blob: 3e7cabf972661205a784c35e18cbda7a77da20c1 [file] [log] [blame] [view]
# Proposal: Register-based Go calling convention
Author: Austin Clements, with input from Cherry Zhang, Michael
Knyszek, Martin Möhrmann, Michael Pratt, David Chase, Keith Randall,
Dan Scales, and Ian Lance Taylor.
Last updated: 2020-08-10
Discussion at https://golang.org/issue/40724.
## Abstract
We propose switching the Go ABI from its current stack-based calling
convention to a register-based calling convention.
[Preliminary experiments
indicate](https://github.com/golang/go/issues/18597#issue-199914923)
this will achieve at least a 5–10% throughput improvement across a
range of applications.
This will remain backwards compatible with existing assembly code that
assumes Go’s current stack-based calling convention through Go’s
[multiple ABI
mechanism](https://golang.org/design/27539-internal-abi).
## Background
Since its initial release, Go has used a *stack-based calling
convention* based on the Plan 9 ABI, in which arguments and result
values are passed via memory on the stack.
This has significant simplicity benefits: the rules of the calling
convention are simple and build on existing struct layout rules; all
platforms can use essentially the same conventions, leading to shared,
portable compiler and runtime code; and call frames have an obvious
first-class representation, which simplifies the implementation of the
`go` and `defer` statements and reflection calls.
Furthermore, the current Go ABI has no *callee-save registers*,
meaning that no register contents live across a function call (any
live state in a function must be flushed to the stack before a call).
This simplifies stack tracing for garbage collection and stack growth
and stack unwinding during panic recovery.
Unfortunately, Go’s stack-based calling convention leaves a lot of
performance on the table.
While modern high-performance CPUs heavily optimize stack access,
accessing arguments in registers is still roughly [40%
faster](https://gist.github.com/aclements/ded22bb8451eead8249d22d3cd873566)
than accessing arguments on the stack.
Furthermore, a stack-based calling convention, especially one with no
callee-save registers, induces additional memory traffic, which has
secondary effects on overall performance.
Most language implementations on most platforms use a register-based
calling convention that passes function arguments and results via
registers rather than memory and designates some registers as
callee-save, allowing functions to keep state in registers across
calls.
## Proposal
We propose switching the Go ABI to a register-based calling
convention, starting with a minimum viable product (MVP) on amd64, and
then expanding to other architectures and improving on the MVP.
We further propose that this calling convention should be designed
specifically for Go, rather than using platform ABIs.
There are several reasons for this.
It’s incredibly tempting to use the platform calling convention, as it
seems that would allow for more efficient language interoperability.
Unfortunately, there are two major reasons it would do little good,
both related to the scalability of goroutines, a central feature of
the Go language.
One reason goroutines scale so well is that the Go runtime dynamically
resizes their stacks, but this imposes requirements on the ABI that
aren’t satisfied by non-Go functions, thus requiring the runtime to
transition out of the dynamic stack regime on a foreign call.
Another reason is that goroutines are scheduled by the Go runtime
rather than the OS kernel, but this means that transitions to and from
non-Go code must be communicated to the Go scheduler.
These two things mean that sharing a calling convention wouldn’t
significantly lower the cost of calling non-Go code.
The other tempting reason to use the platform calling convention would
be tooling interoperability, particularly with debuggers and profiling
tools.
However, these almost universally support DWARF or, for profilers,
frame pointer unwinding.
Go will continue to work with DWARF-based tools and we can make the Go
ABI compatible with platform frame pointer unwinding without otherwise
taking on the platform ABI.
Hence, there’s little upside to using the platform ABI.
And there are several reasons to favor using our own ABI:
- Most existing ABIs were based on the C language, which differs in
important ways from Go.
For example, most ELF ABIs (at least x64-64, ARM64, and RISC-V)
would force Go slices to be passed on the stack rather than in
registers because the slice header is three words.
Similarly, because C functions rarely return more than one word,
most platform ABIs reserve at most two registers for results.
Since Go functions commonly return at least three words (a result
and a two word error interface value), the platform ABI would force
such functions to return values on the stack.
Other things that influence the platform ABI include that array
arguments in C are passed by reference rather than by value and
small integer types in C are promoted to `int` rather than retaining
their type.
Hence, platform ABIs simply aren’t a good fit for the Go language.
- Platform ABIs typically define callee-save registers, which place
substantial additional requirements on a garbage collector.
There are alternatives to callee-save registers that share many of
their benefits, while being much better suited to Go.
- While platform ABIs are generally similar at a high level, their
details differ in myriad ways.
By defining our own ABI, we can follow a common structure across all
platforms and maintain much of the cross-platform simplicity and
reliability of Go’s stack-based calling convention.
The new calling convention will remain backwards-compatible with
existing assembly code that’s based on the stack-based calling
convention via Go’s [multiple ABI
mechanism](https://golang.org/design/27539-internal-abi).
This same multiple ABI mechanism allows us to continue to evolve the
Go calling convention in future versions.
This lets us start with a simple, minimal calling convention and
continue to optimize it in the future.
The rest of this proposal outlines the work necessary to switch Go to
a register-based calling convention.
While it lays out the requirements for the ABI, it does not describe a
specific ABI.
Defining a specific ABI will be one of the first implementation steps,
and its definition should reside in a living document rather than a
proposal.
## Go’s current stack-based ABI
We give an overview of Go’s current ABI to give a sense of the
requirements of any Go ABI and because the register-based calling
convention builds on the same concepts.
In the stack-based Go ABI, when a function F calls a function or
method G, F reserves space in its own stack frame for G’s receiver (if
it’s a method), arguments, and results.
These are laid out in memory as if G’s receiver, arguments, and
results were simply fields in a struct.
There is one exception to all call state being passed on the stack: if
G is a closure, F passes a pointer to its function object in a
*context register*, via which G can quickly access any closed-over
values.
Other than a few fixed-function registers, all registers are
caller-save, meaning F must spill any live state in registers to its
stack frame before calling G and reload the registers after the call.
The Go ABI also keeps a pointer to the runtime structure representing
the current goroutine (“G”) available for quick access.
On 386 and amd64, it is stored in thread-local storage; on all other
platforms, it is stored in a dedicated register.<sup>1</sup>
Every function must ensure sufficient stack space is available before
reserving its stack frame.
The current stack bound is stored in the runtime goroutine structure,
which is why the ABI keeps this readily accessible.
The standard prologue checks the stack pointer against this bound and
calls into the runtime to grow the stack if necessary.
In assembly code, this prologue is automatically generated by the
assembler itself.
Cooperative preemption is implemented by poisoning a goroutine’s stack
bound, and thus also makes use of this standard prologue.
Finally, both stack growth and the Go garbage collector must be able
to find all live pointers.
Logically, function entry and every call instruction has an associated
bitmap indicating which slots in the local frame and the function’s
argument frame contain live pointers.
Sometimes liveness information is path-sensitive, in which case a
function will have additional [*stack
object*](https://golang.org/cl/134155) metadata.
In all cases, all pointers are in known locations on the stack.
<sup>1</sup> This is largely a historical accident.
The G pointer was originally stored in a register on 386/amd64.
This is ideal, since it’s accessed in nearly every function prologue.
It was moved to TLS in order to support cgo, since transitions from C
back to Go (including the runtime signal handler) needed a way to
access the current G.
However, when we added ARM support, it turned out accessing TLS in
every function prologue was far too expensive on ARM, so all later
ports used a hybrid approach where the G is stored in both a register
and TLS and transitions from C restore it from TLS.
## ABI design recommendations
Here we lay out various recommendations for the design of a
register-based Go ABI.
The rest of this document assumes we’ll be following these
recommendations.
1. Common structure across platforms.
This dramatically simplifies porting work in the compiler and
runtime.
We propose that each architecture should define a sequence of
integer and floating point registers (and in the future perhaps
vector registers), plus size and alignment constraints, and that
beyond this, the calling convention should be derived using a
shared set of rules as much as possible.
1. Efficient access to the current goroutine pointer and the context
register for closure calls.
Ideally these will be in registers; however, we may use TLS on
architectures with extremely limited registers (namely, 386).
1. Support for many-word return values.
Go functions frequently return three or more words, so this must be
supported efficiently.
1. Support for scanning and adjusting pointers in register arguments
on stack growth.
Since the function prologue checks the stack bound before reserving
a stack frame, the runtime must be able to spill argument registers
and identify those containing pointers.
1. First-class generic call frame representation.
The `go` and `defer` statements as well as reflection calls need to
manipulate call frames as first-class, in-memory objects.
Reflect calls in particular are simplified by a common, generic
representation with fairly generic bridge code (the compiler could
generate bridge code for `go` and `defer`).
1. No callee-save registers.
Callee-save registers complicate stack unwinding (and garbage
collection if pointers are allowed in callee-save registers).
Inter-function clobber sets have many of the benefits of
callee-save registers, but are much simpler to implement in a
garbage collected language and are well-suited to Go’s compilation
model.
For an MVP, we’re unlikely to implement any form of live registers
across calls, but we’ll want to revisit this later.
1. Where possible, be compatible with platform frame-pointer unwinding
rules.
This helps Go interoperate with system-level profilers, and can
potentially be used to optimize stack unwinding in Go itself.
There are also some notable non-requirements:
1. No compatibility with the platform ABI (other than frame pointers).
This has more downsides and upsides, as described above.
1. No binary compatibility between Go versions.
This is important for shared libraries in C, but Go already
requires all shared libraries in a process to use the same Go
toolchain version.
This means we can continue to evolve and improve the ABI.
## Toolchain changes overview
This section outlines the changes that will be necessary to the Go
build toolchain and runtime.
The "Detailed design" section will go into greater depth on some of
these.
### Compiler
*Abstract argument registers*: The compiler’s register allocator will
need to allocate function arguments and results to the appropriate
registers.
However, it needs to represent argument and result registers in a
platform-independent way prior to architecture lowering and register
allocation.
We propose introducing generic SSA values to represent the argument
and result registers, as done in [David Chase’s
prototype](https://golang.org/cl/28832).
These would simply represent the *i*th argument/result register and
register allocation would assign them to the appropriate architecture
registers.
Having a common ABI structure across platforms means the
architecture-independent parts of the compiler would only need to know
how many argument/result registers the target architecture has.
*Late call lowering*: Call lowering and argument frame construction
currently happen during AST to SSA lowering, which happens well before
register allocation.
Hence, we propose moving call lowering much later in the compilation
process.
Late call lowering will have knock-on effects, as the current approach
hides a lot of the structure of calls from most optimization passes.
*ABI bridges*: For compatibility with existing assembly code, the
compiler must generate ABI bridges when calling between Go
(ABIInternal) and assembly (ABI0) code, as described in the [internal
ABI proposal](https://golang.org/design/27539-internal-abi).
These are small functions that translate between ABIs according to a
function’s type.
While the compiler currently differentiates between the two ABIs
internally, since they’re actually identical right now, it currently
only generates *ABI aliases* and has no mechanism for generating ABI
bridges.
As a post-MVP optimization, the compiler should inline these ABI
bridges where possible.
*Argument GC map*: The garbage collector needs to know which arguments
contain live pointers at function entry and at any calls (since these
are preemption points).
Currently this is represented as a bitmap over words in the function’s
argument frame.
With the register-based ABI, the compiler will need to emit a liveness
map for argument registers for the function entry point.
Since initially we won't have any live registers across calls, live
arguments will be spilled to the stack at a call, so the compiler does
*not* need to emit register maps at calls.
For functions that still require a stack argument frame (because their
arguments don’t all fit in registers), the compiler will also need to
emit argument frame liveness maps at the same points it does today.
*Traceback argument maps*: Go tracebacks currently display a simple
word-based hex dump of a function’s argument frame.
This is not particularly user-friendly nor high-fidelity, but it can
be incredibly valuable for debugging.
With a register-based ABI, there’s a wide range of possible designs
for retaining this functionality.
For an MVP, we propose trying to maintain a similar level of fidelity.
In the future, we may want more detailed maps, or may want to simply
switch to using DWARF location descriptions.
To that end, we propose that the compiler should emit two logical
maps: a *location map* from (PC, argument word index) to
register/`stack`/`dead` and a *home map* from argument word index to
stack home (if any).
Since a named variable’s stack spill home is fixed if it ever spills,
the location map can use a single distinguished value for `stack` that
tells the runtime to refer to the home map.
This approach works well for an ABI that passes argument values in
separate registers without packing small values.
The `dead` value is not necessarily the same as the garbage
collector’s notion of a dead slot: for the garbage collector, you want
slots to become dead as soon as possible, while for debug printing,
you want them to stay live as long as possible (until clobbered by
something else).
The exact encoding of these tables is to be determined.
Most likely, we’ll want to introduce pseudo-ops for representing
changes in the location map that the `cmd/internal/obj` package can
then encode into `FUNCDATA`.
The home map could be produced directly by the compiler as `FUNCDATA`.
*DWARF locations*: The compiler will need to generate DWARF location
lists for arguments and results.
It already has this ability for local variables, and we should reuse
that as much as possible.
We will need to ensure Delve and GDB are compatible with this.
Both already support location lists in general, so this is unlikely to
require much (if any) work in these debuggers.
Clobber sets will require further changes, which we discuss later.
We propose not implementing clobber sets (or any form of callee-save)
for the MVP.
### Linker
The linker requires relatively minor changes, all related to ABI
bridges.
*Eliminate ABI aliases*: Currently, the linker resolves ABI aliases
generated by the compiler by treating all references to a symbol
aliased under one ABI as references to the symbol another the other
ABI.
Once the compiler generates ABI bridges rather than aliases, we can
remove this mechanism, which is likely to simplify and speed up the
linker somewhat.
*ABI name mangling*: Since Go ABIs work by having multiple symbol
definitions under the same name, the linker will also need to
implement a name mangling scheme for non-Go symbol tables.
### Runtime
*First-class call frame representation*: The `go` and `defer`
statements and reflection calls must manipulate call frames as
first-class objects.
While the requirements of these three cases differ, we propose having
a common first-class call frame representation that can capture a
function’s register and stack arguments and record its register and
stack results, along with a small set of generic call bridges that
invoke a call using the generic call frame.
*Stack growth*: Almost every Go function checks for sufficient stack
space before opening its local stack frame.
If there is insufficient space, it calls into the `runtime.morestack`
function to grow the stack.
Currently, `morestack` saves only the calling PC, the stack pointer,
and the context register (if any) because these are the only registers
that can be live at function entry.
With register-based arguments, `morestack` will also have to save all
argument registers.
We propose that it simply spill all *possible* argument registers
rather than trying to be specific to the function; `morestack` is
relatively rare, so the cost is this is unlikely to be noticeable.
It’s likely possible to spill all argument registers to the stack
itself: every function that can grow the stack ensures that there’s
room not only for its local frame, but also for a reasonably large
“guard” space.
`morestack` can spill into this guard space.
The garbage collector can recognize `morestack`’s spill space and use
the argument map of its caller as the stack map of `morestack`.
*Runtime assembly*: While Go’s multiple ABI mechanism makes it
generally possible to transparently call between Go and assembly code
even if they’re using different ABIs, there are runtime assembly
functions that have deep knowledge of the Go ABI and will have to be
modified.
This includes any function that takes a closure (`mcall`,
`systemstack`), is called in a special context (`morestack`), or is
involved in reflection-like calls (`reflectcall`, `debugCallV1`).
*Cgo wrappers*: Generated cgo wrappers marked with
`//go:cgo_unsafe_args` currently access their argument structure by
casting a pointer to their first argument.
This violates the `unsafe.Pointer` rules and will no longer work with
this change.
We can either special case `//go:cgo_unsafe_args` functions to use
ABI0 or change the way these wrappers are generated.
*Stack unwinding for panic recovery*: When a panic is recovered, the
Go runtime must unwind the panicking stack and resume execution after
the deferred call of the recovering function.
For the MVP, we propose not retaining any live registers across calls,
in which case stack unwinding will not have to change.
This is not the case with callee-save registers or clobber sets.
*Traceback argument printing*: As mentioned in the compiler section,
the runtime currently prints a hex dump of function arguments in panic
tracebacks.
This will have to consume the new traceback argument metadata produced
by the compiler.
## Detailed design
This section dives deeper into some of the toolchain changes described
above.
We’ll expand this section over time.
### `go`, `defer` and reflection calls
Above we proposed using a first-class call frame representation for
`go` and `defer` statements and reflection calls with a small set of
call bridges.
These three cases have somewhat different requirements:
- The types of `go` and `defer` calls are known statically, while
reflect calls are not.
This means the compiler could statically generate bridges to
unmarshall arguments for `go` and `defer` calls, but this isn’t an
option for reflection calls.
- The return values of `go` and `defer` calls are always ignored,
while reflection calls must capture results.
This means a call bridge for a `go` or `defer` call can be a tail
call, while reflection calls can require marshalling return values.
- Call frames for `go` and `defer` calls are long-lived, while
reflection call frames are transient.
This means the garbage collector must be able to scan `go` and
`defer` call frames, while we could use non-preemptible regions for
reflection calls.
- Finally, `go` call frames are stored directly on the stack, while
`defer` and reflection call frames may be constructed in the heap.
This means the garbage collector must be able to construct the
appropriate stack map for `go` call frames, but `defer` and
reflection call frames can use the heap bitmap.
It also means `defer` and reflection calls that require stack
arguments must copy that part of the call frame from the heap to the
stack, though we don’t expect this to be the common case.
To satisfy these requirements, we propose the following generic
call-frame representation:
```
struct {
pc uintptr // PC of target function
nInt, nFloat uintptr // # of int and float registers
ints [nInt]uintptr // Int registers
floats [nFloat]uint64 // Float registers
ctxt uintptr // Context register
stack [...]uintptr // Stack arguments/result space
}
```
`go` calls can build this structure on the new goroutine stack and the
call bridge can pop the register part of this structure from the
stack, leaving just the `stack` part on the stack, and tail-call `pc`.
The garbage collector can recognize this call bridge and construct the
stack map by inspecting the `pc` in the call frame.
`defer` and reflection calls can build frames in the heap with the
appropriate heap bitmap.
The call bridge in these cases must open a new stack frame, copy
`stack` to the stack, load the register arguments, call `pc`, and then
copy the register results and the stack results back to the in-heap
frame (using write barriers where necessary).
It may be valuable to have optimized versions of this bridge for
tail-calls (always the case for `defer`) and register-only calls
(likely a common case).
In the register-only reflection call case, the bridge could take the
register arguments as arguments itself and return register results as
results; this would avoid any copying or write barriers.
## Compatibility
This proposal is Go 1-compatible.
While Go assembly is not technically covered by Go 1 compatibility,
this will maintain compatibility with the vast majority of assembly
code using Go’s [multiple ABI
mechanism](https://golang.org/design/27539-internal-abi).
This translates between Go’s existing stack-based calling convention
used by all existing assembly code and Go’s internal calling
convention.
There are a few known forms of unsafe code that this change will
break:
- Assembly code that invokes Go closures.
The closure calling convention was never publicly documented, but
there may be code that does this anyway.
- Code that performs `unsafe.Pointer` arithmetic on pointers to
arguments in order to observe the contents of the stack.
This is a violation of the [`unsafe.Pointer`
rules](https://pkg.go.dev/unsafe#Pointer) today.
## Implementation
We aim to implement a minimum viable register-based Go ABI for amd64
in the 1.16 time frame.
As of this writing (nearing the opening of the 1.16 tree), Dan Scales
has made substantial progress on ABI bridges for a simple ABI change
and David Chase has made substantial progress on late call lowering.
Austin Clements will lead the work with David Chase and Than McIntosh
focusing on the compiler side, Cherry Zhang focusing on aspects that
bridge the compiler and runtime, and Michael Knyszek focusing on the
runtime.