Proposal: Register-based Go calling convention

Author: Austin Clements, with input from Cherry Zhang, Michael Knyszek, Martin Möhrmann, Michael Pratt, David Chase, Keith Randall, Dan Scales, and Ian Lance Taylor.

Last updated: 2020-08-10

Discussion at https://golang.org/issue/40724.

Abstract

We propose switching the Go ABI from its current stack-based calling convention to a register-based calling convention. Preliminary experiments indicate this will achieve at least a 5–10% throughput improvement across a range of applications. This will remain backwards compatible with existing assembly code that assumes Go’s current stack-based calling convention through Go’s multiple ABI mechanism.

Background

Since its initial release, Go has used a stack-based calling convention based on the Plan 9 ABI, in which arguments and result values are passed via memory on the stack. This has significant simplicity benefits: the rules of the calling convention are simple and build on existing struct layout rules; all platforms can use essentially the same conventions, leading to shared, portable compiler and runtime code; and call frames have an obvious first-class representation, which simplifies the implementation of the go and defer statements and reflection calls. Furthermore, the current Go ABI has no callee-save registers, meaning that no register contents live across a function call (any live state in a function must be flushed to the stack before a call). This simplifies stack tracing for garbage collection and stack growth and stack unwinding during panic recovery.

Unfortunately, Go’s stack-based calling convention leaves a lot of performance on the table. While modern high-performance CPUs heavily optimize stack access, accessing arguments in registers is still roughly 40% faster than accessing arguments on the stack. Furthermore, a stack-based calling convention, especially one with no callee-save registers, induces additional memory traffic, which has secondary effects on overall performance.

Most language implementations on most platforms use a register-based calling convention that passes function arguments and results via registers rather than memory and designates some registers as callee-save, allowing functions to keep state in registers across calls.

Proposal

We propose switching the Go ABI to a register-based calling convention, starting with a minimum viable product (MVP) on amd64, and then expanding to other architectures and improving on the MVP.

We further propose that this calling convention should be designed specifically for Go, rather than using platform ABIs. There are several reasons for this.

It’s incredibly tempting to use the platform calling convention, as it seems that would allow for more efficient language interoperability. Unfortunately, there are two major reasons it would do little good, both related to the scalability of goroutines, a central feature of the Go language. One reason goroutines scale so well is that the Go runtime dynamically resizes their stacks, but this imposes requirements on the ABI that aren’t satisfied by non-Go functions, thus requiring the runtime to transition out of the dynamic stack regime on a foreign call. Another reason is that goroutines are scheduled by the Go runtime rather than the OS kernel, but this means that transitions to and from non-Go code must be communicated to the Go scheduler. These two things mean that sharing a calling convention wouldn’t significantly lower the cost of calling non-Go code.

The other tempting reason to use the platform calling convention would be tooling interoperability, particularly with debuggers and profiling tools. However, these almost universally support DWARF or, for profilers, frame pointer unwinding. Go will continue to work with DWARF-based tools and we can make the Go ABI compatible with platform frame pointer unwinding without otherwise taking on the platform ABI.

Hence, there’s little upside to using the platform ABI. And there are several reasons to favor using our own ABI:

  • Most existing ABIs were based on the C language, which differs in important ways from Go. For example, most ELF ABIs (at least x64-64, ARM64, and RISC-V) would force Go slices to be passed on the stack rather than in registers because the slice header is three words. Similarly, because C functions rarely return more than one word, most platform ABIs reserve at most two registers for results. Since Go functions commonly return at least three words (a result and a two word error interface value), the platform ABI would force such functions to return values on the stack. Other things that influence the platform ABI include that array arguments in C are passed by reference rather than by value and small integer types in C are promoted to int rather than retaining their type. Hence, platform ABIs simply aren’t a good fit for the Go language.

  • Platform ABIs typically define callee-save registers, which place substantial additional requirements on a garbage collector. There are alternatives to callee-save registers that share many of their benefits, while being much better suited to Go.

  • While platform ABIs are generally similar at a high level, their details differ in myriad ways. By defining our own ABI, we can follow a common structure across all platforms and maintain much of the cross-platform simplicity and reliability of Go’s stack-based calling convention.

The new calling convention will remain backwards-compatible with existing assembly code that’s based on the stack-based calling convention via Go’s multiple ABI mechanism.

This same multiple ABI mechanism allows us to continue to evolve the Go calling convention in future versions. This lets us start with a simple, minimal calling convention and continue to optimize it in the future.

The rest of this proposal outlines the work necessary to switch Go to a register-based calling convention. While it lays out the requirements for the ABI, it does not describe a specific ABI. Defining a specific ABI will be one of the first implementation steps, and its definition should reside in a living document rather than a proposal.

Go’s current stack-based ABI

We give an overview of Go’s current ABI to give a sense of the requirements of any Go ABI and because the register-based calling convention builds on the same concepts.

In the stack-based Go ABI, when a function F calls a function or method G, F reserves space in its own stack frame for G’s receiver (if it’s a method), arguments, and results. These are laid out in memory as if G’s receiver, arguments, and results were simply fields in a struct.

There is one exception to all call state being passed on the stack: if G is a closure, F passes a pointer to its function object in a context register, via which G can quickly access any closed-over values.

Other than a few fixed-function registers, all registers are caller-save, meaning F must spill any live state in registers to its stack frame before calling G and reload the registers after the call.

The Go ABI also keeps a pointer to the runtime structure representing the current goroutine (“G”) available for quick access. On 386 and amd64, it is stored in thread-local storage; on all other platforms, it is stored in a dedicated register.1

Every function must ensure sufficient stack space is available before reserving its stack frame. The current stack bound is stored in the runtime goroutine structure, which is why the ABI keeps this readily accessible. The standard prologue checks the stack pointer against this bound and calls into the runtime to grow the stack if necessary. In assembly code, this prologue is automatically generated by the assembler itself. Cooperative preemption is implemented by poisoning a goroutine’s stack bound, and thus also makes use of this standard prologue.

Finally, both stack growth and the Go garbage collector must be able to find all live pointers. Logically, function entry and every call instruction has an associated bitmap indicating which slots in the local frame and the function’s argument frame contain live pointers. Sometimes liveness information is path-sensitive, in which case a function will have additional stack object metadata. In all cases, all pointers are in known locations on the stack.

1 This is largely a historical accident. The G pointer was originally stored in a register on 386/amd64. This is ideal, since it’s accessed in nearly every function prologue. It was moved to TLS in order to support cgo, since transitions from C back to Go (including the runtime signal handler) needed a way to access the current G. However, when we added ARM support, it turned out accessing TLS in every function prologue was far too expensive on ARM, so all later ports used a hybrid approach where the G is stored in both a register and TLS and transitions from C restore it from TLS.

ABI design recommendations

Here we lay out various recommendations for the design of a register-based Go ABI. The rest of this document assumes we’ll be following these recommendations.

  1. Common structure across platforms. This dramatically simplifies porting work in the compiler and runtime. We propose that each architecture should define a sequence of integer and floating point registers (and in the future perhaps vector registers), plus size and alignment constraints, and that beyond this, the calling convention should be derived using a shared set of rules as much as possible.

  2. Efficient access to the current goroutine pointer and the context register for closure calls. Ideally these will be in registers; however, we may use TLS on architectures with extremely limited registers (namely, 386).

  3. Support for many-word return values. Go functions frequently return three or more words, so this must be supported efficiently.

  4. Support for scanning and adjusting pointers in register arguments on stack growth. Since the function prologue checks the stack bound before reserving a stack frame, the runtime must be able to spill argument registers and identify those containing pointers.

  5. First-class generic call frame representation. The go and defer statements as well as reflection calls need to manipulate call frames as first-class, in-memory objects. Reflect calls in particular are simplified by a common, generic representation with fairly generic bridge code (the compiler could generate bridge code for go and defer).

  6. No callee-save registers. Callee-save registers complicate stack unwinding (and garbage collection if pointers are allowed in callee-save registers). Inter-function clobber sets have many of the benefits of callee-save registers, but are much simpler to implement in a garbage collected language and are well-suited to Go’s compilation model. For an MVP, we’re unlikely to implement any form of live registers across calls, but we’ll want to revisit this later.

  7. Where possible, be compatible with platform frame-pointer unwinding rules. This helps Go interoperate with system-level profilers, and can potentially be used to optimize stack unwinding in Go itself.

There are also some notable non-requirements:

  1. No compatibility with the platform ABI (other than frame pointers). This has more downsides and upsides, as described above.

  2. No binary compatibility between Go versions. This is important for shared libraries in C, but Go already requires all shared libraries in a process to use the same Go toolchain version. This means we can continue to evolve and improve the ABI.

Toolchain changes overview

This section outlines the changes that will be necessary to the Go build toolchain and runtime. The “Detailed design” section will go into greater depth on some of these.

Compiler

Abstract argument registers: The compiler’s register allocator will need to allocate function arguments and results to the appropriate registers. However, it needs to represent argument and result registers in a platform-independent way prior to architecture lowering and register allocation. We propose introducing generic SSA values to represent the argument and result registers, as done in David Chase’s prototype. These would simply represent the ith argument/result register and register allocation would assign them to the appropriate architecture registers. Having a common ABI structure across platforms means the architecture-independent parts of the compiler would only need to know how many argument/result registers the target architecture has.

Late call lowering: Call lowering and argument frame construction currently happen during AST to SSA lowering, which happens well before register allocation. Hence, we propose moving call lowering much later in the compilation process. Late call lowering will have knock-on effects, as the current approach hides a lot of the structure of calls from most optimization passes.

ABI bridges: For compatibility with existing assembly code, the compiler must generate ABI bridges when calling between Go (ABIInternal) and assembly (ABI0) code, as described in the internal ABI proposal. These are small functions that translate between ABIs according to a function’s type. While the compiler currently differentiates between the two ABIs internally, since they’re actually identical right now, it currently only generates ABI aliases and has no mechanism for generating ABI bridges. As a post-MVP optimization, the compiler should inline these ABI bridges where possible.

Argument GC map: The garbage collector needs to know which arguments contain live pointers at function entry and at any calls (since these are preemption points). Currently this is represented as a bitmap over words in the function’s argument frame. With the register-based ABI, the compiler will need to emit a liveness map for argument registers for the function entry point. Since initially we won't have any live registers across calls, live arguments will be spilled to the stack at a call, so the compiler does not need to emit register maps at calls. For functions that still require a stack argument frame (because their arguments don’t all fit in registers), the compiler will also need to emit argument frame liveness maps at the same points it does today.

Traceback argument maps: Go tracebacks currently display a simple word-based hex dump of a function’s argument frame. This is not particularly user-friendly nor high-fidelity, but it can be incredibly valuable for debugging. With a register-based ABI, there’s a wide range of possible designs for retaining this functionality. For an MVP, we propose trying to maintain a similar level of fidelity. In the future, we may want more detailed maps, or may want to simply switch to using DWARF location descriptions.

To that end, we propose that the compiler should emit two logical maps: a location map from (PC, argument word index) to register/stack/dead and a home map from argument word index to stack home (if any). Since a named variable’s stack spill home is fixed if it ever spills, the location map can use a single distinguished value for stack that tells the runtime to refer to the home map. This approach works well for an ABI that passes argument values in separate registers without packing small values. The dead value is not necessarily the same as the garbage collector’s notion of a dead slot: for the garbage collector, you want slots to become dead as soon as possible, while for debug printing, you want them to stay live as long as possible (until clobbered by something else).

The exact encoding of these tables is to be determined. Most likely, we’ll want to introduce pseudo-ops for representing changes in the location map that the cmd/internal/obj package can then encode into FUNCDATA. The home map could be produced directly by the compiler as FUNCDATA.

DWARF locations: The compiler will need to generate DWARF location lists for arguments and results. It already has this ability for local variables, and we should reuse that as much as possible. We will need to ensure Delve and GDB are compatible with this. Both already support location lists in general, so this is unlikely to require much (if any) work in these debuggers.

Clobber sets will require further changes, which we discuss later. We propose not implementing clobber sets (or any form of callee-save) for the MVP.

Linker

The linker requires relatively minor changes, all related to ABI bridges.

Eliminate ABI aliases: Currently, the linker resolves ABI aliases generated by the compiler by treating all references to a symbol aliased under one ABI as references to the symbol another the other ABI. Once the compiler generates ABI bridges rather than aliases, we can remove this mechanism, which is likely to simplify and speed up the linker somewhat.

ABI name mangling: Since Go ABIs work by having multiple symbol definitions under the same name, the linker will also need to implement a name mangling scheme for non-Go symbol tables.

Runtime

First-class call frame representation: The go and defer statements and reflection calls must manipulate call frames as first-class objects. While the requirements of these three cases differ, we propose having a common first-class call frame representation that can capture a function’s register and stack arguments and record its register and stack results, along with a small set of generic call bridges that invoke a call using the generic call frame.

Stack growth: Almost every Go function checks for sufficient stack space before opening its local stack frame. If there is insufficient space, it calls into the runtime.morestack function to grow the stack. Currently, morestack saves only the calling PC, the stack pointer, and the context register (if any) because these are the only registers that can be live at function entry. With register-based arguments, morestack will also have to save all argument registers. We propose that it simply spill all possible argument registers rather than trying to be specific to the function; morestack is relatively rare, so the cost is this is unlikely to be noticeable. It’s likely possible to spill all argument registers to the stack itself: every function that can grow the stack ensures that there’s room not only for its local frame, but also for a reasonably large “guard” space. morestack can spill into this guard space. The garbage collector can recognize morestack’s spill space and use the argument map of its caller as the stack map of morestack.

Runtime assembly: While Go’s multiple ABI mechanism makes it generally possible to transparently call between Go and assembly code even if they’re using different ABIs, there are runtime assembly functions that have deep knowledge of the Go ABI and will have to be modified. This includes any function that takes a closure (mcall, systemstack), is called in a special context (morestack), or is involved in reflection-like calls (reflectcall, debugCallV1).

Cgo wrappers: Generated cgo wrappers marked with //go:cgo_unsafe_args currently access their argument structure by casting a pointer to their first argument. This violates the unsafe.Pointer rules and will no longer work with this change. We can either special case //go:cgo_unsafe_args functions to use ABI0 or change the way these wrappers are generated.

Stack unwinding for panic recovery: When a panic is recovered, the Go runtime must unwind the panicking stack and resume execution after the deferred call of the recovering function. For the MVP, we propose not retaining any live registers across calls, in which case stack unwinding will not have to change. This is not the case with callee-save registers or clobber sets.

Traceback argument printing: As mentioned in the compiler section, the runtime currently prints a hex dump of function arguments in panic tracebacks. This will have to consume the new traceback argument metadata produced by the compiler.

Detailed design

This section dives deeper into some of the toolchain changes described above. We’ll expand this section over time.

go, defer and reflection calls

Above we proposed using a first-class call frame representation for go and defer statements and reflection calls with a small set of call bridges. These three cases have somewhat different requirements:

  • The types of go and defer calls are known statically, while reflect calls are not. This means the compiler could statically generate bridges to unmarshall arguments for go and defer calls, but this isn’t an option for reflection calls.

  • The return values of go and defer calls are always ignored, while reflection calls must capture results. This means a call bridge for a go or defer call can be a tail call, while reflection calls can require marshalling return values.

  • Call frames for go and defer calls are long-lived, while reflection call frames are transient. This means the garbage collector must be able to scan go and defer call frames, while we could use non-preemptible regions for reflection calls.

  • Finally, go call frames are stored directly on the stack, while defer and reflection call frames may be constructed in the heap. This means the garbage collector must be able to construct the appropriate stack map for go call frames, but defer and reflection call frames can use the heap bitmap. It also means defer and reflection calls that require stack arguments must copy that part of the call frame from the heap to the stack, though we don’t expect this to be the common case.

To satisfy these requirements, we propose the following generic call-frame representation:

struct {
    pc           uintptr          // PC of target function
    nInt, nFloat uintptr          // # of int and float registers
    ints         [nInt]uintptr    // Int registers
    floats       [nFloat]uint64   // Float registers
    ctxt         uintptr          // Context register
    stack        [...]uintptr     // Stack arguments/result space
}

go calls can build this structure on the new goroutine stack and the call bridge can pop the register part of this structure from the stack, leaving just the stack part on the stack, and tail-call pc. The garbage collector can recognize this call bridge and construct the stack map by inspecting the pc in the call frame.

defer and reflection calls can build frames in the heap with the appropriate heap bitmap. The call bridge in these cases must open a new stack frame, copy stack to the stack, load the register arguments, call pc, and then copy the register results and the stack results back to the in-heap frame (using write barriers where necessary). It may be valuable to have optimized versions of this bridge for tail-calls (always the case for defer) and register-only calls (likely a common case). In the register-only reflection call case, the bridge could take the register arguments as arguments itself and return register results as results; this would avoid any copying or write barriers.

Compatibility

This proposal is Go 1-compatible.

While Go assembly is not technically covered by Go 1 compatibility, this will maintain compatibility with the vast majority of assembly code using Go’s multiple ABI mechanism. This translates between Go’s existing stack-based calling convention used by all existing assembly code and Go’s internal calling convention.

There are a few known forms of unsafe code that this change will break:

  • Assembly code that invokes Go closures. The closure calling convention was never publicly documented, but there may be code that does this anyway.

  • Code that performs unsafe.Pointer arithmetic on pointers to arguments in order to observe the contents of the stack. This is a violation of the unsafe.Pointer rules today.

Implementation

We aim to implement a minimum viable register-based Go ABI for amd64 in the 1.16 time frame. As of this writing (nearing the opening of the 1.16 tree), Dan Scales has made substantial progress on ABI bridges for a simple ABI change and David Chase has made substantial progress on late call lowering. Austin Clements will lead the work with David Chase and Than McIntosh focusing on the compiler side, Cherry Zhang focusing on aspects that bridge the compiler and runtime, and Michael Knyszek focusing on the runtime.