design/40724-register-calling.md - proposal - Git at Google

 # Proposal: Register-based Go calling convention

 Author: Austin Clements, with input from Cherry Zhang, Michael
 Knyszek, Martin Möhrmann, Michael Pratt, David Chase, Keith Randall,
 Dan Scales, and Ian Lance Taylor.

 Last updated: 2020-08-10

 Discussion at https://golang.org/issue/40724.

 ## Abstract

 We propose switching the Go ABI from its current stack-based calling
 convention to a register-based calling convention.
 [Preliminary experiments
 indicate](https://github.com/golang/go/issues/18597#issue-199914923)
 this will achieve at least a 5–10% throughput improvement across a
 range of applications.
 This will remain backwards compatible with existing assembly code that
 assumes Go’s current stack-based calling convention through Go’s
 [multiple ABI
 mechanism](https://golang.org/design/27539-internal-abi).

 ## Background

 Since its initial release, Go has used a *stack-based calling
 convention* based on the Plan 9 ABI, in which arguments and result
 values are passed via memory on the stack.
 This has significant simplicity benefits: the rules of the calling
 convention are simple and build on existing struct layout rules; all
 platforms can use essentially the same conventions, leading to shared,
 portable compiler and runtime code; and call frames have an obvious
 first-class representation, which simplifies the implementation of the
 `go` and `defer` statements and reflection calls.
 Furthermore, the current Go ABI has no *callee-save registers*,
 meaning that no register contents live across a function call (any
 live state in a function must be flushed to the stack before a call).
 This simplifies stack tracing for garbage collection and stack growth
 and stack unwinding during panic recovery.

 Unfortunately, Go’s stack-based calling convention leaves a lot of
 performance on the table.
 While modern high-performance CPUs heavily optimize stack access,
 accessing arguments in registers is still roughly [40%
 faster](https://gist.github.com/aclements/ded22bb8451eead8249d22d3cd873566)
 than accessing arguments on the stack.
 Furthermore, a stack-based calling convention, especially one with no
 callee-save registers, induces additional memory traffic, which has
 secondary effects on overall performance.

 Most language implementations on most platforms use a register-based
 calling convention that passes function arguments and results via
 registers rather than memory and designates some registers as
 callee-save, allowing functions to keep state in registers across
 calls.

 ## Proposal

 We propose switching the Go ABI to a register-based calling
 convention, starting with a minimum viable product (MVP) on amd64, and
 then expanding to other architectures and improving on the MVP.

 We further propose that this calling convention should be designed
 specifically for Go, rather than using platform ABIs.
 There are several reasons for this.

 It’s incredibly tempting to use the platform calling convention, as it
 seems that would allow for more efficient language interoperability.
 Unfortunately, there are two major reasons it would do little good,
 both related to the scalability of goroutines, a central feature of
 the Go language.
 One reason goroutines scale so well is that the Go runtime dynamically
 resizes their stacks, but this imposes requirements on the ABI that
 aren’t satisfied by non-Go functions, thus requiring the runtime to
 transition out of the dynamic stack regime on a foreign call.
 Another reason is that goroutines are scheduled by the Go runtime
 rather than the OS kernel, but this means that transitions to and from
 non-Go code must be communicated to the Go scheduler.
 These two things mean that sharing a calling convention wouldn’t
 significantly lower the cost of calling non-Go code.

 The other tempting reason to use the platform calling convention would
 be tooling interoperability, particularly with debuggers and profiling
 tools.
 However, these almost universally support DWARF or, for profilers,
 frame pointer unwinding.
 Go will continue to work with DWARF-based tools and we can make the Go
 ABI compatible with platform frame pointer unwinding without otherwise
 taking on the platform ABI.

 Hence, there’s little upside to using the platform ABI.
 And there are several reasons to favor using our own ABI:

 - Most existing ABIs were based on the C language, which differs in
   important ways from Go.
   For example, most ELF ABIs (at least x64-64, ARM64, and RISC-V)
   would force Go slices to be passed on the stack rather than in
   registers because the slice header is three words.
   Similarly, because C functions rarely return more than one word,
   most platform ABIs reserve at most two registers for results.
   Since Go functions commonly return at least three words (a result
   and a two word error interface value), the platform ABI would force
   such functions to return values on the stack.
   Other things that influence the platform ABI include that array
   arguments in C are passed by reference rather than by value and
   small integer types in C are promoted to `int` rather than retaining
   their type.
   Hence, platform ABIs simply aren’t a good fit for the Go language.

 - Platform ABIs typically define callee-save registers, which place
   substantial additional requirements on a garbage collector.
   There are alternatives to callee-save registers that share many of
   their benefits, while being much better suited to Go.

 - While platform ABIs are generally similar at a high level, their
   details differ in myriad ways.
   By defining our own ABI, we can follow a common structure across all
   platforms and maintain much of the cross-platform simplicity and
   reliability of Go’s stack-based calling convention.

 The new calling convention will remain backwards-compatible with
 existing assembly code that’s based on the stack-based calling
 convention via Go’s [multiple ABI
 mechanism](https://golang.org/design/27539-internal-abi).

 This same multiple ABI mechanism allows us to continue to evolve the
 Go calling convention in future versions.
 This lets us start with a simple, minimal calling convention and
 continue to optimize it in the future.

 The rest of this proposal outlines the work necessary to switch Go to
 a register-based calling convention.
 While it lays out the requirements for the ABI, it does not describe a
 specific ABI.
 Defining a specific ABI will be one of the first implementation steps,
 and its definition should reside in a living document rather than a
 proposal.

 ## Go’s current stack-based ABI

 We give an overview of Go’s current ABI to give a sense of the
 requirements of any Go ABI and because the register-based calling
 convention builds on the same concepts.

 In the stack-based Go ABI, when a function F calls a function or
 method G, F reserves space in its own stack frame for G’s receiver (if
 it’s a method), arguments, and results.
 These are laid out in memory as if G’s receiver, arguments, and
 results were simply fields in a struct.

 There is one exception to all call state being passed on the stack: if
 G is a closure, F passes a pointer to its function object in a
 *context register*, via which G can quickly access any closed-over
 values.

 Other than a few fixed-function registers, all registers are
 caller-save, meaning F must spill any live state in registers to its
 stack frame before calling G and reload the registers after the call.

 The Go ABI also keeps a pointer to the runtime structure representing
 the current goroutine (“G”) available for quick access.
 On 386 and amd64, it is stored in thread-local storage; on all other
 platforms, it is stored in a dedicated register.<sup>1</sup>

 Every function must ensure sufficient stack space is available before
 reserving its stack frame.
 The current stack bound is stored in the runtime goroutine structure,
 which is why the ABI keeps this readily accessible.
 The standard prologue checks the stack pointer against this bound and
 calls into the runtime to grow the stack if necessary.
 In assembly code, this prologue is automatically generated by the
 assembler itself.
 Cooperative preemption is implemented by poisoning a goroutine’s stack
 bound, and thus also makes use of this standard prologue.

 Finally, both stack growth and the Go garbage collector must be able
 to find all live pointers.
 Logically, function entry and every call instruction has an associated
 bitmap indicating which slots in the local frame and the function’s
 argument frame contain live pointers.
 Sometimes liveness information is path-sensitive, in which case a
 function will have additional [*stack
 object*](https://golang.org/cl/134155) metadata.
 In all cases, all pointers are in known locations on the stack.

 <sup>1</sup> This is largely a historical accident.
 The G pointer was originally stored in a register on 386/amd64.
 This is ideal, since it’s accessed in nearly every function prologue.
 It was moved to TLS in order to support cgo, since transitions from C
 back to Go (including the runtime signal handler) needed a way to
 access the current G.
 However, when we added ARM support, it turned out accessing TLS in
 every function prologue was far too expensive on ARM, so all later
 ports used a hybrid approach where the G is stored in both a register
 and TLS and transitions from C restore it from TLS.

 ## ABI design recommendations

 Here we lay out various recommendations for the design of a
 register-based Go ABI.
 The rest of this document assumes we’ll be following these
 recommendations.

 1. Common structure across platforms.
    This dramatically simplifies porting work in the compiler and
    runtime.
    We propose that each architecture should define a sequence of
    integer and floating point registers (and in the future perhaps
    vector registers), plus size and alignment constraints, and that
    beyond this, the calling convention should be derived using a
    shared set of rules as much as possible.

 1. Efficient access to the current goroutine pointer and the context
    register for closure calls.
    Ideally these will be in registers; however, we may use TLS on
    architectures with extremely limited registers (namely, 386).

 1. Support for many-word return values.
    Go functions frequently return three or more words, so this must be
    supported efficiently.

 1. Support for scanning and adjusting pointers in register arguments
    on stack growth.
    Since the function prologue checks the stack bound before reserving
    a stack frame, the runtime must be able to spill argument registers
    and identify those containing pointers.

 1. First-class generic call frame representation.
    The `go` and `defer` statements as well as reflection calls need to
    manipulate call frames as first-class, in-memory objects.
    Reflect calls in particular are simplified by a common, generic
    representation with fairly generic bridge code (the compiler could
    generate bridge code for `go` and `defer`).

 1. No callee-save registers.
    Callee-save registers complicate stack unwinding (and garbage
    collection if pointers are allowed in callee-save registers).
    Inter-function clobber sets have many of the benefits of
    callee-save registers, but are much simpler to implement in a
    garbage collected language and are well-suited to Go’s compilation
    model.
    For an MVP, we’re unlikely to implement any form of live registers
    across calls, but we’ll want to revisit this later.

 1. Where possible, be compatible with platform frame-pointer unwinding
    rules.
    This helps Go interoperate with system-level profilers, and can
    potentially be used to optimize stack unwinding in Go itself.

 There are also some notable non-requirements:

 1. No compatibility with the platform ABI (other than frame pointers).
    This has more downsides and upsides, as described above.

 1. No binary compatibility between Go versions.
    This is important for shared libraries in C, but Go already
    requires all shared libraries in a process to use the same Go
    toolchain version.
    This means we can continue to evolve and improve the ABI.

 ## Toolchain changes overview

 This section outlines the changes that will be necessary to the Go
 build toolchain and runtime.
 The "Detailed design" section will go into greater depth on some of
 these.

 ### Compiler

 *Abstract argument registers*: The compiler’s register allocator will
 need to allocate function arguments and results to the appropriate
 registers.
 However, it needs to represent argument and result registers in a
 platform-independent way prior to architecture lowering and register
 allocation.
 We propose introducing generic SSA values to represent the argument
 and result registers, as done in [David Chase’s
 prototype](https://golang.org/cl/28832).
 These would simply represent the *i*th argument/result register and
 register allocation would assign them to the appropriate architecture
 registers.
 Having a common ABI structure across platforms means the
 architecture-independent parts of the compiler would only need to know
 how many argument/result registers the target architecture has.

 *Late call lowering*: Call lowering and argument frame construction
 currently happen during AST to SSA lowering, which happens well before
 register allocation.
 Hence, we propose moving call lowering much later in the compilation
 process.
 Late call lowering will have knock-on effects, as the current approach
 hides a lot of the structure of calls from most optimization passes.

 *ABI bridges*: For compatibility with existing assembly code, the
 compiler must generate ABI bridges when calling between Go
 (ABIInternal) and assembly (ABI0) code, as described in the [internal
 ABI proposal](https://golang.org/design/27539-internal-abi).
 These are small functions that translate between ABIs according to a
 function’s type.
 While the compiler currently differentiates between the two ABIs
 internally, since they’re actually identical right now, it currently
 only generates *ABI aliases* and has no mechanism for generating ABI
 bridges.
 As a post-MVP optimization, the compiler should inline these ABI
 bridges where possible.

 *Argument GC map*: The garbage collector needs to know which arguments
 contain live pointers at function entry and at any calls (since these
 are preemption points).
 Currently this is represented as a bitmap over words in the function’s
 argument frame.
 With the register-based ABI, the compiler will need to emit a liveness
 map for argument registers for the function entry point.
 Since initially we won't have any live registers across calls, live
 arguments will be spilled to the stack at a call, so the compiler does
 *not* need to emit register maps at calls.
 For functions that still require a stack argument frame (because their
 arguments don’t all fit in registers), the compiler will also need to
 emit argument frame liveness maps at the same points it does today.

 *Traceback argument maps*: Go tracebacks currently display a simple
 word-based hex dump of a function’s argument frame.
 This is not particularly user-friendly nor high-fidelity, but it can
 be incredibly valuable for debugging.
 With a register-based ABI, there’s a wide range of possible designs
 for retaining this functionality.
 For an MVP, we propose trying to maintain a similar level of fidelity.
 In the future, we may want more detailed maps, or may want to simply
 switch to using DWARF location descriptions.

 To that end, we propose that the compiler should emit two logical
 maps: a *location map* from (PC, argument word index) to
 register/`stack`/`dead` and a *home map* from argument word index to
 stack home (if any).
 Since a named variable’s stack spill home is fixed if it ever spills,
 the location map can use a single distinguished value for `stack` that
 tells the runtime to refer to the home map.
 This approach works well for an ABI that passes argument values in
 separate registers without packing small values.
 The `dead` value is not necessarily the same as the garbage
 collector’s notion of a dead slot: for the garbage collector, you want
 slots to become dead as soon as possible, while for debug printing,
 you want them to stay live as long as possible (until clobbered by
 something else).

 The exact encoding of these tables is to be determined.
 Most likely, we’ll want to introduce pseudo-ops for representing
 changes in the location map that the `cmd/internal/obj` package can
 then encode into `FUNCDATA`.
 The home map could be produced directly by the compiler as `FUNCDATA`.

 *DWARF locations*: The compiler will need to generate DWARF location
 lists for arguments and results.
 It already has this ability for local variables, and we should reuse
 that as much as possible.
 We will need to ensure Delve and GDB are compatible with this.
 Both already support location lists in general, so this is unlikely to
 require much (if any) work in these debuggers.

 Clobber sets will require further changes, which we discuss later.
 We propose not implementing clobber sets (or any form of callee-save)
 for the MVP.

 ### Linker

 The linker requires relatively minor changes, all related to ABI
 bridges.

 *Eliminate ABI aliases*: Currently, the linker resolves ABI aliases
 generated by the compiler by treating all references to a symbol
 aliased under one ABI as references to the symbol another the other
 ABI.
 Once the compiler generates ABI bridges rather than aliases, we can
 remove this mechanism, which is likely to simplify and speed up the
 linker somewhat.

 *ABI name mangling*: Since Go ABIs work by having multiple symbol
 definitions under the same name, the linker will also need to
 implement a name mangling scheme for non-Go symbol tables.

 ### Runtime

 *First-class call frame representation*: The `go` and `defer`
 statements and reflection calls must manipulate call frames as
 first-class objects.
 While the requirements of these three cases differ, we propose having
 a common first-class call frame representation that can capture a
 function’s register and stack arguments and record its register and
 stack results, along with a small set of generic call bridges that
 invoke a call using the generic call frame.

 *Stack growth*: Almost every Go function checks for sufficient stack
 space before opening its local stack frame.
 If there is insufficient space, it calls into the `runtime.morestack`
 function to grow the stack.
 Currently, `morestack` saves only the calling PC, the stack pointer,
 and the context register (if any) because these are the only registers
 that can be live at function entry.
 With register-based arguments, `morestack` will also have to save all
 argument registers.
 We propose that it simply spill all *possible* argument registers
 rather than trying to be specific to the function; `morestack` is
 relatively rare, so the cost is this is unlikely to be noticeable.
 It’s likely possible to spill all argument registers to the stack
 itself: every function that can grow the stack ensures that there’s
 room not only for its local frame, but also for a reasonably large
 “guard” space.
 `morestack` can spill into this guard space.
 The garbage collector can recognize `morestack`’s spill space and use
 the argument map of its caller as the stack map of `morestack`.

 *Runtime assembly*: While Go’s multiple ABI mechanism makes it
 generally possible to transparently call between Go and assembly code
 even if they’re using different ABIs, there are runtime assembly
 functions that have deep knowledge of the Go ABI and will have to be
 modified.
 This includes any function that takes a closure (`mcall`,
 `systemstack`), is called in a special context (`morestack`), or is
 involved in reflection-like calls (`reflectcall`, `debugCallV1`).

 *Cgo wrappers*: Generated cgo wrappers marked with
 `//go:cgo_unsafe_args` currently access their argument structure by
 casting a pointer to their first argument.
 This violates the `unsafe.Pointer` rules and will no longer work with
 this change.
 We can either special case `//go:cgo_unsafe_args` functions to use
 ABI0 or change the way these wrappers are generated.

 *Stack unwinding for panic recovery*: When a panic is recovered, the
 Go runtime must unwind the panicking stack and resume execution after
 the deferred call of the recovering function.
 For the MVP, we propose not retaining any live registers across calls,
 in which case stack unwinding will not have to change.
 This is not the case with callee-save registers or clobber sets.

 *Traceback argument printing*: As mentioned in the compiler section,
 the runtime currently prints a hex dump of function arguments in panic
 tracebacks.
 This will have to consume the new traceback argument metadata produced
 by the compiler.

 ## Detailed design

 This section dives deeper into some of the toolchain changes described
 above.
 We’ll expand this section over time.

 ### `go`, `defer` and reflection calls

 Above we proposed using a first-class call frame representation for
 `go` and `defer` statements and reflection calls with a small set of
 call bridges.
 These three cases have somewhat different requirements:

 - The types of `go` and `defer` calls are known statically, while
   reflect calls are not.
   This means the compiler could statically generate bridges to
   unmarshall arguments for `go` and `defer` calls, but this isn’t an
   option for reflection calls.

 - The return values of `go` and `defer` calls are always ignored,
   while reflection calls must capture results.
   This means a call bridge for a `go` or `defer` call can be a tail
   call, while reflection calls can require marshalling return values.

 - Call frames for `go` and `defer` calls are long-lived, while
   reflection call frames are transient.
   This means the garbage collector must be able to scan `go` and
   `defer` call frames, while we could use non-preemptible regions for
   reflection calls.

 - Finally, `go` call frames are stored directly on the stack, while
   `defer` and reflection call frames may be constructed in the heap.
   This means the garbage collector must be able to construct the
   appropriate stack map for `go` call frames, but `defer` and
   reflection call frames can use the heap bitmap.
   It also means `defer` and reflection calls that require stack
   arguments must copy that part of the call frame from the heap to the
   stack, though we don’t expect this to be the common case.

 To satisfy these requirements, we propose the following generic
 call-frame representation:

 ```
 struct {
     pc           uintptr          // PC of target function
     nInt, nFloat uintptr          // # of int and float registers
     ints         [nInt]uintptr    // Int registers
     floats       [nFloat]uint64   // Float registers
     ctxt         uintptr          // Context register
     stack        [...]uintptr     // Stack arguments/result space
 }
 ```

 `go` calls can build this structure on the new goroutine stack and the
 call bridge can pop the register part of this structure from the
 stack, leaving just the `stack` part on the stack, and tail-call `pc`.
 The garbage collector can recognize this call bridge and construct the
 stack map by inspecting the `pc` in the call frame.

 `defer` and reflection calls can build frames in the heap with the
 appropriate heap bitmap.
 The call bridge in these cases must open a new stack frame, copy
 `stack` to the stack, load the register arguments, call `pc`, and then
 copy the register results and the stack results back to the in-heap
 frame (using write barriers where necessary).
 It may be valuable to have optimized versions of this bridge for
 tail-calls (always the case for `defer`) and register-only calls
 (likely a common case).
 In the register-only reflection call case, the bridge could take the
 register arguments as arguments itself and return register results as
 results; this would avoid any copying or write barriers.

 ## Compatibility

 This proposal is Go 1-compatible.

 While Go assembly is not technically covered by Go 1 compatibility,
 this will maintain compatibility with the vast majority of assembly
 code using Go’s [multiple ABI
 mechanism](https://golang.org/design/27539-internal-abi).
 This translates between Go’s existing stack-based calling convention
 used by all existing assembly code and Go’s internal calling
 convention.

 There are a few known forms of unsafe code that this change will
 break:

 - Assembly code that invokes Go closures.
   The closure calling convention was never publicly documented, but
   there may be code that does this anyway.

 - Code that performs `unsafe.Pointer` arithmetic on pointers to
   arguments in order to observe the contents of the stack.
   This is a violation of the [`unsafe.Pointer`
   rules](https://pkg.go.dev/unsafe#Pointer) today.

 ## Implementation

 We aim to implement a minimum viable register-based Go ABI for amd64
 in the 1.16 time frame.
 As of this writing (nearing the opening of the 1.16 tree), Dan Scales
 has made substantial progress on ABI bridges for a simple ABI change
 and David Chase has made substantial progress on late call lowering.
 Austin Clements will lead the work with David Chase and Than McIntosh
 focusing on the compiler side, Cherry Zhang focusing on aspects that
 bridge the compiler and runtime, and Michael Knyszek focusing on the
 runtime.