design/40724-register-calling: add design doc
Reviewed-by: Keith Randall <email@example.com>
Reviewed-by: Cherry Zhang <firstname.lastname@example.org>
Reviewed-by: Martin Möhrmann <email@example.com>
Reviewed-by: Michael Knyszek <firstname.lastname@example.org>
Reviewed-by: Michael Pratt <email@example.com>
diff --git a/design/40724-register-calling.md b/design/40724-register-calling.md
new file mode 100644
@@ -0,0 +1,548 @@
+# Proposal: Register-based Go calling convention
+Author: Austin Clements, with input from Cherry Zhang, Michael
+Knyszek, Martin Möhrmann, Michael Pratt, David Chase, Keith Randall,
+Dan Scales, and Ian Lance Taylor.
+Last updated: 2020-08-10
+Discussion at https://golang.org/issue/40724.
+We propose switching the Go ABI from its current stack-based calling
+convention to a register-based calling convention.
+this will achieve at least a 5–10% throughput improvement across a
+range of applications.
+This will remain backwards compatible with existing assembly code that
+assumes Go’s current stack-based calling convention through Go’s
+Since its initial release, Go has used a *stack-based calling
+convention* based on the Plan 9 ABI, in which arguments and result
+values are passed via memory on the stack.
+This has significant simplicity benefits: the rules of the calling
+convention are simple and build on existing struct layout rules; all
+platforms can use essentially the same conventions, leading to shared,
+portable compiler and runtime code; and call frames have an obvious
+first-class representation, which simplifies the implementation of the
+`go` and `defer` statements and reflection calls.
+Furthermore, the current Go ABI has no *callee-save registers*,
+meaning that no register contents live across a function call (any
+live state in a function must be flushed to the stack before a call).
+This simplifies stack tracing for garbage collection and stack growth
+and stack unwinding during panic recovery.
+Unfortunately, Go’s stack-based calling convention leaves a lot of
+performance on the table.
+While modern high-performance CPUs heavily optimize stack access,
+accessing arguments in registers is still roughly [40%
+than accessing arguments on the stack.
+Furthermore, a stack-based calling convention, especially one with no
+callee-save registers, induces additional memory traffic, which has
+secondary effects on overall performance.
+Most language implementations on most platforms use a register-based
+calling convention that passes function arguments and results via
+registers rather than memory and designates some registers as
+callee-save, allowing functions to keep state in registers across
+We propose switching the Go ABI to a register-based calling
+convention, starting with a minimum viable product (MVP) on amd64, and
+then expanding to other architectures and improving on the MVP.
+We further propose that this calling convention should be designed
+specifically for Go, rather than using platform ABIs.
+There are several reasons for this.
+It’s incredibly tempting to use the platform calling convention, as it
+seems that would allow for more efficient language interoperability.
+Unfortunately, there are two major reasons it would do little good,
+both related to the scalability of goroutines, a central feature of
+the Go language.
+One reason goroutines scale so well is that the Go runtime dynamically
+resizes their stacks, but this imposes requirements on the ABI that
+aren’t satisfied by non-Go functions, thus requiring the runtime to
+transition out of the dynamic stack regime on a foreign call.
+Another reason is that goroutines are scheduled by the Go runtime
+rather than the OS kernel, but this means that transitions to and from
+non-Go code must be communicated to the Go scheduler.
+These two things mean that sharing a calling convention wouldn’t
+significantly lower the cost of calling non-Go code.
+The other tempting reason to use the platform calling convention would
+be tooling interoperability, particularly with debuggers and profiling
+However, these almost universally support DWARF or, for profilers,
+frame pointer unwinding.
+Go will continue to work with DWARF-based tools and we can make the Go
+ABI compatible with platform frame pointer unwinding without otherwise
+taking on the platform ABI.
+Hence, there’s little upside to using the platform ABI.
+And there are several reasons to favor using our own ABI:
+- Most existing ABIs were based on the C language, which differs in
+ important ways from Go.
+ For example, most ELF ABIs (at least x64-64, ARM64, and RISC-V)
+ would force Go slices to be passed on the stack rather than in
+ registers because the slice header is three words.
+ Similarly, because C functions rarely return more than one word,
+ most platform ABIs reserve at most two registers for results.
+ Since Go functions commonly return at least three words (a result
+ and a two word error interface value), the platform ABI would force
+ such functions to return values on the stack.
+ Other things that influence the platform ABI include that array
+ arguments in C are passed by reference rather than by value and
+ small integer types in C are promoted to `int` rather than retaining
+ their type.
+ Hence, platform ABIs simply aren’t a good fit for the Go language.
+- Platform ABIs typically define callee-save registers, which place
+ substantial additional requirements on a garbage collector.
+ There are alternatives to callee-save registers that share many of
+ their benefits, while being much better suited to Go.
+- While platform ABIs are generally similar at a high level, their
+ details differ in myriad ways.
+ By defining our own ABI, we can follow a common structure across all
+ platforms and maintain much of the cross-platform simplicity and
+ reliability of Go’s stack-based calling convention.
+The new calling convention will remain backwards-compatible with
+existing assembly code that’s based on the stack-based calling
+convention via Go’s [multiple ABI
+This same multiple ABI mechanism allows us to continue to evolve the
+Go calling convention in future versions.
+This lets us start with a simple, minimal calling convention and
+continue to optimize it in the future.
+The rest of this proposal outlines the work necessary to switch Go to
+a register-based calling convention.
+While it lays out the requirements for the ABI, it does not describe a
+Defining a specific ABI will be one of the first implementation steps,
+and its definition should reside in a living document rather than a
+## Go’s current stack-based ABI
+We give an overview of Go’s current ABI to give a sense of the
+requirements of any Go ABI and because the register-based calling
+convention builds on the same concepts.
+In the stack-based Go ABI, when a function F calls a function or
+method G, F reserves space in its own stack frame for G’s receiver (if
+it’s a method), arguments, and results.
+These are laid out in memory as if G’s receiver, arguments, and
+results were simply fields in a struct.
+There is one exception to all call state being passed on the stack: if
+G is a closure, F passes a pointer to its function object in a
+*context register*, via which G can quickly access any closed-over
+Other than a few fixed-function registers, all registers are
+caller-save, meaning F must spill any live state in registers to its
+stack frame before calling G and reload the registers after the call.
+The Go ABI also keeps a pointer to the runtime structure representing
+the current goroutine (“G”) available for quick access.
+On 386 and amd64, it is stored in thread-local storage; on all other
+platforms, it is stored in a dedicated register.<sup>1</sup>
+Every function must ensure sufficient stack space is available before
+reserving its stack frame.
+The current stack bound is stored in the runtime goroutine structure,
+which is why the ABI keeps this readily accessible.
+The standard prologue checks the stack pointer against this bound and
+calls into the runtime to grow the stack if necessary.
+In assembly code, this prologue is automatically generated by the
+Cooperative preemption is implemented by poisoning a goroutine’s stack
+bound, and thus also makes use of this standard prologue.
+Finally, both stack growth and the Go garbage collector must be able
+to find all live pointers.
+Logically, function entry and every call instruction has an associated
+bitmap indicating which slots in the local frame and the function’s
+argument frame contain live pointers.
+Sometimes liveness information is path-sensitive, in which case a
+function will have additional [*stack
+In all cases, all pointers are in known locations on the stack.
+<sup>1</sup> This is largely a historical accident.
+The G pointer was originally stored in a register on 386/amd64.
+This is ideal, since it’s accessed in nearly every function prologue.
+It was moved to TLS in order to support cgo, since transitions from C
+back to Go (including the runtime signal handler) needed a way to
+access the current G.
+However, when we added ARM support, it turned out accessing TLS in
+every function prologue was far too expensive on ARM, so all later
+ports used a hybrid approach where the G is stored in both a register
+and TLS and transitions from C restore it from TLS.
+## ABI design recommendations
+Here we lay out various recommendations for the design of a
+register-based Go ABI.
+The rest of this document assumes we’ll be following these
+1. Common structure across platforms.
+ This dramatically simplifies porting work in the compiler and
+ We propose that each architecture should define a sequence of
+ integer and floating point registers (and in the future perhaps
+ vector registers), plus size and alignment constraints, and that
+ beyond this, the calling convention should be derived using a
+ shared set of rules as much as possible.
+1. Efficient access to the current goroutine pointer and the context
+ register for closure calls.
+ Ideally these will be in registers; however, we may use TLS on
+ architectures with extremely limited registers (namely, 386).
+1. Support for many-word return values.
+ Go functions frequently return three or more words, so this must be
+ supported efficiently.
+1. Support for scanning and adjusting pointers in register arguments
+ on stack growth.
+ Since the function prologue checks the stack bound before reserving
+ a stack frame, the runtime must be able to spill argument registers
+ and identify those containing pointers.
+1. First-class generic call frame representation.
+ The `go` and `defer` statements as well as reflection calls need to
+ manipulate call frames as first-class, in-memory objects.
+ Reflect calls in particular are simplified by a common, generic
+ representation with fairly generic bridge code (the compiler could
+ generate bridge code for `go` and `defer`).
+1. No callee-save registers.
+ Callee-save registers complicate stack unwinding (and garbage
+ collection if pointers are allowed in callee-save registers).
+ Inter-function clobber sets have many of the benefits of
+ callee-save registers, but are much simpler to implement in a
+ garbage collected language and are well-suited to Go’s compilation
+ For an MVP, we’re unlikely to implement any form of live registers
+ across calls, but we’ll want to revisit this later.
+1. Where possible, be compatible with platform frame-pointer unwinding
+ This helps Go interoperate with system-level profilers, and can
+ potentially be used to optimize stack unwinding in Go itself.
+There are also some notable non-requirements:
+1. No compatibility with the platform ABI (other than frame pointers).
+ This has more downsides and upsides, as described above.
+1. No binary compatibility between Go versions.
+ This is important for shared libraries in C, but Go already
+ requires all shared libraries in a process to use the same Go
+ toolchain version.
+ This means we can continue to evolve and improve the ABI.
+## Toolchain changes overview
+This section outlines the changes that will be necessary to the Go
+build toolchain and runtime.
+The "Detailed design" section will go into greater depth on some of
+*Abstract argument registers*: The compiler’s register allocator will
+need to allocate function arguments and results to the appropriate
+However, it needs to represent argument and result registers in a
+platform-independent way prior to architecture lowering and register
+We propose introducing generic SSA values to represent the argument
+and result registers, as done in [David Chase’s
+These would simply represent the *i*th argument/result register and
+register allocation would assign them to the appropriate architecture
+Having a common ABI structure across platforms means the
+architecture-independent parts of the compiler would only need to know
+how many argument/result registers the target architecture has.
+*Late call lowering*: Call lowering and argument frame construction
+currently happen during AST to SSA lowering, which happens well before
+Hence, we propose moving call lowering much later in the compilation
+Late call lowering will have knock-on effects, as the current approach
+hides a lot of the structure of calls from most optimization passes.
+*ABI bridges*: For compatibility with existing assembly code, the
+compiler must generate ABI bridges when calling between Go
+(ABIInternal) and assembly (ABI0) code, as described in the [internal
+These are small functions that translate between ABIs according to a
+While the compiler currently differentiates between the two ABIs
+internally, since they’re actually identical right now, it currently
+only generates *ABI aliases* and has no mechanism for generating ABI
+As a post-MVP optimization, the compiler should inline these ABI
+bridges where possible.
+*Argument GC map*: The garbage collector needs to know which arguments
+contain live pointers at function entry and at any calls (since these
+are preemption points).
+Currently this is represented as a bitmap over words in the function’s
+With the register-based ABI, the compiler will need to emit a liveness
+map for argument registers for the function entry point.
+Since initially we won't have any live registers across calls, live
+arguments will be spilled to the stack at a call, so the compiler does
+*not* need to emit register maps at calls.
+For functions that still require a stack argument frame (because their
+arguments don’t all fit in registers), the compiler will also need to
+emit argument frame liveness maps at the same points it does today.
+*Traceback argument maps*: Go tracebacks currently display a simple
+word-based hex dump of a function’s argument frame.
+This is not particularly user-friendly nor high-fidelity, but it can
+be incredibly valuable for debugging.
+With a register-based ABI, there’s a wide range of possible designs
+for retaining this functionality.
+For an MVP, we propose trying to maintain a similar level of fidelity.
+In the future, we may want more detailed maps, or may want to simply
+switch to using DWARF location descriptions.
+To that end, we propose that the compiler should emit two logical
+maps: a *location map* from (PC, argument word index) to
+register/`stack`/`dead` and a *home map* from argument word index to
+stack home (if any).
+Since a named variable’s stack spill home is fixed if it ever spills,
+the location map can use a single distinguished value for `stack` that
+tells the runtime to refer to the home map.
+This approach works well for an ABI that passes argument values in
+separate registers without packing small values.
+The `dead` value is not necessarily the same as the garbage
+collector’s notion of a dead slot: for the garbage collector, you want
+slots to become dead as soon as possible, while for debug printing,
+you want them to stay live as long as possible (until clobbered by
+The exact encoding of these tables is to be determined.
+Most likely, we’ll want to introduce pseudo-ops for representing
+changes in the location map that the `cmd/internal/obj` package can
+then encode into `FUNCDATA`.
+The home map could be produced directly by the compiler as `FUNCDATA`.
+*DWARF locations*: The compiler will need to generate DWARF location
+lists for arguments and results.
+It already has this ability for local variables, and we should reuse
+that as much as possible.
+We will need to ensure Delve and GDB are compatible with this.
+Both already support location lists in general, so this is unlikely to
+require much (if any) work in these debuggers.
+Clobber sets will require further changes, which we discuss later.
+We propose not implementing clobber sets (or any form of callee-save)
+for the MVP.
+The linker requires relatively minor changes, all related to ABI
+*Eliminate ABI aliases*: Currently, the linker resolves ABI aliases
+generated by the compiler by treating all references to a symbol
+aliased under one ABI as references to the symbol another the other
+Once the compiler generates ABI bridges rather than aliases, we can
+remove this mechanism, which is likely to simplify and speed up the
+*ABI name mangling*: Since Go ABIs work by having multiple symbol
+definitions under the same name, the linker will also need to
+implement a name mangling scheme for non-Go symbol tables.
+*First-class call frame representation*: The `go` and `defer`
+statements and reflection calls must manipulate call frames as
+While the requirements of these three cases differ, we propose having
+a common first-class call frame representation that can capture a
+function’s register and stack arguments and record its register and
+stack results, along with a small set of generic call bridges that
+invoke a call using the generic call frame.
+*Stack growth*: Almost every Go function checks for sufficient stack
+space before opening its local stack frame.
+If there is insufficient space, it calls into the `runtime.morestack`
+function to grow the stack.
+Currently, `morestack` saves only the calling PC, the stack pointer,
+and the context register (if any) because these are the only registers
+that can be live at function entry.
+With register-based arguments, `morestack` will also have to save all
+We propose that it simply spill all *possible* argument registers
+rather than trying to be specific to the function; `morestack` is
+relatively rare, so the cost is this is unlikely to be noticeable.
+It’s likely possible to spill all argument registers to the stack
+itself: every function that can grow the stack ensures that there’s
+room not only for its local frame, but also for a reasonably large
+`morestack` can spill into this guard space.
+The garbage collector can recognize `morestack`’s spill space and use
+the argument map of its caller as the stack map of `morestack`.
+*Runtime assembly*: While Go’s multiple ABI mechanism makes it
+generally possible to transparently call between Go and assembly code
+even if they’re using different ABIs, there are runtime assembly
+functions that have deep knowledge of the Go ABI and will have to be
+This includes any function that takes a closure (`mcall`,
+`systemstack`), is called in a special context (`morestack`), or is
+involved in reflection-like calls (`reflectcall`, `debugCallV1`).
+*Cgo wrappers*: Generated cgo wrappers marked with
+`//go:cgo_unsafe_args` currently access their argument structure by
+casting a pointer to their first argument.
+This violates the `unsafe.Pointer` rules and will no longer work with
+We can either special case `//go:cgo_unsafe_args` functions to use
+ABI0 or change the way these wrappers are generated.
+*Stack unwinding for panic recovery*: When a panic is recovered, the
+Go runtime must unwind the panicking stack and resume execution after
+the deferred call of the recovering function.
+For the MVP, we propose not retaining any live registers across calls,
+in which case stack unwinding will not have to change.
+This is not the case with callee-save registers or clobber sets.
+*Traceback argument printing*: As mentioned in the compiler section,
+the runtime currently prints a hex dump of function arguments in panic
+This will have to consume the new traceback argument metadata produced
+by the compiler.
+## Detailed design
+This section dives deeper into some of the toolchain changes described
+We’ll expand this section over time.
+### `go`, `defer` and reflection calls
+Above we proposed using a first-class call frame representation for
+`go` and `defer` statements and reflection calls with a small set of
+These three cases have somewhat different requirements:
+- The types of `go` and `defer` calls are known statically, while
+ reflect calls are not.
+ This means the compiler could statically generate bridges to
+ unmarshall arguments for `go` and `defer` calls, but this isn’t an
+ option for reflection calls.
+- The return values of `go` and `defer` calls are always ignored,
+ while reflection calls must capture results.
+ This means a call bridge for a `go` or `defer` call can be a tail
+ call, while reflection calls can require marshalling return values.
+- Call frames for `go` and `defer` calls are long-lived, while
+ reflection call frames are transient.
+ This means the garbage collector must be able to scan `go` and
+ `defer` call frames, while we could use non-preemptible regions for
+ reflection calls.
+- Finally, `go` call frames are stored directly on the stack, while
+ `defer` and reflection call frames may be constructed in the heap.
+ This means the garbage collector must be able to construct the
+ appropriate stack map for `go` call frames, but `defer` and
+ reflection call frames can use the heap bitmap.
+ It also means `defer` and reflection calls that require stack
+ arguments must copy that part of the call frame from the heap to the
+ stack, though we don’t expect this to be the common case.
+To satisfy these requirements, we propose the following generic
+ pc uintptr // PC of target function
+ nInt, nFloat uintptr // # of int and float registers
+ ints [nInt]uintptr // Int registers
+ floats [nFloat]uint64 // Float registers
+ ctxt uintptr // Context register
+ stack [...]uintptr // Stack arguments/result space
+`go` calls can build this structure on the new goroutine stack and the
+call bridge can pop the register part of this structure from the
+stack, leaving just the `stack` part on the stack, and tail-call `pc`.
+The garbage collector can recognize this call bridge and construct the
+stack map by inspecting the `pc` in the call frame.
+`defer` and reflection calls can build frames in the heap with the
+appropriate heap bitmap.
+The call bridge in these cases must open a new stack frame, copy
+`stack` to the stack, load the register arguments, call `pc`, and then
+copy the register results and the stack results back to the in-heap
+frame (using write barriers where necessary).
+It may be valuable to have optimized versions of this bridge for
+tail-calls (always the case for `defer`) and register-only calls
+(likely a common case).
+In the register-only reflection call case, the bridge could take the
+register arguments as arguments itself and return register results as
+results; this would avoid any copying or write barriers.
+This proposal is Go 1-compatible.
+While Go assembly is not technically covered by Go 1 compatibility,
+this will maintain compatibility with the vast majority of assembly
+code using Go’s [multiple ABI
+This translates between Go’s existing stack-based calling convention
+used by all existing assembly code and Go’s internal calling
+There are a few known forms of unsafe code that this change will
+- Assembly code that invokes Go closures.
+ The closure calling convention was never publicly documented, but
+ there may be code that does this anyway.
+- Code that performs `unsafe.Pointer` arithmetic on pointers to
+ arguments in order to observe the contents of the stack.
+ This is a violation of the [`unsafe.Pointer`
+ rules](https://pkg.go.dev/unsafe#Pointer) today.
+We aim to implement a minimum viable register-based Go ABI for amd64
+in the 1.16 time frame.
+As of this writing (nearing the opening of the 1.16 tree), Dan Scales
+has made substantial progress on ABI bridges for a simple ABI change
+and David Chase has made substantial progress on late call lowering.
+Austin Clements will lead the work with David Chase and Than McIntosh
+focusing on the compiler side, Cherry Zhang focusing on aspects that
+bridge the compiler and runtime, and Michael Knyszek focusing on the