design/40724-register-calling: add design doc

Change-Id: Ib491db5e2523acf6f21b94924339a22d236717bc
Reviewed-on: https://go-review.googlesource.com/c/proposal/+/248178
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Martin Möhrmann <moehrmann@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
diff --git a/design/40724-register-calling.md b/design/40724-register-calling.md
new file mode 100644
index 0000000..3e7cabf
--- /dev/null
+++ b/design/40724-register-calling.md
@@ -0,0 +1,548 @@
+# Proposal: Register-based Go calling convention
+
+Author: Austin Clements, with input from Cherry Zhang, Michael
+Knyszek, Martin Möhrmann, Michael Pratt, David Chase, Keith Randall,
+Dan Scales, and Ian Lance Taylor.
+
+Last updated: 2020-08-10
+
+Discussion at https://golang.org/issue/40724.
+
+## Abstract
+
+We propose switching the Go ABI from its current stack-based calling
+convention to a register-based calling convention.
+[Preliminary experiments
+indicate](https://github.com/golang/go/issues/18597#issue-199914923)
+this will achieve at least a 5–10% throughput improvement across a
+range of applications.
+This will remain backwards compatible with existing assembly code that
+assumes Go’s current stack-based calling convention through Go’s
+[multiple ABI
+mechanism](https://golang.org/design/27539-internal-abi).
+
+## Background
+
+Since its initial release, Go has used a *stack-based calling
+convention* based on the Plan 9 ABI, in which arguments and result
+values are passed via memory on the stack.
+This has significant simplicity benefits: the rules of the calling
+convention are simple and build on existing struct layout rules; all
+platforms can use essentially the same conventions, leading to shared,
+portable compiler and runtime code; and call frames have an obvious
+first-class representation, which simplifies the implementation of the
+`go` and `defer` statements and reflection calls.
+Furthermore, the current Go ABI has no *callee-save registers*,
+meaning that no register contents live across a function call (any
+live state in a function must be flushed to the stack before a call).
+This simplifies stack tracing for garbage collection and stack growth
+and stack unwinding during panic recovery.
+
+Unfortunately, Go’s stack-based calling convention leaves a lot of
+performance on the table.
+While modern high-performance CPUs heavily optimize stack access,
+accessing arguments in registers is still roughly [40%
+faster](https://gist.github.com/aclements/ded22bb8451eead8249d22d3cd873566)
+than accessing arguments on the stack.
+Furthermore, a stack-based calling convention, especially one with no
+callee-save registers, induces additional memory traffic, which has
+secondary effects on overall performance.
+
+Most language implementations on most platforms use a register-based
+calling convention that passes function arguments and results via
+registers rather than memory and designates some registers as
+callee-save, allowing functions to keep state in registers across
+calls.
+
+## Proposal
+
+We propose switching the Go ABI to a register-based calling
+convention, starting with a minimum viable product (MVP) on amd64, and
+then expanding to other architectures and improving on the MVP.
+
+We further propose that this calling convention should be designed
+specifically for Go, rather than using platform ABIs.
+There are several reasons for this.
+
+It’s incredibly tempting to use the platform calling convention, as it
+seems that would allow for more efficient language interoperability.
+Unfortunately, there are two major reasons it would do little good,
+both related to the scalability of goroutines, a central feature of
+the Go language.
+One reason goroutines scale so well is that the Go runtime dynamically
+resizes their stacks, but this imposes requirements on the ABI that
+aren’t satisfied by non-Go functions, thus requiring the runtime to
+transition out of the dynamic stack regime on a foreign call.
+Another reason is that goroutines are scheduled by the Go runtime
+rather than the OS kernel, but this means that transitions to and from
+non-Go code must be communicated to the Go scheduler.
+These two things mean that sharing a calling convention wouldn’t
+significantly lower the cost of calling non-Go code.
+
+The other tempting reason to use the platform calling convention would
+be tooling interoperability, particularly with debuggers and profiling
+tools.
+However, these almost universally support DWARF or, for profilers,
+frame pointer unwinding.
+Go will continue to work with DWARF-based tools and we can make the Go
+ABI compatible with platform frame pointer unwinding without otherwise
+taking on the platform ABI.
+
+Hence, there’s little upside to using the platform ABI.
+And there are several reasons to favor using our own ABI:
+
+- Most existing ABIs were based on the C language, which differs in
+  important ways from Go.
+  For example, most ELF ABIs (at least x64-64, ARM64, and RISC-V)
+  would force Go slices to be passed on the stack rather than in
+  registers because the slice header is three words.
+  Similarly, because C functions rarely return more than one word,
+  most platform ABIs reserve at most two registers for results.
+  Since Go functions commonly return at least three words (a result
+  and a two word error interface value), the platform ABI would force
+  such functions to return values on the stack.
+  Other things that influence the platform ABI include that array
+  arguments in C are passed by reference rather than by value and
+  small integer types in C are promoted to `int` rather than retaining
+  their type.
+  Hence, platform ABIs simply aren’t a good fit for the Go language.
+
+- Platform ABIs typically define callee-save registers, which place
+  substantial additional requirements on a garbage collector.
+  There are alternatives to callee-save registers that share many of
+  their benefits, while being much better suited to Go.
+
+- While platform ABIs are generally similar at a high level, their
+  details differ in myriad ways.
+  By defining our own ABI, we can follow a common structure across all
+  platforms and maintain much of the cross-platform simplicity and
+  reliability of Go’s stack-based calling convention.
+
+The new calling convention will remain backwards-compatible with
+existing assembly code that’s based on the stack-based calling
+convention via Go’s [multiple ABI
+mechanism](https://golang.org/design/27539-internal-abi).
+
+This same multiple ABI mechanism allows us to continue to evolve the
+Go calling convention in future versions.
+This lets us start with a simple, minimal calling convention and
+continue to optimize it in the future.
+
+The rest of this proposal outlines the work necessary to switch Go to
+a register-based calling convention.
+While it lays out the requirements for the ABI, it does not describe a
+specific ABI.
+Defining a specific ABI will be one of the first implementation steps,
+and its definition should reside in a living document rather than a
+proposal.
+
+## Go’s current stack-based ABI
+
+We give an overview of Go’s current ABI to give a sense of the
+requirements of any Go ABI and because the register-based calling
+convention builds on the same concepts.
+
+In the stack-based Go ABI, when a function F calls a function or
+method G, F reserves space in its own stack frame for G’s receiver (if
+it’s a method), arguments, and results.
+These are laid out in memory as if G’s receiver, arguments, and
+results were simply fields in a struct.
+
+There is one exception to all call state being passed on the stack: if
+G is a closure, F passes a pointer to its function object in a
+*context register*, via which G can quickly access any closed-over
+values.
+
+Other than a few fixed-function registers, all registers are
+caller-save, meaning F must spill any live state in registers to its
+stack frame before calling G and reload the registers after the call.
+
+The Go ABI also keeps a pointer to the runtime structure representing
+the current goroutine (“G”) available for quick access.
+On 386 and amd64, it is stored in thread-local storage; on all other
+platforms, it is stored in a dedicated register.<sup>1</sup>
+
+Every function must ensure sufficient stack space is available before
+reserving its stack frame.
+The current stack bound is stored in the runtime goroutine structure,
+which is why the ABI keeps this readily accessible.
+The standard prologue checks the stack pointer against this bound and
+calls into the runtime to grow the stack if necessary.
+In assembly code, this prologue is automatically generated by the
+assembler itself.
+Cooperative preemption is implemented by poisoning a goroutine’s stack
+bound, and thus also makes use of this standard prologue.
+
+Finally, both stack growth and the Go garbage collector must be able
+to find all live pointers.
+Logically, function entry and every call instruction has an associated
+bitmap indicating which slots in the local frame and the function’s
+argument frame contain live pointers.
+Sometimes liveness information is path-sensitive, in which case a
+function will have additional [*stack
+object*](https://golang.org/cl/134155) metadata.
+In all cases, all pointers are in known locations on the stack.
+
+<sup>1</sup> This is largely a historical accident.
+The G pointer was originally stored in a register on 386/amd64.
+This is ideal, since it’s accessed in nearly every function prologue.
+It was moved to TLS in order to support cgo, since transitions from C
+back to Go (including the runtime signal handler) needed a way to
+access the current G.
+However, when we added ARM support, it turned out accessing TLS in
+every function prologue was far too expensive on ARM, so all later
+ports used a hybrid approach where the G is stored in both a register
+and TLS and transitions from C restore it from TLS.
+
+## ABI design recommendations
+
+Here we lay out various recommendations for the design of a
+register-based Go ABI.
+The rest of this document assumes we’ll be following these
+recommendations.
+
+1. Common structure across platforms.
+   This dramatically simplifies porting work in the compiler and
+   runtime.
+   We propose that each architecture should define a sequence of
+   integer and floating point registers (and in the future perhaps
+   vector registers), plus size and alignment constraints, and that
+   beyond this, the calling convention should be derived using a
+   shared set of rules as much as possible.
+
+1. Efficient access to the current goroutine pointer and the context
+   register for closure calls.
+   Ideally these will be in registers; however, we may use TLS on
+   architectures with extremely limited registers (namely, 386).
+
+1. Support for many-word return values.
+   Go functions frequently return three or more words, so this must be
+   supported efficiently.
+
+1. Support for scanning and adjusting pointers in register arguments
+   on stack growth.
+   Since the function prologue checks the stack bound before reserving
+   a stack frame, the runtime must be able to spill argument registers
+   and identify those containing pointers.
+
+1. First-class generic call frame representation.
+   The `go` and `defer` statements as well as reflection calls need to
+   manipulate call frames as first-class, in-memory objects.
+   Reflect calls in particular are simplified by a common, generic
+   representation with fairly generic bridge code (the compiler could
+   generate bridge code for `go` and `defer`).
+
+1. No callee-save registers.
+   Callee-save registers complicate stack unwinding (and garbage
+   collection if pointers are allowed in callee-save registers).
+   Inter-function clobber sets have many of the benefits of
+   callee-save registers, but are much simpler to implement in a
+   garbage collected language and are well-suited to Go’s compilation
+   model.
+   For an MVP, we’re unlikely to implement any form of live registers
+   across calls, but we’ll want to revisit this later.
+
+1. Where possible, be compatible with platform frame-pointer unwinding
+   rules.
+   This helps Go interoperate with system-level profilers, and can
+   potentially be used to optimize stack unwinding in Go itself.
+
+There are also some notable non-requirements:
+
+1. No compatibility with the platform ABI (other than frame pointers).
+   This has more downsides and upsides, as described above.
+
+1. No binary compatibility between Go versions.
+   This is important for shared libraries in C, but Go already
+   requires all shared libraries in a process to use the same Go
+   toolchain version.
+   This means we can continue to evolve and improve the ABI.
+
+## Toolchain changes overview
+
+This section outlines the changes that will be necessary to the Go
+build toolchain and runtime.
+The "Detailed design" section will go into greater depth on some of
+these.
+
+### Compiler
+
+*Abstract argument registers*: The compiler’s register allocator will
+need to allocate function arguments and results to the appropriate
+registers.
+However, it needs to represent argument and result registers in a
+platform-independent way prior to architecture lowering and register
+allocation.
+We propose introducing generic SSA values to represent the argument
+and result registers, as done in [David Chase’s
+prototype](https://golang.org/cl/28832).
+These would simply represent the *i*th argument/result register and
+register allocation would assign them to the appropriate architecture
+registers.
+Having a common ABI structure across platforms means the
+architecture-independent parts of the compiler would only need to know
+how many argument/result registers the target architecture has.
+
+*Late call lowering*: Call lowering and argument frame construction
+currently happen during AST to SSA lowering, which happens well before
+register allocation.
+Hence, we propose moving call lowering much later in the compilation
+process.
+Late call lowering will have knock-on effects, as the current approach
+hides a lot of the structure of calls from most optimization passes.
+
+*ABI bridges*: For compatibility with existing assembly code, the
+compiler must generate ABI bridges when calling between Go
+(ABIInternal) and assembly (ABI0) code, as described in the [internal
+ABI proposal](https://golang.org/design/27539-internal-abi).
+These are small functions that translate between ABIs according to a
+function’s type.
+While the compiler currently differentiates between the two ABIs
+internally, since they’re actually identical right now, it currently
+only generates *ABI aliases* and has no mechanism for generating ABI
+bridges.
+As a post-MVP optimization, the compiler should inline these ABI
+bridges where possible.
+
+*Argument GC map*: The garbage collector needs to know which arguments
+contain live pointers at function entry and at any calls (since these
+are preemption points).
+Currently this is represented as a bitmap over words in the function’s
+argument frame.
+With the register-based ABI, the compiler will need to emit a liveness
+map for argument registers for the function entry point.
+Since initially we won't have any live registers across calls, live
+arguments will be spilled to the stack at a call, so the compiler does
+*not* need to emit register maps at calls.
+For functions that still require a stack argument frame (because their
+arguments don’t all fit in registers), the compiler will also need to
+emit argument frame liveness maps at the same points it does today.
+
+*Traceback argument maps*: Go tracebacks currently display a simple
+word-based hex dump of a function’s argument frame.
+This is not particularly user-friendly nor high-fidelity, but it can
+be incredibly valuable for debugging.
+With a register-based ABI, there’s a wide range of possible designs
+for retaining this functionality.
+For an MVP, we propose trying to maintain a similar level of fidelity.
+In the future, we may want more detailed maps, or may want to simply
+switch to using DWARF location descriptions.
+
+To that end, we propose that the compiler should emit two logical
+maps: a *location map* from (PC, argument word index) to
+register/`stack`/`dead` and a *home map* from argument word index to
+stack home (if any).
+Since a named variable’s stack spill home is fixed if it ever spills,
+the location map can use a single distinguished value for `stack` that
+tells the runtime to refer to the home map.
+This approach works well for an ABI that passes argument values in
+separate registers without packing small values.
+The `dead` value is not necessarily the same as the garbage
+collector’s notion of a dead slot: for the garbage collector, you want
+slots to become dead as soon as possible, while for debug printing,
+you want them to stay live as long as possible (until clobbered by
+something else).
+
+The exact encoding of these tables is to be determined.
+Most likely, we’ll want to introduce pseudo-ops for representing
+changes in the location map that the `cmd/internal/obj` package can
+then encode into `FUNCDATA`.
+The home map could be produced directly by the compiler as `FUNCDATA`.
+
+*DWARF locations*: The compiler will need to generate DWARF location
+lists for arguments and results.
+It already has this ability for local variables, and we should reuse
+that as much as possible.
+We will need to ensure Delve and GDB are compatible with this.
+Both already support location lists in general, so this is unlikely to
+require much (if any) work in these debuggers.
+
+Clobber sets will require further changes, which we discuss later.
+We propose not implementing clobber sets (or any form of callee-save)
+for the MVP.
+
+### Linker
+
+The linker requires relatively minor changes, all related to ABI
+bridges.
+
+*Eliminate ABI aliases*: Currently, the linker resolves ABI aliases
+generated by the compiler by treating all references to a symbol
+aliased under one ABI as references to the symbol another the other
+ABI.
+Once the compiler generates ABI bridges rather than aliases, we can
+remove this mechanism, which is likely to simplify and speed up the
+linker somewhat.
+
+*ABI name mangling*: Since Go ABIs work by having multiple symbol
+definitions under the same name, the linker will also need to
+implement a name mangling scheme for non-Go symbol tables.
+
+### Runtime
+
+*First-class call frame representation*: The `go` and `defer`
+statements and reflection calls must manipulate call frames as
+first-class objects.
+While the requirements of these three cases differ, we propose having
+a common first-class call frame representation that can capture a
+function’s register and stack arguments and record its register and
+stack results, along with a small set of generic call bridges that
+invoke a call using the generic call frame.
+
+*Stack growth*: Almost every Go function checks for sufficient stack
+space before opening its local stack frame.
+If there is insufficient space, it calls into the `runtime.morestack`
+function to grow the stack.
+Currently, `morestack` saves only the calling PC, the stack pointer,
+and the context register (if any) because these are the only registers
+that can be live at function entry.
+With register-based arguments, `morestack` will also have to save all
+argument registers.
+We propose that it simply spill all *possible* argument registers
+rather than trying to be specific to the function; `morestack` is
+relatively rare, so the cost is this is unlikely to be noticeable.
+It’s likely possible to spill all argument registers to the stack
+itself: every function that can grow the stack ensures that there’s
+room not only for its local frame, but also for a reasonably large
+“guard” space.
+`morestack` can spill into this guard space.
+The garbage collector can recognize `morestack`’s spill space and use
+the argument map of its caller as the stack map of `morestack`.
+
+*Runtime assembly*: While Go’s multiple ABI mechanism makes it
+generally possible to transparently call between Go and assembly code
+even if they’re using different ABIs, there are runtime assembly
+functions that have deep knowledge of the Go ABI and will have to be
+modified.
+This includes any function that takes a closure (`mcall`,
+`systemstack`), is called in a special context (`morestack`), or is
+involved in reflection-like calls (`reflectcall`, `debugCallV1`).
+
+*Cgo wrappers*: Generated cgo wrappers marked with
+`//go:cgo_unsafe_args` currently access their argument structure by
+casting a pointer to their first argument.
+This violates the `unsafe.Pointer` rules and will no longer work with
+this change.
+We can either special case `//go:cgo_unsafe_args` functions to use
+ABI0 or change the way these wrappers are generated.
+
+*Stack unwinding for panic recovery*: When a panic is recovered, the
+Go runtime must unwind the panicking stack and resume execution after
+the deferred call of the recovering function.
+For the MVP, we propose not retaining any live registers across calls,
+in which case stack unwinding will not have to change.
+This is not the case with callee-save registers or clobber sets.
+
+*Traceback argument printing*: As mentioned in the compiler section,
+the runtime currently prints a hex dump of function arguments in panic
+tracebacks.
+This will have to consume the new traceback argument metadata produced
+by the compiler.
+
+## Detailed design
+
+This section dives deeper into some of the toolchain changes described
+above.
+We’ll expand this section over time.
+
+### `go`, `defer` and reflection calls
+
+Above we proposed using a first-class call frame representation for
+`go` and `defer` statements and reflection calls with a small set of
+call bridges.
+These three cases have somewhat different requirements:
+
+- The types of `go` and `defer` calls are known statically, while
+  reflect calls are not.
+  This means the compiler could statically generate bridges to
+  unmarshall arguments for `go` and `defer` calls, but this isn’t an
+  option for reflection calls.
+
+- The return values of `go` and `defer` calls are always ignored,
+  while reflection calls must capture results.
+  This means a call bridge for a `go` or `defer` call can be a tail
+  call, while reflection calls can require marshalling return values.
+
+- Call frames for `go` and `defer` calls are long-lived, while
+  reflection call frames are transient.
+  This means the garbage collector must be able to scan `go` and
+  `defer` call frames, while we could use non-preemptible regions for
+  reflection calls.
+
+- Finally, `go` call frames are stored directly on the stack, while
+  `defer` and reflection call frames may be constructed in the heap.
+  This means the garbage collector must be able to construct the
+  appropriate stack map for `go` call frames, but `defer` and
+  reflection call frames can use the heap bitmap.
+  It also means `defer` and reflection calls that require stack
+  arguments must copy that part of the call frame from the heap to the
+  stack, though we don’t expect this to be the common case.
+
+To satisfy these requirements, we propose the following generic
+call-frame representation:
+
+```
+struct {
+    pc           uintptr          // PC of target function
+    nInt, nFloat uintptr          // # of int and float registers
+    ints         [nInt]uintptr    // Int registers
+    floats       [nFloat]uint64   // Float registers
+    ctxt         uintptr          // Context register
+    stack        [...]uintptr     // Stack arguments/result space
+}
+```
+
+`go` calls can build this structure on the new goroutine stack and the
+call bridge can pop the register part of this structure from the
+stack, leaving just the `stack` part on the stack, and tail-call `pc`.
+The garbage collector can recognize this call bridge and construct the
+stack map by inspecting the `pc` in the call frame.
+
+`defer` and reflection calls can build frames in the heap with the
+appropriate heap bitmap.
+The call bridge in these cases must open a new stack frame, copy
+`stack` to the stack, load the register arguments, call `pc`, and then
+copy the register results and the stack results back to the in-heap
+frame (using write barriers where necessary).
+It may be valuable to have optimized versions of this bridge for
+tail-calls (always the case for `defer`) and register-only calls
+(likely a common case).
+In the register-only reflection call case, the bridge could take the
+register arguments as arguments itself and return register results as
+results; this would avoid any copying or write barriers.
+
+## Compatibility
+
+This proposal is Go 1-compatible.
+
+While Go assembly is not technically covered by Go 1 compatibility,
+this will maintain compatibility with the vast majority of assembly
+code using Go’s [multiple ABI
+mechanism](https://golang.org/design/27539-internal-abi).
+This translates between Go’s existing stack-based calling convention
+used by all existing assembly code and Go’s internal calling
+convention.
+
+There are a few known forms of unsafe code that this change will
+break:
+
+- Assembly code that invokes Go closures.
+  The closure calling convention was never publicly documented, but
+  there may be code that does this anyway.
+
+- Code that performs `unsafe.Pointer` arithmetic on pointers to
+  arguments in order to observe the contents of the stack.
+  This is a violation of the [`unsafe.Pointer`
+  rules](https://pkg.go.dev/unsafe#Pointer) today.
+
+## Implementation
+
+We aim to implement a minimum viable register-based Go ABI for amd64
+in the 1.16 time frame.
+As of this writing (nearing the opening of the 1.16 tree), Dan Scales
+has made substantial progress on ABI bridges for a simple ABI change
+and David Chase has made substantial progress on late call lowering.
+Austin Clements will lead the work with David Chase and Than McIntosh
+focusing on the compiler side, Cherry Zhang focusing on aspects that
+bridge the compiler and runtime, and Michael Knyszek focusing on the
+runtime.