design: add 34481-opencoded-defers.md

This changes adds a design document for reducing the cost of defers significantly
by essentially inlining defer calls on normal exits in most cases.

Updates golang/go#34481

Change-Id: If64787e591864ab7843503b8f09c4d6dd6a7a535
Reviewed-on: https://go-review.googlesource.com/c/proposal/+/196964
Reviewed-by: Keith Randall <khr@golang.org>
diff --git a/design/34481-opencoded-defers.md b/design/34481-opencoded-defers.md
new file mode 100644
index 0000000..adf0954
--- /dev/null
+++ b/design/34481-opencoded-defers.md
@@ -0,0 +1,328 @@
+# Proposal: Low-cost defers through inline code, and extra funcdata to manage the panic case
+
+Author(s): Dan Scales, Keith Randall, and Austin Clements
+(with input from many others, including Russ Cox and Cherry Zhang)
+
+Last updated: 2019-09-23
+
+Discussion at https://golang.org/issue/34481
+
+General defer performance discussion at https://golang.org/issue/14939.
+
+## Abstract
+
+As of Go 1.13, most `defer` operations take about 35ns (reduced from about 50ns
+in Go 1.12).
+In contrast, a direct call takes about 6ns.
+This gap incentivizes engineers to eliminate `defer` operations from hot code
+paths, which takes away time from more productive tasks, leads to less
+maintainable code (e.g., if a `panic` is later introduced, the "optimization" is
+no longer correct), and discourages people from using a language feature when it
+would otherwise be an appropriate solution to a problem.
+
+We propose a way to make most `defer`s no more expensive than open-coding the
+call, hence eliminating the incentives to shy away from using this language
+feature to its fullest extent.
+
+
+## Background
+
+Go 1.13 implements the `defer` statement by calling into the runtime to push a
+"defer object" onto the defer chain.
+Then, in any function that contains `defer` statements, the compiler inserts a
+call to `runtime.deferreturn` at every function exit point to unwind that
+function's defers.
+Both of these cause overhead: the defer object must be populated with function
+call information (function pointer, arguments, closure information, etc.) when
+it is added to the chain, and `deferreturn` must find the right defers to
+unwind, copy out the call information, and invoke the deferred calls.
+Furthermore, this inhibits compiler optimizations like inlining, since the defer
+functions are called from the runtime using the defer information.
+
+When a function panics, the runtime runs the deferred calls on the defer chain
+until one of these calls `recover` or it exhausts the chain, resulting in a
+fatal panic.
+The stack itself is *not* unwound unless a deferred call recovers.
+This has the important property that examining the stack from a deferred call
+run during a panic will include the panicking frame, even if the defer was
+pushed by an ancestor of the panicking frame.
+
+In general, this defer chain is necessary since a function can defer an
+unbounded or dynamic number of calls that must all run when it returns.
+For example, a `defer` statement can appear in a loop or an `if` block.
+This also means that, in general, the defer objects must be heap-allocated,
+though the runtime uses an allocation pool to reduce the cost of the allocation.
+
+This is notably different from exception handling in C++ or Java, where the
+applicable set of `except` or `finally` blocks can be determined statically at
+every program counter in a function.
+In these languages, the non-exception case is typically inlined and exception
+handling is driven by a side table giving the locations of the `except` and
+`finally` blocks that apply to each PC.
+
+However, while Go's `defer` mechanism permits unbounded calls, the vast majority
+of functions that use `defer` invoke each `defer` statement at most once, and do
+not invoke `defer` in a loop.
+Go 1.13 adds an [optimization](https://golang.org/cl/171758) to stack-allocate
+defer objects in this case, but they must still be pushed and popped from the
+defer chain.
+This applies to 363 out of the 370 static defer sites in the `cmd/go` binary and
+speeds up this case by 30% relative to heap-allocated defer objects.
+
+This proposal combines this insight with the insights used in C++ and Java to
+make the non-panic case of most `defer` operations no more expensive than the
+manually open-coded case, while retaining correct `panic` handling.
+
+
+## Requirements
+
+While the common case of defer handling is simple enough, it can interact in
+often non-obvious ways with things like recursive panics, recover, and stack
+traces.
+Here we attempt to enumerate the requirements that any new defer implementation
+should likely satisfy, in addition to those in the language specification for
+[Defer statements](https://golang.org/ref/spec#Defer_statements) and [Handling
+panics](https://golang.org/ref/spec#Handling_panics).
+
+1. Executing a `defer` statement logically pushes a deferred function call onto
+   a per-goroutine stack.
+   Deferred calls are always executed starting from the top of this stack (hence
+   in the reverse order of the execution of `defer` statements).
+   Furthermore, each execution of a `defer` statement corresponds to exactly one
+   deferred call (except in the case of program termination, where a deferred
+   function may not be called at all).
+
+2. Defer calls are executed in one of two ways.
+   Whenever a function call returns normally, the runtime starts popping and
+   executing all existing defer calls for that stack frame only (in reverse order
+   of original execution).
+   Separately, whenever a panic (or a call to Goexit) occurs, the runtime starts
+   popping and executing all existing defer calls for the entire defer stack.
+   The execution of any defer call may be interrupted by a panic within the
+   execution of the defer call.
+
+3. A program may have multiple outstanding panics, since a recursive (second)
+   panic may occur during any of the defer calls being executed during the
+   processing of the first panic.
+   A previous panic is “aborted” if the processing of defers by the new panic
+   reaches the frame where the previous panic was processing defers when the new
+   panic happened.
+   When a defer call returns that did a successful `recover` that applies to a
+   panic, the stack is immediately unwound to the frame which contains the defer
+   that did the recover call, and any remaining defers in that frame are
+   executed.
+   Normal execution continues in the preceding frame (but note that normal
+   execution may actually be continuing a defer call for an outer panic).
+   Any panic that has not been recovered or aborted must still appear on the
+   caller stack.
+   Note that the first panic may never continue its defer processing, if the
+   second panic actually successfully runs all defer calls, but the original
+   panic must appear on the stack during all the processing by the second panic.
+
+4. When a defer call is executed because a function is returning normally
+   (whether there are any outstanding panics or not), the call site of a
+   deferred call must appear to be the function that invoked `defer` to push
+   that function on the defer stack, at the line where that function is
+   returning.
+   A consequence of this is that, if the runtime is executing deferred calls in
+   panic mode and a deferred call recovers, it must unwind the stack immediately
+   after that deferred call returns and before executing another deferred call.
+
+5. When a defer call is executed because of an explicit panic, the call stack of
+   a deferred function must include `runtime.gopanic` and the frame that
+   panicked (and its callers) immediately below the deferred function call.
+   As mentioned, the call stack must also include any outstanding previous
+   panics.
+   If a defer call is executed because of a run-time panic, the same condition
+   applies, except that ‘runtime.gopanic’ does not necessarily need to be on the
+   stack.
+   (In the current gc-go implementation, runtime.gopanic does appear on
+   the stack even for run-time panics.)
+
+## Proposal
+
+We propose optimizing deferred calls in functions where every `defer` is
+executed at most once (specifically, a `defer` may be on a conditional path, but
+is never in a loop in the control-flow graph).
+In this optimization, the compiler assigns a bit for every `defer` site to
+indicate whether that defer had been reached or not.
+The `defer` statement itself simply sets the corresponding bit and stores all
+necessary arguments in specific stack slots.
+Then, at every exit point of the function, the compiler open-codes each deferred
+call, protected by (and clearing) each corresponding bit.
+
+For example, the following:
+
+```go
+defer f1(a)
+if cond {
+ defer f2(b)
+}
+body...
+```
+
+would compile to
+
+```go
+deferBits |= 1<<0
+tmpF1 = f1
+tmpA = a
+if cond {
+ deferBits |= 1<<1
+ tmpF2 = f2
+tmpB = b
+}
+body...
+exit:
+if deferBits & 1<<1 != 0 {
+ deferBits &^= 1<<1
+ tmpF2(tmpB)
+}
+if deferBits & 1<<0 != 0 {
+ deferBits &^= 1<<0
+ tmpF1(tmpA)
+}
+```
+
+In order to ensure that the value of `deferBits` and all the tmp variables are
+available in case of a panic, these variables must be allocated explicit stack
+slots and the stores to deferBits and the tmp variables (`tmpF1`, `tmpA`, etc.)
+must write the values into these stack slots.
+In addition, the updates to `deferBits` in the defer exit code must explicitly
+store the `deferBits` value to the corresponding stack slot.
+This will ensure that panic processing can determine exactly which defers have
+been executed so far.
+
+However, the defer exit code can still be optimized significantly in many cases.
+We can refer directly to the `deferBits` and tmpA ‘values’ (in the SSA sense),
+and these accesses can therefore be optimized in terms of using existing values
+in registers, propagating constants, etc.
+Also, if the defers were called unconditionally, then constant propagation may
+in some cases to eliminate the checks on `deferBits` (because the value of
+`deferBits` is known statically at the exit point).
+
+If there are multiple exits (returns) from the function, we can either duplicate
+the defer exit code at each exit, or we can have one copy of the defer exit code
+that is shared among all (or most) of the exits.
+Note that any sharing of defer-exit code code may lead to less specific line
+numbers (which don’t indicate the exact exit location) if the user happens to
+look at the call stack while in a call made by the defer exit code.
+
+## Panic processing
+
+Because no actual defer records have been created, panic processing is quite
+different and somewhat more complex in this approach.
+When generating the code for a function, the compiler also emits an extra set of
+`FUNCDATA` information that records information about each of the open-coded
+defers.
+For each open-coded defer, the compiler emits `FUNCDATA` that specifies the
+exact stack locations that store the function pointer and each of the arguments.
+It also emits the location of the stack slot containing `deferBits`.
+Since stack frames can get arbitrarily large, the compiler uses a varint
+encoding for the stack slot offsets.
+
+In addition, for all functions with open-coded defers, the compiler adds a small
+segment of code that does a call to `runtime.deferreturn` and then returns.
+This code segment is not reachable by the main code of the function, but is used
+to unwind the stack properly when a panic is successfully recovered.
+
+To handle a `panic`, the runtime conceptually walks the defer chain in parallel
+with the stack in order to interleave execution of pushed defers with defers in
+open-coded frames.
+When the runtime encounters an open-coded frame `F` executing function ‘f’, it
+executes the following steps.
+
+1. The runtime reads the funcdata for function `f` that contains the open-defer
+   information.
+
+2. Using the information about the location in frame `F` of the stack slot for
+   `deferBits`, the runtime loads the current value of `deferBits` for this
+   frame.
+   The runtime processes each of the active defers, as specified by the value of
+   `deferBits`, in reverse order.
+
+3. For each active defer, the runtime loads the function pointer for the defer
+   call from the appropriate stack slot.
+   It also builds up an argument frame by copying each of the defer arguments
+   from its specified stack slot to the appropriate location in the argument
+   frame.
+   It then updates `deferBits` in its stack slot after masking off the bit for
+   the current defer.
+   Then it uses the function pointer and argument frame to call the deferred
+   function.
+
+4. If the defer call returns normally without doing a recovery, then the runtime
+   continues executing active defer calls for frame F until all active defer
+   calls have finished.
+
+5. If any defer call returns normally but has done a successful recover, then
+   the runtime stops processing defers in the current frame.
+   There may or may not be any remaining defers to process.
+   The runtime then arranges to jump to the `deferreturn` code segment and
+   unwind the stack to frame `F`, by simultaneously setting the PC to the
+   address of the `deferreturn` segment and setting the SP to the appropriate
+   value for frame `F`.
+   The `deferreturn` code segment then calls back into the runtime.
+   The runtime can now process any remaining active defers from frame `F`.
+   But for these defers, the stack has been appropriately unwound and the defer
+   appears to be called directly from function `f`.
+   When all defers for the frame have finished, the deferreturn finishes and the
+   code segment returns from frame F to continue execution.
+
+If a deferred call in step 3 itself panics, the runtime starts its normal panic
+processing again.
+For any frame with open-coded defers that has already run some defers, the
+deferBits value at the specified stack slot will always accurately reflect the
+remaining defers that need to be run.
+
+## Rationale
+
+One other approach that we extensively considered (and prototyped) also has
+inlined defer code for the normal case, but actual executes the defer exit code
+directly even in the panic case.
+Executing the defer exit code in the panic case requires duplication of stack
+frame F and some complex runtime code to start execution of the defer exit code
+using this new duplicated frame and to regain control when the defer exit code
+complete.
+The required runtime code for this approach is much more architecture-dependent
+and seems to be much more complex (and possibly fragile).
+
+
+## Compatibility
+
+This proposal does not change any user-facing APIs, and hence satisfies the [compatibility
+guidelines](https://golang.org/doc/go1compat).
+
+## Implementation
+
+An implementation has been mostly done.
+The change is [here](https://go-review.googlesource.com/c/go/+/190098/6)
+Comments on the design or implementation are very welcome.
+
+Some miscellaneous implementation details:
+
+1. We need to restrict the number of defers in a function to the size of the
+   deferBits bitmask.
+   To minimize code size, we currently make deferBits to be 8 bits, and don’t do
+   open-coded defers if there are more than 8 defers in a function.
+
+2. The deferBits variable and defer arguments variables (such as ‘tmpA’) must be
+   declared (via OpVarDef) in the entry block, since the unconditional defer
+   exit code at the bottom of the function will access them, so these variables
+   are live throughout the entire function.
+   (And, of course, they can be accessed by panic processing at any point within
+   the function that might cause a panic.)
+   For any defer argument stack slots that are pointers (or contain pointers),
+   we must initialize those stack slots to zero in the entry block.
+   The initialization is required for garbage collection, which doesn’t know
+   which of these defer arguments are active (i.e. which of the defer sites have
+   been reached, but the corresponding defer call has not yet happened)
+
+2. Because the `deferreturn` code segment is disconnected from the rest of the
+   function, it would not normally indicate that any stack slots are live.
+   However, we want the liveness information at the `deferreturn` call to
+   indicate that all of the stack slots associated with defers (which may
+   include pointers to variables accessed by closures) and all of the return
+   values are live.
+   We must explicitly set the liveness for the `deferreturn` call to be the same
+   as the liveness at the first defer call on the defer exit path.