design: add mid-stack inlining design doc

For golang/go#19348.

Change-Id: Ibf3e3817b35226a33d961e76fedb924e15e37069
Reviewed-on: https://go-review.googlesource.com/38090
Run-TryBot: David Lazar <lazard@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
diff --git a/design/19348-midstack-inlining.md b/design/19348-midstack-inlining.md
new file mode 100644
index 0000000..b983998
--- /dev/null
+++ b/design/19348-midstack-inlining.md
@@ -0,0 +1,502 @@
+# Proposal: Mid-stack inlining in the Go compiler
+
+Author(s): David Lazar, Austin Clements
+
+Last updated: 2017-03-10
+
+Discussion at: https://golang.org/issue/19348
+
+See also: https://golang.org/s/go19inliningtalk
+
+# Abstract
+
+As of Go 1.8, the compiler does not inline mid-stack functions (functions
+that call other non-inlineable functions) by default.
+This is because the runtime does not have sufficient information to generate
+accurate tracebacks for inlined code.
+
+We propose fixing this limitation of tracebacks and enabling mid-stack
+inlining by default.
+To do this, we will add a new PC-value table to functions with inlined
+calls that the runtime can use to generate accurate tracebacks, generate
+DWARF inlining information for debuggers, modify runtime.Callers and
+related functions to operate in terms of “logical” stack frames, and
+modify tools that work with stack traces such as pprof and trace.
+
+Preliminary results show that mid-stack inlining can improve performance by 9%
+(Go1 benchmarks on both amd64 and ppc64) with a 15% increase in binary size.
+Follow-on work will focus on improving the inlining heuristics to hopefully
+achieve this performance with less increase in binary size.
+
+
+# Background
+
+Inlining is a fundamental compiler optimization that replaces a function
+call with the body of the called function.
+This eliminates call overhead but more importantly enables other compiler
+optimizations, such as constant folding, common subexpression elimination,
+loop-invariant code motion, and better register allocation.
+As of Go 1.8, inlining happens at the AST level.
+To illustrate how the Go 1.8 compiler does inlining, consider the code in
+the left column of the table below.
+Using heuristics, the compiler decides that the call to PutUint32 in app.go
+can be inlined. It replaces the call with a copy of the AST nodes that make
+up the body of PutUint32, creating new AST nodes for the arguments.
+The resulting code is shown in the right column.
+
+<table>
+<tr><th>Before inlining</th><th>After inlining</th></tr>
+<tr><td>
+<pre>
+binary.go:20 func PutUint32(b []byte, v uint32) {
+binary.go:21     b[0] = byte(v >> 24)
+binary.go:22     b[1] = byte(v >> 16)
+binary.go:23     b[2] = byte(v >> 8)
+binary.go:24     b[3] = byte(v)
+binary.go:25 }
+
+app.go:5 func main() {
+app.go:6     // ...
+app.go:7     PutUint32(data, input)
+app.go:8     // ...
+app.go:9 }
+</pre>
+</td><td>
+<pre>
+app.go:5 func main() {
+app.go:6     // ...
+app.go:7     var b []byte, v uint32
+app.go:7     b, v = data, input
+app.go:7     b[0] = byte(v >> 24)
+app.go:7     b[1] = byte(v >> 16)
+app.go:7     b[2] = byte(v >> 8)
+app.go:7     b[3] = byte(v)
+app.go:8     // ...
+app.go:9 }
+</pre>
+</td></tr>
+</table>
+
+Notice that the compiler replaces the source positions of the inlined AST
+nodes with the source position of the call.
+If the inlined code panics (due to an index out of range error), the
+resulting stack trace is missing a stack frame for PutUint32 and the user
+doesn't get an accurate line number for what caused the panic:
+<pre>
+panic: runtime error: index out of range
+
+main.main()
+	/home/gopher/app.go:7 +0x114
+</pre>
+Thus, even without aggressive inlining, the user might see inaccurate
+tracebacks due to inlining.
+
+To mitigate this problem somewhat, the Go 1.8 compiler does not inline
+functions that contain calls.
+This reduces the likelihood that the user will see an inaccurate traceback,
+but it has a negative impact on performance.
+Suppose in the example below that `intrinsicLog` is a large function that
+won’t be inlined.
+By default, the compiler will not inline the calls to `Log` or `LogBase`
+since these functions make calls to non-inlineable functions.
+However, we can force the compiler to inline these call to using the
+compiler flag `-l=4`.
+
+<table>
+<tr><th>Before inlining</th><th>After inlining (-l=4)</th></tr>
+<tr><td>
+<pre>
+math.go:41 func Log(x float64) float64 {
+math.go:42     if x <= 0 {
+math.go:43         panic("log x <= 0")
+math.go:44     }
+math.go:45     return intrinsicLog(x)
+math.go:46 }
+
+math.go:93 func LogBase(x float64, base float64) float64 {
+math.go:94     n := Log(x)
+math.go:95     d := Log(base)
+math.go:96     return n / d
+math.go:97 }
+
+app.go:5 func main() {
+app.go:6     // ...
+app.go:7     val := LogBase(input1, input2)
+app.go:8     // ...
+app.go:9 }
+</pre>
+</td><td>
+<pre>
+app.go:5 func main() {
+app.go:6     // ...
+app.go:7     x, base := input1, input2
+app.go:7     x1 := x
+app.go:7     if x1 <= 0 {
+app.go:7         panic("log x <= 0")
+app.go:7     }
+app.go:7     r1 := intrinsicLog(x1)
+app.go:7     x2 := base
+app.go:7     if x2 <= 0 {
+app.go:7         panic("log x <= 0")
+app.go:7     }
+app.go:7     r2 := intrinsicLog(x2)
+app.go:7     n := r1
+app.go:7     d := r2
+app.go:7     r3 := n / d
+app.go:7     val := r3
+app.go:8     // ...
+app.go:9 }
+</pre>
+</td></tr>
+</table>
+
+Below we have the corresponding stack traces for these two versions of code,
+caused by a call to `Log(0)`.
+With mid-stack inlining, there is no stack frame or line number information
+available for `LogBase`, so the user is unable to determine which input was 0.
+<table>
+<tr><th>Stack trace before inlining</th><th>Stack trace after inlining (-l=4)</th></tr>
+<tr><td>
+<pre>
+panic(0x497140, 0xc42000e340)
+	/usr/lib/go/src/runtime/panic.go:500 +0x1a1
+main.Log(0x0, 0x400de6bf542e3d2d)
+	/home/gopher/math.go:43 +0xa0
+main.LogBase(0x4045000000000000, 0x0, 0x0)
+	/home/gopher/math.go:95 +0x49
+main.main()
+	/home/gopher/app.go:7 +0x4c
+</pre>
+</td><td>
+<pre>
+panic(0x497140, 0xc42000e340)
+	/usr/lib/go/src/runtime/panic.go:500 +0x1a1
+main.main()
+	/home/gopher/app.go:7 +0x161
+</pre>
+</td></tr>
+</table>
+
+The goal of this proposed change is to produce complete tracebacks in the
+presence of inlining and to enable the compiler to inline non-leaf functions
+like `Log` and `LogBase` without sacrificing debuggability.
+
+
+# Proposal
+
+## Changes to the compiler
+
+Our approach is to modify the compiler to retain the original source
+position information of inlined AST nodes and to store information about
+the call site in a separate data structure.
+Here is what the inlined example from above would look like instead:
+
+<pre>
+app.go:5   func main() {
+app.go:6       // ...
+app.go:7       x, base := input1, input2 ┓ LogBase
+math.go:94     x1 := x                   ┃ app.go:7 ┓ Log
+math.go:42     if x1 <= 0 {              ┃          ┃ math.go:94
+math.go:43         panic("log x <= 0")   ┃          ┃
+math.go:44     }                         ┃          ┃
+math.go:45     r1 := intrinsicLog(x1)    ┃          ┛
+math.go:95     x2 := base                ┃          ┓ Log
+math.go:42     if x2 <= 0 {              ┃          ┃ math.go:95
+math.go:43         panic("log x <= 0")   ┃          ┃
+math.go:44     }                         ┃          ┃
+math.go:45     r2 := intrinsicLog(x2)    ┃          ┛
+math.go:94     n := r1                   ┃
+math.go:95     d := r2                   ┃
+math.go:96     r3 := n / d               ┛
+app.go:7       val := r3
+app.go:8       // ...
+app.go:9   }
+</pre>
+
+Information about inlined calls is stored in a compiler-global data
+structure called the *global inlining tree*.
+Every time a call is inlined, the compiler adds a new node to the global
+inlining tree that contains information about the call site (line number,
+file name, and function name).
+If the parent function of the inlined call is also inlined, the node for
+the inner inlined call points to the node for the parent's inlined call.
+For example, here is the inlining tree for the code above:
+
+<pre>
+┌──────────┐
+│ LogBase  │
+│ app.go:7 │
+└──────────┘
+   ↑    ↑   ┌────────────┐
+   │    └───┤ Log        │
+   │        │ math.go:94 │
+   │        └────────────┘
+   │        ┌────────────┐
+   └────────┤ Log        │
+            │ math.go:95 │
+            └────────────┘
+</pre>
+
+The inlining tree is encoded as a table with one row per node in the tree.
+The parent column is the row index of the node's parent in the table, or -1
+if the node has no parent:
+
+| Parent | File           | Line | Function Name     |
+| ------ | -------------- | ---- | ----------------- |
+| -1     | app.go         | 7    | LogBase           |
+| 0      | math.go        | 94   | Log               |
+| 0      | math.go        | 95   | Log               |
+
+Every AST node is associated to a row index in the global inlining
+tree/table (or -1 if the node is not the result of inlining).
+We maintain this association by extending the `src.PosBase` type with a new
+field called the *inlining index*.
+Here is what our AST looks like now:
+
+<pre>
+app.go:5   func main() {
+app.go:6       // ...
+app.go:7       x, base := input1, input2 ┃ 0
+math.go:94     x1 := x                   ┓
+math.go:42     if x1 <= 0 {              ┃
+math.go:43         panic("log x <= 0")   ┃ 1
+math.go:44     }                         ┃
+math.go:45     r1 := intrinsicLog(x1)    ┛
+math.go:95     x2 := base                ┓
+math.go:42     if x2 <= 0 {              ┃
+math.go:43         panic("log x <= 0")   ┃ 2
+math.go:44     }                         ┃
+math.go:45     r2 := intrinsicLog(x2)    ┛
+math.go:94     n := r1                   ┓
+math.go:95     d := r2                   ┃ 0
+math.go:96     r3 := n / d               ┛
+app.go:7       val := r3
+app.go:8       // ...
+app.go:9   }
+</pre>
+
+As the AST nodes are lowered, their `src.PosBase` values are copied to
+the resulting `Prog` pseudo-instructions.
+The object writer reads the global inlining tree and the inlining index of
+each `Prog` and writes this information compactly to object files.
+
+## Changes to the object writer
+
+The object writer creates two new tables per function.
+The first table is the *local inlining tree* which contains all the
+branches from the global inlining tree that are referenced by the Progs
+in that function.
+The second table is a PC-value table called the *pcinline table* that maps
+each PC to a row index in the local inlining tree, or -1 if the PC does not
+correspond to a function that has been inlined.
+
+The local inlining tree and pcinline table are written to object files as
+part of each function's pcln table.
+The file names and function names in the local inlining tree are represented
+using symbol references which are resolved to name offsets by the linker.
+
+## Changes to the linker
+
+The linker reads the new tables produced by the object writer and writes
+the tables to the final binary.
+We reserve `pcdata[1]` for the pcinline table and `funcdata[2]` for the
+local inlining tree.
+The linker writes the pcinline table to `pcdata[1]` unmodified.
+
+The local inlining tree is encoded using 16 bytes per row (4 bytes per column).
+The parent and line numbers are encoded directly as int32 values.
+The file name and function names are encoded as int32 offsets into existing
+global string tables.
+This table must be written by the linker rather than the compiler because the
+linker deduplicates these names and resolves them to global name offsets.
+
+If necessary, we can encode the inlining tree more compactly using a varint
+for each column value.
+In the compact encoding, the parent column and the values in the pcinline
+table would be byte offsets into the local inlining tree instead of row
+indices.
+In this case, the linker would have to regenerate the pcinline table.
+
+## Changes to the runtime
+
+The `runtime.gentraceback` function generates tracebacks and is modified
+to produce logical stack frames for inlined functions.
+The `gentraceback` function has two modes that are affected by inlining:
+printing mode, used to print a stack trace when the runtime panics, and
+pcbuf mode, which returns a buffer of PC values used by `runtime.Callers`.
+In both modes, `gentraceback` checks if the current PC is mapped to a node
+in the function's inlining tree by decoding the pcinline table for the
+current function until it finds the value at the current PC.
+If the value is -1, this instruction is not a result of inlining, so the
+traceback proceeds normally.
+Otherwise, `gentraceback` decodes the inlining tree and follows the path
+up the tree to create the traceback.
+
+Suppose that `pcPos` is the position information for the current PC
+(obtained from the pcline and pcfile tables), `pcFunc` is the function
+name for the current PC, and `st[0] -> st[1] -> ... -> st[k]` is the
+path up the inlining tree for the current PC.
+To print an accurate stack trace, `gentraceback` prints function names
+and their corresponding position information in this order:
+
+| Function name | Source position |
+| ------------- | --------------- |
+| st[0].Func    | pcPos           |
+| st[1].Func    | st[0].Pos       |
+|     ...       |       ...       |
+| st[k].Func    | st[k-1].Pos     |
+| pcFunc        | st[k].Pos       |
+
+This process repeats for every PC in the traceback.
+Note that the runtime only has sufficient information to print function
+arguments and PC offsets for the last entry in this table.
+Here is the resulting stack trace from the example above with our changes:
+
+<pre>
+main.Log(...)
+	/home/gopher/math.go:43
+main.LogBase(...)
+	/home/gopher/math.go:95
+main.main()
+	/home/gopher/app.go:7 +0x1c8
+</pre>
+
+## Changes to the runtime public API
+
+With inlining, a PC may represent multiple logical calls, so we need to
+clarify the meaning of some runtime APIs related to tracebacks.
+For example, the `skip` argument passed to `runtime.Caller` and
+`runtime.Callers` will be interpreted as the number of logical calls to skip
+(rather than the number of physical stack frames to skip).
+
+Unfortunately, the runtime.Callers API requires some modification to be
+compatible with mid-stack inlining.
+The result value of runtime.Callers is a slice of program counters
+([]uintptr) representing physical stack frames.
+If the `skip` parameter to runtime.Callers skips part-way into a physical
+frame, there is no convenient way to encode that in the resulting slice.
+To avoid changing the API in an incompatible way, our solution is to store
+the number of skipped logical calls of the first frame in the _second_
+uintptr returned by runtime.Callers.
+Since this number is a small integer, we encode it as a valid PC value
+into a small symbol called `runtime.skipPleaseUseCallersFrames`.
+For example, if f() calls g(), g() calls `runtime.Callers(2, pcs)`, and
+g() is inlined into f, then the frame for f will be partially skipped,
+resulting in the following slice:
+
+    pcs = []uintptr{pc_in_f, runtime.skipPleaseUseCallersFrames+1, ...}
+
+The `runtime.CallersFrames` function will check if the second PC is
+in `runtime.skipPleaseUseCallersFrames` and skip the corresponding
+number of logical calls.
+We store the skip PC in `pcs[1]` instead of `pcs[0]` so that `pcs[i:]`
+will truncate the captured stack trace rather than grow it for all i
+(otherwise `pcs[1:]` would grow the stack trace).
+
+Code that iterates over the PC slice from `runtime.Callers` calling
+`FuncForPC` will have to be updated as described below to continue
+observing complete stack traces.
+
+
+# Rationale
+
+Even with just leaf inlining, the new inlining tables increase the
+size of binaries (see Preliminary Results).
+However, this increase is unavoidable if the runtime is to print complete
+stack traces.
+Turning on mid-stack inlining increases binary size more significantly,
+but we can tweak the inlining heuristic to find a good tradeoff between
+binary size, performance, and build times.
+
+We considered several alternative designs before we reached the design
+described in this document.
+One tempting alternative is to reuse the existing file and line PC-value
+tables and simply add a new PC-value table for the “parent” PC of each
+instruction, rather than a new funcdata table.
+This appears to represent the same information as the proposed funcdata table.
+However, some PCs might not have a parent PC to point at, for example
+if an inlined call is the very first instructions in a function.
+We considered adding NOP instructions to represent the parent of an inlined
+call, but concluded that a separate inlining tree is more compact.
+
+Another alternative design involves adding push and pop operations to the
+PC-value table decoder for representing the inlined call stack.
+We didn't prototype this design since the other designs seemed conceptually
+simpler.
+
+
+# Compatibility
+
+Prior to Go 1.7, the recommended way to use `runtime.Callers` was to loop
+over the returned PCs and call functions like `runtime.FuncForPC` on each
+PC directly.
+With mid-stack inlining, code using this pattern will observe incomplete
+call stacks, since inlined frames will be omitted.
+In preparation for this, the `runtime.Frames` API was introduced in Go 1.7
+as a higher-level way to interpret the results of `runtime.Callers`.
+We consider this to be a minor issue, since users will have had two releases
+to update to `runtime.Frames` and any remaining direct uses of
+`runtime.FuncForPC` will continue to work, simply in a degraded fashion.
+
+
+# Implementation
+
+David will implement this proposal during the Go 1.9 time frame.
+As of the beginning of the Go 1.9 development cycle, a mostly complete
+prototype of the changes the compiler, linker, and runtime is already
+working.
+
+The initial implementation goal is to make all tests pass with `-l=4`.
+We will then focus on bringing tools and DWARF information up-to-date
+with mid-stack inlining.
+Once this support is complete, we plan to make `-l=4` the default setting.
+
+We should also update the `debug/gosym` package to expose the new inlining
+information.
+
+*Update* (2017-03-04): CLs that add inlining info and fix stack traces
+have been merged into master.
+CLs that fix runtime.Callers are under submission.
+
+
+# Prerequisite Changes
+
+Prior to this work, Go had the `-l=4` flag to turn on mid-stack inlining,
+but this mode had issues beyond incomplete stack traces.
+
+For example, before we could run experiments with `-l=4`, we had to fix
+inlining of variadic functions ([CL 33671](golang.org/cl/33671)),
+mark certain cgo functions as uninlineable ([CL 33722](golang.org/cl/33722)),
+and include linknames in export data ([CL 33911](golang.org/cl/33911)).
+
+Before we turn on mid-stack inlining, we will have to update uses
+of runtime.Callers in the runtime to use runtime.CallersFrames.
+We will also have to make tests independent of inlining
+(e.g., [CL 37237](golang.org/cl/37237)).
+
+
+# Preliminary Results
+
+Mid-stack inlining (`-l=4`) gives a 9% geomean improvement on the Go1
+benchmarks on amd64:
+
+  https://perf.golang.org/search?q=upload:20170309.1
+
+The same experiment on ppc64 also showed a 9-10% improvement.
+
+The new inlining tables increase binary size by 4% without mid-stack inlining.
+Mid-stack inlining increases the size of the Go1 benchmark binary by an
+additional 11%.
+
+
+# Open issues
+
+One limitation of this approach is that the runtime is unable to print
+the arguments to inlined calls in a stack trace.
+This is because the runtime gets arguments by assuming a certain stack
+layout, but there is no stack frame for inlined calls.
+
+This proposal does not propose significant changes to the existing
+inlining heuristics.
+Since mid-stack inlining is now a possibility, we should revisit the
+inlining heuristics in follow-on work.