| # Proposal: Mid-stack inlining in the Go compiler |
| |
| Author(s): David Lazar, Austin Clements |
| |
| Last updated: 2017-03-10 |
| |
| Discussion at: https://golang.org/issue/19348 |
| |
| See also: https://golang.org/s/go19inliningtalk |
| |
| # Abstract |
| |
| As of Go 1.8, the compiler does not inline mid-stack functions (functions |
| that call other non-inlineable functions) by default. |
| This is because the runtime does not have sufficient information to generate |
| accurate tracebacks for inlined code. |
| |
| We propose fixing this limitation of tracebacks and enabling mid-stack |
| inlining by default. |
| To do this, we will add a new PC-value table to functions with inlined |
| calls that the runtime can use to generate accurate tracebacks, generate |
| DWARF inlining information for debuggers, modify runtime.Callers and |
| related functions to operate in terms of “logical” stack frames, and |
| modify tools that work with stack traces such as pprof and trace. |
| |
| Preliminary results show that mid-stack inlining can improve performance by 9% |
| (Go1 benchmarks on both amd64 and ppc64) with a 15% increase in binary size. |
| Follow-on work will focus on improving the inlining heuristics to hopefully |
| achieve this performance with less increase in binary size. |
| |
| |
| # Background |
| |
| Inlining is a fundamental compiler optimization that replaces a function |
| call with the body of the called function. |
| This eliminates call overhead but more importantly enables other compiler |
| optimizations, such as constant folding, common subexpression elimination, |
| loop-invariant code motion, and better register allocation. |
| As of Go 1.8, inlining happens at the AST level. |
| To illustrate how the Go 1.8 compiler does inlining, consider the code in |
| the left column of the table below. |
| Using heuristics, the compiler decides that the call to PutUint32 in app.go |
| can be inlined. It replaces the call with a copy of the AST nodes that make |
| up the body of PutUint32, creating new AST nodes for the arguments. |
| The resulting code is shown in the right column. |
| |
| <table> |
| <tr><th>Before inlining</th><th>After inlining</th></tr> |
| <tr><td> |
| <pre> |
| binary.go:20 func PutUint32(b []byte, v uint32) { |
| binary.go:21 b[0] = byte(v >> 24) |
| binary.go:22 b[1] = byte(v >> 16) |
| binary.go:23 b[2] = byte(v >> 8) |
| binary.go:24 b[3] = byte(v) |
| binary.go:25 } |
| |
| app.go:5 func main() { |
| app.go:6 // ... |
| app.go:7 PutUint32(data, input) |
| app.go:8 // ... |
| app.go:9 } |
| </pre> |
| </td><td> |
| <pre> |
| app.go:5 func main() { |
| app.go:6 // ... |
| app.go:7 var b []byte, v uint32 |
| app.go:7 b, v = data, input |
| app.go:7 b[0] = byte(v >> 24) |
| app.go:7 b[1] = byte(v >> 16) |
| app.go:7 b[2] = byte(v >> 8) |
| app.go:7 b[3] = byte(v) |
| app.go:8 // ... |
| app.go:9 } |
| </pre> |
| </td></tr> |
| </table> |
| |
| Notice that the compiler replaces the source positions of the inlined AST |
| nodes with the source position of the call. |
| If the inlined code panics (due to an index out of range error), the |
| resulting stack trace is missing a stack frame for PutUint32 and the user |
| doesn't get an accurate line number for what caused the panic: |
| <pre> |
| panic: runtime error: index out of range |
| |
| main.main() |
| /home/gopher/app.go:7 +0x114 |
| </pre> |
| Thus, even without aggressive inlining, the user might see inaccurate |
| tracebacks due to inlining. |
| |
| To mitigate this problem somewhat, the Go 1.8 compiler does not inline |
| functions that contain calls. |
| This reduces the likelihood that the user will see an inaccurate traceback, |
| but it has a negative impact on performance. |
| Suppose in the example below that `intrinsicLog` is a large function that |
| won’t be inlined. |
| By default, the compiler will not inline the calls to `Log` or `LogBase` |
| since these functions make calls to non-inlineable functions. |
| However, we can force the compiler to inline these call to using the |
| compiler flag `-l=4`. |
| |
| <table> |
| <tr><th>Before inlining</th><th>After inlining (-l=4)</th></tr> |
| <tr><td> |
| <pre> |
| math.go:41 func Log(x float64) float64 { |
| math.go:42 if x <= 0 { |
| math.go:43 panic("log x <= 0") |
| math.go:44 } |
| math.go:45 return intrinsicLog(x) |
| math.go:46 } |
| |
| math.go:93 func LogBase(x float64, base float64) float64 { |
| math.go:94 n := Log(x) |
| math.go:95 d := Log(base) |
| math.go:96 return n / d |
| math.go:97 } |
| |
| app.go:5 func main() { |
| app.go:6 // ... |
| app.go:7 val := LogBase(input1, input2) |
| app.go:8 // ... |
| app.go:9 } |
| </pre> |
| </td><td> |
| <pre> |
| app.go:5 func main() { |
| app.go:6 // ... |
| app.go:7 x, base := input1, input2 |
| app.go:7 x1 := x |
| app.go:7 if x1 <= 0 { |
| app.go:7 panic("log x <= 0") |
| app.go:7 } |
| app.go:7 r1 := intrinsicLog(x1) |
| app.go:7 x2 := base |
| app.go:7 if x2 <= 0 { |
| app.go:7 panic("log x <= 0") |
| app.go:7 } |
| app.go:7 r2 := intrinsicLog(x2) |
| app.go:7 n := r1 |
| app.go:7 d := r2 |
| app.go:7 r3 := n / d |
| app.go:7 val := r3 |
| app.go:8 // ... |
| app.go:9 } |
| </pre> |
| </td></tr> |
| </table> |
| |
| Below we have the corresponding stack traces for these two versions of code, |
| caused by a call to `Log(0)`. |
| With mid-stack inlining, there is no stack frame or line number information |
| available for `LogBase`, so the user is unable to determine which input was 0. |
| <table> |
| <tr><th>Stack trace before inlining</th><th>Stack trace after inlining (-l=4)</th></tr> |
| <tr><td> |
| <pre> |
| panic(0x497140, 0xc42000e340) |
| /usr/lib/go/src/runtime/panic.go:500 +0x1a1 |
| main.Log(0x0, 0x400de6bf542e3d2d) |
| /home/gopher/math.go:43 +0xa0 |
| main.LogBase(0x4045000000000000, 0x0, 0x0) |
| /home/gopher/math.go:95 +0x49 |
| main.main() |
| /home/gopher/app.go:7 +0x4c |
| </pre> |
| </td><td> |
| <pre> |
| panic(0x497140, 0xc42000e340) |
| /usr/lib/go/src/runtime/panic.go:500 +0x1a1 |
| main.main() |
| /home/gopher/app.go:7 +0x161 |
| </pre> |
| </td></tr> |
| </table> |
| |
| The goal of this proposed change is to produce complete tracebacks in the |
| presence of inlining and to enable the compiler to inline non-leaf functions |
| like `Log` and `LogBase` without sacrificing debuggability. |
| |
| |
| # Proposal |
| |
| ## Changes to the compiler |
| |
| Our approach is to modify the compiler to retain the original source |
| position information of inlined AST nodes and to store information about |
| the call site in a separate data structure. |
| Here is what the inlined example from above would look like instead: |
| |
| <pre> |
| app.go:5 func main() { |
| app.go:6 // ... |
| app.go:7 x, base := input1, input2 ┓ LogBase |
| math.go:94 x1 := x ┃ app.go:7 ┓ Log |
| math.go:42 if x1 <= 0 { ┃ ┃ math.go:94 |
| math.go:43 panic("log x <= 0") ┃ ┃ |
| math.go:44 } ┃ ┃ |
| math.go:45 r1 := intrinsicLog(x1) ┃ ┛ |
| math.go:95 x2 := base ┃ ┓ Log |
| math.go:42 if x2 <= 0 { ┃ ┃ math.go:95 |
| math.go:43 panic("log x <= 0") ┃ ┃ |
| math.go:44 } ┃ ┃ |
| math.go:45 r2 := intrinsicLog(x2) ┃ ┛ |
| math.go:94 n := r1 ┃ |
| math.go:95 d := r2 ┃ |
| math.go:96 r3 := n / d ┛ |
| app.go:7 val := r3 |
| app.go:8 // ... |
| app.go:9 } |
| </pre> |
| |
| Information about inlined calls is stored in a compiler-global data |
| structure called the *global inlining tree*. |
| Every time a call is inlined, the compiler adds a new node to the global |
| inlining tree that contains information about the call site (line number, |
| file name, and function name). |
| If the parent function of the inlined call is also inlined, the node for |
| the inner inlined call points to the node for the parent's inlined call. |
| For example, here is the inlining tree for the code above: |
| |
| <pre> |
| ┌──────────┐ |
| │ LogBase │ |
| │ app.go:7 │ |
| └──────────┘ |
| ↑ ↑ ┌────────────┐ |
| │ └───┤ Log │ |
| │ │ math.go:94 │ |
| │ └────────────┘ |
| │ ┌────────────┐ |
| └────────┤ Log │ |
| │ math.go:95 │ |
| └────────────┘ |
| </pre> |
| |
| The inlining tree is encoded as a table with one row per node in the tree. |
| The parent column is the row index of the node's parent in the table, or -1 |
| if the node has no parent: |
| |
| | Parent | File | Line | Function Name | |
| | ------ | -------------- | ---- | ----------------- | |
| | -1 | app.go | 7 | LogBase | |
| | 0 | math.go | 94 | Log | |
| | 0 | math.go | 95 | Log | |
| |
| Every AST node is associated to a row index in the global inlining |
| tree/table (or -1 if the node is not the result of inlining). |
| We maintain this association by extending the `src.PosBase` type with a new |
| field called the *inlining index*. |
| Here is what our AST looks like now: |
| |
| <pre> |
| app.go:5 func main() { |
| app.go:6 // ... |
| app.go:7 x, base := input1, input2 ┃ 0 |
| math.go:94 x1 := x ┓ |
| math.go:42 if x1 <= 0 { ┃ |
| math.go:43 panic("log x <= 0") ┃ 1 |
| math.go:44 } ┃ |
| math.go:45 r1 := intrinsicLog(x1) ┛ |
| math.go:95 x2 := base ┓ |
| math.go:42 if x2 <= 0 { ┃ |
| math.go:43 panic("log x <= 0") ┃ 2 |
| math.go:44 } ┃ |
| math.go:45 r2 := intrinsicLog(x2) ┛ |
| math.go:94 n := r1 ┓ |
| math.go:95 d := r2 ┃ 0 |
| math.go:96 r3 := n / d ┛ |
| app.go:7 val := r3 |
| app.go:8 // ... |
| app.go:9 } |
| </pre> |
| |
| As the AST nodes are lowered, their `src.PosBase` values are copied to |
| the resulting `Prog` pseudo-instructions. |
| The object writer reads the global inlining tree and the inlining index of |
| each `Prog` and writes this information compactly to object files. |
| |
| ## Changes to the object writer |
| |
| The object writer creates two new tables per function. |
| The first table is the *local inlining tree* which contains all the |
| branches from the global inlining tree that are referenced by the Progs |
| in that function. |
| The second table is a PC-value table called the *pcinline table* that maps |
| each PC to a row index in the local inlining tree, or -1 if the PC does not |
| correspond to a function that has been inlined. |
| |
| The local inlining tree and pcinline table are written to object files as |
| part of each function's pcln table. |
| The file names and function names in the local inlining tree are represented |
| using symbol references which are resolved to name offsets by the linker. |
| |
| ## Changes to the linker |
| |
| The linker reads the new tables produced by the object writer and writes |
| the tables to the final binary. |
| We reserve `pcdata[1]` for the pcinline table and `funcdata[2]` for the |
| local inlining tree. |
| The linker writes the pcinline table to `pcdata[1]` unmodified. |
| |
| The local inlining tree is encoded using 16 bytes per row (4 bytes per column). |
| The parent and line numbers are encoded directly as int32 values. |
| The file name and function names are encoded as int32 offsets into existing |
| global string tables. |
| This table must be written by the linker rather than the compiler because the |
| linker deduplicates these names and resolves them to global name offsets. |
| |
| If necessary, we can encode the inlining tree more compactly using a varint |
| for each column value. |
| In the compact encoding, the parent column and the values in the pcinline |
| table would be byte offsets into the local inlining tree instead of row |
| indices. |
| In this case, the linker would have to regenerate the pcinline table. |
| |
| ## Changes to the runtime |
| |
| The `runtime.gentraceback` function generates tracebacks and is modified |
| to produce logical stack frames for inlined functions. |
| The `gentraceback` function has two modes that are affected by inlining: |
| printing mode, used to print a stack trace when the runtime panics, and |
| pcbuf mode, which returns a buffer of PC values used by `runtime.Callers`. |
| In both modes, `gentraceback` checks if the current PC is mapped to a node |
| in the function's inlining tree by decoding the pcinline table for the |
| current function until it finds the value at the current PC. |
| If the value is -1, this instruction is not a result of inlining, so the |
| traceback proceeds normally. |
| Otherwise, `gentraceback` decodes the inlining tree and follows the path |
| up the tree to create the traceback. |
| |
| Suppose that `pcPos` is the position information for the current PC |
| (obtained from the pcline and pcfile tables), `pcFunc` is the function |
| name for the current PC, and `st[0] -> st[1] -> ... -> st[k]` is the |
| path up the inlining tree for the current PC. |
| To print an accurate stack trace, `gentraceback` prints function names |
| and their corresponding position information in this order: |
| |
| | Function name | Source position | |
| | ------------- | --------------- | |
| | st[0].Func | pcPos | |
| | st[1].Func | st[0].Pos | |
| | ... | ... | |
| | st[k].Func | st[k-1].Pos | |
| | pcFunc | st[k].Pos | |
| |
| This process repeats for every PC in the traceback. |
| Note that the runtime only has sufficient information to print function |
| arguments and PC offsets for the last entry in this table. |
| Here is the resulting stack trace from the example above with our changes: |
| |
| <pre> |
| main.Log(...) |
| /home/gopher/math.go:43 |
| main.LogBase(...) |
| /home/gopher/math.go:95 |
| main.main() |
| /home/gopher/app.go:7 +0x1c8 |
| </pre> |
| |
| ## Changes to the runtime public API |
| |
| With inlining, a PC may represent multiple logical calls, so we need to |
| clarify the meaning of some runtime APIs related to tracebacks. |
| For example, the `skip` argument passed to `runtime.Caller` and |
| `runtime.Callers` will be interpreted as the number of logical calls to skip |
| (rather than the number of physical stack frames to skip). |
| |
| Unfortunately, the runtime.Callers API requires some modification to be |
| compatible with mid-stack inlining. |
| The result value of runtime.Callers is a slice of program counters |
| ([]uintptr) representing physical stack frames. |
| If the `skip` parameter to runtime.Callers skips part-way into a physical |
| frame, there is no convenient way to encode that in the resulting slice. |
| To avoid changing the API in an incompatible way, our solution is to store |
| the number of skipped logical calls of the first frame in the _second_ |
| uintptr returned by runtime.Callers. |
| Since this number is a small integer, we encode it as a valid PC value |
| into a small symbol called `runtime.skipPleaseUseCallersFrames`. |
| For example, if f() calls g(), g() calls `runtime.Callers(2, pcs)`, and |
| g() is inlined into f, then the frame for f will be partially skipped, |
| resulting in the following slice: |
| |
| pcs = []uintptr{pc_in_f, runtime.skipPleaseUseCallersFrames+1, ...} |
| |
| The `runtime.CallersFrames` function will check if the second PC is |
| in `runtime.skipPleaseUseCallersFrames` and skip the corresponding |
| number of logical calls. |
| We store the skip PC in `pcs[1]` instead of `pcs[0]` so that `pcs[i:]` |
| will truncate the captured stack trace rather than grow it for all i |
| (otherwise `pcs[1:]` would grow the stack trace). |
| |
| Code that iterates over the PC slice from `runtime.Callers` calling |
| `FuncForPC` will have to be updated as described below to continue |
| observing complete stack traces. |
| |
| |
| # Rationale |
| |
| Even with just leaf inlining, the new inlining tables increase the |
| size of binaries (see Preliminary Results). |
| However, this increase is unavoidable if the runtime is to print complete |
| stack traces. |
| Turning on mid-stack inlining increases binary size more significantly, |
| but we can tweak the inlining heuristic to find a good tradeoff between |
| binary size, performance, and build times. |
| |
| We considered several alternative designs before we reached the design |
| described in this document. |
| One tempting alternative is to reuse the existing file and line PC-value |
| tables and simply add a new PC-value table for the “parent” PC of each |
| instruction, rather than a new funcdata table. |
| This appears to represent the same information as the proposed funcdata table. |
| However, some PCs might not have a parent PC to point at, for example |
| if an inlined call is the very first instructions in a function. |
| We considered adding NOP instructions to represent the parent of an inlined |
| call, but concluded that a separate inlining tree is more compact. |
| |
| Another alternative design involves adding push and pop operations to the |
| PC-value table decoder for representing the inlined call stack. |
| We didn't prototype this design since the other designs seemed conceptually |
| simpler. |
| |
| |
| # Compatibility |
| |
| Prior to Go 1.7, the recommended way to use `runtime.Callers` was to loop |
| over the returned PCs and call functions like `runtime.FuncForPC` on each |
| PC directly. |
| With mid-stack inlining, code using this pattern will observe incomplete |
| call stacks, since inlined frames will be omitted. |
| In preparation for this, the `runtime.Frames` API was introduced in Go 1.7 |
| as a higher-level way to interpret the results of `runtime.Callers`. |
| We consider this to be a minor issue, since users will have had two releases |
| to update to `runtime.Frames` and any remaining direct uses of |
| `runtime.FuncForPC` will continue to work, simply in a degraded fashion. |
| |
| |
| # Implementation |
| |
| David will implement this proposal during the Go 1.9 time frame. |
| As of the beginning of the Go 1.9 development cycle, a mostly complete |
| prototype of the changes the compiler, linker, and runtime is already |
| working. |
| |
| The initial implementation goal is to make all tests pass with `-l=4`. |
| We will then focus on bringing tools and DWARF information up-to-date |
| with mid-stack inlining. |
| Once this support is complete, we plan to make `-l=4` the default setting. |
| |
| We should also update the `debug/gosym` package to expose the new inlining |
| information. |
| |
| *Update* (2017-03-04): CLs that add inlining info and fix stack traces |
| have been merged into master. |
| CLs that fix runtime.Callers are under submission. |
| |
| |
| # Prerequisite Changes |
| |
| Prior to this work, Go had the `-l=4` flag to turn on mid-stack inlining, |
| but this mode had issues beyond incomplete stack traces. |
| |
| For example, before we could run experiments with `-l=4`, we had to fix |
| inlining of variadic functions ([CL 33671](golang.org/cl/33671)), |
| mark certain cgo functions as uninlineable ([CL 33722](golang.org/cl/33722)), |
| and include linknames in export data ([CL 33911](golang.org/cl/33911)). |
| |
| Before we turn on mid-stack inlining, we will have to update uses |
| of runtime.Callers in the runtime to use runtime.CallersFrames. |
| We will also have to make tests independent of inlining |
| (e.g., [CL 37237](golang.org/cl/37237)). |
| |
| |
| # Preliminary Results |
| |
| Mid-stack inlining (`-l=4`) gives a 9% geomean improvement on the Go1 |
| benchmarks on amd64: |
| |
| https://perf.golang.org/search?q=upload:20170309.1 |
| |
| The same experiment on ppc64 also showed a 9-10% improvement. |
| |
| The new inlining tables increase binary size by 4% without mid-stack inlining. |
| Mid-stack inlining increases the size of the Go1 benchmark binary by an |
| additional 11%. |
| |
| |
| # Open issues |
| |
| One limitation of this approach is that the runtime is unable to print |
| the arguments to inlined calls in a stack trace. |
| This is because the runtime gets arguments by assuming a certain stack |
| layout, but there is no stack frame for inlined calls. |
| |
| This proposal does not propose significant changes to the existing |
| inlining heuristics. |
| Since mid-stack inlining is now a possibility, we should revisit the |
| inlining heuristics in follow-on work. |