blob: b98399843109a36fcd69beca12af36ad82f6e223 [file] [log] [blame] [view]
# Proposal: Mid-stack inlining in the Go compiler
Author(s): David Lazar, Austin Clements
Last updated: 2017-03-10
Discussion at: https://golang.org/issue/19348
See also: https://golang.org/s/go19inliningtalk
# Abstract
As of Go 1.8, the compiler does not inline mid-stack functions (functions
that call other non-inlineable functions) by default.
This is because the runtime does not have sufficient information to generate
accurate tracebacks for inlined code.
We propose fixing this limitation of tracebacks and enabling mid-stack
inlining by default.
To do this, we will add a new PC-value table to functions with inlined
calls that the runtime can use to generate accurate tracebacks, generate
DWARF inlining information for debuggers, modify runtime.Callers and
related functions to operate in terms of “logical” stack frames, and
modify tools that work with stack traces such as pprof and trace.
Preliminary results show that mid-stack inlining can improve performance by 9%
(Go1 benchmarks on both amd64 and ppc64) with a 15% increase in binary size.
Follow-on work will focus on improving the inlining heuristics to hopefully
achieve this performance with less increase in binary size.
# Background
Inlining is a fundamental compiler optimization that replaces a function
call with the body of the called function.
This eliminates call overhead but more importantly enables other compiler
optimizations, such as constant folding, common subexpression elimination,
loop-invariant code motion, and better register allocation.
As of Go 1.8, inlining happens at the AST level.
To illustrate how the Go 1.8 compiler does inlining, consider the code in
the left column of the table below.
Using heuristics, the compiler decides that the call to PutUint32 in app.go
can be inlined. It replaces the call with a copy of the AST nodes that make
up the body of PutUint32, creating new AST nodes for the arguments.
The resulting code is shown in the right column.
<table>
<tr><th>Before inlining</th><th>After inlining</th></tr>
<tr><td>
<pre>
binary.go:20 func PutUint32(b []byte, v uint32) {
binary.go:21 b[0] = byte(v >> 24)
binary.go:22 b[1] = byte(v >> 16)
binary.go:23 b[2] = byte(v >> 8)
binary.go:24 b[3] = byte(v)
binary.go:25 }
app.go:5 func main() {
app.go:6 // ...
app.go:7 PutUint32(data, input)
app.go:8 // ...
app.go:9 }
</pre>
</td><td>
<pre>
app.go:5 func main() {
app.go:6 // ...
app.go:7 var b []byte, v uint32
app.go:7 b, v = data, input
app.go:7 b[0] = byte(v >> 24)
app.go:7 b[1] = byte(v >> 16)
app.go:7 b[2] = byte(v >> 8)
app.go:7 b[3] = byte(v)
app.go:8 // ...
app.go:9 }
</pre>
</td></tr>
</table>
Notice that the compiler replaces the source positions of the inlined AST
nodes with the source position of the call.
If the inlined code panics (due to an index out of range error), the
resulting stack trace is missing a stack frame for PutUint32 and the user
doesn't get an accurate line number for what caused the panic:
<pre>
panic: runtime error: index out of range
main.main()
/home/gopher/app.go:7 +0x114
</pre>
Thus, even without aggressive inlining, the user might see inaccurate
tracebacks due to inlining.
To mitigate this problem somewhat, the Go 1.8 compiler does not inline
functions that contain calls.
This reduces the likelihood that the user will see an inaccurate traceback,
but it has a negative impact on performance.
Suppose in the example below that `intrinsicLog` is a large function that
won’t be inlined.
By default, the compiler will not inline the calls to `Log` or `LogBase`
since these functions make calls to non-inlineable functions.
However, we can force the compiler to inline these call to using the
compiler flag `-l=4`.
<table>
<tr><th>Before inlining</th><th>After inlining (-l=4)</th></tr>
<tr><td>
<pre>
math.go:41 func Log(x float64) float64 {
math.go:42 if x <= 0 {
math.go:43 panic("log x <= 0")
math.go:44 }
math.go:45 return intrinsicLog(x)
math.go:46 }
math.go:93 func LogBase(x float64, base float64) float64 {
math.go:94 n := Log(x)
math.go:95 d := Log(base)
math.go:96 return n / d
math.go:97 }
app.go:5 func main() {
app.go:6 // ...
app.go:7 val := LogBase(input1, input2)
app.go:8 // ...
app.go:9 }
</pre>
</td><td>
<pre>
app.go:5 func main() {
app.go:6 // ...
app.go:7 x, base := input1, input2
app.go:7 x1 := x
app.go:7 if x1 <= 0 {
app.go:7 panic("log x <= 0")
app.go:7 }
app.go:7 r1 := intrinsicLog(x1)
app.go:7 x2 := base
app.go:7 if x2 <= 0 {
app.go:7 panic("log x <= 0")
app.go:7 }
app.go:7 r2 := intrinsicLog(x2)
app.go:7 n := r1
app.go:7 d := r2
app.go:7 r3 := n / d
app.go:7 val := r3
app.go:8 // ...
app.go:9 }
</pre>
</td></tr>
</table>
Below we have the corresponding stack traces for these two versions of code,
caused by a call to `Log(0)`.
With mid-stack inlining, there is no stack frame or line number information
available for `LogBase`, so the user is unable to determine which input was 0.
<table>
<tr><th>Stack trace before inlining</th><th>Stack trace after inlining (-l=4)</th></tr>
<tr><td>
<pre>
panic(0x497140, 0xc42000e340)
/usr/lib/go/src/runtime/panic.go:500 +0x1a1
main.Log(0x0, 0x400de6bf542e3d2d)
/home/gopher/math.go:43 +0xa0
main.LogBase(0x4045000000000000, 0x0, 0x0)
/home/gopher/math.go:95 +0x49
main.main()
/home/gopher/app.go:7 +0x4c
</pre>
</td><td>
<pre>
panic(0x497140, 0xc42000e340)
/usr/lib/go/src/runtime/panic.go:500 +0x1a1
main.main()
/home/gopher/app.go:7 +0x161
</pre>
</td></tr>
</table>
The goal of this proposed change is to produce complete tracebacks in the
presence of inlining and to enable the compiler to inline non-leaf functions
like `Log` and `LogBase` without sacrificing debuggability.
# Proposal
## Changes to the compiler
Our approach is to modify the compiler to retain the original source
position information of inlined AST nodes and to store information about
the call site in a separate data structure.
Here is what the inlined example from above would look like instead:
<pre>
app.go:5 func main() {
app.go:6 // ...
app.go:7 x, base := input1, input2 ┓ LogBase
math.go:94 x1 := x ┃ app.go:7 ┓ Log
math.go:42 if x1 <= 0 { ┃ ┃ math.go:94
math.go:43 panic("log x <= 0") ┃ ┃
math.go:44 } ┃ ┃
math.go:45 r1 := intrinsicLog(x1) ┃ ┛
math.go:95 x2 := base ┃ ┓ Log
math.go:42 if x2 <= 0 { ┃ ┃ math.go:95
math.go:43 panic("log x <= 0") ┃ ┃
math.go:44 } ┃ ┃
math.go:45 r2 := intrinsicLog(x2) ┃ ┛
math.go:94 n := r1 ┃
math.go:95 d := r2 ┃
math.go:96 r3 := n / d ┛
app.go:7 val := r3
app.go:8 // ...
app.go:9 }
</pre>
Information about inlined calls is stored in a compiler-global data
structure called the *global inlining tree*.
Every time a call is inlined, the compiler adds a new node to the global
inlining tree that contains information about the call site (line number,
file name, and function name).
If the parent function of the inlined call is also inlined, the node for
the inner inlined call points to the node for the parent's inlined call.
For example, here is the inlining tree for the code above:
<pre>
┌──────────┐
│ LogBase │
│ app.go:7 │
└──────────┘
↑ ↑ ┌────────────┐
│ └───┤ Log │
│ │ math.go:94 │
│ └────────────┘
│ ┌────────────┐
└────────┤ Log │
│ math.go:95 │
└────────────┘
</pre>
The inlining tree is encoded as a table with one row per node in the tree.
The parent column is the row index of the node's parent in the table, or -1
if the node has no parent:
| Parent | File | Line | Function Name |
| ------ | -------------- | ---- | ----------------- |
| -1 | app.go | 7 | LogBase |
| 0 | math.go | 94 | Log |
| 0 | math.go | 95 | Log |
Every AST node is associated to a row index in the global inlining
tree/table (or -1 if the node is not the result of inlining).
We maintain this association by extending the `src.PosBase` type with a new
field called the *inlining index*.
Here is what our AST looks like now:
<pre>
app.go:5 func main() {
app.go:6 // ...
app.go:7 x, base := input1, input2 ┃ 0
math.go:94 x1 := x ┓
math.go:42 if x1 <= 0 { ┃
math.go:43 panic("log x <= 0") ┃ 1
math.go:44 } ┃
math.go:45 r1 := intrinsicLog(x1) ┛
math.go:95 x2 := base ┓
math.go:42 if x2 <= 0 { ┃
math.go:43 panic("log x <= 0") ┃ 2
math.go:44 } ┃
math.go:45 r2 := intrinsicLog(x2) ┛
math.go:94 n := r1 ┓
math.go:95 d := r2 ┃ 0
math.go:96 r3 := n / d ┛
app.go:7 val := r3
app.go:8 // ...
app.go:9 }
</pre>
As the AST nodes are lowered, their `src.PosBase` values are copied to
the resulting `Prog` pseudo-instructions.
The object writer reads the global inlining tree and the inlining index of
each `Prog` and writes this information compactly to object files.
## Changes to the object writer
The object writer creates two new tables per function.
The first table is the *local inlining tree* which contains all the
branches from the global inlining tree that are referenced by the Progs
in that function.
The second table is a PC-value table called the *pcinline table* that maps
each PC to a row index in the local inlining tree, or -1 if the PC does not
correspond to a function that has been inlined.
The local inlining tree and pcinline table are written to object files as
part of each function's pcln table.
The file names and function names in the local inlining tree are represented
using symbol references which are resolved to name offsets by the linker.
## Changes to the linker
The linker reads the new tables produced by the object writer and writes
the tables to the final binary.
We reserve `pcdata[1]` for the pcinline table and `funcdata[2]` for the
local inlining tree.
The linker writes the pcinline table to `pcdata[1]` unmodified.
The local inlining tree is encoded using 16 bytes per row (4 bytes per column).
The parent and line numbers are encoded directly as int32 values.
The file name and function names are encoded as int32 offsets into existing
global string tables.
This table must be written by the linker rather than the compiler because the
linker deduplicates these names and resolves them to global name offsets.
If necessary, we can encode the inlining tree more compactly using a varint
for each column value.
In the compact encoding, the parent column and the values in the pcinline
table would be byte offsets into the local inlining tree instead of row
indices.
In this case, the linker would have to regenerate the pcinline table.
## Changes to the runtime
The `runtime.gentraceback` function generates tracebacks and is modified
to produce logical stack frames for inlined functions.
The `gentraceback` function has two modes that are affected by inlining:
printing mode, used to print a stack trace when the runtime panics, and
pcbuf mode, which returns a buffer of PC values used by `runtime.Callers`.
In both modes, `gentraceback` checks if the current PC is mapped to a node
in the function's inlining tree by decoding the pcinline table for the
current function until it finds the value at the current PC.
If the value is -1, this instruction is not a result of inlining, so the
traceback proceeds normally.
Otherwise, `gentraceback` decodes the inlining tree and follows the path
up the tree to create the traceback.
Suppose that `pcPos` is the position information for the current PC
(obtained from the pcline and pcfile tables), `pcFunc` is the function
name for the current PC, and `st[0] -> st[1] -> ... -> st[k]` is the
path up the inlining tree for the current PC.
To print an accurate stack trace, `gentraceback` prints function names
and their corresponding position information in this order:
| Function name | Source position |
| ------------- | --------------- |
| st[0].Func | pcPos |
| st[1].Func | st[0].Pos |
| ... | ... |
| st[k].Func | st[k-1].Pos |
| pcFunc | st[k].Pos |
This process repeats for every PC in the traceback.
Note that the runtime only has sufficient information to print function
arguments and PC offsets for the last entry in this table.
Here is the resulting stack trace from the example above with our changes:
<pre>
main.Log(...)
/home/gopher/math.go:43
main.LogBase(...)
/home/gopher/math.go:95
main.main()
/home/gopher/app.go:7 +0x1c8
</pre>
## Changes to the runtime public API
With inlining, a PC may represent multiple logical calls, so we need to
clarify the meaning of some runtime APIs related to tracebacks.
For example, the `skip` argument passed to `runtime.Caller` and
`runtime.Callers` will be interpreted as the number of logical calls to skip
(rather than the number of physical stack frames to skip).
Unfortunately, the runtime.Callers API requires some modification to be
compatible with mid-stack inlining.
The result value of runtime.Callers is a slice of program counters
([]uintptr) representing physical stack frames.
If the `skip` parameter to runtime.Callers skips part-way into a physical
frame, there is no convenient way to encode that in the resulting slice.
To avoid changing the API in an incompatible way, our solution is to store
the number of skipped logical calls of the first frame in the _second_
uintptr returned by runtime.Callers.
Since this number is a small integer, we encode it as a valid PC value
into a small symbol called `runtime.skipPleaseUseCallersFrames`.
For example, if f() calls g(), g() calls `runtime.Callers(2, pcs)`, and
g() is inlined into f, then the frame for f will be partially skipped,
resulting in the following slice:
pcs = []uintptr{pc_in_f, runtime.skipPleaseUseCallersFrames+1, ...}
The `runtime.CallersFrames` function will check if the second PC is
in `runtime.skipPleaseUseCallersFrames` and skip the corresponding
number of logical calls.
We store the skip PC in `pcs[1]` instead of `pcs[0]` so that `pcs[i:]`
will truncate the captured stack trace rather than grow it for all i
(otherwise `pcs[1:]` would grow the stack trace).
Code that iterates over the PC slice from `runtime.Callers` calling
`FuncForPC` will have to be updated as described below to continue
observing complete stack traces.
# Rationale
Even with just leaf inlining, the new inlining tables increase the
size of binaries (see Preliminary Results).
However, this increase is unavoidable if the runtime is to print complete
stack traces.
Turning on mid-stack inlining increases binary size more significantly,
but we can tweak the inlining heuristic to find a good tradeoff between
binary size, performance, and build times.
We considered several alternative designs before we reached the design
described in this document.
One tempting alternative is to reuse the existing file and line PC-value
tables and simply add a new PC-value table for the “parent” PC of each
instruction, rather than a new funcdata table.
This appears to represent the same information as the proposed funcdata table.
However, some PCs might not have a parent PC to point at, for example
if an inlined call is the very first instructions in a function.
We considered adding NOP instructions to represent the parent of an inlined
call, but concluded that a separate inlining tree is more compact.
Another alternative design involves adding push and pop operations to the
PC-value table decoder for representing the inlined call stack.
We didn't prototype this design since the other designs seemed conceptually
simpler.
# Compatibility
Prior to Go 1.7, the recommended way to use `runtime.Callers` was to loop
over the returned PCs and call functions like `runtime.FuncForPC` on each
PC directly.
With mid-stack inlining, code using this pattern will observe incomplete
call stacks, since inlined frames will be omitted.
In preparation for this, the `runtime.Frames` API was introduced in Go 1.7
as a higher-level way to interpret the results of `runtime.Callers`.
We consider this to be a minor issue, since users will have had two releases
to update to `runtime.Frames` and any remaining direct uses of
`runtime.FuncForPC` will continue to work, simply in a degraded fashion.
# Implementation
David will implement this proposal during the Go 1.9 time frame.
As of the beginning of the Go 1.9 development cycle, a mostly complete
prototype of the changes the compiler, linker, and runtime is already
working.
The initial implementation goal is to make all tests pass with `-l=4`.
We will then focus on bringing tools and DWARF information up-to-date
with mid-stack inlining.
Once this support is complete, we plan to make `-l=4` the default setting.
We should also update the `debug/gosym` package to expose the new inlining
information.
*Update* (2017-03-04): CLs that add inlining info and fix stack traces
have been merged into master.
CLs that fix runtime.Callers are under submission.
# Prerequisite Changes
Prior to this work, Go had the `-l=4` flag to turn on mid-stack inlining,
but this mode had issues beyond incomplete stack traces.
For example, before we could run experiments with `-l=4`, we had to fix
inlining of variadic functions ([CL 33671](golang.org/cl/33671)),
mark certain cgo functions as uninlineable ([CL 33722](golang.org/cl/33722)),
and include linknames in export data ([CL 33911](golang.org/cl/33911)).
Before we turn on mid-stack inlining, we will have to update uses
of runtime.Callers in the runtime to use runtime.CallersFrames.
We will also have to make tests independent of inlining
(e.g., [CL 37237](golang.org/cl/37237)).
# Preliminary Results
Mid-stack inlining (`-l=4`) gives a 9% geomean improvement on the Go1
benchmarks on amd64:
https://perf.golang.org/search?q=upload:20170309.1
The same experiment on ppc64 also showed a 9-10% improvement.
The new inlining tables increase binary size by 4% without mid-stack inlining.
Mid-stack inlining increases the size of the Go1 benchmark binary by an
additional 11%.
# Open issues
One limitation of this approach is that the runtime is unable to print
the arguments to inlined calls in a stack trace.
This is because the runtime gets arguments by assuming a certain stack
layout, but there is no stack frame for inlined calls.
This proposal does not propose significant changes to the existing
inlining heuristics.
Since mid-stack inlining is now a possibility, we should revisit the
inlining heuristics in follow-on work.