Author(s): David Lazar, Austin Clements
Last updated: 2017-03-10
Discussion at: https://golang.org/issue/19348
See also: https://golang.org/s/go19inliningtalk
As of Go 1.8, the compiler does not inline mid-stack functions (functions that call other non-inlineable functions) by default. This is because the runtime does not have sufficient information to generate accurate tracebacks for inlined code.
We propose fixing this limitation of tracebacks and enabling mid-stack inlining by default. To do this, we will add a new PC-value table to functions with inlined calls that the runtime can use to generate accurate tracebacks, generate DWARF inlining information for debuggers, modify runtime.Callers and related functions to operate in terms of “logical” stack frames, and modify tools that work with stack traces such as pprof and trace.
Preliminary results show that mid-stack inlining can improve performance by 9% (Go1 benchmarks on both amd64 and ppc64) with a 15% increase in binary size. Follow-on work will focus on improving the inlining heuristics to hopefully achieve this performance with less increase in binary size.
Inlining is a fundamental compiler optimization that replaces a function call with the body of the called function. This eliminates call overhead but more importantly enables other compiler optimizations, such as constant folding, common subexpression elimination, loop-invariant code motion, and better register allocation. As of Go 1.8, inlining happens at the AST level. To illustrate how the Go 1.8 compiler does inlining, consider the code in the left column of the table below. Using heuristics, the compiler decides that the call to PutUint32 in app.go can be inlined. It replaces the call with a copy of the AST nodes that make up the body of PutUint32, creating new AST nodes for the arguments. The resulting code is shown in the right column.
app.go:5 func main() { app.go:6 // ... app.go:7 PutUint32(data, input) app.go:8 // ... app.go:9 }
Notice that the compiler replaces the source positions of the inlined AST nodes with the source position of the call. If the inlined code panics (due to an index out of range error), the resulting stack trace is missing a stack frame for PutUint32 and the user doesn't get an accurate line number for what caused the panic:
Thus, even without aggressive inlining, the user might see inaccurate tracebacks due to inlining.
To mitigate this problem somewhat, the Go 1.8 compiler does not inline functions that contain calls. This reduces the likelihood that the user will see an inaccurate traceback, but it has a negative impact on performance. Suppose in the example below that intrinsicLog
is a large function that won’t be inlined. By default, the compiler will not inline the calls to Log
or LogBase
since these functions make calls to non-inlineable functions. However, we can force the compiler to inline these call to using the compiler flag -l=4
.
math.go:93 func LogBase(x float64, base float64) float64 { math.go:94 n := Log(x) math.go:95 d := Log(base) math.go:96 return n / d math.go:97 }
app.go:5 func main() { app.go:6 // ... app.go:7 val := LogBase(input1, input2) app.go:8 // ... app.go:9 }
Below we have the corresponding stack traces for these two versions of code, caused by a call to Log(0)
. With mid-stack inlining, there is no stack frame or line number information available for LogBase
, so the user is unable to determine which input was 0.
The goal of this proposed change is to produce complete tracebacks in the presence of inlining and to enable the compiler to inline non-leaf functions like Log
and LogBase
without sacrificing debuggability.
Our approach is to modify the compiler to retain the original source position information of inlined AST nodes and to store information about the call site in a separate data structure. Here is what the inlined example from above would look like instead:
Information about inlined calls is stored in a compiler-global data structure called the global inlining tree. Every time a call is inlined, the compiler adds a new node to the global inlining tree that contains information about the call site (line number, file name, and function name). If the parent function of the inlined call is also inlined, the node for the inner inlined call points to the node for the parent's inlined call. For example, here is the inlining tree for the code above:
The inlining tree is encoded as a table with one row per node in the tree. The parent column is the row index of the node's parent in the table, or -1 if the node has no parent:
Parent | File | Line | Function Name |
---|---|---|---|
-1 | app.go | 7 | LogBase |
0 | math.go | 94 | Log |
0 | math.go | 95 | Log |
Every AST node is associated to a row index in the global inlining tree/table (or -1 if the node is not the result of inlining). We maintain this association by extending the src.PosBase
type with a new field called the inlining index. Here is what our AST looks like now:
As the AST nodes are lowered, their src.PosBase
values are copied to the resulting Prog
pseudo-instructions. The object writer reads the global inlining tree and the inlining index of each Prog
and writes this information compactly to object files.
The object writer creates two new tables per function. The first table is the local inlining tree which contains all the branches from the global inlining tree that are referenced by the Progs in that function. The second table is a PC-value table called the pcinline table that maps each PC to a row index in the local inlining tree, or -1 if the PC does not correspond to a function that has been inlined.
The local inlining tree and pcinline table are written to object files as part of each function's pcln table. The file names and function names in the local inlining tree are represented using symbol references which are resolved to name offsets by the linker.
The linker reads the new tables produced by the object writer and writes the tables to the final binary. We reserve pcdata[1]
for the pcinline table and funcdata[2]
for the local inlining tree. The linker writes the pcinline table to pcdata[1]
unmodified.
The local inlining tree is encoded using 16 bytes per row (4 bytes per column). The parent and line numbers are encoded directly as int32 values. The file name and function names are encoded as int32 offsets into existing global string tables. This table must be written by the linker rather than the compiler because the linker deduplicates these names and resolves them to global name offsets.
If necessary, we can encode the inlining tree more compactly using a varint for each column value. In the compact encoding, the parent column and the values in the pcinline table would be byte offsets into the local inlining tree instead of row indices. In this case, the linker would have to regenerate the pcinline table.
The runtime.gentraceback
function generates tracebacks and is modified to produce logical stack frames for inlined functions. The gentraceback
function has two modes that are affected by inlining: printing mode, used to print a stack trace when the runtime panics, and pcbuf mode, which returns a buffer of PC values used by runtime.Callers
. In both modes, gentraceback
checks if the current PC is mapped to a node in the function's inlining tree by decoding the pcinline table for the current function until it finds the value at the current PC. If the value is -1, this instruction is not a result of inlining, so the traceback proceeds normally. Otherwise, gentraceback
decodes the inlining tree and follows the path up the tree to create the traceback.
Suppose that pcPos
is the position information for the current PC (obtained from the pcline and pcfile tables), pcFunc
is the function name for the current PC, and st[0] -> st[1] -> ... -> st[k]
is the path up the inlining tree for the current PC. To print an accurate stack trace, gentraceback
prints function names and their corresponding position information in this order:
Function name | Source position |
---|---|
st[0].Func | pcPos |
st[1].Func | st[0].Pos |
... | ... |
st[k].Func | st[k-1].Pos |
pcFunc | st[k].Pos |
This process repeats for every PC in the traceback. Note that the runtime only has sufficient information to print function arguments and PC offsets for the last entry in this table. Here is the resulting stack trace from the example above with our changes:
With inlining, a PC may represent multiple logical calls, so we need to clarify the meaning of some runtime APIs related to tracebacks. For example, the skip
argument passed to runtime.Caller
and runtime.Callers
will be interpreted as the number of logical calls to skip (rather than the number of physical stack frames to skip).
Unfortunately, the runtime.Callers API requires some modification to be compatible with mid-stack inlining. The result value of runtime.Callers is a slice of program counters ([]uintptr) representing physical stack frames. If the skip
parameter to runtime.Callers skips part-way into a physical frame, there is no convenient way to encode that in the resulting slice. To avoid changing the API in an incompatible way, our solution is to store the number of skipped logical calls of the first frame in the second uintptr returned by runtime.Callers. Since this number is a small integer, we encode it as a valid PC value into a small symbol called runtime.skipPleaseUseCallersFrames
. For example, if f() calls g(), g() calls runtime.Callers(2, pcs)
, and g() is inlined into f, then the frame for f will be partially skipped, resulting in the following slice:
pcs = []uintptr{pc_in_f, runtime.skipPleaseUseCallersFrames+1, ...}
The runtime.CallersFrames
function will check if the second PC is in runtime.skipPleaseUseCallersFrames
and skip the corresponding number of logical calls. We store the skip PC in pcs[1]
instead of pcs[0]
so that pcs[i:]
will truncate the captured stack trace rather than grow it for all i (otherwise pcs[1:]
would grow the stack trace).
Code that iterates over the PC slice from runtime.Callers
calling FuncForPC
will have to be updated as described below to continue observing complete stack traces.
Even with just leaf inlining, the new inlining tables increase the size of binaries (see Preliminary Results). However, this increase is unavoidable if the runtime is to print complete stack traces. Turning on mid-stack inlining increases binary size more significantly, but we can tweak the inlining heuristic to find a good tradeoff between binary size, performance, and build times.
We considered several alternative designs before we reached the design described in this document. One tempting alternative is to reuse the existing file and line PC-value tables and simply add a new PC-value table for the “parent” PC of each instruction, rather than a new funcdata table. This appears to represent the same information as the proposed funcdata table. However, some PCs might not have a parent PC to point at, for example if an inlined call is the very first instructions in a function. We considered adding NOP instructions to represent the parent of an inlined call, but concluded that a separate inlining tree is more compact.
Another alternative design involves adding push and pop operations to the PC-value table decoder for representing the inlined call stack. We didn't prototype this design since the other designs seemed conceptually simpler.
Prior to Go 1.7, the recommended way to use runtime.Callers
was to loop over the returned PCs and call functions like runtime.FuncForPC
on each PC directly. With mid-stack inlining, code using this pattern will observe incomplete call stacks, since inlined frames will be omitted. In preparation for this, the runtime.Frames
API was introduced in Go 1.7 as a higher-level way to interpret the results of runtime.Callers
. We consider this to be a minor issue, since users will have had two releases to update to runtime.Frames
and any remaining direct uses of runtime.FuncForPC
will continue to work, simply in a degraded fashion.
David will implement this proposal during the Go 1.9 time frame. As of the beginning of the Go 1.9 development cycle, a mostly complete prototype of the changes the compiler, linker, and runtime is already working.
The initial implementation goal is to make all tests pass with -l=4
. We will then focus on bringing tools and DWARF information up-to-date with mid-stack inlining. Once this support is complete, we plan to make -l=4
the default setting.
We should also update the debug/gosym
package to expose the new inlining information.
Update (2017-03-04): CLs that add inlining info and fix stack traces have been merged into master. CLs that fix runtime.Callers are under submission.
Prior to this work, Go had the -l=4
flag to turn on mid-stack inlining, but this mode had issues beyond incomplete stack traces.
For example, before we could run experiments with -l=4
, we had to fix inlining of variadic functions (CL 33671), mark certain cgo functions as uninlineable (CL 33722), and include linknames in export data (CL 33911).
Before we turn on mid-stack inlining, we will have to update uses of runtime.Callers in the runtime to use runtime.CallersFrames. We will also have to make tests independent of inlining (e.g., CL 37237).
Mid-stack inlining (-l=4
) gives a 9% geomean improvement on the Go1 benchmarks on amd64:
https://perf.golang.org/search?q=upload:20170309.1
The same experiment on ppc64 also showed a 9-10% improvement.
The new inlining tables increase binary size by 4% without mid-stack inlining. Mid-stack inlining increases the size of the Go1 benchmark binary by an additional 11%.
One limitation of this approach is that the runtime is unable to print the arguments to inlined calls in a stack trace. This is because the runtime gets arguments by assuming a certain stack layout, but there is no stack frame for inlined calls.
This proposal does not propose significant changes to the existing inlining heuristics. Since mid-stack inlining is now a possibility, we should revisit the inlining heuristics in follow-on work.