| # Proposal: Design and Implementation of Profile-Guided Optimization (PGO) for Go |
| |
| Author(s): Raj Barik, Jin Lin |
| |
| Last updated: 2022-09-12 |
| |
| Discussion at https://golang.org/issue/55025. \ |
| Linked high-level issue for PGO at https://golang.org/issue/55022. |
| |
| ## Abstract |
| |
| Inefficiencies in Go programs can be isolated via profiling tools such as [pprof](https://pkg.go.dev/runtime/pprof) and linux profiler [perf](https://www.brendangregg.com/perf.html). |
| Such tools can pinpoint source code regions where most of the execution time is spent. |
| Unlike other optimizing compilers such as [LLVM](https://llvm.org/docs/HowToBuildWithPGO.html), the Go compiler does not yet perform Profile-Guided Optimization(PGO). |
| PGO uses information about the code’s runtime behavior to guide compiler optimizations such as inlining, code layout etc. |
| PGO can improve application performance in the range 15-30% [[LLVM](https://llvm.org/docs/HowToBuildWithPGO.html), [AutoFDO](https://dl.acm.org/doi/pdf/10.1145/2854038.2854044)]. |
| In this proposal, we extend the Go compiler with PGO. |
| Specifically, we incorporate the profiles into the frontend of the compiler to build a call graph with node & edge weights (called _WeightedCallGraph_). |
| The Inliner subsequently uses the WeightedCallGraph to perform _profile-guided inlining_ which aggressively inlines hot functions. |
| We introduce a _profile-guided code specialization_ pass that is tightly integrated with the Inliner and eliminates indirect method call overheads in hot code paths. |
| Furthermore, we annotate IR instructions with their associated profile weights and propagate these to the SSA-level in order to facilitate _profile-guided basic-block layout_ optimization to benefit from better instruction-cache and TLB performance. |
| Finally, we extend Go's linker to also consume the profiles directly and perform _function reordering_ optimization across package boundaries -- which also helps instruction-cache and TLB performance. |
| The format of the profile file consumed by our PGO is identical to the protobuf format produced by the [pprof](https://pkg.go.dev/runtime/pprof) tool. |
| This format is rich enough to carry additional hardware performance counter information such as cache misses, LBR, etc. |
| Existing [perf\_data\_converter](https://github.com/google/perf_data_converter) tool from Google can convert a _perf.data_ file produced by the Linux [perf](https://www.brendangregg.com/perf.html) into a _profile.proto_ file in protobuf format. |
| The first version of the code that performs _profile-guided inlining_ is available [here](http://github.com/rajbarik/go). |
| |
| ## Background |
| |
| Many compiler optimizations, such as inlining, register allocation, & instruction scheduling often use hard-coded information related to caller-callee frequencies, basic-block frequencies, and branch probabilities to guide optimization. |
| The static estimation of these metrics often leads to poor quality code generated by the compiler. |
| These optimizations can easily benefit from dynamic information collected via profiling an application. |
| |
| Traditionally, a PGO-based compilation begins with an instrumentation phase to generate an instrumented version of the application. |
| Next, the instrumented program runs with training data to collect the profile (e.g., [edge-profiles](https://dl.acm.org/doi/10.1145/183432.183527)). |
| These profiles are subsequently fed to the compiler and the application is recompiled to produce an optimized binary. |
| During this process, the compiler updates and propagates profile information including feeding them to optimization passes to optimize hot/code paths. |
| Modern compilers such as LLVM have embedded PGO in them and have reported [speed ups](https://llvm.org/docs/HowToBuildWithPGO.html) of ~20\%. |
| Since the instrumentation build of an application incurs significant overhead, [recent work](https://dl.acm.org/doi/10.1145/1772954.1772963) has demonstrated little or no performance loss by collecting execution profiles via sampling (e.g., [hardware performance counter](https://dl.acm.org/doi/10.1145/1772954.1772963), [pprof](https://pkg.go.dev/runtime/pprof)). |
| |
| Go binaries are often large as they are statically linked and include all dependent packages and runtimes. |
| For such large binaries, misses in instruction cache and TLB can cause stalled cycles in the front-end leading to performance degradation. |
| Profile-guided _code layout_ optimization is known to alleviate this problem. |
| Recent works including [FB-BOLT](https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/) and [Google-Propeller](https://github.com/google/llvm-propeller) have shown more than10% performance improvements by optimizing code locality in large data-center workloads. |
| _Code layout optimization_ improves code locality and it comprises basic-block layout, function splitting, and function reordering optimizations. |
| In order to extract maximum performance, these optimizations are typically performed during or post link-time using profiling information. |
| |
| ### Standard Compilation flow in Go |
| |
| ![](55022/image1.png) |
| |
| **Figure 1.** Go compiler. |
| |
| At a very high-level, Go compiler produces an object file per package. |
| It links all object-files together to produce an executable. |
| It avoids unnecessary recompilation of packages which are not modified. |
| For each package, the frontend of the Go compiler performs passes such as type checking and ast-lowering before generating the frontend-IR. |
| At this IR-level, the Go compiler performs optimizations such as inlining, devirtualization, and escape analysis before lowering to SSA-IR. |
| Several optimizations (dead-code, cse, dead-store, basic-block layout, etc.) are performed at the SSA-level before producing an object file for a package. |
| |
| Inlining optimization in Go compiler proceeds as a bottom-up traversal of the call graph. |
| For each function, it first invokes _CanInline_ to determine the eligibility for inlining, e.g., functions marked with _go:noinline_ can not be inlined. |
| CanInline traverses the IR (via a HairyVisitor) to calculate the _Cost_ of the function. |
| If this Cost is greater than _maxInlineBudget_ (80), CanInline marks this function not inlinable in its upstream callers. |
| Subsequently, Inliner performs inlining in the same function for call-sites that have direct callees with cost less than maxInlineBudget (80). |
| As an additional code size optimization, the Inliner lowers the budget from maxInlineBudget (80) to inlineBigFunctionMaxCost (20) for a big function whose number of IR instructions is more than 5K, These parameters have been tuned heavily to get a good balance between code size and performance. |
| On the other hand, there have been several discussions around maxInlineBudget (80) being too small which prevents frequently executed methods not being inlined, e.g., NextSet in [bitset](https://lemire.me/blog/2017/09/05/go-does-not-inline-functions-when-it-should/). |
| It is definitely compelling to be able to tune these parameters based on runtime behavior of an application. |
| |
| Devirtualization pass in the Go compiler replaces interface method calls with direct concrete-type method calls where possible. |
| Since Inlining happens before Devirtualization, there is no guarantee that the concrete method calls generated by the devirtualizer gets inlined in the compiler. |
| |
| Go compiler performs basic-block layout at the SSA-level today, but it is not profile-guided. |
| This optimization is more effective when runtime information is provided to the compiler. |
| |
| Go linker does not yet perform any function reordering optimization. |
| To the best of our knowledge, the linker today creates two sections, one for data and another for text/code. |
| Ideally, one would like to do basic-block layout, hot-cold splitting, & function reordering optimizations in the linker ([FB-BOLT](https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/))], however without the support for sections at function and basic-block level it is hard to do them in the linker. |
| This limitation has led us to enable the basic-block layout optimization at SSA-level and function reordering optimization in the linker. |
| |
| ## Proposal |
| |
| ### New compilation flow proposed in Go for PGO |
| |
| ![](55022/image2.png) |
| |
| **Figure 2.** New PGO-enabled Go compiler. |
| |
| Our proposed Go compilation flow consists of the following high-level components. |
| |
| * __Pprof-graph__: First, we process the profile proto file that is generated either via the [pprof](https://github.com/google/pprof/tree/main/profile) tool or the perf+[perf\_data\_converter](https://github.com/google/perf_data_converter) tool to produce an inter-package _pprof-graph._ We use the existing _internal/profile_ package to get this graph straight out-of-the-box. |
| This graph contains most of the information needed to perform PGO, e..g, function names, flat weights, cumulative weights, call-sites with callee information for both direct and indirect callees etc. |
| One thing to note is that ideally we would like to generate this graph once and cache it for individual package-level compilation in order to keep compile-time under control, however since packages could be compiled in parallel in multiple processes we could not find an easy way to share this graph across packages. |
| We fall-back to building this graph for every package. |
| |
| * __CallGraphBuilder and WeightedCallGraph__: While compiling each package, we visit the IRs of all the functions in the package to produce an intra-package call graph with edge and node weights. |
| The edge and node weights are obtained from the _pprof-graph_. |
| We term this as _WeightedCallGraph_(shown as _CallGraphBuilder_ in Figure 2). |
| In other words, _WeightedCallGraph_ is a succinct version of package-level pprof-graph with additional information such as associated IR function bodies, call edges linked with IR call sites, multiple callee edges for indirect calls etc. |
| |
| * __Embed profiles in the frontend IR & SSA IR__: While building the WeightedCallgraph, we also create a map table that stores the IR and its corresponding profiled weights in order to facilitate the basic-block layout optimization at the SSA-level. |
| The primary reason for storing every IR instruction with weights is that there is no control-flow graph available at the IR-level, which could have been leveraged to just annotate branch instructions and branch-edges. |
| During the lowering from the frontend IR to SSA IR, the profiling information is annotated into the basic blocks and edges. |
| The compiler propagates the profiling information in the control flow graph. |
| Similarly, every optimization based on the SSA IR maintains the profiling information. |
| Annotating every single instruction with weights can increase memory pressure; in future we plan on exploring other options. |
| |
| * __Profile-guided Inlining__: During inlining, we determine hot functions and hot call-sites based on the _WeightedCallgraph_. |
| We increase the inlining budget from 80 to a larger value (default 160, tunable via command-line) based on the hotness of the function. |
| This enables a function to be inlined in its upstream callers even if its budget is more than 80. |
| Moreover if the function has a cost more than 80 and has more than one upstream callers, then only hot-call edges will be inlined, others will not, thereby controlling code size growth. |
| Similarly, hot call-sites are also inlined aggressively in "mkinlcall" function using the increased budget. |
| The hot-ness criteria is based on a threshold, which can be tuned via command-line (default set to 2% of the total execution time). |
| One subtle note is that the total number of inlined functions via PGO may either be greater or smaller than the original inliner without PGO. |
| For example, in a pathological case if a hot function gets bulkier (i.e., its budget is more than 160) due to inlining performed in transitive callees, then it may not get inlined in upstream callers (unlike standard inlining)] and, if there are a large number of upstream callers, then it may lead to less number of inlining outcomes. |
| This can be addressed by increasing the budget via command-line, but that may also lead to increased code size, something to keep an eye out for. |
| |
| * __Code Specialization__: In order to eliminate indirect function call overheads, we introduce a _code specialization_ step inside the Inliner to transform an indirect callsite into an "if-else" code block based on the hot receiver. |
| The WeightedCallgraph is used to determine a hot receiver. |
| The "if" condition introduces a check on the runtime type of the hot receiver. |
| The body of the "if" block converts the indirect function call into a direct function call. |
| This call subsequently gets inlined since Inlining follows this optimization. |
| The "else" block retains the original slow indirect call. |
| As mentioned earlier, our goal is to be able to inline hot receivers. |
| We purposefully avoid making multiple checks on the receiver type for multiple hot receiver cases in order to avoid unnecessary code growth and performance degradation in corner cases. |
| In certain cases, it is possible to hoist the "if" condition check outside of the surrounding loop nests; this is a subject for future work. |
| During Inlining and Specialization, we update and propagate weights of the IR instructions. |
| |
| * __Basic-block Reordering__: Since the Go compiler produces a large monolithic binary, we implement state-of-the-art profile-guided basic block layout optimization, i.e., [EXT-TSP heuristic](https://ieeexplore.ieee.org/document/9050435), at the SSA-level. |
| The algorithm is a greedy heuristic that works with chains (ordered lists) of basic blocks. |
| Initially all chains are isolated basic blocks. |
| On every iteration, a pair of chains are merged that yields the biggest benefit in terms of instruction-cache reuse. |
| The algorithm stops when one chain is left or merging does not improve the overall benefit. |
| If there are more than one chain left, these chains are sorted using a density function that prioritizes merging frequently executed smaller sized chains into each page. |
| Unlike earlier algorithms (e.g., [Pettis-Hansen](https://dl.acm.org/doi/10.1145/93548.93550)), in ExtTSP algorithm two chains, X and Y, are first split into three hypothetical chains, X1, X2, and Y. |
| Then it considers all possible ways of combining these three chains (e.g., X1YX2, X1X2Y, X2X1Y, X2YX1, YX1X2, YX2X1) and chooses the one producing the largest benefit. |
| We also have an implementation of [Pettis-Hansen](https://dl.acm.org/doi/10.1145/93548.93550) available for performance comparisons. |
| |
| * __Function Reorderings__: Finally, we introduce a "function reordering" optimization in the linker. |
| Linker code is extended to rebuild the cross-package "pprof-graph" via a command-line argument. |
| It first produces a summary table consisting of a set of (caller, callee, edge-weight) entries [similar to WeightedCallGraph without node weights]. |
| We then use the [C3 heuristic](https://dl.acm.org/doi/10.5555/3049832.3049858) to re-layout the functions in the text section. |
| The structure of the C3 algorithm is similar to the ExtTSP and we avoid describing it again. |
| One thing to note is that the C3 heuristic places a function as close as possible to its most common caller following a priority from the hottest to the coldest functions in the program. |
| One issue we ran into while implementing this algorithm is that C3 heuristic requires the relative distance of a callsite from the start of a function. |
| We found this tricky to implement in the linker, so our current implementation approximates this distance as half of the code size of the function instead of computing it precisely. |
| We are able to compute code size directly from the object file in order to compute density. |
| In future, we should explore options of computing this information in the compiler and store it in a fixed location in the object file. |
| We also have an implementation of [Pettis-Hansen](https://dl.acm.org/doi/10.1145/93548.93550) available as an alternative. |
| Our function reordering optimization is automatically cross-package. |
| One subtle issue we have run into is that often GC-related functions are intermingled with application code in profiles making function reordering optimization sub-optimal. |
| While GC functions emanating from the "assistGC" function called from the "malloc" routine should be included in the function reordering optimization, other GC functions probably should be sequenced as a separate chain/cluster of its own. |
| This makes sense given that GC gets invoked at arbitrary timelines and runs in a separate thread of its own (perhaps could run in a separate core as well). |
| This is a subject for future work for us. |
| |
| * __Stale Profile__: As source code evolves, it is highly likely that the profiles could become stale over time. |
| In order to deal with this, we propose to use _\<pkg\_name, function\_name, line\_offset\>_ instead of _\<pkg\_name, function\_name, line\_number\>_. |
| With line\_offset information, we can identify if a call site has moved up or down relative to the start of the function. |
| In this case, we do not inline this call site, however other call sites that have not moved up or down will continue to use the profile information and get optimized. |
| This design also allows functions that are unchanged will continue to be optimized via PGO. |
| This design is similar to [AutoFDO](https://dl.acm.org/doi/pdf/10.1145/2854038.2854044)]. |
| |
| ## Rationale |
| |
| There are several design choices we have made in this proposal. |
| |
| __Why not inline every hot callee? Why introduce a budget for hot callee?__ |
| |
| Primary motivation is to control code growth while retaining good performance. |
| Auto-tuning frameworks such as [OpenTuner](https://github.com/minshuzhan/opentuner) can tune binaries using the command-line options for budget and threshold. |
| |
| __What is the benefit of WeightedCallGraph data structure?__ |
| |
| The core of WeightedCallGraph is to associate IR functions and callsites with node and edge weights from pprof-graph. |
| This could have been achieved via a simpler data structure perhaps and thus, WeightedCallGraph could be a bit heavy-weight data structure just for Inlining and Specialization. |
| However, it is a forward looking design where other (interprocedural) optimizations can leverage this graph in future. |
| |
| __Why Specialization is performed during inlining and not before or after?__ |
| |
| Ideally, we want code specialization to be performed before inlining so that the direct callees can be inlined. |
| On the other hand, when specialization is performed as a separate pass before inliner, one needs to update the WeightedCallGraph during specialization. |
| When we perform specialization along with Inliner, we avoid updating WeightedCallgraph twice. |
| Moreover, Specialization also requires parts of CanInline logic to perform legality checks. |
| Keeping them together also avoids this code duplication. |
| |
| __Why use the Protobuf format? Why not use other formats similar to LLVM?__ |
| |
| From our experience, protobuf format is rich enough to carry additional profiling information, particularly hardware performance counter information. |
| There exist tools such as [perf\_data\_converter](https://github.com/google/perf_data_converter) tool that can convert a _perf.data_ file produced by Linux [perf](https://www.brendangregg.com/perf.html) into a _profile.proto_ file in protobuf format. |
| So, we may not lose on functionality. |
| |
| __Why do we not perform basic-block layout in the linker?__ |
| |
| Ideally, we should. |
| Currently, the Go compiler produces one section for code/text. |
| If it can be extended to produce sections at the basic-block level, we can port our basic-block layout implementation to the linker. |
| |
| ## Compatibility |
| |
| Currently our optimizations have been tested in the Linux OS only. |
| To the best of our knowledge, our implementation follows the guidelines described in the [compatibility guidelines](https://golang.org/doc/go1compat). |
| |
| ## Implementation |
| |
| We have implemented all the optimizations described above in this proposal, i.e., Inlining, Code Specialization, Basic-block layout, & Function reordering. |
| However, pushing all these changes upstream at once can take a while. |
| Our strategy is to release only "CallGraphBuilder" and "Profile-guided Inlining" (see Figure 2) in the first phase with the timeline for v1.20 release. |
| Below we provide the implementation details for CallGraphBuilder and Profile-guided inlining. |
| |
| ### CallGraphBuilder implementation |
| |
| * __BuildPProfGraph__: This function builds the pprof-graph based on the "internal/profile" package. |
| The pprof-graph is built using the _cpuprofile _data from_ _profile. |
| The protobuf profile file is fed to the compiler via a command line flag (_-profileuse_). |
| It preprocesses the pprof-graph to compute a few additional information such as _TotalNodeWeights_ and _TotalEdgeWeights_. |
| It also creates a global node-to-weight map which is cheaper to look up during IR traversal in later phases. |
| |
| * __BuildWeightedCallGraph__: This function builds the WeightedCallGraph data structure by visiting all the _ir.Func_ s in the decl-list of a package. |
| During the traversal of the IR, it is necessary to determine the callee function for a direct callsite. |
| Our logic to do this is very similar to "inlCallee" in "inline/inl.go". |
| Since we would create a cyclic dependency between inline and pgo packages if we reused the code, we have duplicated "inlCallee" logic in irgraph.go (with the same name). |
| We should refactor this code in future to be able to put it outside of these two packages. |
| |
| * __RedirectEdges__: This function updates the WeightedCallGraph nodes and edges during and after Inlining based on a list of inlined functions. |
| |
| ### Profile-guided Inlining implementation |
| |
| * __Prologue__: Before performing inlining, we walk over the _WeightedCallGraph_ to identify hot functions and hot call-sites based on a command-line tunable threshold (_-inlinehotthreshold_). |
| This threshold is expressed as a percentage of time exclusively spent in a function or call-site over total execution time. |
| It is by default set to 2% and can be updated via the command-line argument. |
| When a function or call-site exclusively spends more time than this threshold percentage, we classify the function or call-site as a hot function or hot call-site. |
| |
| * __Extensions to CanInline__: Currently the HairyVisitor bails out as soon as its budget is more than maxInlineBudget (80). |
| We introduce a new budget for profile-guided inlining, _inlineHotMaxBudget_, with a default value of 160, which can also be updated via a command line flag (_-inlinehotbudget_). |
| We use the inlineHotMaxBudget for hot functions during the CanInline step. |
| |
| ```go |
| if v.budget < 0 { |
| if pgo.WeightedCG != nil { |
| if n, ok := pgo.WeightedCG.IRNodes[ir.PkgFuncName(fn)]; ok { |
| if inlineMaxBudget-v.budget < inlineHotMaxBudget && n.HotNode == true { |
| return false |
| } |
| } |
| } |
| ... |
| } |
| ``` |
| |
| During the HairyVisitor traversal, we also track all the hot call-sites of a function. |
| |
| ```go |
| // Determine if callee edge is a hot callee or not. |
| if pgo.WeightedCG != nil && ir.CurFunc != nil { |
| if fn := inlCallee(n.X); fn != nil && typecheck.HaveInlineBody(fn) { |
| lineno := fmt.Sprintf("%v", ir.Line(n)) |
| splits := strings.Split(lineno, ":") |
| l, _ := strconv.ParseInt(splits[len(splits)-2], 0, 64) |
| linenum := fmt.Sprintf("%v", l) |
| canonicalName := ir.PkgFuncName(ir.CurFunc) + "-" + linenum + "-" + ir.PkgFuncName(fn) |
| if _, o := candHotEdgeMap[canonicalName]; o { |
| listOfHotCallSites[pgo.CallSiteInfo{ir.Line(n), ir.CurFunc}] = struct{}{} |
| } |
| } |
| } |
| ``` |
| |
| * __Extension to mkinlcall__: When a hot callee's budget is more than maxCost, we check if this cost is less than inlineHotMaxBudget in order to permit inlining of hot callees. |
| |
| ```go |
| if fn.Inl.Cost > maxCost { |
| ... |
| |
| // If the callsite is hot and it is below the inlineHotCalleeMaxBudget budget, then inline it, or else bail. |
| if _, ok := listOfHotCallSites[pgo.CallSiteInfo{ir.Line(n), ir.CurFunc}]; ok { |
| if fn.Inl.Cost > inlineHotMaxBudget { |
| return n |
| } |
| } else { |
| return n |
| } |
| } |
| ``` |
| |
| * __Epilogue__: During this phase, we update the WeightedCallGraph based on the inlining decisions made earlier. |
| |
| ```go |
| ir.VisitFuncsBottomUp(typecheck.Target.Decls, func(list []*ir.Func, recursive bool) { |
| for _, f := range list { |
| name := ir.PkgFuncName(f) |
| if n, ok := pgo.WeightedCG.IRNodes[name]; ok { |
| pgo.RedirectEdges(n, inlinedCallSites) |
| } |
| } |
| }) |
| ``` |
| |
| * __Command line flags list__: We introduce the following flag in base/flag.go. |
| ```go |
| PGOProfile string "help:\"read profile from `file`\"" |
| ``` |
| |
| We also introduced following debug flags in base/debug.go. |
| ```go |
| InlineHotThreshold string "help:\"threshold percentage for determining hot methods and callsites for inlining\"" |
| InlineHotBudget int "help:\"inline budget for hot methods\"" |
| ``` |
| |
| ## Open issues (if applicable) |
| |
| * Build the pprof-graph once across all packages instead of building it over and over again for each package. |
| Currently the Go compiler builds packages in parallel using multiple processes making it hard to do this once. |
| |
| * Enable separate sections for basic-blocks and functions in the linker. |
| This would enable basic-block reordering optimization at the linker level. |
| Our implementation performs this optimization at the SSA-level. |