design/55022-pgo-implementation.md - proposal - Git at Google

 # Proposal: Design and Implementation of Profile-Guided Optimization (PGO) for Go

 Author(s): Raj Barik, Jin Lin

 Last updated: 2022-09-12

 Discussion at https://golang.org/issue/55025. \
 Linked high-level issue for PGO at https://golang.org/issue/55022.

 ## Abstract

 Inefficiencies in Go programs can be isolated via profiling tools such as [pprof](https://pkg.go.dev/runtime/pprof) and linux profiler [perf](https://www.brendangregg.com/perf.html).
 Such tools can pinpoint source code regions where most of the execution time is spent.
 Unlike other optimizing compilers such as [LLVM](https://llvm.org/docs/HowToBuildWithPGO.html), the Go compiler does not yet perform Profile-Guided Optimization(PGO).
 PGO uses information about the code’s runtime behavior to guide compiler optimizations such as inlining, code layout etc.
 PGO can improve application performance in the range 15-30% [[LLVM](https://llvm.org/docs/HowToBuildWithPGO.html), [AutoFDO](https://dl.acm.org/doi/pdf/10.1145/2854038.2854044)].
 In this proposal, we extend the Go compiler with PGO.
 Specifically, we incorporate the profiles into the frontend of the compiler to build a call graph with node & edge weights (called _WeightedCallGraph_).
 The Inliner subsequently uses the WeightedCallGraph to perform _profile-guided inlining_ which aggressively inlines hot functions.
 We introduce a _profile-guided code specialization_ pass that is tightly integrated with the Inliner and  eliminates indirect method call overheads in hot code paths.
 Furthermore, we annotate IR instructions with their associated profile weights and propagate these to the SSA-level in order to facilitate _profile-guided basic-block layout_ optimization to benefit from better instruction-cache and TLB performance.
 Finally, we extend Go's linker to also consume the profiles directly and perform _function reordering_ optimization across package boundaries -- which also helps instruction-cache and TLB performance.
 The format of the profile file consumed by our PGO is identical to the protobuf format produced by the [pprof](https://pkg.go.dev/runtime/pprof) tool.
 This format is rich enough to carry additional hardware performance counter information such as cache misses, LBR, etc.
 Existing [perf\_data\_converter](https://github.com/google/perf_data_converter) tool from Google can convert a _perf.data_ file produced by the Linux [perf](https://www.brendangregg.com/perf.html) into a _profile.proto_ file in protobuf format.
 The first version of the code that performs _profile-guided inlining_ is available [here](http://github.com/rajbarik/go).

 ## Background

 Many compiler optimizations, such as inlining, register allocation, & instruction scheduling often use hard-coded information related to caller-callee frequencies, basic-block frequencies, and branch probabilities to guide optimization.
 The static estimation of these metrics often leads to poor quality code generated by the compiler.
 These optimizations can easily benefit from dynamic information collected via profiling an application.

 Traditionally, a PGO-based compilation begins with an instrumentation phase to generate an instrumented version of the application.
 Next, the instrumented program  runs with training data to collect the profile (e.g., [edge-profiles](https://dl.acm.org/doi/10.1145/183432.183527)).
 These profiles are subsequently fed to the compiler and the application is recompiled to produce an optimized binary.
 During this process, the compiler updates and propagates profile information including feeding them to optimization passes to optimize hot/code paths.
 Modern compilers such as LLVM have embedded PGO in them and have reported [speed ups](https://llvm.org/docs/HowToBuildWithPGO.html) of ~20\%.
 Since the instrumentation build of an application incurs significant overhead, [recent work](https://dl.acm.org/doi/10.1145/1772954.1772963) has demonstrated little or no performance loss by collecting execution profiles via sampling (e.g., [hardware performance counter](https://dl.acm.org/doi/10.1145/1772954.1772963), [pprof](https://pkg.go.dev/runtime/pprof)).

 Go binaries are often large as they are statically linked and include all dependent packages and runtimes.
 For such large binaries, misses in instruction cache and TLB  can cause stalled cycles in the front-end leading to performance degradation.
 Profile-guided _code layout_ optimization is known to alleviate this problem.
 Recent works  including [FB-BOLT](https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/) and [Google-Propeller](https://github.com/google/llvm-propeller) have shown more than10% performance improvements by optimizing code locality in large data-center workloads.
 _Code layout optimization_ improves code locality and it comprises basic-block layout, function splitting, and function reordering optimizations.
 In order to extract maximum performance,  these optimizations are typically performed during or post link-time using profiling information.

 ### Standard Compilation flow in Go

 ![](55022/image1.png)

 **Figure 1.** Go compiler.

 At a very high-level, Go compiler produces an object file per package.
 It links all object-files together to produce an executable.
 It avoids unnecessary recompilation of packages which are not modified.
 For each package, the frontend of the Go compiler performs passes such as type checking and ast-lowering before generating the frontend-IR.
 At this IR-level, the Go compiler performs optimizations such as inlining, devirtualization, and escape analysis before lowering to SSA-IR.
 Several optimizations (dead-code, cse, dead-store, basic-block layout, etc.) are performed at the SSA-level before producing an object file for a package.

 Inlining optimization in Go compiler proceeds as a bottom-up traversal of the call graph.
 For each function, it first invokes _CanInline_ to determine the eligibility for inlining, e.g., functions marked with _go:noinline_ can not be inlined.
 CanInline traverses the IR (via a HairyVisitor) to calculate the _Cost_ of the function.
 If this Cost is greater than _maxInlineBudget_ (80), CanInline marks this function not inlinable in its upstream callers.
 Subsequently, Inliner performs inlining in the same function for call-sites that have direct callees with cost less than maxInlineBudget (80).
 As an additional code size optimization, the Inliner lowers the budget from maxInlineBudget (80) to inlineBigFunctionMaxCost (20) for a big function whose number of IR instructions is more than 5K, These parameters have been tuned heavily to get a good balance between code size and performance.
 On the other hand, there have been several discussions around maxInlineBudget (80) being too small which prevents frequently executed methods not being inlined, e.g., NextSet in [bitset](https://lemire.me/blog/2017/09/05/go-does-not-inline-functions-when-it-should/).
 It is definitely compelling to be able to tune these parameters based on runtime behavior of an application.

 Devirtualization pass in the Go compiler replaces interface method calls with direct concrete-type method calls where possible.
 Since Inlining happens before Devirtualization, there is no guarantee that the concrete method calls generated by the devirtualizer gets inlined in the compiler.

 Go compiler performs basic-block layout at the SSA-level today, but it is not profile-guided.
 This optimization is more effective when runtime information is provided to the compiler.

 Go linker does not yet perform any function reordering optimization.
 To the best of our knowledge, the linker today creates two sections, one for data and another for text/code.
 Ideally, one would like to do basic-block layout, hot-cold splitting, & function reordering optimizations in the linker ([FB-BOLT](https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/))], however without the support for sections at function and basic-block level it is hard to do them in the linker.
 This limitation has led us to enable the basic-block layout optimization at SSA-level and function reordering optimization in the linker.

 ## Proposal

 ### New compilation flow proposed in Go for PGO

 ![](55022/image2.png)

 **Figure 2.** New PGO-enabled Go compiler.

 Our proposed Go compilation flow consists of the following high-level components.

 * __Pprof-graph__: First, we process the profile proto file that is generated either via the [pprof](https://github.com/google/pprof/tree/main/profile) tool or the perf+[perf\_data\_converter](https://github.com/google/perf_data_converter) tool to produce an inter-package _pprof-graph._ We use the existing _internal/profile_ package to get this graph straight out-of-the-box.
 This graph contains most of the information needed to perform PGO, e..g, function names, flat weights, cumulative weights, call-sites with callee information for both direct and indirect callees etc.
 One thing to note is that ideally we would like to generate this graph once and cache it for individual package-level compilation in order to keep compile-time under control, however since packages could be compiled in parallel in multiple processes we could not find an easy way to share this graph across packages.
 We fall-back to building this graph for every package.

 * __CallGraphBuilder and WeightedCallGraph__: While compiling each package, we visit the IRs of all the functions in the package to produce an intra-package call graph with edge and node weights.
 The edge and node weights are obtained from the _pprof-graph_.
 We term this as _WeightedCallGraph_(shown as _CallGraphBuilder_ in Figure 2).
 In other words, _WeightedCallGraph_ is a succinct version of package-level pprof-graph with additional information such as associated IR function bodies, call edges linked with IR call sites, multiple callee edges for indirect calls etc.

 * __Embed profiles in the frontend IR & SSA IR__: While building the WeightedCallgraph, we also create a map table that stores the IR and its corresponding profiled weights in order to facilitate the basic-block layout optimization at the SSA-level.
 The primary reason for storing every IR instruction with weights is that there is no control-flow graph available at the IR-level, which could have been leveraged to just annotate branch instructions and branch-edges.
 During the lowering from the frontend IR to SSA IR, the profiling information is annotated into the basic blocks and edges.
 The compiler propagates the profiling information in the control flow graph.
 Similarly, every optimization based on the SSA IR maintains the profiling information.
 Annotating every single instruction with weights can increase memory pressure; in future we plan on exploring other options.

 * __Profile-guided Inlining__: During inlining, we determine hot functions and hot call-sites based on the _WeightedCallgraph_.
 We increase the inlining budget from 80 to a larger value (default 160, tunable via command-line)  based on the hotness of the function.
 This enables a function to be inlined in its upstream callers even if its budget is more than 80.
 Moreover if the function has a cost more than 80 and has more than one upstream callers, then only hot-call edges will be inlined, others will not, thereby controlling code size growth.
 Similarly, hot call-sites are also inlined aggressively in "mkinlcall" function using the increased budget.
 The hot-ness criteria is based on a threshold, which can be tuned via command-line (default set to 2% of the total execution time).
 One subtle note is that the total number of inlined functions via PGO may either be greater or smaller than the original inliner without PGO.
 For example, in a pathological case if a hot function gets bulkier (i.e., its budget is more than 160) due to inlining performed in transitive callees, then it may not get inlined in upstream callers (unlike standard inlining)] and, if there are a large number of upstream callers, then it may lead to less number of inlining outcomes.
 This can be addressed by increasing the budget via command-line, but that may also lead to increased code size, something to keep an eye out for.

 * __Code Specialization__: In order to eliminate indirect function call overheads, we introduce a _code specialization_ step inside the Inliner to transform an indirect callsite into an "if-else" code block based on the hot receiver.
 The WeightedCallgraph is used to determine a hot receiver.
 The "if" condition introduces a check on the runtime type of the hot receiver.
 The body of the "if" block converts the indirect function call into a direct function call.
 This call subsequently gets inlined since Inlining follows this optimization.
 The "else" block retains the original slow indirect call.
 As mentioned earlier, our goal is to be able to inline hot receivers.
 We purposefully avoid making multiple checks on the receiver type for multiple hot receiver cases in order to avoid unnecessary code growth and performance degradation in corner cases.
 In certain cases, it is possible to hoist the "if" condition check outside of the surrounding loop nests; this is a subject for future work.
 During Inlining and Specialization, we update and propagate weights of the IR instructions.

 * __Basic-block Reordering__: Since the Go compiler produces a large monolithic binary, we implement state-of-the-art profile-guided basic block layout optimization, i.e., [EXT-TSP heuristic](https://ieeexplore.ieee.org/document/9050435), at the SSA-level.
 The algorithm is a greedy heuristic that works with chains (ordered lists) of basic blocks.
 Initially all chains are isolated basic blocks.
 On every iteration, a pair of chains are merged that yields the biggest benefit in terms of instruction-cache reuse.
 The algorithm stops when one chain is left or merging does not improve the overall benefit.
  If there are more than one chain left, these chains are sorted using a density function that prioritizes merging frequently executed smaller sized chains into each page.
 Unlike earlier algorithms (e.g., [Pettis-Hansen](https://dl.acm.org/doi/10.1145/93548.93550)), in ExtTSP algorithm two chains, X and Y, are first split into three hypothetical chains, X1, X2, and Y.
 Then it considers all possible ways of combining these three chains (e.g., X1YX2, X1X2Y, X2X1Y, X2YX1, YX1X2, YX2X1) and chooses the one producing the largest benefit.
 We also have an implementation of [Pettis-Hansen](https://dl.acm.org/doi/10.1145/93548.93550) available for performance comparisons.

 * __Function Reorderings__: Finally, we introduce a "function reordering" optimization in the linker.
 Linker code is extended to rebuild the cross-package "pprof-graph" via a command-line argument.
 It first produces a summary table consisting of a set of (caller, callee, edge-weight) entries [similar to WeightedCallGraph without node weights].
 We then use the [C3 heuristic](https://dl.acm.org/doi/10.5555/3049832.3049858) to re-layout the functions in the text section.
 The structure of the C3 algorithm is similar to the ExtTSP and we avoid describing it again.
 One thing to note is that the C3 heuristic places a function as close as possible to its most common caller following a priority from the hottest to the coldest functions in the program.
 One issue we ran into while implementing this algorithm is that C3 heuristic requires the relative distance of a callsite from the start of a function.
 We found this tricky to implement in the linker, so our current implementation approximates this distance as half of the code size of the function instead of computing it precisely.
 We are able to compute code size directly from the object file in order to compute density.
 In future, we should explore options of computing this information in the compiler and store it in a fixed location in the object file.
 We also have an implementation of [Pettis-Hansen](https://dl.acm.org/doi/10.1145/93548.93550) available as an alternative.
 Our function reordering optimization is automatically cross-package.
 One subtle issue we have run into is that often GC-related functions are intermingled with application code in profiles making function reordering optimization sub-optimal.
 While GC functions emanating from the "assistGC" function called from the "malloc" routine should be included in the function reordering optimization, other GC functions probably should be sequenced as a separate chain/cluster of its own.
 This makes sense given that GC gets invoked at arbitrary timelines and runs in a separate thread of its own (perhaps could run in a separate core as well).
 This is a subject for future work for us.

 * __Stale Profile__: As source code evolves, it is highly likely that the profiles could become stale over time.
 In order to deal with this, we propose to use _\<pkg\_name, function\_name, line\_offset\>_ instead of _\<pkg\_name, function\_name, line\_number\>_.
 With line\_offset information, we can identify if a call site has moved up or down relative to the start of the function.
 In this case, we do not inline this call site, however other call sites that have not moved up or down will continue to use the profile information and get optimized.
 This design also allows functions that are unchanged will continue to be optimized via PGO.
 This design is similar to [AutoFDO](https://dl.acm.org/doi/pdf/10.1145/2854038.2854044)].

 ## Rationale

 There are several design choices we have made in this proposal.

 __Why not inline every hot callee? Why introduce a budget for hot callee?__

 Primary motivation is to control code growth while retaining good performance.
 Auto-tuning frameworks such as [OpenTuner](https://github.com/minshuzhan/opentuner) can tune binaries using the command-line options for budget and threshold.

 __What is the benefit of WeightedCallGraph data structure?__

 The core of WeightedCallGraph is to associate IR functions and callsites with node and edge weights from pprof-graph.
 This could have been achieved via a simpler data structure perhaps and thus, WeightedCallGraph could be a bit heavy-weight data structure just for Inlining and Specialization.
 However, it is a forward looking design where other (interprocedural) optimizations can leverage this graph in future.

 __Why Specialization is performed during inlining and not before or after?__

 Ideally, we want code specialization to be performed before inlining so that the direct callees can be inlined.
 On the other hand, when specialization is performed as a separate pass before inliner, one needs to update the WeightedCallGraph during specialization.
 When we perform specialization along with Inliner, we avoid updating WeightedCallgraph twice.
 Moreover, Specialization also requires parts of CanInline logic to perform legality checks.
 Keeping them together also avoids this code duplication.

 __Why use the Protobuf format? Why not use other formats similar to LLVM?__

 From our experience, protobuf format is rich enough to carry additional profiling information, particularly hardware performance counter information.
 There exist tools such as [perf\_data\_converter](https://github.com/google/perf_data_converter) tool that can convert a _perf.data_ file produced by Linux [perf](https://www.brendangregg.com/perf.html) into a _profile.proto_ file in protobuf format.
 So, we may not lose on functionality.

 __Why do we not perform basic-block layout in the linker?__

 Ideally, we should.
 Currently, the Go compiler produces one section for code/text.
 If it can be extended to produce sections at the basic-block level, we can port our basic-block layout implementation to the linker.

 ## Compatibility

 Currently our optimizations have been tested in the Linux OS only.
 To the best of our knowledge, our implementation follows the guidelines described in the [compatibility guidelines](https://golang.org/doc/go1compat).

 ## Implementation

 We have implemented all the optimizations described above in this proposal, i.e., Inlining, Code Specialization, Basic-block layout, & Function reordering.
 However, pushing all these changes upstream at once can take a while.
 Our strategy is to release only "CallGraphBuilder" and "Profile-guided Inlining" (see Figure 2) in the first phase with the timeline for v1.20 release.
 Below we provide the implementation details for CallGraphBuilder and Profile-guided inlining.

 ### CallGraphBuilder implementation

 * __BuildPProfGraph__: This function builds the pprof-graph based on the "internal/profile" package.
 The  pprof-graph is built using the _cpuprofile _data from_ _profile.
 The protobuf profile file is fed to the compiler via a command line flag (_-profileuse_).
 It preprocesses the pprof-graph to compute a few additional information such as _TotalNodeWeights_ and _TotalEdgeWeights_.
 It also creates a global node-to-weight map which is cheaper to look up during IR traversal in later phases.

 * __BuildWeightedCallGraph__: This function builds the WeightedCallGraph data structure by visiting all the _ir.Func_ s in the decl-list of a package.
 During the traversal of the IR, it is necessary to determine the callee function for a direct callsite.
 Our logic to do this is very similar to "inlCallee" in "inline/inl.go".
 Since we would create a cyclic dependency between inline and pgo packages if we reused the code, we have duplicated "inlCallee" logic in irgraph.go (with the same name).
 We should refactor this code in future to be able to put it outside of these two packages.

 * __RedirectEdges__: This function updates the WeightedCallGraph nodes and edges during and after Inlining based on a list of inlined functions.

 ### Profile-guided Inlining implementation

 * __Prologue__: Before performing inlining, we walk over the _WeightedCallGraph_ to identify hot functions and hot call-sites based on a command-line tunable threshold (_-inlinehotthreshold_).
 This threshold is expressed as a percentage of time exclusively spent in a function or call-site over total execution time.
 It is by default set to 2% and can be updated via the command-line argument.
 When a function or call-site exclusively spends more time than this threshold percentage, we classify the function or call-site as a hot function or hot call-site.

 * __Extensions to CanInline__:  Currently the HairyVisitor bails out as soon as its budget is more than maxInlineBudget (80).
 We introduce a new budget for profile-guided inlining, _inlineHotMaxBudget_, with a default value of 160, which can also be updated via a command line flag (_-inlinehotbudget_).
 We use the inlineHotMaxBudget for hot functions during the CanInline step.

 ```go
 if v.budget < 0 {
   if pgo.WeightedCG != nil {
     if n, ok := pgo.WeightedCG.IRNodes[ir.PkgFuncName(fn)]; ok {
       if inlineMaxBudget-v.budget < inlineHotMaxBudget && n.HotNode == true {
         return false
       }
     }
   }
 ...
 }
 ```

 During the HairyVisitor traversal, we also track all the hot call-sites of a function.

 ```go
 // Determine if callee edge is a hot callee or not.
 if pgo.WeightedCG != nil && ir.CurFunc != nil {
   if fn := inlCallee(n.X); fn != nil && typecheck.HaveInlineBody(fn) {
     lineno := fmt.Sprintf("%v", ir.Line(n))
     splits := strings.Split(lineno, ":")
     l, _ := strconv.ParseInt(splits[len(splits)-2], 0, 64)
     linenum := fmt.Sprintf("%v", l)
     canonicalName := ir.PkgFuncName(ir.CurFunc) + "-" + linenum + "-" + ir.PkgFuncName(fn)
     if _, o := candHotEdgeMap[canonicalName]; o {
       listOfHotCallSites[pgo.CallSiteInfo{ir.Line(n), ir.CurFunc}] = struct{}{}
     }
   }
 }
 ```

 * __Extension to mkinlcall__: When a hot callee's budget is more than maxCost, we check if this cost is less than inlineHotMaxBudget in order to permit inlining of hot callees.

 ```go
 if fn.Inl.Cost > maxCost {
 ...

   // If the callsite is hot and it is below the inlineHotCalleeMaxBudget budget, then inline it, or else bail.
   if _, ok := listOfHotCallSites[pgo.CallSiteInfo{ir.Line(n), ir.CurFunc}]; ok {
     if fn.Inl.Cost > inlineHotMaxBudget {
       return n
     }
   } else {
     return n
   }
 }
 ```

 * __Epilogue__: During this phase, we update the WeightedCallGraph based on the inlining decisions made earlier.

 ```go
 ir.VisitFuncsBottomUp(typecheck.Target.Decls, func(list []*ir.Func, recursive bool) {
     for _, f := range list {
       name := ir.PkgFuncName(f)
       if n, ok := pgo.WeightedCG.IRNodes[name]; ok {
         pgo.RedirectEdges(n, inlinedCallSites)
       }
     }
 })
 ```

 * __Command line flags list__: We introduce the following flag in base/flag.go.
 ```go
 PGOProfile         string       "help:\"read profile from `file`\""
 ```

 We also introduced following debug flags in base/debug.go.
 ```go
 InlineHotThreshold string       "help:\"threshold percentage for determining hot methods and callsites for inlining\""
 InlineHotBudget    int          "help:\"inline budget for hot methods\""
 ```

 ## Open issues (if applicable)

 * Build the pprof-graph once across all packages instead of building it over and over again for each package.
 Currently the Go compiler builds packages in parallel using multiple processes making it hard to do this once.

 * Enable separate sections for basic-blocks and functions in the linker.
 This would enable basic-block reordering optimization at the linker level.
 Our implementation performs this optimization at the SSA-level.