design/55022-pgo-implementation.md: add pgo implementation doc

This doc describes the proposed implementation details for the PGO design in 55022-pgo.md.

For golang/go#55022.

Change-Id: Ifa277419ec820fa5f18bf4d3220eeb0c41db1d07
Reviewed-on: https://go-review.googlesource.com/c/proposal/+/430398
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
diff --git a/design/55022-pgo-implementation.md b/design/55022-pgo-implementation.md
new file mode 100644
index 0000000..f788a10
--- /dev/null
+++ b/design/55022-pgo-implementation.md
@@ -0,0 +1,316 @@
+# Proposal: Design and Implementation of Profile-Guided Optimization (PGO) for Go
+
+Author(s): Raj Barik, Jin Lin
+
+Last updated: 2022-09-12
+
+Discussion at https://golang.org/issue/55025. \
+Linked high-level issue for PGO at https://golang.org/issue/55022.
+
+## Abstract
+
+Inefficiencies in Go programs can be isolated via profiling tools such as [pprof](https://pkg.go.dev/runtime/pprof) and linux profiler [perf](https://www.brendangregg.com/perf.html).
+Such tools can pinpoint source code regions where most of the execution time is spent.
+Unlike other optimizing compilers such as [LLVM](https://llvm.org/docs/HowToBuildWithPGO.html), the Go compiler does not yet perform Profile-Guided Optimization(PGO).
+PGO uses information about the code’s runtime behavior to guide compiler optimizations such as inlining, code layout etc.
+PGO can improve application performance in the range 15-30% [[LLVM](https://llvm.org/docs/HowToBuildWithPGO.html), [AutoFDO](https://dl.acm.org/doi/pdf/10.1145/2854038.2854044)].
+In this proposal, we extend the Go compiler with PGO.
+Specifically, we incorporate the profiles into the frontend of the compiler to build a call graph with node & edge weights (called _WeightedCallGraph_).
+The Inliner subsequently uses the WeightedCallGraph to perform _profile-guided inlining_ which aggressively inlines hot functions.
+We introduce a _profile-guided code specialization_ pass that is tightly integrated with the Inliner and  eliminates indirect method call overheads in hot code paths.
+Furthermore, we annotate IR instructions with their associated profile weights and propagate these to the SSA-level in order to facilitate _profile-guided basic-block layout_ optimization to benefit from better instruction-cache and TLB performance.
+Finally, we extend Go's linker to also consume the profiles directly and perform _function reordering_ optimization across package boundaries -- which also helps instruction-cache and TLB performance.
+The format of the profile file consumed by our PGO is identical to the protobuf format produced by the [pprof](https://pkg.go.dev/runtime/pprof) tool.
+This format is rich enough to carry additional hardware performance counter information such as cache misses, LBR, etc.
+Existing [perf\_data\_converter](https://github.com/google/perf_data_converter) tool from Google can convert a _perf.data_ file produced by the Linux [perf](https://www.brendangregg.com/perf.html) into a _profile.proto_ file in protobuf format.
+The first version of the code that performs _profile-guided inlining_ is available [here](http://github.com/rajbarik/go).
+
+## Background
+
+Many compiler optimizations, such as inlining, register allocation, & instruction scheduling often use hard-coded information related to caller-callee frequencies, basic-block frequencies, and branch probabilities to guide optimization.
+The static estimation of these metrics often leads to poor quality code generated by the compiler.
+These optimizations can easily benefit from dynamic information collected via profiling an application.
+
+Traditionally, a PGO-based compilation begins with an instrumentation phase to generate an instrumented version of the application.
+Next, the instrumented program  runs with training data to collect the profile (e.g., [edge-profiles](https://dl.acm.org/doi/10.1145/183432.183527)).
+These profiles are subsequently fed to the compiler and the application is recompiled to produce an optimized binary.
+During this process, the compiler updates and propagates profile information including feeding them to optimization passes to optimize hot/code paths.
+Modern compilers such as LLVM have embedded PGO in them and have reported [speed ups](https://llvm.org/docs/HowToBuildWithPGO.html) of ~20\%.
+Since the instrumentation build of an application incurs significant overhead, [recent work](https://dl.acm.org/doi/10.1145/1772954.1772963) has demonstrated little or no performance loss by collecting execution profiles via sampling (e.g., [hardware performance counter](https://dl.acm.org/doi/10.1145/1772954.1772963), [pprof](https://pkg.go.dev/runtime/pprof)).
+
+Go binaries are often large as they are statically linked and include all dependent packages and runtimes.
+For such large binaries, misses in instruction cache and TLB  can cause stalled cycles in the front-end leading to performance degradation.
+Profile-guided _code layout_ optimization is known to alleviate this problem.
+Recent works  including [FB-BOLT](https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/) and [Google-Propeller](https://github.com/google/llvm-propeller) have shown more than10% performance improvements by optimizing code locality in large data-center workloads.
+_Code layout optimization_ improves code locality and it comprises basic-block layout, function splitting, and function reordering optimizations.
+In order to extract maximum performance,  these optimizations are typically performed during or post link-time using profiling information.
+
+### Standard Compilation flow in Go
+
+![](55025/image1.png)
+
+**Figure 1.** Go compiler.
+
+At a very high-level, Go compiler produces an object file per package.
+It links all object-files together to produce an executable.
+It avoids unnecessary recompilation of packages which are not modified.
+For each package, the frontend of the Go compiler performs passes such as type checking and ast-lowering before generating the frontend-IR.
+At this IR-level, the Go compiler performs optimizations such as inlining, devirtualization, and escape analysis before lowering to SSA-IR.
+Several optimizations (dead-code, cse, dead-store, basic-block layout, etc.) are performed at the SSA-level before producing an object file for a package.
+
+Inlining optimization in Go compiler proceeds as a bottom-up traversal of the call graph.
+For each function, it first invokes _CanInline_ to determine the eligibility for inlining, e.g., functions marked with _go:noinline_ can not be inlined.
+CanInline traverses the IR (via a HairyVisitor) to calculate the _Cost_ of the function.
+If this Cost is greater than _maxInlineBudget_ (80), CanInline marks this function not inlinable in its upstream callers.
+Subsequently, Inliner performs inlining in the same function for call-sites that have direct callees with cost less than maxInlineBudget (80).
+As an additional code size optimization, the Inliner lowers the budget from maxInlineBudget (80) to inlineBigFunctionMaxCost (20) for a big function whose number of IR instructions is more than 5K, These parameters have been tuned heavily to get a good balance between code size and performance.
+On the other hand, there have been several discussions around maxInlineBudget (80) being too small which prevents frequently executed methods not being inlined, e.g., NextSet in [bitset](https://lemire.me/blog/2017/09/05/go-does-not-inline-functions-when-it-should/).
+It is definitely compelling to be able to tune these parameters based on runtime behavior of an application.
+
+Devirtualization pass in the Go compiler replaces interface method calls with direct concrete-type method calls where possible.
+Since Inlining happens before Devirtualization, there is no guarantee that the concrete method calls generated by the devirtualizer gets inlined in the compiler.
+
+Go compiler performs basic-block layout at the SSA-level today, but it is not profile-guided.
+This optimization is more effective when runtime information is provided to the compiler.
+
+Go linker does not yet perform any function reordering optimization.
+To the best of our knowledge, the linker today creates two sections, one for data and another for text/code.
+Ideally, one would like to do basic-block layout, hot-cold splitting, & function reordering optimizations in the linker ([FB-BOLT](https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/))], however without the support for sections at function and basic-block level it is hard to do them in the linker.
+This limitation has led us to enable the basic-block layout optimization at SSA-level and function reordering optimization in the linker.
+
+## Proposal
+
+### New compilation flow proposed in Go for PGO
+
+![](55025/image2.png)
+
+**Figure 2.** New PGO-enabled Go compiler.
+
+Our proposed Go compilation flow consists of the following high-level components.
+
+* __Pprof-graph__: First, we process the profile proto file that is generated either via the [pprof](https://github.com/google/pprof/tree/main/profile) tool or the perf+[perf\_data\_converter](https://github.com/google/perf_data_converter) tool to produce an inter-package _pprof-graph._ We use the existing _internal/profile_ package to get this graph straight out-of-the-box.
+This graph contains most of the information needed to perform PGO, e..g, function names, flat weights, cumulative weights, call-sites with callee information for both direct and indirect callees etc.
+One thing to note is that ideally we would like to generate this graph once and cache it for individual package-level compilation in order to keep compile-time under control, however since packages could be compiled in parallel in multiple processes we could not find an easy way to share this graph across packages.
+We fall-back to building this graph for every package.
+
+* __CallGraphBuilder and WeightedCallGraph__: While compiling each package, we visit the IRs of all the functions in the package to produce an intra-package call graph with edge and node weights.
+The edge and node weights are obtained from the _pprof-graph_.
+We term this as _WeightedCallGraph_(shown as _CallGraphBuilder_ in Figure 2).
+In other words, _WeightedCallGraph_ is a succinct version of package-level pprof-graph with additional information such as associated IR function bodies, call edges linked with IR call sites, multiple callee edges for indirect calls etc.
+
+* __Embed profiles in the frontend IR & SSA IR__: While building the WeightedCallgraph, we also create a map table that stores the IR and its corresponding profiled weights in order to facilitate the basic-block layout optimization at the SSA-level.
+The primary reason for storing every IR instruction with weights is that there is no control-flow graph available at the IR-level, which could have been leveraged to just annotate branch instructions and branch-edges.
+During the lowering from the frontend IR to SSA IR, the profiling information is annotated into the basic blocks and edges.
+The compiler propagates the profiling information in the control flow graph.
+Similarly, every optimization based on the SSA IR maintains the profiling information.
+Annotating every single instruction with weights can increase memory pressure; in future we plan on exploring other options.
+
+* __Profile-guided Inlining__: During inlining, we determine hot functions and hot call-sites based on the _WeightedCallgraph_.
+We increase the inlining budget from 80 to a larger value (default 160, tunable via command-line)  based on the hotness of the function.
+This enables a function to be inlined in its upstream callers even if its budget is more than 80.
+Moreover if the function has a cost more than 80 and has more than one upstream callers, then only hot-call edges will be inlined, others will not, thereby controlling code size growth.
+Similarly, hot call-sites are also inlined aggressively in "mkinlcall" function using the increased budget.
+The hot-ness criteria is based on a threshold, which can be tuned via command-line (default set to 2% of the total execution time).
+One subtle note is that the total number of inlined functions via PGO may either be greater or smaller than the original inliner without PGO.
+For example, in a pathological case if a hot function gets bulkier (i.e., its budget is more than 160) due to inlining performed in transitive callees, then it may not get inlined in upstream callers (unlike standard inlining)] and, if there are a large number of upstream callers, then it may lead to less number of inlining outcomes.
+This can be addressed by increasing the budget via command-line, but that may also lead to increased code size, something to keep an eye out for.
+
+* __Code Specialization__: In order to eliminate indirect function call overheads, we introduce a _code specialization_ step inside the Inliner to transform an indirect callsite into an "if-else" code block based on the hot receiver.
+The WeightedCallgraph is used to determine a hot receiver.
+The "if" condition introduces a check on the runtime type of the hot receiver.
+The body of the "if" block converts the indirect function call into a direct function call.
+This call subsequently gets inlined since Inlining follows this optimization.
+The "else" block retains the original slow indirect call.
+As mentioned earlier, our goal is to be able to inline hot receivers.
+We purposefully avoid making multiple checks on the receiver type for multiple hot receiver cases in order to avoid unnecessary code growth and performance degradation in corner cases.
+In certain cases, it is possible to hoist the "if" condition check outside of the surrounding loop nests; this is a subject for future work.
+During Inlining and Specialization, we update and propagate weights of the IR instructions.
+
+* __Basic-block Reordering__: Since the Go compiler produces a large monolithic binary, we implement state-of-the-art profile-guided basic block layout optimization, i.e., [EXT-TSP heuristic](https://ieeexplore.ieee.org/document/9050435), at the SSA-level.
+The algorithm is a greedy heuristic that works with chains (ordered lists) of basic blocks.
+Initially all chains are isolated basic blocks.
+On every iteration, a pair of chains are merged that yields the biggest benefit in terms of instruction-cache reuse.
+The algorithm stops when one chain is left or merging does not improve the overall benefit.
+ If there are more than one chain left, these chains are sorted using a density function that prioritizes merging frequently executed smaller sized chains into each page.
+Unlike earlier algorithms (e.g., [Pettis-Hansen](https://dl.acm.org/doi/10.1145/93548.93550)), in ExtTSP algorithm two chains, X and Y, are first split into three hypothetical chains, X1, X2, and Y.
+Then it considers all possible ways of combining these three chains (e.g., X1YX2, X1X2Y, X2X1Y, X2YX1, YX1X2, YX2X1) and chooses the one producing the largest benefit.
+We also have an implementation of [Pettis-Hansen](https://dl.acm.org/doi/10.1145/93548.93550) available for performance comparisons.
+
+* __Function Reorderings__: Finally, we introduce a "function reordering" optimization in the linker.
+Linker code is extended to rebuild the cross-package "pprof-graph" via a command-line argument.
+It first produces a summary table consisting of a set of (caller, callee, edge-weight) entries [similar to WeightedCallGraph without node weights].
+We then use the [C3 heuristic](https://dl.acm.org/doi/10.5555/3049832.3049858) to re-layout the functions in the text section.
+The structure of the C3 algorithm is similar to the ExtTSP and we avoid describing it again.
+One thing to note is that the C3 heuristic places a function as close as possible to its most common caller following a priority from the hottest to the coldest functions in the program.
+One issue we ran into while implementing this algorithm is that C3 heuristic requires the relative distance of a callsite from the start of a function.
+We found this tricky to implement in the linker, so our current implementation approximates this distance as half of the code size of the function instead of computing it precisely.
+We are able to compute code size directly from the object file in order to compute density.
+In future, we should explore options of computing this information in the compiler and store it in a fixed location in the object file.
+We also have an implementation of [Pettis-Hansen](https://dl.acm.org/doi/10.1145/93548.93550) available as an alternative.
+Our function reordering optimization is automatically cross-package.
+One subtle issue we have run into is that often GC-related functions are intermingled with application code in profiles making function reordering optimization sub-optimal.
+While GC functions emanating from the "assistGC" function called from the "malloc" routine should be included in the function reordering optimization, other GC functions probably should be sequenced as a separate chain/cluster of its own.
+This makes sense given that GC gets invoked at arbitrary timelines and runs in a separate thread of its own (perhaps could run in a separate core as well).
+This is a subject for future work for us.
+
+* __Stale Profile__: As source code evolves, it is highly likely that the profiles could become stale over time.
+In order to deal with this, we propose to use _\<pkg\_name, function\_name, line\_offset\>_ instead of _\<pkg\_name, function\_name, line\_number\>_.
+With line\_offset information, we can identify if a call site has moved up or down relative to the start of the function.
+In this case, we do not inline this call site, however other call sites that have not moved up or down will continue to use the profile information and get optimized.
+This design also allows functions that are unchanged will continue to be optimized via PGO.
+This design is similar to [AutoFDO](https://dl.acm.org/doi/pdf/10.1145/2854038.2854044)].
+
+## Rationale
+
+There are several design choices we have made in this proposal.
+
+__Why not inline every hot callee? Why introduce a budget for hot callee?__
+
+Primary motivation is to control code growth while retaining good performance.
+Auto-tuning frameworks such as [OpenTuner](https://github.com/minshuzhan/opentuner) can tune binaries using the command-line options for budget and threshold.
+
+__What is the benefit of WeightedCallGraph data structure?__
+
+The core of WeightedCallGraph is to associate IR functions and callsites with node and edge weights from pprof-graph.
+This could have been achieved via a simpler data structure perhaps and thus, WeightedCallGraph could be a bit heavy-weight data structure just for Inlining and Specialization.
+However, it is a forward looking design where other (interprocedural) optimizations can leverage this graph in future.
+
+__Why Specialization is performed during inlining and not before or after?__
+
+Ideally, we want code specialization to be performed before inlining so that the direct callees can be inlined.
+On the other hand, when specialization is performed as a separate pass before inliner, one needs to update the WeightedCallGraph during specialization.
+When we perform specialization along with Inliner, we avoid updating WeightedCallgraph twice.
+Moreover, Specialization also requires parts of CanInline logic to perform legality checks.
+Keeping them together also avoids this code duplication.
+
+__Why use the Protobuf format? Why not use other formats similar to LLVM?__
+
+From our experience, protobuf format is rich enough to carry additional profiling information, particularly hardware performance counter information.
+There exist tools such as [perf\_data\_converter](https://github.com/google/perf_data_converter) tool that can convert a _perf.data_ file produced by Linux [perf](https://www.brendangregg.com/perf.html) into a _profile.proto_ file in protobuf format.
+So, we may not lose on functionality.
+
+__Why do we not perform basic-block layout in the linker?__
+
+Ideally, we should.
+Currently, the Go compiler produces one section for code/text.
+If it can be extended to produce sections at the basic-block level, we can port our basic-block layout implementation to the linker.
+
+## Compatibility
+
+Currently our optimizations have been tested in the Linux OS only.
+To the best of our knowledge, our implementation follows the guidelines described in the [compatibility guidelines](https://golang.org/doc/go1compat).
+
+## Implementation
+
+We have implemented all the optimizations described above in this proposal, i.e., Inlining, Code Specialization, Basic-block layout, & Function reordering.
+However, pushing all these changes upstream at once can take a while.
+Our strategy is to release only "CallGraphBuilder" and "Profile-guided Inlining" (see Figure 2) in the first phase with the timeline for v1.20 release.
+Below we provide the implementation details for CallGraphBuilder and Profile-guided inlining.
+
+### CallGraphBuilder implementation
+
+* __BuildPProfGraph__: This function builds the pprof-graph based on the "internal/profile" package.
+The  pprof-graph is built using the _cpuprofile _data from_ _profile.
+The protobuf profile file is fed to the compiler via a command line flag (_-profileuse_).
+It preprocesses the pprof-graph to compute a few additional information such as _TotalNodeWeights_ and _TotalEdgeWeights_.
+It also creates a global node-to-weight map which is cheaper to look up during IR traversal in later phases.
+
+* __BuildWeightedCallGraph__: This function builds the WeightedCallGraph data structure by visiting all the _ir.Func_ s in the decl-list of a package.
+During the traversal of the IR, it is necessary to determine the callee function for a direct callsite.
+Our logic to do this is very similar to "inlCallee" in "inline/inl.go".
+Since we would create a cyclic dependency between inline and pgo packages if we reused the code, we have duplicated "inlCallee" logic in irgraph.go (with the same name).
+We should refactor this code in future to be able to put it outside of these two packages.
+
+* __RedirectEdges__: This function updates the WeightedCallGraph nodes and edges during and after Inlining based on a list of inlined functions.
+
+### Profile-guided Inlining implementation
+
+* __Prologue__: Before performing inlining, we walk over the _WeightedCallGraph_ to identify hot functions and hot call-sites based on a command-line tunable threshold (_-inlinehotthreshold_).
+This threshold is expressed as a percentage of time exclusively spent in a function or call-site over total execution time.
+It is by default set to 2% and can be updated via the command-line argument.
+When a function or call-site exclusively spends more time than this threshold percentage, we classify the function or call-site as a hot function or hot call-site.
+
+* __Extensions to CanInline__:  Currently the HairyVisitor bails out as soon as its budget is more than maxInlineBudget (80).
+We introduce a new budget for profile-guided inlining, _inlineHotMaxBudget_, with a default value of 160, which can also be updated via a command line flag (_-inlinehotbudget_).
+We use the inlineHotMaxBudget for hot functions during the CanInline step.
+
+```go
+if v.budget < 0 {
+  if pgo.WeightedCG != nil {
+    if n, ok := pgo.WeightedCG.IRNodes[ir.PkgFuncName(fn)]; ok {
+      if inlineMaxBudget-v.budget < inlineHotMaxBudget && n.HotNode == true {
+        return false
+      }
+    }
+  }
+...
+}
+```
+
+During the HairyVisitor traversal, we also track all the hot call-sites of a function.
+
+```go
+// Determine if callee edge is a hot callee or not.
+if pgo.WeightedCG != nil && ir.CurFunc != nil {
+  if fn := inlCallee(n.X); fn != nil && typecheck.HaveInlineBody(fn) {
+    lineno := fmt.Sprintf("%v", ir.Line(n))
+    splits := strings.Split(lineno, ":")
+    l, _ := strconv.ParseInt(splits[len(splits)-2], 0, 64)
+    linenum := fmt.Sprintf("%v", l)
+    canonicalName := ir.PkgFuncName(ir.CurFunc) + "-" + linenum + "-" + ir.PkgFuncName(fn)
+    if _, o := candHotEdgeMap[canonicalName]; o {
+      listOfHotCallSites[pgo.CallSiteInfo{ir.Line(n), ir.CurFunc}] = struct{}{}
+    }
+  }
+}
+```
+
+* __Extension to mkinlcall__: When a hot callee's budget is more than maxCost, we check if this cost is less than inlineHotMaxBudget in order to permit inlining of hot callees.
+
+```go
+if fn.Inl.Cost > maxCost {
+...
+
+  // If the callsite is hot and it is below the inlineHotCalleeMaxBudget budget, then inline it, or else bail.
+  if _, ok := listOfHotCallSites[pgo.CallSiteInfo{ir.Line(n), ir.CurFunc}]; ok {
+    if fn.Inl.Cost > inlineHotMaxBudget {
+      return n
+    }
+  } else {
+    return n
+  }
+}
+```
+
+* __Epilogue__: During this phase, we update the WeightedCallGraph based on the inlining decisions made earlier.
+
+```go
+ir.VisitFuncsBottomUp(typecheck.Target.Decls, func(list []*ir.Func, recursive bool) {
+    for _, f := range list {
+      name := ir.PkgFuncName(f)
+      if n, ok := pgo.WeightedCG.IRNodes[name]; ok {
+        pgo.RedirectEdges(n, inlinedCallSites)
+      }
+    }
+})
+```
+
+* __Command line flags list__: We introduce the following flag in base/flag.go.
+```go
+PGOProfile         string       "help:\"read profile from `file`\""
+```
+
+We also introduced following debug flags in base/debug.go.
+```go
+InlineHotThreshold string       "help:\"threshold percentage for determining hot methods and callsites for inlining\""
+InlineHotBudget    int          "help:\"inline budget for hot methods\""
+```
+
+## Open issues (if applicable)
+
+* Build the pprof-graph once across all packages instead of building it over and over again for each package.
+Currently the Go compiler builds packages in parallel using multiple processes making it hard to do this once.
+
+* Enable separate sections for basic-blocks and functions in the linker.
+This would enable basic-block reordering optimization at the linker level.
+Our implementation performs this optimization at the SSA-level.
diff --git a/design/55022/image1.png b/design/55022/image1.png
new file mode 100644
index 0000000..d42dd33
--- /dev/null
+++ b/design/55022/image1.png
Binary files differ
diff --git a/design/55022/image2.png b/design/55022/image2.png
new file mode 100644
index 0000000..34e63af
--- /dev/null
+++ b/design/55022/image2.png
Binary files differ