proposal: hardware performance counters for CPU profiling. Addresses the issue golang/go#36821. Change-Id: I232f0da8396f49c21aea021acfad52bb5dcc515a
diff --git a/design/36821-perf-counter-pprof.md b/design/36821-perf-counter-pprof.md new file mode 100644 index 0000000..64b256c --- /dev/null +++ b/design/36821-perf-counter-pprof.md
@@ -0,0 +1,1613 @@ +# Proposal: hardware performance counters for CPU profiling. + +Author(s): Milind Chabbi + +Last updated: Feb/13/2020 + +Discussion at https://golang.org/issues/36821. + +## 1. Abstract + +The Go pprof CPU profiles are quite inaccurate and imprecise due to several +limitations related to the operating system (OS) interval timer used as its +sampling engine. +Inaccurate profiles can mislead performance investigation and may lead +to inappropriate optimization guidance. +We propose incorporating hardware +performance counters as an alternative sampling engine for CPU profiling. +Hardware performance counter-based sampling rectifies many of the problems +related to the OS timers. +The hardware CPU cycles event provides microsecond +measurement granularity, and it is available in almost all commodity CPUs. +Furthermore, hardware Performance Monitoring Units (PMUs) allow measuring not +only CPU cycles but other activities such as cache misses that help root cause +complex performance problems. +A [prototype implementation](https://github.com/uber-dev/go/tree/pmu_pprof) shows significantly +better-quality CPU profiles when using PMUs while maintaining a low measurement +overhead. +The proposal will retain the OS-timer as the default sampling engine. +The proposal exposes the PMU-based profiling feature on Linux-based systems, +which provide an excellent facility to program PMUs. + + + +## Table of contents +- [1. Abstract](#1-abstract) +- [2. Background](#2-background) + - [2.1 High-level description of Go's CPU + profiling](#21-high-level-description-of-gos-cpu-profiling) + - [2.2 Desiderata and deficiencies of Go CPU + profiler](#22-desiderata-and-deficiencies-of-go-cpu-profiler) + - [2.3 Examples demonstrating inaccuracy and imprecision in Go + pprof profiles](#23-examples-demonstrating-inaccuracy-and-imprecision-in-go-pprof-profiles) + - [2.4 State-of-the-art in profiling using performance + monitoring units](#24-state-of-the-art-in-profiling-using-performance-monitoring-units) + - [2.5 Low-level details of the current Go CPU + profiler](#25-low-level-details-of-the-current-go-cpu-profiler) +- [3. Problem statement](#3-problem-statement) +- [4. Proposal](#4-proposal) + - [4.1 High-level design](#41-high-level-design) + - [4.2 The devil is in the + details](#42-the-devil-is-in-the-details) + - [4.3 Exposing PMU events via `runtime/pprof` + package](#43-exposing-pmu-events-via-runtimepprof-package) + - [4.4 The optional multiple CPU profiles + feature](#44-the-optional-multiple-cpu-profiles-feature) + - [4.5 The `testing` package](#45-the-testing-package) + - [4.6 The `net/http/pprof` + package](#46-the-nethttppprof-package) +- [5. Empirical evidence on the accuracy and precision of PMU + profiles](#5-empirical-evidence-on-the-accuracy-and-precision-of-pmu-profiles) + - [5.1 The PMU produces precise and accurate profiles for + concurrent programs](#51-the-pmu-produces-precise-and-accurate-profiles-for-concurrent-programs) + - [5.2 The PMU produces precise and accurate profiles for + serial programs](#52-the-pmu-produces-precise-and-accurate-profiles-for-serial-programs) + - [5.3 A practical use](#53-a-practical-use) +- [6. Rationale](#6-rationale) + - [6.1 Advantages](#61-advantages) + - [6.2 Disadvantages](#62-disadvantages) + - [6.3 Alternatives explored](#63-alternatives-explored) +- [7. Compatibility](#7-compatibility) +- [8. Implementation](#8-implementation) +- [9. Open issues](#9-open-issues) +- [10. Acknowledgment](#10-acknowledgment) +- [11. Appendix](#11-appendix) +- [12. References](#12-references) + + +## 2. Background + +In this section, we first provide a high-level description of Go's CPU +profiling, motivate the problems in the current profiler with examples, +highlight the state-of-the-art in CPU profiling, and then dive deeper into the +current implementation of CPU profiling inside Go. +Our proposed changes appear +in the next section. +Readers familiar with profiling technology and the +internals of pprof may skip this section. + +### 2.1 High-level description of Go’s CPU profiling + +Pprof, the de facto Go CPU profiler, employs sampling-based profiling for time +measurement. +It attributes the time metric to different functions, including +their calling context (aka call path or backtrace). +Pprof relies on the +underlying operating-system timers (exposed by the Go runtime) to periodically +interrupt a program’s execution. +The sampling rate is typically one hundred +hertz (a sample taken every 10ms); a different sampling rate can be configured, +but typical systems cannot deliver more than 250 hertz. + + +When profiling is enabled, the Go runtime scheduler sets up each OS thread in +the process to generate periodic timer interrupts. +In each interrupt, the go +runtime’s signal handler obtains a backtrace and appends it to a ring buffer. +All OS threads in a Go process append their backtraces to the same ring buffer +coordinated via a lock. +A side thread, launched by the `pprof` package, +periodically scrapes the backtraces from the ring buffer and serializes them +into a protobuf file to be used by the downstream tools. + +### 2.2 Desiderata and deficiencies of Go CPU profiler + +*Accuracy* and *precision* are desirable properties of a good measurement tool. + +A profiling datum is said to be *accurate* if it is close to the ground truth +(aka true value). +For example, if `API_A()` consumes 25% of the total +execution time and a profiler attributes it 24% of the total execution time, +the measurement has 96% accuracy. + +Profiling data are said to be *precise* if there is low variability among +multiple measurements. +Precision of a set of measurements of the same variable is often +[expressed](https://www.wikihow.com/Calculate-Precision) either as min and max +from the sample mean or as [standard +error](https://en.wikipedia.org/wiki/Standard_error) or [coefficient of +variation](https://en.wikipedia.org/wiki/Coefficient_of_variation) of the +sample. + +Unfortunately, the current pprof CPU profiles meet neither of these two +criteria. + + +#### 2.2.1 Inaccuracy and its causes + +Inaccuracy of profiles arises from two sources: sample size and sampling bias. + + +#### Sample size + +By the [law of large +numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers), a larger sample +size makes the sample mean closer to the population mean. +Moreover, in +profiling, where the sample size is very small compared to the population size, +increasing the sample size greatly improves accuracy. +Conversely, a smaller +sample size leads to a lower inaccuracy. + + +Low-resolution OS-timer is the primary reasons behind small sample size in a +measurement window. +One can obtain more samples either by increasing the +length of a measurement window or by increasing the number of samples in a +measurement window. +The former works well for regular programs e.g., codes +dominated by loops; however, for irregular codes, such as microservices, a +larger measurement window does not rectify the measurement inaccuracy because +more irregularities can easily surface in a larger observation window. +Hence, +there is a dire need to collect more samples within a measurement window and +obtain fine-grained samples of sub-millisecond executions. +Furthermore, linearly scaling the measurement time to collect more samples is +impractical if orders of magnitude more samples are necessary to correct the +small samples size issue. + +#### Sampling bias + +A larger sample size alone does not improve accuracy if the samples are biased. +OS timer samples on some systems (e.g., Linux) are biased because a timer +interrupt from one OS thread, say `T1` may be handled by an arbitrary (not to +be confused with uniformly random) thread, say `T2`. +This means, `T2` will +handle the interrupt and incorrectly attribute the sample to its call stack, +which results in biased samples. +If more timer signals are handled by `T2` +compared to `T1`, a [systematic sampling +bias](https://en.wikipedia.org/wiki/Observational_error) will occur, which leads +to inaccurate profiles. + + +#### 2.2.1 Imprecision and its causes + +Imprecision in measurement comes from at least the following two sources: +sample size and measurement skid. + +#### Sample size + +Fewer number of samples directly contributes to a large [standard +error](https://en.wikipedia.org/wiki/Standard_error). +The low resolution of OS +timers is responsible for fewer samples in a measurement window, which results +in a lower precision of pprof profiles. +Conversely, a higher resolution of +samples will improve precision. + +#### Measurement skid + +The second source of measurement error is the [random +error](https://en.wikipedia.org/wiki/Observational_error) inherent in a +measurement apparatus. +An OS timer apparatus configured to expire after `N` milliseconds +is only guaranteed to generate an interrupt some time after `N` milliseconds -- +not precisely at `N` milliseconds. +This randomness introduces a large "skid" in +measurement. +Assume a periodic timer set to fire every 10ms. +Assume it has +1ms left before its expiry when the OS scheduler with a 4ms resolution inspects +it. +This means the timer will fire 4ms later, which is 3ms after the scheduled +expiry time. +This results in up to 30% imprecision for a 10ms periodic timer. +Although random error may not be entirely eliminated, a superior measurement +apparatus reduces the effects of random error. + +### 2.3 Examples demonstrating inaccuracy and imprecision in Go pprof profiles + +#### 2.3.1 A concurrent code + +Consider the Go program +[goroutine.go](https://github.com/chabbimilind/GoPprofDemo/blob/master/goroutine.go), +which has ten exactly similar goroutines `f1`-`f10`. +These functions are +computationally intensive, long-running, and are unaffected by IO or memory +hierarchy. +The functions are non-preemptive, and hence the Go runtime scheduler +overhead is zero. +It is expected that each function takes exactly the same +amount of CPU time (i.e., approximately 10% of the overall execution). +It +should not matter how many OS threads or CPU cores they run on. + + +Table 1 summarizes the result of `go tool pprof -top` for this program running +on a Linux OS on a 12-core Intel Skylake server-class machine. +The +measurements were taken three times under the same configuration. +The `Flat +(ms)` column shows the absolute millisecond measurement attributed to each of +the ten routines, and the `Flat (%)` column shows the relative time in a +routine with respect to the total execution time. +The `Expected (%)` column +shows the expected relative time in each routine. +The `Rank order` column +assigns ranks in the descending order of execution time. +The raw output is +available in the [appendix](#11-appendix). + + *Table 1: demonstration of inaccuracy and +imprecision of pprof profiles using OS timers on a concurrent code.* + +First, let us focus on the data in a single run `RUN 1`. +`f1`-`f10` have a wide +variance in the time attributed to them; the expectation is that each of them +gets `10%` execution time. +There is up to 6x difference in the time attributed +to the routine with the highest amount of attribution (`main.f7` with 4210ms, +23.31%) and the routine with the lowest amount of attribution (`main.f9` with +700ms, 3.88%). +**This demonstrates the poor accuracy (deviation from the +ground truth) of pprof timer-based profiles.** +The issue reproduces when using different number of CPU cores. + +Now focus on the three runs `RUN 1-3` together . +The time attributed to any +routine widely varies from one run to another. +The rank order of the routines +changes significantly from run to run. +In Run 1, `main.f7` is shown to run for +4210ms with the rank order of 1, whereas in Run 2, it is shown to run for only +520ms with the rank order 10. +The expectation is that the measurements remain +the same from run to run. +From one run to another, function-level time +attribution are not precise; for example, there is a 71% coefficient of +variance among the three runs of the `f1` routine and the average coefficient +of variance across all functions is 28%. +**This demonstrates imprecision (unpredictability of +measurement) of pprof timer-based profiles.** + +For the curious readers, Table 2 below shows near-prefect accuracy and +precision when profiling the same code using the PMU-based profiler described in +this proposal. +We show two runs instead of three for +brevity. +Additional details appear in [Section +5.1](#51-the-pmu-produces-precise-and-accurate-profiles-for-concurrent-programs) + + *Table 2: demonstration of accuracy and +precision of pprof profiles via PMU CPU cycles event on a concurrent code.* + +#### 2.3.2 A serial code + +The issue is not specific to concurrent code alone. +Refer to a carefully +crafted Go program -- +[`serial.go`](https://github.com/chabbimilind/GoPprofDemo/blob/master/serial.go) +-- which has a single goroutine, and it has complete data +dependence from one statement to another. +It has ten functions +`A_expect_1_82`, `B_expect_3_64`, ... , `J_expect_18_18`. +The functions are +intentionally written to consume different amounts of time to make this test +case more representative. +Each function is similar and has a loop, but the +trip-count of the loop in each function is different. +The loop in function +with prefix `A` runs `1*C` times, where `C` is a constant. +The loop in function +with prefix `B` runs `2*C` times, and so on. +This arrangement leads to a +predictable relative execution time for each function; the function with prefix +`A` consumes `1/sum(1..10)=1.82%` of the total execution time; and the function +with prefix `B` consumes `2/sum(1..10)=3.64%` of the total execution time, and +so on until the function with prefix `J` consumes `10/sum(1..10)=18.18%` of the +total execution time. + + +The expected relative percent of CPU time is encoded in each function name, +`A_expect_1_82` should consume 1.82% execution time, `B_expect_3_64` should +consume 3.64% execution time, `J_expect_18_18` should consume 18.18% of CPU +time, and so on. +Table 3 summarizes the output of pprof CPU profiling for this +code. +As before, the measurements were taken three times under the same +configuration. +The `Flat (ms)` column shows the absolute millisecond +measurement attributed to each of the ten functions, and the `Flat (%)` column +shows the relative time with respect to the total time taken by all functions. +The `Expected (%)` column shows the expected relative time in each routine. +The +`Rank order` column assigns ranks in descending order of the execution time. +The raw output is available in the [appendix](#11-appendix). + + + *Table 3: demonstration of inaccuracy and +imprecision of pprof profiles on a serial code.* + +A `-` in a cell represents zero samples. +First, let's focus on just one set of measurements from `RUN 1`. +Rather than +`J_expect_18_18` being the top time consumer with 18.18%, both `J_expect_18_18` +and `H_expect_14_546` are the top time consumers each with 25% CPU time. +`B_expect_3_64`, `D_expect_7_27`, `F_expect_10_91`, and `I_expect_16_36` are +all lumped into a single rank order of 4 each attributed with 6.25% of +execution time. +`E_expect_9_09` does not even appear in the profiles despite +its 9% expected execution time. +More inaccuracies are evident in the data in +the table. +If we inspect the data from `RUN 2` and `RUN 3`, we notice more +discrepancies where a lot of data is missing and incorrect. +The rank ordering +and the relative execution percentage both have very low correlation with the +ground truth. +Despite tens of milliseconds of execution time, some functions +are over counted and some functions are under counted. + + +Now, let's focus on the three runs together. +The highest time consumer +`J_expect_18_18` does not even get a single sample in `RUN 3`, indicating the +imprecision. +`I_expect_16_36`, which has the measurement data available in all +three cases varies from 20ms in `RUN 1` with a rank order of 4 to 70ms in `RUN 2` +with a rank order of 1. +More discrepancies are visible in the table. +Thus, +serial program also shows run to run measurement variation, and hence the +profiles are imprecise. + + +Table 4 below shows near perfect accuracy and precision of profiling the same +code using the PMU-based profiler described in this proposal. +The measurements +were taken two times instead of three for brevity. +Additional details appear +in [Section +5.2](#52-the-pmu-produces-precise-and-accurate-profiles-for-serial-programs) + + *Table 4: demonstration of accuracy and precision +of pprof profiles via PMU CPU cycles event on a serial code.* + +### 2.4 State-of-the-art in profiling using performance monitoring units + +An alternative to the OS timer-based sampling is using hardware performance +counters, which are ubiquitous in modern commodity CPUs. +CPU’s Performance +Monitoring Units (PMUs) have the capability to count various events such as +cycles, instructions, cache misses, to name a few. +Furthermore, PMUs can be +configured to deliver an interrupt when an event reaches a specified threshold +value. +These PMU interrupts, like the OS timer interrupts, can serve as another +source of sampling the call stack. +This technique is employed in many modern +profilers [1,2,4]. +PMU sampling is called “event-based sampling,” where the +period (interval between two interrupts) is expressed as the number of +occurrences of an event, not the number of timer ticks. +As a side note, in +event-based sampling, the sampling overhead is proportional to the number of +events disregarding the length of execution, whereas in timer-based sampling, +the overhead is proportional to the length of execution. +If there are more +events in a short time span, then there is also more interest in observing +deeply in that region of execution and the event-based sampling takes more +samples. +More overhead is incurred in places where there is interest (more +events) and low/no overhead otherwise. +CPU cycles is one of the most common +PMU events, which happens periodically, similar to OS timers. +Hence, using CPU +cycles as a substitute for OS timers is very attractive. + +### 2.5 Low-level details of the current Go CPU profiler + +We now provide the current implementation details of CPU profiling in Go, which +is necessary to understand the changes we propose in the next section. +The +description is relevant for most Unix-based systems, especially Linux, and most +of the discussion is irrelevant for Windows since it does not use signals for +handling CPU profiling. + + + + +*Figure 1: A pictorial representation of the OS timer sampling workflow in the +existing implementation. +The first line of each record box is the function +name. +The remaining lines provide a textual description of the steps taken +by that function.* + +Currently, `pprof.StartCPUProfile(io.Writer)` is the entry point for starting +CPU profiling. +Analogously, `pprof.StopCPUProfile()` stops a running profile. +Two concurrent CPU profiling sessions are disallowed via a lock-protected +variable `cpu.profiling` in pprof. +The `pprof` package interacts with the +`runtime` package for starting/stopping profiling via +`runtime.SetCPUProfileRate(hz int)` API. + +There are two critical pieces of work done inside +`pprof.StartCPUProfile(io.Writer)`. +First, it invokes +`runtime.SetCPUProfileRate(hz=100)` asking the runtime to start profiling; the +rest of the profile generation will be handled by the `runtime` package; the +integer argument is the sampling rate in Hz (samples per second). + +Second, in order to collect the profiles generated by the runtime, +`pprof.StartCPUProfile(io.Writer)` launches a goroutine +`profileWriter(io.Writer)`. +`profileWriter`, + +1. creates a protocol buffer via `newProfileBuilder`, +2. goes into a sleep-wakeup loop calling `runtime.readProfile()`. +`runtime.readProfile()` provides a chunk of profiles collected inside the +runtime in a ring buffer (`eof` if the profiling stopped), and +3. writes to the +sender end of a channel `cpu.done` once +`runtime.readProfile()` returns `eof`. +The other side of the channel will be +waited upon by `pprof.StopCPUProfile()`. + + +`runtime.SetCPUProfileRate(int)` is the entry point to start/stop profiling in +the `runtime` package. +It, + +1. creates a ring buffer, `cpuprof.log`, +2. appends profile header information +into the ring buffer, +3. invokes `setcpuprofilerate(hz)`, which, + 1. calls `setThreadCPUProfileRate(hz=0)` to stop any on-going timer, + 2. invokes `setProcessCPUProfiler(hz)`, which installs the `SIGPROF` handler, and + 3. records the current sampling rate `hz` into the scheduler `sched.profilehz` to be +visible to other OS threads. +4. calls `setThreadCPUProfileRate(hz)`, which uses +the OS timer such as (`setitimer`) to start the sampling engine. +This step is +performed only if the `hz` value is non-zero. + + +Inside the scheduler, the `execute()` method schedules a goroutine (also called +as as `G` in the Go runtime scheduler) to run on the current OS thread (also +called as `M` in the Go runtime scheduler). +This method compares the current +value of the scheduler-recorded `sched.profilehz` variable against its own +snapshot (`m.profilehz`) from the previously recorded value of +`sched.profilehz` and if they differ, it reconfigures its timer by calling +`setThreadCPUProfileRate(sched.profilehz)`. +This functionality sets an OS +timer on behalf of the current OS thread in the process. + +Once armed, the timer periodically delivers a `SIGPROF` signal, handled by +`sighandler(sig int, info *siginfo, ctxt unsafe.Pointer, gp *g)`, which in turn +invokes`sigprof()`, which, + +1. obtains a backtrace, and +2. appends the backtrace into the ring buffer `cpuprof.add()`. + +Occasionally, a sampled program counter may fall into non-Go space, or the ring +buffer may be full; such samples are simply counted, and backtrace is not taken. + +The ring buffer is coordinated via a `signalLock` for concurrency control. +The edges marked with “signalLock”, in Figure 1, show the operations that take +the lock to access the ring buffer. + +`pprof.StopCPUProfile()` follows a very similar path as +`pprof.StartCPUProfile(io.Writer)`. +When the `hz` argument is zero, the runtime +APIs described previously (e.g., `SetCPUProfileRate(hz)`) interpret it as the +end of profiling. +`pprof.StopCPUProfile()` performs the following steps. + +1. calls`runtime.SetCPUProfileRate(0)` to stop profiling. +2. waits on the `cpu.done` channel, which will be flagged by the side thread once `eof` is +reached in the runtime ring buffer. +3. closes the supplied `io.Writer`. + + +`runtime.SetCPUProfileRate(hz)`, when called with `hz=0` invokes +`setcpuprofilerate(0)` and closes the ring buffer. +`setcpuprofilerate(0)` when +called with `hz=0`, + +1. calls `setProcessCPUProfiler(0)`, which restores the previous signal handler +for `SIGPROF`. +2. resets `sched.profilehz=0` indicating globally that the +profiling has stopped. +3. calls `setThreadCPUProfileRate(0)` to stop sampling on the current OS thread. + +Subsequently, the other `M`s notice the change in the global `sched.profilehz` +and turn off their timers when they run their `execute()` method. + +The direct users of pprof are: + +1. The `testing` package which can take a `-cpuprofile <file>` flag to collect +profiles. +2. The `net/http/pprof` package which can collect cpu profiles on the +`/debug/pprof/profile` endpoint with an optional argument `?seconds=<N>`. +3. Third party libraries such as `pkg/profile`. + +## 3. Problem statement + +*Pprof CPU profiles are inaccurate because the profiling resolution is low and +sampling is biased in OS timers.* + +*Pprof CPU profiles are imprecise because the profiling resolution is low and +the samples have a large skid due to OS timers.* + +*Pprof CPU profiles measure only the time metric, additional events should be +monitored to help performance profiling.* + +## 4. Proposal + +In this proposal, we address the concerns -- small sample size, sampling bias, +measurement error, and the need for more metrics -- which were raised in the +background section. + +**Sample size.** First, this proposal addresses the problem of low sampling +resolution of OS timers by using high-resolution samples from hardware +performance monitoring units (PMUs) to generate more samples. +While the OS-timers have ~10ms +resolution (that is one sample every 10ms), sampling CPU cycles via PMUs can +deliver (if desired) a sample every few microseconds -- a three order magnitude +finer resolution. +This fine-grained measurement results in increasing the +sample size, and as a result alleviates both inaccuracy and imprecision +problems. +While PMUs can deliver 100x more samples with just ~13% measurement +overhead, the alternative of lengthening a measurement window by 100x to +collect the same number of samples via the OS timers would be impractical. + +**Sampling bias.** Second, the proposal arranges each sample to be delivered to +the OS thread causing the counter overflow to eliminate the sampling bias, +which improves accuracy. + +**Random error (skid).** Third, PMU samples alleviate the large sampling skid +seen in OS timers because PMUs allow a low-skid or skid-free precise sample +attribution. + +**More metrics.** Finally, the PMU's capability to monitor various events such +as cache misses, TLB misses, branch mispredictions, inter-core cacheline +ping-ponging, to name a few, offers additional insights when diagnosing complex +performance issues. + +Implementing this idea calls for extending the Go runtime package’s CPU +profiling capabilities and making necessary adjustments in `runtime/pprof`, +`testing`, and `http/pprof` packages. +We first describe the details of the +changes to the `runtime` package and then describe the details of the `pprof`, +`testing`, and `net/http/pprof` packages, in that order. + +Our design brings the full capability of PMUs into the `runtime` package but +exposes only a small amount of it via the `pprof` package to users. +For +example, the runtime might be able to configure whether it wants to monitor the +CPU cycles spent inside the kernel or not; but these configurations will not be +exposed (at least in the initial design) to the users of `pprof` package. + +### 4.1 High-level design + +In its simplest form, we substitute the start/stop of an OS interval timer with +the start/stop of a PMU event configured in the sampling mode; the sampling +mode delivers an interrupt each time a hardware counter overflows, which is +analogous to the OS timer interrupt. +The interrupt generated by the PMU counter +overflow follows the same signal handling path as the interval timer. +This +policy allows collecting and appending a backtrace on each PMU interrupt into a +ring buffer, just like the OS timer. +Moreover, like the OS timer, each +OS-thread (`M`) checks whether any changes were made to the PMU-based profiling +during each scheduling activity and performs the necessary adjustments to +reconfigure the PMU. +Finally, we retain `SIGPROF` as the signal type delivered +-- for both OS timer interrupts and a PMU counter overflow interrupts -- +instead of hijacking another signal number for PMU-based profiling. + +### 4.2 The devil is in the details + +This seemingly straightforward approach poses implementation challenges +related to retaining the compatibility with interval timer and incorporating +additional PMU features. +The first part of this section addresses those +challenges. +Other than the design described in the previous subsection, the +rest of the details are closer to implementation details. +As before, the +discussion is only relevant to Linux. + +PMUs can count multiple CPU events simultaneously (e.g., cycles, instruction, +cache misses, etc.). +Our proposal does not preclude the runtime from +simultaneously enabling multiple PMU events or even PMU events alongside an OS +timer event. +In fact, certain events are best used when measured with others. +For example, +1. Cache misses measured alongside cache hits enable computing cache miss ratio, +2. Remote core cache line transfers measured alongside total cache misses +provides a view of the cause of cache misses. + +In our proposal, we will limit the maximum number of concurrent CPU events to +`_CPUPROF_EVENTS_MAX` (`EV_MAX` for short in the rest of this document). +We +will define a few common portable events for which the user does not have to +lookup within the CPU manufacturer’s manual for event codes. +This closely +resembles the preset events of `perf events` in Linux. + +In order to enable advanced users to exploit the full capability of the PMUs, +we define an opaque RAW event that allows the users to specify a model-specific +hexadecimal event number that can be sampled. +A motivating example for such a +raw event is the remote +[HITM](https://software.intel.com/sites/default/files/managed/8b/6e/335279_performance_monitoring_events_guide.pdf?ref=hvper.com) +event on Intel processors; this event counts the number of times a cache line +was loaded where the cache line was in the modified state in another remote +CPU’s cache. +This event helps diagnose problems such as cache line false +sharing; we have observed this event to be helpful in identifying false sharing +even inside the Go runtime. + +We have come up with the following initial set of PMU events. +These events will +be exposed by the `runtime` package to the `pprof` package. +The `pprof` package +will wrap these constant values with matching function names to be presented +later in the `pprof` package subsection. +Notice that the OS timer is one of the CPU events. +Last-level cache events are included to capture the behavior of the CPUs memory +hierarchy. +Branch events are included to capture potentially wasted CPU +cycles. +The design discussions can refine this list further (e.g., whether TLB events +are needed). +They are easy to grow based on requests. + +``` +type cpuEvent int32 + +// These constants are a mirror of the runtime +// These constants are a mirror of the runtime +const ( _CPUPROF_OS_TIMER, _CPUPROF_FIRST_EVENT cpuEvent = iota, iota + _CPUPROF_HW_CPU_CYCLES, _CPUPROF_FIRST_PMU_EVENT + _CPUPROF_HW_INSTRUCTIONS cpuEvent = iota + _CPUPROF_HW_CACHE_REFERENCES + _CPUPROF_HW_CACHE_MISSES + _CPUPROF_HW_BRANCH_INSTRUCTIONS + _CPUPROF_HW_BRANCH_MISSES + _CPUPROF_HW_RAW + _CPUPROF_EVENTS_MAX + _CPUPROF_LAST_EVENT = _CPUPROF_EVENTS_MAX-1 +) +``` + +Since PMU support may not be available on all platforms in all settings, we +need to make it configurable in the Go runtime. +If an OS supports PMU-based +profiling (e.g., Linux), it will dynamically register two functions during the +runtime startup. + + +1. `setThreadPMUProfilerFunc`: this is the OS-specific function used to +start/stop a PMU event analogous to `setThreadCPUProfile`. +If`setThreadPMUProfilerFunc` is `nil`, the runtime will not attempt to +start/stop PMU profiles even if it is requested to do so. +However, this +situation will not arise because `runtime/pprof` will guard against it. +2. `sigprofPMUHandlerFunc`: this is the OS-specific function called from the +OS-specific signal handler when the PMU delivers an interrupt. + +Currently, each `M` (type `m`) and the scheduler (type `schedt`) maintains a +`profilehz` variable. +In lieu of `profilehz` we’ll introduce a runtime structure +`cpuProfileConfig struct `, which captures the essential configurations needed +to start/stop either a PMU counter or an OS timer. +`cpuProfileConfig` is +designed to reflect the underlying `perf_event_attr` struct in Linux, but being +an internal data structure we can modify it, if needed in the future, without +affecting any public-surface API. + +``` +type cpuProfileConfig struct { + hz uint64 // the sampling rate, used only for OS_TIMER. + period uint64 // the sampling interval, used for PMU events. + rawEvent uint64 // the hexa-decimal value of the CPU-specific event. + // used only with _CPUPROF_HW_RAW cpuEvent type + preciseIP profilePCPrecision // the level of precision for the program counter in PMU samples. + isSampleIPIncluded bool // should the sample include the PC? + isSampleAddrIncluded bool // should the sample include the memory address involved? + isKernelIncluded bool // should the kernel count events in kernel mode? + isHvIncluded bool // should the kernel count events in hypervisor mode? + isIdleIncluded bool // should the kernel count events when idle task is running? + isSampleCallchainIncluded bool // should the sample report the call chain? + isCallchainKernelIncluded bool // should the sample include the kernel callchain? + isCallchainUserIncluded bool // should the sample include the user callchain? +} +``` + +Some PMUs can be configured so that they can offer the +exact program counter involved in the counter overflow. +`profilePCPrecision` +dictates the level of precision needed from the PMU as described by the +constants below. +If a PMU is configured to provide the precise program counter for a sample, +we'll substitute the leaf-level program counter in the backtrace with the one +presented by the PMU. +Please refer to the [Section 6.2](#62-disadvantages) for PMU +precision. + +``` +type profilePCPrecision uint8 +const ( + _CPUPROF_IP_ARBITRARY_SKID profilePCPrecision = iota + _CPUPROF_IP_CONSTANT_SKID + _CPUPROF_IP_SUGGEST_NO_SKID + _CPUPROF_IP_NO_SKID) +``` + +`pprof` will set the default to `_CPUPROF_IP_ARBITRARY_SKID` to work across +architectures and event types. +However, we can easily change it to the highest +supported, by probing the CPU as done in +[certain](https://github.com/HPCToolkit/hpctoolkit/blob/master/src/tool/hpcrun/sample-sources/perf/perf_skid.c#L176) +profilers. + +`cpuProfileConfig` will be visible to `pprof` package but will not be exposed to +its clients. +The fields in `cpuProfileConfig` allow how a PMU event can be +customized. +Pprof will set the field values to the lowest common denominator +that is expected to work across many architectures and security settings. +For example, `isKernelIncluded` will most likely be `false`. +The runtime will be +(mostly) agnostic to these values and will pass them down to OS-specific layers +that will handle programming the PMUs. + +The `runtime` package will expose an API +`runtime_pprof_setCPUProfileConfig(cpuEvent, *cpuProfileConfig)`, only to +`pprof`. +This API is the equivalent of `SetCPUProfileRate(hz int)`. +Since +the legacy `SetCPUProfileRate(hz int)` cannot be dropped, we alter its internals +to wrap the `hz` value into a `cpuProfileConfig` structure and pass it down to +`runtime_pprof_setCPUProfileConfig` for a uniform treatment of any event whether +an OS-timer event or a PMU event. + +The `runtime` can be bifurcated into OS-agnostic and OS-specific parts. + +The OS-agnostic part of the runtime will have the following duties: + +1. Keep track of the CPU events currently running, +2. Handle new start/stop +invocations of the OS timer or PMU event, one at a time. +3. Allocate and +coordinate access to the ring buffers associated with each of the running CPU +event. +4. Reconfigure the CPU events for an `M` during its scheduling based on +whether its current snapshot is different from the global configuration stored +in the scheduler. + +The OS-specific part of runtime will have the following duties: + +1. Install/restore profiling signal handler. +2. Start/stop OS timer or PMU +events. +3. Handle the profiling signals and pause the PMUs during signal +handling. +4. Map an interrupt delivered to the appropriate active CPU profiling +event for recording the backtraces into an appropriate ring buffer. + + +Since there can be `CPUPROF_EVENTS_MAX` number of concurrent CPU profiles, we +need to change some of the runtime data structures that previously assumed a +single CPU profile. +Importantly, + +1. We will change the `profilehz` field in `schedt struct` to be an array of +`cpuProfileConfig` pointers. +2. We will change `profilehz` field in `m struct` +to be an array of `cpuProfileConfig` pointers. +3. We will introduce an array of +opaque pointers `cpuProfileHandle` field in `m struct`. +This will be OS-specific +information stored in each `M`. +For example, on Linux, this will hold the +file handles obtained from `perf_event_open` system calls and the corresponding +`mmap` ring buffers where the kernel populates the sample data. +This information +will be used to close the resources once the profiling is stopped on an `M`. + +We’ll replace the `hz` argument used in various runtime APIs involved in CPU +profiling with the following two arguments. + +1. `eventId cpuEvent` (described in the constants previously), and +2. `profConfig *cpuProfileConfig`. + +Figure 2 shows the call graph for the new workflow. +We abbreviate `eventId` as +`i` and `cpuProfileConfig` as `p` in most places. + + +*Figure 2: A pictorial representation of the OS timer-based and PMU-based +sampling workflow. +The first line of each record box is the function name. +The +remaining lines provide a textual description of the steps taken by that +function.* + +A `nil` value of `cpuProfileConfig` implies stop profiling for the specified +`cpuEvent`, and a `non-nil` value implies start profiling for the specified +`cpuEvent`. +At the runtime-level, `runtime_pprof_setCPUProfileConfig()` makes a +copy of the `cpuProfileConfig` passed from `pprof`, so that inside the runtime, +we can perform fast pointer comparison and rely on two large `cpuProfileConfig` +structs to be equal if pointers to them are equal. + +Starting/stopping a PMU counter is different from starting/stopping an OS timer. +Hence `if eventId==CPUPROF_OS_TIMER` checks will appear at a few places inside +the runtime’s profiling-related code. + + +A few concurrency aspects deserve a mention. + +1. Each active event will create its own ring buffer on demand. +There will be a +maximum of `EV_MAX` ring buffers. +Each ring buffer will be created only if there +is a corresponding active event. +The size of each ring buffer is 16KB. +If all +`EV_MAX` events are active at the same time, it will result in `16KB * 8 = +128KB` of ring buffers, which is paltry on modern machines. +2. We’ll change the +`cpuprof` variable in `cpuprof.go` to be an array of `EV_MAX` size. +3. +`runtime_pprof_setCPUProfileConfig` will hold a lock only for the `cpuprof[i]`, +where `i` is the event id for which profile has to be started or stopped. +4. There will be a single `signalLock` rather than `EV_MAX` `signalLock`s. +This is +because the `setcpuprofileconfig` method in `cpuprof.go` needs to check (read) +all `prof[eventId].config`s to infer whether all events are stopped before +uninstalling the signal handler. +Taking `EV_MAX` locks one after another would +be no better than holding just one lock. +The consequence of this is that we +serialize simultaneous writes to two different ring buffers. +This should not be +a serious issue since there cannot be two concurrent `SIGPROF`s. +If a system +supports concurrent `SIGPROF` handling, and if there are a lot of threads in a +system, we can revisit this choice. +5. In a very rare case when a `SIGPROF` is delivered for an already running +event when another event is being set (which will hold the `signalLock`), the +signal handler recognizes it and drops such sample. Otherwise a deadlock will +ensue. + +At the lowest level, we will incorporate the following three system calls into +all `runtime/sys_linux_<arch>.s` files. + +1. `perf_event_open`, which is Linux’s means to program PMUs. +The Go signature +will be `func perfEventOpen(attr *perfEventAttr, pid uintptr, cpu, groupFd +int32, flags uintptr) int32` +2. `ioctl`, which is needed to stop the PMU counter +when handling the `SIGPROF`, and reset and restart after handling the signal. +The +go signature will be `func ioctl(fd, req int32, arg uintptr) int32` +3. `fcntl`, +which is necessary to arrange a `perf_event` file handle to deliver the +`SIGPROF` signal and also to setup the signal to be delivered exactly to the `M` +(OS thread) whose PMU counter overflew. +The go signature will be `func fcntl(fd, +cmd int32, arg uintptr) int32` + +#### 4.2.1 Segregating `SIGPROF` from different sampling sources + +All PMU events and the OS timer use the same `SIGPROF` signal to deliver their +interrupts. +Hence, the Unix signal handler needs to address the following +concerns. + +1. Identify whether a signal was delivered as a result of the OS timer expiring +or a PMU counter overflowing. +This is necessary because the subsequent processing will be +different based on the event. +2. Identify which PMU counter caused the interrupt, when multiple PMU events are being +sampled at the same time. +This is necessary because the subsequent +processing needs to append the backtraces into the buffer corresponding to the +PMU event that triggered the interrupt. +All PMU events do not go into the same +ring buffer. + +Although both OS timer expiry and any PMU counter overflow result in the +same `SIGPROF` signal, the signal codes are different. +The PMU counter overflow +generates a `POLL_IN` signal code indicating the presence of data whereas the OS +timer interrupts generate a `SI_KERNEL` or `SI_ALARM` signal code. +We use this +information to address #1 above. + +Once we know that the signal was generated due to some PMU counter overflowing, +we use the file descriptor `fd` passed in the signal info and match it against +the opened perf events file descriptors on the same `M` and use the matching +location to infer the PMU event `i` that caused the signal; the rest of the +signal handling makes use of this `i` to record the backtrace in the appropriate +ring buffer. +During the signal handling, we’ll pause all active PMU counters via `ioctl` +calls on the `fd` so that the signal handling is not counted towards the PMU +event counters. + + +### 4.3 Exposing PMU events via `runtime/pprof` package + +We now turn our attention to the `runtime/pprof` package, which exposes the PMU +sampling to its clients. +There are several choices. +There is a tension between +revolutionary vs. evolutionary solution; there is friction between exposing the +full PMU functionality vs. offering the basic minimum. +In our proposal, we hope +to make pragmatic choices among myriad options. + +We propose adding a new API, which accepts different configurations needed to +start any CPU event -- OS timer and PMU counters. +The API signature is shown +below. +``` +pprof.StartCPUProfileWithConfig(opt ProfilingOption, moreOpts ...ProfilingOption) error +``` +There is an optional variadic `ProfilingOption` to allow +more than one event to be started during a profiling session. + +A `ProfilingOption` is an interface exposed by `runtime/pprof` as shown below. + +``` +type ProfilingOption interface { apply() error } +``` + +Pprof will implement the following profiling option functions that mirror +various `cpuEvent` constants previously presented. + +``` +func OSTimer(w io.Writer) ProfilingOption +func CPUCycles(w io.Writer, period uint64) ProfilingOption +func CPUInstructions(w io.Writer, period uint64) ProfilingOption +func CPUCacheReferences(w io.Writer, period uint64) ProfilingOption +func CPUCacheMisses(w io.Writer, period uint64) ProfilingOption +func CPUBranchInstructions(w io.Writer, period uint64) ProfilingOption +func CPUBranchMisses(w io.Writer, period uint64) ProfilingOption +func CPURawEvent(w io.Writer, period uint64, hex uint64) ProfilingOption +``` + +All of them consume an `io.Writer`. +The `OSTimer` accepts no other argument. +A +`period` accepted by all PMU events specifies the number of events that must +elapse between two consecutive interrupts. +A larger value means lower +granularity (the opposite of Hz). +Passing a `zero` value will result in using a +preset period. +`CPURawEvent` is special; it accepts any user-provided +hexadecimal PMU event code. +Notice that each event requires an `io.Writer` +since different profile types cannot be serialized into the same pprof protocol +buffer. +In the cases where multiple `ProfilingOption`s are passed to the +`pprof.StartCPUProfileWithConfig`, sanity checks are made to ensure the +uniqueness of each `cpuEvent` and each `io.Writer`. + +Below are a few example usages of the API. + +``` +// Simple +StartCPUProfileWithConfig(OSTimer(w)) +StartCPUProfileWithConfig(CPUCycles(w, period)) +StartCPUProfileWithConfig(CPURawEvent(w, period, hexadecimalEvent)) + +// Advanced +StartCPUProfileWithConfig(CPUCacheReferences(w1, period1), CPUCacheMisses(w2, period2)) +``` + + +The implementation of `StartCPUProfile(writer)` will simply invoke +`StartCPUProfileWithConfig(OSTimer(writer))`. +The implementation of +`StartCPUProfileWithConfig` will ensure a maximum of `EV_MAX` elements, unique +`io.Writers`, and unique `cpuEvents`. +After sanitizing the arguments, it will create +a `cpuProfileConfig` for each `cpuEvent` requested to be profiled. +In a global +data structure `cpu`, we will maintain the active events, their +`*cpuProfileConfig`, and the corresponding `io.Writer`. + +``` +var cpu struct { + sync.Mutex + profiling bool + activeConfig [EV_MAX]*cpuProfileConfig + activeWriter [EV_MAX]io.Writer + eventName [EV_MAX]string + done chan bool +} +``` + +Finally, `StartCPUProfileWithConfig` will invoke +`runtime_pprof_setCPUProfileConfig` multiple times once for each active event +to request the `runtime` to actually configure either the OS timer or the PMU +with a specific event to start sampling. +A lock will be held throughout the invocation of +`StartCPUProfileWithConfig` and hence no two +`StartCPUProfileWithConfig/StartCPUProfile` invocations run concurrently. +Once +a session of `StartCPUProfileWithConfig/StartCPUProfile` is active, another +profiling is disallowed until `StopCPUProfile` is invoked. + +Before successfully returning from `StartCPUProfileWithConfig`, we'll launch a +goroutine `profileWriter`, which will periodically wake up and scrape +runtime-produced stack samples from the ring buffers associated with each of +the active events and serialize them into the corresponding `io.Writer`s. +This follows the same design as the original pprof, but will add the +responsibility of collecting profiles from multiple events. Creating one +goroutine each to scrape every active event would be wasteful, a single +goroutine suffices to collect backtraces from all ring buffers. + +### 4.4 The optional multiple CPU profiles feature + +Allowing multiple CPU profiles in a single profiling session via +`pprof.StartCPUProfileWithConfig(opt ProfilingOption, moreOpts +...ProfilingOption)` API is a power-user feature. +To avoid any accidental +misuse, we'll protect it under an environmental variable +`GO_PPROF_ENABLE_MULTIPLE_CPU_PROFILES=<true|false>`, which will be disabled by +default. + + +### 4.5 The `testing` package + +We’ll add the following two command line arguments to the testing package: + +1. `-cpuprofileevent=<timer|cycles|instructions|cacheMisses|cacheReferences|branches|branchMisses|rHexValue>` + +2. `-cpuprofileperiod=<Int64Value>` + +The default value of `-cpuprofileevent` is `timer` with 100Hz when only +`-cpuprofile` is passed, which ensures compatibility with the current behavior. +`-cpuprofileperiod` will be ignored for `-cpuprofileevent=timer` and will +default to a 100Hz rate as is the case currently. +The testing package will map a +user-specified string for `-cpuprofileevent` value to the appropriate +`pprof.ProfilingOption`, the `-cpuprofileperiod` to an integer, and invoke +`pprof.StartCPUProfileWithConfig()`. +For example `-cpuprofileevent=cycles` and +`-cpuprofileperiod=100000` will result +in`pprof.StartCPUProfileWithConfig(CPUCycles(w, 100000)`. + +### 4.6 The `net/http/pprof` package + +We’ll add the following two additional arguments in an `http` request exposed +by the `net/http/pprof` CPU profiling endpoint. + +1. `event=<timer|cycles|instructions|cacheMisses|cacheReferences|branches|branchMisses|rHexValue>` + +2. `period=<Int64Value>` + +`period` will be ignored for `event=timer` and will default +to a 100Hz rate as is the case currently. +The default value of `event` +is `timer` with 100Hz, which ensures compatibility with the current behavior. +The `net/http/pprof` package will map a user-specified string for `event` +value to an appropriate `pprof.ProfilingOption`, `period` to an +integer, and invoke `pprof.StartCPUProfileWithConfig`. + +## 5. Empirical evidence on the accuracy and precision of PMU profiles + +### 5.1 The PMU produces precise and accurate profiles for concurrent programs + +Below we show the `go tool pprof -top` output of the previously shown +[`goroutine.go`](https://github.com/chabbimilind/GoPprofDemo/blob/master/goroutine.go) +concurrent program run for two times. +The profiles used the proposed new API +with 1M sampling period: `pprof.StartCPUPrileWithConfig(CPUCycles(w, +1000000))`. +The CPU cycles attributed to each routine `f1-f10` match the +expectation (approximately 10% of the total execution time) within one run and +across multiple runs. +*Thus the profiles are accurate and precise. +The +measurement overhead is indistinguishable from the OS timer in this setting.* + + +#### (PMU Cycles profile) goroutine.go/Run 1: + +``` +File: goroutine +Type: cycles +Time: Jan 27, 2020 at 4:49pm (PST) +Showing nodes accounting for 234000000000, 100% of 234000000000 total + flat flat% sum% cum cum% +23400000000 10.00% 10.00% 23400000000 10.00% main.f1 +23400000000 10.00% 20.00% 23400000000 10.00% main.f10 +23400000000 10.00% 30.00% 23400000000 10.00% main.f2 +23400000000 10.00% 40.00% 23400000000 10.00% main.f3 +23400000000 10.00% 50.00% 23400000000 10.00% main.f4 +23400000000 10.00% 60.00% 23400000000 10.00% main.f5 +23400000000 10.00% 70.00% 23400000000 10.00% main.f6 +23400000000 10.00% 80.00% 23400000000 10.00% main.f7 +23400000000 10.00% 90.00% 23400000000 10.00% main.f8 +23400000000 10.00% 100% 23400000000 10.00% main.f9 +``` +#### (PMU Cycles profile) goroutine.go/Run 2: + +``` +File: goroutine +Type: cycles +Time: Jan 27, 2020 at 4:51pm (PST) +Showing nodes accounting for 234000000000, 100% of 234000000000 total + flat flat% sum% cum cum% +23800000000 10.17% 10.17% 23800000000 10.17% main.f1 +23500000000 10.04% 20.21% 23500000000 10.04% main.f7 +23500000000 10.04% 30.26% 23500000000 10.04% main.f9 +23400000000 10.00% 40.26% 23400000000 10.00% main.f10 +23400000000 10.00% 50.26% 23400000000 10.00% main.f2 +23400000000 10.00% 60.26% 23400000000 10.00% main.f4 +23400000000 10.00% 70.26% 23400000000 10.00% main.f6 +23400000000 10.00% 80.26% 23400000000 10.00% main.f8 +23300000000 9.96% 90.21% 23300000000 9.96% main.f3 +22900000000 9.79% 100% 22900000000 9.79% main.f5 +``` + +### 5.2 The PMU produces precise and accurate profiles for serial programs + +Below we show the `go tool pprof -top` output of the previously shown +[`serial.go`](https://github.com/chabbimilind/GoPprofDemo/blob/master/serial.go) +program run for two times. +We remind the reader that the expected relative +execution time is encoded in the function names themselves (`J_expect_18_18` +should consume 18.18% CPU cycles and so on). +The profiles used the proposed +new API with 1M sampling period: `pprof.StartCPUPrileWithConfig(CPUCycles(w, +1000000))`. +The CPU cycles attributed to each routine match the expectation +within a run and across multiple runs. +*Thus the profiles are accurate and +precise.* + + +#### (PMU Cycles profile) serial.go/Run 1: + +``` +File: serial +Type: cycles +Time: Jan 27, 2020 at 4:54pm (PST) +Showing nodes accounting for 1105000000, 100% of 1105000000 total + flat flat% sum% cum cum% + 200000000 18.10% 18.10% 200000000 18.10% main.J_expect_18_18 + 183000000 16.56% 34.66% 183000000 16.56% main.I_expect_16_36 + 165000000 14.93% 49.59% 165000000 14.93% main.H_expect_14_546 + 137000000 12.40% 61.99% 137000000 12.40% main.G_expect_12_73 + 120000000 10.86% 72.85% 120000000 10.86% main.F_expect_10_91 + 100000000 9.05% 81.90% 100000000 9.05% main.E_expect_9_09 + 82000000 7.42% 89.32% 82000000 7.42% main.D_expect_7_27 + 63000000 5.70% 95.02% 63000000 5.70% main.C_expect_5_46 + 37000000 3.35% 98.37% 37000000 3.35% main.B_expect_3_64 + 18000000 1.63% 100% 18000000 1.63% main.A_expect_1_82 + 0 0% 100% 1105000000 100% main.main + 0 0% 100% 1105000000 100% runtime.main +``` + +#### (PMU Cycles profile) serial.go/Run 2: + +``` +File: serial +Type: cycles +Time: Jan 27, 2020 at 4:54pm (PST) +Showing nodes accounting for 1105000000, 100% of 1105000000 total + flat flat% sum% cum cum% + 200000000 18.10% 18.10% 200000000 18.10% main.J_expect_18_18 + 183000000 16.56% 34.66% 183000000 16.56% main.I_expect_16_36 + 159000000 14.39% 49.05% 159000000 14.39% main.H_expect_14_546 + 142000000 12.85% 61.90% 142000000 12.85% main.G_expect_12_73 + 119000000 10.77% 72.67% 119000000 10.77% main.F_expect_10_91 + 100000000 9.05% 81.72% 100000000 9.05% main.E_expect_9_09 + 82000000 7.42% 89.14% 82000000 7.42% main.D_expect_7_27 + 61000000 5.52% 94.66% 61000000 5.52% main.C_expect_5_46 + 40000000 3.62% 98.28% 40000000 3.62% main.B_expect_3_64 + 18000000 1.63% 99.91% 18000000 1.63% main.A_expect_1_82 + 1000000 0.09% 100% 1105000000 100% main.main + 0 0% 100% 1105000000 100% runtime.main +``` + +### 5.3 A practical use + +We use pprof to collect production CPU profiles at Uber. +On one of our Go microservices, when we used the default OS timer-based CPU +profiles, we saw that the `mutex.Lock()` accounted for 15.79% of the time and +`mutex.Unlock()` accounted for 13.55% of the CPU time. +We knew from the program structure that `mutex.Lock()` and `mutex.Unlock()` +were invoked in pairs one after another, and also that `mutex.Lock()` is +significantly more heavyweight compared to `mutex.Unlock()`, thus we were +expecting much lesser time being spent inside `mutex.Unlock()` relative to +`mutex.Lock()`. +The OS-timer profiling data was not matching our expectations. +However, when we used a prototype of the PMU-based profiler and sampled the CPU +cycles event, the quantification changed and matched our expectations; +`mutex.Lock()` accounted for 33.36% of the CPU cycles and `mutex.Unlock()` +accounted for 7.57% the CPU cycles. +The PMU profiles avoided an unnecessary performance investigation. + + +## 6. Rationale + +### 6.1 Advantages + +CPU profiling via PMUs result in richer quality profiles -- high-resolution +measurement, higher accuracy, and higher precision. +Access to various events +such as cache misses in PMUs allows better insights into understanding the +causes of complex performance issues. +Any kind of PMU event can be attributed +to call stack samples, which is useful to provide actionable feedback to the +developer. +Incorporating the PMU facility in the runtime will allow the Go +profiler to take full advantage of the progress happening in the hardware +performance monitoring facilities available in modern CPUs. +This framework will +allow for independently developing more advanced profiling features such as +attributing high-latency data accesses to Go objects, which can help refactor +code or data structures for better data locality [5, 6, 7]. + +### 6.2 Disadvantages + +1. PMU-based profiling may not work when a Go program is running under a +virtual machine, since virtual machines may not expose the hardware performance +counters to a guest OS. +Hence, the proposal keeps the OS-timer as the default +and allows PMU-based sampling available only when it is explicitly specified. +2. A higher sampling rate (a shorter period between samples) obviously incurs +higher overhead. +For example, the default timer with a 100Hz sampling rate +introduces 3.6% measurement overhead on a 2.4GHz Intel Skylake machine. +Using +CPU cycles with a period of 24000000 cycles, which amounts to the same 100 +samples/second introduces the same 3.6% overhead. +The table below shows +overheads at other sampling periods using the PMU cycles and also compares +against the default OS-timer profiler overhead. + +``` + The new PMU-based profiler with cycles event + ----------------------------------------------------- + Sampling period in CPU cycles Overhead + ----------------------------------------------------- + 24000000 (100 samples/sec) 1.036x + 2400000 (1K samples/sec) 1.036x + 240000 (10K samples/sec) 1.13x + 24000 (100K samples/sec) 2.05x +``` +``` + The default OS-timer based profiler + ----------------------------------------------------- + Sampling rate Overhead + ----------------------------------------------------- + 100 samples/sec 1.036x +``` + +A reasonable sampling period, once in 2.4M CPU cycles on a 2.4GHz +Intel Skylake machine, offers a good trade off between high sampling rate at low +overhead. + +3. When using any other PMU event, such as cache misses, the sampling period +needs to be more carefully curated. +We’ll provide preset values for the preset +events. + +4. PMU profiling has a mild skid on deeply pipelined, multi-issue, +out-of-order, superscalar CPU architectures due to multiple instructions +flowing in the pipeline. +Thus the program counter seen in the interrupt can be +a handful of instructions off from the one generating the counter overflow. +This "skid" is much smaller than the several milliseconds of skid in an OS +timer. +Moreover, modern PMUs have built-in features to eliminate such skid +(e.g., Intel Precise Event Based Sampling ([PEBS](https://software.intel.com/sites/default/files/managed/8b/6e/335279_performance_monitoring_events_guide.pdf?ref=hvper.com)), AMD Instruction Based +Sampling ([IBS](https://developer.amd.com/wordpress/media/2012/10/AMD_IBS_paper_EN.pdf)), and PowerPC Marked events ([MRK](https://wiki.raptorcs.com/w/images/6/6b/POWER9_PMU_UG_v12_28NOV2018_pub.pdf)). +Some PMUs can be configured so +that they can offer the exact program counter involved in the counter overflow. +We have introduced `profilePCPrecision` for this reason. + +5. There are finite counters on any PMU. Counter multiplexing happens if the +number of events is more than the number of available PMU counter slots. In +that case the events run only part of the time. + +### 6.3 Alternatives explored + +1. Linux `perf` commandline tool is powerful and offers comparable solutions. +However, it does not solve several important use cases. + 1. Profiling via an `http` endpoint (provided by `net/http/pprof` package in +Go) is the preferred, readily available, and scalable solution that works at +the enterprise scale. Making `perf` to work in a similar fashion would lead to +creating too many custom solutions. + 2. `pprof` works as an in-situ profiler working in the same address +space as the process being profiled and hence does not require any special +system privileges; in contrast, attaching `perf` to a running process requires +changes to a system’s security settings. + 3. We are also focusing on +containerized environments, which is widely used in large-scale Go deployments. +Although `perf` can collect call stack samples from a container, it fails to find +symbols of go binaries located within a container to show clean reports +(e.g., flame graphs). +The workaround is non-trivial. +2. We explored library-level facilities to collect performance counters such as perf-util [\[3\]](https://github.com/hodgesds/perf-utils). +However, no library support will provide good profiling for Go since the runtime +can schedule a monitored function onto another thread, where the counters may not +be enabled. +The correct solution, hence, is to implement the hardware +performance counter facility inside the go runtime into the scheduler as we have +proposed here. +3. We seriously contemplated against introducing a new pprof API +and tried to work with the existing API `StartCPUProfile(io.Writer)`. +We +considered adding variadic `ProfilingOptions` to it. +But any such change would +be qualified as breaking `Go1.X` and went ahead introducing a new API. +4. We considered passing an `Options` struct to the new API +`StartCPUProfileWithConfig`. +However, the usage looked complicated with certain +fields making sense only in certain circumstances and hence decided to use the +functional option pattern. +5. We considered having only one CPU profiling +event active at a time. +However, the experience has shown that this limits +performance analysis in several ways; and advanced performance tuning demands +correlating more than one event to find the root cause. +Furthermore, this richness +in Go profiling can serve to advance the state-of-the-art in profilers. +6. On the implementation front, we considered using a `map` for active events at +several places instead of an `EV_MAX` array. +However, concurrent access to +different array elements was superior to locking a shared `map`. +Moreover, +`EV_MAX` is small enough to search linearly for active events. + +## 7. Compatibility + +The design ensures [compatibility with Go1.0](https://golang.org/doc/go1compat) +since no existing public API signatures, interface signatures, or data +structures are changed. +We have introduced a new interface and a handful of new +APIs in the `pprof` package. + +## 8. Implementation + +Milind Chabbi from Uber technologies will implement the proposal. +The +implementation will proceed as follows: + +1. Add assembly language support for `perf_events` related system calls. +2. Modify the `runtime`, `runtime/pprof`to support +PMU sampling to reflect the call graph / workflow shown in Figure 2. +3. Expose the new facility via `testing` and `net/http/pprof` packages. +4. Make changes to `pprof` CPU profiling tests to use PMU cycles in addition to +OS timer (on supported platforms). +5. Add PMU profiling tests to run on supported platforms. + +The implementation will be broken into a large checkin related to the `runtime` +and `runtime/pprof` packages, a few small checkins related to `testing`, and +`net/http/pprof` packages, and the tests accompanying the code. +The +implementation will align with the February-April 2020 development cycle and +May-July 2020 release cycle. + +A prototype is available [here](https://github.com/uber-dev/go/tree/pmu_pprof). + +## 9. Open issues + +1. Supporting Windows and MacOS will be left for future work, but any support +from experts from those lands is appreciated. +2. The issue [21295](https://github.com/golang/go/issues/21295) proposes +adding `-test.counters` flag to the testing package. That proposal is +orthogonal to this one. +There +isn’t a detailed description of how the proposal wants to accomplish it. +Simply +starting and stopping a PMU counter by making `perf_events_open` system call at +the testing package level will not accomplish the objective. +The monitored +functions can migrate from one `M` to another, and hence the changes must be +made in the runtime as described in this proposal. +However, it is relatively +easy to incorporate the requested feature by making use of changes suggested in +this proposal to accomplish the objective of issue #21295. +We are open to +discussing with the creators of #21295 to understand how we can incorporate +their needs into this design. +3. The additions to the public APIs and naming +aesthetics can be refined with inputs from the experts and the community. +4. Currently, we do not bubble up runtime failures to start a PMU profiler (same is +true for a timer). +Should we bubble up any failure to configure a PMU or +silently produce zero samples in the profiles? +5. Should we introduce any throttling mechanism if the rate of events is too +high for a PMU event. + +## 10. Acknowledgment + +Pengfei Su implemented a prototype of the design +while interning at Uber Technologies. + +Joshua Corbin helped design the public surface APIs. + +This work is generously funded by the Uber Technologies’ Programming Systems +Research team and the team members provided feedback for improvement. + +## 11. Appendix + +##### (OS timer profile) goroutine.go/run1: + +``` +File: goroutine +Type: cpu +Time: Jan 27, 2020 at 3:45pm (PST) +Duration: 6.70s, Total samples = 18060ms (269.37%) +Showing nodes accounting for 18060ms, 100% of 18060ms total + flat flat% sum% cum cum% + 4210ms 23.31% 23.31% 4210ms 23.31% main.f7 + 2610ms 14.45% 37.76% 2610ms 14.45% main.f2 + 2010ms 11.13% 48.89% 2010ms 11.13% main.f6 + 1810ms 10.02% 58.91% 1810ms 10.02% main.f10 + 1780ms 9.86% 68.77% 1780ms 9.86% main.f3 + 1410ms 7.81% 76.58% 1410ms 7.81% main.f1 + 1310ms 7.25% 83.83% 1310ms 7.25% main.f4 + 1110ms 6.15% 89.98% 1110ms 6.15% main.f5 + 1110ms 6.15% 96.12% 1110ms 6.15% main.f8 + 700ms 3.88% 100% 700ms 3.88% main.f9 +``` + +##### (OS timer profile) goroutine.go/run2: + +``` +File: goroutine +Type: cpu +Time: Jan 27, 2020 at 3:45pm (PST) +Duration: 6.71s, Total samples = 17400ms (259.39%) +Showing nodes accounting for 17400ms, 100% of 17400ms total + flat flat% sum% cum cum% + 3250ms 18.68% 18.68% 3250ms 18.68% main.f2 + 2180ms 12.53% 31.21% 2180ms 12.53% main.f9 + 2100ms 12.07% 43.28% 2100ms 12.07% main.f1 + 1770ms 10.17% 53.45% 1770ms 10.17% main.f6 + 1700ms 9.77% 63.22% 1700ms 9.77% main.f5 + 1550ms 8.91% 72.13% 1550ms 8.91% main.f4 + 1500ms 8.62% 80.75% 1500ms 8.62% main.f8 + 1440ms 8.28% 89.02% 1440ms 8.28% main.f3 + 1390ms 7.99% 97.01% 1390ms 7.99% main.f10 + 520ms 2.99% 100% 520ms 2.99% main.f7 +``` + +##### (OS timer profile) goroutine.go/Run 3: + +``` +File: goroutine +Type: cpu +Time: Jan 27, 2020 at 3:48pm (PST) +Duration: 6.71s, Total samples = 17.73s (264.31%) +Showing nodes accounting for 17.73s, 100% of 17.73s total + flat flat% sum% cum cum% + 3.74s 21.09% 21.09% 3.74s 21.09% main.f7 + 2.08s 11.73% 32.83% 2.08s 11.73% main.f9 + 2.05s 11.56% 44.39% 2.05s 11.56% main.f2 + 1.85s 10.43% 54.82% 1.85s 10.43% main.f10 + 1.78s 10.04% 64.86% 1.78s 10.04% main.f1 + 1.43s 8.07% 72.93% 1.43s 8.07% main.f3 + 1.42s 8.01% 80.94% 1.42s 8.01% main.f8 + 1.18s 6.66% 87.59% 1.18s 6.66% main.f6 + 1.17s 6.60% 94.19% 1.17s 6.60% main.f5 + 1.03s 5.81% 100% 1.03s 5.81% main.f4 +``` + +##### (OS timer profile) serial.go/Run 1: +``` +File: serial +Type: cpu +Time: Jan 27, 2020 at 1:42pm (PST) +Duration: 501.51ms, Total samples = 320ms (63.81%) +Showing nodes accounting for 320ms, 100% of 320ms total + flat flat% sum% cum cum% + 80ms 25.00% 25.00% 80ms 25.00% main.H_expect_14_546 + 80ms 25.00% 50.00% 80ms 25.00% main.J_expect_18_18 + 60ms 18.75% 68.75% 60ms 18.75% main.G_expect_12_73 + 20ms 6.25% 75.00% 20ms 6.25% main.B_expect_3_64 + 20ms 6.25% 81.25% 20ms 6.25% main.D_expect_7_27 + 20ms 6.25% 87.50% 20ms 6.25% main.F_expect_10_91 + 20ms 6.25% 93.75% 20ms 6.25% main.I_expect_16_36 + 10ms 3.12% 96.88% 10ms 3.12% main.A_expect_1_82 + 10ms 3.12% 100% 10ms 3.12% main.C_expect_5_46 + 0 0% 100% 320ms 100% main.main + 0 0% 100% 320ms 100% runtime.main +``` + + +##### (OS timer profile) serial.go/Run 2: + +``` +File: serial +Type: cpu +Time: Jan 27, 2020 at 1:44pm (PST) +Duration: 501.31ms, Total samples = 320ms (63.83%) +Showing nodes accounting for 320ms, 100% of 320ms total + flat flat% sum% cum cum% + 70ms 21.88% 21.88% 70ms 21.88% main.I_expect_16_36 + 50ms 15.62% 37.50% 50ms 15.62% main.J_expect_18_18 + 40ms 12.50% 50.00% 40ms 12.50% main.E_expect_9_09 + 40ms 12.50% 62.50% 40ms 12.50% main.F_expect_10_91 + 40ms 12.50% 75.00% 40ms 12.50% main.H_expect_14_546 + 30ms 9.38% 84.38% 30ms 9.38% main.D_expect_7_27 + 20ms 6.25% 90.62% 20ms 6.25% main.B_expect_3_64 + 20ms 6.25% 96.88% 20ms 6.25% main.G_expect_12_73 + 10ms 3.12% 100% 10ms 3.12% main.C_expect_5_46 + 0 0% 100% 320ms 100% main.main + 0 0% 100% 320ms 100% runtime.main +``` + +##### (OS timer profile) serial.go/Run 3: +``` +File: serial +Type: cpu +Time: Jan 27, 2020 at 1:45pm (PST) +Duration: 501.39ms, Total samples = 310ms (61.83%) +Showing nodes accounting for 310ms, 100% of 310ms total + flat flat% sum% cum cum% + 110ms 35.48% 35.48% 110ms 35.48% main.J_expect_18_18 + 70ms 22.58% 58.06% 70ms 22.58% main.G_expect_12_73 + 60ms 19.35% 77.42% 60ms 19.35% main.F_expect_10_91 + 30ms 9.68% 87.10% 30ms 9.68% main.I_expect_16_36 + 20ms 6.45% 93.55% 20ms 6.45% main.H_expect_14_546 + 10ms 3.23% 96.77% 10ms 3.23% main.B_expect_3_64 + 10ms 3.23% 100% 10ms 3.23% main.C_expect_5_46 + 0 0% 100% 310ms 100% main.main + 0 0% 100% 310ms 100% runtime.main +``` + + + +## 12. References + +1. HPCToolkit: https://github.com/HPCToolkit/hpctoolkit +2. Oracle studio: +https://www.oracle.com/technetwork/server-storage/solarisstudio/features/performance-analyzer-2292312.html +3. Perf utils: https://github.com/hodgesds/perf-utils +4. vTune +https://software.intel.com/en-us/vtune +5. "Pinpointing Data Locality Problems +Using Data-Centric Analysis", Xu Liu and John Mellor-Crummey, 2011 International +Symposium on Code Generation and Optimization, April 2-6, 2011, Chamonix, +France. +6. "A Data-centric Profiler for Parallel Programs", Xu Liu and John +Mellor-Crummey, The International Conference for High Performance Computing, +Networking, Storage and Analysis, November 17-22, 2013, Denver, Colorado, USA. +7. "ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel +Programs", Xu Liu and Bo Wu, The International Conference for High Performance +Computing, Networking, Storage and Analysis, Nov 15-20, 2015, Austin, Texas, +USA. +
diff --git a/design/36821/call_graph_pprof++.png b/design/36821/call_graph_pprof++.png new file mode 100644 index 0000000..2f3e451 --- /dev/null +++ b/design/36821/call_graph_pprof++.png Binary files differ
diff --git a/design/36821/call_graph_pprof.png b/design/36821/call_graph_pprof.png new file mode 100644 index 0000000..9ffaed2 --- /dev/null +++ b/design/36821/call_graph_pprof.png Binary files differ
diff --git a/design/36821/goroutine.png b/design/36821/goroutine.png new file mode 100644 index 0000000..f867efe --- /dev/null +++ b/design/36821/goroutine.png Binary files differ
diff --git a/design/36821/goroutine_pmu.png b/design/36821/goroutine_pmu.png new file mode 100644 index 0000000..18dde54 --- /dev/null +++ b/design/36821/goroutine_pmu.png Binary files differ
diff --git a/design/36821/serial.png b/design/36821/serial.png new file mode 100644 index 0000000..0beae7e --- /dev/null +++ b/design/36821/serial.png Binary files differ
diff --git a/design/36821/serial_pmu.png b/design/36821/serial_pmu.png new file mode 100644 index 0000000..7bb9e3f --- /dev/null +++ b/design/36821/serial_pmu.png Binary files differ