design/36821-perf-counter-pprof.md - proposal - Git at Google

 # Proposal: hardware performance counters for CPU profiling.

 Author(s): Milind Chabbi

 Last updated: Feb/13/2020

 Discussion at https://golang.org/issues/36821.

 ## 1. Abstract

 The Go pprof CPU profiles are quite inaccurate and imprecise due to several
 limitations related to the operating system (OS) interval timer used as its
 sampling engine.
 Inaccurate profiles can mislead performance investigation and may lead
 to inappropriate optimization guidance.
 We propose incorporating hardware
 performance counters as an alternative sampling engine for CPU profiling.
 Hardware performance counter-based sampling rectifies many of the problems
 related to the OS timers.
 The hardware CPU cycles event provides microsecond
 measurement granularity, and it is available in almost all commodity CPUs.
 Furthermore, hardware Performance Monitoring Units (PMUs) allow measuring not
 only CPU cycles but other activities such as cache misses that help root cause
 complex performance problems.
 A [prototype implementation](https://github.com/uber-dev/go/tree/pmu_pprof) shows significantly
 better-quality CPU profiles when using PMUs while maintaining a low measurement
 overhead.
 The proposal will retain the OS-timer as the default sampling engine.
 The proposal exposes the PMU-based profiling feature on Linux-based systems,
 which provide an excellent facility to program PMUs.


 ## Table of contents
 -   [1. Abstract](#1-abstract)
 -   [2. Background](#2-background)
     -   [2.1 High-level description of Go's CPU
         profiling](#21-high-level-description-of-gos-cpu-profiling)
     -   [2.2 Desiderata and deficiencies of Go CPU
         profiler](#22-desiderata-and-deficiencies-of-go-cpu-profiler)
     -   [2.3 Examples demonstrating inaccuracy and imprecision in Go
         pprof profiles](#23-examples-demonstrating-inaccuracy-and-imprecision-in-go-pprof-profiles)
     -   [2.4 State-of-the-art in profiling using performance
         monitoring units](#24-state-of-the-art-in-profiling-using-performance-monitoring-units)
     -   [2.5 Low-level details of the current Go CPU
         profiler](#25-low-level-details-of-the-current-go-cpu-profiler)
 -   [3. Problem statement](#3-problem-statement)
 -   [4. Proposal](#4-proposal)
     -   [4.1 High-level design](#41-high-level-design)
     -   [4.2 The devil is in the
         details](#42-the-devil-is-in-the-details)
     -   [4.3 Exposing PMU events via `runtime/pprof`
         package](#43-exposing-pmu-events-via-runtimepprof-package)
     -   [4.4 The optional multiple CPU profiles
         feature](#44-the-optional-multiple-cpu-profiles-feature)
     -   [4.5 The `testing` package](#45-the-testing-package)
     -   [4.6 The `net/http/pprof`
         package](#46-the-nethttppprof-package)
 -   [5. Empirical evidence on the accuracy and precision of PMU
     profiles](#5-empirical-evidence-on-the-accuracy-and-precision-of-pmu-profiles)
     -   [5.1 The PMU produces precise and accurate profiles for
         concurrent programs](#51-the-pmu-produces-precise-and-accurate-profiles-for-concurrent-programs)
     -   [5.2 The PMU produces precise and accurate profiles for
         serial programs](#52-the-pmu-produces-precise-and-accurate-profiles-for-serial-programs)
     -   [5.3 A practical use](#53-a-practical-use)
 -   [6. Rationale](#6-rationale)
     -   [6.1 Advantages](#61-advantages)
     -   [6.2 Disadvantages](#62-disadvantages)
     -   [6.3 Alternatives explored](#63-alternatives-explored)
 -   [7. Compatibility](#7-compatibility)
 -   [8. Implementation](#8-implementation)
 -   [9. Open issues](#9-open-issues)
 -   [10. Acknowledgment](#10-acknowledgment)
 -   [11. Appendix](#11-appendix)
 -   [12. References](#12-references)


 ## 2. Background

 In this section, we first provide a high-level description of Go's CPU
 profiling, motivate the problems in the current profiler with examples,
 highlight the state-of-the-art in CPU profiling, and then dive deeper into the
 current implementation of CPU profiling inside Go.
 Our proposed changes appear
 in the next section.
 Readers familiar with profiling technology and the
 internals of pprof may skip this section.

 ### 2.1 High-level description of Go’s CPU profiling

 Pprof, the de facto Go CPU profiler, employs sampling-based profiling for time
 measurement.
 It attributes the time metric to different functions, including
 their calling context (aka call path or backtrace).
 Pprof relies on the
 underlying operating-system timers (exposed by the Go runtime) to periodically
 interrupt a program’s execution.
 The sampling rate is typically one hundred
 hertz (a sample taken every 10ms); a different sampling rate can be configured,
 but typical systems cannot deliver more than 250 hertz.


 When profiling is enabled, the Go runtime scheduler sets up each OS thread in
 the process to generate periodic timer interrupts.
 In each interrupt, the go
 runtime’s signal handler obtains a backtrace and appends it to a ring buffer.
 All OS threads in a Go process append their backtraces to the same ring buffer
 coordinated via a lock.
 A side thread, launched by the `pprof` package,
 periodically scrapes the backtraces from the ring buffer and serializes them
 into a protobuf file to be used by the downstream tools.

 ### 2.2 Desiderata and deficiencies of Go CPU profiler

 *Accuracy* and *precision* are desirable properties of a good measurement tool.

 A profiling datum is said to be *accurate* if it is close to the ground truth
 (aka true value).
 For example, if `API_A()` consumes 25% of the total
 execution time and a profiler attributes it 24% of the total execution time,
 the measurement has 96% accuracy.

 Profiling data are said to be *precise* if there is low variability among
 multiple measurements.
 Precision of a set of measurements of the same variable is often
 [expressed](https://www.wikihow.com/Calculate-Precision) either as min and max
 from the sample mean or as [standard
 error](https://en.wikipedia.org/wiki/Standard_error) or [coefficient of
 variation](https://en.wikipedia.org/wiki/Coefficient_of_variation) of the
 sample.

 Unfortunately, the current pprof CPU profiles meet neither of these two
 criteria.


 #### 2.2.1 Inaccuracy and its causes

 Inaccuracy of profiles arises from two sources: sample size and sampling bias.


 #### Sample size

 By the [law of large
 numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers), a larger sample
 size makes the sample mean closer to the population mean.
 Moreover, in
 profiling, where the sample size is very small compared to the population size,
 increasing the sample size greatly improves accuracy.
 Conversely, a smaller
 sample size leads to a lower inaccuracy.


 Low-resolution OS-timer is the primary reasons behind small sample size in a
 measurement window.
 One can obtain more samples either by increasing the
 length of a measurement window or by increasing the number of samples in a
 measurement window.
 The former works well for regular programs e.g., codes
 dominated by loops; however, for irregular codes, such as microservices, a
 larger measurement window does not rectify the measurement inaccuracy because
 more irregularities can easily surface in a larger observation window.
 Hence,
 there is a dire need to collect more samples within a measurement window and
 obtain fine-grained samples of sub-millisecond executions.
 Furthermore, linearly scaling the measurement time to collect more samples is
 impractical if orders of magnitude more samples are necessary to correct the
 small samples size issue.

 #### Sampling bias

 A larger sample size alone does not improve accuracy if the samples are biased.
 OS timer samples on some systems (e.g., Linux) are biased because a timer
 interrupt from one OS thread, say `T1` may be handled by an arbitrary (not to
 be confused with uniformly random) thread, say `T2`.
 This means, `T2` will
 handle the interrupt and incorrectly attribute the sample to its call stack,
 which results in biased samples.
 If more timer signals are handled by `T2`
 compared to `T1`, a [systematic sampling
 bias](https://en.wikipedia.org/wiki/Observational_error) will occur, which leads
 to inaccurate profiles.


 #### 2.2.1 Imprecision and its causes

 Imprecision in measurement comes from at least the following two sources:
 sample size and measurement skid.

 #### Sample size

 Fewer number of samples directly contributes to a large [standard
 error](https://en.wikipedia.org/wiki/Standard_error).
 The low resolution of OS
 timers is responsible for fewer samples in a measurement window, which results
 in a lower precision of pprof profiles.
 Conversely, a higher resolution of
 samples will improve precision.

 #### Measurement skid

 The second source of measurement error is the [random
 error](https://en.wikipedia.org/wiki/Observational_error) inherent in a
 measurement apparatus.
 An OS timer apparatus configured to expire after `N` milliseconds
 is only guaranteed to generate an interrupt some time after `N` milliseconds --
 not precisely at `N` milliseconds.
 This randomness introduces a large "skid" in
 measurement.
 Assume a periodic timer set to fire every 10ms.
 Assume it has
 1ms left before its expiry when the OS scheduler with a 4ms resolution inspects
 it.
 This means the timer will fire 4ms later, which is 3ms after the scheduled
 expiry time.
 This results in up to 30% imprecision for a 10ms periodic timer.
 Although random error may not be entirely eliminated, a superior measurement
 apparatus reduces the effects of random error.

 ### 2.3 Examples demonstrating inaccuracy and imprecision in Go pprof profiles

 #### 2.3.1 A concurrent code

 Consider the Go program
 [goroutine.go](https://github.com/chabbimilind/GoPprofDemo/blob/master/goroutine.go),
 which has ten exactly similar goroutines `f1`-`f10`.
 These functions are
 computationally intensive, long-running,  and are unaffected by IO or memory
 hierarchy.
 The functions are non-preemptive, and hence the Go runtime scheduler
 overhead is zero.
 It is expected that each function takes exactly the same
 amount of CPU time (i.e., approximately 10% of the overall execution).
 It
 should not matter how many OS threads or CPU cores they run on.


 Table 1 summarizes the result of `go tool pprof -top` for this program running
 on a Linux OS on a 12-core Intel Skylake server-class machine.
 The
 measurements were taken three times under the same configuration.
 The `Flat
 (ms)` column shows the absolute millisecond measurement attributed to each of
 the ten routines, and the `Flat (%)` column shows the relative time in a
 routine with respect to the total execution time.
 The `Expected (%)` column
 shows the expected relative time in each routine.
 The `Rank order` column
 assigns ranks in the descending order of execution time.
 The raw output is
 available in the [appendix](#11-appendix).

 ![Table 1](36821/goroutine.png) *Table 1: demonstration of inaccuracy and
 imprecision of pprof profiles using OS timers on a concurrent code.*

 First, let us focus on the data in a single run `RUN 1`.
 `f1`-`f10` have a wide
 variance in the time attributed to them; the expectation is that each of them
 gets `10%` execution time.
 There is up to 6x difference in the time attributed
 to the routine with the highest amount of attribution (`main.f7` with 4210ms,
 23.31%) and the routine with the lowest amount of attribution (`main.f9` with
 700ms, 3.88%).
 **This demonstrates the poor accuracy (deviation from the
 ground truth) of pprof timer-based profiles.**
 The issue reproduces when using different number of CPU cores.

 Now focus on the three runs `RUN 1-3` together .
 The time attributed to any
 routine widely varies from one run to another.
 The rank order of the routines
 changes significantly from run to run.
 In Run 1, `main.f7` is shown to run for
 4210ms with the rank order of 1, whereas in Run 2, it is shown to run for only
 520ms with the rank order 10.
 The expectation is that the measurements remain
 the same from run to run.
 From one run to another, function-level time
 attribution are not precise; for example, there is a 71% coefficient of
 variance among the three runs of the `f1` routine and the average coefficient
 of variance across all functions is 28%.
 **This demonstrates imprecision (unpredictability of
 measurement) of pprof timer-based profiles.**

 For the curious readers, Table 2 below shows near-prefect accuracy and
 precision when profiling the same code using the PMU-based profiler described in
 this proposal.
 We show two runs instead of three for
 brevity.
 Additional details appear in [Section
 5.1](#51-the-pmu-produces-precise-and-accurate-profiles-for-concurrent-programs)

 ![Table 2](36821/goroutine_pmu.png) *Table 2: demonstration of accuracy and
 precision of pprof profiles via PMU CPU cycles event on a concurrent code.*

 #### 2.3.2 A serial code

 The issue is not specific to concurrent code alone.
 Refer to a carefully
 crafted Go program --
 [`serial.go`](https://github.com/chabbimilind/GoPprofDemo/blob/master/serial.go)
 -- which has a single goroutine, and it has complete data
 dependence from one statement to another.
 It has ten functions
 `A_expect_1_82`, `B_expect_3_64`, ... , `J_expect_18_18`.
 The functions are
 intentionally written to consume different amounts of time to make this test
 case more representative.
 Each function is similar and has a loop, but the
 trip-count of the loop in each function is different.
 The loop in function
 with prefix `A` runs `1*C` times, where `C` is a constant.
 The loop in function
 with prefix `B` runs `2*C` times, and so on.
 This arrangement leads to a
 predictable relative execution time for each function; the function with prefix
 `A` consumes `1/sum(1..10)=1.82%` of the total execution time; and the function
 with prefix `B` consumes `2/sum(1..10)=3.64%` of the total execution time, and
 so on until the function with prefix `J` consumes `10/sum(1..10)=18.18%` of the
 total execution time.


 The expected relative percent of CPU time is encoded in each function name,
 `A_expect_1_82` should consume 1.82% execution time, `B_expect_3_64` should
 consume 3.64% execution time, `J_expect_18_18` should consume 18.18% of CPU
 time, and so on.
 Table 3 summarizes the output of pprof CPU profiling for this
 code.
 As before, the measurements were taken three times under the same
 configuration.
 The `Flat (ms)` column shows the absolute millisecond
 measurement attributed to each of the ten functions, and the `Flat (%)` column
 shows the relative time with respect to the total time taken by all functions.
 The `Expected (%)` column shows the expected relative time in each routine.
 The
 `Rank order` column assigns ranks in descending order of the execution time.
 The raw output is available in the [appendix](#11-appendix).


 ![Table 3](36821/serial.png) *Table 3: demonstration of inaccuracy and
 imprecision of pprof profiles on a serial code.*

 A `-` in a cell represents zero samples.
 First, let's focus on just one set of measurements from `RUN 1`.
 Rather than
 `J_expect_18_18` being the top time consumer with 18.18%, both `J_expect_18_18`
 and `H_expect_14_546` are the top time consumers each with 25% CPU time.
 `B_expect_3_64`, `D_expect_7_27`, `F_expect_10_91`, and `I_expect_16_36` are
 all lumped into a single rank order of 4 each attributed with 6.25% of
 execution time.
 `E_expect_9_09` does not even appear in the profiles despite
 its 9% expected execution time.
 More inaccuracies are evident in the data in
 the table.
 If we inspect the data from `RUN 2` and `RUN 3`, we notice more
 discrepancies where a lot of data is missing and incorrect.
 The rank ordering
 and the relative execution percentage both have very low correlation with the
 ground truth.
 Despite tens of milliseconds of execution time, some functions
 are over counted and some functions are under counted.


 Now, let's focus on the three runs together.
 The highest time consumer
 `J_expect_18_18` does not even get a single sample in `RUN 3`, indicating the
 imprecision.
 `I_expect_16_36`, which has the measurement data available in all
 three cases varies from 20ms in `RUN 1` with a rank order of 4 to 70ms in `RUN 2`
 with a rank order of 1.
 More discrepancies are visible in the table.
 Thus,
 serial program also shows run to run measurement variation, and hence the
 profiles are imprecise.


 Table 4 below shows near perfect accuracy and precision of profiling the same
 code using the PMU-based profiler described in this proposal.
 The measurements
 were taken two times instead of three for brevity.
 Additional details appear
 in [Section
 5.2](#52-the-pmu-produces-precise-and-accurate-profiles-for-serial-programs)

 ![Table
 4](36821/serial_pmu.png) *Table 4: demonstration of accuracy and precision
 of pprof profiles via PMU CPU cycles event on a serial code.*

 ### 2.4 State-of-the-art in profiling using performance monitoring units

 An alternative to the OS timer-based sampling is using hardware performance
 counters, which are ubiquitous in modern commodity CPUs.
 CPU’s Performance
 Monitoring Units (PMUs) have the capability to count various events such as
 cycles, instructions, cache misses, to name a few.
 Furthermore, PMUs can be
 configured to deliver an interrupt when an event reaches a specified threshold
 value.
 These PMU interrupts, like the OS timer interrupts, can serve as another
 source of sampling the call stack.
 This technique is employed in many modern
 profilers [1,2,4].
 PMU sampling is called “event-based sampling,” where the
 period (interval between two interrupts) is expressed as the number of
 occurrences of an event, not the number of timer ticks.
 As a side note, in
 event-based sampling, the sampling overhead is proportional to the number of
 events disregarding the length of execution, whereas in timer-based sampling,
 the overhead is proportional to the length of execution.
 If there are  more
 events in a short time span, then there is also more interest in observing
 deeply in that region of execution and the event-based sampling takes more
 samples.
 More overhead is incurred in places where there is interest (more
 events) and low/no overhead otherwise.
 CPU cycles is one of the most common
 PMU events, which happens periodically, similar to OS timers.
 Hence, using CPU
 cycles as a substitute for OS timers is very attractive.

 ### 2.5 Low-level details of the current Go CPU profiler

 We now provide the current implementation details of CPU profiling in Go, which
 is necessary to understand the changes we propose in the next section.
 The
 description is relevant for most Unix-based systems, especially Linux, and most
 of the discussion is irrelevant for Windows since it does not use signals for
 handling CPU profiling.


 ![call_graph_pprof](36821/call_graph_pprof.png)

 *Figure 1: A pictorial representation of the OS timer sampling workflow in the
 existing implementation.
 The first line of each record box is the function
 name.
 The remaining lines provide a textual description of the steps taken
 by that function.*

 Currently, `pprof.StartCPUProfile(io.Writer)` is the entry point for starting
 CPU profiling.
 Analogously, `pprof.StopCPUProfile()` stops a running profile.
 Two concurrent CPU profiling sessions are disallowed via a lock-protected
 variable `cpu.profiling` in pprof.
 The `pprof` package interacts with the
 `runtime` package for starting/stopping profiling via
 `runtime.SetCPUProfileRate(hz int)` API.

 There are two critical pieces of work done inside
 `pprof.StartCPUProfile(io.Writer)`.
 First, it invokes
 `runtime.SetCPUProfileRate(hz=100)` asking the runtime to start profiling; the
 rest of the profile generation will be handled by the `runtime` package; the
 integer argument is the sampling rate in Hz (samples per second).

 Second, in order to collect the profiles generated by the runtime,
 `pprof.StartCPUProfile(io.Writer)` launches a goroutine
 `profileWriter(io.Writer)`.
 `profileWriter`,

 1. creates a protocol buffer via `newProfileBuilder`,
 2. goes into a sleep-wakeup loop calling `runtime.readProfile()`.
 `runtime.readProfile()` provides a chunk of profiles collected inside the
 runtime in a ring buffer (`eof` if the profiling stopped), and
 3. writes to the
 sender end of a channel `cpu.done` once
 `runtime.readProfile()` returns `eof`.
 The other side of the channel will be
 waited upon by `pprof.StopCPUProfile()`.


 `runtime.SetCPUProfileRate(int)` is the entry point to start/stop profiling in
 the `runtime` package.
 It,

 1. creates a ring buffer, `cpuprof.log`,
 2. appends profile header information
 into the ring buffer,
 3. invokes `setcpuprofilerate(hz)`, which,
   1. calls `setThreadCPUProfileRate(hz=0)` to stop any on-going timer,
   2. invokes `setProcessCPUProfiler(hz)`, which installs the `SIGPROF` handler, and
   3. records the current sampling rate `hz` into the scheduler `sched.profilehz` to be
 visible to other OS threads.
 4. calls `setThreadCPUProfileRate(hz)`, which uses
 the OS timer such as (`setitimer`)  to start the sampling engine.
 This step is
 performed only if the `hz` value is non-zero.


 Inside the scheduler, the `execute()` method schedules a goroutine (also called
 as as `G` in the Go runtime scheduler) to run on the current OS thread (also
 called as  `M` in the Go runtime scheduler).
 This method compares the current
 value of the scheduler-recorded `sched.profilehz` variable against its own
 snapshot (`m.profilehz`) from the previously recorded value of
 `sched.profilehz` and if they differ, it reconfigures its timer by calling
 `setThreadCPUProfileRate(sched.profilehz)`.
 This functionality sets an OS
 timer on behalf of the current OS thread in the process.

 Once armed, the timer periodically delivers a `SIGPROF` signal, handled by
 `sighandler(sig int, info *siginfo, ctxt unsafe.Pointer, gp *g)`, which in turn
 invokes`sigprof()`, which,

 1. obtains a backtrace, and
 2. appends the backtrace into the ring buffer `cpuprof.add()`.

 Occasionally, a sampled program counter may fall into non-Go space, or the ring
 buffer may be full; such samples are simply counted, and backtrace is not taken.

 The ring buffer is coordinated via a `signalLock` for concurrency control.
 The edges marked with “signalLock”, in Figure 1, show the operations that take
 the lock to access the ring buffer.

 `pprof.StopCPUProfile()` follows a very similar path as
 `pprof.StartCPUProfile(io.Writer)`.
 When the `hz` argument is zero, the runtime
 APIs described previously (e.g., `SetCPUProfileRate(hz)`) interpret it as the
 end of profiling.
 `pprof.StopCPUProfile()` performs the following steps.

 1. calls`runtime.SetCPUProfileRate(0)` to stop profiling.
 2. waits on the `cpu.done` channel, which will be flagged by the side thread once `eof` is
 reached in the runtime ring buffer.
 3. closes the supplied `io.Writer`.


 `runtime.SetCPUProfileRate(hz)`, when called with `hz=0` invokes
 `setcpuprofilerate(0)` and closes the ring buffer.
 `setcpuprofilerate(0)` when
 called with `hz=0`,

 1. calls `setProcessCPUProfiler(0)`, which restores the previous signal handler
 for `SIGPROF`.
 2. resets `sched.profilehz=0` indicating globally that the
 profiling has stopped.
 3. calls `setThreadCPUProfileRate(0)` to stop sampling on the current OS thread.

 Subsequently, the other `M`s notice the change in the global `sched.profilehz`
 and turn off their timers when they run their `execute()` method.

 The direct users of pprof are:

 1. The `testing` package which can take a `-cpuprofile <file>` flag to collect
 profiles.
 2. The `net/http/pprof` package which can collect cpu profiles on the
 `/debug/pprof/profile` endpoint with an optional argument `?seconds=<N>`.
 3. Third party libraries such as `pkg/profile`.

 ## 3. Problem statement

 *Pprof CPU profiles are inaccurate because the profiling resolution is low and
 sampling is biased in OS timers.*

 *Pprof CPU profiles are imprecise because the profiling resolution is low and
 the samples have a large skid due to OS timers.*

 *Pprof CPU profiles measure only the time metric, additional events should be
 monitored to help performance profiling.*

 ## 4. Proposal

 In this proposal, we address the concerns -- small sample size, sampling bias,
 measurement error, and the need for more metrics  -- which were raised in the
 background section.

 **Sample size.** First, this proposal addresses the problem of low sampling
 resolution of OS timers by using high-resolution samples from hardware
 performance monitoring units (PMUs) to generate more samples.
 While the OS-timers have ~10ms
 resolution (that is one sample every 10ms), sampling CPU cycles via PMUs can
 deliver (if desired) a sample every few microseconds -- a three order magnitude
 finer resolution.
 This fine-grained measurement results in increasing the
 sample size, and as a result alleviates both inaccuracy and imprecision
 problems.
 While PMUs can deliver 100x more samples with just ~13% measurement
 overhead, the alternative of lengthening a measurement window by 100x to
 collect the same number of samples via the OS timers would be impractical.

 **Sampling bias.** Second, the proposal arranges each sample to be delivered to
 the OS thread causing the counter overflow to eliminate the sampling bias,
 which improves accuracy.

 **Random error (skid).** Third, PMU samples alleviate the large sampling skid
 seen in OS timers because PMUs allow a low-skid or skid-free precise sample
 attribution.

 **More metrics.** Finally, the PMU's capability to monitor various events such
 as cache misses, TLB misses, branch mispredictions, inter-core cacheline
 ping-ponging, to name a few, offers additional insights when diagnosing complex
 performance issues.

 Implementing this idea calls for extending the Go runtime package’s CPU
 profiling capabilities and making necessary adjustments in `runtime/pprof`,
 `testing`, and `http/pprof` packages.
 We first describe the details of the
 changes to the `runtime` package and then describe the details of the `pprof`,
 `testing`, and `net/http/pprof` packages, in that order.

 Our design brings the full capability of PMUs into the `runtime` package but
 exposes only a small amount of it via the `pprof` package to users.
 For
 example, the runtime might be able to configure whether it wants to monitor the
 CPU cycles spent inside the kernel or not; but these configurations will not be
 exposed (at least in the initial design) to the users of `pprof` package.

 ### 4.1 High-level design

 In its simplest form, we substitute the start/stop of an OS interval timer with
 the start/stop of a PMU event configured in the sampling mode; the sampling
 mode delivers an interrupt each time a hardware counter overflows, which is
 analogous to the OS timer interrupt.
 The interrupt generated by the PMU counter
 overflow follows the same signal handling path as the interval timer.
 This
 policy allows collecting and appending a backtrace on each PMU interrupt into a
 ring buffer, just like the OS timer.
 Moreover, like the OS timer, each
 OS-thread (`M`) checks whether any changes were made to the PMU-based profiling
 during each scheduling activity and performs the necessary adjustments to
 reconfigure the PMU.
 Finally, we retain `SIGPROF` as the signal type delivered
 -- for both OS timer interrupts and a PMU counter overflow interrupts --
 instead of hijacking another signal number for PMU-based profiling.

 ### 4.2 The devil is in the details

 This seemingly straightforward approach  poses implementation challenges
 related to retaining the compatibility with interval timer and incorporating
 additional PMU features.
 The first part of this section addresses those
 challenges.
 Other than the design described in the previous subsection, the
 rest of the details are closer to implementation details.
 As before, the
 discussion is only relevant to Linux.

 PMUs can count multiple CPU events simultaneously (e.g., cycles, instruction,
 cache misses, etc.).
 Our proposal does not preclude the runtime from
 simultaneously enabling multiple PMU events or even PMU events alongside an OS
 timer event.
 In fact, certain events are best used when measured with others.
 For example,
 1. Cache misses measured alongside cache hits enable computing cache miss ratio,
 2. Remote core cache line transfers measured alongside total cache misses
 provides a view of the cause of cache misses.

 In our proposal, we will limit the maximum number of concurrent CPU events to
 `_CPUPROF_EVENTS_MAX` (`EV_MAX` for short in the rest of this document).
 We
 will define a few common portable events for which the user does not have to
 lookup within the CPU manufacturer’s manual for event codes.
 This closely
 resembles the preset events of `perf events` in Linux.

 In order to enable advanced users to exploit the full capability of the PMUs,
 we define an opaque RAW event that allows the users to specify a model-specific
 hexadecimal event number that can be sampled.
 A motivating example for such a
 raw event is the remote
 [HITM](https://software.intel.com/sites/default/files/managed/8b/6e/335279_performance_monitoring_events_guide.pdf?ref=hvper.com)
 event on Intel processors; this event counts the number of times a cache line
 was loaded where the cache line was in the modified state in another remote
 CPU’s cache.
 This event helps diagnose problems such as cache line false
 sharing; we have observed this event to be helpful in identifying false sharing
 even inside the Go runtime.

 We have come up with the following initial set of PMU events.
 These events will
 be exposed by the `runtime` package to the `pprof` package.
 The `pprof` package
 will wrap these constant values with matching function names to be presented
 later in the `pprof` package subsection.
 Notice that the OS timer is one of the CPU events.
 Last-level cache events are included to capture the behavior of the CPUs memory
 hierarchy.
 Branch events are included to capture potentially wasted CPU
 cycles.
 The design discussions can refine this list further (e.g., whether TLB events
 are needed).
 They are easy to grow based on requests.

 ```
 type cpuEvent int32

 // These constants are a mirror of the runtime
 // These constants are a mirror of the runtime
 const ( _CPUPROF_OS_TIMER, _CPUPROF_FIRST_EVENT          cpuEvent = iota, iota
   _CPUPROF_HW_CPU_CYCLES, _CPUPROF_FIRST_PMU_EVENT
   _CPUPROF_HW_INSTRUCTIONS                               cpuEvent = iota
   _CPUPROF_HW_CACHE_REFERENCES
   _CPUPROF_HW_CACHE_MISSES
   _CPUPROF_HW_BRANCH_INSTRUCTIONS
   _CPUPROF_HW_BRANCH_MISSES
   _CPUPROF_HW_RAW
   _CPUPROF_EVENTS_MAX
   _CPUPROF_LAST_EVENT = _CPUPROF_EVENTS_MAX-1
 )
 ```

 Since PMU support may not be available on all platforms in all settings, we
 need to make it configurable in the Go runtime.
 If an OS supports PMU-based
 profiling (e.g., Linux), it will dynamically register two functions during the
 runtime startup.


 1. `setThreadPMUProfilerFunc`: this is the OS-specific function used to
 start/stop a PMU event analogous to `setThreadCPUProfile`.
 If`setThreadPMUProfilerFunc` is `nil`, the runtime will not attempt to
 start/stop PMU profiles even if it is requested to do so.
 However, this
 situation will not arise because `runtime/pprof` will guard against it.
 2. `sigprofPMUHandlerFunc`: this is the OS-specific function called from the
 OS-specific signal handler when the PMU delivers an interrupt.

 Currently, each `M` (type `m`) and the scheduler (type `schedt`) maintains a
 `profilehz` variable.
 In lieu of `profilehz` we’ll introduce a runtime structure
 `cpuProfileConfig struct `, which captures the essential configurations needed
 to start/stop either a PMU counter or an OS timer.
 `cpuProfileConfig` is
 designed to reflect the underlying `perf_event_attr` struct in Linux, but being
 an internal data structure we can modify it, if needed in the future, without
 affecting any public-surface API.

 ```
 type cpuProfileConfig struct {
     hz                        uint64 // the sampling rate, used only for OS_TIMER.
     period                    uint64 // the sampling interval, used for PMU events.
     rawEvent                  uint64 // the hexa-decimal value of the CPU-specific event.
                                                                                                                                                                                                                                                                                                                                               // used only with _CPUPROF_HW_RAW cpuEvent type
     preciseIP                 profilePCPrecision  // the level of precision for the program counter in PMU samples.
     isSampleIPIncluded        bool   // should the sample include the PC?
     isSampleAddrIncluded      bool   // should the sample include the memory address involved?
     isKernelIncluded          bool   // should the kernel count events in kernel mode?
     isHvIncluded              bool   // should the kernel count events in hypervisor mode?
     isIdleIncluded            bool   // should the kernel count events when idle task is running?
     isSampleCallchainIncluded bool   // should the sample report the call chain?
     isCallchainKernelIncluded bool   // should the sample include the kernel callchain?
     isCallchainUserIncluded   bool   // should the sample include the user callchain?
 }
 ```

 Some PMUs can be configured so that they can offer the
 exact program counter involved in the counter overflow.
 `profilePCPrecision`
 dictates the level of precision needed from the PMU as described by the
 constants below.
 If a PMU is configured to provide the precise program counter for a sample,
 we'll substitute the leaf-level program counter in the backtrace with the one
 presented by the PMU.
 Please refer to the [Section 6.2](#62-disadvantages) for PMU
 precision.

 ```
 type profilePCPrecision uint8
 const (
     _CPUPROF_IP_ARBITRARY_SKID profilePCPrecision = iota
     _CPUPROF_IP_CONSTANT_SKID
     _CPUPROF_IP_SUGGEST_NO_SKID
     _CPUPROF_IP_NO_SKID)
 ```

 `pprof` will set the default to `_CPUPROF_IP_ARBITRARY_SKID` to work across
 architectures and event types.
 However, we can easily change it to the highest
 supported, by probing the CPU as done in
 [certain](https://github.com/HPCToolkit/hpctoolkit/blob/master/src/tool/hpcrun/sample-sources/perf/perf_skid.c#L176)
 profilers.

 `cpuProfileConfig` will be visible to `pprof` package but will not be exposed to
 its clients.
 The fields in `cpuProfileConfig` allow how a PMU event can be
 customized.
 Pprof will set the field values to the lowest common denominator
 that is expected to work across many architectures and security settings.
 For example, `isKernelIncluded` will most likely be `false`.
 The runtime will be
 (mostly) agnostic to these values and will pass them down to OS-specific layers
 that will handle programming the PMUs.

 The `runtime` package will expose an API
 `runtime_pprof_setCPUProfileConfig(cpuEvent, *cpuProfileConfig)`, only to
 `pprof`.
 This API is the equivalent of `SetCPUProfileRate(hz int)`.
 Since
 the legacy `SetCPUProfileRate(hz int)` cannot be dropped, we alter its internals
 to wrap the `hz` value into a `cpuProfileConfig` structure and pass it down to
 `runtime_pprof_setCPUProfileConfig` for a uniform treatment of any event whether
 an OS-timer event or a PMU event.

 The `runtime` can be bifurcated into OS-agnostic and OS-specific parts.

 The OS-agnostic part of the runtime will have the following duties:

 1. Keep track of the CPU events currently running,
 2. Handle new start/stop
 invocations of the OS timer or PMU event, one at a time.
 3. Allocate and
 coordinate access to the ring buffers associated with each of the running CPU
 event.
 4. Reconfigure the CPU events for an `M` during its scheduling based on
 whether its current snapshot is different from the global configuration stored
 in the scheduler.

 The OS-specific part of runtime will have the following duties:

 1. Install/restore profiling signal handler.
 2. Start/stop OS timer or PMU
 events.
 3. Handle the profiling signals and pause the PMUs during signal
 handling.
 4. Map an interrupt delivered to the appropriate active CPU profiling
 event for recording the backtraces into an appropriate ring buffer.


 Since there can be `CPUPROF_EVENTS_MAX` number of concurrent CPU profiles, we
 need to change some of the runtime data structures that previously assumed a
 single CPU profile.
 Importantly,

 1. We will change the `profilehz` field in `schedt struct` to be an array of
 `cpuProfileConfig` pointers.
 2. We will change `profilehz` field in `m struct`
 to be an array of `cpuProfileConfig` pointers.
 3. We will introduce an array of
 opaque pointers `cpuProfileHandle` field in `m struct`.
 This will be OS-specific
 information stored in each `M`.
 For example, on Linux, this will hold the
 file handles obtained from `perf_event_open` system calls and the corresponding
 `mmap` ring buffers where the kernel populates the sample data.
 This information
 will be used to close the resources once the profiling is stopped on an `M`.

 We’ll replace the `hz` argument used in various runtime APIs involved in CPU
 profiling with the following two arguments.

 1. `eventId cpuEvent` (described in the constants previously), and
 2. `profConfig *cpuProfileConfig`.

 Figure 2 shows the call graph for the new workflow.
 We abbreviate `eventId` as
 `i` and `cpuProfileConfig` as `p` in most places.

 ![call_graph_pprof++](36821/call_graph_pprof++.png)
 *Figure 2: A pictorial representation of the OS timer-based and PMU-based
 sampling workflow.
 The first line of each record box is the function name.
 The
 remaining lines provide a textual description of the steps taken by that
 function.*

 A `nil` value of `cpuProfileConfig` implies stop profiling for the specified
 `cpuEvent`, and a `non-nil` value implies start profiling for the specified
 `cpuEvent`.
 At the runtime-level, `runtime_pprof_setCPUProfileConfig()` makes a
 copy of the `cpuProfileConfig` passed from `pprof`, so that inside the runtime,
 we can perform fast pointer comparison and rely on two large `cpuProfileConfig`
 structs to be equal if pointers to them are equal.

 Starting/stopping a PMU counter is different from starting/stopping an OS timer.
 Hence `if eventId==CPUPROF_OS_TIMER` checks will appear at a few places inside
 the runtime’s profiling-related code.


 A few concurrency aspects deserve a mention.

 1. Each active event will create its own ring buffer on demand.
 There will be a
 maximum of `EV_MAX` ring buffers.
 Each ring buffer will be created only if there
 is a corresponding active event.
 The size of each ring buffer is 16KB.
 If all
 `EV_MAX` events are active at the same time, it will result in `16KB * 8 =
 128KB` of ring buffers, which is paltry on modern machines.
 2. We’ll change the
 `cpuprof` variable in `cpuprof.go` to be an array of `EV_MAX` size.
 3.
 `runtime_pprof_setCPUProfileConfig` will hold a lock only for the `cpuprof[i]`,
 where `i` is the event id for which profile has to be started or stopped.
 4. There will be a single `signalLock` rather than `EV_MAX` `signalLock`s.
 This is
 because the `setcpuprofileconfig` method in `cpuprof.go` needs to check (read)
 all `prof[eventId].config`s to infer whether all events are stopped before
 uninstalling the signal handler.
 Taking `EV_MAX` locks one after another would
 be no better than holding just one lock.
 The consequence of this is that we
 serialize simultaneous writes to two different ring buffers.
 This should not be
 a serious issue since there cannot be two concurrent `SIGPROF`s.
 If a system
 supports concurrent `SIGPROF` handling, and if there are a lot of threads in a
 system, we can revisit this choice.
 5. In a very rare case when a `SIGPROF` is delivered for an already running
 event when another event is being set (which will hold the `signalLock`), the
 signal handler recognizes it and drops such sample. Otherwise a deadlock will
 ensue.

 At the lowest level, we will incorporate the following three system calls into
 all `runtime/sys_linux_<arch>.s` files.

 1. `perf_event_open`, which is Linux’s means to program PMUs.
 The Go signature
 will be `func perfEventOpen(attr *perfEventAttr, pid uintptr, cpu, groupFd
 int32, flags uintptr) int32`
 2. `ioctl`, which is needed to stop the PMU counter
 when handling the `SIGPROF`, and reset and restart after handling the signal.
 The
 go signature will be `func ioctl(fd, req int32, arg uintptr) int32`
 3. `fcntl`,
 which is necessary to arrange a `perf_event` file handle to deliver the
 `SIGPROF` signal and also to setup the signal to be delivered exactly to the `M`
 (OS thread) whose PMU counter overflew.
 The go signature will be `func fcntl(fd,
 cmd int32, arg uintptr) int32`

 #### 4.2.1 Segregating `SIGPROF` from different sampling sources

 All PMU events and the OS timer use the same `SIGPROF` signal to deliver their
 interrupts.
 Hence, the Unix signal handler needs to address the following
 concerns.

 1. Identify whether a signal was delivered as a result of the OS timer expiring
 or a PMU counter overflowing.
 This is necessary because the subsequent processing will be
 different based on the event.
 2. Identify which PMU counter caused the interrupt, when multiple PMU events are being
 sampled at the same time.
 This is necessary because the subsequent
 processing needs to append the backtraces into the buffer corresponding to the
 PMU event that triggered the interrupt.
 All PMU events do not go into the same
 ring buffer.

 Although both OS timer expiry and any PMU counter overflow result in the
 same `SIGPROF` signal, the signal codes are different.
 The PMU counter overflow
 generates a `POLL_IN` signal code indicating the presence of data whereas the OS
 timer interrupts generate a `SI_KERNEL` or `SI_ALARM` signal code.
 We use this
 information to address #1 above.

 Once we know that the signal was generated due to some PMU counter overflowing,
 we use the file descriptor `fd` passed in the signal info and match it against
 the opened perf events file descriptors on the same `M` and use the matching
 location to infer the PMU event `i` that caused the signal; the rest of the
 signal handling makes use of this `i` to record the backtrace in the appropriate
 ring buffer.
 During the signal handling, we’ll pause all active PMU counters via `ioctl`
 calls on the `fd` so that the signal handling is not counted towards the PMU
 event counters.


 ### 4.3 Exposing PMU events via `runtime/pprof` package

 We now turn our attention to the `runtime/pprof` package, which exposes the PMU
 sampling to its clients.
 There are several choices.
 There is a tension between
 revolutionary vs. evolutionary solution; there is friction between exposing the
 full PMU functionality vs. offering the basic minimum.
 In our proposal, we hope
 to make pragmatic choices among myriad options.

 We propose adding a new API, which accepts different configurations needed to
 start any CPU event -- OS timer and PMU counters.
 The API signature is shown
 below.
 ```
 pprof.StartCPUProfileWithConfig(opt ProfilingOption, moreOpts ...ProfilingOption) error
 ```
 There is an optional variadic `ProfilingOption` to allow
 more than one event to be started during a profiling session.

 A `ProfilingOption` is an interface exposed by `runtime/pprof` as shown below.

 ```
 type ProfilingOption interface { apply() error }
 ```

 Pprof will implement the following profiling option functions that mirror
 various `cpuEvent` constants previously presented.

 ```
 func OSTimer(w io.Writer) ProfilingOption
 func CPUCycles(w io.Writer, period uint64) ProfilingOption
 func CPUInstructions(w io.Writer, period uint64) ProfilingOption
 func CPUCacheReferences(w io.Writer, period uint64) ProfilingOption
 func CPUCacheMisses(w io.Writer, period uint64) ProfilingOption
 func CPUBranchInstructions(w io.Writer, period uint64) ProfilingOption
 func CPUBranchMisses(w io.Writer, period uint64) ProfilingOption
 func CPURawEvent(w io.Writer, period uint64, hex uint64) ProfilingOption
 ```

 All of them consume an `io.Writer`.
 The `OSTimer` accepts no other argument.
 A
 `period` accepted by all PMU events specifies the number of events that must
 elapse between two consecutive interrupts.
 A larger value means lower
 granularity (the opposite of Hz).
 Passing a `zero` value will result in using a
 preset period.
 `CPURawEvent` is special; it accepts any user-provided
 hexadecimal PMU event code.
 Notice that each event requires an `io.Writer`
 since different profile types cannot be serialized into the same pprof protocol
 buffer.
 In the cases where multiple `ProfilingOption`s are passed to the
 `pprof.StartCPUProfileWithConfig`, sanity checks are made to ensure the
 uniqueness of each `cpuEvent` and each `io.Writer`.

 Below are a few example usages of the API.

 ```
 // Simple
 StartCPUProfileWithConfig(OSTimer(w))
 StartCPUProfileWithConfig(CPUCycles(w, period))
 StartCPUProfileWithConfig(CPURawEvent(w, period, hexadecimalEvent))

 // Advanced
 StartCPUProfileWithConfig(CPUCacheReferences(w1, period1), CPUCacheMisses(w2, period2))
 ```


 The implementation of `StartCPUProfile(writer)` will simply invoke
 `StartCPUProfileWithConfig(OSTimer(writer))`.
 The implementation of
 `StartCPUProfileWithConfig` will ensure a maximum of `EV_MAX` elements, unique
 `io.Writers`, and unique `cpuEvents`.
 After sanitizing the arguments, it will create
 a `cpuProfileConfig` for each `cpuEvent` requested to be profiled.
 In a global
 data structure `cpu`, we will maintain the active events, their
 `*cpuProfileConfig`, and the corresponding `io.Writer`.

 ```
 var cpu struct {
     sync.Mutex
     profiling      bool
     activeConfig   [EV_MAX]*cpuProfileConfig
     activeWriter   [EV_MAX]io.Writer
     eventName      [EV_MAX]string
     done chan      bool
 }
 ```

 Finally, `StartCPUProfileWithConfig` will invoke
 `runtime_pprof_setCPUProfileConfig` multiple times once for each active event
 to request the `runtime` to actually configure either the OS timer or the PMU
 with a specific event to start sampling.
 A lock will be held throughout the invocation of
 `StartCPUProfileWithConfig` and hence no two
 `StartCPUProfileWithConfig/StartCPUProfile` invocations run concurrently.
 Once
 a session of `StartCPUProfileWithConfig/StartCPUProfile` is active, another
 profiling is disallowed until `StopCPUProfile` is invoked.

 Before successfully returning from `StartCPUProfileWithConfig`, we'll launch a
 goroutine `profileWriter`, which will periodically wake up and scrape
 runtime-produced stack samples from the ring buffers associated with each of
 the active events and serialize them into the corresponding `io.Writer`s.
 This follows the same design as the original pprof, but will add the
 responsibility of collecting profiles from multiple events.  Creating one
 goroutine each to scrape every active event would be wasteful, a single
 goroutine suffices to collect backtraces from all ring buffers.

 ### 4.4 The optional multiple CPU profiles feature

 Allowing multiple CPU profiles in a single profiling session via
 `pprof.StartCPUProfileWithConfig(opt ProfilingOption, moreOpts
 ...ProfilingOption)` API is a power-user feature.
 To avoid any accidental
 misuse, we'll protect it under an environmental variable
 `GO_PPROF_ENABLE_MULTIPLE_CPU_PROFILES=<true|false>`, which will be disabled by
 default.


 ### 4.5 The `testing` package

 We’ll add the following two command line arguments to the testing package:

 1. `-cpuprofileevent=<timer|cycles|instructions|cacheMisses|cacheReferences|branches|branchMisses|rHexValue>`

 2. `-cpuprofileperiod=<Int64Value>`

 The default value of `-cpuprofileevent` is `timer` with 100Hz when only
 `-cpuprofile` is passed, which ensures compatibility with the current behavior.
 `-cpuprofileperiod` will be ignored for `-cpuprofileevent=timer` and will
 default to a 100Hz rate as is the case currently.
 The testing package will map a
 user-specified string for `-cpuprofileevent` value to the appropriate
 `pprof.ProfilingOption`, the `-cpuprofileperiod` to an integer, and invoke
 `pprof.StartCPUProfileWithConfig()`.
 For example `-cpuprofileevent=cycles` and
 `-cpuprofileperiod=100000` will result
 in`pprof.StartCPUProfileWithConfig(CPUCycles(w, 100000)`.

 ### 4.6 The `net/http/pprof` package

 We’ll add the following two additional arguments in an `http` request exposed
 by the `net/http/pprof` CPU profiling endpoint.

 1. `event=<timer|cycles|instructions|cacheMisses|cacheReferences|branches|branchMisses|rHexValue>`

 2. `period=<Int64Value>`

 `period` will be ignored for `event=timer` and will default
 to a 100Hz rate as is the case currently.
 The default value of `event`
 is `timer` with 100Hz, which ensures compatibility with the current behavior.
 The `net/http/pprof` package will map a user-specified string for `event`
 value to an appropriate `pprof.ProfilingOption`, `period` to an
 integer, and invoke `pprof.StartCPUProfileWithConfig`.

 ## 5. Empirical evidence on the accuracy and precision of PMU profiles

 ### 5.1 The PMU produces precise and accurate profiles for concurrent programs

 Below we show the `go tool pprof -top` output of the previously shown
 [`goroutine.go`](https://github.com/chabbimilind/GoPprofDemo/blob/master/goroutine.go)
 concurrent program run for two times.
 The profiles used the proposed new API
 with 1M sampling period: `pprof.StartCPUPrileWithConfig(CPUCycles(w,
 1000000))`.
 The CPU cycles attributed to each routine `f1-f10` match the
 expectation (approximately 10% of the total execution time) within one run and
 across multiple runs.
 *Thus the profiles are accurate and precise.
 The
 measurement overhead is indistinguishable from the OS timer in this setting.*


 #### (PMU Cycles profile) goroutine.go/Run 1:

 ```
 File: goroutine
 Type: cycles
 Time: Jan 27, 2020 at 4:49pm (PST)
 Showing nodes accounting for 234000000000, 100% of 234000000000 total
       flat  flat%   sum%        cum   cum%
 23400000000 10.00% 10.00% 23400000000 10.00%  main.f1
 23400000000 10.00% 20.00% 23400000000 10.00%  main.f10
 23400000000 10.00% 30.00% 23400000000 10.00%  main.f2
 23400000000 10.00% 40.00% 23400000000 10.00%  main.f3
 23400000000 10.00% 50.00% 23400000000 10.00%  main.f4
 23400000000 10.00% 60.00% 23400000000 10.00%  main.f5
 23400000000 10.00% 70.00% 23400000000 10.00%  main.f6
 23400000000 10.00% 80.00% 23400000000 10.00%  main.f7
 23400000000 10.00% 90.00% 23400000000 10.00%  main.f8
 23400000000 10.00%   100% 23400000000 10.00%  main.f9
 ```
 #### (PMU Cycles profile) goroutine.go/Run 2:

 ```
 File: goroutine
 Type: cycles
 Time: Jan 27, 2020 at 4:51pm (PST)
 Showing nodes accounting for 234000000000, 100% of 234000000000 total
       flat  flat%   sum%        cum   cum%
 23800000000 10.17% 10.17% 23800000000 10.17%  main.f1
 23500000000 10.04% 20.21% 23500000000 10.04%  main.f7
 23500000000 10.04% 30.26% 23500000000 10.04%  main.f9
 23400000000 10.00% 40.26% 23400000000 10.00%  main.f10
 23400000000 10.00% 50.26% 23400000000 10.00%  main.f2
 23400000000 10.00% 60.26% 23400000000 10.00%  main.f4
 23400000000 10.00% 70.26% 23400000000 10.00%  main.f6
 23400000000 10.00% 80.26% 23400000000 10.00%  main.f8
 23300000000  9.96% 90.21% 23300000000  9.96%  main.f3
 22900000000  9.79%   100% 22900000000  9.79%  main.f5
 ```

 ### 5.2 The PMU produces precise and accurate profiles for serial programs

 Below we show the `go tool pprof -top` output of the previously shown
 [`serial.go`](https://github.com/chabbimilind/GoPprofDemo/blob/master/serial.go)
 program run for two times.
 We remind the reader that the expected relative
 execution time is encoded in the function names themselves (`J_expect_18_18`
 should consume 18.18% CPU cycles and so on).
 The profiles used the proposed
 new API with 1M sampling period: `pprof.StartCPUPrileWithConfig(CPUCycles(w,
 1000000))`.
 The CPU cycles attributed to each routine match the expectation
 within a run and across multiple runs.
 *Thus the profiles are accurate and
 precise.*


 #### (PMU Cycles profile) serial.go/Run 1:

 ```
 File: serial
 Type: cycles
 Time: Jan 27, 2020 at 4:54pm (PST)
 Showing nodes accounting for 1105000000, 100% of 1105000000 total
       flat  flat%   sum%        cum   cum%
  200000000 18.10% 18.10%  200000000 18.10%  main.J_expect_18_18
  183000000 16.56% 34.66%  183000000 16.56%  main.I_expect_16_36
  165000000 14.93% 49.59%  165000000 14.93%  main.H_expect_14_546
  137000000 12.40% 61.99%  137000000 12.40%  main.G_expect_12_73
  120000000 10.86% 72.85%  120000000 10.86%  main.F_expect_10_91
  100000000  9.05% 81.90%  100000000  9.05%  main.E_expect_9_09
   82000000  7.42% 89.32%   82000000  7.42%  main.D_expect_7_27
   63000000  5.70% 95.02%   63000000  5.70%  main.C_expect_5_46
   37000000  3.35% 98.37%   37000000  3.35%  main.B_expect_3_64
   18000000  1.63%   100%   18000000  1.63%  main.A_expect_1_82
          0     0%   100% 1105000000   100%  main.main
          0     0%   100% 1105000000   100%  runtime.main
 ```

 #### (PMU Cycles profile) serial.go/Run 2:

 ```
 File: serial
 Type: cycles
 Time: Jan 27, 2020 at 4:54pm (PST)
 Showing nodes accounting for 1105000000, 100% of 1105000000 total
       flat  flat%   sum%        cum   cum%
  200000000 18.10% 18.10%  200000000 18.10%  main.J_expect_18_18
  183000000 16.56% 34.66%  183000000 16.56%  main.I_expect_16_36
  159000000 14.39% 49.05%  159000000 14.39%  main.H_expect_14_546
  142000000 12.85% 61.90%  142000000 12.85%  main.G_expect_12_73
  119000000 10.77% 72.67%  119000000 10.77%  main.F_expect_10_91
  100000000  9.05% 81.72%  100000000  9.05%  main.E_expect_9_09
   82000000  7.42% 89.14%   82000000  7.42%  main.D_expect_7_27
   61000000  5.52% 94.66%   61000000  5.52%  main.C_expect_5_46
   40000000  3.62% 98.28%   40000000  3.62%  main.B_expect_3_64
   18000000  1.63% 99.91%   18000000  1.63%  main.A_expect_1_82
    1000000  0.09%   100% 1105000000   100%  main.main
          0     0%   100% 1105000000   100%  runtime.main
 ```

 ### 5.3 A practical use

 We use pprof to collect production CPU profiles at Uber.
 On one of our Go microservices, when we used the default OS timer-based CPU
 profiles, we saw that the `mutex.Lock()` accounted for 15.79% of the time and
 `mutex.Unlock()` accounted for 13.55% of the CPU time.
 We knew from the program structure that `mutex.Lock()` and `mutex.Unlock()`
 were invoked in pairs one after another, and also that `mutex.Lock()` is
 significantly more heavyweight compared to `mutex.Unlock()`, thus we were
 expecting much lesser time being spent inside `mutex.Unlock()` relative to
 `mutex.Lock()`.
 The OS-timer profiling data was not matching our expectations.
 However, when we used a prototype of the PMU-based profiler and sampled the CPU
 cycles event, the quantification changed and matched our expectations;
 `mutex.Lock()` accounted for 33.36% of the CPU cycles and `mutex.Unlock()`
 accounted for 7.57% the CPU cycles.
 The PMU profiles avoided an unnecessary performance investigation.


 ## 6. Rationale

 ### 6.1 Advantages

 CPU profiling via PMUs result in richer quality profiles -- high-resolution
 measurement, higher accuracy, and higher precision.
 Access to various events
 such as cache misses in PMUs allows better insights into understanding the
 causes of complex performance issues.
 Any kind of PMU event can be attributed
 to call stack samples, which is useful to provide actionable feedback to the
 developer.
 Incorporating the PMU facility in the runtime will allow the Go
 profiler to take full advantage of the progress happening in the hardware
 performance monitoring facilities available in modern CPUs.
 This framework will
 allow for independently developing more advanced profiling features such as
 attributing high-latency data accesses to Go objects, which can help refactor
 code or data structures for better data locality [5, 6, 7].

 ### 6.2 Disadvantages

 1. PMU-based profiling may not work when a Go program is running under a
 virtual machine, since virtual machines may not expose the hardware performance
 counters to a guest OS.
 Hence, the proposal keeps the OS-timer as the default
 and allows PMU-based sampling available only when it is explicitly specified.
 2. A higher sampling rate (a shorter period between samples) obviously incurs
 higher overhead.
 For example, the default timer with a 100Hz sampling rate
 introduces 3.6% measurement overhead on a 2.4GHz Intel Skylake machine.
 Using
 CPU cycles with a period of 24000000 cycles, which amounts to the same 100
 samples/second introduces the same 3.6% overhead.
 The table below shows
 overheads at other sampling periods using the PMU cycles and also compares
 against the default OS-timer profiler overhead.

 ```
 	The new PMU-based profiler with cycles event
 	-----------------------------------------------------
 	Sampling period in CPU cycles			Overhead
 	-----------------------------------------------------
 	24000000 (100 samples/sec)              1.036x
 	 2400000 (1K samples/sec)               1.036x
 	  240000 (10K samples/sec)              1.13x
 	   24000 (100K samples/sec)             2.05x
 ```
 ```
 	The default OS-timer based profiler
 	-----------------------------------------------------
 	Sampling rate 				Overhead
 	-----------------------------------------------------
 	100 samples/sec               1.036x
 ```

 A reasonable sampling period, once in 2.4M CPU cycles on a 2.4GHz
 Intel Skylake machine, offers a good trade off between high sampling rate at low
 overhead.

 3. When using any other PMU event, such as cache misses, the sampling period
 needs to be more carefully curated.
 We’ll provide preset values for the preset
 events.

 4. PMU profiling has a mild skid on deeply pipelined, multi-issue,
 out-of-order, superscalar  CPU architectures due to multiple instructions
 flowing in the pipeline.
 Thus the program counter seen in the interrupt can be
 a handful of instructions off from the one generating the counter overflow.
 This "skid" is much smaller than the several milliseconds of skid in an OS
 timer.
 Moreover, modern PMUs have built-in features to eliminate such skid
 (e.g., Intel Precise Event Based Sampling ([PEBS](https://software.intel.com/sites/default/files/managed/8b/6e/335279_performance_monitoring_events_guide.pdf?ref=hvper.com)), AMD Instruction Based
 Sampling ([IBS](https://developer.amd.com/wordpress/media/2012/10/AMD_IBS_paper_EN.pdf)), and PowerPC Marked events ([MRK](https://wiki.raptorcs.com/w/images/6/6b/POWER9_PMU_UG_v12_28NOV2018_pub.pdf)).
 Some PMUs can be configured so
 that they can offer the exact program counter involved in the counter overflow.
 We have introduced `profilePCPrecision` for this reason.

 5. There are finite counters on any PMU. Counter multiplexing happens if the
 number of events is more than the number of available PMU counter slots.  In
 that case the events run only part of the time.

 ### 6.3 Alternatives explored

 1. Linux `perf` commandline tool is powerful and offers comparable solutions.
 However, it does not solve several important use cases.
   1. Profiling via an `http` endpoint (provided by `net/http/pprof` package in
 Go) is the preferred, readily available, and scalable solution that works at
 the enterprise scale.  Making `perf` to work in a similar fashion would lead to
 creating too many custom solutions.
   2. `pprof` works as an in-situ profiler working in the same address
 space as the process being profiled and hence does not require any special
 system privileges; in contrast, attaching `perf` to a running process requires
 changes to a system’s security settings.
   3. We are also focusing on
 containerized environments, which is widely used in large-scale Go deployments.
 Although `perf` can collect call stack samples from a container, it fails to find
 symbols of go binaries located within a container to show clean reports
 (e.g., flame graphs).
 The workaround is non-trivial.
 2. We explored library-level facilities to collect performance counters such as perf-util [\[3\]](https://github.com/hodgesds/perf-utils).
 However, no library support will provide good profiling for Go since the runtime
 can schedule a monitored function onto another thread, where the counters may not
 be enabled.
 The correct solution, hence, is to implement the hardware
 performance counter facility inside the go runtime into the scheduler as we have
 proposed here.
 3. We seriously contemplated against introducing a new pprof API
 and tried to work with the existing API `StartCPUProfile(io.Writer)`.
 We
 considered adding variadic `ProfilingOptions` to it.
 But any such change would
 be qualified as breaking `Go1.X` and went ahead introducing a new API.
 4. We considered passing an `Options` struct to the new API
 `StartCPUProfileWithConfig`.
 However, the usage looked complicated with certain
 fields making sense only in certain circumstances and hence decided to use the
 functional option pattern.
 5. We considered having only one CPU profiling
 event active at a time.
 However, the experience has shown that this limits
 performance analysis in several ways; and advanced performance tuning demands
 correlating more than one event to find the root cause.
 Furthermore, this richness
 in Go profiling can serve to advance the state-of-the-art in profilers.
 6. On the implementation front, we considered using a `map` for active events at
 several places instead of an `EV_MAX` array.
 However, concurrent access to
 different array elements was superior to locking a shared `map`.
 Moreover,
 `EV_MAX` is small enough to search linearly for active events.

 ## 7. Compatibility

 The design ensures [compatibility with Go1.0](https://golang.org/doc/go1compat)
 since no existing public API signatures, interface signatures, or data
 structures are changed.
 We have introduced a new interface and a handful of new
 APIs in the `pprof` package.

 ## 8. Implementation

 Milind Chabbi from Uber technologies will implement the proposal.
 The
 implementation will proceed as follows:

 1. Add assembly language support for `perf_events` related system calls.
 2. Modify the `runtime`, `runtime/pprof`to support
 PMU sampling to reflect the call graph / workflow shown in Figure 2.
 3. Expose the new facility via `testing` and `net/http/pprof` packages.
 4. Make changes to `pprof` CPU profiling tests to use PMU cycles in addition to
 OS timer (on supported platforms).
 5. Add PMU profiling tests to run on supported platforms.

 The implementation will be broken into a large checkin related to the `runtime`
 and `runtime/pprof` packages, a few small checkins related to `testing`, and
 `net/http/pprof` packages, and the tests accompanying the code.
 The
 implementation will align with the February-April 2020 development cycle and
 May-July 2020 release cycle.

 A prototype is available [here](https://github.com/uber-dev/go/tree/pmu_pprof).

 ## 9. Open issues

 1. Supporting Windows and MacOS will be left for future work, but any support
 from experts from those lands is appreciated.
 2. The issue [21295](https://github.com/golang/go/issues/21295)  proposes
 adding `-test.counters` flag to the testing package.  That proposal is
 orthogonal to this one.
 There
 isn’t a detailed description of how the proposal wants to accomplish it.
 Simply
 starting and stopping a PMU counter by making `perf_events_open` system call at
 the testing package level will not accomplish the objective.
 The monitored
 functions can migrate from one `M` to another, and hence the changes must be
 made in the runtime as described in this proposal.
 However, it is relatively
 easy to incorporate the requested feature by making use of changes suggested in
 this proposal to accomplish the objective of issue #21295.
 We are open to
 discussing with the creators of #21295 to understand how we can incorporate
 their needs into this design.
 3. The additions to the public APIs and naming
 aesthetics can be refined with inputs from the experts and the community.
 4. Currently, we do not bubble up runtime failures to start a PMU profiler (same is
 true for a timer).
 Should we bubble up any failure to configure a PMU or
 silently produce zero samples in the profiles?
 5. Should we introduce any throttling mechanism if the rate of events is too
 high for a PMU event.

 ## 10. Acknowledgment

 Pengfei Su implemented a prototype of the design
 while interning at Uber Technologies.

 Joshua Corbin helped design the public surface APIs.

 This work is generously funded by the Uber Technologies’ Programming Systems
 Research team and the team members provided feedback for improvement.

 ## 11. Appendix

 ##### (OS timer profile) goroutine.go/run1:

 ```
 File: goroutine
 Type: cpu
 Time: Jan 27, 2020 at 3:45pm (PST)
 Duration: 6.70s, Total samples = 18060ms (269.37%)
 Showing nodes accounting for 18060ms, 100% of 18060ms total
       flat  flat%   sum%        cum   cum%
     4210ms 23.31% 23.31%     4210ms 23.31%  main.f7
     2610ms 14.45% 37.76%     2610ms 14.45%  main.f2
     2010ms 11.13% 48.89%     2010ms 11.13%  main.f6
     1810ms 10.02% 58.91%     1810ms 10.02%  main.f10
     1780ms  9.86% 68.77%     1780ms  9.86%  main.f3
     1410ms  7.81% 76.58%     1410ms  7.81%  main.f1
     1310ms  7.25% 83.83%     1310ms  7.25%  main.f4
     1110ms  6.15% 89.98%     1110ms  6.15%  main.f5
     1110ms  6.15% 96.12%     1110ms  6.15%  main.f8
      700ms  3.88%   100%      700ms  3.88%  main.f9
 ```

 ##### (OS timer profile) goroutine.go/run2:

 ```
 File: goroutine
 Type: cpu
 Time: Jan 27, 2020 at 3:45pm (PST)
 Duration: 6.71s, Total samples = 17400ms (259.39%)
 Showing nodes accounting for 17400ms, 100% of 17400ms total
       flat  flat%   sum%        cum   cum%
     3250ms 18.68% 18.68%     3250ms 18.68%  main.f2
     2180ms 12.53% 31.21%     2180ms 12.53%  main.f9
     2100ms 12.07% 43.28%     2100ms 12.07%  main.f1
     1770ms 10.17% 53.45%     1770ms 10.17%  main.f6
     1700ms  9.77% 63.22%     1700ms  9.77%  main.f5
     1550ms  8.91% 72.13%     1550ms  8.91%  main.f4
     1500ms  8.62% 80.75%     1500ms  8.62%  main.f8
     1440ms  8.28% 89.02%     1440ms  8.28%  main.f3
     1390ms  7.99% 97.01%     1390ms  7.99%  main.f10
      520ms  2.99%   100%      520ms  2.99%  main.f7
 ```

 ##### (OS timer profile) goroutine.go/Run 3:

 ```
 File: goroutine
 Type: cpu
 Time: Jan 27, 2020 at 3:48pm (PST)
 Duration: 6.71s, Total samples = 17.73s (264.31%)
 Showing nodes accounting for 17.73s, 100% of 17.73s total
       flat  flat%   sum%        cum   cum%
      3.74s 21.09% 21.09%      3.74s 21.09%  main.f7
      2.08s 11.73% 32.83%      2.08s 11.73%  main.f9
      2.05s 11.56% 44.39%      2.05s 11.56%  main.f2
      1.85s 10.43% 54.82%      1.85s 10.43%  main.f10
      1.78s 10.04% 64.86%      1.78s 10.04%  main.f1
      1.43s  8.07% 72.93%      1.43s  8.07%  main.f3
      1.42s  8.01% 80.94%      1.42s  8.01%  main.f8
      1.18s  6.66% 87.59%      1.18s  6.66%  main.f6
      1.17s  6.60% 94.19%      1.17s  6.60%  main.f5
      1.03s  5.81%   100%      1.03s  5.81%  main.f4
 ```

 ##### (OS timer profile) serial.go/Run 1:
 ```
 File: serial
 Type: cpu
 Time: Jan 27, 2020 at 1:42pm (PST)
 Duration: 501.51ms, Total samples = 320ms (63.81%)
 Showing nodes accounting for 320ms, 100% of 320ms total
       flat  flat%   sum%        cum   cum%
       80ms 25.00% 25.00%       80ms 25.00%  main.H_expect_14_546
       80ms 25.00% 50.00%       80ms 25.00%  main.J_expect_18_18
       60ms 18.75% 68.75%       60ms 18.75%  main.G_expect_12_73
       20ms  6.25% 75.00%       20ms  6.25%  main.B_expect_3_64
       20ms  6.25% 81.25%       20ms  6.25%  main.D_expect_7_27
       20ms  6.25% 87.50%       20ms  6.25%  main.F_expect_10_91
       20ms  6.25% 93.75%       20ms  6.25%  main.I_expect_16_36
       10ms  3.12% 96.88%       10ms  3.12%  main.A_expect_1_82
       10ms  3.12%   100%       10ms  3.12%  main.C_expect_5_46
          0     0%   100%      320ms   100%  main.main
          0     0%   100%      320ms   100%  runtime.main
 ```


 ##### (OS timer profile) serial.go/Run 2:

 ```
 File: serial
 Type: cpu
 Time: Jan 27, 2020 at 1:44pm (PST)
 Duration: 501.31ms, Total samples = 320ms (63.83%)
 Showing nodes accounting for 320ms, 100% of 320ms total
       flat  flat%   sum%        cum   cum%
       70ms 21.88% 21.88%       70ms 21.88%  main.I_expect_16_36
       50ms 15.62% 37.50%       50ms 15.62%  main.J_expect_18_18
       40ms 12.50% 50.00%       40ms 12.50%  main.E_expect_9_09
       40ms 12.50% 62.50%       40ms 12.50%  main.F_expect_10_91
       40ms 12.50% 75.00%       40ms 12.50%  main.H_expect_14_546
       30ms  9.38% 84.38%       30ms  9.38%  main.D_expect_7_27
       20ms  6.25% 90.62%       20ms  6.25%  main.B_expect_3_64
       20ms  6.25% 96.88%       20ms  6.25%  main.G_expect_12_73
       10ms  3.12%   100%       10ms  3.12%  main.C_expect_5_46
          0     0%   100%      320ms   100%  main.main
          0     0%   100%      320ms   100%  runtime.main
 ```

 ##### (OS timer profile) serial.go/Run 3:
 ```
 File: serial
 Type: cpu
 Time: Jan 27, 2020 at 1:45pm (PST)
 Duration: 501.39ms, Total samples = 310ms (61.83%)
 Showing nodes accounting for 310ms, 100% of 310ms total
       flat  flat%   sum%        cum   cum%
      110ms 35.48% 35.48%      110ms 35.48%  main.J_expect_18_18
       70ms 22.58% 58.06%       70ms 22.58%  main.G_expect_12_73
       60ms 19.35% 77.42%       60ms 19.35%  main.F_expect_10_91
       30ms  9.68% 87.10%       30ms  9.68%  main.I_expect_16_36
       20ms  6.45% 93.55%       20ms  6.45%  main.H_expect_14_546
       10ms  3.23% 96.77%       10ms  3.23%  main.B_expect_3_64
       10ms  3.23%   100%       10ms  3.23%  main.C_expect_5_46
          0     0%   100%      310ms   100%  main.main
          0     0%   100%      310ms   100%  runtime.main
 ```


 ## 12. References

 1. HPCToolkit: https://github.com/HPCToolkit/hpctoolkit
 2. Oracle studio:
 https://www.oracle.com/technetwork/server-storage/solarisstudio/features/performance-analyzer-2292312.html
 3. Perf utils: https://github.com/hodgesds/perf-utils
 4. vTune
 https://software.intel.com/en-us/vtune
 5. "Pinpointing Data Locality Problems
 Using Data-Centric Analysis", Xu Liu and John Mellor-Crummey, 2011 International
 Symposium on Code Generation and Optimization, April 2-6, 2011, Chamonix,
 France.
 6. "A Data-centric Profiler for Parallel Programs", Xu Liu and John
 Mellor-Crummey, The International Conference for High Performance Computing,
 Networking, Storage and Analysis, November 17-22, 2013, Denver, Colorado, USA.
 7. "ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel
 Programs", Xu Liu and Bo Wu, The International Conference for High Performance
 Computing, Networking, Storage and Analysis, Nov 15-20, 2015, Austin, Texas,
 USA.