proposal: hardware performance counters for CPU profiling.
Addresses the issue golang/go#36821.
Change-Id: I232f0da8396f49c21aea021acfad52bb5dcc515a
diff --git a/design/36821-perf-counter-pprof.md b/design/36821-perf-counter-pprof.md
new file mode 100644
index 0000000..64b256c
--- /dev/null
+++ b/design/36821-perf-counter-pprof.md
@@ -0,0 +1,1613 @@
+# Proposal: hardware performance counters for CPU profiling.
+
+Author(s): Milind Chabbi
+
+Last updated: Feb/13/2020
+
+Discussion at https://golang.org/issues/36821.
+
+## 1. Abstract
+
+The Go pprof CPU profiles are quite inaccurate and imprecise due to several
+limitations related to the operating system (OS) interval timer used as its
+sampling engine.
+Inaccurate profiles can mislead performance investigation and may lead
+to inappropriate optimization guidance.
+We propose incorporating hardware
+performance counters as an alternative sampling engine for CPU profiling.
+Hardware performance counter-based sampling rectifies many of the problems
+related to the OS timers.
+The hardware CPU cycles event provides microsecond
+measurement granularity, and it is available in almost all commodity CPUs.
+Furthermore, hardware Performance Monitoring Units (PMUs) allow measuring not
+only CPU cycles but other activities such as cache misses that help root cause
+complex performance problems.
+A [prototype implementation](https://github.com/uber-dev/go/tree/pmu_pprof) shows significantly
+better-quality CPU profiles when using PMUs while maintaining a low measurement
+overhead.
+The proposal will retain the OS-timer as the default sampling engine.
+The proposal exposes the PMU-based profiling feature on Linux-based systems,
+which provide an excellent facility to program PMUs.
+
+
+
+## Table of contents
+- [1. Abstract](#1-abstract)
+- [2. Background](#2-background)
+ - [2.1 High-level description of Go's CPU
+ profiling](#21-high-level-description-of-gos-cpu-profiling)
+ - [2.2 Desiderata and deficiencies of Go CPU
+ profiler](#22-desiderata-and-deficiencies-of-go-cpu-profiler)
+ - [2.3 Examples demonstrating inaccuracy and imprecision in Go
+ pprof profiles](#23-examples-demonstrating-inaccuracy-and-imprecision-in-go-pprof-profiles)
+ - [2.4 State-of-the-art in profiling using performance
+ monitoring units](#24-state-of-the-art-in-profiling-using-performance-monitoring-units)
+ - [2.5 Low-level details of the current Go CPU
+ profiler](#25-low-level-details-of-the-current-go-cpu-profiler)
+- [3. Problem statement](#3-problem-statement)
+- [4. Proposal](#4-proposal)
+ - [4.1 High-level design](#41-high-level-design)
+ - [4.2 The devil is in the
+ details](#42-the-devil-is-in-the-details)
+ - [4.3 Exposing PMU events via `runtime/pprof`
+ package](#43-exposing-pmu-events-via-runtimepprof-package)
+ - [4.4 The optional multiple CPU profiles
+ feature](#44-the-optional-multiple-cpu-profiles-feature)
+ - [4.5 The `testing` package](#45-the-testing-package)
+ - [4.6 The `net/http/pprof`
+ package](#46-the-nethttppprof-package)
+- [5. Empirical evidence on the accuracy and precision of PMU
+ profiles](#5-empirical-evidence-on-the-accuracy-and-precision-of-pmu-profiles)
+ - [5.1 The PMU produces precise and accurate profiles for
+ concurrent programs](#51-the-pmu-produces-precise-and-accurate-profiles-for-concurrent-programs)
+ - [5.2 The PMU produces precise and accurate profiles for
+ serial programs](#52-the-pmu-produces-precise-and-accurate-profiles-for-serial-programs)
+ - [5.3 A practical use](#53-a-practical-use)
+- [6. Rationale](#6-rationale)
+ - [6.1 Advantages](#61-advantages)
+ - [6.2 Disadvantages](#62-disadvantages)
+ - [6.3 Alternatives explored](#63-alternatives-explored)
+- [7. Compatibility](#7-compatibility)
+- [8. Implementation](#8-implementation)
+- [9. Open issues](#9-open-issues)
+- [10. Acknowledgment](#10-acknowledgment)
+- [11. Appendix](#11-appendix)
+- [12. References](#12-references)
+
+
+## 2. Background
+
+In this section, we first provide a high-level description of Go's CPU
+profiling, motivate the problems in the current profiler with examples,
+highlight the state-of-the-art in CPU profiling, and then dive deeper into the
+current implementation of CPU profiling inside Go.
+Our proposed changes appear
+in the next section.
+Readers familiar with profiling technology and the
+internals of pprof may skip this section.
+
+### 2.1 High-level description of Go’s CPU profiling
+
+Pprof, the de facto Go CPU profiler, employs sampling-based profiling for time
+measurement.
+It attributes the time metric to different functions, including
+their calling context (aka call path or backtrace).
+Pprof relies on the
+underlying operating-system timers (exposed by the Go runtime) to periodically
+interrupt a program’s execution.
+The sampling rate is typically one hundred
+hertz (a sample taken every 10ms); a different sampling rate can be configured,
+but typical systems cannot deliver more than 250 hertz.
+
+
+When profiling is enabled, the Go runtime scheduler sets up each OS thread in
+the process to generate periodic timer interrupts.
+In each interrupt, the go
+runtime’s signal handler obtains a backtrace and appends it to a ring buffer.
+All OS threads in a Go process append their backtraces to the same ring buffer
+coordinated via a lock.
+A side thread, launched by the `pprof` package,
+periodically scrapes the backtraces from the ring buffer and serializes them
+into a protobuf file to be used by the downstream tools.
+
+### 2.2 Desiderata and deficiencies of Go CPU profiler
+
+*Accuracy* and *precision* are desirable properties of a good measurement tool.
+
+A profiling datum is said to be *accurate* if it is close to the ground truth
+(aka true value).
+For example, if `API_A()` consumes 25% of the total
+execution time and a profiler attributes it 24% of the total execution time,
+the measurement has 96% accuracy.
+
+Profiling data are said to be *precise* if there is low variability among
+multiple measurements.
+Precision of a set of measurements of the same variable is often
+[expressed](https://www.wikihow.com/Calculate-Precision) either as min and max
+from the sample mean or as [standard
+error](https://en.wikipedia.org/wiki/Standard_error) or [coefficient of
+variation](https://en.wikipedia.org/wiki/Coefficient_of_variation) of the
+sample.
+
+Unfortunately, the current pprof CPU profiles meet neither of these two
+criteria.
+
+
+#### 2.2.1 Inaccuracy and its causes
+
+Inaccuracy of profiles arises from two sources: sample size and sampling bias.
+
+
+#### Sample size
+
+By the [law of large
+numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers), a larger sample
+size makes the sample mean closer to the population mean.
+Moreover, in
+profiling, where the sample size is very small compared to the population size,
+increasing the sample size greatly improves accuracy.
+Conversely, a smaller
+sample size leads to a lower inaccuracy.
+
+
+Low-resolution OS-timer is the primary reasons behind small sample size in a
+measurement window.
+One can obtain more samples either by increasing the
+length of a measurement window or by increasing the number of samples in a
+measurement window.
+The former works well for regular programs e.g., codes
+dominated by loops; however, for irregular codes, such as microservices, a
+larger measurement window does not rectify the measurement inaccuracy because
+more irregularities can easily surface in a larger observation window.
+Hence,
+there is a dire need to collect more samples within a measurement window and
+obtain fine-grained samples of sub-millisecond executions.
+Furthermore, linearly scaling the measurement time to collect more samples is
+impractical if orders of magnitude more samples are necessary to correct the
+small samples size issue.
+
+#### Sampling bias
+
+A larger sample size alone does not improve accuracy if the samples are biased.
+OS timer samples on some systems (e.g., Linux) are biased because a timer
+interrupt from one OS thread, say `T1` may be handled by an arbitrary (not to
+be confused with uniformly random) thread, say `T2`.
+This means, `T2` will
+handle the interrupt and incorrectly attribute the sample to its call stack,
+which results in biased samples.
+If more timer signals are handled by `T2`
+compared to `T1`, a [systematic sampling
+bias](https://en.wikipedia.org/wiki/Observational_error) will occur, which leads
+to inaccurate profiles.
+
+
+#### 2.2.1 Imprecision and its causes
+
+Imprecision in measurement comes from at least the following two sources:
+sample size and measurement skid.
+
+#### Sample size
+
+Fewer number of samples directly contributes to a large [standard
+error](https://en.wikipedia.org/wiki/Standard_error).
+The low resolution of OS
+timers is responsible for fewer samples in a measurement window, which results
+in a lower precision of pprof profiles.
+Conversely, a higher resolution of
+samples will improve precision.
+
+#### Measurement skid
+
+The second source of measurement error is the [random
+error](https://en.wikipedia.org/wiki/Observational_error) inherent in a
+measurement apparatus.
+An OS timer apparatus configured to expire after `N` milliseconds
+is only guaranteed to generate an interrupt some time after `N` milliseconds --
+not precisely at `N` milliseconds.
+This randomness introduces a large "skid" in
+measurement.
+Assume a periodic timer set to fire every 10ms.
+Assume it has
+1ms left before its expiry when the OS scheduler with a 4ms resolution inspects
+it.
+This means the timer will fire 4ms later, which is 3ms after the scheduled
+expiry time.
+This results in up to 30% imprecision for a 10ms periodic timer.
+Although random error may not be entirely eliminated, a superior measurement
+apparatus reduces the effects of random error.
+
+### 2.3 Examples demonstrating inaccuracy and imprecision in Go pprof profiles
+
+#### 2.3.1 A concurrent code
+
+Consider the Go program
+[goroutine.go](https://github.com/chabbimilind/GoPprofDemo/blob/master/goroutine.go),
+which has ten exactly similar goroutines `f1`-`f10`.
+These functions are
+computationally intensive, long-running, and are unaffected by IO or memory
+hierarchy.
+The functions are non-preemptive, and hence the Go runtime scheduler
+overhead is zero.
+It is expected that each function takes exactly the same
+amount of CPU time (i.e., approximately 10% of the overall execution).
+It
+should not matter how many OS threads or CPU cores they run on.
+
+
+Table 1 summarizes the result of `go tool pprof -top` for this program running
+on a Linux OS on a 12-core Intel Skylake server-class machine.
+The
+measurements were taken three times under the same configuration.
+The `Flat
+(ms)` column shows the absolute millisecond measurement attributed to each of
+the ten routines, and the `Flat (%)` column shows the relative time in a
+routine with respect to the total execution time.
+The `Expected (%)` column
+shows the expected relative time in each routine.
+The `Rank order` column
+assigns ranks in the descending order of execution time.
+The raw output is
+available in the [appendix](#11-appendix).
+
+ *Table 1: demonstration of inaccuracy and
+imprecision of pprof profiles using OS timers on a concurrent code.*
+
+First, let us focus on the data in a single run `RUN 1`.
+`f1`-`f10` have a wide
+variance in the time attributed to them; the expectation is that each of them
+gets `10%` execution time.
+There is up to 6x difference in the time attributed
+to the routine with the highest amount of attribution (`main.f7` with 4210ms,
+23.31%) and the routine with the lowest amount of attribution (`main.f9` with
+700ms, 3.88%).
+**This demonstrates the poor accuracy (deviation from the
+ground truth) of pprof timer-based profiles.**
+The issue reproduces when using different number of CPU cores.
+
+Now focus on the three runs `RUN 1-3` together .
+The time attributed to any
+routine widely varies from one run to another.
+The rank order of the routines
+changes significantly from run to run.
+In Run 1, `main.f7` is shown to run for
+4210ms with the rank order of 1, whereas in Run 2, it is shown to run for only
+520ms with the rank order 10.
+The expectation is that the measurements remain
+the same from run to run.
+From one run to another, function-level time
+attribution are not precise; for example, there is a 71% coefficient of
+variance among the three runs of the `f1` routine and the average coefficient
+of variance across all functions is 28%.
+**This demonstrates imprecision (unpredictability of
+measurement) of pprof timer-based profiles.**
+
+For the curious readers, Table 2 below shows near-prefect accuracy and
+precision when profiling the same code using the PMU-based profiler described in
+this proposal.
+We show two runs instead of three for
+brevity.
+Additional details appear in [Section
+5.1](#51-the-pmu-produces-precise-and-accurate-profiles-for-concurrent-programs)
+
+ *Table 2: demonstration of accuracy and
+precision of pprof profiles via PMU CPU cycles event on a concurrent code.*
+
+#### 2.3.2 A serial code
+
+The issue is not specific to concurrent code alone.
+Refer to a carefully
+crafted Go program --
+[`serial.go`](https://github.com/chabbimilind/GoPprofDemo/blob/master/serial.go)
+-- which has a single goroutine, and it has complete data
+dependence from one statement to another.
+It has ten functions
+`A_expect_1_82`, `B_expect_3_64`, ... , `J_expect_18_18`.
+The functions are
+intentionally written to consume different amounts of time to make this test
+case more representative.
+Each function is similar and has a loop, but the
+trip-count of the loop in each function is different.
+The loop in function
+with prefix `A` runs `1*C` times, where `C` is a constant.
+The loop in function
+with prefix `B` runs `2*C` times, and so on.
+This arrangement leads to a
+predictable relative execution time for each function; the function with prefix
+`A` consumes `1/sum(1..10)=1.82%` of the total execution time; and the function
+with prefix `B` consumes `2/sum(1..10)=3.64%` of the total execution time, and
+so on until the function with prefix `J` consumes `10/sum(1..10)=18.18%` of the
+total execution time.
+
+
+The expected relative percent of CPU time is encoded in each function name,
+`A_expect_1_82` should consume 1.82% execution time, `B_expect_3_64` should
+consume 3.64% execution time, `J_expect_18_18` should consume 18.18% of CPU
+time, and so on.
+Table 3 summarizes the output of pprof CPU profiling for this
+code.
+As before, the measurements were taken three times under the same
+configuration.
+The `Flat (ms)` column shows the absolute millisecond
+measurement attributed to each of the ten functions, and the `Flat (%)` column
+shows the relative time with respect to the total time taken by all functions.
+The `Expected (%)` column shows the expected relative time in each routine.
+The
+`Rank order` column assigns ranks in descending order of the execution time.
+The raw output is available in the [appendix](#11-appendix).
+
+
+ *Table 3: demonstration of inaccuracy and
+imprecision of pprof profiles on a serial code.*
+
+A `-` in a cell represents zero samples.
+First, let's focus on just one set of measurements from `RUN 1`.
+Rather than
+`J_expect_18_18` being the top time consumer with 18.18%, both `J_expect_18_18`
+and `H_expect_14_546` are the top time consumers each with 25% CPU time.
+`B_expect_3_64`, `D_expect_7_27`, `F_expect_10_91`, and `I_expect_16_36` are
+all lumped into a single rank order of 4 each attributed with 6.25% of
+execution time.
+`E_expect_9_09` does not even appear in the profiles despite
+its 9% expected execution time.
+More inaccuracies are evident in the data in
+the table.
+If we inspect the data from `RUN 2` and `RUN 3`, we notice more
+discrepancies where a lot of data is missing and incorrect.
+The rank ordering
+and the relative execution percentage both have very low correlation with the
+ground truth.
+Despite tens of milliseconds of execution time, some functions
+are over counted and some functions are under counted.
+
+
+Now, let's focus on the three runs together.
+The highest time consumer
+`J_expect_18_18` does not even get a single sample in `RUN 3`, indicating the
+imprecision.
+`I_expect_16_36`, which has the measurement data available in all
+three cases varies from 20ms in `RUN 1` with a rank order of 4 to 70ms in `RUN 2`
+with a rank order of 1.
+More discrepancies are visible in the table.
+Thus,
+serial program also shows run to run measurement variation, and hence the
+profiles are imprecise.
+
+
+Table 4 below shows near perfect accuracy and precision of profiling the same
+code using the PMU-based profiler described in this proposal.
+The measurements
+were taken two times instead of three for brevity.
+Additional details appear
+in [Section
+5.2](#52-the-pmu-produces-precise-and-accurate-profiles-for-serial-programs)
+
+ *Table 4: demonstration of accuracy and precision
+of pprof profiles via PMU CPU cycles event on a serial code.*
+
+### 2.4 State-of-the-art in profiling using performance monitoring units
+
+An alternative to the OS timer-based sampling is using hardware performance
+counters, which are ubiquitous in modern commodity CPUs.
+CPU’s Performance
+Monitoring Units (PMUs) have the capability to count various events such as
+cycles, instructions, cache misses, to name a few.
+Furthermore, PMUs can be
+configured to deliver an interrupt when an event reaches a specified threshold
+value.
+These PMU interrupts, like the OS timer interrupts, can serve as another
+source of sampling the call stack.
+This technique is employed in many modern
+profilers [1,2,4].
+PMU sampling is called “event-based sampling,” where the
+period (interval between two interrupts) is expressed as the number of
+occurrences of an event, not the number of timer ticks.
+As a side note, in
+event-based sampling, the sampling overhead is proportional to the number of
+events disregarding the length of execution, whereas in timer-based sampling,
+the overhead is proportional to the length of execution.
+If there are more
+events in a short time span, then there is also more interest in observing
+deeply in that region of execution and the event-based sampling takes more
+samples.
+More overhead is incurred in places where there is interest (more
+events) and low/no overhead otherwise.
+CPU cycles is one of the most common
+PMU events, which happens periodically, similar to OS timers.
+Hence, using CPU
+cycles as a substitute for OS timers is very attractive.
+
+### 2.5 Low-level details of the current Go CPU profiler
+
+We now provide the current implementation details of CPU profiling in Go, which
+is necessary to understand the changes we propose in the next section.
+The
+description is relevant for most Unix-based systems, especially Linux, and most
+of the discussion is irrelevant for Windows since it does not use signals for
+handling CPU profiling.
+
+
+
+
+*Figure 1: A pictorial representation of the OS timer sampling workflow in the
+existing implementation.
+The first line of each record box is the function
+name.
+The remaining lines provide a textual description of the steps taken
+by that function.*
+
+Currently, `pprof.StartCPUProfile(io.Writer)` is the entry point for starting
+CPU profiling.
+Analogously, `pprof.StopCPUProfile()` stops a running profile.
+Two concurrent CPU profiling sessions are disallowed via a lock-protected
+variable `cpu.profiling` in pprof.
+The `pprof` package interacts with the
+`runtime` package for starting/stopping profiling via
+`runtime.SetCPUProfileRate(hz int)` API.
+
+There are two critical pieces of work done inside
+`pprof.StartCPUProfile(io.Writer)`.
+First, it invokes
+`runtime.SetCPUProfileRate(hz=100)` asking the runtime to start profiling; the
+rest of the profile generation will be handled by the `runtime` package; the
+integer argument is the sampling rate in Hz (samples per second).
+
+Second, in order to collect the profiles generated by the runtime,
+`pprof.StartCPUProfile(io.Writer)` launches a goroutine
+`profileWriter(io.Writer)`.
+`profileWriter`,
+
+1. creates a protocol buffer via `newProfileBuilder`,
+2. goes into a sleep-wakeup loop calling `runtime.readProfile()`.
+`runtime.readProfile()` provides a chunk of profiles collected inside the
+runtime in a ring buffer (`eof` if the profiling stopped), and
+3. writes to the
+sender end of a channel `cpu.done` once
+`runtime.readProfile()` returns `eof`.
+The other side of the channel will be
+waited upon by `pprof.StopCPUProfile()`.
+
+
+`runtime.SetCPUProfileRate(int)` is the entry point to start/stop profiling in
+the `runtime` package.
+It,
+
+1. creates a ring buffer, `cpuprof.log`,
+2. appends profile header information
+into the ring buffer,
+3. invokes `setcpuprofilerate(hz)`, which,
+ 1. calls `setThreadCPUProfileRate(hz=0)` to stop any on-going timer,
+ 2. invokes `setProcessCPUProfiler(hz)`, which installs the `SIGPROF` handler, and
+ 3. records the current sampling rate `hz` into the scheduler `sched.profilehz` to be
+visible to other OS threads.
+4. calls `setThreadCPUProfileRate(hz)`, which uses
+the OS timer such as (`setitimer`) to start the sampling engine.
+This step is
+performed only if the `hz` value is non-zero.
+
+
+Inside the scheduler, the `execute()` method schedules a goroutine (also called
+as as `G` in the Go runtime scheduler) to run on the current OS thread (also
+called as `M` in the Go runtime scheduler).
+This method compares the current
+value of the scheduler-recorded `sched.profilehz` variable against its own
+snapshot (`m.profilehz`) from the previously recorded value of
+`sched.profilehz` and if they differ, it reconfigures its timer by calling
+`setThreadCPUProfileRate(sched.profilehz)`.
+This functionality sets an OS
+timer on behalf of the current OS thread in the process.
+
+Once armed, the timer periodically delivers a `SIGPROF` signal, handled by
+`sighandler(sig int, info *siginfo, ctxt unsafe.Pointer, gp *g)`, which in turn
+invokes`sigprof()`, which,
+
+1. obtains a backtrace, and
+2. appends the backtrace into the ring buffer `cpuprof.add()`.
+
+Occasionally, a sampled program counter may fall into non-Go space, or the ring
+buffer may be full; such samples are simply counted, and backtrace is not taken.
+
+The ring buffer is coordinated via a `signalLock` for concurrency control.
+The edges marked with “signalLock”, in Figure 1, show the operations that take
+the lock to access the ring buffer.
+
+`pprof.StopCPUProfile()` follows a very similar path as
+`pprof.StartCPUProfile(io.Writer)`.
+When the `hz` argument is zero, the runtime
+APIs described previously (e.g., `SetCPUProfileRate(hz)`) interpret it as the
+end of profiling.
+`pprof.StopCPUProfile()` performs the following steps.
+
+1. calls`runtime.SetCPUProfileRate(0)` to stop profiling.
+2. waits on the `cpu.done` channel, which will be flagged by the side thread once `eof` is
+reached in the runtime ring buffer.
+3. closes the supplied `io.Writer`.
+
+
+`runtime.SetCPUProfileRate(hz)`, when called with `hz=0` invokes
+`setcpuprofilerate(0)` and closes the ring buffer.
+`setcpuprofilerate(0)` when
+called with `hz=0`,
+
+1. calls `setProcessCPUProfiler(0)`, which restores the previous signal handler
+for `SIGPROF`.
+2. resets `sched.profilehz=0` indicating globally that the
+profiling has stopped.
+3. calls `setThreadCPUProfileRate(0)` to stop sampling on the current OS thread.
+
+Subsequently, the other `M`s notice the change in the global `sched.profilehz`
+and turn off their timers when they run their `execute()` method.
+
+The direct users of pprof are:
+
+1. The `testing` package which can take a `-cpuprofile <file>` flag to collect
+profiles.
+2. The `net/http/pprof` package which can collect cpu profiles on the
+`/debug/pprof/profile` endpoint with an optional argument `?seconds=<N>`.
+3. Third party libraries such as `pkg/profile`.
+
+## 3. Problem statement
+
+*Pprof CPU profiles are inaccurate because the profiling resolution is low and
+sampling is biased in OS timers.*
+
+*Pprof CPU profiles are imprecise because the profiling resolution is low and
+the samples have a large skid due to OS timers.*
+
+*Pprof CPU profiles measure only the time metric, additional events should be
+monitored to help performance profiling.*
+
+## 4. Proposal
+
+In this proposal, we address the concerns -- small sample size, sampling bias,
+measurement error, and the need for more metrics -- which were raised in the
+background section.
+
+**Sample size.** First, this proposal addresses the problem of low sampling
+resolution of OS timers by using high-resolution samples from hardware
+performance monitoring units (PMUs) to generate more samples.
+While the OS-timers have ~10ms
+resolution (that is one sample every 10ms), sampling CPU cycles via PMUs can
+deliver (if desired) a sample every few microseconds -- a three order magnitude
+finer resolution.
+This fine-grained measurement results in increasing the
+sample size, and as a result alleviates both inaccuracy and imprecision
+problems.
+While PMUs can deliver 100x more samples with just ~13% measurement
+overhead, the alternative of lengthening a measurement window by 100x to
+collect the same number of samples via the OS timers would be impractical.
+
+**Sampling bias.** Second, the proposal arranges each sample to be delivered to
+the OS thread causing the counter overflow to eliminate the sampling bias,
+which improves accuracy.
+
+**Random error (skid).** Third, PMU samples alleviate the large sampling skid
+seen in OS timers because PMUs allow a low-skid or skid-free precise sample
+attribution.
+
+**More metrics.** Finally, the PMU's capability to monitor various events such
+as cache misses, TLB misses, branch mispredictions, inter-core cacheline
+ping-ponging, to name a few, offers additional insights when diagnosing complex
+performance issues.
+
+Implementing this idea calls for extending the Go runtime package’s CPU
+profiling capabilities and making necessary adjustments in `runtime/pprof`,
+`testing`, and `http/pprof` packages.
+We first describe the details of the
+changes to the `runtime` package and then describe the details of the `pprof`,
+`testing`, and `net/http/pprof` packages, in that order.
+
+Our design brings the full capability of PMUs into the `runtime` package but
+exposes only a small amount of it via the `pprof` package to users.
+For
+example, the runtime might be able to configure whether it wants to monitor the
+CPU cycles spent inside the kernel or not; but these configurations will not be
+exposed (at least in the initial design) to the users of `pprof` package.
+
+### 4.1 High-level design
+
+In its simplest form, we substitute the start/stop of an OS interval timer with
+the start/stop of a PMU event configured in the sampling mode; the sampling
+mode delivers an interrupt each time a hardware counter overflows, which is
+analogous to the OS timer interrupt.
+The interrupt generated by the PMU counter
+overflow follows the same signal handling path as the interval timer.
+This
+policy allows collecting and appending a backtrace on each PMU interrupt into a
+ring buffer, just like the OS timer.
+Moreover, like the OS timer, each
+OS-thread (`M`) checks whether any changes were made to the PMU-based profiling
+during each scheduling activity and performs the necessary adjustments to
+reconfigure the PMU.
+Finally, we retain `SIGPROF` as the signal type delivered
+-- for both OS timer interrupts and a PMU counter overflow interrupts --
+instead of hijacking another signal number for PMU-based profiling.
+
+### 4.2 The devil is in the details
+
+This seemingly straightforward approach poses implementation challenges
+related to retaining the compatibility with interval timer and incorporating
+additional PMU features.
+The first part of this section addresses those
+challenges.
+Other than the design described in the previous subsection, the
+rest of the details are closer to implementation details.
+As before, the
+discussion is only relevant to Linux.
+
+PMUs can count multiple CPU events simultaneously (e.g., cycles, instruction,
+cache misses, etc.).
+Our proposal does not preclude the runtime from
+simultaneously enabling multiple PMU events or even PMU events alongside an OS
+timer event.
+In fact, certain events are best used when measured with others.
+For example,
+1. Cache misses measured alongside cache hits enable computing cache miss ratio,
+2. Remote core cache line transfers measured alongside total cache misses
+provides a view of the cause of cache misses.
+
+In our proposal, we will limit the maximum number of concurrent CPU events to
+`_CPUPROF_EVENTS_MAX` (`EV_MAX` for short in the rest of this document).
+We
+will define a few common portable events for which the user does not have to
+lookup within the CPU manufacturer’s manual for event codes.
+This closely
+resembles the preset events of `perf events` in Linux.
+
+In order to enable advanced users to exploit the full capability of the PMUs,
+we define an opaque RAW event that allows the users to specify a model-specific
+hexadecimal event number that can be sampled.
+A motivating example for such a
+raw event is the remote
+[HITM](https://software.intel.com/sites/default/files/managed/8b/6e/335279_performance_monitoring_events_guide.pdf?ref=hvper.com)
+event on Intel processors; this event counts the number of times a cache line
+was loaded where the cache line was in the modified state in another remote
+CPU’s cache.
+This event helps diagnose problems such as cache line false
+sharing; we have observed this event to be helpful in identifying false sharing
+even inside the Go runtime.
+
+We have come up with the following initial set of PMU events.
+These events will
+be exposed by the `runtime` package to the `pprof` package.
+The `pprof` package
+will wrap these constant values with matching function names to be presented
+later in the `pprof` package subsection.
+Notice that the OS timer is one of the CPU events.
+Last-level cache events are included to capture the behavior of the CPUs memory
+hierarchy.
+Branch events are included to capture potentially wasted CPU
+cycles.
+The design discussions can refine this list further (e.g., whether TLB events
+are needed).
+They are easy to grow based on requests.
+
+```
+type cpuEvent int32
+
+// These constants are a mirror of the runtime
+// These constants are a mirror of the runtime
+const ( _CPUPROF_OS_TIMER, _CPUPROF_FIRST_EVENT cpuEvent = iota, iota
+ _CPUPROF_HW_CPU_CYCLES, _CPUPROF_FIRST_PMU_EVENT
+ _CPUPROF_HW_INSTRUCTIONS cpuEvent = iota
+ _CPUPROF_HW_CACHE_REFERENCES
+ _CPUPROF_HW_CACHE_MISSES
+ _CPUPROF_HW_BRANCH_INSTRUCTIONS
+ _CPUPROF_HW_BRANCH_MISSES
+ _CPUPROF_HW_RAW
+ _CPUPROF_EVENTS_MAX
+ _CPUPROF_LAST_EVENT = _CPUPROF_EVENTS_MAX-1
+)
+```
+
+Since PMU support may not be available on all platforms in all settings, we
+need to make it configurable in the Go runtime.
+If an OS supports PMU-based
+profiling (e.g., Linux), it will dynamically register two functions during the
+runtime startup.
+
+
+1. `setThreadPMUProfilerFunc`: this is the OS-specific function used to
+start/stop a PMU event analogous to `setThreadCPUProfile`.
+If`setThreadPMUProfilerFunc` is `nil`, the runtime will not attempt to
+start/stop PMU profiles even if it is requested to do so.
+However, this
+situation will not arise because `runtime/pprof` will guard against it.
+2. `sigprofPMUHandlerFunc`: this is the OS-specific function called from the
+OS-specific signal handler when the PMU delivers an interrupt.
+
+Currently, each `M` (type `m`) and the scheduler (type `schedt`) maintains a
+`profilehz` variable.
+In lieu of `profilehz` we’ll introduce a runtime structure
+`cpuProfileConfig struct `, which captures the essential configurations needed
+to start/stop either a PMU counter or an OS timer.
+`cpuProfileConfig` is
+designed to reflect the underlying `perf_event_attr` struct in Linux, but being
+an internal data structure we can modify it, if needed in the future, without
+affecting any public-surface API.
+
+```
+type cpuProfileConfig struct {
+ hz uint64 // the sampling rate, used only for OS_TIMER.
+ period uint64 // the sampling interval, used for PMU events.
+ rawEvent uint64 // the hexa-decimal value of the CPU-specific event.
+ // used only with _CPUPROF_HW_RAW cpuEvent type
+ preciseIP profilePCPrecision // the level of precision for the program counter in PMU samples.
+ isSampleIPIncluded bool // should the sample include the PC?
+ isSampleAddrIncluded bool // should the sample include the memory address involved?
+ isKernelIncluded bool // should the kernel count events in kernel mode?
+ isHvIncluded bool // should the kernel count events in hypervisor mode?
+ isIdleIncluded bool // should the kernel count events when idle task is running?
+ isSampleCallchainIncluded bool // should the sample report the call chain?
+ isCallchainKernelIncluded bool // should the sample include the kernel callchain?
+ isCallchainUserIncluded bool // should the sample include the user callchain?
+}
+```
+
+Some PMUs can be configured so that they can offer the
+exact program counter involved in the counter overflow.
+`profilePCPrecision`
+dictates the level of precision needed from the PMU as described by the
+constants below.
+If a PMU is configured to provide the precise program counter for a sample,
+we'll substitute the leaf-level program counter in the backtrace with the one
+presented by the PMU.
+Please refer to the [Section 6.2](#62-disadvantages) for PMU
+precision.
+
+```
+type profilePCPrecision uint8
+const (
+ _CPUPROF_IP_ARBITRARY_SKID profilePCPrecision = iota
+ _CPUPROF_IP_CONSTANT_SKID
+ _CPUPROF_IP_SUGGEST_NO_SKID
+ _CPUPROF_IP_NO_SKID)
+```
+
+`pprof` will set the default to `_CPUPROF_IP_ARBITRARY_SKID` to work across
+architectures and event types.
+However, we can easily change it to the highest
+supported, by probing the CPU as done in
+[certain](https://github.com/HPCToolkit/hpctoolkit/blob/master/src/tool/hpcrun/sample-sources/perf/perf_skid.c#L176)
+profilers.
+
+`cpuProfileConfig` will be visible to `pprof` package but will not be exposed to
+its clients.
+The fields in `cpuProfileConfig` allow how a PMU event can be
+customized.
+Pprof will set the field values to the lowest common denominator
+that is expected to work across many architectures and security settings.
+For example, `isKernelIncluded` will most likely be `false`.
+The runtime will be
+(mostly) agnostic to these values and will pass them down to OS-specific layers
+that will handle programming the PMUs.
+
+The `runtime` package will expose an API
+`runtime_pprof_setCPUProfileConfig(cpuEvent, *cpuProfileConfig)`, only to
+`pprof`.
+This API is the equivalent of `SetCPUProfileRate(hz int)`.
+Since
+the legacy `SetCPUProfileRate(hz int)` cannot be dropped, we alter its internals
+to wrap the `hz` value into a `cpuProfileConfig` structure and pass it down to
+`runtime_pprof_setCPUProfileConfig` for a uniform treatment of any event whether
+an OS-timer event or a PMU event.
+
+The `runtime` can be bifurcated into OS-agnostic and OS-specific parts.
+
+The OS-agnostic part of the runtime will have the following duties:
+
+1. Keep track of the CPU events currently running,
+2. Handle new start/stop
+invocations of the OS timer or PMU event, one at a time.
+3. Allocate and
+coordinate access to the ring buffers associated with each of the running CPU
+event.
+4. Reconfigure the CPU events for an `M` during its scheduling based on
+whether its current snapshot is different from the global configuration stored
+in the scheduler.
+
+The OS-specific part of runtime will have the following duties:
+
+1. Install/restore profiling signal handler.
+2. Start/stop OS timer or PMU
+events.
+3. Handle the profiling signals and pause the PMUs during signal
+handling.
+4. Map an interrupt delivered to the appropriate active CPU profiling
+event for recording the backtraces into an appropriate ring buffer.
+
+
+Since there can be `CPUPROF_EVENTS_MAX` number of concurrent CPU profiles, we
+need to change some of the runtime data structures that previously assumed a
+single CPU profile.
+Importantly,
+
+1. We will change the `profilehz` field in `schedt struct` to be an array of
+`cpuProfileConfig` pointers.
+2. We will change `profilehz` field in `m struct`
+to be an array of `cpuProfileConfig` pointers.
+3. We will introduce an array of
+opaque pointers `cpuProfileHandle` field in `m struct`.
+This will be OS-specific
+information stored in each `M`.
+For example, on Linux, this will hold the
+file handles obtained from `perf_event_open` system calls and the corresponding
+`mmap` ring buffers where the kernel populates the sample data.
+This information
+will be used to close the resources once the profiling is stopped on an `M`.
+
+We’ll replace the `hz` argument used in various runtime APIs involved in CPU
+profiling with the following two arguments.
+
+1. `eventId cpuEvent` (described in the constants previously), and
+2. `profConfig *cpuProfileConfig`.
+
+Figure 2 shows the call graph for the new workflow.
+We abbreviate `eventId` as
+`i` and `cpuProfileConfig` as `p` in most places.
+
+
+*Figure 2: A pictorial representation of the OS timer-based and PMU-based
+sampling workflow.
+The first line of each record box is the function name.
+The
+remaining lines provide a textual description of the steps taken by that
+function.*
+
+A `nil` value of `cpuProfileConfig` implies stop profiling for the specified
+`cpuEvent`, and a `non-nil` value implies start profiling for the specified
+`cpuEvent`.
+At the runtime-level, `runtime_pprof_setCPUProfileConfig()` makes a
+copy of the `cpuProfileConfig` passed from `pprof`, so that inside the runtime,
+we can perform fast pointer comparison and rely on two large `cpuProfileConfig`
+structs to be equal if pointers to them are equal.
+
+Starting/stopping a PMU counter is different from starting/stopping an OS timer.
+Hence `if eventId==CPUPROF_OS_TIMER` checks will appear at a few places inside
+the runtime’s profiling-related code.
+
+
+A few concurrency aspects deserve a mention.
+
+1. Each active event will create its own ring buffer on demand.
+There will be a
+maximum of `EV_MAX` ring buffers.
+Each ring buffer will be created only if there
+is a corresponding active event.
+The size of each ring buffer is 16KB.
+If all
+`EV_MAX` events are active at the same time, it will result in `16KB * 8 =
+128KB` of ring buffers, which is paltry on modern machines.
+2. We’ll change the
+`cpuprof` variable in `cpuprof.go` to be an array of `EV_MAX` size.
+3.
+`runtime_pprof_setCPUProfileConfig` will hold a lock only for the `cpuprof[i]`,
+where `i` is the event id for which profile has to be started or stopped.
+4. There will be a single `signalLock` rather than `EV_MAX` `signalLock`s.
+This is
+because the `setcpuprofileconfig` method in `cpuprof.go` needs to check (read)
+all `prof[eventId].config`s to infer whether all events are stopped before
+uninstalling the signal handler.
+Taking `EV_MAX` locks one after another would
+be no better than holding just one lock.
+The consequence of this is that we
+serialize simultaneous writes to two different ring buffers.
+This should not be
+a serious issue since there cannot be two concurrent `SIGPROF`s.
+If a system
+supports concurrent `SIGPROF` handling, and if there are a lot of threads in a
+system, we can revisit this choice.
+5. In a very rare case when a `SIGPROF` is delivered for an already running
+event when another event is being set (which will hold the `signalLock`), the
+signal handler recognizes it and drops such sample. Otherwise a deadlock will
+ensue.
+
+At the lowest level, we will incorporate the following three system calls into
+all `runtime/sys_linux_<arch>.s` files.
+
+1. `perf_event_open`, which is Linux’s means to program PMUs.
+The Go signature
+will be `func perfEventOpen(attr *perfEventAttr, pid uintptr, cpu, groupFd
+int32, flags uintptr) int32`
+2. `ioctl`, which is needed to stop the PMU counter
+when handling the `SIGPROF`, and reset and restart after handling the signal.
+The
+go signature will be `func ioctl(fd, req int32, arg uintptr) int32`
+3. `fcntl`,
+which is necessary to arrange a `perf_event` file handle to deliver the
+`SIGPROF` signal and also to setup the signal to be delivered exactly to the `M`
+(OS thread) whose PMU counter overflew.
+The go signature will be `func fcntl(fd,
+cmd int32, arg uintptr) int32`
+
+#### 4.2.1 Segregating `SIGPROF` from different sampling sources
+
+All PMU events and the OS timer use the same `SIGPROF` signal to deliver their
+interrupts.
+Hence, the Unix signal handler needs to address the following
+concerns.
+
+1. Identify whether a signal was delivered as a result of the OS timer expiring
+or a PMU counter overflowing.
+This is necessary because the subsequent processing will be
+different based on the event.
+2. Identify which PMU counter caused the interrupt, when multiple PMU events are being
+sampled at the same time.
+This is necessary because the subsequent
+processing needs to append the backtraces into the buffer corresponding to the
+PMU event that triggered the interrupt.
+All PMU events do not go into the same
+ring buffer.
+
+Although both OS timer expiry and any PMU counter overflow result in the
+same `SIGPROF` signal, the signal codes are different.
+The PMU counter overflow
+generates a `POLL_IN` signal code indicating the presence of data whereas the OS
+timer interrupts generate a `SI_KERNEL` or `SI_ALARM` signal code.
+We use this
+information to address #1 above.
+
+Once we know that the signal was generated due to some PMU counter overflowing,
+we use the file descriptor `fd` passed in the signal info and match it against
+the opened perf events file descriptors on the same `M` and use the matching
+location to infer the PMU event `i` that caused the signal; the rest of the
+signal handling makes use of this `i` to record the backtrace in the appropriate
+ring buffer.
+During the signal handling, we’ll pause all active PMU counters via `ioctl`
+calls on the `fd` so that the signal handling is not counted towards the PMU
+event counters.
+
+
+### 4.3 Exposing PMU events via `runtime/pprof` package
+
+We now turn our attention to the `runtime/pprof` package, which exposes the PMU
+sampling to its clients.
+There are several choices.
+There is a tension between
+revolutionary vs. evolutionary solution; there is friction between exposing the
+full PMU functionality vs. offering the basic minimum.
+In our proposal, we hope
+to make pragmatic choices among myriad options.
+
+We propose adding a new API, which accepts different configurations needed to
+start any CPU event -- OS timer and PMU counters.
+The API signature is shown
+below.
+```
+pprof.StartCPUProfileWithConfig(opt ProfilingOption, moreOpts ...ProfilingOption) error
+```
+There is an optional variadic `ProfilingOption` to allow
+more than one event to be started during a profiling session.
+
+A `ProfilingOption` is an interface exposed by `runtime/pprof` as shown below.
+
+```
+type ProfilingOption interface { apply() error }
+```
+
+Pprof will implement the following profiling option functions that mirror
+various `cpuEvent` constants previously presented.
+
+```
+func OSTimer(w io.Writer) ProfilingOption
+func CPUCycles(w io.Writer, period uint64) ProfilingOption
+func CPUInstructions(w io.Writer, period uint64) ProfilingOption
+func CPUCacheReferences(w io.Writer, period uint64) ProfilingOption
+func CPUCacheMisses(w io.Writer, period uint64) ProfilingOption
+func CPUBranchInstructions(w io.Writer, period uint64) ProfilingOption
+func CPUBranchMisses(w io.Writer, period uint64) ProfilingOption
+func CPURawEvent(w io.Writer, period uint64, hex uint64) ProfilingOption
+```
+
+All of them consume an `io.Writer`.
+The `OSTimer` accepts no other argument.
+A
+`period` accepted by all PMU events specifies the number of events that must
+elapse between two consecutive interrupts.
+A larger value means lower
+granularity (the opposite of Hz).
+Passing a `zero` value will result in using a
+preset period.
+`CPURawEvent` is special; it accepts any user-provided
+hexadecimal PMU event code.
+Notice that each event requires an `io.Writer`
+since different profile types cannot be serialized into the same pprof protocol
+buffer.
+In the cases where multiple `ProfilingOption`s are passed to the
+`pprof.StartCPUProfileWithConfig`, sanity checks are made to ensure the
+uniqueness of each `cpuEvent` and each `io.Writer`.
+
+Below are a few example usages of the API.
+
+```
+// Simple
+StartCPUProfileWithConfig(OSTimer(w))
+StartCPUProfileWithConfig(CPUCycles(w, period))
+StartCPUProfileWithConfig(CPURawEvent(w, period, hexadecimalEvent))
+
+// Advanced
+StartCPUProfileWithConfig(CPUCacheReferences(w1, period1), CPUCacheMisses(w2, period2))
+```
+
+
+The implementation of `StartCPUProfile(writer)` will simply invoke
+`StartCPUProfileWithConfig(OSTimer(writer))`.
+The implementation of
+`StartCPUProfileWithConfig` will ensure a maximum of `EV_MAX` elements, unique
+`io.Writers`, and unique `cpuEvents`.
+After sanitizing the arguments, it will create
+a `cpuProfileConfig` for each `cpuEvent` requested to be profiled.
+In a global
+data structure `cpu`, we will maintain the active events, their
+`*cpuProfileConfig`, and the corresponding `io.Writer`.
+
+```
+var cpu struct {
+ sync.Mutex
+ profiling bool
+ activeConfig [EV_MAX]*cpuProfileConfig
+ activeWriter [EV_MAX]io.Writer
+ eventName [EV_MAX]string
+ done chan bool
+}
+```
+
+Finally, `StartCPUProfileWithConfig` will invoke
+`runtime_pprof_setCPUProfileConfig` multiple times once for each active event
+to request the `runtime` to actually configure either the OS timer or the PMU
+with a specific event to start sampling.
+A lock will be held throughout the invocation of
+`StartCPUProfileWithConfig` and hence no two
+`StartCPUProfileWithConfig/StartCPUProfile` invocations run concurrently.
+Once
+a session of `StartCPUProfileWithConfig/StartCPUProfile` is active, another
+profiling is disallowed until `StopCPUProfile` is invoked.
+
+Before successfully returning from `StartCPUProfileWithConfig`, we'll launch a
+goroutine `profileWriter`, which will periodically wake up and scrape
+runtime-produced stack samples from the ring buffers associated with each of
+the active events and serialize them into the corresponding `io.Writer`s.
+This follows the same design as the original pprof, but will add the
+responsibility of collecting profiles from multiple events. Creating one
+goroutine each to scrape every active event would be wasteful, a single
+goroutine suffices to collect backtraces from all ring buffers.
+
+### 4.4 The optional multiple CPU profiles feature
+
+Allowing multiple CPU profiles in a single profiling session via
+`pprof.StartCPUProfileWithConfig(opt ProfilingOption, moreOpts
+...ProfilingOption)` API is a power-user feature.
+To avoid any accidental
+misuse, we'll protect it under an environmental variable
+`GO_PPROF_ENABLE_MULTIPLE_CPU_PROFILES=<true|false>`, which will be disabled by
+default.
+
+
+### 4.5 The `testing` package
+
+We’ll add the following two command line arguments to the testing package:
+
+1. `-cpuprofileevent=<timer|cycles|instructions|cacheMisses|cacheReferences|branches|branchMisses|rHexValue>`
+
+2. `-cpuprofileperiod=<Int64Value>`
+
+The default value of `-cpuprofileevent` is `timer` with 100Hz when only
+`-cpuprofile` is passed, which ensures compatibility with the current behavior.
+`-cpuprofileperiod` will be ignored for `-cpuprofileevent=timer` and will
+default to a 100Hz rate as is the case currently.
+The testing package will map a
+user-specified string for `-cpuprofileevent` value to the appropriate
+`pprof.ProfilingOption`, the `-cpuprofileperiod` to an integer, and invoke
+`pprof.StartCPUProfileWithConfig()`.
+For example `-cpuprofileevent=cycles` and
+`-cpuprofileperiod=100000` will result
+in`pprof.StartCPUProfileWithConfig(CPUCycles(w, 100000)`.
+
+### 4.6 The `net/http/pprof` package
+
+We’ll add the following two additional arguments in an `http` request exposed
+by the `net/http/pprof` CPU profiling endpoint.
+
+1. `event=<timer|cycles|instructions|cacheMisses|cacheReferences|branches|branchMisses|rHexValue>`
+
+2. `period=<Int64Value>`
+
+`period` will be ignored for `event=timer` and will default
+to a 100Hz rate as is the case currently.
+The default value of `event`
+is `timer` with 100Hz, which ensures compatibility with the current behavior.
+The `net/http/pprof` package will map a user-specified string for `event`
+value to an appropriate `pprof.ProfilingOption`, `period` to an
+integer, and invoke `pprof.StartCPUProfileWithConfig`.
+
+## 5. Empirical evidence on the accuracy and precision of PMU profiles
+
+### 5.1 The PMU produces precise and accurate profiles for concurrent programs
+
+Below we show the `go tool pprof -top` output of the previously shown
+[`goroutine.go`](https://github.com/chabbimilind/GoPprofDemo/blob/master/goroutine.go)
+concurrent program run for two times.
+The profiles used the proposed new API
+with 1M sampling period: `pprof.StartCPUPrileWithConfig(CPUCycles(w,
+1000000))`.
+The CPU cycles attributed to each routine `f1-f10` match the
+expectation (approximately 10% of the total execution time) within one run and
+across multiple runs.
+*Thus the profiles are accurate and precise.
+The
+measurement overhead is indistinguishable from the OS timer in this setting.*
+
+
+#### (PMU Cycles profile) goroutine.go/Run 1:
+
+```
+File: goroutine
+Type: cycles
+Time: Jan 27, 2020 at 4:49pm (PST)
+Showing nodes accounting for 234000000000, 100% of 234000000000 total
+ flat flat% sum% cum cum%
+23400000000 10.00% 10.00% 23400000000 10.00% main.f1
+23400000000 10.00% 20.00% 23400000000 10.00% main.f10
+23400000000 10.00% 30.00% 23400000000 10.00% main.f2
+23400000000 10.00% 40.00% 23400000000 10.00% main.f3
+23400000000 10.00% 50.00% 23400000000 10.00% main.f4
+23400000000 10.00% 60.00% 23400000000 10.00% main.f5
+23400000000 10.00% 70.00% 23400000000 10.00% main.f6
+23400000000 10.00% 80.00% 23400000000 10.00% main.f7
+23400000000 10.00% 90.00% 23400000000 10.00% main.f8
+23400000000 10.00% 100% 23400000000 10.00% main.f9
+```
+#### (PMU Cycles profile) goroutine.go/Run 2:
+
+```
+File: goroutine
+Type: cycles
+Time: Jan 27, 2020 at 4:51pm (PST)
+Showing nodes accounting for 234000000000, 100% of 234000000000 total
+ flat flat% sum% cum cum%
+23800000000 10.17% 10.17% 23800000000 10.17% main.f1
+23500000000 10.04% 20.21% 23500000000 10.04% main.f7
+23500000000 10.04% 30.26% 23500000000 10.04% main.f9
+23400000000 10.00% 40.26% 23400000000 10.00% main.f10
+23400000000 10.00% 50.26% 23400000000 10.00% main.f2
+23400000000 10.00% 60.26% 23400000000 10.00% main.f4
+23400000000 10.00% 70.26% 23400000000 10.00% main.f6
+23400000000 10.00% 80.26% 23400000000 10.00% main.f8
+23300000000 9.96% 90.21% 23300000000 9.96% main.f3
+22900000000 9.79% 100% 22900000000 9.79% main.f5
+```
+
+### 5.2 The PMU produces precise and accurate profiles for serial programs
+
+Below we show the `go tool pprof -top` output of the previously shown
+[`serial.go`](https://github.com/chabbimilind/GoPprofDemo/blob/master/serial.go)
+program run for two times.
+We remind the reader that the expected relative
+execution time is encoded in the function names themselves (`J_expect_18_18`
+should consume 18.18% CPU cycles and so on).
+The profiles used the proposed
+new API with 1M sampling period: `pprof.StartCPUPrileWithConfig(CPUCycles(w,
+1000000))`.
+The CPU cycles attributed to each routine match the expectation
+within a run and across multiple runs.
+*Thus the profiles are accurate and
+precise.*
+
+
+#### (PMU Cycles profile) serial.go/Run 1:
+
+```
+File: serial
+Type: cycles
+Time: Jan 27, 2020 at 4:54pm (PST)
+Showing nodes accounting for 1105000000, 100% of 1105000000 total
+ flat flat% sum% cum cum%
+ 200000000 18.10% 18.10% 200000000 18.10% main.J_expect_18_18
+ 183000000 16.56% 34.66% 183000000 16.56% main.I_expect_16_36
+ 165000000 14.93% 49.59% 165000000 14.93% main.H_expect_14_546
+ 137000000 12.40% 61.99% 137000000 12.40% main.G_expect_12_73
+ 120000000 10.86% 72.85% 120000000 10.86% main.F_expect_10_91
+ 100000000 9.05% 81.90% 100000000 9.05% main.E_expect_9_09
+ 82000000 7.42% 89.32% 82000000 7.42% main.D_expect_7_27
+ 63000000 5.70% 95.02% 63000000 5.70% main.C_expect_5_46
+ 37000000 3.35% 98.37% 37000000 3.35% main.B_expect_3_64
+ 18000000 1.63% 100% 18000000 1.63% main.A_expect_1_82
+ 0 0% 100% 1105000000 100% main.main
+ 0 0% 100% 1105000000 100% runtime.main
+```
+
+#### (PMU Cycles profile) serial.go/Run 2:
+
+```
+File: serial
+Type: cycles
+Time: Jan 27, 2020 at 4:54pm (PST)
+Showing nodes accounting for 1105000000, 100% of 1105000000 total
+ flat flat% sum% cum cum%
+ 200000000 18.10% 18.10% 200000000 18.10% main.J_expect_18_18
+ 183000000 16.56% 34.66% 183000000 16.56% main.I_expect_16_36
+ 159000000 14.39% 49.05% 159000000 14.39% main.H_expect_14_546
+ 142000000 12.85% 61.90% 142000000 12.85% main.G_expect_12_73
+ 119000000 10.77% 72.67% 119000000 10.77% main.F_expect_10_91
+ 100000000 9.05% 81.72% 100000000 9.05% main.E_expect_9_09
+ 82000000 7.42% 89.14% 82000000 7.42% main.D_expect_7_27
+ 61000000 5.52% 94.66% 61000000 5.52% main.C_expect_5_46
+ 40000000 3.62% 98.28% 40000000 3.62% main.B_expect_3_64
+ 18000000 1.63% 99.91% 18000000 1.63% main.A_expect_1_82
+ 1000000 0.09% 100% 1105000000 100% main.main
+ 0 0% 100% 1105000000 100% runtime.main
+```
+
+### 5.3 A practical use
+
+We use pprof to collect production CPU profiles at Uber.
+On one of our Go microservices, when we used the default OS timer-based CPU
+profiles, we saw that the `mutex.Lock()` accounted for 15.79% of the time and
+`mutex.Unlock()` accounted for 13.55% of the CPU time.
+We knew from the program structure that `mutex.Lock()` and `mutex.Unlock()`
+were invoked in pairs one after another, and also that `mutex.Lock()` is
+significantly more heavyweight compared to `mutex.Unlock()`, thus we were
+expecting much lesser time being spent inside `mutex.Unlock()` relative to
+`mutex.Lock()`.
+The OS-timer profiling data was not matching our expectations.
+However, when we used a prototype of the PMU-based profiler and sampled the CPU
+cycles event, the quantification changed and matched our expectations;
+`mutex.Lock()` accounted for 33.36% of the CPU cycles and `mutex.Unlock()`
+accounted for 7.57% the CPU cycles.
+The PMU profiles avoided an unnecessary performance investigation.
+
+
+## 6. Rationale
+
+### 6.1 Advantages
+
+CPU profiling via PMUs result in richer quality profiles -- high-resolution
+measurement, higher accuracy, and higher precision.
+Access to various events
+such as cache misses in PMUs allows better insights into understanding the
+causes of complex performance issues.
+Any kind of PMU event can be attributed
+to call stack samples, which is useful to provide actionable feedback to the
+developer.
+Incorporating the PMU facility in the runtime will allow the Go
+profiler to take full advantage of the progress happening in the hardware
+performance monitoring facilities available in modern CPUs.
+This framework will
+allow for independently developing more advanced profiling features such as
+attributing high-latency data accesses to Go objects, which can help refactor
+code or data structures for better data locality [5, 6, 7].
+
+### 6.2 Disadvantages
+
+1. PMU-based profiling may not work when a Go program is running under a
+virtual machine, since virtual machines may not expose the hardware performance
+counters to a guest OS.
+Hence, the proposal keeps the OS-timer as the default
+and allows PMU-based sampling available only when it is explicitly specified.
+2. A higher sampling rate (a shorter period between samples) obviously incurs
+higher overhead.
+For example, the default timer with a 100Hz sampling rate
+introduces 3.6% measurement overhead on a 2.4GHz Intel Skylake machine.
+Using
+CPU cycles with a period of 24000000 cycles, which amounts to the same 100
+samples/second introduces the same 3.6% overhead.
+The table below shows
+overheads at other sampling periods using the PMU cycles and also compares
+against the default OS-timer profiler overhead.
+
+```
+ The new PMU-based profiler with cycles event
+ -----------------------------------------------------
+ Sampling period in CPU cycles Overhead
+ -----------------------------------------------------
+ 24000000 (100 samples/sec) 1.036x
+ 2400000 (1K samples/sec) 1.036x
+ 240000 (10K samples/sec) 1.13x
+ 24000 (100K samples/sec) 2.05x
+```
+```
+ The default OS-timer based profiler
+ -----------------------------------------------------
+ Sampling rate Overhead
+ -----------------------------------------------------
+ 100 samples/sec 1.036x
+```
+
+A reasonable sampling period, once in 2.4M CPU cycles on a 2.4GHz
+Intel Skylake machine, offers a good trade off between high sampling rate at low
+overhead.
+
+3. When using any other PMU event, such as cache misses, the sampling period
+needs to be more carefully curated.
+We’ll provide preset values for the preset
+events.
+
+4. PMU profiling has a mild skid on deeply pipelined, multi-issue,
+out-of-order, superscalar CPU architectures due to multiple instructions
+flowing in the pipeline.
+Thus the program counter seen in the interrupt can be
+a handful of instructions off from the one generating the counter overflow.
+This "skid" is much smaller than the several milliseconds of skid in an OS
+timer.
+Moreover, modern PMUs have built-in features to eliminate such skid
+(e.g., Intel Precise Event Based Sampling ([PEBS](https://software.intel.com/sites/default/files/managed/8b/6e/335279_performance_monitoring_events_guide.pdf?ref=hvper.com)), AMD Instruction Based
+Sampling ([IBS](https://developer.amd.com/wordpress/media/2012/10/AMD_IBS_paper_EN.pdf)), and PowerPC Marked events ([MRK](https://wiki.raptorcs.com/w/images/6/6b/POWER9_PMU_UG_v12_28NOV2018_pub.pdf)).
+Some PMUs can be configured so
+that they can offer the exact program counter involved in the counter overflow.
+We have introduced `profilePCPrecision` for this reason.
+
+5. There are finite counters on any PMU. Counter multiplexing happens if the
+number of events is more than the number of available PMU counter slots. In
+that case the events run only part of the time.
+
+### 6.3 Alternatives explored
+
+1. Linux `perf` commandline tool is powerful and offers comparable solutions.
+However, it does not solve several important use cases.
+ 1. Profiling via an `http` endpoint (provided by `net/http/pprof` package in
+Go) is the preferred, readily available, and scalable solution that works at
+the enterprise scale. Making `perf` to work in a similar fashion would lead to
+creating too many custom solutions.
+ 2. `pprof` works as an in-situ profiler working in the same address
+space as the process being profiled and hence does not require any special
+system privileges; in contrast, attaching `perf` to a running process requires
+changes to a system’s security settings.
+ 3. We are also focusing on
+containerized environments, which is widely used in large-scale Go deployments.
+Although `perf` can collect call stack samples from a container, it fails to find
+symbols of go binaries located within a container to show clean reports
+(e.g., flame graphs).
+The workaround is non-trivial.
+2. We explored library-level facilities to collect performance counters such as perf-util [\[3\]](https://github.com/hodgesds/perf-utils).
+However, no library support will provide good profiling for Go since the runtime
+can schedule a monitored function onto another thread, where the counters may not
+be enabled.
+The correct solution, hence, is to implement the hardware
+performance counter facility inside the go runtime into the scheduler as we have
+proposed here.
+3. We seriously contemplated against introducing a new pprof API
+and tried to work with the existing API `StartCPUProfile(io.Writer)`.
+We
+considered adding variadic `ProfilingOptions` to it.
+But any such change would
+be qualified as breaking `Go1.X` and went ahead introducing a new API.
+4. We considered passing an `Options` struct to the new API
+`StartCPUProfileWithConfig`.
+However, the usage looked complicated with certain
+fields making sense only in certain circumstances and hence decided to use the
+functional option pattern.
+5. We considered having only one CPU profiling
+event active at a time.
+However, the experience has shown that this limits
+performance analysis in several ways; and advanced performance tuning demands
+correlating more than one event to find the root cause.
+Furthermore, this richness
+in Go profiling can serve to advance the state-of-the-art in profilers.
+6. On the implementation front, we considered using a `map` for active events at
+several places instead of an `EV_MAX` array.
+However, concurrent access to
+different array elements was superior to locking a shared `map`.
+Moreover,
+`EV_MAX` is small enough to search linearly for active events.
+
+## 7. Compatibility
+
+The design ensures [compatibility with Go1.0](https://golang.org/doc/go1compat)
+since no existing public API signatures, interface signatures, or data
+structures are changed.
+We have introduced a new interface and a handful of new
+APIs in the `pprof` package.
+
+## 8. Implementation
+
+Milind Chabbi from Uber technologies will implement the proposal.
+The
+implementation will proceed as follows:
+
+1. Add assembly language support for `perf_events` related system calls.
+2. Modify the `runtime`, `runtime/pprof`to support
+PMU sampling to reflect the call graph / workflow shown in Figure 2.
+3. Expose the new facility via `testing` and `net/http/pprof` packages.
+4. Make changes to `pprof` CPU profiling tests to use PMU cycles in addition to
+OS timer (on supported platforms).
+5. Add PMU profiling tests to run on supported platforms.
+
+The implementation will be broken into a large checkin related to the `runtime`
+and `runtime/pprof` packages, a few small checkins related to `testing`, and
+`net/http/pprof` packages, and the tests accompanying the code.
+The
+implementation will align with the February-April 2020 development cycle and
+May-July 2020 release cycle.
+
+A prototype is available [here](https://github.com/uber-dev/go/tree/pmu_pprof).
+
+## 9. Open issues
+
+1. Supporting Windows and MacOS will be left for future work, but any support
+from experts from those lands is appreciated.
+2. The issue [21295](https://github.com/golang/go/issues/21295) proposes
+adding `-test.counters` flag to the testing package. That proposal is
+orthogonal to this one.
+There
+isn’t a detailed description of how the proposal wants to accomplish it.
+Simply
+starting and stopping a PMU counter by making `perf_events_open` system call at
+the testing package level will not accomplish the objective.
+The monitored
+functions can migrate from one `M` to another, and hence the changes must be
+made in the runtime as described in this proposal.
+However, it is relatively
+easy to incorporate the requested feature by making use of changes suggested in
+this proposal to accomplish the objective of issue #21295.
+We are open to
+discussing with the creators of #21295 to understand how we can incorporate
+their needs into this design.
+3. The additions to the public APIs and naming
+aesthetics can be refined with inputs from the experts and the community.
+4. Currently, we do not bubble up runtime failures to start a PMU profiler (same is
+true for a timer).
+Should we bubble up any failure to configure a PMU or
+silently produce zero samples in the profiles?
+5. Should we introduce any throttling mechanism if the rate of events is too
+high for a PMU event.
+
+## 10. Acknowledgment
+
+Pengfei Su implemented a prototype of the design
+while interning at Uber Technologies.
+
+Joshua Corbin helped design the public surface APIs.
+
+This work is generously funded by the Uber Technologies’ Programming Systems
+Research team and the team members provided feedback for improvement.
+
+## 11. Appendix
+
+##### (OS timer profile) goroutine.go/run1:
+
+```
+File: goroutine
+Type: cpu
+Time: Jan 27, 2020 at 3:45pm (PST)
+Duration: 6.70s, Total samples = 18060ms (269.37%)
+Showing nodes accounting for 18060ms, 100% of 18060ms total
+ flat flat% sum% cum cum%
+ 4210ms 23.31% 23.31% 4210ms 23.31% main.f7
+ 2610ms 14.45% 37.76% 2610ms 14.45% main.f2
+ 2010ms 11.13% 48.89% 2010ms 11.13% main.f6
+ 1810ms 10.02% 58.91% 1810ms 10.02% main.f10
+ 1780ms 9.86% 68.77% 1780ms 9.86% main.f3
+ 1410ms 7.81% 76.58% 1410ms 7.81% main.f1
+ 1310ms 7.25% 83.83% 1310ms 7.25% main.f4
+ 1110ms 6.15% 89.98% 1110ms 6.15% main.f5
+ 1110ms 6.15% 96.12% 1110ms 6.15% main.f8
+ 700ms 3.88% 100% 700ms 3.88% main.f9
+```
+
+##### (OS timer profile) goroutine.go/run2:
+
+```
+File: goroutine
+Type: cpu
+Time: Jan 27, 2020 at 3:45pm (PST)
+Duration: 6.71s, Total samples = 17400ms (259.39%)
+Showing nodes accounting for 17400ms, 100% of 17400ms total
+ flat flat% sum% cum cum%
+ 3250ms 18.68% 18.68% 3250ms 18.68% main.f2
+ 2180ms 12.53% 31.21% 2180ms 12.53% main.f9
+ 2100ms 12.07% 43.28% 2100ms 12.07% main.f1
+ 1770ms 10.17% 53.45% 1770ms 10.17% main.f6
+ 1700ms 9.77% 63.22% 1700ms 9.77% main.f5
+ 1550ms 8.91% 72.13% 1550ms 8.91% main.f4
+ 1500ms 8.62% 80.75% 1500ms 8.62% main.f8
+ 1440ms 8.28% 89.02% 1440ms 8.28% main.f3
+ 1390ms 7.99% 97.01% 1390ms 7.99% main.f10
+ 520ms 2.99% 100% 520ms 2.99% main.f7
+```
+
+##### (OS timer profile) goroutine.go/Run 3:
+
+```
+File: goroutine
+Type: cpu
+Time: Jan 27, 2020 at 3:48pm (PST)
+Duration: 6.71s, Total samples = 17.73s (264.31%)
+Showing nodes accounting for 17.73s, 100% of 17.73s total
+ flat flat% sum% cum cum%
+ 3.74s 21.09% 21.09% 3.74s 21.09% main.f7
+ 2.08s 11.73% 32.83% 2.08s 11.73% main.f9
+ 2.05s 11.56% 44.39% 2.05s 11.56% main.f2
+ 1.85s 10.43% 54.82% 1.85s 10.43% main.f10
+ 1.78s 10.04% 64.86% 1.78s 10.04% main.f1
+ 1.43s 8.07% 72.93% 1.43s 8.07% main.f3
+ 1.42s 8.01% 80.94% 1.42s 8.01% main.f8
+ 1.18s 6.66% 87.59% 1.18s 6.66% main.f6
+ 1.17s 6.60% 94.19% 1.17s 6.60% main.f5
+ 1.03s 5.81% 100% 1.03s 5.81% main.f4
+```
+
+##### (OS timer profile) serial.go/Run 1:
+```
+File: serial
+Type: cpu
+Time: Jan 27, 2020 at 1:42pm (PST)
+Duration: 501.51ms, Total samples = 320ms (63.81%)
+Showing nodes accounting for 320ms, 100% of 320ms total
+ flat flat% sum% cum cum%
+ 80ms 25.00% 25.00% 80ms 25.00% main.H_expect_14_546
+ 80ms 25.00% 50.00% 80ms 25.00% main.J_expect_18_18
+ 60ms 18.75% 68.75% 60ms 18.75% main.G_expect_12_73
+ 20ms 6.25% 75.00% 20ms 6.25% main.B_expect_3_64
+ 20ms 6.25% 81.25% 20ms 6.25% main.D_expect_7_27
+ 20ms 6.25% 87.50% 20ms 6.25% main.F_expect_10_91
+ 20ms 6.25% 93.75% 20ms 6.25% main.I_expect_16_36
+ 10ms 3.12% 96.88% 10ms 3.12% main.A_expect_1_82
+ 10ms 3.12% 100% 10ms 3.12% main.C_expect_5_46
+ 0 0% 100% 320ms 100% main.main
+ 0 0% 100% 320ms 100% runtime.main
+```
+
+
+##### (OS timer profile) serial.go/Run 2:
+
+```
+File: serial
+Type: cpu
+Time: Jan 27, 2020 at 1:44pm (PST)
+Duration: 501.31ms, Total samples = 320ms (63.83%)
+Showing nodes accounting for 320ms, 100% of 320ms total
+ flat flat% sum% cum cum%
+ 70ms 21.88% 21.88% 70ms 21.88% main.I_expect_16_36
+ 50ms 15.62% 37.50% 50ms 15.62% main.J_expect_18_18
+ 40ms 12.50% 50.00% 40ms 12.50% main.E_expect_9_09
+ 40ms 12.50% 62.50% 40ms 12.50% main.F_expect_10_91
+ 40ms 12.50% 75.00% 40ms 12.50% main.H_expect_14_546
+ 30ms 9.38% 84.38% 30ms 9.38% main.D_expect_7_27
+ 20ms 6.25% 90.62% 20ms 6.25% main.B_expect_3_64
+ 20ms 6.25% 96.88% 20ms 6.25% main.G_expect_12_73
+ 10ms 3.12% 100% 10ms 3.12% main.C_expect_5_46
+ 0 0% 100% 320ms 100% main.main
+ 0 0% 100% 320ms 100% runtime.main
+```
+
+##### (OS timer profile) serial.go/Run 3:
+```
+File: serial
+Type: cpu
+Time: Jan 27, 2020 at 1:45pm (PST)
+Duration: 501.39ms, Total samples = 310ms (61.83%)
+Showing nodes accounting for 310ms, 100% of 310ms total
+ flat flat% sum% cum cum%
+ 110ms 35.48% 35.48% 110ms 35.48% main.J_expect_18_18
+ 70ms 22.58% 58.06% 70ms 22.58% main.G_expect_12_73
+ 60ms 19.35% 77.42% 60ms 19.35% main.F_expect_10_91
+ 30ms 9.68% 87.10% 30ms 9.68% main.I_expect_16_36
+ 20ms 6.45% 93.55% 20ms 6.45% main.H_expect_14_546
+ 10ms 3.23% 96.77% 10ms 3.23% main.B_expect_3_64
+ 10ms 3.23% 100% 10ms 3.23% main.C_expect_5_46
+ 0 0% 100% 310ms 100% main.main
+ 0 0% 100% 310ms 100% runtime.main
+```
+
+
+
+## 12. References
+
+1. HPCToolkit: https://github.com/HPCToolkit/hpctoolkit
+2. Oracle studio:
+https://www.oracle.com/technetwork/server-storage/solarisstudio/features/performance-analyzer-2292312.html
+3. Perf utils: https://github.com/hodgesds/perf-utils
+4. vTune
+https://software.intel.com/en-us/vtune
+5. "Pinpointing Data Locality Problems
+Using Data-Centric Analysis", Xu Liu and John Mellor-Crummey, 2011 International
+Symposium on Code Generation and Optimization, April 2-6, 2011, Chamonix,
+France.
+6. "A Data-centric Profiler for Parallel Programs", Xu Liu and John
+Mellor-Crummey, The International Conference for High Performance Computing,
+Networking, Storage and Analysis, November 17-22, 2013, Denver, Colorado, USA.
+7. "ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel
+Programs", Xu Liu and Bo Wu, The International Conference for High Performance
+Computing, Networking, Storage and Analysis, Nov 15-20, 2015, Austin, Texas,
+USA.
+
diff --git a/design/36821/call_graph_pprof++.png b/design/36821/call_graph_pprof++.png
new file mode 100644
index 0000000..2f3e451
--- /dev/null
+++ b/design/36821/call_graph_pprof++.png
Binary files differ
diff --git a/design/36821/call_graph_pprof.png b/design/36821/call_graph_pprof.png
new file mode 100644
index 0000000..9ffaed2
--- /dev/null
+++ b/design/36821/call_graph_pprof.png
Binary files differ
diff --git a/design/36821/goroutine.png b/design/36821/goroutine.png
new file mode 100644
index 0000000..f867efe
--- /dev/null
+++ b/design/36821/goroutine.png
Binary files differ
diff --git a/design/36821/goroutine_pmu.png b/design/36821/goroutine_pmu.png
new file mode 100644
index 0000000..18dde54
--- /dev/null
+++ b/design/36821/goroutine_pmu.png
Binary files differ
diff --git a/design/36821/serial.png b/design/36821/serial.png
new file mode 100644
index 0000000..0beae7e
--- /dev/null
+++ b/design/36821/serial.png
Binary files differ
diff --git a/design/36821/serial_pmu.png b/design/36821/serial_pmu.png
new file mode 100644
index 0000000..7bb9e3f
--- /dev/null
+++ b/design/36821/serial_pmu.png
Binary files differ