blob: 64b256cd259aaaf62ef2f137cd9506988fb3afdb [file] [log] [blame] [view]
# Proposal: hardware performance counters for CPU profiling.
Author(s): Milind Chabbi
Last updated: Feb/13/2020
Discussion at https://golang.org/issues/36821.
## 1. Abstract
The Go pprof CPU profiles are quite inaccurate and imprecise due to several
limitations related to the operating system (OS) interval timer used as its
sampling engine.
Inaccurate profiles can mislead performance investigation and may lead
to inappropriate optimization guidance.
We propose incorporating hardware
performance counters as an alternative sampling engine for CPU profiling.
Hardware performance counter-based sampling rectifies many of the problems
related to the OS timers.
The hardware CPU cycles event provides microsecond
measurement granularity, and it is available in almost all commodity CPUs.
Furthermore, hardware Performance Monitoring Units (PMUs) allow measuring not
only CPU cycles but other activities such as cache misses that help root cause
complex performance problems.
A [prototype implementation](https://github.com/uber-dev/go/tree/pmu_pprof) shows significantly
better-quality CPU profiles when using PMUs while maintaining a low measurement
overhead.
The proposal will retain the OS-timer as the default sampling engine.
The proposal exposes the PMU-based profiling feature on Linux-based systems,
which provide an excellent facility to program PMUs.
## Table of contents
- [1. Abstract](#1-abstract)
- [2. Background](#2-background)
- [2.1 High-level description of Go's CPU
profiling](#21-high-level-description-of-gos-cpu-profiling)
- [2.2 Desiderata and deficiencies of Go CPU
profiler](#22-desiderata-and-deficiencies-of-go-cpu-profiler)
- [2.3 Examples demonstrating inaccuracy and imprecision in Go
pprof profiles](#23-examples-demonstrating-inaccuracy-and-imprecision-in-go-pprof-profiles)
- [2.4 State-of-the-art in profiling using performance
monitoring units](#24-state-of-the-art-in-profiling-using-performance-monitoring-units)
- [2.5 Low-level details of the current Go CPU
profiler](#25-low-level-details-of-the-current-go-cpu-profiler)
- [3. Problem statement](#3-problem-statement)
- [4. Proposal](#4-proposal)
- [4.1 High-level design](#41-high-level-design)
- [4.2 The devil is in the
details](#42-the-devil-is-in-the-details)
- [4.3 Exposing PMU events via `runtime/pprof`
package](#43-exposing-pmu-events-via-runtimepprof-package)
- [4.4 The optional multiple CPU profiles
feature](#44-the-optional-multiple-cpu-profiles-feature)
- [4.5 The `testing` package](#45-the-testing-package)
- [4.6 The `net/http/pprof`
package](#46-the-nethttppprof-package)
- [5. Empirical evidence on the accuracy and precision of PMU
profiles](#5-empirical-evidence-on-the-accuracy-and-precision-of-pmu-profiles)
- [5.1 The PMU produces precise and accurate profiles for
concurrent programs](#51-the-pmu-produces-precise-and-accurate-profiles-for-concurrent-programs)
- [5.2 The PMU produces precise and accurate profiles for
serial programs](#52-the-pmu-produces-precise-and-accurate-profiles-for-serial-programs)
- [5.3 A practical use](#53-a-practical-use)
- [6. Rationale](#6-rationale)
- [6.1 Advantages](#61-advantages)
- [6.2 Disadvantages](#62-disadvantages)
- [6.3 Alternatives explored](#63-alternatives-explored)
- [7. Compatibility](#7-compatibility)
- [8. Implementation](#8-implementation)
- [9. Open issues](#9-open-issues)
- [10. Acknowledgment](#10-acknowledgment)
- [11. Appendix](#11-appendix)
- [12. References](#12-references)
## 2. Background
In this section, we first provide a high-level description of Go's CPU
profiling, motivate the problems in the current profiler with examples,
highlight the state-of-the-art in CPU profiling, and then dive deeper into the
current implementation of CPU profiling inside Go.
Our proposed changes appear
in the next section.
Readers familiar with profiling technology and the
internals of pprof may skip this section.
### 2.1 High-level description of Go’s CPU profiling
Pprof, the de facto Go CPU profiler, employs sampling-based profiling for time
measurement.
It attributes the time metric to different functions, including
their calling context (aka call path or backtrace).
Pprof relies on the
underlying operating-system timers (exposed by the Go runtime) to periodically
interrupt a program’s execution.
The sampling rate is typically one hundred
hertz (a sample taken every 10ms); a different sampling rate can be configured,
but typical systems cannot deliver more than 250 hertz.
When profiling is enabled, the Go runtime scheduler sets up each OS thread in
the process to generate periodic timer interrupts.
In each interrupt, the go
runtime’s signal handler obtains a backtrace and appends it to a ring buffer.
All OS threads in a Go process append their backtraces to the same ring buffer
coordinated via a lock.
A side thread, launched by the `pprof` package,
periodically scrapes the backtraces from the ring buffer and serializes them
into a protobuf file to be used by the downstream tools.
### 2.2 Desiderata and deficiencies of Go CPU profiler
*Accuracy* and *precision* are desirable properties of a good measurement tool.
A profiling datum is said to be *accurate* if it is close to the ground truth
(aka true value).
For example, if `API_A()` consumes 25% of the total
execution time and a profiler attributes it 24% of the total execution time,
the measurement has 96% accuracy.
Profiling data are said to be *precise* if there is low variability among
multiple measurements.
Precision of a set of measurements of the same variable is often
[expressed](https://www.wikihow.com/Calculate-Precision) either as min and max
from the sample mean or as [standard
error](https://en.wikipedia.org/wiki/Standard_error) or [coefficient of
variation](https://en.wikipedia.org/wiki/Coefficient_of_variation) of the
sample.
Unfortunately, the current pprof CPU profiles meet neither of these two
criteria.
#### 2.2.1 Inaccuracy and its causes
Inaccuracy of profiles arises from two sources: sample size and sampling bias.
#### Sample size
By the [law of large
numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers), a larger sample
size makes the sample mean closer to the population mean.
Moreover, in
profiling, where the sample size is very small compared to the population size,
increasing the sample size greatly improves accuracy.
Conversely, a smaller
sample size leads to a lower inaccuracy.
Low-resolution OS-timer is the primary reasons behind small sample size in a
measurement window.
One can obtain more samples either by increasing the
length of a measurement window or by increasing the number of samples in a
measurement window.
The former works well for regular programs e.g., codes
dominated by loops; however, for irregular codes, such as microservices, a
larger measurement window does not rectify the measurement inaccuracy because
more irregularities can easily surface in a larger observation window.
Hence,
there is a dire need to collect more samples within a measurement window and
obtain fine-grained samples of sub-millisecond executions.
Furthermore, linearly scaling the measurement time to collect more samples is
impractical if orders of magnitude more samples are necessary to correct the
small samples size issue.
#### Sampling bias
A larger sample size alone does not improve accuracy if the samples are biased.
OS timer samples on some systems (e.g., Linux) are biased because a timer
interrupt from one OS thread, say `T1` may be handled by an arbitrary (not to
be confused with uniformly random) thread, say `T2`.
This means, `T2` will
handle the interrupt and incorrectly attribute the sample to its call stack,
which results in biased samples.
If more timer signals are handled by `T2`
compared to `T1`, a [systematic sampling
bias](https://en.wikipedia.org/wiki/Observational_error) will occur, which leads
to inaccurate profiles.
#### 2.2.1 Imprecision and its causes
Imprecision in measurement comes from at least the following two sources:
sample size and measurement skid.
#### Sample size
Fewer number of samples directly contributes to a large [standard
error](https://en.wikipedia.org/wiki/Standard_error).
The low resolution of OS
timers is responsible for fewer samples in a measurement window, which results
in a lower precision of pprof profiles.
Conversely, a higher resolution of
samples will improve precision.
#### Measurement skid
The second source of measurement error is the [random
error](https://en.wikipedia.org/wiki/Observational_error) inherent in a
measurement apparatus.
An OS timer apparatus configured to expire after `N` milliseconds
is only guaranteed to generate an interrupt some time after `N` milliseconds --
not precisely at `N` milliseconds.
This randomness introduces a large "skid" in
measurement.
Assume a periodic timer set to fire every 10ms.
Assume it has
1ms left before its expiry when the OS scheduler with a 4ms resolution inspects
it.
This means the timer will fire 4ms later, which is 3ms after the scheduled
expiry time.
This results in up to 30% imprecision for a 10ms periodic timer.
Although random error may not be entirely eliminated, a superior measurement
apparatus reduces the effects of random error.
### 2.3 Examples demonstrating inaccuracy and imprecision in Go pprof profiles
#### 2.3.1 A concurrent code
Consider the Go program
[goroutine.go](https://github.com/chabbimilind/GoPprofDemo/blob/master/goroutine.go),
which has ten exactly similar goroutines `f1`-`f10`.
These functions are
computationally intensive, long-running, and are unaffected by IO or memory
hierarchy.
The functions are non-preemptive, and hence the Go runtime scheduler
overhead is zero.
It is expected that each function takes exactly the same
amount of CPU time (i.e., approximately 10% of the overall execution).
It
should not matter how many OS threads or CPU cores they run on.
Table 1 summarizes the result of `go tool pprof -top` for this program running
on a Linux OS on a 12-core Intel Skylake server-class machine.
The
measurements were taken three times under the same configuration.
The `Flat
(ms)` column shows the absolute millisecond measurement attributed to each of
the ten routines, and the `Flat (%)` column shows the relative time in a
routine with respect to the total execution time.
The `Expected (%)` column
shows the expected relative time in each routine.
The `Rank order` column
assigns ranks in the descending order of execution time.
The raw output is
available in the [appendix](#11-appendix).
![Table 1](36821/goroutine.png) *Table 1: demonstration of inaccuracy and
imprecision of pprof profiles using OS timers on a concurrent code.*
First, let us focus on the data in a single run `RUN 1`.
`f1`-`f10` have a wide
variance in the time attributed to them; the expectation is that each of them
gets `10%` execution time.
There is up to 6x difference in the time attributed
to the routine with the highest amount of attribution (`main.f7` with 4210ms,
23.31%) and the routine with the lowest amount of attribution (`main.f9` with
700ms, 3.88%).
**This demonstrates the poor accuracy (deviation from the
ground truth) of pprof timer-based profiles.**
The issue reproduces when using different number of CPU cores.
Now focus on the three runs `RUN 1-3` together .
The time attributed to any
routine widely varies from one run to another.
The rank order of the routines
changes significantly from run to run.
In Run 1, `main.f7` is shown to run for
4210ms with the rank order of 1, whereas in Run 2, it is shown to run for only
520ms with the rank order 10.
The expectation is that the measurements remain
the same from run to run.
From one run to another, function-level time
attribution are not precise; for example, there is a 71% coefficient of
variance among the three runs of the `f1` routine and the average coefficient
of variance across all functions is 28%.
**This demonstrates imprecision (unpredictability of
measurement) of pprof timer-based profiles.**
For the curious readers, Table 2 below shows near-prefect accuracy and
precision when profiling the same code using the PMU-based profiler described in
this proposal.
We show two runs instead of three for
brevity.
Additional details appear in [Section
5.1](#51-the-pmu-produces-precise-and-accurate-profiles-for-concurrent-programs)
![Table 2](36821/goroutine_pmu.png) *Table 2: demonstration of accuracy and
precision of pprof profiles via PMU CPU cycles event on a concurrent code.*
#### 2.3.2 A serial code
The issue is not specific to concurrent code alone.
Refer to a carefully
crafted Go program --
[`serial.go`](https://github.com/chabbimilind/GoPprofDemo/blob/master/serial.go)
-- which has a single goroutine, and it has complete data
dependence from one statement to another.
It has ten functions
`A_expect_1_82`, `B_expect_3_64`, ... , `J_expect_18_18`.
The functions are
intentionally written to consume different amounts of time to make this test
case more representative.
Each function is similar and has a loop, but the
trip-count of the loop in each function is different.
The loop in function
with prefix `A` runs `1*C` times, where `C` is a constant.
The loop in function
with prefix `B` runs `2*C` times, and so on.
This arrangement leads to a
predictable relative execution time for each function; the function with prefix
`A` consumes `1/sum(1..10)=1.82%` of the total execution time; and the function
with prefix `B` consumes `2/sum(1..10)=3.64%` of the total execution time, and
so on until the function with prefix `J` consumes `10/sum(1..10)=18.18%` of the
total execution time.
The expected relative percent of CPU time is encoded in each function name,
`A_expect_1_82` should consume 1.82% execution time, `B_expect_3_64` should
consume 3.64% execution time, `J_expect_18_18` should consume 18.18% of CPU
time, and so on.
Table 3 summarizes the output of pprof CPU profiling for this
code.
As before, the measurements were taken three times under the same
configuration.
The `Flat (ms)` column shows the absolute millisecond
measurement attributed to each of the ten functions, and the `Flat (%)` column
shows the relative time with respect to the total time taken by all functions.
The `Expected (%)` column shows the expected relative time in each routine.
The
`Rank order` column assigns ranks in descending order of the execution time.
The raw output is available in the [appendix](#11-appendix).
![Table 3](36821/serial.png) *Table 3: demonstration of inaccuracy and
imprecision of pprof profiles on a serial code.*
A `-` in a cell represents zero samples.
First, let's focus on just one set of measurements from `RUN 1`.
Rather than
`J_expect_18_18` being the top time consumer with 18.18%, both `J_expect_18_18`
and `H_expect_14_546` are the top time consumers each with 25% CPU time.
`B_expect_3_64`, `D_expect_7_27`, `F_expect_10_91`, and `I_expect_16_36` are
all lumped into a single rank order of 4 each attributed with 6.25% of
execution time.
`E_expect_9_09` does not even appear in the profiles despite
its 9% expected execution time.
More inaccuracies are evident in the data in
the table.
If we inspect the data from `RUN 2` and `RUN 3`, we notice more
discrepancies where a lot of data is missing and incorrect.
The rank ordering
and the relative execution percentage both have very low correlation with the
ground truth.
Despite tens of milliseconds of execution time, some functions
are over counted and some functions are under counted.
Now, let's focus on the three runs together.
The highest time consumer
`J_expect_18_18` does not even get a single sample in `RUN 3`, indicating the
imprecision.
`I_expect_16_36`, which has the measurement data available in all
three cases varies from 20ms in `RUN 1` with a rank order of 4 to 70ms in `RUN 2`
with a rank order of 1.
More discrepancies are visible in the table.
Thus,
serial program also shows run to run measurement variation, and hence the
profiles are imprecise.
Table 4 below shows near perfect accuracy and precision of profiling the same
code using the PMU-based profiler described in this proposal.
The measurements
were taken two times instead of three for brevity.
Additional details appear
in [Section
5.2](#52-the-pmu-produces-precise-and-accurate-profiles-for-serial-programs)
![Table
4](36821/serial_pmu.png) *Table 4: demonstration of accuracy and precision
of pprof profiles via PMU CPU cycles event on a serial code.*
### 2.4 State-of-the-art in profiling using performance monitoring units
An alternative to the OS timer-based sampling is using hardware performance
counters, which are ubiquitous in modern commodity CPUs.
CPU’s Performance
Monitoring Units (PMUs) have the capability to count various events such as
cycles, instructions, cache misses, to name a few.
Furthermore, PMUs can be
configured to deliver an interrupt when an event reaches a specified threshold
value.
These PMU interrupts, like the OS timer interrupts, can serve as another
source of sampling the call stack.
This technique is employed in many modern
profilers [1,2,4].
PMU sampling is called “event-based sampling,” where the
period (interval between two interrupts) is expressed as the number of
occurrences of an event, not the number of timer ticks.
As a side note, in
event-based sampling, the sampling overhead is proportional to the number of
events disregarding the length of execution, whereas in timer-based sampling,
the overhead is proportional to the length of execution.
If there are more
events in a short time span, then there is also more interest in observing
deeply in that region of execution and the event-based sampling takes more
samples.
More overhead is incurred in places where there is interest (more
events) and low/no overhead otherwise.
CPU cycles is one of the most common
PMU events, which happens periodically, similar to OS timers.
Hence, using CPU
cycles as a substitute for OS timers is very attractive.
### 2.5 Low-level details of the current Go CPU profiler
We now provide the current implementation details of CPU profiling in Go, which
is necessary to understand the changes we propose in the next section.
The
description is relevant for most Unix-based systems, especially Linux, and most
of the discussion is irrelevant for Windows since it does not use signals for
handling CPU profiling.
![call_graph_pprof](36821/call_graph_pprof.png)
*Figure 1: A pictorial representation of the OS timer sampling workflow in the
existing implementation.
The first line of each record box is the function
name.
The remaining lines provide a textual description of the steps taken
by that function.*
Currently, `pprof.StartCPUProfile(io.Writer)` is the entry point for starting
CPU profiling.
Analogously, `pprof.StopCPUProfile()` stops a running profile.
Two concurrent CPU profiling sessions are disallowed via a lock-protected
variable `cpu.profiling` in pprof.
The `pprof` package interacts with the
`runtime` package for starting/stopping profiling via
`runtime.SetCPUProfileRate(hz int)` API.
There are two critical pieces of work done inside
`pprof.StartCPUProfile(io.Writer)`.
First, it invokes
`runtime.SetCPUProfileRate(hz=100)` asking the runtime to start profiling; the
rest of the profile generation will be handled by the `runtime` package; the
integer argument is the sampling rate in Hz (samples per second).
Second, in order to collect the profiles generated by the runtime,
`pprof.StartCPUProfile(io.Writer)` launches a goroutine
`profileWriter(io.Writer)`.
`profileWriter`,
1. creates a protocol buffer via `newProfileBuilder`,
2. goes into a sleep-wakeup loop calling `runtime.readProfile()`.
`runtime.readProfile()` provides a chunk of profiles collected inside the
runtime in a ring buffer (`eof` if the profiling stopped), and
3. writes to the
sender end of a channel `cpu.done` once
`runtime.readProfile()` returns `eof`.
The other side of the channel will be
waited upon by `pprof.StopCPUProfile()`.
`runtime.SetCPUProfileRate(int)` is the entry point to start/stop profiling in
the `runtime` package.
It,
1. creates a ring buffer, `cpuprof.log`,
2. appends profile header information
into the ring buffer,
3. invokes `setcpuprofilerate(hz)`, which,
1. calls `setThreadCPUProfileRate(hz=0)` to stop any on-going timer,
2. invokes `setProcessCPUProfiler(hz)`, which installs the `SIGPROF` handler, and
3. records the current sampling rate `hz` into the scheduler `sched.profilehz` to be
visible to other OS threads.
4. calls `setThreadCPUProfileRate(hz)`, which uses
the OS timer such as (`setitimer`) to start the sampling engine.
This step is
performed only if the `hz` value is non-zero.
Inside the scheduler, the `execute()` method schedules a goroutine (also called
as as `G` in the Go runtime scheduler) to run on the current OS thread (also
called as `M` in the Go runtime scheduler).
This method compares the current
value of the scheduler-recorded `sched.profilehz` variable against its own
snapshot (`m.profilehz`) from the previously recorded value of
`sched.profilehz` and if they differ, it reconfigures its timer by calling
`setThreadCPUProfileRate(sched.profilehz)`.
This functionality sets an OS
timer on behalf of the current OS thread in the process.
Once armed, the timer periodically delivers a `SIGPROF` signal, handled by
`sighandler(sig int, info *siginfo, ctxt unsafe.Pointer, gp *g)`, which in turn
invokes`sigprof()`, which,
1. obtains a backtrace, and
2. appends the backtrace into the ring buffer `cpuprof.add()`.
Occasionally, a sampled program counter may fall into non-Go space, or the ring
buffer may be full; such samples are simply counted, and backtrace is not taken.
The ring buffer is coordinated via a `signalLock` for concurrency control.
The edges marked with “signalLock”, in Figure 1, show the operations that take
the lock to access the ring buffer.
`pprof.StopCPUProfile()` follows a very similar path as
`pprof.StartCPUProfile(io.Writer)`.
When the `hz` argument is zero, the runtime
APIs described previously (e.g., `SetCPUProfileRate(hz)`) interpret it as the
end of profiling.
`pprof.StopCPUProfile()` performs the following steps.
1. calls`runtime.SetCPUProfileRate(0)` to stop profiling.
2. waits on the `cpu.done` channel, which will be flagged by the side thread once `eof` is
reached in the runtime ring buffer.
3. closes the supplied `io.Writer`.
`runtime.SetCPUProfileRate(hz)`, when called with `hz=0` invokes
`setcpuprofilerate(0)` and closes the ring buffer.
`setcpuprofilerate(0)` when
called with `hz=0`,
1. calls `setProcessCPUProfiler(0)`, which restores the previous signal handler
for `SIGPROF`.
2. resets `sched.profilehz=0` indicating globally that the
profiling has stopped.
3. calls `setThreadCPUProfileRate(0)` to stop sampling on the current OS thread.
Subsequently, the other `M`s notice the change in the global `sched.profilehz`
and turn off their timers when they run their `execute()` method.
The direct users of pprof are:
1. The `testing` package which can take a `-cpuprofile <file>` flag to collect
profiles.
2. The `net/http/pprof` package which can collect cpu profiles on the
`/debug/pprof/profile` endpoint with an optional argument `?seconds=<N>`.
3. Third party libraries such as `pkg/profile`.
## 3. Problem statement
*Pprof CPU profiles are inaccurate because the profiling resolution is low and
sampling is biased in OS timers.*
*Pprof CPU profiles are imprecise because the profiling resolution is low and
the samples have a large skid due to OS timers.*
*Pprof CPU profiles measure only the time metric, additional events should be
monitored to help performance profiling.*
## 4. Proposal
In this proposal, we address the concerns -- small sample size, sampling bias,
measurement error, and the need for more metrics -- which were raised in the
background section.
**Sample size.** First, this proposal addresses the problem of low sampling
resolution of OS timers by using high-resolution samples from hardware
performance monitoring units (PMUs) to generate more samples.
While the OS-timers have ~10ms
resolution (that is one sample every 10ms), sampling CPU cycles via PMUs can
deliver (if desired) a sample every few microseconds -- a three order magnitude
finer resolution.
This fine-grained measurement results in increasing the
sample size, and as a result alleviates both inaccuracy and imprecision
problems.
While PMUs can deliver 100x more samples with just ~13% measurement
overhead, the alternative of lengthening a measurement window by 100x to
collect the same number of samples via the OS timers would be impractical.
**Sampling bias.** Second, the proposal arranges each sample to be delivered to
the OS thread causing the counter overflow to eliminate the sampling bias,
which improves accuracy.
**Random error (skid).** Third, PMU samples alleviate the large sampling skid
seen in OS timers because PMUs allow a low-skid or skid-free precise sample
attribution.
**More metrics.** Finally, the PMU's capability to monitor various events such
as cache misses, TLB misses, branch mispredictions, inter-core cacheline
ping-ponging, to name a few, offers additional insights when diagnosing complex
performance issues.
Implementing this idea calls for extending the Go runtime package’s CPU
profiling capabilities and making necessary adjustments in `runtime/pprof`,
`testing`, and `http/pprof` packages.
We first describe the details of the
changes to the `runtime` package and then describe the details of the `pprof`,
`testing`, and `net/http/pprof` packages, in that order.
Our design brings the full capability of PMUs into the `runtime` package but
exposes only a small amount of it via the `pprof` package to users.
For
example, the runtime might be able to configure whether it wants to monitor the
CPU cycles spent inside the kernel or not; but these configurations will not be
exposed (at least in the initial design) to the users of `pprof` package.
### 4.1 High-level design
In its simplest form, we substitute the start/stop of an OS interval timer with
the start/stop of a PMU event configured in the sampling mode; the sampling
mode delivers an interrupt each time a hardware counter overflows, which is
analogous to the OS timer interrupt.
The interrupt generated by the PMU counter
overflow follows the same signal handling path as the interval timer.
This
policy allows collecting and appending a backtrace on each PMU interrupt into a
ring buffer, just like the OS timer.
Moreover, like the OS timer, each
OS-thread (`M`) checks whether any changes were made to the PMU-based profiling
during each scheduling activity and performs the necessary adjustments to
reconfigure the PMU.
Finally, we retain `SIGPROF` as the signal type delivered
-- for both OS timer interrupts and a PMU counter overflow interrupts --
instead of hijacking another signal number for PMU-based profiling.
### 4.2 The devil is in the details
This seemingly straightforward approach poses implementation challenges
related to retaining the compatibility with interval timer and incorporating
additional PMU features.
The first part of this section addresses those
challenges.
Other than the design described in the previous subsection, the
rest of the details are closer to implementation details.
As before, the
discussion is only relevant to Linux.
PMUs can count multiple CPU events simultaneously (e.g., cycles, instruction,
cache misses, etc.).
Our proposal does not preclude the runtime from
simultaneously enabling multiple PMU events or even PMU events alongside an OS
timer event.
In fact, certain events are best used when measured with others.
For example,
1. Cache misses measured alongside cache hits enable computing cache miss ratio,
2. Remote core cache line transfers measured alongside total cache misses
provides a view of the cause of cache misses.
In our proposal, we will limit the maximum number of concurrent CPU events to
`_CPUPROF_EVENTS_MAX` (`EV_MAX` for short in the rest of this document).
We
will define a few common portable events for which the user does not have to
lookup within the CPU manufacturer’s manual for event codes.
This closely
resembles the preset events of `perf events` in Linux.
In order to enable advanced users to exploit the full capability of the PMUs,
we define an opaque RAW event that allows the users to specify a model-specific
hexadecimal event number that can be sampled.
A motivating example for such a
raw event is the remote
[HITM](https://software.intel.com/sites/default/files/managed/8b/6e/335279_performance_monitoring_events_guide.pdf?ref=hvper.com)
event on Intel processors; this event counts the number of times a cache line
was loaded where the cache line was in the modified state in another remote
CPU’s cache.
This event helps diagnose problems such as cache line false
sharing; we have observed this event to be helpful in identifying false sharing
even inside the Go runtime.
We have come up with the following initial set of PMU events.
These events will
be exposed by the `runtime` package to the `pprof` package.
The `pprof` package
will wrap these constant values with matching function names to be presented
later in the `pprof` package subsection.
Notice that the OS timer is one of the CPU events.
Last-level cache events are included to capture the behavior of the CPUs memory
hierarchy.
Branch events are included to capture potentially wasted CPU
cycles.
The design discussions can refine this list further (e.g., whether TLB events
are needed).
They are easy to grow based on requests.
```
type cpuEvent int32
// These constants are a mirror of the runtime
// These constants are a mirror of the runtime
const ( _CPUPROF_OS_TIMER, _CPUPROF_FIRST_EVENT cpuEvent = iota, iota
_CPUPROF_HW_CPU_CYCLES, _CPUPROF_FIRST_PMU_EVENT
_CPUPROF_HW_INSTRUCTIONS cpuEvent = iota
_CPUPROF_HW_CACHE_REFERENCES
_CPUPROF_HW_CACHE_MISSES
_CPUPROF_HW_BRANCH_INSTRUCTIONS
_CPUPROF_HW_BRANCH_MISSES
_CPUPROF_HW_RAW
_CPUPROF_EVENTS_MAX
_CPUPROF_LAST_EVENT = _CPUPROF_EVENTS_MAX-1
)
```
Since PMU support may not be available on all platforms in all settings, we
need to make it configurable in the Go runtime.
If an OS supports PMU-based
profiling (e.g., Linux), it will dynamically register two functions during the
runtime startup.
1. `setThreadPMUProfilerFunc`: this is the OS-specific function used to
start/stop a PMU event analogous to `setThreadCPUProfile`.
If`setThreadPMUProfilerFunc` is `nil`, the runtime will not attempt to
start/stop PMU profiles even if it is requested to do so.
However, this
situation will not arise because `runtime/pprof` will guard against it.
2. `sigprofPMUHandlerFunc`: this is the OS-specific function called from the
OS-specific signal handler when the PMU delivers an interrupt.
Currently, each `M` (type `m`) and the scheduler (type `schedt`) maintains a
`profilehz` variable.
In lieu of `profilehz` we’ll introduce a runtime structure
`cpuProfileConfig struct `, which captures the essential configurations needed
to start/stop either a PMU counter or an OS timer.
`cpuProfileConfig` is
designed to reflect the underlying `perf_event_attr` struct in Linux, but being
an internal data structure we can modify it, if needed in the future, without
affecting any public-surface API.
```
type cpuProfileConfig struct {
hz uint64 // the sampling rate, used only for OS_TIMER.
period uint64 // the sampling interval, used for PMU events.
rawEvent uint64 // the hexa-decimal value of the CPU-specific event.
// used only with _CPUPROF_HW_RAW cpuEvent type
preciseIP profilePCPrecision // the level of precision for the program counter in PMU samples.
isSampleIPIncluded bool // should the sample include the PC?
isSampleAddrIncluded bool // should the sample include the memory address involved?
isKernelIncluded bool // should the kernel count events in kernel mode?
isHvIncluded bool // should the kernel count events in hypervisor mode?
isIdleIncluded bool // should the kernel count events when idle task is running?
isSampleCallchainIncluded bool // should the sample report the call chain?
isCallchainKernelIncluded bool // should the sample include the kernel callchain?
isCallchainUserIncluded bool // should the sample include the user callchain?
}
```
Some PMUs can be configured so that they can offer the
exact program counter involved in the counter overflow.
`profilePCPrecision`
dictates the level of precision needed from the PMU as described by the
constants below.
If a PMU is configured to provide the precise program counter for a sample,
we'll substitute the leaf-level program counter in the backtrace with the one
presented by the PMU.
Please refer to the [Section 6.2](#62-disadvantages) for PMU
precision.
```
type profilePCPrecision uint8
const (
_CPUPROF_IP_ARBITRARY_SKID profilePCPrecision = iota
_CPUPROF_IP_CONSTANT_SKID
_CPUPROF_IP_SUGGEST_NO_SKID
_CPUPROF_IP_NO_SKID)
```
`pprof` will set the default to `_CPUPROF_IP_ARBITRARY_SKID` to work across
architectures and event types.
However, we can easily change it to the highest
supported, by probing the CPU as done in
[certain](https://github.com/HPCToolkit/hpctoolkit/blob/master/src/tool/hpcrun/sample-sources/perf/perf_skid.c#L176)
profilers.
`cpuProfileConfig` will be visible to `pprof` package but will not be exposed to
its clients.
The fields in `cpuProfileConfig` allow how a PMU event can be
customized.
Pprof will set the field values to the lowest common denominator
that is expected to work across many architectures and security settings.
For example, `isKernelIncluded` will most likely be `false`.
The runtime will be
(mostly) agnostic to these values and will pass them down to OS-specific layers
that will handle programming the PMUs.
The `runtime` package will expose an API
`runtime_pprof_setCPUProfileConfig(cpuEvent, *cpuProfileConfig)`, only to
`pprof`.
This API is the equivalent of `SetCPUProfileRate(hz int)`.
Since
the legacy `SetCPUProfileRate(hz int)` cannot be dropped, we alter its internals
to wrap the `hz` value into a `cpuProfileConfig` structure and pass it down to
`runtime_pprof_setCPUProfileConfig` for a uniform treatment of any event whether
an OS-timer event or a PMU event.
The `runtime` can be bifurcated into OS-agnostic and OS-specific parts.
The OS-agnostic part of the runtime will have the following duties:
1. Keep track of the CPU events currently running,
2. Handle new start/stop
invocations of the OS timer or PMU event, one at a time.
3. Allocate and
coordinate access to the ring buffers associated with each of the running CPU
event.
4. Reconfigure the CPU events for an `M` during its scheduling based on
whether its current snapshot is different from the global configuration stored
in the scheduler.
The OS-specific part of runtime will have the following duties:
1. Install/restore profiling signal handler.
2. Start/stop OS timer or PMU
events.
3. Handle the profiling signals and pause the PMUs during signal
handling.
4. Map an interrupt delivered to the appropriate active CPU profiling
event for recording the backtraces into an appropriate ring buffer.
Since there can be `CPUPROF_EVENTS_MAX` number of concurrent CPU profiles, we
need to change some of the runtime data structures that previously assumed a
single CPU profile.
Importantly,
1. We will change the `profilehz` field in `schedt struct` to be an array of
`cpuProfileConfig` pointers.
2. We will change `profilehz` field in `m struct`
to be an array of `cpuProfileConfig` pointers.
3. We will introduce an array of
opaque pointers `cpuProfileHandle` field in `m struct`.
This will be OS-specific
information stored in each `M`.
For example, on Linux, this will hold the
file handles obtained from `perf_event_open` system calls and the corresponding
`mmap` ring buffers where the kernel populates the sample data.
This information
will be used to close the resources once the profiling is stopped on an `M`.
We’ll replace the `hz` argument used in various runtime APIs involved in CPU
profiling with the following two arguments.
1. `eventId cpuEvent` (described in the constants previously), and
2. `profConfig *cpuProfileConfig`.
Figure 2 shows the call graph for the new workflow.
We abbreviate `eventId` as
`i` and `cpuProfileConfig` as `p` in most places.
![call_graph_pprof++](36821/call_graph_pprof++.png)
*Figure 2: A pictorial representation of the OS timer-based and PMU-based
sampling workflow.
The first line of each record box is the function name.
The
remaining lines provide a textual description of the steps taken by that
function.*
A `nil` value of `cpuProfileConfig` implies stop profiling for the specified
`cpuEvent`, and a `non-nil` value implies start profiling for the specified
`cpuEvent`.
At the runtime-level, `runtime_pprof_setCPUProfileConfig()` makes a
copy of the `cpuProfileConfig` passed from `pprof`, so that inside the runtime,
we can perform fast pointer comparison and rely on two large `cpuProfileConfig`
structs to be equal if pointers to them are equal.
Starting/stopping a PMU counter is different from starting/stopping an OS timer.
Hence `if eventId==CPUPROF_OS_TIMER` checks will appear at a few places inside
the runtime’s profiling-related code.
A few concurrency aspects deserve a mention.
1. Each active event will create its own ring buffer on demand.
There will be a
maximum of `EV_MAX` ring buffers.
Each ring buffer will be created only if there
is a corresponding active event.
The size of each ring buffer is 16KB.
If all
`EV_MAX` events are active at the same time, it will result in `16KB * 8 =
128KB` of ring buffers, which is paltry on modern machines.
2. We’ll change the
`cpuprof` variable in `cpuprof.go` to be an array of `EV_MAX` size.
3.
`runtime_pprof_setCPUProfileConfig` will hold a lock only for the `cpuprof[i]`,
where `i` is the event id for which profile has to be started or stopped.
4. There will be a single `signalLock` rather than `EV_MAX` `signalLock`s.
This is
because the `setcpuprofileconfig` method in `cpuprof.go` needs to check (read)
all `prof[eventId].config`s to infer whether all events are stopped before
uninstalling the signal handler.
Taking `EV_MAX` locks one after another would
be no better than holding just one lock.
The consequence of this is that we
serialize simultaneous writes to two different ring buffers.
This should not be
a serious issue since there cannot be two concurrent `SIGPROF`s.
If a system
supports concurrent `SIGPROF` handling, and if there are a lot of threads in a
system, we can revisit this choice.
5. In a very rare case when a `SIGPROF` is delivered for an already running
event when another event is being set (which will hold the `signalLock`), the
signal handler recognizes it and drops such sample. Otherwise a deadlock will
ensue.
At the lowest level, we will incorporate the following three system calls into
all `runtime/sys_linux_<arch>.s` files.
1. `perf_event_open`, which is Linux’s means to program PMUs.
The Go signature
will be `func perfEventOpen(attr *perfEventAttr, pid uintptr, cpu, groupFd
int32, flags uintptr) int32`
2. `ioctl`, which is needed to stop the PMU counter
when handling the `SIGPROF`, and reset and restart after handling the signal.
The
go signature will be `func ioctl(fd, req int32, arg uintptr) int32`
3. `fcntl`,
which is necessary to arrange a `perf_event` file handle to deliver the
`SIGPROF` signal and also to setup the signal to be delivered exactly to the `M`
(OS thread) whose PMU counter overflew.
The go signature will be `func fcntl(fd,
cmd int32, arg uintptr) int32`
#### 4.2.1 Segregating `SIGPROF` from different sampling sources
All PMU events and the OS timer use the same `SIGPROF` signal to deliver their
interrupts.
Hence, the Unix signal handler needs to address the following
concerns.
1. Identify whether a signal was delivered as a result of the OS timer expiring
or a PMU counter overflowing.
This is necessary because the subsequent processing will be
different based on the event.
2. Identify which PMU counter caused the interrupt, when multiple PMU events are being
sampled at the same time.
This is necessary because the subsequent
processing needs to append the backtraces into the buffer corresponding to the
PMU event that triggered the interrupt.
All PMU events do not go into the same
ring buffer.
Although both OS timer expiry and any PMU counter overflow result in the
same `SIGPROF` signal, the signal codes are different.
The PMU counter overflow
generates a `POLL_IN` signal code indicating the presence of data whereas the OS
timer interrupts generate a `SI_KERNEL` or `SI_ALARM` signal code.
We use this
information to address #1 above.
Once we know that the signal was generated due to some PMU counter overflowing,
we use the file descriptor `fd` passed in the signal info and match it against
the opened perf events file descriptors on the same `M` and use the matching
location to infer the PMU event `i` that caused the signal; the rest of the
signal handling makes use of this `i` to record the backtrace in the appropriate
ring buffer.
During the signal handling, we’ll pause all active PMU counters via `ioctl`
calls on the `fd` so that the signal handling is not counted towards the PMU
event counters.
### 4.3 Exposing PMU events via `runtime/pprof` package
We now turn our attention to the `runtime/pprof` package, which exposes the PMU
sampling to its clients.
There are several choices.
There is a tension between
revolutionary vs. evolutionary solution; there is friction between exposing the
full PMU functionality vs. offering the basic minimum.
In our proposal, we hope
to make pragmatic choices among myriad options.
We propose adding a new API, which accepts different configurations needed to
start any CPU event -- OS timer and PMU counters.
The API signature is shown
below.
```
pprof.StartCPUProfileWithConfig(opt ProfilingOption, moreOpts ...ProfilingOption) error
```
There is an optional variadic `ProfilingOption` to allow
more than one event to be started during a profiling session.
A `ProfilingOption` is an interface exposed by `runtime/pprof` as shown below.
```
type ProfilingOption interface { apply() error }
```
Pprof will implement the following profiling option functions that mirror
various `cpuEvent` constants previously presented.
```
func OSTimer(w io.Writer) ProfilingOption
func CPUCycles(w io.Writer, period uint64) ProfilingOption
func CPUInstructions(w io.Writer, period uint64) ProfilingOption
func CPUCacheReferences(w io.Writer, period uint64) ProfilingOption
func CPUCacheMisses(w io.Writer, period uint64) ProfilingOption
func CPUBranchInstructions(w io.Writer, period uint64) ProfilingOption
func CPUBranchMisses(w io.Writer, period uint64) ProfilingOption
func CPURawEvent(w io.Writer, period uint64, hex uint64) ProfilingOption
```
All of them consume an `io.Writer`.
The `OSTimer` accepts no other argument.
A
`period` accepted by all PMU events specifies the number of events that must
elapse between two consecutive interrupts.
A larger value means lower
granularity (the opposite of Hz).
Passing a `zero` value will result in using a
preset period.
`CPURawEvent` is special; it accepts any user-provided
hexadecimal PMU event code.
Notice that each event requires an `io.Writer`
since different profile types cannot be serialized into the same pprof protocol
buffer.
In the cases where multiple `ProfilingOption`s are passed to the
`pprof.StartCPUProfileWithConfig`, sanity checks are made to ensure the
uniqueness of each `cpuEvent` and each `io.Writer`.
Below are a few example usages of the API.
```
// Simple
StartCPUProfileWithConfig(OSTimer(w))
StartCPUProfileWithConfig(CPUCycles(w, period))
StartCPUProfileWithConfig(CPURawEvent(w, period, hexadecimalEvent))
// Advanced
StartCPUProfileWithConfig(CPUCacheReferences(w1, period1), CPUCacheMisses(w2, period2))
```
The implementation of `StartCPUProfile(writer)` will simply invoke
`StartCPUProfileWithConfig(OSTimer(writer))`.
The implementation of
`StartCPUProfileWithConfig` will ensure a maximum of `EV_MAX` elements, unique
`io.Writers`, and unique `cpuEvents`.
After sanitizing the arguments, it will create
a `cpuProfileConfig` for each `cpuEvent` requested to be profiled.
In a global
data structure `cpu`, we will maintain the active events, their
`*cpuProfileConfig`, and the corresponding `io.Writer`.
```
var cpu struct {
sync.Mutex
profiling bool
activeConfig [EV_MAX]*cpuProfileConfig
activeWriter [EV_MAX]io.Writer
eventName [EV_MAX]string
done chan bool
}
```
Finally, `StartCPUProfileWithConfig` will invoke
`runtime_pprof_setCPUProfileConfig` multiple times once for each active event
to request the `runtime` to actually configure either the OS timer or the PMU
with a specific event to start sampling.
A lock will be held throughout the invocation of
`StartCPUProfileWithConfig` and hence no two
`StartCPUProfileWithConfig/StartCPUProfile` invocations run concurrently.
Once
a session of `StartCPUProfileWithConfig/StartCPUProfile` is active, another
profiling is disallowed until `StopCPUProfile` is invoked.
Before successfully returning from `StartCPUProfileWithConfig`, we'll launch a
goroutine `profileWriter`, which will periodically wake up and scrape
runtime-produced stack samples from the ring buffers associated with each of
the active events and serialize them into the corresponding `io.Writer`s.
This follows the same design as the original pprof, but will add the
responsibility of collecting profiles from multiple events. Creating one
goroutine each to scrape every active event would be wasteful, a single
goroutine suffices to collect backtraces from all ring buffers.
### 4.4 The optional multiple CPU profiles feature
Allowing multiple CPU profiles in a single profiling session via
`pprof.StartCPUProfileWithConfig(opt ProfilingOption, moreOpts
...ProfilingOption)` API is a power-user feature.
To avoid any accidental
misuse, we'll protect it under an environmental variable
`GO_PPROF_ENABLE_MULTIPLE_CPU_PROFILES=<true|false>`, which will be disabled by
default.
### 4.5 The `testing` package
We’ll add the following two command line arguments to the testing package:
1. `-cpuprofileevent=<timer|cycles|instructions|cacheMisses|cacheReferences|branches|branchMisses|rHexValue>`
2. `-cpuprofileperiod=<Int64Value>`
The default value of `-cpuprofileevent` is `timer` with 100Hz when only
`-cpuprofile` is passed, which ensures compatibility with the current behavior.
`-cpuprofileperiod` will be ignored for `-cpuprofileevent=timer` and will
default to a 100Hz rate as is the case currently.
The testing package will map a
user-specified string for `-cpuprofileevent` value to the appropriate
`pprof.ProfilingOption`, the `-cpuprofileperiod` to an integer, and invoke
`pprof.StartCPUProfileWithConfig()`.
For example `-cpuprofileevent=cycles` and
`-cpuprofileperiod=100000` will result
in`pprof.StartCPUProfileWithConfig(CPUCycles(w, 100000)`.
### 4.6 The `net/http/pprof` package
We’ll add the following two additional arguments in an `http` request exposed
by the `net/http/pprof` CPU profiling endpoint.
1. `event=<timer|cycles|instructions|cacheMisses|cacheReferences|branches|branchMisses|rHexValue>`
2. `period=<Int64Value>`
`period` will be ignored for `event=timer` and will default
to a 100Hz rate as is the case currently.
The default value of `event`
is `timer` with 100Hz, which ensures compatibility with the current behavior.
The `net/http/pprof` package will map a user-specified string for `event`
value to an appropriate `pprof.ProfilingOption`, `period` to an
integer, and invoke `pprof.StartCPUProfileWithConfig`.
## 5. Empirical evidence on the accuracy and precision of PMU profiles
### 5.1 The PMU produces precise and accurate profiles for concurrent programs
Below we show the `go tool pprof -top` output of the previously shown
[`goroutine.go`](https://github.com/chabbimilind/GoPprofDemo/blob/master/goroutine.go)
concurrent program run for two times.
The profiles used the proposed new API
with 1M sampling period: `pprof.StartCPUPrileWithConfig(CPUCycles(w,
1000000))`.
The CPU cycles attributed to each routine `f1-f10` match the
expectation (approximately 10% of the total execution time) within one run and
across multiple runs.
*Thus the profiles are accurate and precise.
The
measurement overhead is indistinguishable from the OS timer in this setting.*
#### (PMU Cycles profile) goroutine.go/Run 1:
```
File: goroutine
Type: cycles
Time: Jan 27, 2020 at 4:49pm (PST)
Showing nodes accounting for 234000000000, 100% of 234000000000 total
flat flat% sum% cum cum%
23400000000 10.00% 10.00% 23400000000 10.00% main.f1
23400000000 10.00% 20.00% 23400000000 10.00% main.f10
23400000000 10.00% 30.00% 23400000000 10.00% main.f2
23400000000 10.00% 40.00% 23400000000 10.00% main.f3
23400000000 10.00% 50.00% 23400000000 10.00% main.f4
23400000000 10.00% 60.00% 23400000000 10.00% main.f5
23400000000 10.00% 70.00% 23400000000 10.00% main.f6
23400000000 10.00% 80.00% 23400000000 10.00% main.f7
23400000000 10.00% 90.00% 23400000000 10.00% main.f8
23400000000 10.00% 100% 23400000000 10.00% main.f9
```
#### (PMU Cycles profile) goroutine.go/Run 2:
```
File: goroutine
Type: cycles
Time: Jan 27, 2020 at 4:51pm (PST)
Showing nodes accounting for 234000000000, 100% of 234000000000 total
flat flat% sum% cum cum%
23800000000 10.17% 10.17% 23800000000 10.17% main.f1
23500000000 10.04% 20.21% 23500000000 10.04% main.f7
23500000000 10.04% 30.26% 23500000000 10.04% main.f9
23400000000 10.00% 40.26% 23400000000 10.00% main.f10
23400000000 10.00% 50.26% 23400000000 10.00% main.f2
23400000000 10.00% 60.26% 23400000000 10.00% main.f4
23400000000 10.00% 70.26% 23400000000 10.00% main.f6
23400000000 10.00% 80.26% 23400000000 10.00% main.f8
23300000000 9.96% 90.21% 23300000000 9.96% main.f3
22900000000 9.79% 100% 22900000000 9.79% main.f5
```
### 5.2 The PMU produces precise and accurate profiles for serial programs
Below we show the `go tool pprof -top` output of the previously shown
[`serial.go`](https://github.com/chabbimilind/GoPprofDemo/blob/master/serial.go)
program run for two times.
We remind the reader that the expected relative
execution time is encoded in the function names themselves (`J_expect_18_18`
should consume 18.18% CPU cycles and so on).
The profiles used the proposed
new API with 1M sampling period: `pprof.StartCPUPrileWithConfig(CPUCycles(w,
1000000))`.
The CPU cycles attributed to each routine match the expectation
within a run and across multiple runs.
*Thus the profiles are accurate and
precise.*
#### (PMU Cycles profile) serial.go/Run 1:
```
File: serial
Type: cycles
Time: Jan 27, 2020 at 4:54pm (PST)
Showing nodes accounting for 1105000000, 100% of 1105000000 total
flat flat% sum% cum cum%
200000000 18.10% 18.10% 200000000 18.10% main.J_expect_18_18
183000000 16.56% 34.66% 183000000 16.56% main.I_expect_16_36
165000000 14.93% 49.59% 165000000 14.93% main.H_expect_14_546
137000000 12.40% 61.99% 137000000 12.40% main.G_expect_12_73
120000000 10.86% 72.85% 120000000 10.86% main.F_expect_10_91
100000000 9.05% 81.90% 100000000 9.05% main.E_expect_9_09
82000000 7.42% 89.32% 82000000 7.42% main.D_expect_7_27
63000000 5.70% 95.02% 63000000 5.70% main.C_expect_5_46
37000000 3.35% 98.37% 37000000 3.35% main.B_expect_3_64
18000000 1.63% 100% 18000000 1.63% main.A_expect_1_82
0 0% 100% 1105000000 100% main.main
0 0% 100% 1105000000 100% runtime.main
```
#### (PMU Cycles profile) serial.go/Run 2:
```
File: serial
Type: cycles
Time: Jan 27, 2020 at 4:54pm (PST)
Showing nodes accounting for 1105000000, 100% of 1105000000 total
flat flat% sum% cum cum%
200000000 18.10% 18.10% 200000000 18.10% main.J_expect_18_18
183000000 16.56% 34.66% 183000000 16.56% main.I_expect_16_36
159000000 14.39% 49.05% 159000000 14.39% main.H_expect_14_546
142000000 12.85% 61.90% 142000000 12.85% main.G_expect_12_73
119000000 10.77% 72.67% 119000000 10.77% main.F_expect_10_91
100000000 9.05% 81.72% 100000000 9.05% main.E_expect_9_09
82000000 7.42% 89.14% 82000000 7.42% main.D_expect_7_27
61000000 5.52% 94.66% 61000000 5.52% main.C_expect_5_46
40000000 3.62% 98.28% 40000000 3.62% main.B_expect_3_64
18000000 1.63% 99.91% 18000000 1.63% main.A_expect_1_82
1000000 0.09% 100% 1105000000 100% main.main
0 0% 100% 1105000000 100% runtime.main
```
### 5.3 A practical use
We use pprof to collect production CPU profiles at Uber.
On one of our Go microservices, when we used the default OS timer-based CPU
profiles, we saw that the `mutex.Lock()` accounted for 15.79% of the time and
`mutex.Unlock()` accounted for 13.55% of the CPU time.
We knew from the program structure that `mutex.Lock()` and `mutex.Unlock()`
were invoked in pairs one after another, and also that `mutex.Lock()` is
significantly more heavyweight compared to `mutex.Unlock()`, thus we were
expecting much lesser time being spent inside `mutex.Unlock()` relative to
`mutex.Lock()`.
The OS-timer profiling data was not matching our expectations.
However, when we used a prototype of the PMU-based profiler and sampled the CPU
cycles event, the quantification changed and matched our expectations;
`mutex.Lock()` accounted for 33.36% of the CPU cycles and `mutex.Unlock()`
accounted for 7.57% the CPU cycles.
The PMU profiles avoided an unnecessary performance investigation.
## 6. Rationale
### 6.1 Advantages
CPU profiling via PMUs result in richer quality profiles -- high-resolution
measurement, higher accuracy, and higher precision.
Access to various events
such as cache misses in PMUs allows better insights into understanding the
causes of complex performance issues.
Any kind of PMU event can be attributed
to call stack samples, which is useful to provide actionable feedback to the
developer.
Incorporating the PMU facility in the runtime will allow the Go
profiler to take full advantage of the progress happening in the hardware
performance monitoring facilities available in modern CPUs.
This framework will
allow for independently developing more advanced profiling features such as
attributing high-latency data accesses to Go objects, which can help refactor
code or data structures for better data locality [5, 6, 7].
### 6.2 Disadvantages
1. PMU-based profiling may not work when a Go program is running under a
virtual machine, since virtual machines may not expose the hardware performance
counters to a guest OS.
Hence, the proposal keeps the OS-timer as the default
and allows PMU-based sampling available only when it is explicitly specified.
2. A higher sampling rate (a shorter period between samples) obviously incurs
higher overhead.
For example, the default timer with a 100Hz sampling rate
introduces 3.6% measurement overhead on a 2.4GHz Intel Skylake machine.
Using
CPU cycles with a period of 24000000 cycles, which amounts to the same 100
samples/second introduces the same 3.6% overhead.
The table below shows
overheads at other sampling periods using the PMU cycles and also compares
against the default OS-timer profiler overhead.
```
The new PMU-based profiler with cycles event
-----------------------------------------------------
Sampling period in CPU cycles Overhead
-----------------------------------------------------
24000000 (100 samples/sec) 1.036x
2400000 (1K samples/sec) 1.036x
240000 (10K samples/sec) 1.13x
24000 (100K samples/sec) 2.05x
```
```
The default OS-timer based profiler
-----------------------------------------------------
Sampling rate Overhead
-----------------------------------------------------
100 samples/sec 1.036x
```
A reasonable sampling period, once in 2.4M CPU cycles on a 2.4GHz
Intel Skylake machine, offers a good trade off between high sampling rate at low
overhead.
3. When using any other PMU event, such as cache misses, the sampling period
needs to be more carefully curated.
We’ll provide preset values for the preset
events.
4. PMU profiling has a mild skid on deeply pipelined, multi-issue,
out-of-order, superscalar CPU architectures due to multiple instructions
flowing in the pipeline.
Thus the program counter seen in the interrupt can be
a handful of instructions off from the one generating the counter overflow.
This "skid" is much smaller than the several milliseconds of skid in an OS
timer.
Moreover, modern PMUs have built-in features to eliminate such skid
(e.g., Intel Precise Event Based Sampling ([PEBS](https://software.intel.com/sites/default/files/managed/8b/6e/335279_performance_monitoring_events_guide.pdf?ref=hvper.com)), AMD Instruction Based
Sampling ([IBS](https://developer.amd.com/wordpress/media/2012/10/AMD_IBS_paper_EN.pdf)), and PowerPC Marked events ([MRK](https://wiki.raptorcs.com/w/images/6/6b/POWER9_PMU_UG_v12_28NOV2018_pub.pdf)).
Some PMUs can be configured so
that they can offer the exact program counter involved in the counter overflow.
We have introduced `profilePCPrecision` for this reason.
5. There are finite counters on any PMU. Counter multiplexing happens if the
number of events is more than the number of available PMU counter slots. In
that case the events run only part of the time.
### 6.3 Alternatives explored
1. Linux `perf` commandline tool is powerful and offers comparable solutions.
However, it does not solve several important use cases.
1. Profiling via an `http` endpoint (provided by `net/http/pprof` package in
Go) is the preferred, readily available, and scalable solution that works at
the enterprise scale. Making `perf` to work in a similar fashion would lead to
creating too many custom solutions.
2. `pprof` works as an in-situ profiler working in the same address
space as the process being profiled and hence does not require any special
system privileges; in contrast, attaching `perf` to a running process requires
changes to a system’s security settings.
3. We are also focusing on
containerized environments, which is widely used in large-scale Go deployments.
Although `perf` can collect call stack samples from a container, it fails to find
symbols of go binaries located within a container to show clean reports
(e.g., flame graphs).
The workaround is non-trivial.
2. We explored library-level facilities to collect performance counters such as perf-util [\[3\]](https://github.com/hodgesds/perf-utils).
However, no library support will provide good profiling for Go since the runtime
can schedule a monitored function onto another thread, where the counters may not
be enabled.
The correct solution, hence, is to implement the hardware
performance counter facility inside the go runtime into the scheduler as we have
proposed here.
3. We seriously contemplated against introducing a new pprof API
and tried to work with the existing API `StartCPUProfile(io.Writer)`.
We
considered adding variadic `ProfilingOptions` to it.
But any such change would
be qualified as breaking `Go1.X` and went ahead introducing a new API.
4. We considered passing an `Options` struct to the new API
`StartCPUProfileWithConfig`.
However, the usage looked complicated with certain
fields making sense only in certain circumstances and hence decided to use the
functional option pattern.
5. We considered having only one CPU profiling
event active at a time.
However, the experience has shown that this limits
performance analysis in several ways; and advanced performance tuning demands
correlating more than one event to find the root cause.
Furthermore, this richness
in Go profiling can serve to advance the state-of-the-art in profilers.
6. On the implementation front, we considered using a `map` for active events at
several places instead of an `EV_MAX` array.
However, concurrent access to
different array elements was superior to locking a shared `map`.
Moreover,
`EV_MAX` is small enough to search linearly for active events.
## 7. Compatibility
The design ensures [compatibility with Go1.0](https://golang.org/doc/go1compat)
since no existing public API signatures, interface signatures, or data
structures are changed.
We have introduced a new interface and a handful of new
APIs in the `pprof` package.
## 8. Implementation
Milind Chabbi from Uber technologies will implement the proposal.
The
implementation will proceed as follows:
1. Add assembly language support for `perf_events` related system calls.
2. Modify the `runtime`, `runtime/pprof`to support
PMU sampling to reflect the call graph / workflow shown in Figure 2.
3. Expose the new facility via `testing` and `net/http/pprof` packages.
4. Make changes to `pprof` CPU profiling tests to use PMU cycles in addition to
OS timer (on supported platforms).
5. Add PMU profiling tests to run on supported platforms.
The implementation will be broken into a large checkin related to the `runtime`
and `runtime/pprof` packages, a few small checkins related to `testing`, and
`net/http/pprof` packages, and the tests accompanying the code.
The
implementation will align with the February-April 2020 development cycle and
May-July 2020 release cycle.
A prototype is available [here](https://github.com/uber-dev/go/tree/pmu_pprof).
## 9. Open issues
1. Supporting Windows and MacOS will be left for future work, but any support
from experts from those lands is appreciated.
2. The issue [21295](https://github.com/golang/go/issues/21295) proposes
adding `-test.counters` flag to the testing package. That proposal is
orthogonal to this one.
There
isn’t a detailed description of how the proposal wants to accomplish it.
Simply
starting and stopping a PMU counter by making `perf_events_open` system call at
the testing package level will not accomplish the objective.
The monitored
functions can migrate from one `M` to another, and hence the changes must be
made in the runtime as described in this proposal.
However, it is relatively
easy to incorporate the requested feature by making use of changes suggested in
this proposal to accomplish the objective of issue #21295.
We are open to
discussing with the creators of #21295 to understand how we can incorporate
their needs into this design.
3. The additions to the public APIs and naming
aesthetics can be refined with inputs from the experts and the community.
4. Currently, we do not bubble up runtime failures to start a PMU profiler (same is
true for a timer).
Should we bubble up any failure to configure a PMU or
silently produce zero samples in the profiles?
5. Should we introduce any throttling mechanism if the rate of events is too
high for a PMU event.
## 10. Acknowledgment
Pengfei Su implemented a prototype of the design
while interning at Uber Technologies.
Joshua Corbin helped design the public surface APIs.
This work is generously funded by the Uber Technologies’ Programming Systems
Research team and the team members provided feedback for improvement.
## 11. Appendix
##### (OS timer profile) goroutine.go/run1:
```
File: goroutine
Type: cpu
Time: Jan 27, 2020 at 3:45pm (PST)
Duration: 6.70s, Total samples = 18060ms (269.37%)
Showing nodes accounting for 18060ms, 100% of 18060ms total
flat flat% sum% cum cum%
4210ms 23.31% 23.31% 4210ms 23.31% main.f7
2610ms 14.45% 37.76% 2610ms 14.45% main.f2
2010ms 11.13% 48.89% 2010ms 11.13% main.f6
1810ms 10.02% 58.91% 1810ms 10.02% main.f10
1780ms 9.86% 68.77% 1780ms 9.86% main.f3
1410ms 7.81% 76.58% 1410ms 7.81% main.f1
1310ms 7.25% 83.83% 1310ms 7.25% main.f4
1110ms 6.15% 89.98% 1110ms 6.15% main.f5
1110ms 6.15% 96.12% 1110ms 6.15% main.f8
700ms 3.88% 100% 700ms 3.88% main.f9
```
##### (OS timer profile) goroutine.go/run2:
```
File: goroutine
Type: cpu
Time: Jan 27, 2020 at 3:45pm (PST)
Duration: 6.71s, Total samples = 17400ms (259.39%)
Showing nodes accounting for 17400ms, 100% of 17400ms total
flat flat% sum% cum cum%
3250ms 18.68% 18.68% 3250ms 18.68% main.f2
2180ms 12.53% 31.21% 2180ms 12.53% main.f9
2100ms 12.07% 43.28% 2100ms 12.07% main.f1
1770ms 10.17% 53.45% 1770ms 10.17% main.f6
1700ms 9.77% 63.22% 1700ms 9.77% main.f5
1550ms 8.91% 72.13% 1550ms 8.91% main.f4
1500ms 8.62% 80.75% 1500ms 8.62% main.f8
1440ms 8.28% 89.02% 1440ms 8.28% main.f3
1390ms 7.99% 97.01% 1390ms 7.99% main.f10
520ms 2.99% 100% 520ms 2.99% main.f7
```
##### (OS timer profile) goroutine.go/Run 3:
```
File: goroutine
Type: cpu
Time: Jan 27, 2020 at 3:48pm (PST)
Duration: 6.71s, Total samples = 17.73s (264.31%)
Showing nodes accounting for 17.73s, 100% of 17.73s total
flat flat% sum% cum cum%
3.74s 21.09% 21.09% 3.74s 21.09% main.f7
2.08s 11.73% 32.83% 2.08s 11.73% main.f9
2.05s 11.56% 44.39% 2.05s 11.56% main.f2
1.85s 10.43% 54.82% 1.85s 10.43% main.f10
1.78s 10.04% 64.86% 1.78s 10.04% main.f1
1.43s 8.07% 72.93% 1.43s 8.07% main.f3
1.42s 8.01% 80.94% 1.42s 8.01% main.f8
1.18s 6.66% 87.59% 1.18s 6.66% main.f6
1.17s 6.60% 94.19% 1.17s 6.60% main.f5
1.03s 5.81% 100% 1.03s 5.81% main.f4
```
##### (OS timer profile) serial.go/Run 1:
```
File: serial
Type: cpu
Time: Jan 27, 2020 at 1:42pm (PST)
Duration: 501.51ms, Total samples = 320ms (63.81%)
Showing nodes accounting for 320ms, 100% of 320ms total
flat flat% sum% cum cum%
80ms 25.00% 25.00% 80ms 25.00% main.H_expect_14_546
80ms 25.00% 50.00% 80ms 25.00% main.J_expect_18_18
60ms 18.75% 68.75% 60ms 18.75% main.G_expect_12_73
20ms 6.25% 75.00% 20ms 6.25% main.B_expect_3_64
20ms 6.25% 81.25% 20ms 6.25% main.D_expect_7_27
20ms 6.25% 87.50% 20ms 6.25% main.F_expect_10_91
20ms 6.25% 93.75% 20ms 6.25% main.I_expect_16_36
10ms 3.12% 96.88% 10ms 3.12% main.A_expect_1_82
10ms 3.12% 100% 10ms 3.12% main.C_expect_5_46
0 0% 100% 320ms 100% main.main
0 0% 100% 320ms 100% runtime.main
```
##### (OS timer profile) serial.go/Run 2:
```
File: serial
Type: cpu
Time: Jan 27, 2020 at 1:44pm (PST)
Duration: 501.31ms, Total samples = 320ms (63.83%)
Showing nodes accounting for 320ms, 100% of 320ms total
flat flat% sum% cum cum%
70ms 21.88% 21.88% 70ms 21.88% main.I_expect_16_36
50ms 15.62% 37.50% 50ms 15.62% main.J_expect_18_18
40ms 12.50% 50.00% 40ms 12.50% main.E_expect_9_09
40ms 12.50% 62.50% 40ms 12.50% main.F_expect_10_91
40ms 12.50% 75.00% 40ms 12.50% main.H_expect_14_546
30ms 9.38% 84.38% 30ms 9.38% main.D_expect_7_27
20ms 6.25% 90.62% 20ms 6.25% main.B_expect_3_64
20ms 6.25% 96.88% 20ms 6.25% main.G_expect_12_73
10ms 3.12% 100% 10ms 3.12% main.C_expect_5_46
0 0% 100% 320ms 100% main.main
0 0% 100% 320ms 100% runtime.main
```
##### (OS timer profile) serial.go/Run 3:
```
File: serial
Type: cpu
Time: Jan 27, 2020 at 1:45pm (PST)
Duration: 501.39ms, Total samples = 310ms (61.83%)
Showing nodes accounting for 310ms, 100% of 310ms total
flat flat% sum% cum cum%
110ms 35.48% 35.48% 110ms 35.48% main.J_expect_18_18
70ms 22.58% 58.06% 70ms 22.58% main.G_expect_12_73
60ms 19.35% 77.42% 60ms 19.35% main.F_expect_10_91
30ms 9.68% 87.10% 30ms 9.68% main.I_expect_16_36
20ms 6.45% 93.55% 20ms 6.45% main.H_expect_14_546
10ms 3.23% 96.77% 10ms 3.23% main.B_expect_3_64
10ms 3.23% 100% 10ms 3.23% main.C_expect_5_46
0 0% 100% 310ms 100% main.main
0 0% 100% 310ms 100% runtime.main
```
## 12. References
1. HPCToolkit: https://github.com/HPCToolkit/hpctoolkit
2. Oracle studio:
https://www.oracle.com/technetwork/server-storage/solarisstudio/features/performance-analyzer-2292312.html
3. Perf utils: https://github.com/hodgesds/perf-utils
4. vTune
https://software.intel.com/en-us/vtune
5. "Pinpointing Data Locality Problems
Using Data-Centric Analysis", Xu Liu and John Mellor-Crummey, 2011 International
Symposium on Code Generation and Optimization, April 2-6, 2011, Chamonix,
France.
6. "A Data-centric Profiler for Parallel Programs", Xu Liu and John
Mellor-Crummey, The International Conference for High Performance Computing,
Networking, Storage and Analysis, November 17-22, 2013, Denver, Colorado, USA.
7. "ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel
Programs", Xu Liu and Bo Wu, The International Conference for High Performance
Computing, Networking, Storage and Analysis, Nov 15-20, 2015, Austin, Texas,
USA.