Proposal: hardware performance counters for CPU profiling.

Author(s): Milind Chabbi

Last updated: Feb/13/2020

Discussion at

1. Abstract

The Go pprof CPU profiles are quite inaccurate and imprecise due to several limitations related to the operating system (OS) interval timer used as its sampling engine. Inaccurate profiles can mislead performance investigation and may lead to inappropriate optimization guidance. We propose incorporating hardware performance counters as an alternative sampling engine for CPU profiling. Hardware performance counter-based sampling rectifies many of the problems related to the OS timers. The hardware CPU cycles event provides microsecond measurement granularity, and it is available in almost all commodity CPUs. Furthermore, hardware Performance Monitoring Units (PMUs) allow measuring not only CPU cycles but other activities such as cache misses that help root cause complex performance problems. A prototype implementation shows significantly better-quality CPU profiles when using PMUs while maintaining a low measurement overhead. The proposal will retain the OS-timer as the default sampling engine. The proposal exposes the PMU-based profiling feature on Linux-based systems, which provide an excellent facility to program PMUs.

Table of contents

2. Background

In this section, we first provide a high-level description of Go's CPU profiling, motivate the problems in the current profiler with examples, highlight the state-of-the-art in CPU profiling, and then dive deeper into the current implementation of CPU profiling inside Go. Our proposed changes appear in the next section. Readers familiar with profiling technology and the internals of pprof may skip this section.

2.1 High-level description of Go’s CPU profiling

Pprof, the de facto Go CPU profiler, employs sampling-based profiling for time measurement. It attributes the time metric to different functions, including their calling context (aka call path or backtrace). Pprof relies on the underlying operating-system timers (exposed by the Go runtime) to periodically interrupt a program’s execution. The sampling rate is typically one hundred hertz (a sample taken every 10ms); a different sampling rate can be configured, but typical systems cannot deliver more than 250 hertz.

When profiling is enabled, the Go runtime scheduler sets up each OS thread in the process to generate periodic timer interrupts. In each interrupt, the go runtime’s signal handler obtains a backtrace and appends it to a ring buffer. All OS threads in a Go process append their backtraces to the same ring buffer coordinated via a lock. A side thread, launched by the pprof package, periodically scrapes the backtraces from the ring buffer and serializes them into a protobuf file to be used by the downstream tools.

2.2 Desiderata and deficiencies of Go CPU profiler

Accuracy and precision are desirable properties of a good measurement tool.

A profiling datum is said to be accurate if it is close to the ground truth (aka true value). For example, if API_A() consumes 25% of the total execution time and a profiler attributes it 24% of the total execution time, the measurement has 96% accuracy.

Profiling data are said to be precise if there is low variability among multiple measurements. Precision of a set of measurements of the same variable is often expressed either as min and max from the sample mean or as standard error or coefficient of variation of the sample.

Unfortunately, the current pprof CPU profiles meet neither of these two criteria.

2.2.1 Inaccuracy and its causes

Inaccuracy of profiles arises from two sources: sample size and sampling bias.

Sample size

By the law of large numbers, a larger sample size makes the sample mean closer to the population mean. Moreover, in profiling, where the sample size is very small compared to the population size, increasing the sample size greatly improves accuracy. Conversely, a smaller sample size leads to a lower inaccuracy.

Low-resolution OS-timer is the primary reasons behind small sample size in a measurement window. One can obtain more samples either by increasing the length of a measurement window or by increasing the number of samples in a measurement window. The former works well for regular programs e.g., codes dominated by loops; however, for irregular codes, such as microservices, a larger measurement window does not rectify the measurement inaccuracy because more irregularities can easily surface in a larger observation window. Hence, there is a dire need to collect more samples within a measurement window and obtain fine-grained samples of sub-millisecond executions. Furthermore, linearly scaling the measurement time to collect more samples is impractical if orders of magnitude more samples are necessary to correct the small samples size issue.

Sampling bias

A larger sample size alone does not improve accuracy if the samples are biased. OS timer samples on some systems (e.g., Linux) are biased because a timer interrupt from one OS thread, say T1 may be handled by an arbitrary (not to be confused with uniformly random) thread, say T2. This means, T2 will handle the interrupt and incorrectly attribute the sample to its call stack, which results in biased samples. If more timer signals are handled by T2 compared to T1, a systematic sampling bias will occur, which leads to inaccurate profiles.

2.2.1 Imprecision and its causes

Imprecision in measurement comes from at least the following two sources: sample size and measurement skid.

Sample size

Fewer number of samples directly contributes to a large standard error. The low resolution of OS timers is responsible for fewer samples in a measurement window, which results in a lower precision of pprof profiles. Conversely, a higher resolution of samples will improve precision.

Measurement skid

The second source of measurement error is the random error inherent in a measurement apparatus. An OS timer apparatus configured to expire after N milliseconds is only guaranteed to generate an interrupt some time after N milliseconds -- not precisely at N milliseconds. This randomness introduces a large “skid” in measurement. Assume a periodic timer set to fire every 10ms. Assume it has 1ms left before its expiry when the OS scheduler with a 4ms resolution inspects it. This means the timer will fire 4ms later, which is 3ms after the scheduled expiry time. This results in up to 30% imprecision for a 10ms periodic timer. Although random error may not be entirely eliminated, a superior measurement apparatus reduces the effects of random error.

2.3 Examples demonstrating inaccuracy and imprecision in Go pprof profiles

2.3.1 A concurrent code

Consider the Go program goroutine.go, which has ten exactly similar goroutines f1-f10. These functions are computationally intensive, long-running, and are unaffected by IO or memory hierarchy. The functions are non-preemptive, and hence the Go runtime scheduler overhead is zero. It is expected that each function takes exactly the same amount of CPU time (i.e., approximately 10% of the overall execution). It should not matter how many OS threads or CPU cores they run on.

Table 1 summarizes the result of go tool pprof -top for this program running on a Linux OS on a 12-core Intel Skylake server-class machine. The measurements were taken three times under the same configuration. The Flat (ms) column shows the absolute millisecond measurement attributed to each of the ten routines, and the Flat (%) column shows the relative time in a routine with respect to the total execution time. The Expected (%) column shows the expected relative time in each routine. The Rank order column assigns ranks in the descending order of execution time. The raw output is available in the appendix.

Table 1 Table 1: demonstration of inaccuracy and imprecision of pprof profiles using OS timers on a concurrent code.

First, let us focus on the data in a single run RUN 1. f1-f10 have a wide variance in the time attributed to them; the expectation is that each of them gets 10% execution time. There is up to 6x difference in the time attributed to the routine with the highest amount of attribution (main.f7 with 4210ms, 23.31%) and the routine with the lowest amount of attribution (main.f9 with 700ms, 3.88%). This demonstrates the poor accuracy (deviation from the ground truth) of pprof timer-based profiles. The issue reproduces when using different number of CPU cores.

Now focus on the three runs RUN 1-3 together . The time attributed to any routine widely varies from one run to another. The rank order of the routines changes significantly from run to run. In Run 1, main.f7 is shown to run for 4210ms with the rank order of 1, whereas in Run 2, it is shown to run for only 520ms with the rank order 10. The expectation is that the measurements remain the same from run to run. From one run to another, function-level time attribution are not precise; for example, there is a 71% coefficient of variance among the three runs of the f1 routine and the average coefficient of variance across all functions is 28%. This demonstrates imprecision (unpredictability of measurement) of pprof timer-based profiles.

For the curious readers, Table 2 below shows near-prefect accuracy and precision when profiling the same code using the PMU-based profiler described in this proposal. We show two runs instead of three for brevity. Additional details appear in Section 5.1

Table 2 Table 2: demonstration of accuracy and precision of pprof profiles via PMU CPU cycles event on a concurrent code.

2.3.2 A serial code

The issue is not specific to concurrent code alone. Refer to a carefully crafted Go program -- serial.go -- which has a single goroutine, and it has complete data dependence from one statement to another. It has ten functions A_expect_1_82, B_expect_3_64, ... , J_expect_18_18. The functions are intentionally written to consume different amounts of time to make this test case more representative. Each function is similar and has a loop, but the trip-count of the loop in each function is different. The loop in function with prefix A runs 1*C times, where C is a constant. The loop in function with prefix B runs 2*C times, and so on. This arrangement leads to a predictable relative execution time for each function; the function with prefix A consumes 1/sum(1..10)=1.82% of the total execution time; and the function with prefix B consumes 2/sum(1..10)=3.64% of the total execution time, and so on until the function with prefix J consumes 10/sum(1..10)=18.18% of the total execution time.

The expected relative percent of CPU time is encoded in each function name, A_expect_1_82 should consume 1.82% execution time, B_expect_3_64 should consume 3.64% execution time, J_expect_18_18 should consume 18.18% of CPU time, and so on. Table 3 summarizes the output of pprof CPU profiling for this code. As before, the measurements were taken three times under the same configuration. The Flat (ms) column shows the absolute millisecond measurement attributed to each of the ten functions, and the Flat (%) column shows the relative time with respect to the total time taken by all functions. The Expected (%) column shows the expected relative time in each routine. The Rank order column assigns ranks in descending order of the execution time. The raw output is available in the appendix.

Table 3 Table 3: demonstration of inaccuracy and imprecision of pprof profiles on a serial code.

A - in a cell represents zero samples. First, let's focus on just one set of measurements from RUN 1. Rather than J_expect_18_18 being the top time consumer with 18.18%, both J_expect_18_18 and H_expect_14_546 are the top time consumers each with 25% CPU time. B_expect_3_64, D_expect_7_27, F_expect_10_91, and I_expect_16_36 are all lumped into a single rank order of 4 each attributed with 6.25% of execution time. E_expect_9_09 does not even appear in the profiles despite its 9% expected execution time. More inaccuracies are evident in the data in the table. If we inspect the data from RUN 2 and RUN 3, we notice more discrepancies where a lot of data is missing and incorrect. The rank ordering and the relative execution percentage both have very low correlation with the ground truth. Despite tens of milliseconds of execution time, some functions are over counted and some functions are under counted.

Now, let's focus on the three runs together. The highest time consumer J_expect_18_18 does not even get a single sample in RUN 3, indicating the imprecision. I_expect_16_36, which has the measurement data available in all three cases varies from 20ms in RUN 1 with a rank order of 4 to 70ms in RUN 2 with a rank order of 1. More discrepancies are visible in the table. Thus, serial program also shows run to run measurement variation, and hence the profiles are imprecise.

Table 4 below shows near perfect accuracy and precision of profiling the same code using the PMU-based profiler described in this proposal. The measurements were taken two times instead of three for brevity. Additional details appear in Section 5.2

Table4 Table 4: demonstration of accuracy and precision of pprof profiles via PMU CPU cycles event on a serial code.

2.4 State-of-the-art in profiling using performance monitoring units

An alternative to the OS timer-based sampling is using hardware performance counters, which are ubiquitous in modern commodity CPUs. CPU’s Performance Monitoring Units (PMUs) have the capability to count various events such as cycles, instructions, cache misses, to name a few. Furthermore, PMUs can be configured to deliver an interrupt when an event reaches a specified threshold value. These PMU interrupts, like the OS timer interrupts, can serve as another source of sampling the call stack. This technique is employed in many modern profilers [1,2,4]. PMU sampling is called “event-based sampling,” where the period (interval between two interrupts) is expressed as the number of occurrences of an event, not the number of timer ticks. As a side note, in event-based sampling, the sampling overhead is proportional to the number of events disregarding the length of execution, whereas in timer-based sampling, the overhead is proportional to the length of execution. If there are more events in a short time span, then there is also more interest in observing deeply in that region of execution and the event-based sampling takes more samples. More overhead is incurred in places where there is interest (more events) and low/no overhead otherwise. CPU cycles is one of the most common PMU events, which happens periodically, similar to OS timers. Hence, using CPU cycles as a substitute for OS timers is very attractive.

2.5 Low-level details of the current Go CPU profiler

We now provide the current implementation details of CPU profiling in Go, which is necessary to understand the changes we propose in the next section. The description is relevant for most Unix-based systems, especially Linux, and most of the discussion is irrelevant for Windows since it does not use signals for handling CPU profiling.


Figure 1: A pictorial representation of the OS timer sampling workflow in the existing implementation. The first line of each record box is the function name. The remaining lines provide a textual description of the steps taken by that function.

Currently, pprof.StartCPUProfile(io.Writer) is the entry point for starting CPU profiling. Analogously, pprof.StopCPUProfile() stops a running profile. Two concurrent CPU profiling sessions are disallowed via a lock-protected variable cpu.profiling in pprof. The pprof package interacts with the runtime package for starting/stopping profiling via runtime.SetCPUProfileRate(hz int) API.

There are two critical pieces of work done inside pprof.StartCPUProfile(io.Writer). First, it invokes runtime.SetCPUProfileRate(hz=100) asking the runtime to start profiling; the rest of the profile generation will be handled by the runtime package; the integer argument is the sampling rate in Hz (samples per second).

Second, in order to collect the profiles generated by the runtime, pprof.StartCPUProfile(io.Writer) launches a goroutine profileWriter(io.Writer). profileWriter,

  1. creates a protocol buffer via newProfileBuilder,
  2. goes into a sleep-wakeup loop calling runtime.readProfile(). runtime.readProfile() provides a chunk of profiles collected inside the runtime in a ring buffer (eof if the profiling stopped), and
  3. writes to the sender end of a channel cpu.done once runtime.readProfile() returns eof. The other side of the channel will be waited upon by pprof.StopCPUProfile().

runtime.SetCPUProfileRate(int) is the entry point to start/stop profiling in the runtime package. It,

  1. creates a ring buffer, cpuprof.log,
  2. appends profile header information into the ring buffer,
  3. invokes setcpuprofilerate(hz), which,
  4. calls setThreadCPUProfileRate(hz=0) to stop any on-going timer,
  5. invokes setProcessCPUProfiler(hz), which installs the SIGPROF handler, and
  6. records the current sampling rate hz into the scheduler sched.profilehz to be visible to other OS threads.
  7. calls setThreadCPUProfileRate(hz), which uses the OS timer such as (setitimer) to start the sampling engine. This step is performed only if the hz value is non-zero.

Inside the scheduler, the execute() method schedules a goroutine (also called as as G in the Go runtime scheduler) to run on the current OS thread (also called as M in the Go runtime scheduler). This method compares the current value of the scheduler-recorded sched.profilehz variable against its own snapshot (m.profilehz) from the previously recorded value of sched.profilehz and if they differ, it reconfigures its timer by calling setThreadCPUProfileRate(sched.profilehz). This functionality sets an OS timer on behalf of the current OS thread in the process.

Once armed, the timer periodically delivers a SIGPROF signal, handled by sighandler(sig int, info *siginfo, ctxt unsafe.Pointer, gp *g), which in turn invokessigprof(), which,

  1. obtains a backtrace, and
  2. appends the backtrace into the ring buffer cpuprof.add().

Occasionally, a sampled program counter may fall into non-Go space, or the ring buffer may be full; such samples are simply counted, and backtrace is not taken.

The ring buffer is coordinated via a signalLock for concurrency control. The edges marked with “signalLock”, in Figure 1, show the operations that take the lock to access the ring buffer.

pprof.StopCPUProfile() follows a very similar path as pprof.StartCPUProfile(io.Writer). When the hz argument is zero, the runtime APIs described previously (e.g., SetCPUProfileRate(hz)) interpret it as the end of profiling. pprof.StopCPUProfile() performs the following steps.

  1. callsruntime.SetCPUProfileRate(0) to stop profiling.
  2. waits on the cpu.done channel, which will be flagged by the side thread once eof is reached in the runtime ring buffer.
  3. closes the supplied io.Writer.

runtime.SetCPUProfileRate(hz), when called with hz=0 invokes setcpuprofilerate(0) and closes the ring buffer. setcpuprofilerate(0) when called with hz=0,

  1. calls setProcessCPUProfiler(0), which restores the previous signal handler for SIGPROF.
  2. resets sched.profilehz=0 indicating globally that the profiling has stopped.
  3. calls setThreadCPUProfileRate(0) to stop sampling on the current OS thread.

Subsequently, the other Ms notice the change in the global sched.profilehz and turn off their timers when they run their execute() method.

The direct users of pprof are:

  1. The testing package which can take a -cpuprofile <file> flag to collect profiles.
  2. The net/http/pprof package which can collect cpu profiles on the /debug/pprof/profile endpoint with an optional argument ?seconds=<N>.
  3. Third party libraries such as pkg/profile.

3. Problem statement

Pprof CPU profiles are inaccurate because the profiling resolution is low and sampling is biased in OS timers.

Pprof CPU profiles are imprecise because the profiling resolution is low and the samples have a large skid due to OS timers.

Pprof CPU profiles measure only the time metric, additional events should be monitored to help performance profiling.

4. Proposal

In this proposal, we address the concerns -- small sample size, sampling bias, measurement error, and the need for more metrics -- which were raised in the background section.

Sample size. First, this proposal addresses the problem of low sampling resolution of OS timers by using high-resolution samples from hardware performance monitoring units (PMUs) to generate more samples. While the OS-timers have ~10ms resolution (that is one sample every 10ms), sampling CPU cycles via PMUs can deliver (if desired) a sample every few microseconds -- a three order magnitude finer resolution. This fine-grained measurement results in increasing the sample size, and as a result alleviates both inaccuracy and imprecision problems. While PMUs can deliver 100x more samples with just ~13% measurement overhead, the alternative of lengthening a measurement window by 100x to collect the same number of samples via the OS timers would be impractical.

Sampling bias. Second, the proposal arranges each sample to be delivered to the OS thread causing the counter overflow to eliminate the sampling bias, which improves accuracy.

Random error (skid). Third, PMU samples alleviate the large sampling skid seen in OS timers because PMUs allow a low-skid or skid-free precise sample attribution.

More metrics. Finally, the PMU's capability to monitor various events such as cache misses, TLB misses, branch mispredictions, inter-core cacheline ping-ponging, to name a few, offers additional insights when diagnosing complex performance issues.

Implementing this idea calls for extending the Go runtime package’s CPU profiling capabilities and making necessary adjustments in runtime/pprof, testing, and http/pprof packages. We first describe the details of the changes to the runtime package and then describe the details of the pprof, testing, and net/http/pprof packages, in that order.

Our design brings the full capability of PMUs into the runtime package but exposes only a small amount of it via the pprof package to users. For example, the runtime might be able to configure whether it wants to monitor the CPU cycles spent inside the kernel or not; but these configurations will not be exposed (at least in the initial design) to the users of pprof package.

4.1 High-level design

In its simplest form, we substitute the start/stop of an OS interval timer with the start/stop of a PMU event configured in the sampling mode; the sampling mode delivers an interrupt each time a hardware counter overflows, which is analogous to the OS timer interrupt. The interrupt generated by the PMU counter overflow follows the same signal handling path as the interval timer. This policy allows collecting and appending a backtrace on each PMU interrupt into a ring buffer, just like the OS timer. Moreover, like the OS timer, each OS-thread (M) checks whether any changes were made to the PMU-based profiling during each scheduling activity and performs the necessary adjustments to reconfigure the PMU. Finally, we retain SIGPROF as the signal type delivered -- for both OS timer interrupts and a PMU counter overflow interrupts -- instead of hijacking another signal number for PMU-based profiling.

4.2 The devil is in the details

This seemingly straightforward approach poses implementation challenges related to retaining the compatibility with interval timer and incorporating additional PMU features. The first part of this section addresses those challenges. Other than the design described in the previous subsection, the rest of the details are closer to implementation details. As before, the discussion is only relevant to Linux.

PMUs can count multiple CPU events simultaneously (e.g., cycles, instruction, cache misses, etc.). Our proposal does not preclude the runtime from simultaneously enabling multiple PMU events or even PMU events alongside an OS timer event. In fact, certain events are best used when measured with others. For example,

  1. Cache misses measured alongside cache hits enable computing cache miss ratio,
  2. Remote core cache line transfers measured alongside total cache misses provides a view of the cause of cache misses.

In our proposal, we will limit the maximum number of concurrent CPU events to _CPUPROF_EVENTS_MAX (EV_MAX for short in the rest of this document). We will define a few common portable events for which the user does not have to lookup within the CPU manufacturer’s manual for event codes. This closely resembles the preset events of perf events in Linux.

In order to enable advanced users to exploit the full capability of the PMUs, we define an opaque RAW event that allows the users to specify a model-specific hexadecimal event number that can be sampled. A motivating example for such a raw event is the remote HITM event on Intel processors; this event counts the number of times a cache line was loaded where the cache line was in the modified state in another remote CPU’s cache. This event helps diagnose problems such as cache line false sharing; we have observed this event to be helpful in identifying false sharing even inside the Go runtime.

We have come up with the following initial set of PMU events. These events will be exposed by the runtime package to the pprof package. The pprof package will wrap these constant values with matching function names to be presented later in the pprof package subsection. Notice that the OS timer is one of the CPU events. Last-level cache events are included to capture the behavior of the CPUs memory hierarchy. Branch events are included to capture potentially wasted CPU cycles. The design discussions can refine this list further (e.g., whether TLB events are needed). They are easy to grow based on requests.

type cpuEvent int32

// These constants are a mirror of the runtime
// These constants are a mirror of the runtime
const ( _CPUPROF_OS_TIMER, _CPUPROF_FIRST_EVENT          cpuEvent = iota, iota
  _CPUPROF_HW_INSTRUCTIONS                               cpuEvent = iota

Since PMU support may not be available on all platforms in all settings, we need to make it configurable in the Go runtime. If an OS supports PMU-based profiling (e.g., Linux), it will dynamically register two functions during the runtime startup.

  1. setThreadPMUProfilerFunc: this is the OS-specific function used to start/stop a PMU event analogous to setThreadCPUProfile. IfsetThreadPMUProfilerFunc is nil, the runtime will not attempt to start/stop PMU profiles even if it is requested to do so. However, this situation will not arise because runtime/pprof will guard against it.
  2. sigprofPMUHandlerFunc: this is the OS-specific function called from the OS-specific signal handler when the PMU delivers an interrupt.

Currently, each M (type m) and the scheduler (type schedt) maintains a profilehz variable. In lieu of profilehz we’ll introduce a runtime structure cpuProfileConfig struct , which captures the essential configurations needed to start/stop either a PMU counter or an OS timer. cpuProfileConfig is designed to reflect the underlying perf_event_attr struct in Linux, but being an internal data structure we can modify it, if needed in the future, without affecting any public-surface API.

type cpuProfileConfig struct {
    hz                        uint64 // the sampling rate, used only for OS_TIMER.
    period                    uint64 // the sampling interval, used for PMU events.
    rawEvent                  uint64 // the hexa-decimal value of the CPU-specific event.
                                                                                                                                                                                                                                                                                                                                              // used only with _CPUPROF_HW_RAW cpuEvent type
    preciseIP                 profilePCPrecision  // the level of precision for the program counter in PMU samples.
    isSampleIPIncluded        bool   // should the sample include the PC?
    isSampleAddrIncluded      bool   // should the sample include the memory address involved?
    isKernelIncluded          bool   // should the kernel count events in kernel mode?
    isHvIncluded              bool   // should the kernel count events in hypervisor mode?
    isIdleIncluded            bool   // should the kernel count events when idle task is running?
    isSampleCallchainIncluded bool   // should the sample report the call chain?
    isCallchainKernelIncluded bool   // should the sample include the kernel callchain?
    isCallchainUserIncluded   bool   // should the sample include the user callchain?

Some PMUs can be configured so that they can offer the exact program counter involved in the counter overflow. profilePCPrecision dictates the level of precision needed from the PMU as described by the constants below. If a PMU is configured to provide the precise program counter for a sample, we'll substitute the leaf-level program counter in the backtrace with the one presented by the PMU. Please refer to the Section 6.2 for PMU precision.

type profilePCPrecision uint8
const (
    _CPUPROF_IP_ARBITRARY_SKID profilePCPrecision = iota

pprof will set the default to _CPUPROF_IP_ARBITRARY_SKID to work across architectures and event types. However, we can easily change it to the highest supported, by probing the CPU as done in certain profilers.

cpuProfileConfig will be visible to pprof package but will not be exposed to its clients. The fields in cpuProfileConfig allow how a PMU event can be customized. Pprof will set the field values to the lowest common denominator that is expected to work across many architectures and security settings. For example, isKernelIncluded will most likely be false. The runtime will be (mostly) agnostic to these values and will pass them down to OS-specific layers that will handle programming the PMUs.

The runtime package will expose an API runtime_pprof_setCPUProfileConfig(cpuEvent, *cpuProfileConfig), only to pprof. This API is the equivalent of SetCPUProfileRate(hz int). Since the legacy SetCPUProfileRate(hz int) cannot be dropped, we alter its internals to wrap the hz value into a cpuProfileConfig structure and pass it down to runtime_pprof_setCPUProfileConfig for a uniform treatment of any event whether an OS-timer event or a PMU event.

The runtime can be bifurcated into OS-agnostic and OS-specific parts.

The OS-agnostic part of the runtime will have the following duties:

  1. Keep track of the CPU events currently running,
  2. Handle new start/stop invocations of the OS timer or PMU event, one at a time.
  3. Allocate and coordinate access to the ring buffers associated with each of the running CPU event.
  4. Reconfigure the CPU events for an M during its scheduling based on whether its current snapshot is different from the global configuration stored in the scheduler.

The OS-specific part of runtime will have the following duties:

  1. Install/restore profiling signal handler.
  2. Start/stop OS timer or PMU events.
  3. Handle the profiling signals and pause the PMUs during signal handling.
  4. Map an interrupt delivered to the appropriate active CPU profiling event for recording the backtraces into an appropriate ring buffer.

Since there can be CPUPROF_EVENTS_MAX number of concurrent CPU profiles, we need to change some of the runtime data structures that previously assumed a single CPU profile. Importantly,

  1. We will change the profilehz field in schedt struct to be an array of cpuProfileConfig pointers.
  2. We will change profilehz field in m struct to be an array of cpuProfileConfig pointers.
  3. We will introduce an array of opaque pointers cpuProfileHandle field in m struct. This will be OS-specific information stored in each M. For example, on Linux, this will hold the file handles obtained from perf_event_open system calls and the corresponding mmap ring buffers where the kernel populates the sample data. This information will be used to close the resources once the profiling is stopped on an M.

We’ll replace the hz argument used in various runtime APIs involved in CPU profiling with the following two arguments.

  1. eventId cpuEvent (described in the constants previously), and
  2. profConfig *cpuProfileConfig.

Figure 2 shows the call graph for the new workflow. We abbreviate eventId as i and cpuProfileConfig as p in most places.

call_graph_pprof++ Figure 2: A pictorial representation of the OS timer-based and PMU-based sampling workflow. The first line of each record box is the function name. The remaining lines provide a textual description of the steps taken by that function.

A nil value of cpuProfileConfig implies stop profiling for the specified cpuEvent, and a non-nil value implies start profiling for the specified cpuEvent. At the runtime-level, runtime_pprof_setCPUProfileConfig() makes a copy of the cpuProfileConfig passed from pprof, so that inside the runtime, we can perform fast pointer comparison and rely on two large cpuProfileConfig structs to be equal if pointers to them are equal.

Starting/stopping a PMU counter is different from starting/stopping an OS timer. Hence if eventId==CPUPROF_OS_TIMER checks will appear at a few places inside the runtime’s profiling-related code.

A few concurrency aspects deserve a mention.

  1. Each active event will create its own ring buffer on demand. There will be a maximum of EV_MAX ring buffers. Each ring buffer will be created only if there is a corresponding active event. The size of each ring buffer is 16KB. If all EV_MAX events are active at the same time, it will result in 16KB * 8 = 128KB of ring buffers, which is paltry on modern machines.
  2. We’ll change the cpuprof variable in cpuprof.go to be an array of EV_MAX size.

runtime_pprof_setCPUProfileConfig will hold a lock only for the cpuprof[i], where i is the event id for which profile has to be started or stopped. 4. There will be a single signalLock rather than EV_MAX signalLocks. This is because the setcpuprofileconfig method in cpuprof.go needs to check (read) all prof[eventId].configs to infer whether all events are stopped before uninstalling the signal handler. Taking EV_MAX locks one after another would be no better than holding just one lock. The consequence of this is that we serialize simultaneous writes to two different ring buffers. This should not be a serious issue since there cannot be two concurrent SIGPROFs. If a system supports concurrent SIGPROF handling, and if there are a lot of threads in a system, we can revisit this choice. 5. In a very rare case when a SIGPROF is delivered for an already running event when another event is being set (which will hold the signalLock), the signal handler recognizes it and drops such sample. Otherwise a deadlock will ensue.

At the lowest level, we will incorporate the following three system calls into all runtime/sys_linux_<arch>.s files.

  1. perf_event_open, which is Linux’s means to program PMUs. The Go signature will be func perfEventOpen(attr *perfEventAttr, pid uintptr, cpu, groupFd int32, flags uintptr) int32
  2. ioctl, which is needed to stop the PMU counter when handling the SIGPROF, and reset and restart after handling the signal. The go signature will be func ioctl(fd, req int32, arg uintptr) int32
  3. fcntl, which is necessary to arrange a perf_event file handle to deliver the SIGPROF signal and also to setup the signal to be delivered exactly to the M (OS thread) whose PMU counter overflew. The go signature will be func fcntl(fd, cmd int32, arg uintptr) int32

4.2.1 Segregating SIGPROF from different sampling sources

All PMU events and the OS timer use the same SIGPROF signal to deliver their interrupts. Hence, the Unix signal handler needs to address the following concerns.

  1. Identify whether a signal was delivered as a result of the OS timer expiring or a PMU counter overflowing. This is necessary because the subsequent processing will be different based on the event.
  2. Identify which PMU counter caused the interrupt, when multiple PMU events are being sampled at the same time. This is necessary because the subsequent processing needs to append the backtraces into the buffer corresponding to the PMU event that triggered the interrupt. All PMU events do not go into the same ring buffer.

Although both OS timer expiry and any PMU counter overflow result in the same SIGPROF signal, the signal codes are different. The PMU counter overflow generates a POLL_IN signal code indicating the presence of data whereas the OS timer interrupts generate a SI_KERNEL or SI_ALARM signal code. We use this information to address #1 above.

Once we know that the signal was generated due to some PMU counter overflowing, we use the file descriptor fd passed in the signal info and match it against the opened perf events file descriptors on the same M and use the matching location to infer the PMU event i that caused the signal; the rest of the signal handling makes use of this i to record the backtrace in the appropriate ring buffer. During the signal handling, we’ll pause all active PMU counters via ioctl calls on the fd so that the signal handling is not counted towards the PMU event counters.

4.3 Exposing PMU events via runtime/pprof package

We now turn our attention to the runtime/pprof package, which exposes the PMU sampling to its clients. There are several choices. There is a tension between revolutionary vs. evolutionary solution; there is friction between exposing the full PMU functionality vs. offering the basic minimum. In our proposal, we hope to make pragmatic choices among myriad options.

We propose adding a new API, which accepts different configurations needed to start any CPU event -- OS timer and PMU counters. The API signature is shown below.

pprof.StartCPUProfileWithConfig(opt ProfilingOption, moreOpts ...ProfilingOption) error

There is an optional variadic ProfilingOption to allow more than one event to be started during a profiling session.

A ProfilingOption is an interface exposed by runtime/pprof as shown below.

type ProfilingOption interface { apply() error }

Pprof will implement the following profiling option functions that mirror various cpuEvent constants previously presented.

func OSTimer(w io.Writer) ProfilingOption
func CPUCycles(w io.Writer, period uint64) ProfilingOption
func CPUInstructions(w io.Writer, period uint64) ProfilingOption
func CPUCacheReferences(w io.Writer, period uint64) ProfilingOption
func CPUCacheMisses(w io.Writer, period uint64) ProfilingOption
func CPUBranchInstructions(w io.Writer, period uint64) ProfilingOption
func CPUBranchMisses(w io.Writer, period uint64) ProfilingOption
func CPURawEvent(w io.Writer, period uint64, hex uint64) ProfilingOption

All of them consume an io.Writer. The OSTimer accepts no other argument. A period accepted by all PMU events specifies the number of events that must elapse between two consecutive interrupts. A larger value means lower granularity (the opposite of Hz). Passing a zero value will result in using a preset period. CPURawEvent is special; it accepts any user-provided hexadecimal PMU event code. Notice that each event requires an io.Writer since different profile types cannot be serialized into the same pprof protocol buffer. In the cases where multiple ProfilingOptions are passed to the pprof.StartCPUProfileWithConfig, sanity checks are made to ensure the uniqueness of each cpuEvent and each io.Writer.

Below are a few example usages of the API.

// Simple
StartCPUProfileWithConfig(CPUCycles(w, period))
StartCPUProfileWithConfig(CPURawEvent(w, period, hexadecimalEvent))

// Advanced
StartCPUProfileWithConfig(CPUCacheReferences(w1, period1), CPUCacheMisses(w2, period2))

The implementation of StartCPUProfile(writer) will simply invoke StartCPUProfileWithConfig(OSTimer(writer)). The implementation of StartCPUProfileWithConfig will ensure a maximum of EV_MAX elements, unique io.Writers, and unique cpuEvents. After sanitizing the arguments, it will create a cpuProfileConfig for each cpuEvent requested to be profiled. In a global data structure cpu, we will maintain the active events, their *cpuProfileConfig, and the corresponding io.Writer.

var cpu struct {
    profiling      bool
    activeConfig   [EV_MAX]*cpuProfileConfig
    activeWriter   [EV_MAX]io.Writer
    eventName      [EV_MAX]string
    done chan      bool

Finally, StartCPUProfileWithConfig will invoke runtime_pprof_setCPUProfileConfig multiple times once for each active event to request the runtime to actually configure either the OS timer or the PMU with a specific event to start sampling. A lock will be held throughout the invocation of StartCPUProfileWithConfig and hence no two StartCPUProfileWithConfig/StartCPUProfile invocations run concurrently. Once a session of StartCPUProfileWithConfig/StartCPUProfile is active, another profiling is disallowed until StopCPUProfile is invoked.

Before successfully returning from StartCPUProfileWithConfig, we'll launch a goroutine profileWriter, which will periodically wake up and scrape runtime-produced stack samples from the ring buffers associated with each of the active events and serialize them into the corresponding io.Writers. This follows the same design as the original pprof, but will add the responsibility of collecting profiles from multiple events. Creating one goroutine each to scrape every active event would be wasteful, a single goroutine suffices to collect backtraces from all ring buffers.

4.4 The optional multiple CPU profiles feature

Allowing multiple CPU profiles in a single profiling session via pprof.StartCPUProfileWithConfig(opt ProfilingOption, moreOpts ...ProfilingOption) API is a power-user feature. To avoid any accidental misuse, we'll protect it under an environmental variable GO_PPROF_ENABLE_MULTIPLE_CPU_PROFILES=<true|false>, which will be disabled by default.

4.5 The testing package

We’ll add the following two command line arguments to the testing package:

  1. -cpuprofileevent=<timer|cycles|instructions|cacheMisses|cacheReferences|branches|branchMisses|rHexValue>

  2. -cpuprofileperiod=<Int64Value>

The default value of -cpuprofileevent is timer with 100Hz when only -cpuprofile is passed, which ensures compatibility with the current behavior. -cpuprofileperiod will be ignored for -cpuprofileevent=timer and will default to a 100Hz rate as is the case currently. The testing package will map a user-specified string for -cpuprofileevent value to the appropriate pprof.ProfilingOption, the -cpuprofileperiod to an integer, and invoke pprof.StartCPUProfileWithConfig(). For example -cpuprofileevent=cycles and -cpuprofileperiod=100000 will result inpprof.StartCPUProfileWithConfig(CPUCycles(w, 100000).

4.6 The net/http/pprof package

We’ll add the following two additional arguments in an http request exposed by the net/http/pprof CPU profiling endpoint.

  1. event=<timer|cycles|instructions|cacheMisses|cacheReferences|branches|branchMisses|rHexValue>

  2. period=<Int64Value>

period will be ignored for event=timer and will default to a 100Hz rate as is the case currently. The default value of event is timer with 100Hz, which ensures compatibility with the current behavior. The net/http/pprof package will map a user-specified string for event value to an appropriate pprof.ProfilingOption, period to an integer, and invoke pprof.StartCPUProfileWithConfig.

5. Empirical evidence on the accuracy and precision of PMU profiles

5.1 The PMU produces precise and accurate profiles for concurrent programs

Below we show the go tool pprof -top output of the previously shown goroutine.go concurrent program run for two times. The profiles used the proposed new API with 1M sampling period: pprof.StartCPUPrileWithConfig(CPUCycles(w, 1000000)). The CPU cycles attributed to each routine f1-f10 match the expectation (approximately 10% of the total execution time) within one run and across multiple runs. Thus the profiles are accurate and precise. The measurement overhead is indistinguishable from the OS timer in this setting.

(PMU Cycles profile) goroutine.go/Run 1:

File: goroutine
Type: cycles
Time: Jan 27, 2020 at 4:49pm (PST)
Showing nodes accounting for 234000000000, 100% of 234000000000 total
      flat  flat%   sum%        cum   cum%
23400000000 10.00% 10.00% 23400000000 10.00%  main.f1
23400000000 10.00% 20.00% 23400000000 10.00%  main.f10
23400000000 10.00% 30.00% 23400000000 10.00%  main.f2
23400000000 10.00% 40.00% 23400000000 10.00%  main.f3
23400000000 10.00% 50.00% 23400000000 10.00%  main.f4
23400000000 10.00% 60.00% 23400000000 10.00%  main.f5
23400000000 10.00% 70.00% 23400000000 10.00%  main.f6
23400000000 10.00% 80.00% 23400000000 10.00%  main.f7
23400000000 10.00% 90.00% 23400000000 10.00%  main.f8
23400000000 10.00%   100% 23400000000 10.00%  main.f9

(PMU Cycles profile) goroutine.go/Run 2:

File: goroutine
Type: cycles
Time: Jan 27, 2020 at 4:51pm (PST)
Showing nodes accounting for 234000000000, 100% of 234000000000 total
      flat  flat%   sum%        cum   cum%
23800000000 10.17% 10.17% 23800000000 10.17%  main.f1
23500000000 10.04% 20.21% 23500000000 10.04%  main.f7
23500000000 10.04% 30.26% 23500000000 10.04%  main.f9
23400000000 10.00% 40.26% 23400000000 10.00%  main.f10
23400000000 10.00% 50.26% 23400000000 10.00%  main.f2
23400000000 10.00% 60.26% 23400000000 10.00%  main.f4
23400000000 10.00% 70.26% 23400000000 10.00%  main.f6
23400000000 10.00% 80.26% 23400000000 10.00%  main.f8
23300000000  9.96% 90.21% 23300000000  9.96%  main.f3
22900000000  9.79%   100% 22900000000  9.79%  main.f5

5.2 The PMU produces precise and accurate profiles for serial programs

Below we show the go tool pprof -top output of the previously shown serial.go program run for two times. We remind the reader that the expected relative execution time is encoded in the function names themselves (J_expect_18_18 should consume 18.18% CPU cycles and so on). The profiles used the proposed new API with 1M sampling period: pprof.StartCPUPrileWithConfig(CPUCycles(w, 1000000)). The CPU cycles attributed to each routine match the expectation within a run and across multiple runs. Thus the profiles are accurate and precise.

(PMU Cycles profile) serial.go/Run 1:

File: serial
Type: cycles
Time: Jan 27, 2020 at 4:54pm (PST)
Showing nodes accounting for 1105000000, 100% of 1105000000 total
      flat  flat%   sum%        cum   cum%
 200000000 18.10% 18.10%  200000000 18.10%  main.J_expect_18_18
 183000000 16.56% 34.66%  183000000 16.56%  main.I_expect_16_36
 165000000 14.93% 49.59%  165000000 14.93%  main.H_expect_14_546
 137000000 12.40% 61.99%  137000000 12.40%  main.G_expect_12_73
 120000000 10.86% 72.85%  120000000 10.86%  main.F_expect_10_91
 100000000  9.05% 81.90%  100000000  9.05%  main.E_expect_9_09
  82000000  7.42% 89.32%   82000000  7.42%  main.D_expect_7_27
  63000000  5.70% 95.02%   63000000  5.70%  main.C_expect_5_46
  37000000  3.35% 98.37%   37000000  3.35%  main.B_expect_3_64
  18000000  1.63%   100%   18000000  1.63%  main.A_expect_1_82
         0     0%   100% 1105000000   100%  main.main
         0     0%   100% 1105000000   100%  runtime.main

(PMU Cycles profile) serial.go/Run 2:

File: serial
Type: cycles
Time: Jan 27, 2020 at 4:54pm (PST)
Showing nodes accounting for 1105000000, 100% of 1105000000 total
      flat  flat%   sum%        cum   cum%
 200000000 18.10% 18.10%  200000000 18.10%  main.J_expect_18_18
 183000000 16.56% 34.66%  183000000 16.56%  main.I_expect_16_36
 159000000 14.39% 49.05%  159000000 14.39%  main.H_expect_14_546
 142000000 12.85% 61.90%  142000000 12.85%  main.G_expect_12_73
 119000000 10.77% 72.67%  119000000 10.77%  main.F_expect_10_91
 100000000  9.05% 81.72%  100000000  9.05%  main.E_expect_9_09
  82000000  7.42% 89.14%   82000000  7.42%  main.D_expect_7_27
  61000000  5.52% 94.66%   61000000  5.52%  main.C_expect_5_46
  40000000  3.62% 98.28%   40000000  3.62%  main.B_expect_3_64
  18000000  1.63% 99.91%   18000000  1.63%  main.A_expect_1_82
   1000000  0.09%   100% 1105000000   100%  main.main
         0     0%   100% 1105000000   100%  runtime.main

5.3 A practical use

We use pprof to collect production CPU profiles at Uber. On one of our Go microservices, when we used the default OS timer-based CPU profiles, we saw that the mutex.Lock() accounted for 15.79% of the time and mutex.Unlock() accounted for 13.55% of the CPU time. We knew from the program structure that mutex.Lock() and mutex.Unlock() were invoked in pairs one after another, and also that mutex.Lock() is significantly more heavyweight compared to mutex.Unlock(), thus we were expecting much lesser time being spent inside mutex.Unlock() relative to mutex.Lock(). The OS-timer profiling data was not matching our expectations. However, when we used a prototype of the PMU-based profiler and sampled the CPU cycles event, the quantification changed and matched our expectations; mutex.Lock() accounted for 33.36% of the CPU cycles and mutex.Unlock() accounted for 7.57% the CPU cycles. The PMU profiles avoided an unnecessary performance investigation.

6. Rationale

6.1 Advantages

CPU profiling via PMUs result in richer quality profiles -- high-resolution measurement, higher accuracy, and higher precision. Access to various events such as cache misses in PMUs allows better insights into understanding the causes of complex performance issues. Any kind of PMU event can be attributed to call stack samples, which is useful to provide actionable feedback to the developer. Incorporating the PMU facility in the runtime will allow the Go profiler to take full advantage of the progress happening in the hardware performance monitoring facilities available in modern CPUs. This framework will allow for independently developing more advanced profiling features such as attributing high-latency data accesses to Go objects, which can help refactor code or data structures for better data locality [5, 6, 7].

6.2 Disadvantages

  1. PMU-based profiling may not work when a Go program is running under a virtual machine, since virtual machines may not expose the hardware performance counters to a guest OS. Hence, the proposal keeps the OS-timer as the default and allows PMU-based sampling available only when it is explicitly specified.
  2. A higher sampling rate (a shorter period between samples) obviously incurs higher overhead. For example, the default timer with a 100Hz sampling rate introduces 3.6% measurement overhead on a 2.4GHz Intel Skylake machine. Using CPU cycles with a period of 24000000 cycles, which amounts to the same 100 samples/second introduces the same 3.6% overhead. The table below shows overheads at other sampling periods using the PMU cycles and also compares against the default OS-timer profiler overhead.
	The new PMU-based profiler with cycles event
	Sampling period in CPU cycles			Overhead
	24000000 (100 samples/sec)              1.036x	
	 2400000 (1K samples/sec)               1.036x
	  240000 (10K samples/sec)              1.13x
	   24000 (100K samples/sec)             2.05x
	The default OS-timer based profiler
	Sampling rate 				Overhead
	100 samples/sec               1.036x	

A reasonable sampling period, once in 2.4M CPU cycles on a 2.4GHz Intel Skylake machine, offers a good trade off between high sampling rate at low overhead.

  1. When using any other PMU event, such as cache misses, the sampling period needs to be more carefully curated. We’ll provide preset values for the preset events.

  2. PMU profiling has a mild skid on deeply pipelined, multi-issue, out-of-order, superscalar CPU architectures due to multiple instructions flowing in the pipeline. Thus the program counter seen in the interrupt can be a handful of instructions off from the one generating the counter overflow. This “skid” is much smaller than the several milliseconds of skid in an OS timer. Moreover, modern PMUs have built-in features to eliminate such skid (e.g., Intel Precise Event Based Sampling (PEBS), AMD Instruction Based Sampling (IBS), and PowerPC Marked events (MRK). Some PMUs can be configured so that they can offer the exact program counter involved in the counter overflow. We have introduced profilePCPrecision for this reason.

  3. There are finite counters on any PMU. Counter multiplexing happens if the number of events is more than the number of available PMU counter slots. In that case the events run only part of the time.

6.3 Alternatives explored

  1. Linux perf commandline tool is powerful and offers comparable solutions. However, it does not solve several important use cases.
  2. Profiling via an http endpoint (provided by net/http/pprof package in Go) is the preferred, readily available, and scalable solution that works at the enterprise scale. Making perf to work in a similar fashion would lead to creating too many custom solutions.
  3. pprof works as an in-situ profiler working in the same address space as the process being profiled and hence does not require any special system privileges; in contrast, attaching perf to a running process requires changes to a system’s security settings.
  4. We are also focusing on containerized environments, which is widely used in large-scale Go deployments. Although perf can collect call stack samples from a container, it fails to find symbols of go binaries located within a container to show clean reports (e.g., flame graphs). The workaround is non-trivial.
  5. We explored library-level facilities to collect performance counters such as perf-util [3]. However, no library support will provide good profiling for Go since the runtime can schedule a monitored function onto another thread, where the counters may not be enabled. The correct solution, hence, is to implement the hardware performance counter facility inside the go runtime into the scheduler as we have proposed here.
  6. We seriously contemplated against introducing a new pprof API and tried to work with the existing API StartCPUProfile(io.Writer). We considered adding variadic ProfilingOptions to it. But any such change would be qualified as breaking Go1.X and went ahead introducing a new API.
  7. We considered passing an Options struct to the new API StartCPUProfileWithConfig. However, the usage looked complicated with certain fields making sense only in certain circumstances and hence decided to use the functional option pattern.
  8. We considered having only one CPU profiling event active at a time. However, the experience has shown that this limits performance analysis in several ways; and advanced performance tuning demands correlating more than one event to find the root cause. Furthermore, this richness in Go profiling can serve to advance the state-of-the-art in profilers.
  9. On the implementation front, we considered using a map for active events at several places instead of an EV_MAX array. However, concurrent access to different array elements was superior to locking a shared map. Moreover, EV_MAX is small enough to search linearly for active events.

7. Compatibility

The design ensures compatibility with Go1.0 since no existing public API signatures, interface signatures, or data structures are changed. We have introduced a new interface and a handful of new APIs in the pprof package.

8. Implementation

Milind Chabbi from Uber technologies will implement the proposal. The implementation will proceed as follows:

  1. Add assembly language support for perf_events related system calls.
  2. Modify the runtime, runtime/pprofto support PMU sampling to reflect the call graph / workflow shown in Figure 2.
  3. Expose the new facility via testing and net/http/pprof packages.
  4. Make changes to pprof CPU profiling tests to use PMU cycles in addition to OS timer (on supported platforms).
  5. Add PMU profiling tests to run on supported platforms.

The implementation will be broken into a large checkin related to the runtime and runtime/pprof packages, a few small checkins related to testing, and net/http/pprof packages, and the tests accompanying the code. The implementation will align with the February-April 2020 development cycle and May-July 2020 release cycle.

A prototype is available here.

9. Open issues

  1. Supporting Windows and MacOS will be left for future work, but any support from experts from those lands is appreciated.
  2. The issue 21295 proposes adding -test.counters flag to the testing package. That proposal is orthogonal to this one. There isn’t a detailed description of how the proposal wants to accomplish it. Simply starting and stopping a PMU counter by making perf_events_open system call at the testing package level will not accomplish the objective. The monitored functions can migrate from one M to another, and hence the changes must be made in the runtime as described in this proposal. However, it is relatively easy to incorporate the requested feature by making use of changes suggested in this proposal to accomplish the objective of issue #21295. We are open to discussing with the creators of #21295 to understand how we can incorporate their needs into this design.
  3. The additions to the public APIs and naming aesthetics can be refined with inputs from the experts and the community.
  4. Currently, we do not bubble up runtime failures to start a PMU profiler (same is true for a timer). Should we bubble up any failure to configure a PMU or silently produce zero samples in the profiles?
  5. Should we introduce any throttling mechanism if the rate of events is too high for a PMU event.

10. Acknowledgment

Pengfei Su implemented a prototype of the design while interning at Uber Technologies.

Joshua Corbin helped design the public surface APIs.

This work is generously funded by the Uber Technologies’ Programming Systems Research team and the team members provided feedback for improvement.

11. Appendix

(OS timer profile) goroutine.go/run1:
File: goroutine
Type: cpu
Time: Jan 27, 2020 at 3:45pm (PST)
Duration: 6.70s, Total samples = 18060ms (269.37%)
Showing nodes accounting for 18060ms, 100% of 18060ms total
      flat  flat%   sum%        cum   cum%
    4210ms 23.31% 23.31%     4210ms 23.31%  main.f7
    2610ms 14.45% 37.76%     2610ms 14.45%  main.f2
    2010ms 11.13% 48.89%     2010ms 11.13%  main.f6
    1810ms 10.02% 58.91%     1810ms 10.02%  main.f10
    1780ms  9.86% 68.77%     1780ms  9.86%  main.f3
    1410ms  7.81% 76.58%     1410ms  7.81%  main.f1
    1310ms  7.25% 83.83%     1310ms  7.25%  main.f4
    1110ms  6.15% 89.98%     1110ms  6.15%  main.f5
    1110ms  6.15% 96.12%     1110ms  6.15%  main.f8
     700ms  3.88%   100%      700ms  3.88%  main.f9
(OS timer profile) goroutine.go/run2:
File: goroutine
Type: cpu
Time: Jan 27, 2020 at 3:45pm (PST)
Duration: 6.71s, Total samples = 17400ms (259.39%)
Showing nodes accounting for 17400ms, 100% of 17400ms total
      flat  flat%   sum%        cum   cum%
    3250ms 18.68% 18.68%     3250ms 18.68%  main.f2
    2180ms 12.53% 31.21%     2180ms 12.53%  main.f9
    2100ms 12.07% 43.28%     2100ms 12.07%  main.f1
    1770ms 10.17% 53.45%     1770ms 10.17%  main.f6
    1700ms  9.77% 63.22%     1700ms  9.77%  main.f5
    1550ms  8.91% 72.13%     1550ms  8.91%  main.f4
    1500ms  8.62% 80.75%     1500ms  8.62%  main.f8
    1440ms  8.28% 89.02%     1440ms  8.28%  main.f3
    1390ms  7.99% 97.01%     1390ms  7.99%  main.f10
     520ms  2.99%   100%      520ms  2.99%  main.f7
(OS timer profile) goroutine.go/Run 3:
File: goroutine
Type: cpu
Time: Jan 27, 2020 at 3:48pm (PST)
Duration: 6.71s, Total samples = 17.73s (264.31%)
Showing nodes accounting for 17.73s, 100% of 17.73s total
      flat  flat%   sum%        cum   cum%
     3.74s 21.09% 21.09%      3.74s 21.09%  main.f7
     2.08s 11.73% 32.83%      2.08s 11.73%  main.f9
     2.05s 11.56% 44.39%      2.05s 11.56%  main.f2
     1.85s 10.43% 54.82%      1.85s 10.43%  main.f10
     1.78s 10.04% 64.86%      1.78s 10.04%  main.f1
     1.43s  8.07% 72.93%      1.43s  8.07%  main.f3
     1.42s  8.01% 80.94%      1.42s  8.01%  main.f8
     1.18s  6.66% 87.59%      1.18s  6.66%  main.f6
     1.17s  6.60% 94.19%      1.17s  6.60%  main.f5
     1.03s  5.81%   100%      1.03s  5.81%  main.f4
(OS timer profile) serial.go/Run 1:
File: serial
Type: cpu
Time: Jan 27, 2020 at 1:42pm (PST)
Duration: 501.51ms, Total samples = 320ms (63.81%)
Showing nodes accounting for 320ms, 100% of 320ms total
      flat  flat%   sum%        cum   cum%
      80ms 25.00% 25.00%       80ms 25.00%  main.H_expect_14_546
      80ms 25.00% 50.00%       80ms 25.00%  main.J_expect_18_18
      60ms 18.75% 68.75%       60ms 18.75%  main.G_expect_12_73
      20ms  6.25% 75.00%       20ms  6.25%  main.B_expect_3_64
      20ms  6.25% 81.25%       20ms  6.25%  main.D_expect_7_27
      20ms  6.25% 87.50%       20ms  6.25%  main.F_expect_10_91
      20ms  6.25% 93.75%       20ms  6.25%  main.I_expect_16_36
      10ms  3.12% 96.88%       10ms  3.12%  main.A_expect_1_82
      10ms  3.12%   100%       10ms  3.12%  main.C_expect_5_46
         0     0%   100%      320ms   100%  main.main
         0     0%   100%      320ms   100%  runtime.main
(OS timer profile) serial.go/Run 2:
File: serial
Type: cpu
Time: Jan 27, 2020 at 1:44pm (PST)
Duration: 501.31ms, Total samples = 320ms (63.83%)
Showing nodes accounting for 320ms, 100% of 320ms total
      flat  flat%   sum%        cum   cum%
      70ms 21.88% 21.88%       70ms 21.88%  main.I_expect_16_36
      50ms 15.62% 37.50%       50ms 15.62%  main.J_expect_18_18
      40ms 12.50% 50.00%       40ms 12.50%  main.E_expect_9_09
      40ms 12.50% 62.50%       40ms 12.50%  main.F_expect_10_91
      40ms 12.50% 75.00%       40ms 12.50%  main.H_expect_14_546
      30ms  9.38% 84.38%       30ms  9.38%  main.D_expect_7_27
      20ms  6.25% 90.62%       20ms  6.25%  main.B_expect_3_64
      20ms  6.25% 96.88%       20ms  6.25%  main.G_expect_12_73
      10ms  3.12%   100%       10ms  3.12%  main.C_expect_5_46
         0     0%   100%      320ms   100%  main.main
         0     0%   100%      320ms   100%  runtime.main
(OS timer profile) serial.go/Run 3:
File: serial
Type: cpu
Time: Jan 27, 2020 at 1:45pm (PST)
Duration: 501.39ms, Total samples = 310ms (61.83%)
Showing nodes accounting for 310ms, 100% of 310ms total
      flat  flat%   sum%        cum   cum%
     110ms 35.48% 35.48%      110ms 35.48%  main.J_expect_18_18
      70ms 22.58% 58.06%       70ms 22.58%  main.G_expect_12_73
      60ms 19.35% 77.42%       60ms 19.35%  main.F_expect_10_91
      30ms  9.68% 87.10%       30ms  9.68%  main.I_expect_16_36
      20ms  6.45% 93.55%       20ms  6.45%  main.H_expect_14_546
      10ms  3.23% 96.77%       10ms  3.23%  main.B_expect_3_64
      10ms  3.23%   100%       10ms  3.23%  main.C_expect_5_46
         0     0%   100%      310ms   100%  main.main
         0     0%   100%      310ms   100%  runtime.main

12. References

  1. HPCToolkit:
  2. Oracle studio:
  3. Perf utils:
  4. vTune
  5. “Pinpointing Data Locality Problems Using Data-Centric Analysis”, Xu Liu and John Mellor-Crummey, 2011 International Symposium on Code Generation and Optimization, April 2-6, 2011, Chamonix, France.
  6. “A Data-centric Profiler for Parallel Programs”, Xu Liu and John Mellor-Crummey, The International Conference for High Performance Computing, Networking, Storage and Analysis, November 17-22, 2013, Denver, Colorado, USA.
  7. “ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs”, Xu Liu and Bo Wu, The International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 15-20, 2015, Austin, Texas, USA.