Author(s): Milind Chabbi
Last updated: Feb/13/2020
Discussion at https://golang.org/issues/36821.
The Go pprof CPU profiles are quite inaccurate and imprecise due to several limitations related to the operating system (OS) interval timer used as its sampling engine. Inaccurate profiles can mislead performance investigation and may lead to inappropriate optimization guidance. We propose incorporating hardware performance counters as an alternative sampling engine for CPU profiling. Hardware performance counter-based sampling rectifies many of the problems related to the OS timers. The hardware CPU cycles event provides microsecond measurement granularity, and it is available in almost all commodity CPUs. Furthermore, hardware Performance Monitoring Units (PMUs) allow measuring not only CPU cycles but other activities such as cache misses that help root cause complex performance problems. A prototype implementation shows significantly better-quality CPU profiles when using PMUs while maintaining a low measurement overhead. The proposal will retain the OS-timer as the default sampling engine. The proposal exposes the PMU-based profiling feature on Linux-based systems, which provide an excellent facility to program PMUs.
In this section, we first provide a high-level description of Go's CPU profiling, motivate the problems in the current profiler with examples, highlight the state-of-the-art in CPU profiling, and then dive deeper into the current implementation of CPU profiling inside Go. Our proposed changes appear in the next section. Readers familiar with profiling technology and the internals of pprof may skip this section.
Pprof, the de facto Go CPU profiler, employs sampling-based profiling for time measurement. It attributes the time metric to different functions, including their calling context (aka call path or backtrace). Pprof relies on the underlying operating-system timers (exposed by the Go runtime) to periodically interrupt a program’s execution. The sampling rate is typically one hundred hertz (a sample taken every 10ms); a different sampling rate can be configured, but typical systems cannot deliver more than 250 hertz.
When profiling is enabled, the Go runtime scheduler sets up each OS thread in the process to generate periodic timer interrupts. In each interrupt, the go runtime’s signal handler obtains a backtrace and appends it to a ring buffer. All OS threads in a Go process append their backtraces to the same ring buffer coordinated via a lock. A side thread, launched by the pprof
package, periodically scrapes the backtraces from the ring buffer and serializes them into a protobuf file to be used by the downstream tools.
Accuracy and precision are desirable properties of a good measurement tool.
A profiling datum is said to be accurate if it is close to the ground truth (aka true value). For example, if API_A()
consumes 25% of the total execution time and a profiler attributes it 24% of the total execution time, the measurement has 96% accuracy.
Profiling data are said to be precise if there is low variability among multiple measurements. Precision of a set of measurements of the same variable is often expressed either as min and max from the sample mean or as standard error or coefficient of variation of the sample.
Unfortunately, the current pprof CPU profiles meet neither of these two criteria.
Inaccuracy of profiles arises from two sources: sample size and sampling bias.
By the law of large numbers, a larger sample size makes the sample mean closer to the population mean. Moreover, in profiling, where the sample size is very small compared to the population size, increasing the sample size greatly improves accuracy. Conversely, a smaller sample size leads to a lower accuracy.
Low-resolution OS-timer is the primary reasons behind small sample size in a measurement window. One can obtain more samples either by increasing the length of a measurement window or by increasing the number of samples in a measurement window. The former works well for regular programs e.g., codes dominated by loops; however, for irregular codes, such as microservices, a larger measurement window does not rectify the measurement inaccuracy because more irregularities can easily surface in a larger observation window. Hence, there is a dire need to collect more samples within a measurement window and obtain fine-grained samples of sub-millisecond executions. Furthermore, linearly scaling the measurement time to collect more samples is impractical if orders of magnitude more samples are necessary to correct the small samples size issue.
A larger sample size alone does not improve accuracy if the samples are biased. OS timer samples on some systems (e.g., Linux) are biased because a timer interrupt from one OS thread, say T1
may be handled by an arbitrary (not to be confused with uniformly random) thread, say T2
. This means, T2
will handle the interrupt and incorrectly attribute the sample to its call stack, which results in biased samples. If more timer signals are handled by T2
compared to T1
, a systematic sampling bias will occur, which leads to inaccurate profiles.
Imprecision in measurement comes from at least the following two sources: sample size and measurement skid.
Fewer number of samples directly contributes to a large standard error. The low resolution of OS timers is responsible for fewer samples in a measurement window, which results in a lower precision of pprof profiles. Conversely, a higher resolution of samples will improve precision.
The second source of measurement error is the random error inherent in a measurement apparatus. An OS timer apparatus configured to expire after N
milliseconds is only guaranteed to generate an interrupt some time after N
milliseconds -- not precisely at N
milliseconds. This randomness introduces a large “skid” in measurement. Assume a periodic timer set to fire every 10ms. Assume it has 1ms left before its expiry when the OS scheduler with a 4ms resolution inspects it. This means the timer will fire 4ms later, which is 3ms after the scheduled expiry time. This results in up to 30% imprecision for a 10ms periodic timer. Although random error may not be entirely eliminated, a superior measurement apparatus reduces the effects of random error.
Consider the Go program goroutine.go, which has ten exactly similar goroutines f1
-f10
. These functions are computationally intensive, long-running, and are unaffected by IO or memory hierarchy. The functions are non-preemptive, and hence the Go runtime scheduler overhead is zero. It is expected that each function takes exactly the same amount of CPU time (i.e., approximately 10% of the overall execution). It should not matter how many OS threads or CPU cores they run on.
Table 1 summarizes the result of go tool pprof -top
for this program running on a Linux OS on a 12-core Intel Skylake server-class machine. The measurements were taken three times under the same configuration. The Flat (ms)
column shows the absolute millisecond measurement attributed to each of the ten routines, and the Flat (%)
column shows the relative time in a routine with respect to the total execution time. The Expected (%)
column shows the expected relative time in each routine. The Rank order
column assigns ranks in the descending order of execution time. The raw output is available in the appendix.
Table 1: demonstration of inaccuracy and imprecision of pprof profiles using OS timers on a concurrent code.
First, let us focus on the data in a single run RUN 1
. f1
-f10
have a wide variance in the time attributed to them; the expectation is that each of them gets 10%
execution time. There is up to 6x difference in the time attributed to the routine with the highest amount of attribution (main.f7
with 4210ms, 23.31%) and the routine with the lowest amount of attribution (main.f9
with 700ms, 3.88%). This demonstrates the poor accuracy (deviation from the ground truth) of pprof timer-based profiles. The issue reproduces when using different number of CPU cores.
Now focus on the three runs RUN 1-3
together . The time attributed to any routine widely varies from one run to another. The rank order of the routines changes significantly from run to run. In Run 1, main.f7
is shown to run for 4210ms with the rank order of 1, whereas in Run 2, it is shown to run for only 520ms with the rank order 10. The expectation is that the measurements remain the same from run to run. From one run to another, function-level time attribution are not precise; for example, there is a 71% coefficient of variance among the three runs of the f1
routine and the average coefficient of variance across all functions is 28%. This demonstrates imprecision (unpredictability of measurement) of pprof timer-based profiles.
For the curious readers, Table 2 below shows near-prefect accuracy and precision when profiling the same code using the PMU-based profiler described in this proposal. We show two runs instead of three for brevity. Additional details appear in Section 5.1
Table 2: demonstration of accuracy and precision of pprof profiles via PMU CPU cycles event on a concurrent code.
The issue is not specific to concurrent code alone. Refer to a carefully crafted Go program -- serial.go
-- which has a single goroutine, and it has complete data dependence from one statement to another. It has ten functions A_expect_1_82
, B_expect_3_64
, ... , J_expect_18_18
. The functions are intentionally written to consume different amounts of time to make this test case more representative. Each function is similar and has a loop, but the trip-count of the loop in each function is different. The loop in function with prefix A
runs 1*C
times, where C
is a constant. The loop in function with prefix B
runs 2*C
times, and so on. This arrangement leads to a predictable relative execution time for each function; the function with prefix A
consumes 1/sum(1..10)=1.82%
of the total execution time; and the function with prefix B
consumes 2/sum(1..10)=3.64%
of the total execution time, and so on until the function with prefix J
consumes 10/sum(1..10)=18.18%
of the total execution time.
The expected relative percent of CPU time is encoded in each function name, A_expect_1_82
should consume 1.82% execution time, B_expect_3_64
should consume 3.64% execution time, J_expect_18_18
should consume 18.18% of CPU time, and so on. Table 3 summarizes the output of pprof CPU profiling for this code. As before, the measurements were taken three times under the same configuration. The Flat (ms)
column shows the absolute millisecond measurement attributed to each of the ten functions, and the Flat (%)
column shows the relative time with respect to the total time taken by all functions. The Expected (%)
column shows the expected relative time in each routine. The Rank order
column assigns ranks in descending order of the execution time. The raw output is available in the appendix.
Table 3: demonstration of inaccuracy and imprecision of pprof profiles on a serial code.
A -
in a cell represents zero samples. First, let's focus on just one set of measurements from RUN 1
. Rather than J_expect_18_18
being the top time consumer with 18.18%, both J_expect_18_18
and H_expect_14_546
are the top time consumers each with 25% CPU time. B_expect_3_64
, D_expect_7_27
, F_expect_10_91
, and I_expect_16_36
are all lumped into a single rank order of 4 each attributed with 6.25% of execution time. E_expect_9_09
does not even appear in the profiles despite its 9% expected execution time. More inaccuracies are evident in the data in the table. If we inspect the data from RUN 2
and RUN 3
, we notice more discrepancies where a lot of data is missing and incorrect. The rank ordering and the relative execution percentage both have very low correlation with the ground truth. Despite tens of milliseconds of execution time, some functions are over counted and some functions are under counted.
Now, let's focus on the three runs together. The highest time consumer J_expect_18_18
does not even get a single sample in RUN 3
, indicating the imprecision. I_expect_16_36
, which has the measurement data available in all three cases varies from 20ms in RUN 1
with a rank order of 4 to 70ms in RUN 2
with a rank order of 1. More discrepancies are visible in the table. Thus, serial program also shows run to run measurement variation, and hence the profiles are imprecise.
Table 4 below shows near perfect accuracy and precision of profiling the same code using the PMU-based profiler described in this proposal. The measurements were taken two times instead of three for brevity. Additional details appear in Section 5.2
Table 4: demonstration of accuracy and precision of pprof profiles via PMU CPU cycles event on a serial code.
An alternative to the OS timer-based sampling is using hardware performance counters, which are ubiquitous in modern commodity CPUs. CPU’s Performance Monitoring Units (PMUs) have the capability to count various events such as cycles, instructions, cache misses, to name a few. Furthermore, PMUs can be configured to deliver an interrupt when an event reaches a specified threshold value. These PMU interrupts, like the OS timer interrupts, can serve as another source of sampling the call stack. This technique is employed in many modern profilers [1,2,4]. PMU sampling is called “event-based sampling,” where the period (interval between two interrupts) is expressed as the number of occurrences of an event, not the number of timer ticks. As a side note, in event-based sampling, the sampling overhead is proportional to the number of events disregarding the length of execution, whereas in timer-based sampling, the overhead is proportional to the length of execution. If there are more events in a short time span, then there is also more interest in observing deeply in that region of execution and the event-based sampling takes more samples. More overhead is incurred in places where there is interest (more events) and low/no overhead otherwise. CPU cycles is one of the most common PMU events, which happens periodically, similar to OS timers. Hence, using CPU cycles as a substitute for OS timers is very attractive.
We now provide the current implementation details of CPU profiling in Go, which is necessary to understand the changes we propose in the next section. The description is relevant for most Unix-based systems, especially Linux, and most of the discussion is irrelevant for Windows since it does not use signals for handling CPU profiling.
Figure 1: A pictorial representation of the OS timer sampling workflow in the existing implementation. The first line of each record box is the function name. The remaining lines provide a textual description of the steps taken by that function.
Currently, pprof.StartCPUProfile(io.Writer)
is the entry point for starting CPU profiling. Analogously, pprof.StopCPUProfile()
stops a running profile. Two concurrent CPU profiling sessions are disallowed via a lock-protected variable cpu.profiling
in pprof. The pprof
package interacts with the runtime
package for starting/stopping profiling via runtime.SetCPUProfileRate(hz int)
API.
There are two critical pieces of work done inside pprof.StartCPUProfile(io.Writer)
. First, it invokes runtime.SetCPUProfileRate(hz=100)
asking the runtime to start profiling; the rest of the profile generation will be handled by the runtime
package; the integer argument is the sampling rate in Hz (samples per second).
Second, in order to collect the profiles generated by the runtime, pprof.StartCPUProfile(io.Writer)
launches a goroutine profileWriter(io.Writer)
. profileWriter
,
newProfileBuilder
,runtime.readProfile()
. runtime.readProfile()
provides a chunk of profiles collected inside the runtime in a ring buffer (eof
if the profiling stopped), andcpu.done
once runtime.readProfile()
returns eof
. The other side of the channel will be waited upon by pprof.StopCPUProfile()
.runtime.SetCPUProfileRate(int)
is the entry point to start/stop profiling in the runtime
package. It,
cpuprof.log
,setcpuprofilerate(hz)
, which,setThreadCPUProfileRate(hz=0)
to stop any on-going timer,setProcessCPUProfiler(hz)
, which installs the SIGPROF
handler, andhz
into the scheduler sched.profilehz
to be visible to other OS threads.setThreadCPUProfileRate(hz)
, which uses the OS timer such as (setitimer
) to start the sampling engine. This step is performed only if the hz
value is non-zero.Inside the scheduler, the execute()
method schedules a goroutine (also called as as G
in the Go runtime scheduler) to run on the current OS thread (also called as M
in the Go runtime scheduler). This method compares the current value of the scheduler-recorded sched.profilehz
variable against its own snapshot (m.profilehz
) from the previously recorded value of sched.profilehz
and if they differ, it reconfigures its timer by calling setThreadCPUProfileRate(sched.profilehz)
. This functionality sets an OS timer on behalf of the current OS thread in the process.
Once armed, the timer periodically delivers a SIGPROF
signal, handled by sighandler(sig int, info *siginfo, ctxt unsafe.Pointer, gp *g)
, which in turn invokessigprof()
, which,
cpuprof.add()
.Occasionally, a sampled program counter may fall into non-Go space, or the ring buffer may be full; such samples are simply counted, and backtrace is not taken.
The ring buffer is coordinated via a signalLock
for concurrency control. The edges marked with “signalLock”, in Figure 1, show the operations that take the lock to access the ring buffer.
pprof.StopCPUProfile()
follows a very similar path as pprof.StartCPUProfile(io.Writer)
. When the hz
argument is zero, the runtime APIs described previously (e.g., SetCPUProfileRate(hz)
) interpret it as the end of profiling. pprof.StopCPUProfile()
performs the following steps.
runtime.SetCPUProfileRate(0)
to stop profiling.cpu.done
channel, which will be flagged by the side thread once eof
is reached in the runtime ring buffer.io.Writer
.runtime.SetCPUProfileRate(hz)
, when called with hz=0
invokes setcpuprofilerate(0)
and closes the ring buffer. setcpuprofilerate(0)
when called with hz=0
,
setProcessCPUProfiler(0)
, which restores the previous signal handler for SIGPROF
.sched.profilehz=0
indicating globally that the profiling has stopped.setThreadCPUProfileRate(0)
to stop sampling on the current OS thread.Subsequently, the other M
s notice the change in the global sched.profilehz
and turn off their timers when they run their execute()
method.
The direct users of pprof are:
testing
package which can take a -cpuprofile <file>
flag to collect profiles.net/http/pprof
package which can collect cpu profiles on the /debug/pprof/profile
endpoint with an optional argument ?seconds=<N>
.pkg/profile
.Pprof CPU profiles are inaccurate because the profiling resolution is low and sampling is biased in OS timers.
Pprof CPU profiles are imprecise because the profiling resolution is low and the samples have a large skid due to OS timers.
Pprof CPU profiles measure only the time metric, additional events should be monitored to help performance profiling.
In this proposal, we address the concerns -- small sample size, sampling bias, measurement error, and the need for more metrics -- which were raised in the background section.
Sample size. First, this proposal addresses the problem of low sampling resolution of OS timers by using high-resolution samples from hardware performance monitoring units (PMUs) to generate more samples. While the OS-timers have ~10ms resolution (that is one sample every 10ms), sampling CPU cycles via PMUs can deliver (if desired) a sample every few microseconds -- a three order magnitude finer resolution. This fine-grained measurement results in increasing the sample size, and as a result alleviates both inaccuracy and imprecision problems. While PMUs can deliver 100x more samples with just ~13% measurement overhead, the alternative of lengthening a measurement window by 100x to collect the same number of samples via the OS timers would be impractical.
Sampling bias. Second, the proposal arranges each sample to be delivered to the OS thread causing the counter overflow to eliminate the sampling bias, which improves accuracy.
Random error (skid). Third, PMU samples alleviate the large sampling skid seen in OS timers because PMUs allow a low-skid or skid-free precise sample attribution.
More metrics. Finally, the PMU's capability to monitor various events such as cache misses, TLB misses, branch mispredictions, inter-core cacheline ping-ponging, to name a few, offers additional insights when diagnosing complex performance issues.
Implementing this idea calls for extending the Go runtime package’s CPU profiling capabilities and making necessary adjustments in runtime/pprof
, testing
, and http/pprof
packages. We first describe the details of the changes to the runtime
package and then describe the details of the pprof
, testing
, and net/http/pprof
packages, in that order.
Our design brings the full capability of PMUs into the runtime
package but exposes only a small amount of it via the pprof
package to users. For example, the runtime might be able to configure whether it wants to monitor the CPU cycles spent inside the kernel or not; but these configurations will not be exposed (at least in the initial design) to the users of pprof
package.
In its simplest form, we substitute the start/stop of an OS interval timer with the start/stop of a PMU event configured in the sampling mode; the sampling mode delivers an interrupt each time a hardware counter overflows, which is analogous to the OS timer interrupt. The interrupt generated by the PMU counter overflow follows the same signal handling path as the interval timer. This policy allows collecting and appending a backtrace on each PMU interrupt into a ring buffer, just like the OS timer. Moreover, like the OS timer, each OS-thread (M
) checks whether any changes were made to the PMU-based profiling during each scheduling activity and performs the necessary adjustments to reconfigure the PMU. Finally, we retain SIGPROF
as the signal type delivered -- for both OS timer interrupts and a PMU counter overflow interrupts -- instead of hijacking another signal number for PMU-based profiling.
This seemingly straightforward approach poses implementation challenges related to retaining the compatibility with interval timer and incorporating additional PMU features. The first part of this section addresses those challenges. Other than the design described in the previous subsection, the rest of the details are closer to implementation details. As before, the discussion is only relevant to Linux.
PMUs can count multiple CPU events simultaneously (e.g., cycles, instruction, cache misses, etc.). Our proposal does not preclude the runtime from simultaneously enabling multiple PMU events or even PMU events alongside an OS timer event. In fact, certain events are best used when measured with others. For example,
In our proposal, we will limit the maximum number of concurrent CPU events to _CPUPROF_EVENTS_MAX
(EV_MAX
for short in the rest of this document). We will define a few common portable events for which the user does not have to lookup within the CPU manufacturer’s manual for event codes. This closely resembles the preset events of perf events
in Linux.
In order to enable advanced users to exploit the full capability of the PMUs, we define an opaque RAW event that allows the users to specify a model-specific hexadecimal event number that can be sampled. A motivating example for such a raw event is the remote HITM event on Intel processors; this event counts the number of times a cache line was loaded where the cache line was in the modified state in another remote CPU’s cache. This event helps diagnose problems such as cache line false sharing; we have observed this event to be helpful in identifying false sharing even inside the Go runtime.
We have come up with the following initial set of PMU events. These events will be exposed by the runtime
package to the pprof
package. The pprof
package will wrap these constant values with matching function names to be presented later in the pprof
package subsection. Notice that the OS timer is one of the CPU events. Last-level cache events are included to capture the behavior of the CPUs memory hierarchy. Branch events are included to capture potentially wasted CPU cycles. The design discussions can refine this list further (e.g., whether TLB events are needed). They are easy to grow based on requests.
type cpuEvent int32 // These constants are a mirror of the runtime // These constants are a mirror of the runtime const ( _CPUPROF_OS_TIMER, _CPUPROF_FIRST_EVENT cpuEvent = iota, iota _CPUPROF_HW_CPU_CYCLES, _CPUPROF_FIRST_PMU_EVENT _CPUPROF_HW_INSTRUCTIONS cpuEvent = iota _CPUPROF_HW_CACHE_REFERENCES _CPUPROF_HW_CACHE_MISSES _CPUPROF_HW_BRANCH_INSTRUCTIONS _CPUPROF_HW_BRANCH_MISSES _CPUPROF_HW_RAW _CPUPROF_EVENTS_MAX _CPUPROF_LAST_EVENT = _CPUPROF_EVENTS_MAX-1 )
Since PMU support may not be available on all platforms in all settings, we need to make it configurable in the Go runtime. If an OS supports PMU-based profiling (e.g., Linux), it will dynamically register two functions during the runtime startup.
setThreadPMUProfilerFunc
: this is the OS-specific function used to start/stop a PMU event analogous to setThreadCPUProfile
. IfsetThreadPMUProfilerFunc
is nil
, the runtime will not attempt to start/stop PMU profiles even if it is requested to do so. However, this situation will not arise because runtime/pprof
will guard against it.sigprofPMUHandlerFunc
: this is the OS-specific function called from the OS-specific signal handler when the PMU delivers an interrupt.Currently, each M
(type m
) and the scheduler (type schedt
) maintains a profilehz
variable. In lieu of profilehz
we’ll introduce a runtime structure cpuProfileConfig struct
, which captures the essential configurations needed to start/stop either a PMU counter or an OS timer. cpuProfileConfig
is designed to reflect the underlying perf_event_attr
struct in Linux, but being an internal data structure we can modify it, if needed in the future, without affecting any public-surface API.
type cpuProfileConfig struct { hz uint64 // the sampling rate, used only for OS_TIMER. period uint64 // the sampling interval, used for PMU events. rawEvent uint64 // the hexa-decimal value of the CPU-specific event. // used only with _CPUPROF_HW_RAW cpuEvent type preciseIP profilePCPrecision // the level of precision for the program counter in PMU samples. isSampleIPIncluded bool // should the sample include the PC? isSampleAddrIncluded bool // should the sample include the memory address involved? isKernelIncluded bool // should the kernel count events in kernel mode? isHvIncluded bool // should the kernel count events in hypervisor mode? isIdleIncluded bool // should the kernel count events when idle task is running? isSampleCallchainIncluded bool // should the sample report the call chain? isCallchainKernelIncluded bool // should the sample include the kernel callchain? isCallchainUserIncluded bool // should the sample include the user callchain? }
Some PMUs can be configured so that they can offer the exact program counter involved in the counter overflow. profilePCPrecision
dictates the level of precision needed from the PMU as described by the constants below. If a PMU is configured to provide the precise program counter for a sample, we'll substitute the leaf-level program counter in the backtrace with the one presented by the PMU. Please refer to the Section 6.2 for PMU precision.
type profilePCPrecision uint8 const ( _CPUPROF_IP_ARBITRARY_SKID profilePCPrecision = iota _CPUPROF_IP_CONSTANT_SKID _CPUPROF_IP_SUGGEST_NO_SKID _CPUPROF_IP_NO_SKID)
pprof
will set the default to _CPUPROF_IP_ARBITRARY_SKID
to work across architectures and event types. However, we can easily change it to the highest supported, by probing the CPU as done in certain profilers.
cpuProfileConfig
will be visible to pprof
package but will not be exposed to its clients. The fields in cpuProfileConfig
allow how a PMU event can be customized. Pprof will set the field values to the lowest common denominator that is expected to work across many architectures and security settings. For example, isKernelIncluded
will most likely be false
. The runtime will be (mostly) agnostic to these values and will pass them down to OS-specific layers that will handle programming the PMUs.
The runtime
package will expose an API runtime_pprof_setCPUProfileConfig(cpuEvent, *cpuProfileConfig)
, only to pprof
. This API is the equivalent of SetCPUProfileRate(hz int)
. Since the legacy SetCPUProfileRate(hz int)
cannot be dropped, we alter its internals to wrap the hz
value into a cpuProfileConfig
structure and pass it down to runtime_pprof_setCPUProfileConfig
for a uniform treatment of any event whether an OS-timer event or a PMU event.
The runtime
can be bifurcated into OS-agnostic and OS-specific parts.
The OS-agnostic part of the runtime will have the following duties:
M
during its scheduling based on whether its current snapshot is different from the global configuration stored in the scheduler.The OS-specific part of runtime will have the following duties:
Since there can be CPUPROF_EVENTS_MAX
number of concurrent CPU profiles, we need to change some of the runtime data structures that previously assumed a single CPU profile. Importantly,
profilehz
field in schedt struct
to be an array of cpuProfileConfig
pointers.profilehz
field in m struct
to be an array of cpuProfileConfig
pointers.cpuProfileHandle
field in m struct
. This will be OS-specific information stored in each M
. For example, on Linux, this will hold the file handles obtained from perf_event_open
system calls and the corresponding mmap
ring buffers where the kernel populates the sample data. This information will be used to close the resources once the profiling is stopped on an M
.We’ll replace the hz
argument used in various runtime APIs involved in CPU profiling with the following two arguments.
eventId cpuEvent
(described in the constants previously), andprofConfig *cpuProfileConfig
.Figure 2 shows the call graph for the new workflow. We abbreviate eventId
as i
and cpuProfileConfig
as p
in most places.
Figure 2: A pictorial representation of the OS timer-based and PMU-based sampling workflow. The first line of each record box is the function name. The remaining lines provide a textual description of the steps taken by that function.
A nil
value of cpuProfileConfig
implies stop profiling for the specified cpuEvent
, and a non-nil
value implies start profiling for the specified cpuEvent
. At the runtime-level, runtime_pprof_setCPUProfileConfig()
makes a copy of the cpuProfileConfig
passed from pprof
, so that inside the runtime, we can perform fast pointer comparison and rely on two large cpuProfileConfig
structs to be equal if pointers to them are equal.
Starting/stopping a PMU counter is different from starting/stopping an OS timer. Hence if eventId==CPUPROF_OS_TIMER
checks will appear at a few places inside the runtime’s profiling-related code.
A few concurrency aspects deserve a mention.
EV_MAX
ring buffers. Each ring buffer will be created only if there is a corresponding active event. The size of each ring buffer is 16KB. If all EV_MAX
events are active at the same time, it will result in 16KB * 8 = 128KB
of ring buffers, which is paltry on modern machines.cpuprof
variable in cpuprof.go
to be an array of EV_MAX
size.runtime_pprof_setCPUProfileConfig
will hold a lock only for the cpuprof[i]
, where i
is the event id for which profile has to be started or stopped. 4. There will be a single signalLock
rather than EV_MAX
signalLock
s. This is because the setcpuprofileconfig
method in cpuprof.go
needs to check (read) all prof[eventId].config
s to infer whether all events are stopped before uninstalling the signal handler. Taking EV_MAX
locks one after another would be no better than holding just one lock. The consequence of this is that we serialize simultaneous writes to two different ring buffers. This should not be a serious issue since there cannot be two concurrent SIGPROF
s. If a system supports concurrent SIGPROF
handling, and if there are a lot of threads in a system, we can revisit this choice. 5. In a very rare case when a SIGPROF
is delivered for an already running event when another event is being set (which will hold the signalLock
), the signal handler recognizes it and drops such sample. Otherwise a deadlock will ensue.
At the lowest level, we will incorporate the following three system calls into all runtime/sys_linux_<arch>.s
files.
perf_event_open
, which is Linux’s means to program PMUs. The Go signature will be func perfEventOpen(attr *perfEventAttr, pid uintptr, cpu, groupFd int32, flags uintptr) int32
ioctl
, which is needed to stop the PMU counter when handling the SIGPROF
, and reset and restart after handling the signal. The go signature will be func ioctl(fd, req int32, arg uintptr) int32
fcntl
, which is necessary to arrange a perf_event
file handle to deliver the SIGPROF
signal and also to setup the signal to be delivered exactly to the M
(OS thread) whose PMU counter overflew. The go signature will be func fcntl(fd, cmd int32, arg uintptr) int32
SIGPROF
from different sampling sourcesAll PMU events and the OS timer use the same SIGPROF
signal to deliver their interrupts. Hence, the Unix signal handler needs to address the following concerns.
Although both OS timer expiry and any PMU counter overflow result in the same SIGPROF
signal, the signal codes are different. The PMU counter overflow generates a POLL_IN
signal code indicating the presence of data whereas the OS timer interrupts generate a SI_KERNEL
or SI_ALARM
signal code. We use this information to address #1 above.
Once we know that the signal was generated due to some PMU counter overflowing, we use the file descriptor fd
passed in the signal info and match it against the opened perf events file descriptors on the same M
and use the matching location to infer the PMU event i
that caused the signal; the rest of the signal handling makes use of this i
to record the backtrace in the appropriate ring buffer. During the signal handling, we’ll pause all active PMU counters via ioctl
calls on the fd
so that the signal handling is not counted towards the PMU event counters.
runtime/pprof
packageWe now turn our attention to the runtime/pprof
package, which exposes the PMU sampling to its clients. There are several choices. There is a tension between revolutionary vs. evolutionary solution; there is friction between exposing the full PMU functionality vs. offering the basic minimum. In our proposal, we hope to make pragmatic choices among myriad options.
We propose adding a new API, which accepts different configurations needed to start any CPU event -- OS timer and PMU counters. The API signature is shown below.
pprof.StartCPUProfileWithConfig(opt ProfilingOption, moreOpts ...ProfilingOption) error
There is an optional variadic ProfilingOption
to allow more than one event to be started during a profiling session.
A ProfilingOption
is an interface exposed by runtime/pprof
as shown below.
type ProfilingOption interface { apply() error }
Pprof will implement the following profiling option functions that mirror various cpuEvent
constants previously presented.
func OSTimer(w io.Writer) ProfilingOption func CPUCycles(w io.Writer, period uint64) ProfilingOption func CPUInstructions(w io.Writer, period uint64) ProfilingOption func CPUCacheReferences(w io.Writer, period uint64) ProfilingOption func CPUCacheMisses(w io.Writer, period uint64) ProfilingOption func CPUBranchInstructions(w io.Writer, period uint64) ProfilingOption func CPUBranchMisses(w io.Writer, period uint64) ProfilingOption func CPURawEvent(w io.Writer, period uint64, hex uint64) ProfilingOption
All of them consume an io.Writer
. The OSTimer
accepts no other argument. A period
accepted by all PMU events specifies the number of events that must elapse between two consecutive interrupts. A larger value means lower granularity (the opposite of Hz). Passing a zero
value will result in using a preset period. CPURawEvent
is special; it accepts any user-provided hexadecimal PMU event code. Notice that each event requires an io.Writer
since different profile types cannot be serialized into the same pprof protocol buffer. In the cases where multiple ProfilingOption
s are passed to the pprof.StartCPUProfileWithConfig
, sanity checks are made to ensure the uniqueness of each cpuEvent
and each io.Writer
.
Below are a few example usages of the API.
// Simple StartCPUProfileWithConfig(OSTimer(w)) StartCPUProfileWithConfig(CPUCycles(w, period)) StartCPUProfileWithConfig(CPURawEvent(w, period, hexadecimalEvent)) // Advanced StartCPUProfileWithConfig(CPUCacheReferences(w1, period1), CPUCacheMisses(w2, period2))
The implementation of StartCPUProfile(writer)
will simply invoke StartCPUProfileWithConfig(OSTimer(writer))
. The implementation of StartCPUProfileWithConfig
will ensure a maximum of EV_MAX
elements, unique io.Writers
, and unique cpuEvents
. After sanitizing the arguments, it will create a cpuProfileConfig
for each cpuEvent
requested to be profiled. In a global data structure cpu
, we will maintain the active events, their *cpuProfileConfig
, and the corresponding io.Writer
.
var cpu struct { sync.Mutex profiling bool activeConfig [EV_MAX]*cpuProfileConfig activeWriter [EV_MAX]io.Writer eventName [EV_MAX]string done chan bool }
Finally, StartCPUProfileWithConfig
will invoke runtime_pprof_setCPUProfileConfig
multiple times once for each active event to request the runtime
to actually configure either the OS timer or the PMU with a specific event to start sampling. A lock will be held throughout the invocation of StartCPUProfileWithConfig
and hence no two StartCPUProfileWithConfig/StartCPUProfile
invocations run concurrently. Once a session of StartCPUProfileWithConfig/StartCPUProfile
is active, another profiling is disallowed until StopCPUProfile
is invoked.
Before successfully returning from StartCPUProfileWithConfig
, we'll launch a goroutine profileWriter
, which will periodically wake up and scrape runtime-produced stack samples from the ring buffers associated with each of the active events and serialize them into the corresponding io.Writer
s. This follows the same design as the original pprof, but will add the responsibility of collecting profiles from multiple events. Creating one goroutine each to scrape every active event would be wasteful, a single goroutine suffices to collect backtraces from all ring buffers.
Allowing multiple CPU profiles in a single profiling session via pprof.StartCPUProfileWithConfig(opt ProfilingOption, moreOpts ...ProfilingOption)
API is a power-user feature. To avoid any accidental misuse, we'll protect it under an environmental variable GO_PPROF_ENABLE_MULTIPLE_CPU_PROFILES=<true|false>
, which will be disabled by default.
testing
packageWe’ll add the following two command line arguments to the testing package:
-cpuprofileevent=<timer|cycles|instructions|cacheMisses|cacheReferences|branches|branchMisses|rHexValue>
-cpuprofileperiod=<Int64Value>
The default value of -cpuprofileevent
is timer
with 100Hz when only -cpuprofile
is passed, which ensures compatibility with the current behavior. -cpuprofileperiod
will be ignored for -cpuprofileevent=timer
and will default to a 100Hz rate as is the case currently. The testing package will map a user-specified string for -cpuprofileevent
value to the appropriate pprof.ProfilingOption
, the -cpuprofileperiod
to an integer, and invoke pprof.StartCPUProfileWithConfig()
. For example -cpuprofileevent=cycles
and -cpuprofileperiod=100000
will result inpprof.StartCPUProfileWithConfig(CPUCycles(w, 100000)
.
net/http/pprof
packageWe’ll add the following two additional arguments in an http
request exposed by the net/http/pprof
CPU profiling endpoint.
event=<timer|cycles|instructions|cacheMisses|cacheReferences|branches|branchMisses|rHexValue>
period=<Int64Value>
period
will be ignored for event=timer
and will default to a 100Hz rate as is the case currently. The default value of event
is timer
with 100Hz, which ensures compatibility with the current behavior. The net/http/pprof
package will map a user-specified string for event
value to an appropriate pprof.ProfilingOption
, period
to an integer, and invoke pprof.StartCPUProfileWithConfig
.
Below we show the go tool pprof -top
output of the previously shown goroutine.go
concurrent program run for two times. The profiles used the proposed new API with 1M sampling period: pprof.StartCPUPrileWithConfig(CPUCycles(w, 1000000))
. The CPU cycles attributed to each routine f1-f10
match the expectation (approximately 10% of the total execution time) within one run and across multiple runs. Thus the profiles are accurate and precise. The measurement overhead is indistinguishable from the OS timer in this setting.
File: goroutine Type: cycles Time: Jan 27, 2020 at 4:49pm (PST) Showing nodes accounting for 234000000000, 100% of 234000000000 total flat flat% sum% cum cum% 23400000000 10.00% 10.00% 23400000000 10.00% main.f1 23400000000 10.00% 20.00% 23400000000 10.00% main.f10 23400000000 10.00% 30.00% 23400000000 10.00% main.f2 23400000000 10.00% 40.00% 23400000000 10.00% main.f3 23400000000 10.00% 50.00% 23400000000 10.00% main.f4 23400000000 10.00% 60.00% 23400000000 10.00% main.f5 23400000000 10.00% 70.00% 23400000000 10.00% main.f6 23400000000 10.00% 80.00% 23400000000 10.00% main.f7 23400000000 10.00% 90.00% 23400000000 10.00% main.f8 23400000000 10.00% 100% 23400000000 10.00% main.f9
File: goroutine Type: cycles Time: Jan 27, 2020 at 4:51pm (PST) Showing nodes accounting for 234000000000, 100% of 234000000000 total flat flat% sum% cum cum% 23800000000 10.17% 10.17% 23800000000 10.17% main.f1 23500000000 10.04% 20.21% 23500000000 10.04% main.f7 23500000000 10.04% 30.26% 23500000000 10.04% main.f9 23400000000 10.00% 40.26% 23400000000 10.00% main.f10 23400000000 10.00% 50.26% 23400000000 10.00% main.f2 23400000000 10.00% 60.26% 23400000000 10.00% main.f4 23400000000 10.00% 70.26% 23400000000 10.00% main.f6 23400000000 10.00% 80.26% 23400000000 10.00% main.f8 23300000000 9.96% 90.21% 23300000000 9.96% main.f3 22900000000 9.79% 100% 22900000000 9.79% main.f5
Below we show the go tool pprof -top
output of the previously shown serial.go
program run for two times. We remind the reader that the expected relative execution time is encoded in the function names themselves (J_expect_18_18
should consume 18.18% CPU cycles and so on). The profiles used the proposed new API with 1M sampling period: pprof.StartCPUPrileWithConfig(CPUCycles(w, 1000000))
. The CPU cycles attributed to each routine match the expectation within a run and across multiple runs. Thus the profiles are accurate and precise.
File: serial Type: cycles Time: Jan 27, 2020 at 4:54pm (PST) Showing nodes accounting for 1105000000, 100% of 1105000000 total flat flat% sum% cum cum% 200000000 18.10% 18.10% 200000000 18.10% main.J_expect_18_18 183000000 16.56% 34.66% 183000000 16.56% main.I_expect_16_36 165000000 14.93% 49.59% 165000000 14.93% main.H_expect_14_546 137000000 12.40% 61.99% 137000000 12.40% main.G_expect_12_73 120000000 10.86% 72.85% 120000000 10.86% main.F_expect_10_91 100000000 9.05% 81.90% 100000000 9.05% main.E_expect_9_09 82000000 7.42% 89.32% 82000000 7.42% main.D_expect_7_27 63000000 5.70% 95.02% 63000000 5.70% main.C_expect_5_46 37000000 3.35% 98.37% 37000000 3.35% main.B_expect_3_64 18000000 1.63% 100% 18000000 1.63% main.A_expect_1_82 0 0% 100% 1105000000 100% main.main 0 0% 100% 1105000000 100% runtime.main
File: serial Type: cycles Time: Jan 27, 2020 at 4:54pm (PST) Showing nodes accounting for 1105000000, 100% of 1105000000 total flat flat% sum% cum cum% 200000000 18.10% 18.10% 200000000 18.10% main.J_expect_18_18 183000000 16.56% 34.66% 183000000 16.56% main.I_expect_16_36 159000000 14.39% 49.05% 159000000 14.39% main.H_expect_14_546 142000000 12.85% 61.90% 142000000 12.85% main.G_expect_12_73 119000000 10.77% 72.67% 119000000 10.77% main.F_expect_10_91 100000000 9.05% 81.72% 100000000 9.05% main.E_expect_9_09 82000000 7.42% 89.14% 82000000 7.42% main.D_expect_7_27 61000000 5.52% 94.66% 61000000 5.52% main.C_expect_5_46 40000000 3.62% 98.28% 40000000 3.62% main.B_expect_3_64 18000000 1.63% 99.91% 18000000 1.63% main.A_expect_1_82 1000000 0.09% 100% 1105000000 100% main.main 0 0% 100% 1105000000 100% runtime.main
We use pprof to collect production CPU profiles at Uber. On one of our Go microservices, when we used the default OS timer-based CPU profiles, we saw that the mutex.Lock()
accounted for 15.79% of the time and mutex.Unlock()
accounted for 13.55% of the CPU time. We knew from the program structure that mutex.Lock()
and mutex.Unlock()
were invoked in pairs one after another, and also that mutex.Lock()
is significantly more heavyweight compared to mutex.Unlock()
, thus we were expecting much lesser time being spent inside mutex.Unlock()
relative to mutex.Lock()
. The OS-timer profiling data was not matching our expectations. However, when we used a prototype of the PMU-based profiler and sampled the CPU cycles event, the quantification changed and matched our expectations; mutex.Lock()
accounted for 33.36% of the CPU cycles and mutex.Unlock()
accounted for 7.57% the CPU cycles. The PMU profiles avoided an unnecessary performance investigation.
CPU profiling via PMUs result in richer quality profiles -- high-resolution measurement, higher accuracy, and higher precision. Access to various events such as cache misses in PMUs allows better insights into understanding the causes of complex performance issues. Any kind of PMU event can be attributed to call stack samples, which is useful to provide actionable feedback to the developer. Incorporating the PMU facility in the runtime will allow the Go profiler to take full advantage of the progress happening in the hardware performance monitoring facilities available in modern CPUs. This framework will allow for independently developing more advanced profiling features such as attributing high-latency data accesses to Go objects, which can help refactor code or data structures for better data locality [5, 6, 7].
The new PMU-based profiler with cycles event ----------------------------------------------------- Sampling period in CPU cycles Overhead ----------------------------------------------------- 24000000 (100 samples/sec) 1.036x 2400000 (1K samples/sec) 1.036x 240000 (10K samples/sec) 1.13x 24000 (100K samples/sec) 2.05x
The default OS-timer based profiler ----------------------------------------------------- Sampling rate Overhead ----------------------------------------------------- 100 samples/sec 1.036x
A reasonable sampling period, once in 2.4M CPU cycles on a 2.4GHz Intel Skylake machine, offers a good trade off between high sampling rate at low overhead.
When using any other PMU event, such as cache misses, the sampling period needs to be more carefully curated. We’ll provide preset values for the preset events.
PMU profiling has a mild skid on deeply pipelined, multi-issue, out-of-order, superscalar CPU architectures due to multiple instructions flowing in the pipeline. Thus the program counter seen in the interrupt can be a handful of instructions off from the one generating the counter overflow. This “skid” is much smaller than the several milliseconds of skid in an OS timer. Moreover, modern PMUs have built-in features to eliminate such skid (e.g., Intel Precise Event Based Sampling (PEBS), AMD Instruction Based Sampling (IBS), and PowerPC Marked events (MRK). Some PMUs can be configured so that they can offer the exact program counter involved in the counter overflow. We have introduced profilePCPrecision
for this reason.
There are finite counters on any PMU. Counter multiplexing happens if the number of events is more than the number of available PMU counter slots. In that case the events run only part of the time.
perf
commandline tool is powerful and offers comparable solutions. However, it does not solve several important use cases.http
endpoint (provided by net/http/pprof
package in Go) is the preferred, readily available, and scalable solution that works at the enterprise scale. Making perf
to work in a similar fashion would lead to creating too many custom solutions.pprof
works as an in-situ profiler working in the same address space as the process being profiled and hence does not require any special system privileges; in contrast, attaching perf
to a running process requires changes to a system’s security settings.perf
can collect call stack samples from a container, it fails to find symbols of go binaries located within a container to show clean reports (e.g., flame graphs). The workaround is non-trivial.StartCPUProfile(io.Writer)
. We considered adding variadic ProfilingOptions
to it. But any such change would be qualified as breaking Go1.X
and went ahead introducing a new API.Options
struct to the new API StartCPUProfileWithConfig
. However, the usage looked complicated with certain fields making sense only in certain circumstances and hence decided to use the functional option pattern.map
for active events at several places instead of an EV_MAX
array. However, concurrent access to different array elements was superior to locking a shared map
. Moreover, EV_MAX
is small enough to search linearly for active events.The design ensures compatibility with Go1.0 since no existing public API signatures, interface signatures, or data structures are changed. We have introduced a new interface and a handful of new APIs in the pprof
package.
Milind Chabbi from Uber technologies will implement the proposal. The implementation will proceed as follows:
perf_events
related system calls.runtime
, runtime/pprof
to support PMU sampling to reflect the call graph / workflow shown in Figure 2.testing
and net/http/pprof
packages.pprof
CPU profiling tests to use PMU cycles in addition to OS timer (on supported platforms).The implementation will be broken into a large checkin related to the runtime
and runtime/pprof
packages, a few small checkins related to testing
, and net/http/pprof
packages, and the tests accompanying the code. The implementation will align with the February-April 2020 development cycle and May-July 2020 release cycle.
A prototype is available here.
-test.counters
flag to the testing package. That proposal is orthogonal to this one. There isn’t a detailed description of how the proposal wants to accomplish it. Simply starting and stopping a PMU counter by making perf_events_open
system call at the testing package level will not accomplish the objective. The monitored functions can migrate from one M
to another, and hence the changes must be made in the runtime as described in this proposal. However, it is relatively easy to incorporate the requested feature by making use of changes suggested in this proposal to accomplish the objective of issue #21295. We are open to discussing with the creators of #21295 to understand how we can incorporate their needs into this design.Pengfei Su implemented a prototype of the design while interning at Uber Technologies.
Joshua Corbin helped design the public surface APIs.
This work is generously funded by the Uber Technologies’ Programming Systems Research team and the team members provided feedback for improvement.
File: goroutine Type: cpu Time: Jan 27, 2020 at 3:45pm (PST) Duration: 6.70s, Total samples = 18060ms (269.37%) Showing nodes accounting for 18060ms, 100% of 18060ms total flat flat% sum% cum cum% 4210ms 23.31% 23.31% 4210ms 23.31% main.f7 2610ms 14.45% 37.76% 2610ms 14.45% main.f2 2010ms 11.13% 48.89% 2010ms 11.13% main.f6 1810ms 10.02% 58.91% 1810ms 10.02% main.f10 1780ms 9.86% 68.77% 1780ms 9.86% main.f3 1410ms 7.81% 76.58% 1410ms 7.81% main.f1 1310ms 7.25% 83.83% 1310ms 7.25% main.f4 1110ms 6.15% 89.98% 1110ms 6.15% main.f5 1110ms 6.15% 96.12% 1110ms 6.15% main.f8 700ms 3.88% 100% 700ms 3.88% main.f9
File: goroutine Type: cpu Time: Jan 27, 2020 at 3:45pm (PST) Duration: 6.71s, Total samples = 17400ms (259.39%) Showing nodes accounting for 17400ms, 100% of 17400ms total flat flat% sum% cum cum% 3250ms 18.68% 18.68% 3250ms 18.68% main.f2 2180ms 12.53% 31.21% 2180ms 12.53% main.f9 2100ms 12.07% 43.28% 2100ms 12.07% main.f1 1770ms 10.17% 53.45% 1770ms 10.17% main.f6 1700ms 9.77% 63.22% 1700ms 9.77% main.f5 1550ms 8.91% 72.13% 1550ms 8.91% main.f4 1500ms 8.62% 80.75% 1500ms 8.62% main.f8 1440ms 8.28% 89.02% 1440ms 8.28% main.f3 1390ms 7.99% 97.01% 1390ms 7.99% main.f10 520ms 2.99% 100% 520ms 2.99% main.f7
File: goroutine Type: cpu Time: Jan 27, 2020 at 3:48pm (PST) Duration: 6.71s, Total samples = 17.73s (264.31%) Showing nodes accounting for 17.73s, 100% of 17.73s total flat flat% sum% cum cum% 3.74s 21.09% 21.09% 3.74s 21.09% main.f7 2.08s 11.73% 32.83% 2.08s 11.73% main.f9 2.05s 11.56% 44.39% 2.05s 11.56% main.f2 1.85s 10.43% 54.82% 1.85s 10.43% main.f10 1.78s 10.04% 64.86% 1.78s 10.04% main.f1 1.43s 8.07% 72.93% 1.43s 8.07% main.f3 1.42s 8.01% 80.94% 1.42s 8.01% main.f8 1.18s 6.66% 87.59% 1.18s 6.66% main.f6 1.17s 6.60% 94.19% 1.17s 6.60% main.f5 1.03s 5.81% 100% 1.03s 5.81% main.f4
File: serial Type: cpu Time: Jan 27, 2020 at 1:42pm (PST) Duration: 501.51ms, Total samples = 320ms (63.81%) Showing nodes accounting for 320ms, 100% of 320ms total flat flat% sum% cum cum% 80ms 25.00% 25.00% 80ms 25.00% main.H_expect_14_546 80ms 25.00% 50.00% 80ms 25.00% main.J_expect_18_18 60ms 18.75% 68.75% 60ms 18.75% main.G_expect_12_73 20ms 6.25% 75.00% 20ms 6.25% main.B_expect_3_64 20ms 6.25% 81.25% 20ms 6.25% main.D_expect_7_27 20ms 6.25% 87.50% 20ms 6.25% main.F_expect_10_91 20ms 6.25% 93.75% 20ms 6.25% main.I_expect_16_36 10ms 3.12% 96.88% 10ms 3.12% main.A_expect_1_82 10ms 3.12% 100% 10ms 3.12% main.C_expect_5_46 0 0% 100% 320ms 100% main.main 0 0% 100% 320ms 100% runtime.main
File: serial Type: cpu Time: Jan 27, 2020 at 1:44pm (PST) Duration: 501.31ms, Total samples = 320ms (63.83%) Showing nodes accounting for 320ms, 100% of 320ms total flat flat% sum% cum cum% 70ms 21.88% 21.88% 70ms 21.88% main.I_expect_16_36 50ms 15.62% 37.50% 50ms 15.62% main.J_expect_18_18 40ms 12.50% 50.00% 40ms 12.50% main.E_expect_9_09 40ms 12.50% 62.50% 40ms 12.50% main.F_expect_10_91 40ms 12.50% 75.00% 40ms 12.50% main.H_expect_14_546 30ms 9.38% 84.38% 30ms 9.38% main.D_expect_7_27 20ms 6.25% 90.62% 20ms 6.25% main.B_expect_3_64 20ms 6.25% 96.88% 20ms 6.25% main.G_expect_12_73 10ms 3.12% 100% 10ms 3.12% main.C_expect_5_46 0 0% 100% 320ms 100% main.main 0 0% 100% 320ms 100% runtime.main
File: serial Type: cpu Time: Jan 27, 2020 at 1:45pm (PST) Duration: 501.39ms, Total samples = 310ms (61.83%) Showing nodes accounting for 310ms, 100% of 310ms total flat flat% sum% cum cum% 110ms 35.48% 35.48% 110ms 35.48% main.J_expect_18_18 70ms 22.58% 58.06% 70ms 22.58% main.G_expect_12_73 60ms 19.35% 77.42% 60ms 19.35% main.F_expect_10_91 30ms 9.68% 87.10% 30ms 9.68% main.I_expect_16_36 20ms 6.45% 93.55% 20ms 6.45% main.H_expect_14_546 10ms 3.23% 96.77% 10ms 3.23% main.B_expect_3_64 10ms 3.23% 100% 10ms 3.23% main.C_expect_5_46 0 0% 100% 310ms 100% main.main 0 0% 100% 310ms 100% runtime.main