| # Proposal: hardware performance counters for CPU profiling. |
| |
| Author(s): Milind Chabbi |
| |
| Last updated: Feb/13/2020 |
| |
| Discussion at https://golang.org/issues/36821. |
| |
| ## 1. Abstract |
| |
| The Go pprof CPU profiles are quite inaccurate and imprecise due to several |
| limitations related to the operating system (OS) interval timer used as its |
| sampling engine. |
| Inaccurate profiles can mislead performance investigation and may lead |
| to inappropriate optimization guidance. |
| We propose incorporating hardware |
| performance counters as an alternative sampling engine for CPU profiling. |
| Hardware performance counter-based sampling rectifies many of the problems |
| related to the OS timers. |
| The hardware CPU cycles event provides microsecond |
| measurement granularity, and it is available in almost all commodity CPUs. |
| Furthermore, hardware Performance Monitoring Units (PMUs) allow measuring not |
| only CPU cycles but other activities such as cache misses that help root cause |
| complex performance problems. |
| A [prototype implementation](https://github.com/uber-dev/go/tree/pmu_pprof) shows significantly |
| better-quality CPU profiles when using PMUs while maintaining a low measurement |
| overhead. |
| The proposal will retain the OS-timer as the default sampling engine. |
| The proposal exposes the PMU-based profiling feature on Linux-based systems, |
| which provide an excellent facility to program PMUs. |
| |
| |
| |
| ## Table of contents |
| - [1. Abstract](#1-abstract) |
| - [2. Background](#2-background) |
| - [2.1 High-level description of Go's CPU |
| profiling](#21-high-level-description-of-gos-cpu-profiling) |
| - [2.2 Desiderata and deficiencies of Go CPU |
| profiler](#22-desiderata-and-deficiencies-of-go-cpu-profiler) |
| - [2.3 Examples demonstrating inaccuracy and imprecision in Go |
| pprof profiles](#23-examples-demonstrating-inaccuracy-and-imprecision-in-go-pprof-profiles) |
| - [2.4 State-of-the-art in profiling using performance |
| monitoring units](#24-state-of-the-art-in-profiling-using-performance-monitoring-units) |
| - [2.5 Low-level details of the current Go CPU |
| profiler](#25-low-level-details-of-the-current-go-cpu-profiler) |
| - [3. Problem statement](#3-problem-statement) |
| - [4. Proposal](#4-proposal) |
| - [4.1 High-level design](#41-high-level-design) |
| - [4.2 The devil is in the |
| details](#42-the-devil-is-in-the-details) |
| - [4.3 Exposing PMU events via `runtime/pprof` |
| package](#43-exposing-pmu-events-via-runtimepprof-package) |
| - [4.4 The optional multiple CPU profiles |
| feature](#44-the-optional-multiple-cpu-profiles-feature) |
| - [4.5 The `testing` package](#45-the-testing-package) |
| - [4.6 The `net/http/pprof` |
| package](#46-the-nethttppprof-package) |
| - [5. Empirical evidence on the accuracy and precision of PMU |
| profiles](#5-empirical-evidence-on-the-accuracy-and-precision-of-pmu-profiles) |
| - [5.1 The PMU produces precise and accurate profiles for |
| concurrent programs](#51-the-pmu-produces-precise-and-accurate-profiles-for-concurrent-programs) |
| - [5.2 The PMU produces precise and accurate profiles for |
| serial programs](#52-the-pmu-produces-precise-and-accurate-profiles-for-serial-programs) |
| - [5.3 A practical use](#53-a-practical-use) |
| - [6. Rationale](#6-rationale) |
| - [6.1 Advantages](#61-advantages) |
| - [6.2 Disadvantages](#62-disadvantages) |
| - [6.3 Alternatives explored](#63-alternatives-explored) |
| - [7. Compatibility](#7-compatibility) |
| - [8. Implementation](#8-implementation) |
| - [9. Open issues](#9-open-issues) |
| - [10. Acknowledgment](#10-acknowledgment) |
| - [11. Appendix](#11-appendix) |
| - [12. References](#12-references) |
| |
| |
| ## 2. Background |
| |
| In this section, we first provide a high-level description of Go's CPU |
| profiling, motivate the problems in the current profiler with examples, |
| highlight the state-of-the-art in CPU profiling, and then dive deeper into the |
| current implementation of CPU profiling inside Go. |
| Our proposed changes appear |
| in the next section. |
| Readers familiar with profiling technology and the |
| internals of pprof may skip this section. |
| |
| ### 2.1 High-level description of Go’s CPU profiling |
| |
| Pprof, the de facto Go CPU profiler, employs sampling-based profiling for time |
| measurement. |
| It attributes the time metric to different functions, including |
| their calling context (aka call path or backtrace). |
| Pprof relies on the |
| underlying operating-system timers (exposed by the Go runtime) to periodically |
| interrupt a program’s execution. |
| The sampling rate is typically one hundred |
| hertz (a sample taken every 10ms); a different sampling rate can be configured, |
| but typical systems cannot deliver more than 250 hertz. |
| |
| |
| When profiling is enabled, the Go runtime scheduler sets up each OS thread in |
| the process to generate periodic timer interrupts. |
| In each interrupt, the go |
| runtime’s signal handler obtains a backtrace and appends it to a ring buffer. |
| All OS threads in a Go process append their backtraces to the same ring buffer |
| coordinated via a lock. |
| A side thread, launched by the `pprof` package, |
| periodically scrapes the backtraces from the ring buffer and serializes them |
| into a protobuf file to be used by the downstream tools. |
| |
| ### 2.2 Desiderata and deficiencies of Go CPU profiler |
| |
| *Accuracy* and *precision* are desirable properties of a good measurement tool. |
| |
| A profiling datum is said to be *accurate* if it is close to the ground truth |
| (aka true value). |
| For example, if `API_A()` consumes 25% of the total |
| execution time and a profiler attributes it 24% of the total execution time, |
| the measurement has 96% accuracy. |
| |
| Profiling data are said to be *precise* if there is low variability among |
| multiple measurements. |
| Precision of a set of measurements of the same variable is often |
| [expressed](https://www.wikihow.com/Calculate-Precision) either as min and max |
| from the sample mean or as [standard |
| error](https://en.wikipedia.org/wiki/Standard_error) or [coefficient of |
| variation](https://en.wikipedia.org/wiki/Coefficient_of_variation) of the |
| sample. |
| |
| Unfortunately, the current pprof CPU profiles meet neither of these two |
| criteria. |
| |
| |
| #### 2.2.1 Inaccuracy and its causes |
| |
| Inaccuracy of profiles arises from two sources: sample size and sampling bias. |
| |
| |
| #### Sample size |
| |
| By the [law of large |
| numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers), a larger sample |
| size makes the sample mean closer to the population mean. |
| Moreover, in |
| profiling, where the sample size is very small compared to the population size, |
| increasing the sample size greatly improves accuracy. |
| Conversely, a smaller |
| sample size leads to a lower inaccuracy. |
| |
| |
| Low-resolution OS-timer is the primary reasons behind small sample size in a |
| measurement window. |
| One can obtain more samples either by increasing the |
| length of a measurement window or by increasing the number of samples in a |
| measurement window. |
| The former works well for regular programs e.g., codes |
| dominated by loops; however, for irregular codes, such as microservices, a |
| larger measurement window does not rectify the measurement inaccuracy because |
| more irregularities can easily surface in a larger observation window. |
| Hence, |
| there is a dire need to collect more samples within a measurement window and |
| obtain fine-grained samples of sub-millisecond executions. |
| Furthermore, linearly scaling the measurement time to collect more samples is |
| impractical if orders of magnitude more samples are necessary to correct the |
| small samples size issue. |
| |
| #### Sampling bias |
| |
| A larger sample size alone does not improve accuracy if the samples are biased. |
| OS timer samples on some systems (e.g., Linux) are biased because a timer |
| interrupt from one OS thread, say `T1` may be handled by an arbitrary (not to |
| be confused with uniformly random) thread, say `T2`. |
| This means, `T2` will |
| handle the interrupt and incorrectly attribute the sample to its call stack, |
| which results in biased samples. |
| If more timer signals are handled by `T2` |
| compared to `T1`, a [systematic sampling |
| bias](https://en.wikipedia.org/wiki/Observational_error) will occur, which leads |
| to inaccurate profiles. |
| |
| |
| #### 2.2.1 Imprecision and its causes |
| |
| Imprecision in measurement comes from at least the following two sources: |
| sample size and measurement skid. |
| |
| #### Sample size |
| |
| Fewer number of samples directly contributes to a large [standard |
| error](https://en.wikipedia.org/wiki/Standard_error). |
| The low resolution of OS |
| timers is responsible for fewer samples in a measurement window, which results |
| in a lower precision of pprof profiles. |
| Conversely, a higher resolution of |
| samples will improve precision. |
| |
| #### Measurement skid |
| |
| The second source of measurement error is the [random |
| error](https://en.wikipedia.org/wiki/Observational_error) inherent in a |
| measurement apparatus. |
| An OS timer apparatus configured to expire after `N` milliseconds |
| is only guaranteed to generate an interrupt some time after `N` milliseconds -- |
| not precisely at `N` milliseconds. |
| This randomness introduces a large "skid" in |
| measurement. |
| Assume a periodic timer set to fire every 10ms. |
| Assume it has |
| 1ms left before its expiry when the OS scheduler with a 4ms resolution inspects |
| it. |
| This means the timer will fire 4ms later, which is 3ms after the scheduled |
| expiry time. |
| This results in up to 30% imprecision for a 10ms periodic timer. |
| Although random error may not be entirely eliminated, a superior measurement |
| apparatus reduces the effects of random error. |
| |
| ### 2.3 Examples demonstrating inaccuracy and imprecision in Go pprof profiles |
| |
| #### 2.3.1 A concurrent code |
| |
| Consider the Go program |
| [goroutine.go](https://github.com/chabbimilind/GoPprofDemo/blob/master/goroutine.go), |
| which has ten exactly similar goroutines `f1`-`f10`. |
| These functions are |
| computationally intensive, long-running, and are unaffected by IO or memory |
| hierarchy. |
| The functions are non-preemptive, and hence the Go runtime scheduler |
| overhead is zero. |
| It is expected that each function takes exactly the same |
| amount of CPU time (i.e., approximately 10% of the overall execution). |
| It |
| should not matter how many OS threads or CPU cores they run on. |
| |
| |
| Table 1 summarizes the result of `go tool pprof -top` for this program running |
| on a Linux OS on a 12-core Intel Skylake server-class machine. |
| The |
| measurements were taken three times under the same configuration. |
| The `Flat |
| (ms)` column shows the absolute millisecond measurement attributed to each of |
| the ten routines, and the `Flat (%)` column shows the relative time in a |
| routine with respect to the total execution time. |
| The `Expected (%)` column |
| shows the expected relative time in each routine. |
| The `Rank order` column |
| assigns ranks in the descending order of execution time. |
| The raw output is |
| available in the [appendix](#11-appendix). |
| |
|  *Table 1: demonstration of inaccuracy and |
| imprecision of pprof profiles using OS timers on a concurrent code.* |
| |
| First, let us focus on the data in a single run `RUN 1`. |
| `f1`-`f10` have a wide |
| variance in the time attributed to them; the expectation is that each of them |
| gets `10%` execution time. |
| There is up to 6x difference in the time attributed |
| to the routine with the highest amount of attribution (`main.f7` with 4210ms, |
| 23.31%) and the routine with the lowest amount of attribution (`main.f9` with |
| 700ms, 3.88%). |
| **This demonstrates the poor accuracy (deviation from the |
| ground truth) of pprof timer-based profiles.** |
| The issue reproduces when using different number of CPU cores. |
| |
| Now focus on the three runs `RUN 1-3` together . |
| The time attributed to any |
| routine widely varies from one run to another. |
| The rank order of the routines |
| changes significantly from run to run. |
| In Run 1, `main.f7` is shown to run for |
| 4210ms with the rank order of 1, whereas in Run 2, it is shown to run for only |
| 520ms with the rank order 10. |
| The expectation is that the measurements remain |
| the same from run to run. |
| From one run to another, function-level time |
| attribution are not precise; for example, there is a 71% coefficient of |
| variance among the three runs of the `f1` routine and the average coefficient |
| of variance across all functions is 28%. |
| **This demonstrates imprecision (unpredictability of |
| measurement) of pprof timer-based profiles.** |
| |
| For the curious readers, Table 2 below shows near-prefect accuracy and |
| precision when profiling the same code using the PMU-based profiler described in |
| this proposal. |
| We show two runs instead of three for |
| brevity. |
| Additional details appear in [Section |
| 5.1](#51-the-pmu-produces-precise-and-accurate-profiles-for-concurrent-programs) |
| |
|  *Table 2: demonstration of accuracy and |
| precision of pprof profiles via PMU CPU cycles event on a concurrent code.* |
| |
| #### 2.3.2 A serial code |
| |
| The issue is not specific to concurrent code alone. |
| Refer to a carefully |
| crafted Go program -- |
| [`serial.go`](https://github.com/chabbimilind/GoPprofDemo/blob/master/serial.go) |
| -- which has a single goroutine, and it has complete data |
| dependence from one statement to another. |
| It has ten functions |
| `A_expect_1_82`, `B_expect_3_64`, ... , `J_expect_18_18`. |
| The functions are |
| intentionally written to consume different amounts of time to make this test |
| case more representative. |
| Each function is similar and has a loop, but the |
| trip-count of the loop in each function is different. |
| The loop in function |
| with prefix `A` runs `1*C` times, where `C` is a constant. |
| The loop in function |
| with prefix `B` runs `2*C` times, and so on. |
| This arrangement leads to a |
| predictable relative execution time for each function; the function with prefix |
| `A` consumes `1/sum(1..10)=1.82%` of the total execution time; and the function |
| with prefix `B` consumes `2/sum(1..10)=3.64%` of the total execution time, and |
| so on until the function with prefix `J` consumes `10/sum(1..10)=18.18%` of the |
| total execution time. |
| |
| |
| The expected relative percent of CPU time is encoded in each function name, |
| `A_expect_1_82` should consume 1.82% execution time, `B_expect_3_64` should |
| consume 3.64% execution time, `J_expect_18_18` should consume 18.18% of CPU |
| time, and so on. |
| Table 3 summarizes the output of pprof CPU profiling for this |
| code. |
| As before, the measurements were taken three times under the same |
| configuration. |
| The `Flat (ms)` column shows the absolute millisecond |
| measurement attributed to each of the ten functions, and the `Flat (%)` column |
| shows the relative time with respect to the total time taken by all functions. |
| The `Expected (%)` column shows the expected relative time in each routine. |
| The |
| `Rank order` column assigns ranks in descending order of the execution time. |
| The raw output is available in the [appendix](#11-appendix). |
| |
| |
|  *Table 3: demonstration of inaccuracy and |
| imprecision of pprof profiles on a serial code.* |
| |
| A `-` in a cell represents zero samples. |
| First, let's focus on just one set of measurements from `RUN 1`. |
| Rather than |
| `J_expect_18_18` being the top time consumer with 18.18%, both `J_expect_18_18` |
| and `H_expect_14_546` are the top time consumers each with 25% CPU time. |
| `B_expect_3_64`, `D_expect_7_27`, `F_expect_10_91`, and `I_expect_16_36` are |
| all lumped into a single rank order of 4 each attributed with 6.25% of |
| execution time. |
| `E_expect_9_09` does not even appear in the profiles despite |
| its 9% expected execution time. |
| More inaccuracies are evident in the data in |
| the table. |
| If we inspect the data from `RUN 2` and `RUN 3`, we notice more |
| discrepancies where a lot of data is missing and incorrect. |
| The rank ordering |
| and the relative execution percentage both have very low correlation with the |
| ground truth. |
| Despite tens of milliseconds of execution time, some functions |
| are over counted and some functions are under counted. |
| |
| |
| Now, let's focus on the three runs together. |
| The highest time consumer |
| `J_expect_18_18` does not even get a single sample in `RUN 3`, indicating the |
| imprecision. |
| `I_expect_16_36`, which has the measurement data available in all |
| three cases varies from 20ms in `RUN 1` with a rank order of 4 to 70ms in `RUN 2` |
| with a rank order of 1. |
| More discrepancies are visible in the table. |
| Thus, |
| serial program also shows run to run measurement variation, and hence the |
| profiles are imprecise. |
| |
| |
| Table 4 below shows near perfect accuracy and precision of profiling the same |
| code using the PMU-based profiler described in this proposal. |
| The measurements |
| were taken two times instead of three for brevity. |
| Additional details appear |
| in [Section |
| 5.2](#52-the-pmu-produces-precise-and-accurate-profiles-for-serial-programs) |
| |
|  *Table 4: demonstration of accuracy and precision |
| of pprof profiles via PMU CPU cycles event on a serial code.* |
| |
| ### 2.4 State-of-the-art in profiling using performance monitoring units |
| |
| An alternative to the OS timer-based sampling is using hardware performance |
| counters, which are ubiquitous in modern commodity CPUs. |
| CPU’s Performance |
| Monitoring Units (PMUs) have the capability to count various events such as |
| cycles, instructions, cache misses, to name a few. |
| Furthermore, PMUs can be |
| configured to deliver an interrupt when an event reaches a specified threshold |
| value. |
| These PMU interrupts, like the OS timer interrupts, can serve as another |
| source of sampling the call stack. |
| This technique is employed in many modern |
| profilers [1,2,4]. |
| PMU sampling is called “event-based sampling,” where the |
| period (interval between two interrupts) is expressed as the number of |
| occurrences of an event, not the number of timer ticks. |
| As a side note, in |
| event-based sampling, the sampling overhead is proportional to the number of |
| events disregarding the length of execution, whereas in timer-based sampling, |
| the overhead is proportional to the length of execution. |
| If there are more |
| events in a short time span, then there is also more interest in observing |
| deeply in that region of execution and the event-based sampling takes more |
| samples. |
| More overhead is incurred in places where there is interest (more |
| events) and low/no overhead otherwise. |
| CPU cycles is one of the most common |
| PMU events, which happens periodically, similar to OS timers. |
| Hence, using CPU |
| cycles as a substitute for OS timers is very attractive. |
| |
| ### 2.5 Low-level details of the current Go CPU profiler |
| |
| We now provide the current implementation details of CPU profiling in Go, which |
| is necessary to understand the changes we propose in the next section. |
| The |
| description is relevant for most Unix-based systems, especially Linux, and most |
| of the discussion is irrelevant for Windows since it does not use signals for |
| handling CPU profiling. |
| |
| |
|  |
| |
| *Figure 1: A pictorial representation of the OS timer sampling workflow in the |
| existing implementation. |
| The first line of each record box is the function |
| name. |
| The remaining lines provide a textual description of the steps taken |
| by that function.* |
| |
| Currently, `pprof.StartCPUProfile(io.Writer)` is the entry point for starting |
| CPU profiling. |
| Analogously, `pprof.StopCPUProfile()` stops a running profile. |
| Two concurrent CPU profiling sessions are disallowed via a lock-protected |
| variable `cpu.profiling` in pprof. |
| The `pprof` package interacts with the |
| `runtime` package for starting/stopping profiling via |
| `runtime.SetCPUProfileRate(hz int)` API. |
| |
| There are two critical pieces of work done inside |
| `pprof.StartCPUProfile(io.Writer)`. |
| First, it invokes |
| `runtime.SetCPUProfileRate(hz=100)` asking the runtime to start profiling; the |
| rest of the profile generation will be handled by the `runtime` package; the |
| integer argument is the sampling rate in Hz (samples per second). |
| |
| Second, in order to collect the profiles generated by the runtime, |
| `pprof.StartCPUProfile(io.Writer)` launches a goroutine |
| `profileWriter(io.Writer)`. |
| `profileWriter`, |
| |
| 1. creates a protocol buffer via `newProfileBuilder`, |
| 2. goes into a sleep-wakeup loop calling `runtime.readProfile()`. |
| `runtime.readProfile()` provides a chunk of profiles collected inside the |
| runtime in a ring buffer (`eof` if the profiling stopped), and |
| 3. writes to the |
| sender end of a channel `cpu.done` once |
| `runtime.readProfile()` returns `eof`. |
| The other side of the channel will be |
| waited upon by `pprof.StopCPUProfile()`. |
| |
| |
| `runtime.SetCPUProfileRate(int)` is the entry point to start/stop profiling in |
| the `runtime` package. |
| It, |
| |
| 1. creates a ring buffer, `cpuprof.log`, |
| 2. appends profile header information |
| into the ring buffer, |
| 3. invokes `setcpuprofilerate(hz)`, which, |
| 1. calls `setThreadCPUProfileRate(hz=0)` to stop any on-going timer, |
| 2. invokes `setProcessCPUProfiler(hz)`, which installs the `SIGPROF` handler, and |
| 3. records the current sampling rate `hz` into the scheduler `sched.profilehz` to be |
| visible to other OS threads. |
| 4. calls `setThreadCPUProfileRate(hz)`, which uses |
| the OS timer such as (`setitimer`) to start the sampling engine. |
| This step is |
| performed only if the `hz` value is non-zero. |
| |
| |
| Inside the scheduler, the `execute()` method schedules a goroutine (also called |
| as as `G` in the Go runtime scheduler) to run on the current OS thread (also |
| called as `M` in the Go runtime scheduler). |
| This method compares the current |
| value of the scheduler-recorded `sched.profilehz` variable against its own |
| snapshot (`m.profilehz`) from the previously recorded value of |
| `sched.profilehz` and if they differ, it reconfigures its timer by calling |
| `setThreadCPUProfileRate(sched.profilehz)`. |
| This functionality sets an OS |
| timer on behalf of the current OS thread in the process. |
| |
| Once armed, the timer periodically delivers a `SIGPROF` signal, handled by |
| `sighandler(sig int, info *siginfo, ctxt unsafe.Pointer, gp *g)`, which in turn |
| invokes`sigprof()`, which, |
| |
| 1. obtains a backtrace, and |
| 2. appends the backtrace into the ring buffer `cpuprof.add()`. |
| |
| Occasionally, a sampled program counter may fall into non-Go space, or the ring |
| buffer may be full; such samples are simply counted, and backtrace is not taken. |
| |
| The ring buffer is coordinated via a `signalLock` for concurrency control. |
| The edges marked with “signalLock”, in Figure 1, show the operations that take |
| the lock to access the ring buffer. |
| |
| `pprof.StopCPUProfile()` follows a very similar path as |
| `pprof.StartCPUProfile(io.Writer)`. |
| When the `hz` argument is zero, the runtime |
| APIs described previously (e.g., `SetCPUProfileRate(hz)`) interpret it as the |
| end of profiling. |
| `pprof.StopCPUProfile()` performs the following steps. |
| |
| 1. calls`runtime.SetCPUProfileRate(0)` to stop profiling. |
| 2. waits on the `cpu.done` channel, which will be flagged by the side thread once `eof` is |
| reached in the runtime ring buffer. |
| 3. closes the supplied `io.Writer`. |
| |
| |
| `runtime.SetCPUProfileRate(hz)`, when called with `hz=0` invokes |
| `setcpuprofilerate(0)` and closes the ring buffer. |
| `setcpuprofilerate(0)` when |
| called with `hz=0`, |
| |
| 1. calls `setProcessCPUProfiler(0)`, which restores the previous signal handler |
| for `SIGPROF`. |
| 2. resets `sched.profilehz=0` indicating globally that the |
| profiling has stopped. |
| 3. calls `setThreadCPUProfileRate(0)` to stop sampling on the current OS thread. |
| |
| Subsequently, the other `M`s notice the change in the global `sched.profilehz` |
| and turn off their timers when they run their `execute()` method. |
| |
| The direct users of pprof are: |
| |
| 1. The `testing` package which can take a `-cpuprofile <file>` flag to collect |
| profiles. |
| 2. The `net/http/pprof` package which can collect cpu profiles on the |
| `/debug/pprof/profile` endpoint with an optional argument `?seconds=<N>`. |
| 3. Third party libraries such as `pkg/profile`. |
| |
| ## 3. Problem statement |
| |
| *Pprof CPU profiles are inaccurate because the profiling resolution is low and |
| sampling is biased in OS timers.* |
| |
| *Pprof CPU profiles are imprecise because the profiling resolution is low and |
| the samples have a large skid due to OS timers.* |
| |
| *Pprof CPU profiles measure only the time metric, additional events should be |
| monitored to help performance profiling.* |
| |
| ## 4. Proposal |
| |
| In this proposal, we address the concerns -- small sample size, sampling bias, |
| measurement error, and the need for more metrics -- which were raised in the |
| background section. |
| |
| **Sample size.** First, this proposal addresses the problem of low sampling |
| resolution of OS timers by using high-resolution samples from hardware |
| performance monitoring units (PMUs) to generate more samples. |
| While the OS-timers have ~10ms |
| resolution (that is one sample every 10ms), sampling CPU cycles via PMUs can |
| deliver (if desired) a sample every few microseconds -- a three order magnitude |
| finer resolution. |
| This fine-grained measurement results in increasing the |
| sample size, and as a result alleviates both inaccuracy and imprecision |
| problems. |
| While PMUs can deliver 100x more samples with just ~13% measurement |
| overhead, the alternative of lengthening a measurement window by 100x to |
| collect the same number of samples via the OS timers would be impractical. |
| |
| **Sampling bias.** Second, the proposal arranges each sample to be delivered to |
| the OS thread causing the counter overflow to eliminate the sampling bias, |
| which improves accuracy. |
| |
| **Random error (skid).** Third, PMU samples alleviate the large sampling skid |
| seen in OS timers because PMUs allow a low-skid or skid-free precise sample |
| attribution. |
| |
| **More metrics.** Finally, the PMU's capability to monitor various events such |
| as cache misses, TLB misses, branch mispredictions, inter-core cacheline |
| ping-ponging, to name a few, offers additional insights when diagnosing complex |
| performance issues. |
| |
| Implementing this idea calls for extending the Go runtime package’s CPU |
| profiling capabilities and making necessary adjustments in `runtime/pprof`, |
| `testing`, and `http/pprof` packages. |
| We first describe the details of the |
| changes to the `runtime` package and then describe the details of the `pprof`, |
| `testing`, and `net/http/pprof` packages, in that order. |
| |
| Our design brings the full capability of PMUs into the `runtime` package but |
| exposes only a small amount of it via the `pprof` package to users. |
| For |
| example, the runtime might be able to configure whether it wants to monitor the |
| CPU cycles spent inside the kernel or not; but these configurations will not be |
| exposed (at least in the initial design) to the users of `pprof` package. |
| |
| ### 4.1 High-level design |
| |
| In its simplest form, we substitute the start/stop of an OS interval timer with |
| the start/stop of a PMU event configured in the sampling mode; the sampling |
| mode delivers an interrupt each time a hardware counter overflows, which is |
| analogous to the OS timer interrupt. |
| The interrupt generated by the PMU counter |
| overflow follows the same signal handling path as the interval timer. |
| This |
| policy allows collecting and appending a backtrace on each PMU interrupt into a |
| ring buffer, just like the OS timer. |
| Moreover, like the OS timer, each |
| OS-thread (`M`) checks whether any changes were made to the PMU-based profiling |
| during each scheduling activity and performs the necessary adjustments to |
| reconfigure the PMU. |
| Finally, we retain `SIGPROF` as the signal type delivered |
| -- for both OS timer interrupts and a PMU counter overflow interrupts -- |
| instead of hijacking another signal number for PMU-based profiling. |
| |
| ### 4.2 The devil is in the details |
| |
| This seemingly straightforward approach poses implementation challenges |
| related to retaining the compatibility with interval timer and incorporating |
| additional PMU features. |
| The first part of this section addresses those |
| challenges. |
| Other than the design described in the previous subsection, the |
| rest of the details are closer to implementation details. |
| As before, the |
| discussion is only relevant to Linux. |
| |
| PMUs can count multiple CPU events simultaneously (e.g., cycles, instruction, |
| cache misses, etc.). |
| Our proposal does not preclude the runtime from |
| simultaneously enabling multiple PMU events or even PMU events alongside an OS |
| timer event. |
| In fact, certain events are best used when measured with others. |
| For example, |
| 1. Cache misses measured alongside cache hits enable computing cache miss ratio, |
| 2. Remote core cache line transfers measured alongside total cache misses |
| provides a view of the cause of cache misses. |
| |
| In our proposal, we will limit the maximum number of concurrent CPU events to |
| `_CPUPROF_EVENTS_MAX` (`EV_MAX` for short in the rest of this document). |
| We |
| will define a few common portable events for which the user does not have to |
| lookup within the CPU manufacturer’s manual for event codes. |
| This closely |
| resembles the preset events of `perf events` in Linux. |
| |
| In order to enable advanced users to exploit the full capability of the PMUs, |
| we define an opaque RAW event that allows the users to specify a model-specific |
| hexadecimal event number that can be sampled. |
| A motivating example for such a |
| raw event is the remote |
| [HITM](https://software.intel.com/sites/default/files/managed/8b/6e/335279_performance_monitoring_events_guide.pdf?ref=hvper.com) |
| event on Intel processors; this event counts the number of times a cache line |
| was loaded where the cache line was in the modified state in another remote |
| CPU’s cache. |
| This event helps diagnose problems such as cache line false |
| sharing; we have observed this event to be helpful in identifying false sharing |
| even inside the Go runtime. |
| |
| We have come up with the following initial set of PMU events. |
| These events will |
| be exposed by the `runtime` package to the `pprof` package. |
| The `pprof` package |
| will wrap these constant values with matching function names to be presented |
| later in the `pprof` package subsection. |
| Notice that the OS timer is one of the CPU events. |
| Last-level cache events are included to capture the behavior of the CPUs memory |
| hierarchy. |
| Branch events are included to capture potentially wasted CPU |
| cycles. |
| The design discussions can refine this list further (e.g., whether TLB events |
| are needed). |
| They are easy to grow based on requests. |
| |
| ``` |
| type cpuEvent int32 |
| |
| // These constants are a mirror of the runtime |
| // These constants are a mirror of the runtime |
| const ( _CPUPROF_OS_TIMER, _CPUPROF_FIRST_EVENT cpuEvent = iota, iota |
| _CPUPROF_HW_CPU_CYCLES, _CPUPROF_FIRST_PMU_EVENT |
| _CPUPROF_HW_INSTRUCTIONS cpuEvent = iota |
| _CPUPROF_HW_CACHE_REFERENCES |
| _CPUPROF_HW_CACHE_MISSES |
| _CPUPROF_HW_BRANCH_INSTRUCTIONS |
| _CPUPROF_HW_BRANCH_MISSES |
| _CPUPROF_HW_RAW |
| _CPUPROF_EVENTS_MAX |
| _CPUPROF_LAST_EVENT = _CPUPROF_EVENTS_MAX-1 |
| ) |
| ``` |
| |
| Since PMU support may not be available on all platforms in all settings, we |
| need to make it configurable in the Go runtime. |
| If an OS supports PMU-based |
| profiling (e.g., Linux), it will dynamically register two functions during the |
| runtime startup. |
| |
| |
| 1. `setThreadPMUProfilerFunc`: this is the OS-specific function used to |
| start/stop a PMU event analogous to `setThreadCPUProfile`. |
| If`setThreadPMUProfilerFunc` is `nil`, the runtime will not attempt to |
| start/stop PMU profiles even if it is requested to do so. |
| However, this |
| situation will not arise because `runtime/pprof` will guard against it. |
| 2. `sigprofPMUHandlerFunc`: this is the OS-specific function called from the |
| OS-specific signal handler when the PMU delivers an interrupt. |
| |
| Currently, each `M` (type `m`) and the scheduler (type `schedt`) maintains a |
| `profilehz` variable. |
| In lieu of `profilehz` we’ll introduce a runtime structure |
| `cpuProfileConfig struct `, which captures the essential configurations needed |
| to start/stop either a PMU counter or an OS timer. |
| `cpuProfileConfig` is |
| designed to reflect the underlying `perf_event_attr` struct in Linux, but being |
| an internal data structure we can modify it, if needed in the future, without |
| affecting any public-surface API. |
| |
| ``` |
| type cpuProfileConfig struct { |
| hz uint64 // the sampling rate, used only for OS_TIMER. |
| period uint64 // the sampling interval, used for PMU events. |
| rawEvent uint64 // the hexa-decimal value of the CPU-specific event. |
| // used only with _CPUPROF_HW_RAW cpuEvent type |
| preciseIP profilePCPrecision // the level of precision for the program counter in PMU samples. |
| isSampleIPIncluded bool // should the sample include the PC? |
| isSampleAddrIncluded bool // should the sample include the memory address involved? |
| isKernelIncluded bool // should the kernel count events in kernel mode? |
| isHvIncluded bool // should the kernel count events in hypervisor mode? |
| isIdleIncluded bool // should the kernel count events when idle task is running? |
| isSampleCallchainIncluded bool // should the sample report the call chain? |
| isCallchainKernelIncluded bool // should the sample include the kernel callchain? |
| isCallchainUserIncluded bool // should the sample include the user callchain? |
| } |
| ``` |
| |
| Some PMUs can be configured so that they can offer the |
| exact program counter involved in the counter overflow. |
| `profilePCPrecision` |
| dictates the level of precision needed from the PMU as described by the |
| constants below. |
| If a PMU is configured to provide the precise program counter for a sample, |
| we'll substitute the leaf-level program counter in the backtrace with the one |
| presented by the PMU. |
| Please refer to the [Section 6.2](#62-disadvantages) for PMU |
| precision. |
| |
| ``` |
| type profilePCPrecision uint8 |
| const ( |
| _CPUPROF_IP_ARBITRARY_SKID profilePCPrecision = iota |
| _CPUPROF_IP_CONSTANT_SKID |
| _CPUPROF_IP_SUGGEST_NO_SKID |
| _CPUPROF_IP_NO_SKID) |
| ``` |
| |
| `pprof` will set the default to `_CPUPROF_IP_ARBITRARY_SKID` to work across |
| architectures and event types. |
| However, we can easily change it to the highest |
| supported, by probing the CPU as done in |
| [certain](https://github.com/HPCToolkit/hpctoolkit/blob/master/src/tool/hpcrun/sample-sources/perf/perf_skid.c#L176) |
| profilers. |
| |
| `cpuProfileConfig` will be visible to `pprof` package but will not be exposed to |
| its clients. |
| The fields in `cpuProfileConfig` allow how a PMU event can be |
| customized. |
| Pprof will set the field values to the lowest common denominator |
| that is expected to work across many architectures and security settings. |
| For example, `isKernelIncluded` will most likely be `false`. |
| The runtime will be |
| (mostly) agnostic to these values and will pass them down to OS-specific layers |
| that will handle programming the PMUs. |
| |
| The `runtime` package will expose an API |
| `runtime_pprof_setCPUProfileConfig(cpuEvent, *cpuProfileConfig)`, only to |
| `pprof`. |
| This API is the equivalent of `SetCPUProfileRate(hz int)`. |
| Since |
| the legacy `SetCPUProfileRate(hz int)` cannot be dropped, we alter its internals |
| to wrap the `hz` value into a `cpuProfileConfig` structure and pass it down to |
| `runtime_pprof_setCPUProfileConfig` for a uniform treatment of any event whether |
| an OS-timer event or a PMU event. |
| |
| The `runtime` can be bifurcated into OS-agnostic and OS-specific parts. |
| |
| The OS-agnostic part of the runtime will have the following duties: |
| |
| 1. Keep track of the CPU events currently running, |
| 2. Handle new start/stop |
| invocations of the OS timer or PMU event, one at a time. |
| 3. Allocate and |
| coordinate access to the ring buffers associated with each of the running CPU |
| event. |
| 4. Reconfigure the CPU events for an `M` during its scheduling based on |
| whether its current snapshot is different from the global configuration stored |
| in the scheduler. |
| |
| The OS-specific part of runtime will have the following duties: |
| |
| 1. Install/restore profiling signal handler. |
| 2. Start/stop OS timer or PMU |
| events. |
| 3. Handle the profiling signals and pause the PMUs during signal |
| handling. |
| 4. Map an interrupt delivered to the appropriate active CPU profiling |
| event for recording the backtraces into an appropriate ring buffer. |
| |
| |
| Since there can be `CPUPROF_EVENTS_MAX` number of concurrent CPU profiles, we |
| need to change some of the runtime data structures that previously assumed a |
| single CPU profile. |
| Importantly, |
| |
| 1. We will change the `profilehz` field in `schedt struct` to be an array of |
| `cpuProfileConfig` pointers. |
| 2. We will change `profilehz` field in `m struct` |
| to be an array of `cpuProfileConfig` pointers. |
| 3. We will introduce an array of |
| opaque pointers `cpuProfileHandle` field in `m struct`. |
| This will be OS-specific |
| information stored in each `M`. |
| For example, on Linux, this will hold the |
| file handles obtained from `perf_event_open` system calls and the corresponding |
| `mmap` ring buffers where the kernel populates the sample data. |
| This information |
| will be used to close the resources once the profiling is stopped on an `M`. |
| |
| We’ll replace the `hz` argument used in various runtime APIs involved in CPU |
| profiling with the following two arguments. |
| |
| 1. `eventId cpuEvent` (described in the constants previously), and |
| 2. `profConfig *cpuProfileConfig`. |
| |
| Figure 2 shows the call graph for the new workflow. |
| We abbreviate `eventId` as |
| `i` and `cpuProfileConfig` as `p` in most places. |
| |
|  |
| *Figure 2: A pictorial representation of the OS timer-based and PMU-based |
| sampling workflow. |
| The first line of each record box is the function name. |
| The |
| remaining lines provide a textual description of the steps taken by that |
| function.* |
| |
| A `nil` value of `cpuProfileConfig` implies stop profiling for the specified |
| `cpuEvent`, and a `non-nil` value implies start profiling for the specified |
| `cpuEvent`. |
| At the runtime-level, `runtime_pprof_setCPUProfileConfig()` makes a |
| copy of the `cpuProfileConfig` passed from `pprof`, so that inside the runtime, |
| we can perform fast pointer comparison and rely on two large `cpuProfileConfig` |
| structs to be equal if pointers to them are equal. |
| |
| Starting/stopping a PMU counter is different from starting/stopping an OS timer. |
| Hence `if eventId==CPUPROF_OS_TIMER` checks will appear at a few places inside |
| the runtime’s profiling-related code. |
| |
| |
| A few concurrency aspects deserve a mention. |
| |
| 1. Each active event will create its own ring buffer on demand. |
| There will be a |
| maximum of `EV_MAX` ring buffers. |
| Each ring buffer will be created only if there |
| is a corresponding active event. |
| The size of each ring buffer is 16KB. |
| If all |
| `EV_MAX` events are active at the same time, it will result in `16KB * 8 = |
| 128KB` of ring buffers, which is paltry on modern machines. |
| 2. We’ll change the |
| `cpuprof` variable in `cpuprof.go` to be an array of `EV_MAX` size. |
| 3. |
| `runtime_pprof_setCPUProfileConfig` will hold a lock only for the `cpuprof[i]`, |
| where `i` is the event id for which profile has to be started or stopped. |
| 4. There will be a single `signalLock` rather than `EV_MAX` `signalLock`s. |
| This is |
| because the `setcpuprofileconfig` method in `cpuprof.go` needs to check (read) |
| all `prof[eventId].config`s to infer whether all events are stopped before |
| uninstalling the signal handler. |
| Taking `EV_MAX` locks one after another would |
| be no better than holding just one lock. |
| The consequence of this is that we |
| serialize simultaneous writes to two different ring buffers. |
| This should not be |
| a serious issue since there cannot be two concurrent `SIGPROF`s. |
| If a system |
| supports concurrent `SIGPROF` handling, and if there are a lot of threads in a |
| system, we can revisit this choice. |
| 5. In a very rare case when a `SIGPROF` is delivered for an already running |
| event when another event is being set (which will hold the `signalLock`), the |
| signal handler recognizes it and drops such sample. Otherwise a deadlock will |
| ensue. |
| |
| At the lowest level, we will incorporate the following three system calls into |
| all `runtime/sys_linux_<arch>.s` files. |
| |
| 1. `perf_event_open`, which is Linux’s means to program PMUs. |
| The Go signature |
| will be `func perfEventOpen(attr *perfEventAttr, pid uintptr, cpu, groupFd |
| int32, flags uintptr) int32` |
| 2. `ioctl`, which is needed to stop the PMU counter |
| when handling the `SIGPROF`, and reset and restart after handling the signal. |
| The |
| go signature will be `func ioctl(fd, req int32, arg uintptr) int32` |
| 3. `fcntl`, |
| which is necessary to arrange a `perf_event` file handle to deliver the |
| `SIGPROF` signal and also to setup the signal to be delivered exactly to the `M` |
| (OS thread) whose PMU counter overflew. |
| The go signature will be `func fcntl(fd, |
| cmd int32, arg uintptr) int32` |
| |
| #### 4.2.1 Segregating `SIGPROF` from different sampling sources |
| |
| All PMU events and the OS timer use the same `SIGPROF` signal to deliver their |
| interrupts. |
| Hence, the Unix signal handler needs to address the following |
| concerns. |
| |
| 1. Identify whether a signal was delivered as a result of the OS timer expiring |
| or a PMU counter overflowing. |
| This is necessary because the subsequent processing will be |
| different based on the event. |
| 2. Identify which PMU counter caused the interrupt, when multiple PMU events are being |
| sampled at the same time. |
| This is necessary because the subsequent |
| processing needs to append the backtraces into the buffer corresponding to the |
| PMU event that triggered the interrupt. |
| All PMU events do not go into the same |
| ring buffer. |
| |
| Although both OS timer expiry and any PMU counter overflow result in the |
| same `SIGPROF` signal, the signal codes are different. |
| The PMU counter overflow |
| generates a `POLL_IN` signal code indicating the presence of data whereas the OS |
| timer interrupts generate a `SI_KERNEL` or `SI_ALARM` signal code. |
| We use this |
| information to address #1 above. |
| |
| Once we know that the signal was generated due to some PMU counter overflowing, |
| we use the file descriptor `fd` passed in the signal info and match it against |
| the opened perf events file descriptors on the same `M` and use the matching |
| location to infer the PMU event `i` that caused the signal; the rest of the |
| signal handling makes use of this `i` to record the backtrace in the appropriate |
| ring buffer. |
| During the signal handling, we’ll pause all active PMU counters via `ioctl` |
| calls on the `fd` so that the signal handling is not counted towards the PMU |
| event counters. |
| |
| |
| ### 4.3 Exposing PMU events via `runtime/pprof` package |
| |
| We now turn our attention to the `runtime/pprof` package, which exposes the PMU |
| sampling to its clients. |
| There are several choices. |
| There is a tension between |
| revolutionary vs. evolutionary solution; there is friction between exposing the |
| full PMU functionality vs. offering the basic minimum. |
| In our proposal, we hope |
| to make pragmatic choices among myriad options. |
| |
| We propose adding a new API, which accepts different configurations needed to |
| start any CPU event -- OS timer and PMU counters. |
| The API signature is shown |
| below. |
| ``` |
| pprof.StartCPUProfileWithConfig(opt ProfilingOption, moreOpts ...ProfilingOption) error |
| ``` |
| There is an optional variadic `ProfilingOption` to allow |
| more than one event to be started during a profiling session. |
| |
| A `ProfilingOption` is an interface exposed by `runtime/pprof` as shown below. |
| |
| ``` |
| type ProfilingOption interface { apply() error } |
| ``` |
| |
| Pprof will implement the following profiling option functions that mirror |
| various `cpuEvent` constants previously presented. |
| |
| ``` |
| func OSTimer(w io.Writer) ProfilingOption |
| func CPUCycles(w io.Writer, period uint64) ProfilingOption |
| func CPUInstructions(w io.Writer, period uint64) ProfilingOption |
| func CPUCacheReferences(w io.Writer, period uint64) ProfilingOption |
| func CPUCacheMisses(w io.Writer, period uint64) ProfilingOption |
| func CPUBranchInstructions(w io.Writer, period uint64) ProfilingOption |
| func CPUBranchMisses(w io.Writer, period uint64) ProfilingOption |
| func CPURawEvent(w io.Writer, period uint64, hex uint64) ProfilingOption |
| ``` |
| |
| All of them consume an `io.Writer`. |
| The `OSTimer` accepts no other argument. |
| A |
| `period` accepted by all PMU events specifies the number of events that must |
| elapse between two consecutive interrupts. |
| A larger value means lower |
| granularity (the opposite of Hz). |
| Passing a `zero` value will result in using a |
| preset period. |
| `CPURawEvent` is special; it accepts any user-provided |
| hexadecimal PMU event code. |
| Notice that each event requires an `io.Writer` |
| since different profile types cannot be serialized into the same pprof protocol |
| buffer. |
| In the cases where multiple `ProfilingOption`s are passed to the |
| `pprof.StartCPUProfileWithConfig`, sanity checks are made to ensure the |
| uniqueness of each `cpuEvent` and each `io.Writer`. |
| |
| Below are a few example usages of the API. |
| |
| ``` |
| // Simple |
| StartCPUProfileWithConfig(OSTimer(w)) |
| StartCPUProfileWithConfig(CPUCycles(w, period)) |
| StartCPUProfileWithConfig(CPURawEvent(w, period, hexadecimalEvent)) |
| |
| // Advanced |
| StartCPUProfileWithConfig(CPUCacheReferences(w1, period1), CPUCacheMisses(w2, period2)) |
| ``` |
| |
| |
| The implementation of `StartCPUProfile(writer)` will simply invoke |
| `StartCPUProfileWithConfig(OSTimer(writer))`. |
| The implementation of |
| `StartCPUProfileWithConfig` will ensure a maximum of `EV_MAX` elements, unique |
| `io.Writers`, and unique `cpuEvents`. |
| After sanitizing the arguments, it will create |
| a `cpuProfileConfig` for each `cpuEvent` requested to be profiled. |
| In a global |
| data structure `cpu`, we will maintain the active events, their |
| `*cpuProfileConfig`, and the corresponding `io.Writer`. |
| |
| ``` |
| var cpu struct { |
| sync.Mutex |
| profiling bool |
| activeConfig [EV_MAX]*cpuProfileConfig |
| activeWriter [EV_MAX]io.Writer |
| eventName [EV_MAX]string |
| done chan bool |
| } |
| ``` |
| |
| Finally, `StartCPUProfileWithConfig` will invoke |
| `runtime_pprof_setCPUProfileConfig` multiple times once for each active event |
| to request the `runtime` to actually configure either the OS timer or the PMU |
| with a specific event to start sampling. |
| A lock will be held throughout the invocation of |
| `StartCPUProfileWithConfig` and hence no two |
| `StartCPUProfileWithConfig/StartCPUProfile` invocations run concurrently. |
| Once |
| a session of `StartCPUProfileWithConfig/StartCPUProfile` is active, another |
| profiling is disallowed until `StopCPUProfile` is invoked. |
| |
| Before successfully returning from `StartCPUProfileWithConfig`, we'll launch a |
| goroutine `profileWriter`, which will periodically wake up and scrape |
| runtime-produced stack samples from the ring buffers associated with each of |
| the active events and serialize them into the corresponding `io.Writer`s. |
| This follows the same design as the original pprof, but will add the |
| responsibility of collecting profiles from multiple events. Creating one |
| goroutine each to scrape every active event would be wasteful, a single |
| goroutine suffices to collect backtraces from all ring buffers. |
| |
| ### 4.4 The optional multiple CPU profiles feature |
| |
| Allowing multiple CPU profiles in a single profiling session via |
| `pprof.StartCPUProfileWithConfig(opt ProfilingOption, moreOpts |
| ...ProfilingOption)` API is a power-user feature. |
| To avoid any accidental |
| misuse, we'll protect it under an environmental variable |
| `GO_PPROF_ENABLE_MULTIPLE_CPU_PROFILES=<true|false>`, which will be disabled by |
| default. |
| |
| |
| ### 4.5 The `testing` package |
| |
| We’ll add the following two command line arguments to the testing package: |
| |
| 1. `-cpuprofileevent=<timer|cycles|instructions|cacheMisses|cacheReferences|branches|branchMisses|rHexValue>` |
| |
| 2. `-cpuprofileperiod=<Int64Value>` |
| |
| The default value of `-cpuprofileevent` is `timer` with 100Hz when only |
| `-cpuprofile` is passed, which ensures compatibility with the current behavior. |
| `-cpuprofileperiod` will be ignored for `-cpuprofileevent=timer` and will |
| default to a 100Hz rate as is the case currently. |
| The testing package will map a |
| user-specified string for `-cpuprofileevent` value to the appropriate |
| `pprof.ProfilingOption`, the `-cpuprofileperiod` to an integer, and invoke |
| `pprof.StartCPUProfileWithConfig()`. |
| For example `-cpuprofileevent=cycles` and |
| `-cpuprofileperiod=100000` will result |
| in`pprof.StartCPUProfileWithConfig(CPUCycles(w, 100000)`. |
| |
| ### 4.6 The `net/http/pprof` package |
| |
| We’ll add the following two additional arguments in an `http` request exposed |
| by the `net/http/pprof` CPU profiling endpoint. |
| |
| 1. `event=<timer|cycles|instructions|cacheMisses|cacheReferences|branches|branchMisses|rHexValue>` |
| |
| 2. `period=<Int64Value>` |
| |
| `period` will be ignored for `event=timer` and will default |
| to a 100Hz rate as is the case currently. |
| The default value of `event` |
| is `timer` with 100Hz, which ensures compatibility with the current behavior. |
| The `net/http/pprof` package will map a user-specified string for `event` |
| value to an appropriate `pprof.ProfilingOption`, `period` to an |
| integer, and invoke `pprof.StartCPUProfileWithConfig`. |
| |
| ## 5. Empirical evidence on the accuracy and precision of PMU profiles |
| |
| ### 5.1 The PMU produces precise and accurate profiles for concurrent programs |
| |
| Below we show the `go tool pprof -top` output of the previously shown |
| [`goroutine.go`](https://github.com/chabbimilind/GoPprofDemo/blob/master/goroutine.go) |
| concurrent program run for two times. |
| The profiles used the proposed new API |
| with 1M sampling period: `pprof.StartCPUPrileWithConfig(CPUCycles(w, |
| 1000000))`. |
| The CPU cycles attributed to each routine `f1-f10` match the |
| expectation (approximately 10% of the total execution time) within one run and |
| across multiple runs. |
| *Thus the profiles are accurate and precise. |
| The |
| measurement overhead is indistinguishable from the OS timer in this setting.* |
| |
| |
| #### (PMU Cycles profile) goroutine.go/Run 1: |
| |
| ``` |
| File: goroutine |
| Type: cycles |
| Time: Jan 27, 2020 at 4:49pm (PST) |
| Showing nodes accounting for 234000000000, 100% of 234000000000 total |
| flat flat% sum% cum cum% |
| 23400000000 10.00% 10.00% 23400000000 10.00% main.f1 |
| 23400000000 10.00% 20.00% 23400000000 10.00% main.f10 |
| 23400000000 10.00% 30.00% 23400000000 10.00% main.f2 |
| 23400000000 10.00% 40.00% 23400000000 10.00% main.f3 |
| 23400000000 10.00% 50.00% 23400000000 10.00% main.f4 |
| 23400000000 10.00% 60.00% 23400000000 10.00% main.f5 |
| 23400000000 10.00% 70.00% 23400000000 10.00% main.f6 |
| 23400000000 10.00% 80.00% 23400000000 10.00% main.f7 |
| 23400000000 10.00% 90.00% 23400000000 10.00% main.f8 |
| 23400000000 10.00% 100% 23400000000 10.00% main.f9 |
| ``` |
| #### (PMU Cycles profile) goroutine.go/Run 2: |
| |
| ``` |
| File: goroutine |
| Type: cycles |
| Time: Jan 27, 2020 at 4:51pm (PST) |
| Showing nodes accounting for 234000000000, 100% of 234000000000 total |
| flat flat% sum% cum cum% |
| 23800000000 10.17% 10.17% 23800000000 10.17% main.f1 |
| 23500000000 10.04% 20.21% 23500000000 10.04% main.f7 |
| 23500000000 10.04% 30.26% 23500000000 10.04% main.f9 |
| 23400000000 10.00% 40.26% 23400000000 10.00% main.f10 |
| 23400000000 10.00% 50.26% 23400000000 10.00% main.f2 |
| 23400000000 10.00% 60.26% 23400000000 10.00% main.f4 |
| 23400000000 10.00% 70.26% 23400000000 10.00% main.f6 |
| 23400000000 10.00% 80.26% 23400000000 10.00% main.f8 |
| 23300000000 9.96% 90.21% 23300000000 9.96% main.f3 |
| 22900000000 9.79% 100% 22900000000 9.79% main.f5 |
| ``` |
| |
| ### 5.2 The PMU produces precise and accurate profiles for serial programs |
| |
| Below we show the `go tool pprof -top` output of the previously shown |
| [`serial.go`](https://github.com/chabbimilind/GoPprofDemo/blob/master/serial.go) |
| program run for two times. |
| We remind the reader that the expected relative |
| execution time is encoded in the function names themselves (`J_expect_18_18` |
| should consume 18.18% CPU cycles and so on). |
| The profiles used the proposed |
| new API with 1M sampling period: `pprof.StartCPUPrileWithConfig(CPUCycles(w, |
| 1000000))`. |
| The CPU cycles attributed to each routine match the expectation |
| within a run and across multiple runs. |
| *Thus the profiles are accurate and |
| precise.* |
| |
| |
| #### (PMU Cycles profile) serial.go/Run 1: |
| |
| ``` |
| File: serial |
| Type: cycles |
| Time: Jan 27, 2020 at 4:54pm (PST) |
| Showing nodes accounting for 1105000000, 100% of 1105000000 total |
| flat flat% sum% cum cum% |
| 200000000 18.10% 18.10% 200000000 18.10% main.J_expect_18_18 |
| 183000000 16.56% 34.66% 183000000 16.56% main.I_expect_16_36 |
| 165000000 14.93% 49.59% 165000000 14.93% main.H_expect_14_546 |
| 137000000 12.40% 61.99% 137000000 12.40% main.G_expect_12_73 |
| 120000000 10.86% 72.85% 120000000 10.86% main.F_expect_10_91 |
| 100000000 9.05% 81.90% 100000000 9.05% main.E_expect_9_09 |
| 82000000 7.42% 89.32% 82000000 7.42% main.D_expect_7_27 |
| 63000000 5.70% 95.02% 63000000 5.70% main.C_expect_5_46 |
| 37000000 3.35% 98.37% 37000000 3.35% main.B_expect_3_64 |
| 18000000 1.63% 100% 18000000 1.63% main.A_expect_1_82 |
| 0 0% 100% 1105000000 100% main.main |
| 0 0% 100% 1105000000 100% runtime.main |
| ``` |
| |
| #### (PMU Cycles profile) serial.go/Run 2: |
| |
| ``` |
| File: serial |
| Type: cycles |
| Time: Jan 27, 2020 at 4:54pm (PST) |
| Showing nodes accounting for 1105000000, 100% of 1105000000 total |
| flat flat% sum% cum cum% |
| 200000000 18.10% 18.10% 200000000 18.10% main.J_expect_18_18 |
| 183000000 16.56% 34.66% 183000000 16.56% main.I_expect_16_36 |
| 159000000 14.39% 49.05% 159000000 14.39% main.H_expect_14_546 |
| 142000000 12.85% 61.90% 142000000 12.85% main.G_expect_12_73 |
| 119000000 10.77% 72.67% 119000000 10.77% main.F_expect_10_91 |
| 100000000 9.05% 81.72% 100000000 9.05% main.E_expect_9_09 |
| 82000000 7.42% 89.14% 82000000 7.42% main.D_expect_7_27 |
| 61000000 5.52% 94.66% 61000000 5.52% main.C_expect_5_46 |
| 40000000 3.62% 98.28% 40000000 3.62% main.B_expect_3_64 |
| 18000000 1.63% 99.91% 18000000 1.63% main.A_expect_1_82 |
| 1000000 0.09% 100% 1105000000 100% main.main |
| 0 0% 100% 1105000000 100% runtime.main |
| ``` |
| |
| ### 5.3 A practical use |
| |
| We use pprof to collect production CPU profiles at Uber. |
| On one of our Go microservices, when we used the default OS timer-based CPU |
| profiles, we saw that the `mutex.Lock()` accounted for 15.79% of the time and |
| `mutex.Unlock()` accounted for 13.55% of the CPU time. |
| We knew from the program structure that `mutex.Lock()` and `mutex.Unlock()` |
| were invoked in pairs one after another, and also that `mutex.Lock()` is |
| significantly more heavyweight compared to `mutex.Unlock()`, thus we were |
| expecting much lesser time being spent inside `mutex.Unlock()` relative to |
| `mutex.Lock()`. |
| The OS-timer profiling data was not matching our expectations. |
| However, when we used a prototype of the PMU-based profiler and sampled the CPU |
| cycles event, the quantification changed and matched our expectations; |
| `mutex.Lock()` accounted for 33.36% of the CPU cycles and `mutex.Unlock()` |
| accounted for 7.57% the CPU cycles. |
| The PMU profiles avoided an unnecessary performance investigation. |
| |
| |
| ## 6. Rationale |
| |
| ### 6.1 Advantages |
| |
| CPU profiling via PMUs result in richer quality profiles -- high-resolution |
| measurement, higher accuracy, and higher precision. |
| Access to various events |
| such as cache misses in PMUs allows better insights into understanding the |
| causes of complex performance issues. |
| Any kind of PMU event can be attributed |
| to call stack samples, which is useful to provide actionable feedback to the |
| developer. |
| Incorporating the PMU facility in the runtime will allow the Go |
| profiler to take full advantage of the progress happening in the hardware |
| performance monitoring facilities available in modern CPUs. |
| This framework will |
| allow for independently developing more advanced profiling features such as |
| attributing high-latency data accesses to Go objects, which can help refactor |
| code or data structures for better data locality [5, 6, 7]. |
| |
| ### 6.2 Disadvantages |
| |
| 1. PMU-based profiling may not work when a Go program is running under a |
| virtual machine, since virtual machines may not expose the hardware performance |
| counters to a guest OS. |
| Hence, the proposal keeps the OS-timer as the default |
| and allows PMU-based sampling available only when it is explicitly specified. |
| 2. A higher sampling rate (a shorter period between samples) obviously incurs |
| higher overhead. |
| For example, the default timer with a 100Hz sampling rate |
| introduces 3.6% measurement overhead on a 2.4GHz Intel Skylake machine. |
| Using |
| CPU cycles with a period of 24000000 cycles, which amounts to the same 100 |
| samples/second introduces the same 3.6% overhead. |
| The table below shows |
| overheads at other sampling periods using the PMU cycles and also compares |
| against the default OS-timer profiler overhead. |
| |
| ``` |
| The new PMU-based profiler with cycles event |
| ----------------------------------------------------- |
| Sampling period in CPU cycles Overhead |
| ----------------------------------------------------- |
| 24000000 (100 samples/sec) 1.036x |
| 2400000 (1K samples/sec) 1.036x |
| 240000 (10K samples/sec) 1.13x |
| 24000 (100K samples/sec) 2.05x |
| ``` |
| ``` |
| The default OS-timer based profiler |
| ----------------------------------------------------- |
| Sampling rate Overhead |
| ----------------------------------------------------- |
| 100 samples/sec 1.036x |
| ``` |
| |
| A reasonable sampling period, once in 2.4M CPU cycles on a 2.4GHz |
| Intel Skylake machine, offers a good trade off between high sampling rate at low |
| overhead. |
| |
| 3. When using any other PMU event, such as cache misses, the sampling period |
| needs to be more carefully curated. |
| We’ll provide preset values for the preset |
| events. |
| |
| 4. PMU profiling has a mild skid on deeply pipelined, multi-issue, |
| out-of-order, superscalar CPU architectures due to multiple instructions |
| flowing in the pipeline. |
| Thus the program counter seen in the interrupt can be |
| a handful of instructions off from the one generating the counter overflow. |
| This "skid" is much smaller than the several milliseconds of skid in an OS |
| timer. |
| Moreover, modern PMUs have built-in features to eliminate such skid |
| (e.g., Intel Precise Event Based Sampling ([PEBS](https://software.intel.com/sites/default/files/managed/8b/6e/335279_performance_monitoring_events_guide.pdf?ref=hvper.com)), AMD Instruction Based |
| Sampling ([IBS](https://developer.amd.com/wordpress/media/2012/10/AMD_IBS_paper_EN.pdf)), and PowerPC Marked events ([MRK](https://wiki.raptorcs.com/w/images/6/6b/POWER9_PMU_UG_v12_28NOV2018_pub.pdf)). |
| Some PMUs can be configured so |
| that they can offer the exact program counter involved in the counter overflow. |
| We have introduced `profilePCPrecision` for this reason. |
| |
| 5. There are finite counters on any PMU. Counter multiplexing happens if the |
| number of events is more than the number of available PMU counter slots. In |
| that case the events run only part of the time. |
| |
| ### 6.3 Alternatives explored |
| |
| 1. Linux `perf` commandline tool is powerful and offers comparable solutions. |
| However, it does not solve several important use cases. |
| 1. Profiling via an `http` endpoint (provided by `net/http/pprof` package in |
| Go) is the preferred, readily available, and scalable solution that works at |
| the enterprise scale. Making `perf` to work in a similar fashion would lead to |
| creating too many custom solutions. |
| 2. `pprof` works as an in-situ profiler working in the same address |
| space as the process being profiled and hence does not require any special |
| system privileges; in contrast, attaching `perf` to a running process requires |
| changes to a system’s security settings. |
| 3. We are also focusing on |
| containerized environments, which is widely used in large-scale Go deployments. |
| Although `perf` can collect call stack samples from a container, it fails to find |
| symbols of go binaries located within a container to show clean reports |
| (e.g., flame graphs). |
| The workaround is non-trivial. |
| 2. We explored library-level facilities to collect performance counters such as perf-util [\[3\]](https://github.com/hodgesds/perf-utils). |
| However, no library support will provide good profiling for Go since the runtime |
| can schedule a monitored function onto another thread, where the counters may not |
| be enabled. |
| The correct solution, hence, is to implement the hardware |
| performance counter facility inside the go runtime into the scheduler as we have |
| proposed here. |
| 3. We seriously contemplated against introducing a new pprof API |
| and tried to work with the existing API `StartCPUProfile(io.Writer)`. |
| We |
| considered adding variadic `ProfilingOptions` to it. |
| But any such change would |
| be qualified as breaking `Go1.X` and went ahead introducing a new API. |
| 4. We considered passing an `Options` struct to the new API |
| `StartCPUProfileWithConfig`. |
| However, the usage looked complicated with certain |
| fields making sense only in certain circumstances and hence decided to use the |
| functional option pattern. |
| 5. We considered having only one CPU profiling |
| event active at a time. |
| However, the experience has shown that this limits |
| performance analysis in several ways; and advanced performance tuning demands |
| correlating more than one event to find the root cause. |
| Furthermore, this richness |
| in Go profiling can serve to advance the state-of-the-art in profilers. |
| 6. On the implementation front, we considered using a `map` for active events at |
| several places instead of an `EV_MAX` array. |
| However, concurrent access to |
| different array elements was superior to locking a shared `map`. |
| Moreover, |
| `EV_MAX` is small enough to search linearly for active events. |
| |
| ## 7. Compatibility |
| |
| The design ensures [compatibility with Go1.0](https://golang.org/doc/go1compat) |
| since no existing public API signatures, interface signatures, or data |
| structures are changed. |
| We have introduced a new interface and a handful of new |
| APIs in the `pprof` package. |
| |
| ## 8. Implementation |
| |
| Milind Chabbi from Uber technologies will implement the proposal. |
| The |
| implementation will proceed as follows: |
| |
| 1. Add assembly language support for `perf_events` related system calls. |
| 2. Modify the `runtime`, `runtime/pprof`to support |
| PMU sampling to reflect the call graph / workflow shown in Figure 2. |
| 3. Expose the new facility via `testing` and `net/http/pprof` packages. |
| 4. Make changes to `pprof` CPU profiling tests to use PMU cycles in addition to |
| OS timer (on supported platforms). |
| 5. Add PMU profiling tests to run on supported platforms. |
| |
| The implementation will be broken into a large checkin related to the `runtime` |
| and `runtime/pprof` packages, a few small checkins related to `testing`, and |
| `net/http/pprof` packages, and the tests accompanying the code. |
| The |
| implementation will align with the February-April 2020 development cycle and |
| May-July 2020 release cycle. |
| |
| A prototype is available [here](https://github.com/uber-dev/go/tree/pmu_pprof). |
| |
| ## 9. Open issues |
| |
| 1. Supporting Windows and MacOS will be left for future work, but any support |
| from experts from those lands is appreciated. |
| 2. The issue [21295](https://github.com/golang/go/issues/21295) proposes |
| adding `-test.counters` flag to the testing package. That proposal is |
| orthogonal to this one. |
| There |
| isn’t a detailed description of how the proposal wants to accomplish it. |
| Simply |
| starting and stopping a PMU counter by making `perf_events_open` system call at |
| the testing package level will not accomplish the objective. |
| The monitored |
| functions can migrate from one `M` to another, and hence the changes must be |
| made in the runtime as described in this proposal. |
| However, it is relatively |
| easy to incorporate the requested feature by making use of changes suggested in |
| this proposal to accomplish the objective of issue #21295. |
| We are open to |
| discussing with the creators of #21295 to understand how we can incorporate |
| their needs into this design. |
| 3. The additions to the public APIs and naming |
| aesthetics can be refined with inputs from the experts and the community. |
| 4. Currently, we do not bubble up runtime failures to start a PMU profiler (same is |
| true for a timer). |
| Should we bubble up any failure to configure a PMU or |
| silently produce zero samples in the profiles? |
| 5. Should we introduce any throttling mechanism if the rate of events is too |
| high for a PMU event. |
| |
| ## 10. Acknowledgment |
| |
| Pengfei Su implemented a prototype of the design |
| while interning at Uber Technologies. |
| |
| Joshua Corbin helped design the public surface APIs. |
| |
| This work is generously funded by the Uber Technologies’ Programming Systems |
| Research team and the team members provided feedback for improvement. |
| |
| ## 11. Appendix |
| |
| ##### (OS timer profile) goroutine.go/run1: |
| |
| ``` |
| File: goroutine |
| Type: cpu |
| Time: Jan 27, 2020 at 3:45pm (PST) |
| Duration: 6.70s, Total samples = 18060ms (269.37%) |
| Showing nodes accounting for 18060ms, 100% of 18060ms total |
| flat flat% sum% cum cum% |
| 4210ms 23.31% 23.31% 4210ms 23.31% main.f7 |
| 2610ms 14.45% 37.76% 2610ms 14.45% main.f2 |
| 2010ms 11.13% 48.89% 2010ms 11.13% main.f6 |
| 1810ms 10.02% 58.91% 1810ms 10.02% main.f10 |
| 1780ms 9.86% 68.77% 1780ms 9.86% main.f3 |
| 1410ms 7.81% 76.58% 1410ms 7.81% main.f1 |
| 1310ms 7.25% 83.83% 1310ms 7.25% main.f4 |
| 1110ms 6.15% 89.98% 1110ms 6.15% main.f5 |
| 1110ms 6.15% 96.12% 1110ms 6.15% main.f8 |
| 700ms 3.88% 100% 700ms 3.88% main.f9 |
| ``` |
| |
| ##### (OS timer profile) goroutine.go/run2: |
| |
| ``` |
| File: goroutine |
| Type: cpu |
| Time: Jan 27, 2020 at 3:45pm (PST) |
| Duration: 6.71s, Total samples = 17400ms (259.39%) |
| Showing nodes accounting for 17400ms, 100% of 17400ms total |
| flat flat% sum% cum cum% |
| 3250ms 18.68% 18.68% 3250ms 18.68% main.f2 |
| 2180ms 12.53% 31.21% 2180ms 12.53% main.f9 |
| 2100ms 12.07% 43.28% 2100ms 12.07% main.f1 |
| 1770ms 10.17% 53.45% 1770ms 10.17% main.f6 |
| 1700ms 9.77% 63.22% 1700ms 9.77% main.f5 |
| 1550ms 8.91% 72.13% 1550ms 8.91% main.f4 |
| 1500ms 8.62% 80.75% 1500ms 8.62% main.f8 |
| 1440ms 8.28% 89.02% 1440ms 8.28% main.f3 |
| 1390ms 7.99% 97.01% 1390ms 7.99% main.f10 |
| 520ms 2.99% 100% 520ms 2.99% main.f7 |
| ``` |
| |
| ##### (OS timer profile) goroutine.go/Run 3: |
| |
| ``` |
| File: goroutine |
| Type: cpu |
| Time: Jan 27, 2020 at 3:48pm (PST) |
| Duration: 6.71s, Total samples = 17.73s (264.31%) |
| Showing nodes accounting for 17.73s, 100% of 17.73s total |
| flat flat% sum% cum cum% |
| 3.74s 21.09% 21.09% 3.74s 21.09% main.f7 |
| 2.08s 11.73% 32.83% 2.08s 11.73% main.f9 |
| 2.05s 11.56% 44.39% 2.05s 11.56% main.f2 |
| 1.85s 10.43% 54.82% 1.85s 10.43% main.f10 |
| 1.78s 10.04% 64.86% 1.78s 10.04% main.f1 |
| 1.43s 8.07% 72.93% 1.43s 8.07% main.f3 |
| 1.42s 8.01% 80.94% 1.42s 8.01% main.f8 |
| 1.18s 6.66% 87.59% 1.18s 6.66% main.f6 |
| 1.17s 6.60% 94.19% 1.17s 6.60% main.f5 |
| 1.03s 5.81% 100% 1.03s 5.81% main.f4 |
| ``` |
| |
| ##### (OS timer profile) serial.go/Run 1: |
| ``` |
| File: serial |
| Type: cpu |
| Time: Jan 27, 2020 at 1:42pm (PST) |
| Duration: 501.51ms, Total samples = 320ms (63.81%) |
| Showing nodes accounting for 320ms, 100% of 320ms total |
| flat flat% sum% cum cum% |
| 80ms 25.00% 25.00% 80ms 25.00% main.H_expect_14_546 |
| 80ms 25.00% 50.00% 80ms 25.00% main.J_expect_18_18 |
| 60ms 18.75% 68.75% 60ms 18.75% main.G_expect_12_73 |
| 20ms 6.25% 75.00% 20ms 6.25% main.B_expect_3_64 |
| 20ms 6.25% 81.25% 20ms 6.25% main.D_expect_7_27 |
| 20ms 6.25% 87.50% 20ms 6.25% main.F_expect_10_91 |
| 20ms 6.25% 93.75% 20ms 6.25% main.I_expect_16_36 |
| 10ms 3.12% 96.88% 10ms 3.12% main.A_expect_1_82 |
| 10ms 3.12% 100% 10ms 3.12% main.C_expect_5_46 |
| 0 0% 100% 320ms 100% main.main |
| 0 0% 100% 320ms 100% runtime.main |
| ``` |
| |
| |
| ##### (OS timer profile) serial.go/Run 2: |
| |
| ``` |
| File: serial |
| Type: cpu |
| Time: Jan 27, 2020 at 1:44pm (PST) |
| Duration: 501.31ms, Total samples = 320ms (63.83%) |
| Showing nodes accounting for 320ms, 100% of 320ms total |
| flat flat% sum% cum cum% |
| 70ms 21.88% 21.88% 70ms 21.88% main.I_expect_16_36 |
| 50ms 15.62% 37.50% 50ms 15.62% main.J_expect_18_18 |
| 40ms 12.50% 50.00% 40ms 12.50% main.E_expect_9_09 |
| 40ms 12.50% 62.50% 40ms 12.50% main.F_expect_10_91 |
| 40ms 12.50% 75.00% 40ms 12.50% main.H_expect_14_546 |
| 30ms 9.38% 84.38% 30ms 9.38% main.D_expect_7_27 |
| 20ms 6.25% 90.62% 20ms 6.25% main.B_expect_3_64 |
| 20ms 6.25% 96.88% 20ms 6.25% main.G_expect_12_73 |
| 10ms 3.12% 100% 10ms 3.12% main.C_expect_5_46 |
| 0 0% 100% 320ms 100% main.main |
| 0 0% 100% 320ms 100% runtime.main |
| ``` |
| |
| ##### (OS timer profile) serial.go/Run 3: |
| ``` |
| File: serial |
| Type: cpu |
| Time: Jan 27, 2020 at 1:45pm (PST) |
| Duration: 501.39ms, Total samples = 310ms (61.83%) |
| Showing nodes accounting for 310ms, 100% of 310ms total |
| flat flat% sum% cum cum% |
| 110ms 35.48% 35.48% 110ms 35.48% main.J_expect_18_18 |
| 70ms 22.58% 58.06% 70ms 22.58% main.G_expect_12_73 |
| 60ms 19.35% 77.42% 60ms 19.35% main.F_expect_10_91 |
| 30ms 9.68% 87.10% 30ms 9.68% main.I_expect_16_36 |
| 20ms 6.45% 93.55% 20ms 6.45% main.H_expect_14_546 |
| 10ms 3.23% 96.77% 10ms 3.23% main.B_expect_3_64 |
| 10ms 3.23% 100% 10ms 3.23% main.C_expect_5_46 |
| 0 0% 100% 310ms 100% main.main |
| 0 0% 100% 310ms 100% runtime.main |
| ``` |
| |
| |
| |
| ## 12. References |
| |
| 1. HPCToolkit: https://github.com/HPCToolkit/hpctoolkit |
| 2. Oracle studio: |
| https://www.oracle.com/technetwork/server-storage/solarisstudio/features/performance-analyzer-2292312.html |
| 3. Perf utils: https://github.com/hodgesds/perf-utils |
| 4. vTune |
| https://software.intel.com/en-us/vtune |
| 5. "Pinpointing Data Locality Problems |
| Using Data-Centric Analysis", Xu Liu and John Mellor-Crummey, 2011 International |
| Symposium on Code Generation and Optimization, April 2-6, 2011, Chamonix, |
| France. |
| 6. "A Data-centric Profiler for Parallel Programs", Xu Liu and John |
| Mellor-Crummey, The International Conference for High Performance Computing, |
| Networking, Storage and Analysis, November 17-22, 2013, Denver, Colorado, USA. |
| 7. "ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel |
| Programs", Xu Liu and Bo Wu, The International Conference for High Performance |
| Computing, Networking, Storage and Analysis, Nov 15-20, 2015, Austin, Texas, |
| USA. |
| |