GC Pacer Redesign

Author: Michael Knyszek

Updated: 8 February 2021

Abstract

Go's tracing garbage collector runs concurrently with the application, and thus requires an algorithm to determine when to start a new cycle. In the runtime, this algorithm is referred to as the pacer. Until now, the garbage collector has framed this process as an optimization problem, utilizing a proportional controller to achieve a desired stopping-point (that is, the cycle completes just as the heap reaches a certain size) as well as a desired CPU utilization. While this approach has served Go well for a long time, it has accrued many corner cases due to resolved issues, as well as a backlog of unresolved issues.

I propose redesigning the garbage collector's pacer from the ground up to capture the things it does well and eliminate the problems that have been discovered. More specifically, I propose:

  1. Including all non-heap sources of GC work (stacks, globals) in pacing decisions.
  2. Reframing the pacing problem as a search problem,
  3. Extending the hard heap goal to the worst-case heap goal of the next GC,

(1) will resolve long-standing issues with small heap sizes, allowing the Go garbage collector to scale down and act more predictably in general. (2) will eliminate offset error present in the current design, will allow turning off GC assists in the steady-state, and will enable clearer designs for setting memory limits on Go applications. (3) will enable smooth and consistent response to large changes in the live heap size with large GOGC values.

Background

Since version 1.5 Go has had a tracing mark-sweep garbage collector (GC) that is able to execute concurrently with user goroutines. The garbage collector manages several goroutines called “workers” to carry out its task. A key problem in concurrent garbage collection is deciding when to begin, such that the work is complete “on time.” Timeliness, today, is defined by the optimization of two goals:

  1. The heap size relative to the live heap at the end of the last cycle, and
  2. A target CPU utilization for the garbage collector while it is active.

These two goals are tightly related. If a garbage collection cycle starts too late, for instance, it may consume more CPU to avoid missing its target. If a cycle begins too early, it may end too early, resulting in GC cycles happening more often than expected.

Go's garbage collector sets a fixed target of 30% CPU utilization (25% from GC workers, with 5% from user goroutines donating their time to assist) while the GC is active. It also offers a parameter to allow the user to set their own memory use and CPU trade-off: GOGC. GOGC is a percent overhead describing how much additional memory (over the live heap) the garbage collector may use. A higher GOGC value indicates that the garbage collector may use more memory, setting the target heap size higher, and conversely a lower GOGC value sets the target heap size lower. The process of deciding when a garbage collection should start given these parameters has often been called “pacing” in the Go runtime.

To attempt to reach its goals, Go's “pacer” utilizes a proportional controller to decide when to start a garbage collection cycle. The controller attempts to find the correct point to begin directly, given an error term that captures the two aforementioned optimization goals.

It's worth noting that the optimization goals are defined for some steady-state. Today, the steady-state is implicitly defined as: constant allocation rate, constant heap size, and constant heap composition (hence, constant mark rate). The pacer expects the application to settle on some average global behavior across GC cycles.

However, the GC is still robust to transient application states. When the GC is in some transient state, the pacer is often operating with stale information, and is actively trying to find the new steady-state. To avoid issues with memory blow-up, among other things, the GC makes allocating goroutines donate their time to assist the garbage collector, proportionally to the amount of memory that they allocate. This GC assist system keeps memory use stable in unstable conditions, at the expense of user CPU time and latency.

The GC assist system operates by dynamically computing an assist ratio. The assist ratio is the slope of a curve in the space of allocation time and GC work time, a curve that the application is required to stay under. This assist ratio is then used as a conversion factor between the amount a goroutine has allocated, and how much GC assist work it should do. Meanwhile, GC workers generate assist credit from the work that they do and place it in a global pool that allocating goroutines may steal from to avoid having to assist.

Motivation

Since version 1.5, the pacer has gone through several minor tweaks and changes in order to resolve issues, usually adding special cases and making its behavior more difficult to understand, though resolving the motivating problem. Meanwhile, more issues have been cropping up that are diagnosed but more difficult to tackle in the existing design. Most of these issues are listed in the GC pacer meta-issue.

Even more fundamentally, the proportional controller at the center of the pacer is demonstrably unable to completely eliminate error in its scheduling, a well-known issue with proportional-only controllers.

Another significant motivator, beyond resolving latent issues, is that the Go runtime lacks facilities for dealing with finite memory limits. While the GOGC mechanism works quite well and has served Go for a long time, it falls short when there‘s a hard memory limit. For instance, a consequence of GOGC that often surprises new gophers coming from languages like Java, is that if GOGC is 100, then Go really needs 2x more memory than the peak live heap size. The garbage collector will not automatically run more aggressively as it approaches some memory limit, leading to out-of-memory errors. Conversely, users that know they’ll have a fixed amount of memory up-front are unable to take advantage of it if their live heap is usually small. Users have taken to fooling the GC into thinking more memory is live than their application actually needs in order to let the application allocate more memory in between garbage collections. Simply increasing GOGC doesn't tend to work in this scenario either because of the previous problem: if the live heap spikes suddenly, GOGC will result in much more memory being used overall. See issue #42430 for more details.

The current pacer is not designed with these use-cases in mind.

Design

Definitions

\begin{aligned}
\gamma & = 1+\frac{GOGC}{100} \\
   S_n & = \textrm{bytes of memory allocated to goroutine stacks at the beginning of the mark phase of GC } n \\
   G_n & = \textrm{bytes of memory dedicated to scannable global variables at the beginning of the mark phase of GC } n \\
   M_n & = \textrm{bytes of memory marked live after GC } n
\end{aligned}

There is some nuance to these definitions.

Firstly, $\gamma$ is used in place of GOGC because it makes the math easier to understand.

Secondly, $S_n$ may vary throughout the sweep phase, but effectively becomes fixed once a GC cycle starts. Stacks may not shrink, only grow during this time, so there‘s a chance any value used by the runtime during a GC cycle will be stale. $S_n$ also includes space that may not be actively used for the stack. That is, if an 8 KiB goroutine stack is actually only 2 KiB high (and thus only 2 KiB is actually scannable), for consistency’s sake the stack's height will be considered 8 KiB. Both of these estimates introduce the potential for skew. In general, however, stacks are roots in the GC and will be some of the first sources of work for the GC, so the estimate should be fairly close. If that turns out not to be true in practice, it is possible, though tricky to track goroutine stack heights more accurately, though there must necessarily always be some imprecision because actual scannable stack height is rapidly changing.

Thirdly, $G_n$ acts similarly to $S_n$. The amount of global memory in a Go program can change while the application is running because of the plugin package. This action is relatively rare compared to a change in the size of stacks. Because of this rarity, I propose allowing a bit of skew. At worst (as we'll see later) the pacer will overshoot a little bit.

Lastly, $M_n$ is the amount of heap memory known to be live to the runtime the instant after a garbage collection cycle completes. Intuitively, it is the bottom of the classic GC sawtooth pattern.

Heap goal

Like in the previous definition of the pacer, the runtime sets some target heap size for the GC cycle based on GOGC. Intuitively, this target heap size is the targeted heap size at the top of the classic GC sawtooth pattern.

The definition I propose is very similar, except it includes non-heap sources of GC work. Let $N_n$ be the heap goal for GC $n$ (“N” stands for “Next GC”).

N_n = \gamma M_{n-1} + S_n + G_n

The old definition makes the assumption that non-heap sources of GC work are negligible. In practice, that is often not true, such as with small heaps. This definition says that we‘re trading off not just heap memory, but all memory that influences the garbage collector’s CPU consumption.

From a philospical standpoint wherein GOGC is intended to be a knob controlling the trade-off between CPU resources and memory footprint, this definition is more accurate.

This change has one large user-visible ramification: the default GOGC, in most cases, will use slightly more memory than before.

This change will inevitably cause some friction, but I believe the change is worth it. It unlocks the ability to scale down to heaps smaller than 4 MiB (the origin of this limit is directly tied to this lack of accounting). It also unlocks better behavior in applications with many, or large goroutine stacks, or very many globals. That GC work is now accounted for, leading to fewer surprises.

Deciding when to trigger a GC

Unlike the current pacer, I propose that instead of finding the right point to start a GC such that the runtime reaches some target in the steady-state, that the pacer instead searches for a value that is more fundamental, though more indirect.

Before continuing I want take a moment to point out some very fundamental and necessary assumptions made in both this design and the current pacer. Here, we are taking a “macro-economic” view of the Go garbage collector. The actual behavior of the application at the “micro” level is all about individual allocations, but the pacer is concerned not with the moment-to-moment behavior of the application. Instead, it concerns itself with broad aggregate patterns. And evidently, this abstraction is useful. Most programs are not wildly unpredictable in their behavior; in fact it's somewhat of a challenge to write a useful application that non-trivially has unpredictable memory allocation behavior, thanks to the law of large numbers. This observation is why it is useful to talk about the steady-state of an application at all.

The pacer concerns itself with two notions of time: the time it takes to allocate from the GC trigger point to the heap goal and the time it takes to find and perform all outstanding GC work. These are only notions of time because the pacer's job is to make them happen in the same amount of time, relative to a wall clock. Since in the steady-state the amount of GC work (broadly speaking) stays fixed, the pacer is then concerned with figuring out how early it should start such that it meets its goal. Because they should happen in the same amount of time, this question of “how early” is answered in “bytes allocated so far.”

So what‘s this more fundamental value? Suppose we model a Go program as such: the world is split in two while a GC is active: the application is either spending time on itself and potentially allocating, or on performing GC work. This model is “zero-sum,” in a sense: if more time is spent on GC work, then less time is necessarily spent on the application and vice versa. Given this model, suppose we had two measures of program behavior during a GC cycle: how often the application is allocating, and how rapidly the GC can scan and mark memory. Note that these measure are not peak throughput. They are a measure of the rates, in actuality, of allocation and GC work happens during a GC. To give them a concrete unit, let’s say they‘re bytes per cpu-seconds per core. The idea with this unit is to have some generalized, aggregate notion of this behavior, independent of available CPU resources. We’ll see why this is important shortly. Lets call these rates $a$ and $s$ respectively. In the steady-state, these rates aren't changing, so we can use them to predict when to start a garbage collection.

Coming back to our model, some amount of CPU time is going to go to each of these activities. Let's say our target GC CPU utilization in the steady-state is $u_t$. If $C$ is the number of CPU cores available and $t$ is some wall-clock time window, then $a(1-u_t)Ct$ bytes will be allocated and $s u_t Ct$ bytes will be scanned in that window.

Notice that ratio of “bytes allocated” to “bytes scanned” is constant in the steady-state in this model, because both $a$ and $s$ are constant. Let‘s call this ratio $r$. To make things a little more general, let’s make $r$ also a function of utilization $u$, because part of the Go garbage collector's design is the ability to dynamically change CPU utilization to keep it on-pace.

r(u) = \frac{a(1-u)Ct}{suCt}

The big idea here is that this value, $r(u)$ is a conversion rate between these two notions of time.

Consider the following: in the steady-state, the runtime can perfectly back out the correct time to start a GC cycle, given that it knows exactly how much work it needs to do. Let $T_n$ be the trigger point for GC cycle $n$. Let $P_n$ be the size of the live scannable heap at the end of GC $n$. More precisely, $P_n$ is the subset of $M_n$ that contains pointers. Why include only pointer-ful memory? Because GC work is dominated by the cost of the scan loop, and each pointer that is found is marked; memory containing Go types without pointers are never touched, and so are totally ignored by the GC. Furthermore, this does include non-pointers in pointer-ful memory, because scanning over those is a significant cost in GC, enough so that GC is roughly proportional to it, not just the number of pointer slots. In the steady-state, the size of the scannable heap should not change, so $P_n$ remains constant.

T_n = N_n - r(u_t)(P_{n-1} + S_n + G_n)

That‘s nice, but we don’t know $r$ while the runtime is executing. And worse, it could change over time.

But if the Go runtime can somehow accurately estimate and predict $r$ then it can find a steady-state.

Suppose we had some prediction of $r$ for GC cycle $n$ called $r_n$. Then, our trigger condition is a simple extension of the formula above. Let $A$ be the size of the Go live heap at any given time. $A$ is thus monotonically increasing during a GC cycle, and then instantaneously drops at the end of the GC cycle. In essence $A$ is the classic GC sawtooth pattern.

A \ge N_n - r_n(u_t)(P_{n-1} + S_n + G_n)

Note that this formula is in fact a condition and not a predetermined trigger point, like the trigger ratio. In fact, this formula could transform into the previous formula for $T_n$ if it were not for the fact that $S_n$ actively changes during a GC cycle, since the rest of the values are constant for each GC cycle.

A big question remains: how do we predict $r$?

To answer that, we first need to determine how to measure $r$ at all. I propose a straightforward approximation: each GC cycle, take the amount of memory allocated, divide it by the amount of memory scanned, and scale it from the actual GC CPU utilization to the target GC CPU utilization. Note that this scaling factor is necessary because we want our trigger to use an $r$ value that is at the target utilization, such that the GC is given enough time to only use that amount of CPU. This note is a key aspect of the proposal and will come up later.

What does this scaling factor look like? Recall that because of our model, any value of $r$ has a $1-u$ factor in the numerator and a $u$ factor in the denominator. Scaling from one utilization to another is as simple as switching out factors.

Let $\hat{A}_n$ be the actual peak live heap size at the end of a GC cycle (as opposed to $N_n$, which is only a target). Let $u_n$ be the GC CPU utilization over cycle $n$ and $u_t$ be the target utilization. Altogether,

r_{measured} \textrm{ for GC } n = \frac{\hat{A}_n - T_n}{M_n + S_n + G_n}\frac{(1-u_t)u_n}{(1-u_n)u_t}

Now that we have a way to measure $r$, we could use this value directly as our prediction. But I fear that using it directly has the potential to introduce a significant amount of noise, so smoothing over transient changes to this value is desirable. To do so, I propose using this measurement as the set-point for a proportional-integral (PI) controller. This means that the PI controller is always chasing whatever value was measured for each GC cycle. In a steady-state with only small changes, this means the PI controller acts as a smoothing function.

The advantage of a PI controller over a proportional controller is that it guarantees that steady-state error will be driven to zero. Note that the current GC pacer has issues with offset error. It may also find the wrong point on the isocline of GC CPU utilization and peak heap size because the error term can go to zero even if both targets are missed. The disadvantage of a PI controller, however, is that it oscillates and may overshoot significantly on its way to reaching a steady value. This disadvantage could be mitigated by overdamping the controller, but I propose we tune it using the tried-and-tested standard Ziegler-Nichols method. In simulations (see the simulations section) this tuning tends to work well. It's worth noting that PI (more generally, PID controllers) have a lot of years of research and use behind them, and this design lets us take advantage of that and tune the pacer further if need be.

Why a PI controller and not a PID controller? Firstly, PI controllers are simpler to reason about. Secondly, the derivative term in a PID controller tends to be sensitive to high-frequency noise, and the entire point here is to smooth over noise. Furthermore, the advantage of the derivative term is a shorter rise time, but simulations show that the rise time is roughly 1 GC cycle, so I don‘t think there’s much reason to include it. Adding the derivative term though is trivial once the rest of the design is in place, so the door is always open.

By focusing on this $r$ value, we've now reframed the pacing problem as a search problem instead of an optimization problem. That raises question: are we still reaching our optimization goals? And how do GC assists fit into this picture?

The good news is that we're always triggering for the right CPU utilization. Because $r$ being scaled for the target GC CPU utilization and $r$ picks the trigger, the pacer will naturally start at a point that will generate a certain utilization in the steady-state.

Following from this fact, there is no longer any reason to have the target GC CPU utilization be 30%. Originally, in the design for the current pacer, the target GC CPU utilization, began at 25%, with GC assists always extending from that, so in the steady-state there would be no GC assists. However, because the pacer was structured to solve an optimization problem, it required feedback from both directions. That is, it needed to know whether it was actinng too aggressively or not aggressively enough. This feedback could only be obtained by actually performing GC assists. But with this design, that's no longer necessary. The target CPU utilization can completely exclude GC assists in the steady-state with a mitigated risk of bad behavior.

As a result, I propose the target utilization be reduced once again to 25%, eliminating GC assists in the steady-state (that's not out-pacing the GC), and potentially improving application latency as a result.

Idle-priority background GC workers

Idle-priority GC workers are extra workers that run if the application isn‘t utilizing all GOMAXPROCS worth of parallelism. The scheduler schedules “low priority” background workers on any additional CPU resources, and this ultimately skews utilization measurements in the GC. In today’s pacer, they're somewhat accounted for. In general, the idle workers skew toward undershoot: because the pacer is not explicitly aware of the idle workers, GC cycles will complete sooner than it might expect. However if a GC cycle completes early, then in theory the current pacer will simply adjust. From its perspective, scanning and marking objects was just faster than expected.

The new proposed design must also account for them somehow. I propose including idle priority worker CPU utilization in both the measured utilization, $u_n$, and the target utilization, $u_t$, in each GC cycle. In this way, the only difference between the two terms remains GC assist utilization, and while idle worker CPU utilization remains stable, the pacer may effectively account for it.

Unfortunately this does mean idle-priority GC worker utilization becomes part of the definition of a GC steady-state, making it slightly more fragile. The good news is that that was already true. Due to other issues with idle-priority GC workers, it may be worth revisiting them as a concept, but they do appear to legitimately help certain applications, particularly those that spend a significant chunk of time blocked on I/O, yet spend some significant chunk of their time awake allocating memory.

Smoothing out GC assists

This discussion of GC assists brings us to the existing issues around pacing decisions made while the GC is active (which I will refer to as the “GC assist pacer” below). For the most part, this system works very well, and is able to smooth over small hiccups in performance, due to noise from the underlying platform or elsewhere.

Unfortunately, there‘s one place where it doesn’t do so well: the hard heap goal. Currently, the GC assist pacer prevents memory blow-up in pathological cases by ramping up assists once either the GC has found more work than it expected (i.e. the live scannable heap has grown) or the GC is behind and the application's heap size has exceeded the heap goal. In both of these cases, it sets a somewhat arbitrarily defined hard limit at 1.1x the heap goal.

The problem with this policy is that high GOGC values create the opportunity for very large changes in live heap size, because the GC has quite a lot of runway (consider an application with GOGC=51100 has a steady-state live heap of size 10 MiB and suddenly all the memory it allocates is live). In this case, the GC assist pacer is going to find all this new live memory and panic: the rate of assists will begin to skyrocket. This particular problem impedes the adoption of any sort of target heap size, or configurable minimum heap size. One can imagine a small live heap with a large target heap size as having a large effective GOGC value, so it reduces to exactly the same case.

To deal with this, I propose modifying the GC assist policy to set a hard heap goal of $\gamma N_n$. The intuition behind this goal is that if all the memory allocated in this GC cycle turns out to be live, the next GC cycle will end up using that much memory anyway, so we let it slide.

But this hard goal need not be used for actually pacing GC assists other than in extreme cases. In fact, it must not, because an assist ratio computed from this hard heap goal and the worst-case scan work turns out to be extremely loose, leading to the GC assist pacer consistently missing the heap goal in some steady-states.

So, I propose an alternative calculation for the assist ratio. I believe that the assist ratio must always pass through the heap goal, since otherwise there‘s no guarantee that the GC meets its heap goal in the steady-state (which is a fundamental property of the pacer in Go’s existing GC design). However, there‘s no reason why the ratio itself needs to change dramatically when there’s more GC work than expected. In fact, the preferable case is that it does not, because that lends itself to a much more even distribution of GC assist work across the cycle.

So, I propose that the assist ratio be an extrapolation of the current steady-state assist ratio, with the exception that it now include non-heap GC work as the rest of this document does.

That is,

\begin{aligned}
\textrm{max scan work} & = T_n + S_n + G_n \\
\textrm{extrapolated runway} & = \frac{N_n - T_n}{P_{n-1} + S_n + G_n} (T_n + S_n + G_n) \\
\textrm{assist ratio} & = \frac{\textrm{extrapolated runway}}{\textrm{max scan work}}
\end{aligned}

This definition is intentially roundabout. The assist ratio changes dynamically as the amount of GC work left decreases and the amount of memory allocated increases. This responsiveness is what allows the pacing mechanism to be so robust.

Today, the assist ratio is calculated by computing the remaining heap runway and the remaining expected GC work, and dividing the former by the latter. But of course, that‘s not possible if there’s more GC work than expected, since then the assist ratio could go negative, which is meaningless.

So that's the purpose defining the “max scan work” and “extrapolated runway”: these are worst-case values that are always safe to subtract from, such that we can maintain roughly the same assist ratio throughout the cycle (assuming no hiccups).

One minor details is that the “extrapolated runway” needs to be capped at the hard heap goal to prevent breaking that promise, though in practice this will almost. The hard heap goal is such a loose bound that it‘s really only useful in pathological cases, but it’s still necessary to ensure robustness.

A key point in this choice is that the GC assist pacer will only respond to changes in allocation behavior and scan rate, not changes in the size of the live heap. This point seems minor, but it means the GC assist pacer's function is much simpler and more predictable.

Remaining unanswered questions

Not every problem listed in issue #42430 is resolved by this design, though many are.

Notable exclusions are:

  1. Mark assists are front-loaded in a GC cycle.
  2. The hard heap goal isn't actually hard in some circumstances.
  3. Existing trigger limits to prevent unbounded memory growth.

(1) is difficult to resolve without special cases and arbitrary heuristics, and I think in practice it‘s OK; the system was fairly robust and will now be more so to this kind of noise. That doesn’t mean that it shouldn‘t be revisited, but it’s not quite as big as the other problems, so I leave it outside the scope of this proposal.

(2) is also tricky and somewhat orthogonal. I believe the path forward there involves better scheduling of fractional GC workers, which are currently very loosely scheduled. This design has made me realize how important dedicated GC workers are to progress, and how GC assists are a back-up mechanism. I believe that the fundamental problem there lies with the fact that fractional GC workers don't provide that sort of consistent progress.

For (3), I propose we retain the limits, translated to the current design. For reference, these limits are $0.95 (\gamma - 1)$ as the upper-bound on the trigger ratio, and $0.6 (\gamma - 1)$ as the lower-bound.

The upper bound exists to prevent ever starting the GC too late in low-activity scenarios. It may cause consistent undershoot, but prevents issues in GC pacer calculations by preventing the calculated runway from ever being too low. The upper-bound may need to be revisited when considering a configurable target heap size.

The lower bound exists to prevent the application from causing excessive memory growth due to floating garbage as the application‘s allocation rate increases. Before that limit was installed, it wasn’t very easy for an application to allocate hard enough for that to happen. The lower bound probably should be revisited, but I leave that outside of the scope of this document.

To translate them to the current design, I propose we simply modify the trigger condition to include these limits. It's not important to put these limits in the rest of the pacer because it no longer tries to compute the trigger point ahead of time.

Initial conditions

Like today, the pacer has to start somewhere for the first GC. I propose we carry forward what we already do today: set the trigger point at 7/8ths of the first heap goal, which will always be the minimum heap size. If GC 1 is the first GC, then in terms of the math above, we choose to avoid defining $M_0$, and instead directly define

\begin{aligned}
N_1 & = \textrm{minimum heap size} \\
T_1 & = \frac{7}{8} N_1 \\
P_0 & = 0
\end{aligned}

The definition of $P_0$ is necessary for the GC assist pacer.

Furthermore, the PI controller's state will be initialized to zero otherwise.

These choices are somewhat arbitrary, but the fact is that the pacer has no knowledge of the progam's past behavior for the first GC. Naturally the behavior of the GC will always be a little odd, but it should, in general, stabilize quite quickly (note that this is the case in each scenario for the simulations.

A note about CPU utilization

This document uses the term “GC CPU utilization” quite frequently, but so far has refrained from defining exactly how it‘s measured. Before doing that, let’s define CPU utilization over a GC mark phase, as it‘s been used so far. First, let’s define the mark phase: it is the period of wall-clock time between the end of sweep termination and the start of mark termination. In the mark phase, the process will have access to some total number of CPU-seconds of execution time. This CPU time can then be divided into “time spent doing GC work” and “time spent doing anything else.” GC CPU utilization is then defined as a proportion of that total CPU time that is spent doing GC work.

This definition seems straightforward enough but in reality it‘s more complicated. Measuring CPU time on most platforms is tricky, so what Go does today is an approximation: take the wall-clock time of the GC mark phase, multiply it by GOMAXPROCS. Call this $T$. Take 25% of that (representing the dedicated GC workers) and add total amount of time all goroutines spend in GC assists. The latter is computed directly, but is just the difference between the start and end time in the critical section; it does not try to account for context switches forced by the underlying system, or anything like that. Now take this value we just computed and divide it by $T$. That’s our GC CPU utilization.

This approximation is mostly accurate in the common case, but is prone to skew in various scenarios, such as when the system is CPU-starved. This fact can be problematic, but I believe it is largely orthogonal to the content of this document; we can work on improving this approximation without having to change any of this design. It already assumes that we have a good measure of CPU utilization.

Alternatives considered

The alternatives considered for this design basically boil down to its individual components.

For instance, I considered grouping stacks and globals into the current formulation of the pacer, but that adds another condition to the definition of the steady-state: stacks and globals do not change. That makes the steady-state more fragile.

I also considered a design that was similar, but computed everything in terms of an “effective” GOGC, and “converted” that back to GOGC for pacing purposes (that is, what would the heap trigger have been had the expected amount of live memory been correct?). This formulation is similar to how Austin formulated the experimental SetMaxHeap API. Austin suggested I avoid this formulation because math involving GOGC tends to have to work around infinities. A good example of this is if runtime.GC is called while GOGC is off: the runtime has to “fake” a very large GOGC value in the pacer. By using a ratio of rates that's more grounded in actual application behavior the trickiness of the math is avoided.

I also considered not using a PI controller and just using the measured $r$ value directly, assuming it doesn't change across GC cycles, but that method is prone to noise.

Justification

Pros:

  • The steady-state is now independent of the amount of GC work to be done.
  • Steady-state mark assist drops to zero if not allocating too heavily (a likely latency improvement in many scenarios) (see the “high GOGC” scenario in simulations).
  • GC amortization includes non-heap GC work, and responds well in those cases.
  • Eliminates offset error present in the existing design.

Cons:

  • Definition of GOGC has changed slightly, so a GOGC of 100 will use slightly more memory in nearly all cases.
  • $r$ is a little bit unintuitive.

Implementation

This pacer redesign will be implemented by Michael Knyszek.

  1. The existing pacer will be refactored into a form fit for simulation.
  2. A comprehensive simulation-based test suite will be written for the pacer.
  3. The existing pacer will be swapped out with the new implementation.

The purpose of the simulation infrastructure is to make the pacer, in general, more testable. This lets us write regression test cases based on real-life problems we encounter and confirm that they don‘t break going forward. Furthermore, with fairly large changes to the Go compiler and runtime in the pipeline, it’s especially important to reduce the risk of this change as much as possible.

Go 1 backwards compatibility

This change will not modify any user-visible APIs, but may have surprising effects on application behavior. The two “most” incompatible changes in this proposal are the redefinition of the heap goal to include non-heap sources of GC work, since that directly influences the meaning of GOGC, and the change in target GC CPU utilization. These two factors together mean that, by default and on average, Go applications will use slightly more memory than before. To obtain previous levels of memory usage, users may be required to tune down GOGC lower manually, but the overall result should be more consistent, more predictable, and more efficient.

Simulations

In order to show the effectiveness of the new pacer and compare it to the current one, I modeled both the existing pacer and the new pacer and simulated both in a variety of scenarios.

The code used to run these simulations and generate the plots below may be found at github.com/mknyszek/pacer-model.

Assumptions and caveats

The model of each pacer is fairly detailed, and takes into account most details like allocations made during a GC being marked. The one big assumption it makes, however, is that the behavior of the application while a GC cycle is running is perfectly smooth, such that the GC assist pacer is perfectly paced according to the initial assist ratio. In practice, this is close to true, but it's worth accounting for the more extreme cases. (TODO: Show simulations that inject some noise into the GC assist pacer.)

Another caveat with the simulation is the graph of “R value” (that is, $r_n$), and “Alloc/scan ratio.” The latter is well-defined for all simulations (it's a part of the input) but the former is not a concept used in the current pacer. So for simulations of the current pacer, the “R value” is backed out from the trigger ratio: we know the runway, we know the expected scan work for the target utilization, so we can compute the $r_n$ that the trigger point encodes.

Results

Perfectly steady heap size.

The simplest possible scenario.

Current pacer:

New pacer:

Notes:

  • The current pacer doesn't seem to find the right utilization.
  • Both pacers do reasonably well at meeting the heap goal.

Jittery heap size and alloc/scan ratio.

A mostly steady-state heap with a slight jitter added to both live heap size and the alloc/scan ratio.

Current pacer:

New pacer:

Notes:

  • Both pacers are resilient to a small amount of noise.

Small step in alloc/scan ratio.

This scenario demonstrates the transitions between two steady-states, that are not far from one another.

Current pacer:

New pacer:

Notes:

  • Both pacers react to the change in alloc/scan rate.
  • Clear oscillations in utilization visible for the new pacer.

Large step in alloc/scan ratio.

This scenario demonstrates the transitions between two steady-states, that are further from one another.

Current pacer:

New pacer:

Notes:

  • The old pacer consistently overshoots the heap size post-step.
  • The new pacer minimizes overshoot.

Large step in heap size with a high GOGC value.

This scenario demonstrates the “high GOGC problem” described in the GC pacer meta-issue.

Current pacer:

New pacer:

Notes:

  • The new pacer‘s heap size stabilizes faster than the old pacer’s.
  • The new pacer has a spike in overshoot; this is by design.
  • The new pacer's utilization is independent of this heap size spike.
  • The old pacer has a clear spike in utilization.

Oscillating alloc/scan ratio.

This scenario demonstrates an oscillating alloc/scan ratio. This scenario is interesting because it shows a somewhat extreme case where a steady-state is never actually reached for any amount of time. However, this is not a realistic scenario.

Current pacer:

New pacer:

Notes:

  • The new pacer tracks the oscillations worse than the old pacer. This is likely due to the error never settling, so the PI controller is always overshooting.

Large amount of goroutine stacks.

This scenario demonstrates the “heap amortization problem” described in the GC pacer meta-issue for goroutine stacks.

Current pacer:

New pacer:

Notes:

  • The old pacer consistently overshoots because it's underestimating the amount of work it has to do.
  • The new pacer uses more memory, since the heap goal is now proportional to stack space, but it stabilizes and is otherwise sane.

Large amount of global variables.

This scenario demonstrates the “heap amortization problem” described in the GC pacer meta-issue for global variables.

Current pacer:

New pacer:

Notes:

  • This is essentially identical to the stack space case.

High alloc/scan ratio.

This scenario shows the behavior of each pacer in the face of a very high alloc/scan ratio, with jitter applied to both the live heap size and the alloc/scan ratio.

Current pacer:

New pacer:

Notes:

  • In the face of a very high allocation rate, the old pacer consistently overshoots, though both maintain a similar GC CPU utilization.