Author: Michael Knyszek
Updated: 8 February 2021
Go's tracing garbage collector runs concurrently with the application, and thus requires an algorithm to determine when to start a new cycle. In the runtime, this algorithm is referred to as the pacer. Until now, the garbage collector has framed this process as an optimization problem, utilizing a proportional controller to achieve a desired stopping-point (that is, the cycle completes just as the heap reaches a certain size) as well as a desired CPU utilization. While this approach has served Go well for a long time, but the design has accrued many corner cases due to resolved issues, as well as a backlog of unresolved issues.
I propose redesigning the garbage collector's pacer from the ground up to capture the things it does well and eliminate the problems that have been discovered. More specifically, I propose:
(1) will resolve long-standing issues with small heap sizes, allowing the Go garbage collector to scale down and act more predictably in general. (2) will eliminate offset error present in the current design, will allow turning off mark-assist almost entirely outside of exceptional cases, improving allocation latency, and will enable clearer designs for setting memory limits on Go applications. (3) will enable smooth and consistent response to large changes in the live heap size with large GOGC values.
Since version 1.5 Go has a had a tracing mark-sweep garbage collector (GC) that is able to execute concurrently with user goroutines. The garbage collector manages several goroutines called “workers” to carry out its task. A key problem in concurrent garbage collection is deciding when to begin, such that the work is complete “on time.” Timeliness, today, is defined by the optimization of two goals:
These two goals are tightly related. If a garbage collection cycle starts too late, for instance, it may consume more CPU to avoid missing its target. If a cycle begins too early, it may end too early, resulting in GC cycles happening more often than expected.
Go's garbage collector sets a fixed target of 30% CPU utilization (25% from GC workers, with 5% from user goroutines donating their time to assist) while the GC is active. It also offers a parameter to allow the user to set their own memory use and CPU trade-off: GOGC. GOGC is a percent overhead describing how much additional memory (over the live heap) the garbage collector may use. A higher GOGC value indicates that the garbage collector may use more memory, setting the target heap size higher, and conversely a lower GOGC value sets the target heap size lower. The process of deciding when a garbage collection should start given these parameters has often been called “pacing” in the Go runtime.
To attempt to reach its goals, Go's “pacer” utilizes a proportional controller to decide when to start a garbage collection cycle. The controller attempts to find the correct point to begin directly, given an error term that captures the two aforementioned optimization goals.
It's worth noting that the optimization goals are defined for some steady-state. Today, the steady-state is implicitly defined as: constant allocation rate, constant heap size, and constant heap composition (hence, constant mark rate). The pacer expects the application to settle on some average global behavior across GC cycles.
However, the GC is still robust to transient application states. When the GC is in some transient state, the pacer is often operating with stale information, and is actively trying to find the new steady-state. To avoid issues with memory blow-up, among other things, the GC makes allocating goroutines donate their time to assist the garbage collector, proportionally to the amount of memory that they allocate. This GC assist system keeps memory use stable in unstable conditions, at the expense of user CPU time and latency.
The GC assist system operates by dynamically computing an assist ratio. The assist ratio is the slope of a curve in the space of allocation time and GC work time, a curve that the application is required to stay under. This assist ratio is then used as a conversion factor between the amount a goroutine has allocated, and how much GC assist work it should do. Meanwhile, GC workers generate assist credit from the work that they do and place it in a global pool that allocating goroutines may steal from to avoid having to assist.
Since version 1.5, the pacer has gone through several minor tweaks and changes in order to resolve issues, usually adding special cases and making its behavior more difficult to understand, though resolving the underlying problem. Meanwhile, more issues have been cropping up that are diagnosed but more difficult to tackle in the existing design. Most of these issues are listed in the GC pacer meta-issue.
Even more fundamentally, the proportional controller at the center of the pacer is demonstrably unable to completely eliminate error in its scheduling, a well-known issue with proportional controllers.
Another significant motivator, beyond resolving latent issues, is that the Go runtime lacks facilities for dealing with finite memory limits. While the GOGC mechanism works quite well and has served Go for a long time, it falls short when there‘s a hard memory limit. For instance, a consequence of GOGC that often surprises new gophers coming from languages like Java, is that if GOGC is 100, then Go really needs 2x more memory than the peak live heap size. The garbage collector will not automatically throttle itself as it approaches some memory limit, leading to out-of-memory errors. Conversely, users that know they’ll have a fixed amount of memory up-front are unable to take advantage of it if their live heap is usually small. Users have taken to fooling the GC into thinking more memory is live than their application actually needs in order to let the application allocate more memory in between garbage collections. Simply increasing GOGC doesn't tend to work in this scenario either because of the previous problem: if the live heap spikes suddenly, GOGC will result in much more memory being used overall. See issue #42430 for more details.
The current pacer is not designed with these use-cases in mind.
There is some nuance to these definitions.
Firstly, is used in place of GOGC because it makes the math easier to understand.
Secondly, may vary throughout the sweep phase, but effectively becomes fixed once a GC cycle starts. Stacks may not shrink, only grow during this time, so there‘s a chance any value used by the runtime during a GC cycle will be stale. also includes space that may not be actively used for the stack. That is, if an 8 KiB goroutine stack is actually only 2 KiB high (and thus only 2 KiB is actually scannable), for consistency’s sake the stack's height will be considered 8 KiB. Both of these estimates introduce the potential for skew. In general, however, stacks are roots in the GC and will be some of the first sources of work for the GC, so the estimate should be fairly close. If that turns out not to be true in practice, it is possible, though tricky to track goroutine stack heights more accurately, though there must necessarily always be some imprecision because actual scannable stack height is rapidly changing.
Thirdly, acts similarly to . The amount of globals in a Go program can change because of the
plugin package, though this action is relatively rare compared to a change in the size of stacks. Because of this rarity, I propose allowing a bit of skew. At worst (as we'll see later) the pacer will overshoot a little bit.
Lastly, is the amount of heap memory known to be live to the runtime the instant after a garbage collection cycle completes. Intuitively, it is the bottom of the classic GC sawtooth pattern.
Like in the previous definition of the pacer, the runtime sets some target heap size for the GC cycle based on GOGC. Intuitively, this target heap size is the targeted heap size at the top of the classic GC sawtooth pattern.
The definition I propose is very similar, except it includes non-heap sources of GC work. Let be the heap goal for GC (“N” stands for “Next GC”).
The old definition makes the assumption that non-heap sources of GC work are negligible. In practice, that is often not true, such as with small heaps. This definition says that we‘re trading off not just heap memory, but all memory that influences the garbage collector’s CPU consumption.
From a philospical standpoint wherein GOGC is intended to be a knob controlling the trade-off between CPU resources and memory footprint, this definition is more accurate.
This change has one large user-visible ramification: the default GOGC, in most cases, will use slightly more memory than before.
This change will inevitably cause some friction, but I believe the change is worth it. It unlocks the ability to scale down to heaps smaller than 4 MiB (the origin of this limit is directly tied to this lack of accounting). It also unlocks better behavior in applications with many, or large goroutine stacks, or very many globals. That GC work is now accounted for, leading to fewer surprises.
Unlike the current pacer, I propose that instead of finding the right point to start a GC such that the runtime reaches some target in the steady-state, that the pacer instead search for a value that is more fundamental, though more indirect.
Before continuing I want take a moment to point out some very fundamental and necessary assumptions made in both this design and the current pacer. Here, we are taking a “macro-economic” view of the Go garbage collector. The actual behavior of the application at the “micro” level is all about individual allocations, but the pacer is concerned not with the moment-to-moment behavior of the application. Instead, it concerns itself with broad aggregate patterns. And evidently, this abstraction is useful. Most programs are not wildly unpredictable in their behavior; in fact it's somewhat of a challenge to write a useful application that non-trivially has unpredictable memory allocation behavior, thanks to the law of large numbers. This observation is why it is useful to talk about the steady-state of an application at all.
The pacer concerns itself with two notions of time: the time it takes to allocate from the GC trigger point to the heap goal and the time it takes to find and perform all outstanding GC work. These are only notions of time because the pacer's job is to make them happen in the same amount of time, relative to a wall clock. Since in the steady-state the amount of GC work (broadly speaking) stays fixed, the pacer is then concerned with figuring out how early it should start such that it meets its goal. Because they should happen in the same amount of time, this question of “how early” is answered in “bytes allocated so far.”
So what‘s this more fundamental value? Suppose we model a Go program as such: the world is split in two while a GC is active: the application is either spending time on itself and potentially allocating, or on performing GC work. This model is “zero-sum,” in a sense: if more time is spent on GC work, then less time is necessarily spent on the application and vice versa. Given this model, suppose we had two measures of program behavior during a GC cycle: how often the application is allocating, and how rapidly the GC can scan and mark memory. Note that these measure are not peak throughput. They are a measure of the rates, in actuality, of allocation and GC work happens during a GC. To give them a concrete unit, let’s say they‘re bytes per cpu-seconds per core. The idea with this unit is to have some generalized, aggregate notion of this behavior, independent of available CPU resources. We’ll see why this is important shortly. Lets call these rates and respectively. In the steady-state, these rates aren't changing, so we can use them to predict when to start a garbage collection.
Coming back to our model, some amount of CPU time is going to go to each of these activities. Let's say our target GC CPU utilization in the steady-state is . If is the number of CPU cores available and is some wall-clock time window, then bytes will be allocated and bytes will be scanned in that window.
Notice that ratio of “bytes allocated” to “bytes scanned” is constant in the steady-state in this model, because both and are constant. Let‘s call this ratio . To make things a little more general, let’s make also a function of utilization , because part of the Go garbage collector's design is the ability to dynamically change CPU utilization to keep it on-pace.
The big idea here is that this value, is a conversion rate between these two notions of time.
Consider the following: in the steady-state, the runtime can perfectly back out the correct time to start a GC cycle, given that it knows exactly how much work it needs to do. Let be the trigger point for GC cycle . Let be the size of the live scannable heap at the end of GC . More precisely, is the subset of that contains pointers. Why include only pointer-ful memory? Because GC work is dominated by the cost of the scan loop, and each pointer that is found is marked; memory containing Go types without pointers are never touched, and so are totally ignored by the GC. Furthermore, this does include non-pointers in pointer-ful memory, because scanning over those is a significant cost in GC, enough so that GC is roughly proportional to it, not just the number of pointer slots. In the steady-state, the size of the scannable heap should not change, so remains constant.
That‘s nice, but we don’t know while the runtime is executing. And worse, it could change over time.
But if the Go runtime can somehow accurately estimate and predict then it can find a steady-state.
Suppose we had some prediction of for GC cycle called . Then, our trigger condition is a simple extension of the formula above. Let be the size of the Go live heap at any given time. is thus monotonically increasing during a GC cycle, and then instantaneously drops at the end of the GC cycle. In essence is the classic GC sawtooth pattern.
Note that this formula is in fact a condition and not a predetermined trigger point, like the trigger ratio. In fact, this formula could transform into the previous formula for if it were not for the fact that actively changes during a GC cycle, since the rest of the values are constant for each GC cycle.
A big question remains: how do we predict ?
To answer that, we first need to determine how to measure at all. I propose a straightforward approximation: each GC cycle, take the amount of memory allocated, divide it by the amount of memory scanned, and scale it from the actual GC CPU utilization to the target GC CPU utilization. Note that this scaling factor is necessary because we want our trigger to use an value that is at the target utilization, such that the GC is given enough time to only use that amount of CPU. This note is a key aspect of the proposal and will come up later.
What does this scaling factor look like? Recall that because of our model, any value of has a factor in the numerator and a factor in the denominator. Scaling from one utilization to another is as simple as switching out factors.
Let be the actual peak live heap size at the end of a GC cycle (as opposed to , which is only a target). Let be the GC CPU utilization over cycle and be the target utilization. Altogether,
Now that we have a way to measure , we could use this value directly as our prediction. But I fear that using it directly has the potential to introduce a significant amount of noise, so smoothing over transient changes to this value is desirable. To do so, I propose using this measurement as the set-point for a proportional-integral (PI) controller.
The advantage of a PI controller over a proportional controller is that it guarantees that error will be driven to zero. Note that the current GC pacer has issues with offset error. It may also find the wrong point on the isocline of GC CPU utilization and peak heap size because the error term can go to zero even if both targets are missed. The disadvantage of a PI controller, however, is that it oscillates and may overshoot significantly on its way to reaching a steady value. This disadvantage could be mitigated by overdamping the controller, but I propose we tune it using the tried-and-tested standard Ziegler-Nichols method. In simulations (see the simulations section) this tuning tends to work well. It's worth noting that PI (more generally, PID controllers) have a lot of years of research and use behind them, and this design lets us take advantage of that and tune the pacer further if need be.
Why a PI controller and not a PID controller? The PI controllers are simpler to reason about, and the derivative term in a PID controller tends to be sensitive to high-frequency noise. The advantage of the derivative term is a shorter rise time, but simulations show that the rise time is roughly 1 GC cycle, so I don‘t think there’s much reason to include it just yet. Adding the derivative term though is trivial once the rest of the design is in place, so the door is always open.
By focusing on this value, we've now reframed the pacing problem as a search problem instead of an optimization problem. That raises question: are we still reaching our optimization goals? And how do GC assists fit into this picture?
The good news is that we're always triggering for the right CPU utilization. Because being scaled for the target GC CPU utilization and picks the trigger, the pacer will naturally start at a point that will generate a certain utilization in the steady-state.
Following from this fact, there is no longer any reason to have the target GC CPU utilization be 30%. Originally, in the design for the current pacer, the target GC CPU utilization, began at 25%, with GC assists always extending from that, so in the steady-state there would be no GC assists. However, because the pacer was structured to solve an optimization problem, it required feedback from both directions. That is, it needed to know whether it was actinng too aggressively or not aggressively enough. This feedback could only be obtained by actually performing GC assists. But with this design, that's no longer necessary. The target CPU utilization can completely exclude GC assists in the steady-state with a mitigated risk of bad behavior.
As a result, I propose the target utilization be reduced once again to 25%, eliminating GC assists in the steady-state (that's not out-pacing the GC), and potentially improving application latency as a result.
This discussion of GC assists brings us to the existing issues around pacing decisions made while the GC is active (which I will refer to as the “GC assist pacer” below). For the most part, this system works very well, and is able to smooth over small hiccups in performance, due to noise from the underlying platform or elsewhere.
Unfortunately, there‘s one place where it doesn’t do so well: the hard heap goal. Currently, the GC assist pacer prevents memory blow-up in pathological cases by ramping up assists once either the GC has found more work than it expected (i.e. the live scannable heap has grown) or the GC is behind and the application's heap size has exceeded the heap goal. In both of these cases, it sets a somewhat arbitrarily defined hard limit at 1.1x the heap goal.
The problem with this policy is that high GOGC values create the opportunity for very large changes in live heap size, because the GC has quite a lot of runway (consider an application with
GOGC=51100 has a steady-state live heap of size 10 MiB and suddenly all the memory it allocates is live). In this case, the GC assist pacer is going to find all this new live memory and panic: the rate of assists will begin to skyrocket. This particular problem impedes the adoption of any sort of target heap size, or configurable minimum heap size. One can imagine a small live heap with a large target heap size as having a large effective GOGC value, so it reduces to exactly the same case.
To deal with this, I propose modifying the GC assist policy to set a hard heap goal of . The intuition behind this goal is that if all the memory allocated in this GC cycle turns out to be live, the next GC cycle will end up using that much memory anyway, so we let it slide.
But this hard goal need not be used for actually pacing GC assists other than in extreme cases. In fact, it must not, because an assist ratio computed from this hard heap goal and the worst-case scan work turns out to be extremely loose, leading to the GC assist pacer consistently missing the heap goal in some steady-states.
So, I propose an alternative calculation for the assist ratio. I believe that the assist ratio must always pass through the heap goal, since otherwise there‘s no guarantee that the GC meets its heap goal in the steady-state (which is a fundamental property of the pacer in Go’s existing GC design). However, there‘s no reason why the ratio itself needs to change dramatically when there’s more GC work than expected. In fact, the preferable case is that it does not, because that lends itself to a much more even distribution of GC assist work across the cycle.
So, I propose that the assist ratio be an extrapolation of the current steady-state assist ratio, with the exception that it now include non-heap GC work as the rest of this document does.
This definition is intentially roundabout. The assist ratio changes dynamically as the amount of GC work left decreases and the amount of memory allocated increases. This responsiveness is what allows the pacing mechanism to be so robust.
Today, the assist ratio is calculated by computing the remaining heap runway and the remaining expected GC work, and dividing the former by the latter. But of course, that‘s not possible if there’s more GC work than expected, since then the assist ratio could go negative, which is meaningless.
So that's the purpose defining the “max scan work” and “extrapolated runway”: these are worst-case values that are always safe to subtract from, such that we can maintain roughly the same assist ratio throughout the cycle (assuming no hiccups).
One minor details is that the “extrapolated runway” needs to be capped at the hard heap goal to prevent breaking that promise, though in practice this will almost. The hard heap goal is such a loose bound that it‘s really only useful in pathological cases, but it’s still necessary to ensure robustness.
A key point in this choice is that the GC assist pacer will only respond to changes in allocation behavior and scan rate, not changes in the size of the live heap. This point seems minor, but it means the GC assist pacer's function is much simpler and more predictable.
Not every problem listed in issue #42430 is resolved by this design, though many are.
Notable exclusions are:
(1) is difficult to resolve without special cases and arbitrary heuristics, and I think in practice it‘s OK; the system was fairly robust and will now be more so to this kind of noise. That doesn’t mean that it shouldn‘t be revisited, but it’s not quite as big as the other problems, so I leave it outside the scope of this proposal.
(2) is also tricky and somewhat orthogonal. I believe the path forward there involves better scheduling of fractional GC workers, which are currently very loosely scheduled. This design has made me realize how important dedicated GC workers are to progress, and how GC assists are a back-up mechanism. I believe that the fundamental problem there lies with the fact that fractional GC workers don't provide that sort of consistent progress.
For (3) I believe we should remove idle GC workers entirely, which is why this document ignores them. Idle GC workers are extra mark workers that run if the application isn‘t utilizing all GOMAXPROCS worth of parallelism. The scheduler schedules “low priority” background workers on any additional CPU resources, and this ultimately skews utilization measurements in the GC, because as of today they’re unaccounted for. Unfortunately, it‘s likely that idle GC workers have accidentally become necessary for the GC to make progress, so just pulling them out won’t be quite so easy. I believe that needs a separate proposal given other large potential changes coming to the compiler and runtime in the near future, because there‘s unfortunately a fairly significant risk of bugs with doing so, though I do think it’s ultimately an improvement. See the related issue for more details.
(4) is easy and I don't believe needs any deliberation. That is a bug we should simply fix.
For (5), I propose we retain the limits, translated to the current design. For reference, these limits are as the upper-bound on the trigger ratio, and as the lower-bound.
The upper bound exists to prevent ever starting the GC too late in low-activity scenarios. It may cause consistent undershoot, but prevents issues in GC pacer calculations by preventing the calculated runway from ever being too low. The upper-bound may need to be revisited when considering a configurable target heap size.
The lower bound exists to prevent the application from causing excessive memory growth due to floating garbage as the application‘s allocation rate increases. Before that limit was installed, it wasn’t very easy for an application to allocate hard enough for that to happen. The lower bound probably should be revisited, but I leave that outside of the scope of this document.
To translate them to the current design, I propose we simply modify the trigger condition to include these limits. It's not important to put these limits in the rest of the pacer because it no longer tries to compute the trigger point ahead of time.
The alternatives considered for this design basically boil down to its individual components.
For instance, I considered grouping stacks and globals into the current formulation of the pacer, but that adds another condition to the definition of the steady-state: stacks and globals do not change. That makes the steady-state more fragile.
I also considered a design that was similar, but computed everything in terms of an “effective” GOGC, and “converted” that back to GOGC for pacing purposes (that is, what would the heap trigger have been had the expected amount of live memory been correct?). This formulation is similar to how Austin formulated the experimental
SetMaxHeap API. Austin suggested I avoid this formulation because math involving GOGC tends to have to work around infinities. A good example of this is if
runtime.GC is called while GOGC is off: the runtime has to “fake” a very large GOGC value in the pacer. By using a ratio of rates that's more grounded in actual application behavior the trickiness of the math is avoided.
I also considered not using a PI controller and just using the measured value directly, assuming it doesn't change across GC cycles, but that method is prone to noise.
This pacer redesign will be implemented by Michael Knyszek.
The purpose of the simulation infrastructure is to make the pacer, in general, more testable. This lets us write regression test cases based on real-life problems we encounter and confirm that they don‘t break going forward. Furthermore, with fairly large changes to the Go compiler and runtime in the pipeline, it’s especially important to reduce the risk of this change as much as possible.
This change will not modify any user-visible APIs, but may have surprising effects on application behavior. The two “most” incompatible changes in this proposal are the redefinition of the heap goal to include non-heap sources of GC work, since that directly influences the meaning of GOGC, and the change in target GC CPU utilization. These two factors together mean that, by default and on average, Go applications will use slightly more memory than before. To obtain previous levels of memory usage, users may be required to tune down GOGC lower manually, but the overall result should be more consistent, more predictable, and more efficient.
In order to show the effectiveness of the new pacer and compare it to the current one, I modeled both the existing pacer and the new pacer and simulated both in a variety of scenarios.
The code used to run these simulations and generate the plots below may be found at github.com/mknyszek/pacer-model.
The model of each pacer is fairly detailed, and takes into account most details like allocations made during a GC being marked. The one big assumption it makes, however, is that the behavior of the application while a GC cycle is running is perfectly smooth, such that the GC assist pacer is perfectly paced according to the initial assist ratio. In practice, this is close to true, but it's worth accounting for the more extreme cases. (TODO: Show simulations that inject some noise into the GC assist pacer.)
Another caveat with the simulation is the graph of “R value” (that is, ), and “Alloc/scan ratio.” The latter is well-defined for all simulations (it's a part of the input) but the former is not a concept used in the current pacer. So for simulations of the current pacer, the “R value” is backed out from the trigger ratio: we know the runway, we know the expected scan work for the target utilization, so we can compute the that the trigger point encodes.
Perfectly steady heap size.
The simplest possible scenario.
Jittery heap size and alloc/scan ratio.
A mostly steady-state heap with a slight jitter added to both live heap size and the alloc/scan ratio.
Small step in alloc/scan ratio.
This scenario demonstrates the transitions between two steady-states, that are not far from one another.
Large step in alloc/scan ratio.
This scenario demonstrates the transitions between two steady-states, that are furtger from one another.
Large step in heap size with a high GOGC value.
This scenario demonstrates the “high GOGC problem” described in the GC pacer meta-issue.
Oscillating alloc/scan ratio.
This scenario demonstrates an oscillating alloc/scan ratio. This scenario is interesting because it shows a somewhat extreme case where a steady-state is never actually reached for any amount of time. However, this is not a realistic scenario.
Large amount of goroutine stacks.
This scenario demonstrates the “heap amortization problem” described in the GC pacer meta-issue for goroutine stacks.
Large amount of global variables.
This scenario demonstrates the “heap amortization problem” described in the GC pacer meta-issue for global variables.