design: add GC pacer redesign

For golang/go#44167.

Change-Id: I468aa78edb8588b4e48008ad44cecc08544a8f48
Reviewed-on: https://go-review.googlesource.com/c/proposal/+/290489
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Jeremy Faller <jeremy@golang.org>
diff --git a/design/44167-gc-pacer-redesign.md b/design/44167-gc-pacer-redesign.md
new file mode 100644
index 0000000..8bb27f9
--- /dev/null
+++ b/design/44167-gc-pacer-redesign.md
@@ -0,0 +1,864 @@
+# GC Pacer Redesign
+
+Author: Michael Knyszek
+
+Updated: 8 February 2021
+
+## Abstract
+
+Go's tracing garbage collector runs concurrently with the application, and thus
+requires an algorithm to determine when to start a new cycle.
+In the runtime, this algorithm is referred to as the pacer.
+Until now, the garbage collector has framed this process as an optimization
+problem, utilizing a proportional controller to achieve a desired stopping-point
+(that is, the cycle completes just as the heap reaches a certain size) as well
+as a desired CPU utilization.
+While this approach has served Go well for a long time, it has accrued many
+corner cases due to resolved issues, as well as a backlog of unresolved issues.
+
+I propose redesigning the garbage collector's pacer from the ground up to
+capture the things it does well and eliminate the problems that have been
+discovered.
+More specifically, I propose:
+
+1. Including all non-heap sources of GC work (stacks, globals) in pacing
+   decisions.
+1. Reframing the pacing problem as a search problem, solved by a
+   proportional-integral controller,
+1. Extending the hard heap goal to the worst-case heap goal of the next GC,
+
+(1) will resolve long-standing issues with small heap sizes, allowing the Go
+garbage collector to scale *down* and act more predictably in general.
+(2) will eliminate offset error present in the current design, will allow
+turning off GC assists in the steady-state, and will enable clearer designs for
+setting memory limits on Go applications.
+(3) will enable smooth and consistent response to large changes in the live heap
+size with large `GOGC` values.
+
+## Background
+
+Since version 1.5 Go has had a tracing mark-sweep garbage collector (GC) that is
+able to execute concurrently with user goroutines.
+The garbage collector manages several goroutines called "workers" to carry out
+its task.
+A key problem in concurrent garbage collection is deciding when to begin, such
+that the work is complete "on time."
+Timeliness, today, is defined by the optimization of two goals:
+
+1. The heap size relative to the live heap at the end of the last cycle, and
+1. A target CPU utilization for the garbage collector while it is active.
+
+These two goals are tightly related.
+If a garbage collection cycle starts too late, for instance, it may consume more
+CPU to avoid missing its target.
+If a cycle begins too early, it may end too early, resulting in GC cycles
+happening more often than expected.
+
+Go's garbage collector sets a fixed target of 30% CPU utilization (25% from GC
+workers, with 5% from user goroutines donating their time to assist) while the
+GC is active.
+It also offers a parameter to allow the user to set their own memory use and CPU
+trade-off: `GOGC`.
+`GOGC` is a percent overhead describing how much *additional* memory (over the
+live heap) the garbage collector may use.
+A higher `GOGC` value indicates that the garbage collector may use more memory,
+setting the target heap size higher, and conversely a lower `GOGC` value sets
+the target heap size lower.
+The process of deciding when a garbage collection should start given these
+parameters has often been called "pacing" in the Go runtime.
+
+To attempt to reach its goals, Go's "pacer" utilizes a proportional controller
+to decide when to start a garbage collection cycle.
+The controller attempts to find the correct point to begin directly, given an
+error term that captures the two aforementioned optimization goals.
+
+It's worth noting that the optimization goals are defined for some steady-state.
+Today, the steady-state is implicitly defined as: constant allocation rate,
+constant heap size, and constant heap composition (hence, constant mark rate).
+The pacer expects the application to settle on some average global behavior
+across GC cycles.
+
+However, the GC is still robust to transient application states.
+When the GC is in some transient state, the pacer is often operating with stale
+information, and is actively trying to find the new steady-state.
+To avoid issues with memory blow-up, among other things, the GC makes allocating
+goroutines donate their time to assist the garbage collector, proportionally to
+the amount of memory that they allocate.
+This GC assist system keeps memory use stable in unstable conditions, at the
+expense of user CPU time and latency.
+
+The GC assist system operates by dynamically computing an assist ratio.
+The assist ratio is the slope of a curve in the space of allocation time and GC
+work time, a curve that the application is required to stay under.
+This assist ratio is then used as a conversion factor between the amount a
+goroutine has allocated, and how much GC assist work it should do.
+Meanwhile, GC workers generate assist credit from the work that they do and
+place it in a global pool that allocating goroutines may steal from to avoid
+having to assist.
+
+## Motivation
+
+Since version 1.5, the pacer has gone through several minor tweaks and changes
+in order to resolve issues, usually adding special cases and making its behavior
+more difficult to understand, though resolving the motivating problem.
+Meanwhile, more issues have been cropping up that are diagnosed but more
+difficult to tackle in the existing design.
+Most of these issues are listed in the [GC pacer
+meta-issue](https://github.com/golang/go/issues/42430).
+
+Even more fundamentally, the proportional controller at the center of the pacer
+is demonstrably unable to completely eliminate error in its scheduling, a
+well-known issue with proportional-only controllers.
+
+Another significant motivator, beyond resolving latent issues, is that the Go
+runtime lacks facilities for dealing with finite memory limits.
+While the `GOGC` mechanism works quite well and has served Go for a long time,
+it falls short when there's a hard memory limit.
+For instance, a consequence of `GOGC` that often surprises new gophers coming
+from languages like Java, is that if `GOGC` is 100, then Go really needs 2x more
+memory than the peak live heap size.
+The garbage collector will *not* automatically run more aggressively as it
+approaches some memory limit, leading to out-of-memory errors.
+Conversely, users that know they'll have a fixed amount of memory up-front are
+unable to take advantage of it if their live heap is usually small.
+Users have taken to fooling the GC into thinking more memory is live than their
+application actually needs in order to let the application allocate more memory
+in between garbage collections.
+Simply increasing `GOGC` doesn't tend to work in this scenario either because of
+the previous problem: if the live heap spikes suddenly, `GOGC` will result in
+much more memory being used overall.
+See issue [#42430](https://github.com/golang/go/issues/42430) for more details.
+
+The current pacer is not designed with these use-cases in mind.
+
+## Design
+
+### Definitions
+
+![Equation 1](44167/eqn1.png)
+
+There is some nuance to these definitions.
+
+Firstly, ![`\gamma`](44167/inl1.png) is used in place of `GOGC` because it makes
+the math easier to understand.
+
+Secondly, ![`S_n`](44167/inl2.png) may vary throughout the sweep phase, but
+effectively becomes fixed once a GC cycle starts.
+Stacks may not shrink, only grow during this time, so there's a chance any value
+used by the runtime during a GC cycle will be stale.
+![`S_n`](44167/inl2.png) also includes space that may not be actively used for
+the stack.
+That is, if an 8 KiB goroutine stack is actually only 2 KiB high (and thus only
+2 KiB is actually scannable), for consistency's sake the stack's height will be
+considered 8 KiB.
+Both of these estimates introduce the potential for skew.
+In general, however, stacks are roots in the GC and will be some of the first
+sources of work for the GC, so the estimate should be fairly close.
+If that turns out not to be true in practice, it is possible, though tricky to
+track goroutine stack heights more accurately, though there must necessarily
+always be some imprecision because actual scannable stack height is rapidly
+changing.
+
+Thirdly, ![`G_n`](44167/inl3.png) acts similarly to ![`S_n`](44167/inl2.png).
+The amount of global memory in a Go program can change while the application is
+running because of the `plugin` package.
+This action is relatively rare compared to a change in the size of stacks.
+Because of this rarity, I propose allowing a bit of skew.
+At worst (as we'll see later) the pacer will overshoot a little bit.
+
+Lastly, ![`M_n`](44167/inl4.png) is the amount of heap memory known to be live
+to the runtime the *instant* after a garbage collection cycle completes.
+Intuitively, it is the bottom of the classic GC sawtooth pattern.
+
+### Heap goal
+
+Like in the [previous definition of the
+pacer](https://docs.google.com/document/d/1wmjrocXIWTr1JxU-3EQBI6BK6KgtiFArkG47XK73xIQ/edit#heading=h.poxawxtiwajr),
+the runtime sets some target heap size for the GC cycle based on `GOGC`.
+Intuitively, this target heap size is the targeted heap size at the top of the
+classic GC sawtooth pattern.
+
+The definition I propose is very similar, except it includes non-heap sources of
+GC work.
+Let ![`N_n`](44167/inl5.png) be the heap goal for GC ![`n`](44167/inl6.png) ("N"
+stands for "Next GC").
+
+![Equation 2](44167/eqn2.png)
+
+The old definition makes the assumption that non-heap sources of GC work are
+negligible.
+In practice, that is often not true, such as with small heaps.
+This definition says that we're trading off not just heap memory, but *all*
+memory that influences the garbage collector's CPU consumption.
+
+From a philospical standpoint wherein `GOGC` is intended to be a knob
+controlling the trade-off between CPU resources and memory footprint, this
+definition is more accurate.
+
+This change has one large user-visible ramification: the default `GOGC`, in most
+cases, will use slightly more memory than before.
+
+This change will inevitably cause some friction, but I believe the change is
+worth it.
+It unlocks the ability to scale *down* to heaps smaller than 4 MiB (the origin
+of this limit is directly tied to this lack of accounting).
+It also unlocks better behavior in applications with many, or large goroutine
+stacks, or very many globals.
+That GC work is now accounted for, leading to fewer surprises.
+
+### Deciding when to trigger a GC
+
+Unlike the current pacer, I propose that instead of finding the right point to
+start a GC such that the runtime reaches some target in the steady-state, that
+the pacer instead searches for a value that is more fundamental, though more
+indirect.
+
+Before continuing I want take a moment to point out some very fundamental and
+necessary assumptions made in both this design and the current pacer.
+Here, we are taking a "macro-economic" view of the Go garbage collector.
+The actual behavior of the application at the "micro" level is all about
+individual allocations, but the pacer is concerned not with the moment-to-moment
+behavior of the application.
+Instead, it concerns itself with broad aggregate patterns.
+And evidently, this abstraction is useful.
+Most programs are not wildly unpredictable in their behavior; in fact it's
+somewhat of a challenge to write a useful application that non-trivially has
+unpredictable memory allocation behavior, thanks to the law of large numbers.
+This observation is why it is useful to talk about the steady-state of an
+application *at all*.
+
+The pacer concerns itself with two notions of time: the time it takes to
+allocate from the GC trigger point to the heap goal and the time it takes to
+find and perform all outstanding GC work.
+These are only *notions* of time because the pacer's job is to make them happen
+in the *same* amount of time, relative to a wall clock.
+Since in the steady-state the amount of GC work (broadly speaking) stays fixed,
+the pacer is then concerned with figuring out how early it should start such
+that it meets its goal.
+Because they should happen in the *same* amount of time, this question of "how
+early" is answered in "bytes allocated so far."
+
+So what's this more fundamental value? Suppose we model a Go program as such:
+the world is split in two while a GC is active: the application is either
+spending time on itself and potentially allocating, or on performing GC work.
+This model is "zero-sum," in a sense: if more time is spent on GC work, then
+less time is necessarily spent on the application and vice versa.
+Given this model, suppose we had two measures of program behavior during a GC
+cycle: how often the application is allocating, and how rapidly the GC can scan
+and mark memory.
+Note that these measure are *not* peak throughput.
+They are a measure of the rates, in actuality, of allocation and GC work happens
+during a GC.
+To give them a concrete unit, let's say they're bytes per cpu-seconds per core.
+The idea with this unit is to have some generalized, aggregate notion of this
+behavior, independent of available CPU resources.
+We'll see why this is important shortly.
+Lets call these rates ![`a`](44167/inl7.png) and ![`s`](44167/inl8.png)
+respectively.
+In the steady-state, these rates aren't changing, so we can use them to predict
+when to start a garbage collection.
+
+Coming back to our model, some amount of CPU time is going to go to each of
+these activities.
+Let's say our target GC CPU utilization in the steady-state is
+![`u_t`](44167/inl9.png).
+If ![`C`](44167/inl10.png) is the number of CPU cores available and
+![`t`](44167/inl11.png) is some wall-clock time window, then
+![`a(1-u_t)Ct`](44167/inl12.png) bytes will be allocated and ![`s u_t
+Ct`](44167/inl13.png) bytes will be scanned in that window.
+
+Notice that *ratio* of "bytes allocated" to "bytes scanned" is constant in the
+steady-state in this model, because both ![`a`](44167/inl7.png) and
+![`s`](44167/inl8.png) are constant.
+Let's call this ratio ![`r`](44167/inl14.png).
+To make things a little more general, let's make ![`r`](44167/inl14.png) also a
+function of utilization ![`u`](44167/inl15.png), because part of the Go garbage
+collector's design is the ability to dynamically change CPU utilization to keep
+it on-pace.
+
+![Equation 3](44167/eqn3.png)
+
+The big idea here is that this value, ![`r(u)`](44167/inl16.png) is a
+*conversion rate* between these two notions of time.
+
+Consider the following: in the steady-state, the runtime can perfectly back out
+the correct time to start a GC cycle, given that it knows exactly how much work
+it needs to do.
+Let ![`T_n`](44167/inl17.png) be the trigger point for GC cycle
+![`n`](44167/inl6.png).
+Let ![`P_n`](44167/inl18.png) be the size of the live *scannable* heap at the
+end of GC ![`n`](44167/inl6.png).
+More precisely, ![`P_n`](44167/inl18.png) is the subset of
+![`M_n`](44167/inl4.png) that contains pointers.
+Why include only pointer-ful memory? Because GC work is dominated by the cost of
+the scan loop, and each pointer that is found is marked; memory containing Go
+types without pointers are never touched, and so are totally ignored by the GC.
+Furthermore, this *does* include non-pointers in pointer-ful memory, because
+scanning over those is a significant cost in GC, enough so that GC is roughly
+proportional to it, not just the number of pointer slots.
+In the steady-state, the size of the scannable heap should not change, so
+![`P_n`](44167/inl18.png) remains constant.
+
+![Equation 4](44167/eqn4.png)
+
+That's nice, but we don't know ![`r`](44167/inl14.png) while the runtime is
+executing.
+And worse, it could *change* over time.
+
+But if the Go runtime can somehow accurately estimate and predict
+![`r`](44167/inl14.png) then it can find a steady-state.
+
+Suppose we had some prediction of ![`r`](44167/inl14.png) for GC cycle
+![`n`](44167/inl6.png) called ![`r_n`](44167/inl19.png).
+Then, our trigger condition is a simple extension of the formula above.
+Let ![`A`](44167/inl20.png) be the size of the Go live heap at any given time.
+![`A`](44167/inl20.png) is thus monotonically increasing during a GC cycle, and
+then instantaneously drops at the end of the GC cycle.
+In essence ![`A`](44167/inl20.png) *is* the classic GC sawtooth pattern.
+
+![Equation 5](44167/eqn5.png)
+
+Note that this formula is in fact a *condition* and not a predetermined trigger
+point, like the trigger ratio.
+In fact, this formula could transform into the previous formula for
+![`T_n`](44167/inl17.png) if it were not for the fact that
+![`S_n`](44167/inl2.png) actively changes during a GC cycle, since the rest of
+the values are constant for each GC cycle.
+
+A big question remains: how do we predict ![`r`](44167/inl14.png)?
+
+To answer that, we first need to determine how to measure
+![`r`](44167/inl14.png) at all.
+I propose a straightforward approximation: each GC cycle, take the amount of
+memory allocated, divide it by the amount of memory scanned, and scale it from
+the actual GC CPU utilization to the target GC CPU utilization.
+Note that this scaling factor is necessary because we want our trigger to use an
+![`r`](44167/inl14.png) value that is at the target utilization, such that the
+GC is given enough time to *only* use that amount of CPU.
+This note is a key aspect of the proposal and will come up later.
+
+What does this scaling factor look like? Recall that because of our model, any
+value of ![`r`](44167/inl14.png) has a ![`1-u`](44167/inl21.png) factor in the
+numerator and a ![`u`](44167/inl15.png) factor in the denominator.
+Scaling from one utilization to another is as simple as switching out factors.
+
+Let ![`\hat{A}_n`](44167/inl22.png) be the actual peak live heap size at the end
+of a GC cycle (as opposed to ![`N_n`](44167/inl5.png), which is only a target).
+Let ![`u_n`](44167/inl23.png) be the GC CPU utilization over cycle
+![`n`](44167/inl6.png) and ![`u_t`](44167/inl9.png) be the target utilization.
+Altogether,
+
+![Equation 6](44167/eqn6.png)
+
+Now that we have a way to measure ![`r`](44167/inl14.png), we could use this
+value directly as our prediction.
+But I fear that using it directly has the potential to introduce a significant
+amount of noise, so smoothing over transient changes to this value is desirable.
+To do so, I propose using this measurement as the set-point for a
+proportional-integral (PI) controller.
+
+The advantage of a PI controller over a proportional controller is that it
+guarantees that steady-state error will be driven to zero.
+Note that the current GC pacer has issues with offset error.
+It may also find the wrong point on the isocline of GC CPU utilization and peak
+heap size because the error term can go to zero even if both targets are missed.
+The disadvantage of a PI controller, however, is that it oscillates and may
+overshoot significantly on its way to reaching a steady value.
+This disadvantage could be mitigated by overdamping the controller, but I
+propose we tune it using the tried-and-tested standard Ziegler-Nichols method.
+In simulations (see [the simulations section](#simulations)) this tuning tends
+to work well.
+It's worth noting that PI (more generally, PID controllers) have a lot of years
+of research and use behind them, and this design lets us take advantage of that
+and tune the pacer further if need be.
+
+Why a PI controller and not a PID controller? The PI controllers are simpler to
+reason about, and the derivative term in a PID controller tends to be sensitive
+to high-frequency noise.
+The advantage of the derivative term is a shorter rise time, but simulations
+show that the rise time is roughly 1 GC cycle, so I don't think there's much
+reason to include it just yet.
+Adding the derivative term though is trivial once the rest of the design is in
+place, so the door is always open.
+
+By focusing on this ![`r`](44167/inl14.png) value, we've now reframed the pacing
+problem as a search problem instead of an optimization problem.
+That raises question: are we still reaching our optimization goals? And how do
+GC assists fit into this picture?
+
+The good news is that we're always triggering for the right CPU utilization.
+Because ![`r`](44167/inl14.png) being scaled for the *target* GC CPU utilization
+and ![`r`](44167/inl14.png) picks the trigger, the pacer will naturally start at
+a point that will generate a certain utilization in the steady-state.
+
+Following from this fact, there is no longer any reason to have the target GC
+CPU utilization be 30%.
+Originally, in the design for the current pacer, the target GC CPU utilization,
+began at 25%, with GC assists always *extending* from that, so in the
+steady-state there would be no GC assists.
+However, because the pacer was structured to solve an optimization problem, it
+required feedback from both directions.
+That is, it needed to know whether it was actinng too aggressively *or* not
+aggressively enough.
+This feedback could only be obtained by actually performing GC assists.
+But with this design, that's no longer necessary.
+The target CPU utilization can completely exclude GC assists in the steady-state
+with a mitigated risk of bad behavior.
+
+As a result, I propose the target utilization be reduced once again to 25%,
+eliminating GC assists in the steady-state (that's not out-pacing the GC), and
+potentially improving application latency as a result.
+
+### Smoothing out GC assists
+
+This discussion of GC assists brings us to the existing issues around pacing
+decisions made *while* the GC is active (which I will refer to as the "GC assist
+pacer" below).
+For the most part, this system works very well, and is able to smooth over small
+hiccups in performance, due to noise from the underlying platform or elsewhere.
+
+Unfortunately, there's one place where it doesn't do so well: the hard heap
+goal.
+Currently, the GC assist pacer prevents memory blow-up in pathological cases by
+ramping up assists once either the GC has found more work than it expected (i.e.
+the live scannable heap has grown) or the GC is behind and the application's
+heap size has exceeded the heap goal.
+In both of these cases, it sets a somewhat arbitrarily defined hard limit at
+1.1x the heap goal.
+
+The problem with this policy is that high `GOGC` values create the opportunity
+for very large changes in live heap size, because the GC has quite a lot of
+runway (consider an application with `GOGC=51100` has a steady-state live heap
+of size 10 MiB and suddenly all the memory it allocates is live).
+In this case, the GC assist pacer is going to find all this new live memory and
+panic: the rate of assists will begin to skyrocket.
+This particular problem impedes the adoption of any sort of target heap size, or
+configurable minimum heap size.
+One can imagine a small live heap with a large target heap size as having a
+large *effective* `GOGC` value, so it reduces to exactly the same case.
+
+To deal with this, I propose modifying the GC assist policy to set a hard heap
+goal of ![`\gamma N_n`](44167/inl24.png).
+The intuition behind this goal is that if *all* the memory allocated in this GC
+cycle turns out to be live, the *next* GC cycle will end up using that much
+memory *anyway*, so we let it slide.
+
+But this hard goal need not be used for actually pacing GC assists other than in
+extreme cases.
+In fact, it must not, because an assist ratio computed from this hard heap goal
+and the worst-case scan work turns out to be extremely loose, leading to the GC
+assist pacer consistently missing the heap goal in some steady-states.
+
+So, I propose an alternative calculation for the assist ratio.
+I believe that the assist ratio must always pass through the heap goal, since
+otherwise there's no guarantee that the GC meets its heap goal in the
+steady-state (which is a fundamental property of the pacer in Go's existing GC
+design).
+However, there's no reason why the ratio itself needs to change dramatically
+when there's more GC work than expected.
+In fact, the preferable case is that it does not, because that lends itself to a
+much more even distribution of GC assist work across the cycle.
+
+So, I propose that the assist ratio be an extrapolation of the current
+steady-state assist ratio, with the exception that it now include non-heap GC
+work as the rest of this document does.
+
+That is,
+
+![Equation 7](44167/eqn7.png)
+
+This definition is intentially roundabout.
+The assist ratio changes dynamically as the amount of GC work left decreases and
+the amount of memory allocated increases.
+This responsiveness is what allows the pacing mechanism to be so robust.
+
+Today, the assist ratio is calculated by computing the remaining heap runway and
+the remaining expected GC work, and dividing the former by the latter.
+But of course, that's not possible if there's more GC work than expected, since
+then the assist ratio could go negative, which is meaningless.
+
+So that's the purpose defining the "max scan work" and "extrapolated runway":
+these are worst-case values that are always safe to subtract from, such that we
+can maintain roughly the same assist ratio throughout the cycle (assuming no
+hiccups).
+
+One minor details is that the "extrapolated runway" needs to be capped at the
+hard heap goal to prevent breaking that promise, though in practice this will
+almost.
+The hard heap goal is such a loose bound that it's really only useful in
+pathological cases, but it's still necessary to ensure robustness.
+
+A key point in this choice is that the GC assist pacer will *only* respond to
+changes in allocation behavior and scan rate, not changes in the *size* of the
+live heap.
+This point seems minor, but it means the GC assist pacer's function is much
+simpler and more predictable.
+
+## Remaining unanswered questions
+
+Not every problem listed in issue
+[#42430](https://github.com/golang/go/issues/42430) is resolved by this design,
+though many are.
+
+Notable exclusions are:
+1. Mark assists are front-loaded in a GC cycle.
+1. The hard heap goal isn't actually hard in some circumstances.
+1. Dealing with idle GC workers.
+1. Donating goroutine assist credit/debt on exit.
+1. Existing trigger limits to prevent unbounded memory growth.
+
+(1) is difficult to resolve without special cases and arbitrary heuristics, and
+I think in practice it's OK; the system was fairly robust and will now be more
+so to this kind of noise.
+That doesn't mean that it shouldn't be revisited, but it's not quite as big as
+the other problems, so I leave it outside the scope of this proposal.
+
+(2) is also tricky and somewhat orthogonal.
+I believe the path forward there involves better scheduling of fractional GC
+workers, which are currently very loosely scheduled.
+This design has made me realize how important dedicated GC workers are to
+progress, and how GC assists are a back-up mechanism.
+I believe that the fundamental problem there lies with the fact that fractional
+GC workers don't provide that sort of consistent progress.
+
+For (3) I believe we should remove idle GC workers entirely, which is why this
+document ignores them.
+Idle GC workers are extra mark workers that run if the application isn't
+utilizing all GOMAXPROCS worth of parallelism.
+The scheduler schedules "low priority" background workers on any additional CPU
+resources, and this ultimately skews utilization measurements in the GC, because
+as of today they're unaccounted for.
+Unfortunately, it's likely that idle GC workers have accidentally become
+necessary for the GC to make progress, so just pulling them out won't be quite
+so easy.
+I believe that needs a separate proposal given other large potential changes
+coming to the compiler and runtime in the near future, because there's
+unfortunately a fairly significant risk of bugs with doing so, though I do think
+it's ultimately an improvement.
+See [the related issue for more
+details](https://github.com/golang/go/issues/44163).
+
+(4) is easy and I don't believe needs any deliberation.
+That is a bug we should simply fix.
+
+For (5), I propose we retain the limits, translated to the current design.
+For reference, these limits are ![`0.95 (\gamma - 1)`](44167/inl25.png) as the
+upper-bound on the trigger ratio, and ![`0.6 (\gamma - 1)`](44167/inl26.png) as
+the lower-bound.
+
+The upper bound exists to prevent ever starting the GC too late in low-activity
+scenarios.
+It may cause consistent undershoot, but prevents issues in GC pacer calculations
+by preventing the calculated runway from ever being too low.
+The upper-bound may need to be revisited when considering a configurable target
+heap size.
+
+The lower bound exists to prevent the application from causing excessive memory
+growth due to floating garbage as the application's allocation rate increases.
+Before that limit was installed, it wasn't very easy for an application to
+allocate hard enough for that to happen.
+The lower bound probably should be revisited, but I leave that outside of the
+scope of this document.
+
+To translate them to the current design, I propose we simply modify the trigger
+condition to include these limits.
+It's not important to put these limits in the rest of the pacer because it no
+longer tries to compute the trigger point ahead of time.
+
+## A note about CPU utilization
+
+This document uses the term "GC CPU utilization" quite frequently, but so far
+has refrained from defining exactly how it's measured.
+Before doing that, let's define CPU utilization over a GC mark phase, as it's
+been used so far.
+First, let's define the mark phase: it is the period of wall-clock time between
+the end of sweep termination and the start of mark termination.
+In the mark phase, the process will have access to some total number of
+CPU-seconds of execution time.
+This CPU time can then be divided into "time spent doing GC work" and "time
+spent doing anything else."
+GC CPU utilization is then defined as a proportion of that total CPU time that
+is spent doing GC work.
+
+This definition seems straightforward enough but in reality it's more
+complicated.
+Measuring CPU time on most platforms is tricky, so what Go does today is an
+approximation: take the wall-clock time of the GC mark phase, multiply it by
+`GOMAXPROCS`.
+Call this $`T`$.
+Take 25% of that (representing the dedicated GC workers) and add total amount of
+time all goroutines spend in GC assists.
+The latter is computed directly, but is just the difference between the start
+and end time in the critical section; it does not try to account for context
+switches forced by the underlying system, or anything like that.
+Now take this value we just computed and divide it by ![`T`](44167/inl27.png).
+That's our GC CPU utilization.
+
+This approximation is mostly accurate in the common case, but is prone to skew
+in various scenarios, such as when the system is CPU-starved.
+This fact can be problematic, but I believe it is largely orthogonal to the
+content of this document; we can work on improving this approximation without
+having to change any of this design.
+It already assumes that we have a good measure of CPU utilization.
+
+## Alternatives considered
+
+The alternatives considered for this design basically boil down to its
+individual components.
+
+For instance, I considered grouping stacks and globals into the current
+formulation of the pacer, but that adds another condition to the definition of
+the steady-state: stacks and globals do not change.
+That makes the steady-state more fragile.
+
+I also considered a design that was similar, but computed everything in terms of
+an "effective" `GOGC`, and "converted" that back to `GOGC` for pacing purposes
+(that is, what would the heap trigger have been had the expected amount of live
+memory been correct?).
+This formulation is similar to how Austin formulated the experimental
+`SetMaxHeap` API.
+Austin suggested I avoid this formulation because math involving `GOGC` tends to
+have to work around infinities.
+A good example of this is if `runtime.GC` is called while `GOGC` is off: the
+runtime has to "fake" a very large `GOGC` value in the pacer.
+By using a ratio of rates that's more grounded in actual application behavior
+the trickiness of the math is avoided.
+
+I also considered not using a PI controller and just using the measured
+![`r`](44167/inl14.png) value directly, assuming it doesn't change across GC
+cycles, but that method is prone to noise.
+
+## Justification
+
+Pros:
+- The steady-state is now independent of the amount of GC work to be done.
+- Steady-state mark assist drops to zero if not allocating too heavily (a likely
+  latency improvement in many scenarios) (see the "high `GOGC`" scenario in
+  [simulations](#simulations)).
+- GC amortization includes non-heap GC work, and responds well in those cases.
+- Eliminates offset error present in the existing design.
+
+Cons:
+- Definition of `GOGC` has changed slightly, so a `GOGC` of 100 will use
+  slightly more memory in nearly all cases.
+- ![`r`](44167/inl14.png) is a little bit unintuitive.
+
+## Implementation
+
+This pacer redesign will be implemented by Michael Knyszek.
+
+1. The existing pacer will be refactored into a form fit for simulation.
+1. A comprehensive simulation-based test suite will be written for the pacer.
+1. The existing pacer will be swapped out with the new implementation.
+
+The purpose of the simulation infrastructure is to make the pacer, in general,
+more testable.
+This lets us write regression test cases based on real-life problems we
+encounter and confirm that they don't break going forward.
+Furthermore, with fairly large changes to the Go compiler and runtime in the
+pipeline, it's especially important to reduce the risk of this change as much as
+possible.
+
+## Go 1 backwards compatibility
+
+This change will not modify any user-visible APIs, but may have surprising
+effects on application behavior.
+The two "most" incompatible changes in this proposal are the redefinition of the
+heap goal to include non-heap sources of GC work, since that directly influences
+the meaning of `GOGC`, and the change in target GC CPU utilization.
+These two factors together mean that, by default and on average, Go applications
+will use slightly more memory than before.
+To obtain previous levels of memory usage, users may be required to tune down
+`GOGC` lower manually, but the overall result should be more consistent, more
+predictable, and more efficient.
+
+## Simulations
+
+In order to show the effectiveness of the new pacer and compare it to the
+current one, I modeled both the existing pacer and the new pacer and simulated
+both in a variety of scenarios.
+
+The code used to run these simulations and generate the plots below may be found
+at [github.com/mknyszek/pacer-model](https://github.com/mknyszek/pacer-model).
+
+### Assumptions and caveats
+
+The model of each pacer is fairly detailed, and takes into account most details
+like allocations made during a GC being marked.
+The one big assumption it makes, however, is that the behavior of the
+application while a GC cycle is running is perfectly smooth, such that the GC
+assist pacer is perfectly paced according to the initial assist ratio.
+In practice, this is close to true, but it's worth accounting for the more
+extreme cases.
+(TODO: Show simulations that inject some noise into the GC assist pacer.)
+
+Another caveat with the simulation is the graph of "R value" (that is,
+![`r_n`](44167/inl19.png)), and "Alloc/scan ratio."
+The latter is well-defined for all simulations (it's a part of the input) but
+the former is not a concept used in the current pacer.
+So for simulations of the current pacer, the "R value" is backed out from the
+trigger ratio: we know the runway, we know the *expected* scan work for the
+target utilization, so we can compute the ![`r_n`](44167/inl19.png) that the
+trigger point encodes.
+
+### Results
+
+**Perfectly steady heap size.**
+
+The simplest possible scenario.
+
+Current pacer:
+
+![](44167/pacer-plots/old-steady.png)
+
+New pacer:
+
+![](44167/pacer-plots/new-steady.png)
+
+Notes:
+- The current pacer doesn't seem to find the right utilization.
+- Both pacers do reasonably well at meeting the heap goal.
+
+**Jittery heap size and alloc/scan ratio.**
+
+A mostly steady-state heap with a slight jitter added to both live heap size and
+the alloc/scan ratio.
+
+Current pacer:
+
+![](44167/pacer-plots/old-jitter-alloc.png)
+
+New pacer:
+
+![](44167/pacer-plots/new-jitter-alloc.png)
+
+Notes:
+- Both pacers are resilient to a small amount of noise.
+
+**Small step in alloc/scan ratio.**
+
+This scenario demonstrates the transitions between two steady-states, that are
+not far from one another.
+
+Current pacer:
+
+![](44167/pacer-plots/old-step-alloc.png)
+
+New pacer:
+
+![](44167/pacer-plots/new-step-alloc.png)
+
+Notes:
+- Both pacers react to the change in alloc/scan rate.
+- Clear oscillations in utilization visible for the new pacer.
+
+**Large step in alloc/scan ratio.**
+
+This scenario demonstrates the transitions between two steady-states, that are
+further from one another.
+
+Current pacer:
+
+![](44167/pacer-plots/old-heavy-step-alloc.png)
+
+New pacer:
+
+![](44167/pacer-plots/new-heavy-step-alloc.png)
+
+Notes:
+- The old pacer consistently overshoots the heap size post-step.
+- The new pacer minimizes overshoot.
+
+**Large step in heap size with a high `GOGC` value.**
+
+This scenario demonstrates the "high `GOGC` problem" described in the [GC pacer
+meta-issue](https://github.com/golang/go/issues/42430).
+
+Current pacer:
+
+![](44167/pacer-plots/old-high-GOGC.png)
+
+New pacer:
+
+![](44167/pacer-plots/new-high-GOGC.png)
+
+Notes:
+- The new pacer's heap size stabilizes faster than the old pacer's.
+- The new pacer has a spike in overshoot; this is *by design*.
+- The new pacer's utilization is independent of this heap size spike.
+- The old pacer has a clear spike in utilization.
+
+**Oscillating alloc/scan ratio.**
+
+This scenario demonstrates an oscillating alloc/scan ratio.
+This scenario is interesting because it shows a somewhat extreme case where a
+steady-state is never actually reached for any amount of time.
+However, this is not a realistic scenario.
+
+Current pacer:
+
+![](44167/pacer-plots/old-osc-alloc.png)
+
+New pacer:
+
+![](44167/pacer-plots/new-osc-alloc.png)
+
+Notes:
+- The new pacer tracks the oscillations worse than the old pacer.
+  This is likely due to the error never settling, so the PI controller is always
+  overshooting.
+
+**Large amount of goroutine stacks.**
+
+This scenario demonstrates the "heap amortization problem" described in the [GC
+pacer meta-issue](https://github.com/golang/go/issues/42430) for goroutine
+stacks.
+
+Current pacer:
+
+![](44167/pacer-plots/old-big-stacks.png)
+
+New pacer:
+
+![](44167/pacer-plots/new-big-stacks.png)
+
+Notes:
+- The old pacer consistently overshoots because it's underestimating the amount
+  of work it has to do.
+- The new pacer uses more memory, since the heap goal is now proportional to
+  stack space, but it stabilizes and is otherwise sane.
+
+**Large amount of global variables.**
+
+This scenario demonstrates the "heap amortization problem" described in the [GC
+pacer meta-issue](https://github.com/golang/go/issues/42430) for global
+variables.
+
+Current pacer:
+
+![](44167/pacer-plots/old-big-globals.png)
+
+New pacer:
+
+![](44167/pacer-plots/new-big-globals.png)
+
+Notes:
+- This is essentially identical to the stack space case.
+
+**High alloc/scan ratio.**
+
+This scenario shows the behavior of each pacer in the face of a very high
+alloc/scan ratio, with jitter applied to both the live heap size and the
+alloc/scan ratio.
+
+Current pacer:
+
+![](44167/pacer-plots/old-heavy-jitter-alloc.png)
+
+New pacer:
+
+![](44167/pacer-plots/new-heavy-jitter-alloc.png)
+
+Notes:
+- In the face of a very high allocation rate, the old pacer consistently
+  overshoots, though both maintain a similar GC CPU utilization.
diff --git a/design/44167/README.md b/design/44167/README.md
new file mode 100644
index 0000000..8994615
--- /dev/null
+++ b/design/44167/README.md
@@ -0,0 +1,21 @@
+# Design document
+
+The GC pacer design document is generated from the `.src.md` file in this
+directory.
+It contains LaTeX formulas which Markdown cannot render, so they're
+rendered by an external open-source tool,
+[md-latex](https://github.com/mknyszek/md-tools).
+
+Then, because Gitiles' markdown viewer can't render SVGs, run
+
+```
+./svg2png.bash
+cd pacer-plots
+./svg2png.bash
+cd ..
+```
+
+And go back and replace all instances of SVG with PNG in the final document.
+
+Note that `svg2png.bash` requires both ImageMagick and Google Chrome.
+
diff --git a/design/44167/eqn1.png b/design/44167/eqn1.png
new file mode 100644
index 0000000..b72d192
--- /dev/null
+++ b/design/44167/eqn1.png
Binary files differ
diff --git a/design/44167/eqn2.png b/design/44167/eqn2.png
new file mode 100644
index 0000000..42053ee
--- /dev/null
+++ b/design/44167/eqn2.png
Binary files differ
diff --git a/design/44167/eqn3.png b/design/44167/eqn3.png
new file mode 100644
index 0000000..7ae3a88
--- /dev/null
+++ b/design/44167/eqn3.png
Binary files differ
diff --git a/design/44167/eqn4.png b/design/44167/eqn4.png
new file mode 100644
index 0000000..784db07
--- /dev/null
+++ b/design/44167/eqn4.png
Binary files differ
diff --git a/design/44167/eqn5.png b/design/44167/eqn5.png
new file mode 100644
index 0000000..fade575
--- /dev/null
+++ b/design/44167/eqn5.png
Binary files differ
diff --git a/design/44167/eqn6.png b/design/44167/eqn6.png
new file mode 100644
index 0000000..f6bd4f4
--- /dev/null
+++ b/design/44167/eqn6.png
Binary files differ
diff --git a/design/44167/eqn7.png b/design/44167/eqn7.png
new file mode 100644
index 0000000..1e0b3b4
--- /dev/null
+++ b/design/44167/eqn7.png
Binary files differ
diff --git a/design/44167/eqn8.png b/design/44167/eqn8.png
new file mode 100644
index 0000000..b11aae2
--- /dev/null
+++ b/design/44167/eqn8.png
Binary files differ
diff --git a/design/44167/eqn9.png b/design/44167/eqn9.png
new file mode 100644
index 0000000..577ccf3
--- /dev/null
+++ b/design/44167/eqn9.png
Binary files differ
diff --git a/design/44167/gc-pacer-redesign.src.md b/design/44167/gc-pacer-redesign.src.md
new file mode 100644
index 0000000..3268a56
--- /dev/null
+++ b/design/44167/gc-pacer-redesign.src.md
@@ -0,0 +1,872 @@
+# GC Pacer Redesign
+
+Author: Michael Knyszek
+
+Updated: 8 February 2021
+
+## Abstract
+
+Go's tracing garbage collector runs concurrently with the application, and thus
+requires an algorithm to determine when to start a new cycle.
+In the runtime, this algorithm is referred to as the pacer.
+Until now, the garbage collector has framed this process as an optimization
+problem, utilizing a proportional controller to achieve a desired stopping-point
+(that is, the cycle completes just as the heap reaches a certain size) as well
+as a desired CPU utilization.
+While this approach has served Go well for a long time, it has accrued many
+corner cases due to resolved issues, as well as a backlog of unresolved issues.
+
+I propose redesigning the garbage collector's pacer from the ground up to
+capture the things it does well and eliminate the problems that have been
+discovered.
+More specifically, I propose:
+
+1. Including all non-heap sources of GC work (stacks, globals) in pacing
+   decisions.
+1. Reframing the pacing problem as a search problem, solved by a
+   proportional-integral controller,
+1. Extending the hard heap goal to the worst-case heap goal of the next GC,
+
+(1) will resolve long-standing issues with small heap sizes, allowing the Go
+garbage collector to scale *down* and act more predictably in general.
+(2) will eliminate offset error present in the current design, will allow
+turning off GC assists in the steady-state, and will enable clearer designs for
+setting memory limits on Go applications.
+(3) will enable smooth and consistent response to large changes in the live heap
+size with large `GOGC` values.
+
+## Background
+
+Since version 1.5 Go has had a tracing mark-sweep garbage collector (GC) that is
+able to execute concurrently with user goroutines.
+The garbage collector manages several goroutines called "workers" to carry out
+its task.
+A key problem in concurrent garbage collection is deciding when to begin, such
+that the work is complete "on time."
+Timeliness, today, is defined by the optimization of two goals:
+
+1. The heap size relative to the live heap at the end of the last cycle, and
+1. A target CPU utilization for the garbage collector while it is active.
+
+These two goals are tightly related.
+If a garbage collection cycle starts too late, for instance, it may consume more
+CPU to avoid missing its target.
+If a cycle begins too early, it may end too early, resulting in GC cycles
+happening more often than expected.
+
+Go's garbage collector sets a fixed target of 30% CPU utilization (25% from GC
+workers, with 5% from user goroutines donating their time to assist) while the
+GC is active.
+It also offers a parameter to allow the user to set their own memory use and CPU
+trade-off: `GOGC`.
+`GOGC` is a percent overhead describing how much *additional* memory (over the
+live heap) the garbage collector may use.
+A higher `GOGC` value indicates that the garbage collector may use more memory,
+setting the target heap size higher, and conversely a lower `GOGC` value sets
+the target heap size lower.
+The process of deciding when a garbage collection should start given these
+parameters has often been called "pacing" in the Go runtime.
+
+To attempt to reach its goals, Go's "pacer" utilizes a proportional controller
+to decide when to start a garbage collection cycle.
+The controller attempts to find the correct point to begin directly, given an
+error term that captures the two aforementioned optimization goals.
+
+It's worth noting that the optimization goals are defined for some steady-state.
+Today, the steady-state is implicitly defined as: constant allocation rate,
+constant heap size, and constant heap composition (hence, constant mark rate).
+The pacer expects the application to settle on some average global behavior
+across GC cycles.
+
+However, the GC is still robust to transient application states.
+When the GC is in some transient state, the pacer is often operating with stale
+information, and is actively trying to find the new steady-state.
+To avoid issues with memory blow-up, among other things, the GC makes allocating
+goroutines donate their time to assist the garbage collector, proportionally to
+the amount of memory that they allocate.
+This GC assist system keeps memory use stable in unstable conditions, at the
+expense of user CPU time and latency.
+
+The GC assist system operates by dynamically computing an assist ratio.
+The assist ratio is the slope of a curve in the space of allocation time and GC
+work time, a curve that the application is required to stay under.
+This assist ratio is then used as a conversion factor between the amount a
+goroutine has allocated, and how much GC assist work it should do.
+Meanwhile, GC workers generate assist credit from the work that they do and
+place it in a global pool that allocating goroutines may steal from to avoid
+having to assist.
+
+## Motivation
+
+Since version 1.5, the pacer has gone through several minor tweaks and changes
+in order to resolve issues, usually adding special cases and making its behavior
+more difficult to understand, though resolving the motivating problem.
+Meanwhile, more issues have been cropping up that are diagnosed but more
+difficult to tackle in the existing design.
+Most of these issues are listed in the [GC pacer
+meta-issue](https://github.com/golang/go/issues/42430).
+
+Even more fundamentally, the proportional controller at the center of the pacer
+is demonstrably unable to completely eliminate error in its scheduling, a
+well-known issue with proportional-only controllers.
+
+Another significant motivator, beyond resolving latent issues, is that the Go
+runtime lacks facilities for dealing with finite memory limits.
+While the `GOGC` mechanism works quite well and has served Go for a long time,
+it falls short when there's a hard memory limit.
+For instance, a consequence of `GOGC` that often surprises new gophers coming
+from languages like Java, is that if `GOGC` is 100, then Go really needs 2x more
+memory than the peak live heap size.
+The garbage collector will *not* automatically run more aggressively as it
+approaches some memory limit, leading to out-of-memory errors.
+Conversely, users that know they'll have a fixed amount of memory up-front are
+unable to take advantage of it if their live heap is usually small.
+Users have taken to fooling the GC into thinking more memory is live than their
+application actually needs in order to let the application allocate more memory
+in between garbage collections.
+Simply increasing `GOGC` doesn't tend to work in this scenario either because of
+the previous problem: if the live heap spikes suddenly, `GOGC` will result in
+much more memory being used overall.
+See issue [#42430](https://github.com/golang/go/issues/42430) for more details.
+
+The current pacer is not designed with these use-cases in mind.
+
+## Design
+
+### Definitions
+
+```render-latex
+\begin{aligned}
+\gamma & = 1+\frac{GOGC}{100} \\
+   S_n & = \textrm{bytes of memory allocated to goroutine stacks at the beginning of the mark phase of GC } n \\
+   G_n & = \textrm{bytes of memory dedicated to scannable global variables at the beginning of the mark phase of GC } n \\
+   M_n & = \textrm{bytes of memory marked live after GC } n
+\end{aligned}
+```
+
+There is some nuance to these definitions.
+
+Firstly, `$\gamma$` is used in place of `GOGC` because it makes the math easier
+to understand.
+
+Secondly, `$S_n$` may vary throughout the sweep phase, but effectively becomes
+fixed once a GC cycle starts.
+Stacks may not shrink, only grow during this time, so there's a chance any value
+used by the runtime during a GC cycle will be stale.
+`$S_n$` also includes space that may not be actively used for the stack.
+That is, if an 8 KiB goroutine stack is actually only 2 KiB high (and thus only
+2 KiB is actually scannable), for consistency's sake the stack's height will be
+considered 8 KiB.
+Both of these estimates introduce the potential for skew.
+In general, however, stacks are roots in the GC and will be some of the first
+sources of work for the GC, so the estimate should be fairly close.
+If that turns out not to be true in practice, it is possible, though tricky to
+track goroutine stack heights more accurately, though there must necessarily
+always be some imprecision because actual scannable stack height is rapidly
+changing.
+
+Thirdly, `$G_n$` acts similarly to `$S_n$`.
+The amount of global memory in a Go program can change while the application is
+running because of the `plugin` package.
+This action is relatively rare compared to a change in the size of stacks.
+Because of this rarity, I propose allowing a bit of skew.
+At worst (as we'll see later) the pacer will overshoot a little bit.
+
+Lastly, `$M_n$` is the amount of heap memory known to be live to the runtime the
+*instant* after a garbage collection cycle completes.
+Intuitively, it is the bottom of the classic GC sawtooth pattern.
+
+### Heap goal
+
+Like in the [previous definition of the
+pacer](https://docs.google.com/document/d/1wmjrocXIWTr1JxU-3EQBI6BK6KgtiFArkG47XK73xIQ/edit#heading=h.poxawxtiwajr),
+the runtime sets some target heap size for the GC cycle based on `GOGC`.
+Intuitively, this target heap size is the targeted heap size at the top of the
+classic GC sawtooth pattern.
+
+The definition I propose is very similar, except it includes non-heap sources of
+GC work.
+Let `$N_n$` be the heap goal for GC `$n$` ("N" stands for "Next GC").
+
+```render-latex
+N_n = \gamma(M_{n-1} + S_n + G_n)
+```
+
+The old definition makes the assumption that non-heap sources of GC work are
+negligible.
+In practice, that is often not true, such as with small heaps.
+This definition says that we're trading off not just heap memory, but *all*
+memory that influences the garbage collector's CPU consumption.
+
+From a philospical standpoint wherein `GOGC` is intended to be a knob
+controlling the trade-off between CPU resources and memory footprint, this
+definition is more accurate.
+
+This change has one large user-visible ramification: the default `GOGC`, in most
+cases, will use slightly more memory than before.
+
+This change will inevitably cause some friction, but I believe the change is
+worth it.
+It unlocks the ability to scale *down* to heaps smaller than 4 MiB (the origin
+of this limit is directly tied to this lack of accounting).
+It also unlocks better behavior in applications with many, or large goroutine
+stacks, or very many globals.
+That GC work is now accounted for, leading to fewer surprises.
+
+### Deciding when to trigger a GC
+
+Unlike the current pacer, I propose that instead of finding the right point to
+start a GC such that the runtime reaches some target in the steady-state, that
+the pacer instead searches for a value that is more fundamental, though more
+indirect.
+
+Before continuing I want take a moment to point out some very fundamental and
+necessary assumptions made in both this design and the current pacer.
+Here, we are taking a "macro-economic" view of the Go garbage collector.
+The actual behavior of the application at the "micro" level is all about
+individual allocations, but the pacer is concerned not with the moment-to-moment
+behavior of the application.
+Instead, it concerns itself with broad aggregate patterns.
+And evidently, this abstraction is useful.
+Most programs are not wildly unpredictable in their behavior; in fact it's
+somewhat of a challenge to write a useful application that non-trivially has
+unpredictable memory allocation behavior, thanks to the law of large numbers.
+This observation is why it is useful to talk about the steady-state of an
+application *at all*.
+
+The pacer concerns itself with two notions of time: the time it takes to
+allocate from the GC trigger point to the heap goal and the time it takes to
+find and perform all outstanding GC work.
+These are only *notions* of time because the pacer's job is to make them happen
+in the *same* amount of time, relative to a wall clock.
+Since in the steady-state the amount of GC work (broadly speaking) stays fixed,
+the pacer is then concerned with figuring out how early it should start such
+that it meets its goal.
+Because they should happen in the *same* amount of time, this question of "how
+early" is answered in "bytes allocated so far."
+
+So what's this more fundamental value? Suppose we model a Go program as such:
+the world is split in two while a GC is active: the application is either
+spending time on itself and potentially allocating, or on performing GC work.
+This model is "zero-sum," in a sense: if more time is spent on GC work, then
+less time is necessarily spent on the application and vice versa.
+Given this model, suppose we had two measures of program behavior during a GC
+cycle: how often the application is allocating, and how rapidly the GC can scan
+and mark memory.
+Note that these measure are *not* peak throughput.
+They are a measure of the rates, in actuality, of allocation and GC work happens
+during a GC.
+To give them a concrete unit, let's say they're bytes per cpu-seconds per core.
+The idea with this unit is to have some generalized, aggregate notion of this
+behavior, independent of available CPU resources.
+We'll see why this is important shortly.
+Lets call these rates `$a$` and `$s$` respectively.
+In the steady-state, these rates aren't changing, so we can use them to predict
+when to start a garbage collection.
+
+Coming back to our model, some amount of CPU time is going to go to each of
+these activities.
+Let's say our target GC CPU utilization in the steady-state is `$u_t$`.
+If `$C$` is the number of CPU cores available and `$t$` is some wall-clock time
+window, then `$a(1-u_t)Ct$` bytes will be allocated and `$s u_t Ct$` bytes will
+be scanned in that window.
+
+Notice that *ratio* of "bytes allocated" to "bytes scanned" is constant in the
+steady-state in this model, because both `$a$` and `$s$` are constant.
+Let's call this ratio `$r$`.
+To make things a little more general, let's make `$r$` also a function of
+utilization `$u$`, because part of the Go garbage collector's design is the
+ability to dynamically change CPU utilization to keep it on-pace.
+
+```render-latex
+r(u) = \frac{a(1-u)Ct}{suCt}
+```
+
+The big idea here is that this value, `$r(u)$` is a *conversion rate* between
+these two notions of time.
+
+Consider the following: in the steady-state, the runtime can perfectly back out
+the correct time to start a GC cycle, given that it knows exactly how much work
+it needs to do.
+Let `$T_n$` be the trigger point for GC cycle `$n$`.
+Let `$P_n$` be the size of the live *scannable* heap at the end of GC `$n$`.
+More precisely, `$P_n$` is the subset of `$M_n$` that contains pointers.
+Why include only pointer-ful memory? Because GC work is dominated by the cost of
+the scan loop, and each pointer that is found is marked; memory containing Go
+types without pointers are never touched, and so are totally ignored by the GC.
+Furthermore, this *does* include non-pointers in pointer-ful memory, because
+scanning over those is a significant cost in GC, enough so that GC is roughly
+proportional to it, not just the number of pointer slots.
+In the steady-state, the size of the scannable heap should not change, so
+`$P_n$` remains constant.
+
+```render-latex
+T_n = N_n - r(u_t)(P_{n-1} + S_n + G_n)
+```
+
+That's nice, but we don't know `$r$` while the runtime is executing.
+And worse, it could *change* over time.
+
+But if the Go runtime can somehow accurately estimate and predict `$r$` then it
+can find a steady-state.
+
+Suppose we had some prediction of `$r$` for GC cycle `$n$` called `$r_n$`.
+Then, our trigger condition is a simple extension of the formula above.
+Let `$A$` be the size of the Go live heap at any given time.
+`$A$` is thus monotonically increasing during a GC cycle, and then
+instantaneously drops at the end of the GC cycle.
+In essence `$A$` *is* the classic GC sawtooth pattern.
+
+```render-latex
+A \ge N_n - r_n(u_t)(P_{n-1} + S_n + G_n)
+```
+
+Note that this formula is in fact a *condition* and not a predetermined trigger
+point, like the trigger ratio.
+In fact, this formula could transform into the previous formula for `$T_n$` if
+it were not for the fact that `$S_n$` actively changes during a GC cycle, since
+the rest of the values are constant for each GC cycle.
+
+A big question remains: how do we predict `$r$`?
+
+To answer that, we first need to determine how to measure `$r$` at all.
+I propose a straightforward approximation: each GC cycle, take the amount of
+memory allocated, divide it by the amount of memory scanned, and scale it from
+the actual GC CPU utilization to the target GC CPU utilization.
+Note that this scaling factor is necessary because we want our trigger to use an
+`$r$` value that is at the target utilization, such that the GC is given enough
+time to *only* use that amount of CPU.
+This note is a key aspect of the proposal and will come up later.
+
+What does this scaling factor look like? Recall that because of our model, any
+value of `$r$` has a `$1-u$` factor in the numerator and a `$u$` factor in the
+denominator.
+Scaling from one utilization to another is as simple as switching out factors.
+
+Let `$\hat{A}_n$` be the actual peak live heap size at the end of a GC cycle (as
+opposed to `$N_n$`, which is only a target).
+Let `$u_n$` be the GC CPU utilization over cycle `$n$` and `$u_t$` be the target
+utilization.
+Altogether,
+
+```render-latex
+r_{measured} \textrm{ for GC } n = \frac{\hat{A}_n - T_n}{M_n + S_n + G_n}\frac{(1-u_t)u_n}{(1-u_n)u_t}
+```
+
+Now that we have a way to measure `$r$`, we could use this value directly as our
+prediction.
+But I fear that using it directly has the potential to introduce a significant
+amount of noise, so smoothing over transient changes to this value is desirable.
+To do so, I propose using this measurement as the set-point for a
+proportional-integral (PI) controller.
+
+The advantage of a PI controller over a proportional controller is that it
+guarantees that steady-state error will be driven to zero.
+Note that the current GC pacer has issues with offset error.
+It may also find the wrong point on the isocline of GC CPU utilization and peak
+heap size because the error term can go to zero even if both targets are missed.
+The disadvantage of a PI controller, however, is that it oscillates and may
+overshoot significantly on its way to reaching a steady value.
+This disadvantage could be mitigated by overdamping the controller, but I
+propose we tune it using the tried-and-tested standard Ziegler-Nichols method.
+In simulations (see [the simulations section](#simulations)) this tuning tends
+to work well.
+It's worth noting that PI (more generally, PID controllers) have a lot of years
+of research and use behind them, and this design lets us take advantage of that
+and tune the pacer further if need be.
+
+Why a PI controller and not a PID controller? The PI controllers are simpler to
+reason about, and the derivative term in a PID controller tends to be sensitive
+to high-frequency noise.
+The advantage of the derivative term is a shorter rise time, but simulations
+show that the rise time is roughly 1 GC cycle, so I don't think there's much
+reason to include it just yet.
+Adding the derivative term though is trivial once the rest of the design is in
+place, so the door is always open.
+
+By focusing on this `$r$` value, we've now reframed the pacing problem as a
+search problem instead of an optimization problem.
+That raises question: are we still reaching our optimization goals? And how do
+GC assists fit into this picture?
+
+The good news is that we're always triggering for the right CPU utilization.
+Because `$r$` being scaled for the *target* GC CPU utilization and `$r$` picks
+the trigger, the pacer will naturally start at a point that will generate a
+certain utilization in the steady-state.
+
+Following from this fact, there is no longer any reason to have the target GC
+CPU utilization be 30%.
+Originally, in the design for the current pacer, the target GC CPU utilization,
+began at 25%, with GC assists always *extending* from that, so in the
+steady-state there would be no GC assists.
+However, because the pacer was structured to solve an optimization problem, it
+required feedback from both directions.
+That is, it needed to know whether it was actinng too aggressively *or* not
+aggressively enough.
+This feedback could only be obtained by actually performing GC assists.
+But with this design, that's no longer necessary.
+The target CPU utilization can completely exclude GC assists in the steady-state
+with a mitigated risk of bad behavior.
+
+As a result, I propose the target utilization be reduced once again to 25%,
+eliminating GC assists in the steady-state (that's not out-pacing the GC), and
+potentially improving application latency as a result.
+
+### Smoothing out GC assists
+
+This discussion of GC assists brings us to the existing issues around pacing
+decisions made *while* the GC is active (which I will refer to as the "GC assist
+pacer" below).
+For the most part, this system works very well, and is able to smooth over small
+hiccups in performance, due to noise from the underlying platform or elsewhere.
+
+Unfortunately, there's one place where it doesn't do so well: the hard heap
+goal.
+Currently, the GC assist pacer prevents memory blow-up in pathological cases by
+ramping up assists once either the GC has found more work than it expected (i.e.
+the live scannable heap has grown) or the GC is behind and the application's
+heap size has exceeded the heap goal.
+In both of these cases, it sets a somewhat arbitrarily defined hard limit at
+1.1x the heap goal.
+
+The problem with this policy is that high `GOGC` values create the opportunity
+for very large changes in live heap size, because the GC has quite a lot of
+runway (consider an application with `GOGC=51100` has a steady-state live heap
+of size 10 MiB and suddenly all the memory it allocates is live).
+In this case, the GC assist pacer is going to find all this new live memory and
+panic: the rate of assists will begin to skyrocket.
+This particular problem impedes the adoption of any sort of target heap size, or
+configurable minimum heap size.
+One can imagine a small live heap with a large target heap size as having a
+large *effective* `GOGC` value, so it reduces to exactly the same case.
+
+To deal with this, I propose modifying the GC assist policy to set a hard heap
+goal of `$\gamma N_n$`.
+The intuition behind this goal is that if *all* the memory allocated in this GC
+cycle turns out to be live, the *next* GC cycle will end up using that much
+memory *anyway*, so we let it slide.
+
+But this hard goal need not be used for actually pacing GC assists other than in
+extreme cases.
+In fact, it must not, because an assist ratio computed from this hard heap goal
+and the worst-case scan work turns out to be extremely loose, leading to the GC
+assist pacer consistently missing the heap goal in some steady-states.
+
+So, I propose an alternative calculation for the assist ratio.
+I believe that the assist ratio must always pass through the heap goal, since
+otherwise there's no guarantee that the GC meets its heap goal in the
+steady-state (which is a fundamental property of the pacer in Go's existing GC
+design).
+However, there's no reason why the ratio itself needs to change dramatically
+when there's more GC work than expected.
+In fact, the preferable case is that it does not, because that lends itself to a
+much more even distribution of GC assist work across the cycle.
+
+So, I propose that the assist ratio be an extrapolation of the current
+steady-state assist ratio, with the exception that it now include non-heap GC
+work as the rest of this document does.
+
+That is,
+
+```render-latex
+\begin{aligned}
+\textrm{max scan work} & = T_n + S_n + G_n \\
+\textrm{extrapolated runway} & = \frac{N_n - T_n}{P_{n-1} + S_n + G_n} (T_n + S_n + G_n) \\
+\textrm{assist ratio} & = \frac{\textrm{extrapolated runway}}{\textrm{max scan work}}
+\end{aligned}
+```
+
+This definition is intentially roundabout.
+The assist ratio changes dynamically as the amount of GC work left decreases and
+the amount of memory allocated increases.
+This responsiveness is what allows the pacing mechanism to be so robust.
+
+Today, the assist ratio is calculated by computing the remaining heap runway and
+the remaining expected GC work, and dividing the former by the latter.
+But of course, that's not possible if there's more GC work than expected, since
+then the assist ratio could go negative, which is meaningless.
+
+So that's the purpose defining the "max scan work" and "extrapolated runway":
+these are worst-case values that are always safe to subtract from, such that we
+can maintain roughly the same assist ratio throughout the cycle (assuming no
+hiccups).
+
+One minor details is that the "extrapolated runway" needs to be capped at the
+hard heap goal to prevent breaking that promise, though in practice this will
+almost.
+The hard heap goal is such a loose bound that it's really only useful in
+pathological cases, but it's still necessary to ensure robustness.
+
+A key point in this choice is that the GC assist pacer will *only* respond to
+changes in allocation behavior and scan rate, not changes in the *size* of the
+live heap.
+This point seems minor, but it means the GC assist pacer's function is much
+simpler and more predictable.
+
+## Remaining unanswered questions
+
+Not every problem listed in issue
+[#42430](https://github.com/golang/go/issues/42430) is resolved by this design,
+though many are.
+
+Notable exclusions are:
+1. Mark assists are front-loaded in a GC cycle.
+1. The hard heap goal isn't actually hard in some circumstances.
+1. Dealing with idle GC workers.
+1. Donating goroutine assist credit/debt on exit.
+1. Existing trigger limits to prevent unbounded memory growth.
+
+(1) is difficult to resolve without special cases and arbitrary heuristics, and
+I think in practice it's OK; the system was fairly robust and will now be more
+so to this kind of noise.
+That doesn't mean that it shouldn't be revisited, but it's not quite as big as
+the other problems, so I leave it outside the scope of this proposal.
+
+(2) is also tricky and somewhat orthogonal.
+I believe the path forward there involves better scheduling of fractional GC
+workers, which are currently very loosely scheduled.
+This design has made me realize how important dedicated GC workers are to
+progress, and how GC assists are a back-up mechanism.
+I believe that the fundamental problem there lies with the fact that fractional
+GC workers don't provide that sort of consistent progress.
+
+For (3) I believe we should remove idle GC workers entirely, which is why this
+document ignores them.
+Idle GC workers are extra mark workers that run if the application isn't
+utilizing all GOMAXPROCS worth of parallelism.
+The scheduler schedules "low priority" background workers on any additional CPU
+resources, and this ultimately skews utilization measurements in the GC, because
+as of today they're unaccounted for.
+Unfortunately, it's likely that idle GC workers have accidentally become
+necessary for the GC to make progress, so just pulling them out won't be quite
+so easy.
+I believe that needs a separate proposal given other large potential changes
+coming to the compiler and runtime in the near future, because there's
+unfortunately a fairly significant risk of bugs with doing so, though I do think
+it's ultimately an improvement.
+See [the related issue for more
+details](https://github.com/golang/go/issues/44163).
+
+(4) is easy and I don't believe needs any deliberation.
+That is a bug we should simply fix.
+
+For (5), I propose we retain the limits, translated to the current design.
+For reference, these limits are `$0.95 (\gamma - 1)$` as the upper-bound on the
+trigger ratio, and `$0.6 (\gamma - 1)$` as the lower-bound.
+
+The upper bound exists to prevent ever starting the GC too late in low-activity
+scenarios.
+It may cause consistent undershoot, but prevents issues in GC pacer calculations
+by preventing the calculated runway from ever being too low.
+The upper-bound may need to be revisited when considering a configurable target
+heap size.
+
+The lower bound exists to prevent the application from causing excessive memory
+growth due to floating garbage as the application's allocation rate increases.
+Before that limit was installed, it wasn't very easy for an application to
+allocate hard enough for that to happen.
+The lower bound probably should be revisited, but I leave that outside of the
+scope of this document.
+
+To translate them to the current design, I propose we simply modify the trigger
+condition to include these limits.
+It's not important to put these limits in the rest of the pacer because it no
+longer tries to compute the trigger point ahead of time.
+
+## A note about CPU utilization
+
+This document uses the term "GC CPU utilization" quite frequently, but so far
+has refrained from defining exactly how it's measured.
+Before doing that, let's define CPU utilization over a GC mark phase, as it's
+been used so far.
+First, let's define the mark phase: it is the period of wall-clock time between
+the end of sweep termination and the start of mark termination.
+In the mark phase, the process will have access to some total number of
+CPU-seconds of execution time.
+This CPU time can then be divided into "time spent doing GC work" and "time
+spent doing anything else."
+GC CPU utilization is then defined as a proportion of that total CPU time that
+is spent doing GC work.
+
+This definition seems straightforward enough but in reality it's more
+complicated.
+Measuring CPU time on most platforms is tricky, so what Go does today is an
+approximation: take the wall-clock time of the GC mark phase, multiply it by
+`GOMAXPROCS`.
+Call this $`T`$.
+Take 25% of that (representing the dedicated GC workers) and add total amount of
+time all goroutines spend in GC assists.
+The latter is computed directly, but is just the difference between the start
+and end time in the critical section; it does not try to account for context
+switches forced by the underlying system, or anything like that.
+Now take this value we just computed and divide it by `$T$`.
+That's our GC CPU utilization.
+
+This approximation is mostly accurate in the common case, but is prone to skew
+in various scenarios, such as when the system is CPU-starved.
+This fact can be problematic, but I believe it is largely orthogonal to the
+content of this document; we can work on improving this approximation without
+having to change any of this design.
+It already assumes that we have a good measure of CPU utilization.
+
+## Alternatives considered
+
+The alternatives considered for this design basically boil down to its
+individual components.
+
+For instance, I considered grouping stacks and globals into the current
+formulation of the pacer, but that adds another condition to the definition of
+the steady-state: stacks and globals do not change.
+That makes the steady-state more fragile.
+
+I also considered a design that was similar, but computed everything in terms of
+an "effective" `GOGC`, and "converted" that back to `GOGC` for pacing purposes
+(that is, what would the heap trigger have been had the expected amount of live
+memory been correct?).
+This formulation is similar to how Austin formulated the experimental
+`SetMaxHeap` API.
+Austin suggested I avoid this formulation because math involving `GOGC` tends to
+have to work around infinities.
+A good example of this is if `runtime.GC` is called while `GOGC` is off: the
+runtime has to "fake" a very large `GOGC` value in the pacer.
+By using a ratio of rates that's more grounded in actual application behavior
+the trickiness of the math is avoided.
+
+I also considered not using a PI controller and just using the measured `$r$`
+value directly, assuming it doesn't change across GC cycles, but that method is
+prone to noise.
+
+## Justification
+
+Pros:
+- The steady-state is now independent of the amount of GC work to be done.
+- Steady-state mark assist drops to zero if not allocating too heavily (a likely
+  latency improvement in many scenarios) (see the "high `GOGC`" scenario in
+  [simulations](#simulations)).
+- GC amortization includes non-heap GC work, and responds well in those cases.
+- Eliminates offset error present in the existing design.
+
+Cons:
+- Definition of `GOGC` has changed slightly, so a `GOGC` of 100 will use
+  slightly more memory in nearly all cases.
+- `$r$` is a little bit unintuitive.
+
+## Implementation
+
+This pacer redesign will be implemented by Michael Knyszek.
+
+1. The existing pacer will be refactored into a form fit for simulation.
+1. A comprehensive simulation-based test suite will be written for the pacer.
+1. The existing pacer will be swapped out with the new implementation.
+
+The purpose of the simulation infrastructure is to make the pacer, in general,
+more testable.
+This lets us write regression test cases based on real-life problems we
+encounter and confirm that they don't break going forward.
+Furthermore, with fairly large changes to the Go compiler and runtime in the
+pipeline, it's especially important to reduce the risk of this change as much as
+possible.
+
+## Go 1 backwards compatibility
+
+This change will not modify any user-visible APIs, but may have surprising
+effects on application behavior.
+The two "most" incompatible changes in this proposal are the redefinition of the
+heap goal to include non-heap sources of GC work, since that directly influences
+the meaning of `GOGC`, and the change in target GC CPU utilization.
+These two factors together mean that, by default and on average, Go applications
+will use slightly more memory than before.
+To obtain previous levels of memory usage, users may be required to tune down
+`GOGC` lower manually, but the overall result should be more consistent, more
+predictable, and more efficient.
+
+## Simulations
+
+In order to show the effectiveness of the new pacer and compare it to the
+current one, I modeled both the existing pacer and the new pacer and simulated
+both in a variety of scenarios.
+
+The code used to run these simulations and generate the plots below may be found
+at [github.com/mknyszek/pacer-model](https://github.com/mknyszek/pacer-model).
+
+### Assumptions and caveats
+
+The model of each pacer is fairly detailed, and takes into account most details
+like allocations made during a GC being marked.
+The one big assumption it makes, however, is that the behavior of the
+application while a GC cycle is running is perfectly smooth, such that the GC
+assist pacer is perfectly paced according to the initial assist ratio.
+In practice, this is close to true, but it's worth accounting for the more
+extreme cases.
+(TODO: Show simulations that inject some noise into the GC assist pacer.)
+
+Another caveat with the simulation is the graph of "R value" (that is, `$r_n$`),
+and "Alloc/scan ratio."
+The latter is well-defined for all simulations (it's a part of the input) but
+the former is not a concept used in the current pacer.
+So for simulations of the current pacer, the "R value" is backed out from the
+trigger ratio: we know the runway, we know the *expected* scan work for the
+target utilization, so we can compute the `$r_n$` that the trigger point
+encodes.
+
+### Results
+
+**Perfectly steady heap size.**
+
+The simplest possible scenario.
+
+Current pacer:
+
+![](44167/pacer-plots/old-steady.png)
+
+New pacer:
+
+![](44167/pacer-plots/new-steady.png)
+
+Notes:
+- The current pacer doesn't seem to find the right utilization.
+- Both pacers do reasonably well at meeting the heap goal.
+
+**Jittery heap size and alloc/scan ratio.**
+
+A mostly steady-state heap with a slight jitter added to both live heap size and
+the alloc/scan ratio.
+
+Current pacer:
+
+![](44167/pacer-plots/old-jitter-alloc.png)
+
+New pacer:
+
+![](44167/pacer-plots/new-jitter-alloc.png)
+
+Notes:
+- Both pacers are resilient to a small amount of noise.
+
+**Small step in alloc/scan ratio.**
+
+This scenario demonstrates the transitions between two steady-states, that are
+not far from one another.
+
+Current pacer:
+
+![](44167/pacer-plots/old-step-alloc.png)
+
+New pacer:
+
+![](44167/pacer-plots/new-step-alloc.png)
+
+Notes:
+- Both pacers react to the change in alloc/scan rate.
+- Clear oscillations in utilization visible for the new pacer.
+
+**Large step in alloc/scan ratio.**
+
+This scenario demonstrates the transitions between two steady-states, that are
+further from one another.
+
+Current pacer:
+
+![](44167/pacer-plots/old-heavy-step-alloc.png)
+
+New pacer:
+
+![](44167/pacer-plots/new-heavy-step-alloc.png)
+
+Notes:
+- The old pacer consistently overshoots the heap size post-step.
+- The new pacer minimizes overshoot.
+
+**Large step in heap size with a high `GOGC` value.**
+
+This scenario demonstrates the "high `GOGC` problem" described in the [GC pacer
+meta-issue](https://github.com/golang/go/issues/42430).
+
+Current pacer:
+
+![](44167/pacer-plots/old-high-GOGC.png)
+
+New pacer:
+
+![](44167/pacer-plots/new-high-GOGC.png)
+
+Notes:
+- The new pacer's heap size stabilizes faster than the old pacer's.
+- The new pacer has a spike in overshoot; this is *by design*.
+- The new pacer's utilization is independent of this heap size spike.
+- The old pacer has a clear spike in utilization.
+
+**Oscillating alloc/scan ratio.**
+
+This scenario demonstrates an oscillating alloc/scan ratio.
+This scenario is interesting because it shows a somewhat extreme case where a
+steady-state is never actually reached for any amount of time.
+However, this is not a realistic scenario.
+
+Current pacer:
+
+![](44167/pacer-plots/old-osc-alloc.png)
+
+New pacer:
+
+![](44167/pacer-plots/new-osc-alloc.png)
+
+Notes:
+- The new pacer tracks the oscillations worse than the old pacer.
+  This is likely due to the error never settling, so the PI controller is always
+  overshooting.
+
+**Large amount of goroutine stacks.**
+
+This scenario demonstrates the "heap amortization problem" described in the [GC
+pacer meta-issue](https://github.com/golang/go/issues/42430) for goroutine
+stacks.
+
+Current pacer:
+
+![](44167/pacer-plots/old-big-stacks.png)
+
+New pacer:
+
+![](44167/pacer-plots/new-big-stacks.png)
+
+Notes:
+- The old pacer consistently overshoots because it's underestimating the amount
+  of work it has to do.
+- The new pacer uses more memory, since the heap goal is now proportional to
+  stack space, but it stabilizes and is otherwise sane.
+
+**Large amount of global variables.**
+
+This scenario demonstrates the "heap amortization problem" described in the [GC
+pacer meta-issue](https://github.com/golang/go/issues/42430) for global
+variables.
+
+Current pacer:
+
+![](44167/pacer-plots/old-big-globals.png)
+
+New pacer:
+
+![](44167/pacer-plots/new-big-globals.png)
+
+Notes:
+- This is essentially identical to the stack space case.
+
+**High alloc/scan ratio.**
+
+This scenario shows the behavior of each pacer in the face of a very high
+alloc/scan ratio, with jitter applied to both the live heap size and the
+alloc/scan ratio.
+
+Current pacer:
+
+![](44167/pacer-plots/old-heavy-jitter-alloc.png)
+
+New pacer:
+
+![](44167/pacer-plots/new-heavy-jitter-alloc.png)
+
+Notes:
+- In the face of a very high allocation rate, the old pacer consistently
+  overshoots, though both maintain a similar GC CPU utilization.
diff --git a/design/44167/inl1.png b/design/44167/inl1.png
new file mode 100644
index 0000000..f8832b6
--- /dev/null
+++ b/design/44167/inl1.png
Binary files differ
diff --git a/design/44167/inl10.png b/design/44167/inl10.png
new file mode 100644
index 0000000..ddac0d0
--- /dev/null
+++ b/design/44167/inl10.png
Binary files differ
diff --git a/design/44167/inl11.png b/design/44167/inl11.png
new file mode 100644
index 0000000..759de77
--- /dev/null
+++ b/design/44167/inl11.png
Binary files differ
diff --git a/design/44167/inl12.png b/design/44167/inl12.png
new file mode 100644
index 0000000..3d9bb4a
--- /dev/null
+++ b/design/44167/inl12.png
Binary files differ
diff --git a/design/44167/inl13.png b/design/44167/inl13.png
new file mode 100644
index 0000000..acfc8e8
--- /dev/null
+++ b/design/44167/inl13.png
Binary files differ
diff --git a/design/44167/inl14.png b/design/44167/inl14.png
new file mode 100644
index 0000000..62a247d
--- /dev/null
+++ b/design/44167/inl14.png
Binary files differ
diff --git a/design/44167/inl15.png b/design/44167/inl15.png
new file mode 100644
index 0000000..5586f7c
--- /dev/null
+++ b/design/44167/inl15.png
Binary files differ
diff --git a/design/44167/inl16.png b/design/44167/inl16.png
new file mode 100644
index 0000000..19a3be1
--- /dev/null
+++ b/design/44167/inl16.png
Binary files differ
diff --git a/design/44167/inl17.png b/design/44167/inl17.png
new file mode 100644
index 0000000..c6eb198
--- /dev/null
+++ b/design/44167/inl17.png
Binary files differ
diff --git a/design/44167/inl18.png b/design/44167/inl18.png
new file mode 100644
index 0000000..46258c0
--- /dev/null
+++ b/design/44167/inl18.png
Binary files differ
diff --git a/design/44167/inl19.png b/design/44167/inl19.png
new file mode 100644
index 0000000..8954e04
--- /dev/null
+++ b/design/44167/inl19.png
Binary files differ
diff --git a/design/44167/inl2.png b/design/44167/inl2.png
new file mode 100644
index 0000000..3b2906d
--- /dev/null
+++ b/design/44167/inl2.png
Binary files differ
diff --git a/design/44167/inl20.png b/design/44167/inl20.png
new file mode 100644
index 0000000..e2ff650
--- /dev/null
+++ b/design/44167/inl20.png
Binary files differ
diff --git a/design/44167/inl21.png b/design/44167/inl21.png
new file mode 100644
index 0000000..34b21bc
--- /dev/null
+++ b/design/44167/inl21.png
Binary files differ
diff --git a/design/44167/inl22.png b/design/44167/inl22.png
new file mode 100644
index 0000000..7e49cfe
--- /dev/null
+++ b/design/44167/inl22.png
Binary files differ
diff --git a/design/44167/inl23.png b/design/44167/inl23.png
new file mode 100644
index 0000000..60d3514
--- /dev/null
+++ b/design/44167/inl23.png
Binary files differ
diff --git a/design/44167/inl24.png b/design/44167/inl24.png
new file mode 100644
index 0000000..af0131c
--- /dev/null
+++ b/design/44167/inl24.png
Binary files differ
diff --git a/design/44167/inl25.png b/design/44167/inl25.png
new file mode 100644
index 0000000..20af26a
--- /dev/null
+++ b/design/44167/inl25.png
Binary files differ
diff --git a/design/44167/inl26.png b/design/44167/inl26.png
new file mode 100644
index 0000000..1f48988
--- /dev/null
+++ b/design/44167/inl26.png
Binary files differ
diff --git a/design/44167/inl3.png b/design/44167/inl3.png
new file mode 100644
index 0000000..636e5fb
--- /dev/null
+++ b/design/44167/inl3.png
Binary files differ
diff --git a/design/44167/inl4.png b/design/44167/inl4.png
new file mode 100644
index 0000000..03cdee9
--- /dev/null
+++ b/design/44167/inl4.png
Binary files differ
diff --git a/design/44167/inl5.png b/design/44167/inl5.png
new file mode 100644
index 0000000..32ec3e5
--- /dev/null
+++ b/design/44167/inl5.png
Binary files differ
diff --git a/design/44167/inl6.png b/design/44167/inl6.png
new file mode 100644
index 0000000..2aeb809
--- /dev/null
+++ b/design/44167/inl6.png
Binary files differ
diff --git a/design/44167/inl7.png b/design/44167/inl7.png
new file mode 100644
index 0000000..1e2056e
--- /dev/null
+++ b/design/44167/inl7.png
Binary files differ
diff --git a/design/44167/inl8.png b/design/44167/inl8.png
new file mode 100644
index 0000000..922811f
--- /dev/null
+++ b/design/44167/inl8.png
Binary files differ
diff --git a/design/44167/inl9.png b/design/44167/inl9.png
new file mode 100644
index 0000000..2f96e9f
--- /dev/null
+++ b/design/44167/inl9.png
Binary files differ
diff --git a/design/44167/pacer-plots/new-big-globals.png b/design/44167/pacer-plots/new-big-globals.png
new file mode 100644
index 0000000..958f985
--- /dev/null
+++ b/design/44167/pacer-plots/new-big-globals.png
Binary files differ
diff --git a/design/44167/pacer-plots/new-big-stacks.png b/design/44167/pacer-plots/new-big-stacks.png
new file mode 100644
index 0000000..1ab5583
--- /dev/null
+++ b/design/44167/pacer-plots/new-big-stacks.png
Binary files differ
diff --git a/design/44167/pacer-plots/new-heavy-jitter-alloc.png b/design/44167/pacer-plots/new-heavy-jitter-alloc.png
new file mode 100644
index 0000000..e5596ec
--- /dev/null
+++ b/design/44167/pacer-plots/new-heavy-jitter-alloc.png
Binary files differ
diff --git a/design/44167/pacer-plots/new-heavy-step-alloc.png b/design/44167/pacer-plots/new-heavy-step-alloc.png
new file mode 100644
index 0000000..d1261ef
--- /dev/null
+++ b/design/44167/pacer-plots/new-heavy-step-alloc.png
Binary files differ
diff --git a/design/44167/pacer-plots/new-high-GOGC.png b/design/44167/pacer-plots/new-high-GOGC.png
new file mode 100644
index 0000000..8c9291f
--- /dev/null
+++ b/design/44167/pacer-plots/new-high-GOGC.png
Binary files differ
diff --git a/design/44167/pacer-plots/new-jitter-alloc.png b/design/44167/pacer-plots/new-jitter-alloc.png
new file mode 100644
index 0000000..b021d60
--- /dev/null
+++ b/design/44167/pacer-plots/new-jitter-alloc.png
Binary files differ
diff --git a/design/44167/pacer-plots/new-osc-alloc.png b/design/44167/pacer-plots/new-osc-alloc.png
new file mode 100644
index 0000000..2d18192
--- /dev/null
+++ b/design/44167/pacer-plots/new-osc-alloc.png
Binary files differ
diff --git a/design/44167/pacer-plots/new-steady.png b/design/44167/pacer-plots/new-steady.png
new file mode 100644
index 0000000..4d4f45c
--- /dev/null
+++ b/design/44167/pacer-plots/new-steady.png
Binary files differ
diff --git a/design/44167/pacer-plots/new-step-alloc.png b/design/44167/pacer-plots/new-step-alloc.png
new file mode 100644
index 0000000..a8639f3
--- /dev/null
+++ b/design/44167/pacer-plots/new-step-alloc.png
Binary files differ
diff --git a/design/44167/pacer-plots/old-big-globals.png b/design/44167/pacer-plots/old-big-globals.png
new file mode 100644
index 0000000..d1e1d13
--- /dev/null
+++ b/design/44167/pacer-plots/old-big-globals.png
Binary files differ
diff --git a/design/44167/pacer-plots/old-big-stacks.png b/design/44167/pacer-plots/old-big-stacks.png
new file mode 100644
index 0000000..80febdf
--- /dev/null
+++ b/design/44167/pacer-plots/old-big-stacks.png
Binary files differ
diff --git a/design/44167/pacer-plots/old-heavy-jitter-alloc.png b/design/44167/pacer-plots/old-heavy-jitter-alloc.png
new file mode 100644
index 0000000..3d5eb1d
--- /dev/null
+++ b/design/44167/pacer-plots/old-heavy-jitter-alloc.png
Binary files differ
diff --git a/design/44167/pacer-plots/old-heavy-step-alloc.png b/design/44167/pacer-plots/old-heavy-step-alloc.png
new file mode 100644
index 0000000..8b297cf
--- /dev/null
+++ b/design/44167/pacer-plots/old-heavy-step-alloc.png
Binary files differ
diff --git a/design/44167/pacer-plots/old-high-GOGC.png b/design/44167/pacer-plots/old-high-GOGC.png
new file mode 100644
index 0000000..3dbbc20
--- /dev/null
+++ b/design/44167/pacer-plots/old-high-GOGC.png
Binary files differ
diff --git a/design/44167/pacer-plots/old-jitter-alloc.png b/design/44167/pacer-plots/old-jitter-alloc.png
new file mode 100644
index 0000000..ad8d42c
--- /dev/null
+++ b/design/44167/pacer-plots/old-jitter-alloc.png
Binary files differ
diff --git a/design/44167/pacer-plots/old-osc-alloc.png b/design/44167/pacer-plots/old-osc-alloc.png
new file mode 100644
index 0000000..7527144
--- /dev/null
+++ b/design/44167/pacer-plots/old-osc-alloc.png
Binary files differ
diff --git a/design/44167/pacer-plots/old-steady.png b/design/44167/pacer-plots/old-steady.png
new file mode 100644
index 0000000..7576b9f
--- /dev/null
+++ b/design/44167/pacer-plots/old-steady.png
Binary files differ
diff --git a/design/44167/pacer-plots/old-step-alloc.png b/design/44167/pacer-plots/old-step-alloc.png
new file mode 100644
index 0000000..2b06691
--- /dev/null
+++ b/design/44167/pacer-plots/old-step-alloc.png
Binary files differ
diff --git a/design/44167/svg2png.bash b/design/44167/svg2png.bash
new file mode 100755
index 0000000..72a4f62
--- /dev/null
+++ b/design/44167/svg2png.bash
@@ -0,0 +1,8 @@
+#!/bin/bash
+
+for input in *.svg; do
+	output=${input%.*}.png
+	google-chrome --headless --window-size=1920,1080 --disable-gpu --screenshot $input
+	convert screenshot.png -trim $output
+done
+rm screenshot.png