design/48409-soft-memory-limit.md - proposal - Git at Google

 # Proposal: Soft memory limit

 Author: Michael Knyszek

 Date: 15 September 2021

 ## Summary

 This document proposes a new option for tuning the behavior of the Go garbage
 collector by setting a soft memory limit on the total amount of memory that Go
 uses.

 This option comes in two flavors: a new `runtime/debug` function called
 `SetMemoryLimit` and a `GOMEMLIMIT` environment variable.

 This new option gives applications better control over their resource economy.
 It empowers users to:
 * Better utilize the memory that they already have,
 * Confidently decrease their memory limits, knowing Go will respect them,
 * Avoid unsupported forms of garbage collection tuning like heap ballasts.

 ## Motivation

 Go famously offers a single option for tuning the Go garbage collector: `GOGC`
 (or `runtime/debug.SetGCPercent`).
 This option offers a direct tradeoff between CPU and memory: it directly
 dictates the amount of memory overhead available to the garbage collector.
 Less memory overhead means more frequent garbage collection cycles, and more CPU
 spent on GC.
 More memory overhead means less frequent cycles and more CPU for the
 application.

 This option has carried the Go project well for a long time.
 However, years of experience has produced a meaningful amount of evidence
 suggesting that `GOGC` is not enough.
 A direct tradeoff isn't always possible, because memory is not fungible relative
 to CPU resources.

 Consider an application that runs as a service or a daemon.
 In these cases, out-of-memory errors often arise due to transient spikes in
 applications' heap sizes.
 Today, Go does not respect users' memory limits.
 As a result, the Go community has developed various patterns for dealing with
 out-of-memory errors.

 In scenarios where out-of-memory errors are acceptable (to varying degrees), Go
 users choose to live with these errors, restarting their service regularly,
 instead of reducing `GOGC`.
 Reducing `GOGC` directly impacts users' productivity metrics, metrics whose
 behavior is largely governed by steady-state behavior, not transients, so this
 course of action is undesirable.

 In scenarios where out-of-memory errors are unacceptable, a similar situation
 occurs.
 Users are unwilling to increase GOGC to achieve productivity improvements, even
 though in the steady-state they may actually have that memory available to them.
 They pick an overly conservative value to reduce the chance of an out-of-memory
 condition in transient states, but this choice leaves productivity on the table.

 This out-of-memory avoidance led to the Go community developing its own
 homegrown garbage collector tuning.

 The first example of such tuning is the heap ballast.
 In order to increase their productivity metrics while also avoiding
 out-of-memory errors, users sometimes pick a low `GOGC` value, and fool the GC
 into thinking that there's a fixed amount of memory live.
 This solution elegantly scales with `GOGC`: as the real live heap increases, the
 impact of that fixed set decreases, and `GOGC`'s contribution to the heap size
 dominates.
 In effect, `GOGC` is larger when the heap is smaller, and smaller (down to its
 actual value) when the heap is larger.
 Unfortunately, this solution is not portable across platforms, and is not
 guaranteed to continue to work as the Go runtime changes.
 Furthermore, users are forced to do complex calculations and estimate runtime
 overheads in order to pick a heap ballast size that aligns with their memory
 limits.

 The second example of such tuning is calling `runtime/debug.FreeOSMemory` at
 some regular interval, forcing a garbage collection to trigger sooner, usually
 to respect some memory limit.
 This case is much more dangerous, because calling it too frequently can lead a
 process to entirely freeze up, spending all its time on garbage collection.
 Working with it takes careful consideration and experimentation to be both
 effective and avoid serious repercussions.

 Both of these situations, dealing with out-of-memory errors and homegrown
 garbage collection tuning, have a straightforward solution that other platforms
 (like Java and TCMalloc) already provide its users: a configurable memory limit,
 enforced by the Go runtime.
 A memory limit would give the Go runtime the information it needs to both
 respect users' memory limits, and allow them to optionally use that memory
 always, to cut back the cost of garbage collection.

 ## Non-goals

 1. Accounting for and react to memory outside the Go runtime, such as:

 * Co-tenant processes in a container,
 * C/C++ memory, and
 * Kernel memory counted against the process.

 Dealing with and reacting to memory used by other processes, and even to memory
 within the process governed by the semantics of a completely different
 programming language, is an incredibly hard problem and often requires a
 coordinated effort.
 It's outside of the scope of the Go runtime to solve this problem for everyone,
 but I believe the Go runtime has an important role to play in supporting these
 worthwhile efforts.

 1. Eliminate out-of-memory errors in 100% of cases.

 Whatever policy this API adheres to is going to fail for some use-case, and
 that's OK.
 The policy can be tweaked and improved upon as time goes on, but it's impossible
 for us to create a solution that is all things to all people without a
 tremendous amount of toil for our team (by e.g. exposing lots of tuning knobs).
 On top of this, any such solution is likely to become difficult to use at all.
 The best we can do is make life better for as many users as possible.

 ## Detailed design

 The design of a memory soft limit consists of four parts: an API, mechanisms to
 enforce the soft limit, and guidance through thorough documentation, and
 telemetry for identifying issues in production.

 ### API

 ```go
 package runtime/debug

 // SetMemoryLimit provides the runtime with a soft memory limit.
 //
 // The runtime undertakes several processes to try to respect this
 // memory limit, including adjustments to the frequency of garbage
 // collections and returning memory to the underlying system more
 // aggressively. This limit will be respected even if GOGC=off (or,
 // if SetGCPercent(-1) is executed).
 //
 // The input limit is provided as bytes, and is intended to include
 // all memory that the Go runtime has direct control over. In other
 // words, runtime.MemStats.Sys - runtime.MemStats.HeapReleased.
 //
 // This limit does not account for memory external to Go, such as
 // memory managed by the underlying system on behalf of the process,
 // or memory managed by non-Go code inside the same process.
 //
 // A zero limit or a limit that's lower than the amount of memory
 // used by the Go runtime may cause the garbage collector to run
 // nearly continuously. However, the application may still make
 // progress.
 //
 // See https://golang.org/doc/gc-ergonomics for a detailed guide
 // explaining the soft memory limit as well as a variety of common
 // use-cases and scenarios.
 //
 // SetMemoryLimit returns the previously set memory limit.
 // By default, the limit is math.MaxInt64.
 // A negative input does not adjust the limit, and allows for
 // retrieval of the currently set memory limit.
 func SetMemoryLimit(limit int64) int64
 ```

 Note that the soft limit is expressed in terms of the total amount of memory
 used by the Go runtime.
 This choice means that enforcement of the soft memory limit by the GC must
 account for additional memory use such as heap metadata and fragmentation.
 It also means that the runtime is responsible for any idle heap memory above the
 limit, i.e. any memory that is currently unused by the Go runtime, but has not
 been returned to the operating system.
 As a result, the Go runtime's memory scavenger must also participate in
 enforcement.

 This choice is a departure from similar APIs in other languages (including the
 experimental `SetMaxHeap` [patch](https://golang.org/cl/46751)), whose limits
 only include space occupied by heap objects themselves.
 To reduce confusion and help facilitate understanding, each class of memory that
 is accounted for will be precisely listed in the documentation.

 In addition, the soft memory limit can be set directly via an environment
 variable that all Go programs recognize: `GOMEMLIMIT`.

 For ease-of-use, I propose `GOMEMLIMIT` accept either an integer value in bytes,
 or a string such as "8GiB."
 More specifically, an integer followed by one of several recognized unit
 strings, without spaces.
 I propose supporting "B," "KiB," "MiB," "GiB," and "TiB" indicating the
 power-of-two versions of each.
 Similarly, I propose supporting "KB," "MB," "GB," and "TB," which refer to their
 power-of-ten counterparts.

 ### Enforcement

 #### Garbage collection

 In order to ensure the runtime maintains the soft memory limit, it needs to
 trigger at a point such that the total heap memory used does not exceed the soft
 limit.
 Because the Go garbage collector's memory use is defined entirely in terms of
 the heap goal, altering its definition is sufficient to ensure that a memory
 limit is enforced.

 However, the heap goal is defined in terms of object bytes, while the memory
 limit includes a much broader variety of memory classes, necessitating a
 conversion function between the two.

 To compute the heap limit ![`\hat{L}`](48409/inl1.png) from the soft memory
 limit ![`L`](48409/inl2.png), I propose the following calculation:

 ![Equation 1](48409/eqn1.png)

 Where ![`O_M`](48409/inl3.png) is (per `runtime/metrics` memory names)

 ```
 /memory/classes/metadata/mcache/free:bytes +
 /memory/classes/metadata/mcache/inuse:bytes +
 /memory/classes/metadata/mspan/free:bytes +
 /memory/classes/metadata/mspan/inuse:bytes +
 /memory/classes/metadata/other:bytes +
 /memory/classes/os-stacks:bytes +
 /memory/classes/other:bytes +
 /memory/classes/profiling/buckets:bytes
 ```

 and ![`O_I`](48409/inl4.png) is the maximum of
 `/memory/classes/heap/unused:bytes + /memory/classes/heap/free:bytes` over the
 last GC cycle.

 These terms (called ![`O`](48409/inl5.png), for "overheads") account for all
 memory that is not accounted for by the GC pacer (from the [new pacer
 proposal](https://github.com/golang/proposal/blob/329650d4723a558c2b76b81b4995fc5c267e6bc1/design/44167-gc-pacer-redesign.md#heap-goal)).

 With ![`\hat{L}`](48409/inl1.png) fully defined, our heap goal for cycle
 ![`n`](48409/inl6.png) (![`N_n`](48409/inl7.png)) is a straightforward extension
 of the existing one.

 Where
 * ![`M_n`](48409/inl8.png) is equal to bytes marked at the end of GC n's mark
   phase
 * ![`S_n`](48409/inl9.png) is equal to stack bytes at the beginning of GC n's
   mark phase
 * ![`G_n`](48409/inl10.png) is equal to bytes of globals at the beginning of GC
   n's mark phase
 * ![`\gamma`](48409/inl11.png) is equal to
   ![`1+\frac{GOGC}{100}`](48409/inl12.png)

 then

 ![Equation 2](48409/eqn2.png)

 Over the course of a GC cycle ![`O_M`](48409/inl3.png) remains stable because it
 increases monotonically.
 There's only one situation where ![`O_M`](48409/inl3.png) can grow tremendously
 (relative to active heap objects) in a short period of time (< 1 GC cycle), and
 that's when `GOMAXPROCS` increases.
 So, I also propose recomputing this value at that time.

 Meanwhile ![`O_I`](48409/inl4.png) stays relatively stable (and doesn't have a
 sawtooth pattern, as one might expect from a sum of idle heap memory) because
 object sweeping occurs incrementally, specifically proportionally to how fast
 the application is allocating.
 Furthermore, this value is guaranteed to stay relatively stable across a single
 GC cycle, because the total size of the heap for one GC cycle is bounded by the
 heap goal.
 Taking the highwater mark of this value places a conservative upper bound on the
 total impact of this memory, so the heap goal stays safe from major changes.

 One concern with the above definition of ![`\hat{L}`](48409/inl1.png) is that it
 is fragile to changes to the Go GC.
 In the past, seemingly unrelated changes to the Go runtime have impacted the
 GC's pacer, usually due to an unforeseen influence on the accounting that the
 pacer relies on.
 To minimize the impact of these accidents on the conversion function, I propose
 centralizing and categorizing all the variables used in accounting, and writing
 tests to ensure that expected properties of the account remain in-tact.

 #### Death spirals

 As the live heap grows toward ![`\hat{L}`](48409/inl1.png), the Go garbage
 collector is going to stray from the tradeoff defined by `GOGC`, and will
 trigger more and more often to reach its goal.
 Left unchecked, the Go garbage collector will eventually run continuously, and
 increase its utilization as its runway disappears.
 Eventually, the application will fail to make progress.
 This process is referred to as a death spiral.

 One way to deal with this situation is to place a limit on the amount of total
 CPU utilization of the garbage collector.
 If the garbage collector were to execute and exceed that limit at any point, it
 will instead let the application proceed, even if that means missing its goal
 and breaching the memory limit.

 I propose we do exactly this, but rather than provide another knob for
 determining the maximum fraction of CPU time, I believe that we should simply
 pick a reasonable default based on `GOGC`.
 I propose that we pick 50% as the default fraction.
 This fraction is reasonable and conservative since most applications won't come
 close to this threshold in normal execution.

 To implement this policy, I propose a leaky bucket mechanism inspired by a tool
 called `jvmquake` developed by [Netflix for killing Java service
 instances](https://netflixtechblog.medium.com/introducing-jvmquake-ec944c60ba70)
 that could fall into a death spiral.
 To summarize, the mechanism consists of a conceptual bucket with a capacity, and
 that bucket accumulates GC CPU time.
 At the same time, the bucket is drained by mutator CPU time.
 Should the ratio of GC CPU time to mutator CPU time exceed 1:1 for some time
 (determined by the bucket's capacity) then the bucket's contents will tend
 toward infinity.
 At the point in which the bucket's contents exceed its capacity, `jvmquake`
 would kill the target service instance.

 In this case instead of killing the process, the garbage collector will
 deliberately prevent user goroutines from assisting the garbage collector in
 order to prevent the bucket from overflowing.
 The purpose of the bucket is to allow brief spikes in GC CPU utilization.
 Otherwise, anomalous situations could cause unnecessary missed assists that make
 GC assist pacing less smooth.

 A reasonable bucket capacity will have to be chosen empirically, as it should be
 large enough to accommodate worst-case pause times but not too large such that a
 100% GC CPU utilization spike could cause the program to become unresponsive for
 more than about a second.
 1 CPU-second per `GOMAXPROCS` seems like a reasonable place to start.

 Unfortunately, 50% is not a reasonable choice for small values of `GOGC`.
 Consider an application running with `GOGC=10`: an overall 50% GC CPU
 utilization limit for `GOGC=10` is likely going to be always active, leading to
 significant overshoot.
 This high utilization is due to the fact that the Go GC at `GOGC=10` will reach
 the point at which it may no longer start a GC much sooner than, say `GOGC=100`.
 At that point, the GC has no option but to increase utilization to meet its
 goal.
 Because it will then be capped at increasing utilization, the GC will have no
 choice but to use more memory and overshoot.
 As a result, this effectively creates a minimum `GOGC` value: below a certain
 `GOGC`, the runtime will be effectively acting as if the `GOGC` value was
 higher.
 For now, I consider this acceptable.

 #### Returning memory to the platform

 In the context of maintaining a memory limit, it's critical that the Go runtime
 return memory to the underlying platform as a part of that process.

 Today, the Go runtime returns memory to the system with a background goroutine
 called the scavenger, which paces itself to consume around 1% of 1 CPU.
 This pacing is conservative, but necessarily so: the scavenger must synchronize
 with any goroutine allocating pages of memory from the heap, so this pacing is
 generally a slight underestimate as it fails to include synchronization
 overheads from any concurrent allocating goroutines.
 Currently, the scavenger's goal is to return free memory to the platform until
 it reaches the heap goal, accounting for page-level fragmentation and a fixed
 10% overhead to avoid paging costs.

 In the context of a memory limit, I propose that the scavenger's goal becomes
 that limit.
 Then, the scavenger should pace itself more aggressively as the runtime's memory
 use approaches the limit.
 I propose it does so using a proportional-integral controller whose input is the
 difference between the memory limit and the memory used by Go, and whose output
 is the CPU utilization target of the background scavenger.
 The output will be clamped at a minimum of 1% and a maximum of 10% overall CPU
 utilization.
 Note that the 10% is chosen arbitrarily; in general, returning memory to the
 platform is nowhere near as costly as the GC, but the number must be chosen such
 that the mutator still has plenty of room to make progress (thus, I assert that
 40% of CPU time is enough).
 In order to make the scavenger scale to overall CPU utilization effectively, it
 requires some improvements to avoid the aforementioned locking issues it deals
 with today.

 Any CPU time spent in the scavenger should also be accounted for in the leaky
 bucket algorithm described in the [Death spirals](#death-spirals) section as GC
 time, however I don't think it should be throttled in the same way.
 The intuition behind that is that returning memory to the platform is generally
 going to be more immediately fruitful than spending more time in garbage
 collection.

 #### Alternative approaches considered

 ##### Enforcement

 The conversion function from the memory limit to the heap limit described in
 this section is the result of an impedance mismatch between how the GC pacer
 views memory and how the memory limit is defined (i.e. how the platform views
 memory).
 An alternative approach would be to resolve this impedance mismatch.

 One way to do so would be to define the memory limit in terms of heap object
 bytes.
 As discussed in the [API section](#api) however, this makes for a poor user
 experience.

 Another way is to redefine the GC pacer's view of memory to include other memory
 sources.
 Let's focus on the most significant of these: fragmentation.
 Suppose we redefined the heap goal and the garbage collector's pacing to be
 based on the spans containing objects, rather than the objects themselves.
 This definition is straightforward to implement: marked memory is defined as the
 sum total of memory used spans containing marked objects, and the heap is
 considered to grow each time a fresh span is allocated.

 Unfortunately, this redefinition comes with two major caveats that make it very
 risky.
 The first is that the definition of the GC steady-state, upon which much of the
 pacer's intuition is built, now also depends on the degree of fragmentation,
 making it a less fragile state in practice.

 The second is that most heaps will have an inflated size.
 Consider a situation where we start with a very dense heap.
 After some time, most of the objects die, but there's still at least one object
 alive in each span that previously contained a live object.
 With the redefinition, the overall heap will grow to the same size despite much
 less memory being alive.
 In contrast, the existing definition will cause the heap to grow only only to a
 multiple of the actual live objects memory, and it's very unlikely that it will
 go beyond the spans already in-use.

 ##### Returning memory to the platform

 A potential issue with the proposed design is that because the scavenger is
 running in the background, it may not react readily to spikes in memory use that
 exceed the limit.

 In contrast, [TCMalloc](#tcmalloc) searches for memory to return eagerly, if an
 allocation were to exceed the limit.
 In the Go 1.13 cycle, I attempted a similar policy when first implementing the
 scavenger, and found that it could cause unacceptable tail latency increases in
 some applications.
 While that policy certainly tried to return memory back to the platform
 significantly more often than it would be in this case, it still has a couple of
 downsides:
 1. It introduces latency.
    The background scavenger can be more efficiently time-sliced in between other
    work, so it generally should only impact throughput.
 1. It's much more complicated to bound the total amount of time spent searching
    for and returning memory to the platform during an allocation.

 The key insight as to why this policy works just fine for TCMalloc and won't
 work for Go comes from a fundamental difference in design.
 Manual memory allocators are typically designed to have a LIFO-style memory
 reuse pattern.
 Once an allocation is freed, it is immediately available for reallocation.
 In contrast, most efficient tracing garbage collection algorithms require a
 FIFO-style memory reuse pattern, since allocations are freed in bulk.
 The result is that the page allocator in a garbage-collected memory allocator is
 accessed far more frequently than in manual memory allocator, so this path will
 be hit a lot harder.

 For the purposes of this design, I don't believe the benefits of eager return
 outweigh the costs, and I do believe that the proposed design is good enough for
 most cases.

 ### Documentation

 Alongside this new feature I plan to create a new document in the doc directory
 of the Go repository entitled "Go GC Ergonomics."
 The purpose of this document is four-fold:
 * Provide Go users with a high-level, but accurate mental and visual model of
   how the Go GC and scavenger behave with varying GOGC and GOMEMLIMIT settings.
 * Address common use-cases and provide advice, to promote good practice for each
   setting.
 * Break down how Go accounts for memory in excruciating detail.
   Often memory-related documentation is frustratingly imprecise, making every
   user's job much more difficult.
 * Describe how to identify and diagnose issues related to the GC through runtime
   metrics.

 While Hyrum's Law guarantees that the API will be used in unexpected ways, at
 least a central and complete living document will exist to help users better
 understand what it is that their code is doing.

 ### Telemetry

 Identifying issues with the garbage collector becomes even more important with
 new ways to interact with it.
 While the Go runtime already exposes metrics that could aid in identifying
 issues, these metrics are insufficient to create a complete diagnosis,
 especially in light of the new API.

 To further assist users in diagnosing issues related to the API (be that misuse
 or bugs in the implementation) and the garbage collector in general, I propose
 the addition of three new metrics to the [runtime/metrics
 package](https://pkg.go.dev/runtime/metrics):
 * `/gc/throttles:events`: a monotonically increasing count of leaky bucket
   overflows.
     * Direct indicator of the application entering a death spiral with the soft
       memory limit enabled.
 * `/gc/cycles-by-utilization:percent`: histogram of GC cycles by GC CPU
   utilization.
     * Replaces the very misleading runtime.MemStats.GCCPUFraction
 * `/gc/scavenge/cycles-by-utilization:percent`: histogram of scavenger
   utilization.
     * Since the scavenging rate can now change, identifying possible issues
       there will be critical.

 ## Prior art

 ### Java

 Nearly every Java garbage collector operates with a heap limit by default.
 As a result, the heap limit is not a special mode of operation for memory
 limits, but rather the status quo.

 The limit is typically configured by passing the `-Xmx` flag to the Java runtime
 executable.
 Note that this is a heap limit, not a memory limit, and so only counts heap
 objects.
 The OpenJDK runtime operates with a default value of ¼ of available memory or 1
 GiB, whichever is lesser.
 Generally speaking, Java runtimes often only return memory to the OS when it
 decides to shrink the heap space used; more recent implementations (e.g. G1) do
 so more rarely, except when [the application is
 idle](https://openjdk.java.net/jeps/346).

 ### SetMaxHeap

 For nearly 4 years, the Go project has been trialing an experimental API in the
 `runtime/debug` package called `SetMaxHeap`.
 The API is available as a patch in Gerrit and has been used for some time within
 Google.

 The API proposed in this document builds on top of the work done on
 `SetMaxHeap`.

 Some notable details about `SetMaxHeap`:
 * Its limit is defined solely in terms of heap object bytes, like Java.
 * It does not alter the scavenging policy of the Go runtime at all.
 * It accepts an extra channel argument that provides GC pressure notifications.
     * This channel is ignored in most uses I'm aware of.
     * One exception is where it is used to log GC behavior for telemetry.
 * It attempts to limit death spirals by placing a minimum on the runway the GC
   has.
     * This minimum was set to 10%.

 Lessons learned from `SetMaxHeap`:
 * Backpressure notifications are unnecessarily complex for many use-cases.
 * Figuring out a good heap limit is tricky and leads to conservative limits.
 * Without a memory return policy, its usefulness for OOM avoidance is limited.

 ### TCMalloc

 TCMalloc provides a `SetMemoryLimit` function to set a [soft memory
 limit](https://github.com/google/tcmalloc/blob/cb5aa92545ded39f75115f3b2cc2ffd66a17d55b/tcmalloc/malloc_extension.h#L306).
 Because dynamic memory allocation is provided in C and C++ as a library,
 TCMalloc's `SetMemoryLimit` can only be aware of its own overheads, but notably
 it does include all sources of fragmentation and metadata in its calculation.

 Furthermore, it maintains a policy of eagerly returning memory to the platform
 if an allocation would cause TCMalloc's memory use to exceed the specified
 limit.

 ### Go 1 compatibility

 This change adds an API to the Go project and does not alter existing ones.
 Therefore, the proposed changes are Go 1 backwards compatible.