design: update the soft memory limit design document

This change updates the design document with some additional prior art
and corrections that reflect the final implementation.

For #48409.

Change-Id: I8276a30b546c698ab1c9f946ccc0ae2a327399d0
Reviewed-on: https://go-review.googlesource.com/c/proposal/+/421418
Reviewed-by: Michael Knyszek <mknyszek@google.com>
diff --git a/design/48409-soft-memory-limit.md b/design/48409-soft-memory-limit.md
index 59c0ab1..3edc9ad 100644
--- a/design/48409-soft-memory-limit.md
+++ b/design/48409-soft-memory-limit.md
@@ -208,70 +208,51 @@
 
 ![Equation 1](48409/eqn1.png)
 
-Where ![`O_M`](48409/inl3.png) is (per `runtime/metrics` memory names)
+![`T`](48409/inl3.png) is the total amount of memory mapped by the Go runtime.
+![`F`](48409/inl4.png) is the amount of free and unscavenged memory the Go
+runtime is holding.
+![`A`](48409/inl5.png) is the number of bytes in allocated heap objects at the
+time ![`\hat{L}`](48409/inl1.png) is computed.
 
-```
-/memory/classes/metadata/mcache/free:bytes +
-/memory/classes/metadata/mcache/inuse:bytes +
-/memory/classes/metadata/mspan/free:bytes +
-/memory/classes/metadata/mspan/inuse:bytes +
-/memory/classes/metadata/other:bytes +
-/memory/classes/os-stacks:bytes +
-/memory/classes/other:bytes +
-/memory/classes/profiling/buckets:bytes
-```
-
-and ![`O_I`](48409/inl4.png) is the maximum of
-`/memory/classes/heap/unused:bytes + /memory/classes/heap/free:bytes` over the
-last GC cycle.
-
-These terms (called ![`O`](48409/inl5.png), for "overheads") account for all
-memory that is not accounted for by the GC pacer (from the [new pacer
-proposal](https://github.com/golang/proposal/blob/329650d4723a558c2b76b81b4995fc5c267e6bc1/design/44167-gc-pacer-redesign.md#heap-goal)).
+The second term, ![`(T - F - A)`](48409/inl6.png), represents the sum of
+non-heap overheads.
+Free and unscavenged memory is specifically excluded because this is memory that
+the runtime might use in the near future, and the scavenger is specifically
+instructed to leave the memory up to the heap goal unscavenged.
+Failing to exclude free and unscavenged memory could lead to a very poor
+accounting of non-heap overheads.
 
 With ![`\hat{L}`](48409/inl1.png) fully defined, our heap goal for cycle
-![`n`](48409/inl6.png) (![`N_n`](48409/inl7.png)) is a straightforward extension
+![`n`](48409/inl7.png) (![`N_n`](48409/inl8.png)) is a straightforward extension
 of the existing one.
 
 Where
-* ![`M_n`](48409/inl8.png) is equal to bytes marked at the end of GC n's mark
+* ![`M_n`](48409/inl9.png) is equal to bytes marked at the end of GC n's mark
   phase
-* ![`S_n`](48409/inl9.png) is equal to stack bytes at the beginning of GC n's
+* ![`S_n`](48409/inl10.png) is equal to stack bytes at the beginning of GC n's
   mark phase
-* ![`G_n`](48409/inl10.png) is equal to bytes of globals at the beginning of GC
+* ![`G_n`](48409/inl11.png) is equal to bytes of globals at the beginning of GC
   n's mark phase
-* ![`\gamma`](48409/inl11.png) is equal to
-  ![`1+\frac{GOGC}{100}`](48409/inl12.png)
+* ![`\gamma`](48409/inl12.png) is equal to
+  ![`1+\frac{GOGC}{100}`](48409/inl13.png)
 
 then
 
 ![Equation 2](48409/eqn2.png)
 
-Over the course of a GC cycle ![`O_M`](48409/inl3.png) remains stable because it
-increases monotonically.
-There's only one situation where ![`O_M`](48409/inl3.png) can grow tremendously
-(relative to active heap objects) in a short period of time (< 1 GC cycle), and
-that's when `GOMAXPROCS` increases.
-So, I also propose recomputing this value at that time.
+Over the course of a GC cycle, non-heap overheads remain stable because the
+mostly increase monotonically.
+However, the GC needs to be responsive to any change in non-heap overheads.
+Therefore, I propose a more heavy-weight recomputation of the heap goal every
+time its needed, as opposed to computing it only once per cycle.
+This also means the GC trigger point needs to be dynamically recomputable.
+This check will create additional overheads, but they're likely to be low, as
+the GC's internal statistics are updated only on slow paths.
 
-Meanwhile ![`O_I`](48409/inl4.png) stays relatively stable (and doesn't have a
-sawtooth pattern, as one might expect from a sum of idle heap memory) because
-object sweeping occurs incrementally, specifically proportionally to how fast
-the application is allocating.
-Furthermore, this value is guaranteed to stay relatively stable across a single
-GC cycle, because the total size of the heap for one GC cycle is bounded by the
-heap goal.
-Taking the highwater mark of this value places a conservative upper bound on the
-total impact of this memory, so the heap goal stays safe from major changes.
-
-One concern with the above definition of ![`\hat{L}`](48409/inl1.png) is that it
-is fragile to changes to the Go GC.
-In the past, seemingly unrelated changes to the Go runtime have impacted the
-GC's pacer, usually due to an unforeseen influence on the accounting that the
-pacer relies on.
-To minimize the impact of these accidents on the conversion function, I propose
-centralizing and categorizing all the variables used in accounting, and writing
-tests to ensure that expected properties of the account remain in-tact.
+The nice thing about this definition of ![`\hat{L}`](48409/inl1.png) is that
+it's fairly robust to changes to the Go GC, since total mapped memory, free and
+unscavenged memory, and bytes allocated in objects, are fairly fundamental
+properties (especially to any tracing GC design).
 
 #### Death spirals
 
@@ -322,7 +303,7 @@
 more than about a second.
 1 CPU-second per `GOMAXPROCS` seems like a reasonable place to start.
 
-Unfortunately, 50% is not a reasonable choice for small values of `GOGC`.
+Unfortunately, 50% is not always a reasonable choice for small values of `GOGC`.
 Consider an application running with `GOGC=10`: an overall 50% GC CPU
 utilization limit for `GOGC=10` is likely going to be always active, leading to
 significant overshoot.
@@ -359,22 +340,13 @@
 I propose it does so using a proportional-integral controller whose input is the
 difference between the memory limit and the memory used by Go, and whose output
 is the CPU utilization target of the background scavenger.
-The output will be clamped at a minimum of 1% and a maximum of 10% overall CPU
-utilization.
-Note that the 10% is chosen arbitrarily; in general, returning memory to the
-platform is nowhere near as costly as the GC, but the number must be chosen such
-that the mutator still has plenty of room to make progress (thus, I assert that
-40% of CPU time is enough).
-In order to make the scavenger scale to overall CPU utilization effectively, it
-requires some improvements to avoid the aforementioned locking issues it deals
-with today.
+This will make the background scavenger more reliable.
 
-Any CPU time spent in the scavenger should also be accounted for in the leaky
-bucket algorithm described in the [Death spirals](#death-spirals) section as GC
-time, however I don't think it should be throttled in the same way.
-The intuition behind that is that returning memory to the platform is generally
-going to be more immediately fruitful than spending more time in garbage
-collection.
+However, the background scavenger likely won't return memory to the OS promptly
+enough for the memory limit, so in addition, I propose having span allocations
+eagerly return memory to the OS to stay under the limit.
+The time a goroutine spends in this will also count toward the 50% GC CPU limit
+described in the [Death spirals](#death-spirals) section.
 
 #### Alternative approaches considered
 
@@ -418,38 +390,13 @@
 
 ##### Returning memory to the platform
 
-A potential issue with the proposed design is that because the scavenger is
-running in the background, it may not react readily to spikes in memory use that
-exceed the limit.
-
-In contrast, [TCMalloc](#tcmalloc) searches for memory to return eagerly, if an
-allocation were to exceed the limit.
-In the Go 1.13 cycle, I attempted a similar policy when first implementing the
-scavenger, and found that it could cause unacceptable tail latency increases in
-some applications.
-While that policy certainly tried to return memory back to the platform
-significantly more often than it would be in this case, it still has a couple of
-downsides:
-1. It introduces latency.
-   The background scavenger can be more efficiently time-sliced in between other
-   work, so it generally should only impact throughput.
-1. It's much more complicated to bound the total amount of time spent searching
-   for and returning memory to the platform during an allocation.
-
-The key insight as to why this policy works just fine for TCMalloc and won't
-work for Go comes from a fundamental difference in design.
-Manual memory allocators are typically designed to have a LIFO-style memory
-reuse pattern.
-Once an allocation is freed, it is immediately available for reallocation.
-In contrast, most efficient tracing garbage collection algorithms require a
-FIFO-style memory reuse pattern, since allocations are freed in bulk.
-The result is that the page allocator in a garbage-collected memory allocator is
-accessed far more frequently than in manual memory allocator, so this path will
-be hit a lot harder.
-
-For the purposes of this design, I don't believe the benefits of eager return
-outweigh the costs, and I do believe that the proposed design is good enough for
-most cases.
+If returning memory to the OS eagerly becomes a significant performance issue, a
+reasonable alternative could be to crank up the background scavenger's CPU usage
+in response to growing memory pressure.
+This needs more thought, but given that it would now be controlled by a
+controller, its CPU usage will be more reliable, and this is an option we can
+keep in mind.
+One benefit of this option is that it may impact latency less prominently.
 
 ### Documentation
 
@@ -513,6 +460,11 @@
 so more rarely, except when [the application is
 idle](https://openjdk.java.net/jeps/346).
 
+Some JVMs are "container aware" and read the memory limits of their containers
+to stay under the limit.
+This behavior is closer to what is proposed in this document, but I do not
+believe the memory limit is directly configurable, like the one proposed here.
+
 ### SetMaxHeap
 
 For nearly 4 years, the Go project has been trialing an experimental API in the
diff --git a/design/48409/README.md b/design/48409/README.md
index 8994615..375a0f2 100644
--- a/design/48409/README.md
+++ b/design/48409/README.md
@@ -1,6 +1,6 @@
 # Design document
 
-The GC pacer design document is generated from the `.src.md` file in this
+The memory limit design document is generated from the `.src.md` file in this
 directory.
 It contains LaTeX formulas which Markdown cannot render, so they're
 rendered by an external open-source tool,
@@ -10,9 +10,6 @@
 
 ```
 ./svg2png.bash
-cd pacer-plots
-./svg2png.bash
-cd ..
 ```
 
 And go back and replace all instances of SVG with PNG in the final document.
diff --git a/design/48409/eqn1.png b/design/48409/eqn1.png
index f638a9a..96ce7b9 100644
--- a/design/48409/eqn1.png
+++ b/design/48409/eqn1.png
Binary files differ
diff --git a/design/48409/eqn2.png b/design/48409/eqn2.png
index b25a7dd..8174e00 100644
--- a/design/48409/eqn2.png
+++ b/design/48409/eqn2.png
Binary files differ
diff --git a/design/48409/inl1.png b/design/48409/inl1.png
index d5799ad..8636d42 100644
--- a/design/48409/inl1.png
+++ b/design/48409/inl1.png
Binary files differ
diff --git a/design/48409/inl10.png b/design/48409/inl10.png
index 0cc2d21..2d1b53e 100644
--- a/design/48409/inl10.png
+++ b/design/48409/inl10.png
Binary files differ
diff --git a/design/48409/inl11.png b/design/48409/inl11.png
index 038e874..b1966ed 100644
--- a/design/48409/inl11.png
+++ b/design/48409/inl11.png
Binary files differ
diff --git a/design/48409/inl12.png b/design/48409/inl12.png
index 48cd28f..1158f2d 100644
--- a/design/48409/inl12.png
+++ b/design/48409/inl12.png
Binary files differ
diff --git a/design/48409/inl13.png b/design/48409/inl13.png
new file mode 100644
index 0000000..1beadc3
--- /dev/null
+++ b/design/48409/inl13.png
Binary files differ
diff --git a/design/48409/inl2.png b/design/48409/inl2.png
index bb27598..7fc5478 100644
--- a/design/48409/inl2.png
+++ b/design/48409/inl2.png
Binary files differ
diff --git a/design/48409/inl3.png b/design/48409/inl3.png
index c152bf2..4afda7a 100644
--- a/design/48409/inl3.png
+++ b/design/48409/inl3.png
Binary files differ
diff --git a/design/48409/inl4.png b/design/48409/inl4.png
index cf4f20b..23b71e5 100644
--- a/design/48409/inl4.png
+++ b/design/48409/inl4.png
Binary files differ
diff --git a/design/48409/inl5.png b/design/48409/inl5.png
index 6386671..9fc70f6 100644
--- a/design/48409/inl5.png
+++ b/design/48409/inl5.png
Binary files differ
diff --git a/design/48409/inl6.png b/design/48409/inl6.png
index de20b8e..d72def7 100644
--- a/design/48409/inl6.png
+++ b/design/48409/inl6.png
Binary files differ
diff --git a/design/48409/inl7.png b/design/48409/inl7.png
index 4b7a746..4149a13 100644
--- a/design/48409/inl7.png
+++ b/design/48409/inl7.png
Binary files differ
diff --git a/design/48409/inl8.png b/design/48409/inl8.png
index e3dbc80..9e0c57c 100644
--- a/design/48409/inl8.png
+++ b/design/48409/inl8.png
Binary files differ
diff --git a/design/48409/inl9.png b/design/48409/inl9.png
index 01dff67..820bb71 100644
--- a/design/48409/inl9.png
+++ b/design/48409/inl9.png
Binary files differ
diff --git a/design/48409/soft-memory-limit.src.md b/design/48409/soft-memory-limit.src.md
index c8efb3b..a5d6e81 100644
--- a/design/48409/soft-memory-limit.src.md
+++ b/design/48409/soft-memory-limit.src.md
@@ -107,7 +107,7 @@
 but I believe the Go runtime has an important role to play in supporting these
 worthwhile efforts.
 
-2. Eliminate out-of-memory errors in 100% of cases.
+1. Eliminate out-of-memory errors in 100% of cases.
 
 Whatever policy this API adheres to is going to fail for some use-case, and
 that's OK.
@@ -207,28 +207,20 @@
 propose the following calculation:
 
 ```render-latex
-\hat{L} = L - (O_M + O_I)
+\hat{L} = L - (T - F - A)
 ```
 
-Where `$O_M$` is (per `runtime/metrics` memory names)
+`$T$` is the total amount of memory mapped by the Go runtime.
+`$F$` is the amount of free and unscavenged memory the Go runtime is holding.
+`$A$` is the number of bytes in allocated heap objects at the time `$\hat{L}$`
+is computed.
 
-```
-/memory/classes/metadata/mcache/free:bytes +
-/memory/classes/metadata/mcache/inuse:bytes +
-/memory/classes/metadata/mspan/free:bytes +
-/memory/classes/metadata/mspan/inuse:bytes +
-/memory/classes/metadata/other:bytes +
-/memory/classes/os-stacks:bytes +
-/memory/classes/other:bytes +
-/memory/classes/profiling/buckets:bytes
-```
-
-and `$O_I$` is the maximum of `/memory/classes/heap/unused:bytes +
-/memory/classes/heap/free:bytes` over the last GC cycle.
-
-These terms (called `$O$`, for "overheads") account for all memory that is not
-accounted for by the GC pacer (from the [new pacer
-proposal](https://github.com/golang/proposal/blob/329650d4723a558c2b76b81b4995fc5c267e6bc1/design/44167-gc-pacer-redesign.md#heap-goal)).
+The second term, `$(T - F - A)$`, represents the sum of non-heap overheads.
+Free and unscavenged memory is specifically excluded because this is memory that
+the runtime might use in the near future, and the scavenger is specifically
+instructed to leave the memory up to the heap goal unscavenged.
+Failing to exclude free and unscavenged memory could lead to a very poor
+accounting of non-heap overheads.
 
 With `$\hat{L}$` fully defined, our heap goal for cycle `$n$` (`$N_n$`) is a
 straightforward extension of the existing one.
@@ -245,31 +237,19 @@
 N_n = min(\hat{L}, \gamma(M_{n-1})+S_n+G_n)
 ```
 
-Over the course of a GC cycle `$O_M$` remains stable because it increases
-monotonically.
-There's only one situation where `$O_M$` can grow tremendously (relative to
-active heap objects) in a short period of time (< 1 GC cycle), and that's when
-`GOMAXPROCS` increases.
-So, I also propose recomputing this value at that time.
+Over the course of a GC cycle, non-heap overheads remain stable because the
+mostly increase monotonically.
+However, the GC needs to be responsive to any change in non-heap overheads.
+Therefore, I propose a more heavy-weight recomputation of the heap goal every
+time its needed, as opposed to computing it only once per cycle.
+This also means the GC trigger point needs to be dynamically recomputable.
+This check will create additional overheads, but they're likely to be low, as
+the GC's internal statistics are updated only on slow paths.
 
-Meanwhile `$O_I$` stays relatively stable (and doesn't have a sawtooth pattern,
-as one might expect from a sum of idle heap memory) because object sweeping
-occurs incrementally, specifically proportionally to how fast the application is
-allocating.
-Furthermore, this value is guaranteed to stay relatively stable across a single
-GC cycle, because the total size of the heap for one GC cycle is bounded by the
-heap goal.
-Taking the highwater mark of this value places a conservative upper bound on the
-total impact of this memory, so the heap goal stays safe from major changes.
-
-One concern with the above definition of `$\hat{L}$` is that it is fragile to
-changes to the Go GC.
-In the past, seemingly unrelated changes to the Go runtime have impacted the
-GC's pacer, usually due to an unforeseen influence on the accounting that the
-pacer relies on.
-To minimize the impact of these accidents on the conversion function, I propose
-centralizing and categorizing all the variables used in accounting, and writing
-tests to ensure that expected properties of the account remain in-tact.
+The nice thing about this definition of `$\hat{L}$` is that it's fairly robust
+to changes to the Go GC, since total mapped memory, free and unscavenged memory,
+and bytes allocated in objects, are fairly fundamental properties (especially to
+any tracing GC design).
 
 #### Death spirals
 
@@ -320,7 +300,7 @@
 more than about a second.
 1 CPU-second per `GOMAXPROCS` seems like a reasonable place to start.
 
-Unfortunately, 50% is not a reasonable choice for small values of `GOGC`.
+Unfortunately, 50% is not always a reasonable choice for small values of `GOGC`.
 Consider an application running with `GOGC=10`: an overall 50% GC CPU
 utilization limit for `GOGC=10` is likely going to be always active, leading to
 significant overshoot.
@@ -357,22 +337,13 @@
 I propose it does so using a proportional-integral controller whose input is the
 difference between the memory limit and the memory used by Go, and whose output
 is the CPU utilization target of the background scavenger.
-The output will be clamped at a minimum of 1% and a maximum of 10% overall CPU
-utilization.
-Note that the 10% is chosen arbitrarily; in general, returning memory to the
-platform is nowhere near as costly as the GC, but the number must be chosen such
-that the mutator still has plenty of room to make progress (thus, I assert that
-40% of CPU time is enough).
-In order to make the scavenger scale to overall CPU utilization effectively, it
-requires some improvements to avoid the aforementioned locking issues it deals
-with today.
+This will make the background scavenger more reliable.
 
-Any CPU time spent in the scavenger should also be accounted for in the leaky
-bucket algorithm described in the [Death spirals](#death-spirals) section as GC
-time, however I don't think it should be throttled in the same way.
-The intuition behind that is that returning memory to the platform is generally
-going to be more immediately fruitful than spending more time in garbage
-collection.
+However, the background scavenger likely won't return memory to the OS promptly
+enough for the memory limit, so in addition, I propose having span allocations
+eagerly return memory to the OS to stay under the limit.
+The time a goroutine spends in this will also count toward the 50% GC CPU limit
+described in the [Death spirals](#death-spirals) section.
 
 #### Alternative approaches considered
 
@@ -416,38 +387,13 @@
 
 ##### Returning memory to the platform
 
-A potential issue with the proposed design is that because the scavenger is
-running in the background, it may not react readily to spikes in memory use that
-exceed the limit.
-
-In contrast, [TCMalloc](#tcmalloc) searches for memory to return eagerly, if an
-allocation were to exceed the limit.
-In the Go 1.13 cycle, I attempted a similar policy when first implementing the
-scavenger, and found that it could cause unacceptable tail latency increases in
-some applications.
-While that policy certainly tried to return memory back to the platform
-significantly more often than it would be in this case, it still has a couple of
-downsides:
-1. It introduces latency.
-   The background scavenger can be more efficiently time-sliced in between other
-   work, so it generally should only impact throughput.
-1. It's much more complicated to bound the total amount of time spent searching
-   for and returning memory to the platform during an allocation.
-
-The key insight as to why this policy works just fine for TCMalloc and won't
-work for Go comes from a fundamental difference in design.
-Manual memory allocators are typically designed to have a LIFO-style memory
-reuse pattern.
-Once an allocation is freed, it is immediately available for reallocation.
-In contrast, most efficient tracing garbage collection algorithms require a
-FIFO-style memory reuse pattern, since allocations are freed in bulk.
-The result is that the page allocator in a garbage-collected memory allocator is
-accessed far more frequently than in manual memory allocator, so this path will
-be hit a lot harder.
-
-For the purposes of this design, I don't believe the benefits of eager return
-outweigh the costs, and I do believe that the proposed design is good enough for
-most cases.
+If returning memory to the OS eagerly becomes a significant performance issue, a
+reasonable alternative could be to crank up the background scavenger's CPU usage
+in response to growing memory pressure.
+This needs more thought, but given that it would now be controlled by a
+controller, its CPU usage will be more reliable, and this is an option we can
+keep in mind.
+One benefit of this option is that it may impact latency less prominently.
 
 ### Documentation
 
@@ -511,6 +457,11 @@
 so more rarely, except when [the application is
 idle](https://openjdk.java.net/jeps/346).
 
+Some JVMs are "container aware" and read the memory limits of their containers
+to stay under the limit.
+This behavior is closer to what is proposed in this document, but I do not
+believe the memory limit is directly configurable, like the one proposed here.
+
 ### SetMaxHeap
 
 For nearly 4 years, the Go project has been trialing an experimental API in the