_content/doc/gc-guide: incorporate additional feedback

From comments by aktau@google.com on CL 414397.

Change-Id: Ic6a5edbd477dbf6c54b996e652056e2189f727ae
Reviewed-on: https://go-review.googlesource.com/c/website/+/417649
Reviewed-by: Nicolas Hillegeer <aktau@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Lasse Folger <lassefolger@google.com>
diff --git a/_content/doc/gc-guide.html b/_content/doc/gc-guide.html
index cbb6476..5d97a5c 100644
--- a/_content/doc/gc-guide.html
+++ b/_content/doc/gc-guide.html
@@ -38,6 +38,11 @@
   height: 24px;
   width: auto;
 }
+
+.gc-guide-equation {
+  display: block;
+  text-align: center;
+}
 </style>
 
 <h2 id="Introduction">Introduction</h2>
@@ -193,7 +198,7 @@
 <p>
 This basic algorithm is common to all tracing GCs.
 Where tracing GCs differ is what they do once they discover memory is live.
-Go's GC uses the mark-sweep technique, which means is that in order to keep track
+Go's GC uses the mark-sweep technique, which means that in order to keep track
 of its progress, the GC also <b>marks</b> the values it encounters as live.
 Once tracing is complete, the GC then walks over all memory in the heap and
 makes all memory that is <i>not</i> marked available for allocation.
@@ -221,8 +226,10 @@
 marking.
 Furthermore, the GC may also not be active at all, when there's no GC-related
 work to do.
-The GC continuosly rotates through these three states of off, marking, and
-sweeping in what's known as the <b>GC cycle</b>.
+The GC continuously rotates through these three phases of sweeping, off, and
+marking in what's known as the <b>GC cycle</b>.
+For the purposes of this document, consider the GC cycle starting with sweeping,
+turning off, then marking.
 </p>
 
 <p>
@@ -325,8 +332,11 @@
 		During the request, at most 512 KiB of that 2 MiB stays live while
 		the request is in flight, and when the service is finished handling
 		the request, all that memory dies.
-		A steady stream of requests, say 100 requests per second, results in
-		an allocation rate of 200 MiB/s and a 50 MiB peak live heap.
+		Now, for the sake of simplicity suppose each request takes about 1
+		second to handle end-to-end.
+		Then, a steady stream of requests, say 100 requests per second,
+		results in an allocation rate of 200 MiB/s and a 50 MiB peak live
+		heap.
 		</i>
 		</p>
 	</li>
@@ -407,15 +417,24 @@
 <h3 id="GOGC">GOGC</h3>
 
 <p>
-GOGC is a tuning parameter for the Go GC that directly reflects a trade-off
-between CPU time and memory by controlling GC frequency.
-More specifically, GOGC sets the target heap size for the GC, or the amount
-of new memory that should be allocated by the time the mark phase has
-completed.
-GOGC is defined as a percent overhead over the amount of work the GC needs to
-do.
-This work is currently defined as the size of the live heap, plus the size of
-the GC roots in bytes.
+At a high level, GOGC determines the trade-off between GC CPU and memory.
+</p>
+
+<p>
+It works by determining the target heap size after each GC cycle, a target value
+for the total heap size in the next cycle.
+The GC's goal is to finish a collection cycle before the total heap size
+exceeds the target heap size.
+Total heap size is defined as the live heap size at the end of the previous
+cycle, plus any new heap memory allocated by the application since the previous
+cycle.
+Meanwhile, target heap memory is defined as:
+</p>
+
+<p class="gc-guide-equation">
+<i>
+Target heap memory = Live heap + (Live heap + GC roots) * GOGC / 100
+</i>
 </p>
 
 <p>
@@ -430,30 +449,6 @@
 
 <p>
 <i>
-Note: GOGC may be more precisely described as defining the amount of new memory
-that can be allocated before the start of the next sweep phase.
-This timing is technically true for the <a href="#Understanding_costs">GC
-model</a> this guide has been using thus far, but also generalizes to the
-real GC implementation that Go uses that's discussed in more detail in the
-<a href="#Latency">latency section</a>.
-</i>
-</p>
-
-<p>
-The benefit of defining the trade-off this way is that the cost of GC remains
-constant in the steady-state regardless of the amount of work that the GC has
-to do (so regardless of the sizes of the live heap and root set), because the
-<i>frequency</i> will always be proportional to the amount of work that has to
-be done.
-In other words, it represents a fixed point in the trade-off between CPU cost
-and memory use.
-(It's important to note that this fixed point may shift if the steady-state
-changes as well, but crucially it is <i>not</i> dependent on the size of the
-live heap.)
-</p>
-
-<p>
-<i>
 Note: GOGC includes the root set only as of Go 1.18.
 Previously, it would only count the live heap.
 Often, the amount of memory in goroutine stacks is quite small and the live
@@ -463,6 +458,31 @@
 </p>
 
 <p>
+The heap target controls GC frequency: the bigger the target, the longer the GC
+can wait to start another mark phase and vice versa.
+While the precise formula is useful for making estimates, it's best to think of
+GOGC in terms of its fundamental purpose: a parameter that picks a point in the
+GC CPU and memory trade-off.
+The key takeaway is that <b>doubling GOGC will double heap memory overheads and
+roughly halve GC CPU cost</b>, and vice versa.
+(To see a full explanation as to why, see the
+<a href="#Additional_notes_on_GOGC">appendix</a>.)
+</p>
+
+<p>
+<i>
+Note: the target heap size is just a target, and there are several reasons why
+the GC cycle might not finish right at that target.
+For one, a large enough heap allocation can simply exceed the target.
+However, other reasons appear in GC implementations that go beyond the
+<a href="#Understanding_costs">GC model</a> this guide has been using thus far.
+For some more detail, see the <a href="#Latency">latency section</a>, but the
+complete details may be found in the <a href="#Additional_resources">additional
+resources</a>.
+</i>
+</p>
+
+<p>
 GOGC may be configured through either the <code>GOGC</code> environment
 variable (which all Go programs recognize), or through the
 <a href="https://pkg.go.dev/runtime/debug#SetGCPercent"><code>SetGCPercent</code></a>
@@ -493,7 +513,9 @@
 <p>
 Use the slider to adjust the value of GOGC to see how the application responds
 in terms of total duration and GC overhead.
-Each GC cycle occurs while the new heap drops to zero.
+Each GC cycle ends while the new heap drops to zero.
+The time taken while the new heap drops to zero is the combined time for the
+mark phase for cycle N, and the sweep phase for the cycle N+1.
 The X axis shifts to always show the full CPU-time duration of the program.
 Notice that additional CPU time used by the GC increases the overall duration.
 </p>
@@ -947,9 +969,10 @@
 </p>
 
 <p>
-Although the first axiom no longer holds, it wasn't really all that important
-to begin with; the rest of the costs still align as described by the model, and
-the same notion of a steady-state applies.
+Although the first axiom, that the application is paused while the GC executes,
+no longer holds, it wasn't really all that important to begin with. 
+The rest of the costs still align as described by the model, and the same notion
+of a steady-state applies.
 As a result, GC frequency is still the primary way the GC trades off between CPU
 time and memory for throughput, and it also takes on this role for latency.
 With respect to throughput, it's easy to get back within the realm of the model
@@ -1001,6 +1024,8 @@
 	</li>
 </ol>
 
+<!-- TODO: Add a short section about non-steady-state behavior. -->
+
 <h3 id="Additional_resources">Additional resources</h3>
 
 <p>
@@ -1234,7 +1259,7 @@
 		</p>
 
 		<p>
-		See the documentation for the <a href="https://pkg.go.dev/runtime/trace">
+		See the <a href="https://pkg.go.dev/runtime/trace">documentation for the 
 		<code>runtime/trace</code></a> package for how to get started with
 		execution traces.
 		</p>
@@ -1449,5 +1474,140 @@
 costs.
 </p>
 
+<h2 id="Appendix">Appendix</h2>
+
+<h3 id="Additional_notes_on_GOGC">Additional notes on GOGC</h2>
+
+<p>
+The <a href="#GOGC">GOGC section</a> claimed that doubling GOGC doubles heap
+memory overheads and halves GC CPU costs.
+To see why, let's break it down mathematically.
+</p>
+
+<p>
+Firstly, the heap target sets a target for the total heap size.
+This target, however, mainly influences the new heap memory, because the live
+heap is fundamental to the application.
+</p>
+
+<p class="gc-guide-equation">
+<i>
+Target heap memory = Live heap + (Live heap + GC roots) * GOGC / 100
+</i>
+</p>
+
+<p class="gc-guide-equation">
+<i>
+Total heap memory = Live heap + New heap memory
+</i>
+</p>
+
+<p class="gc-guide-equation">
+<i>
+&rArr;
+</i>
+</p>
+
+<p class="gc-guide-equation">
+<i>
+New heap memory = (Live heap + GC roots) * GOGC / 100
+</i>
+</p>
+
+<p>
+From this we can see that doubling GOGC would also double the amount of new heap
+memory that application will allocate each cycle, which captures heap memory
+overheads.
+Note that <i>Live heap + GC roots</i> is an approximation of the amount of
+memory the GC needs to scan.
+</p>
+
+<p>
+Next, let's look at GC CPU cost.
+Total cost can be broken down as the cost per cycle, times GC frequency over
+some time period T.
+</p>
+
+<p class="gc-guide-equation">
+<i>
+Total GC CPU cost = (GC CPU cost per cycle) * (GC frequency) * T
+</i>
+</p>
+
+<p>
+GC CPU cost per cycle can be derived from the
+<a href="#Understanding_costs">GC model</a>:
+</p>
+
+<p class="gc-guide-equation">
+<i>
+GC CPU cost per cycle = (Live heap + GC roots) * (Cost per byte) + Fixed cost
+</i>
+</p>
+
+<p>
+Note that sweep phase costs are ignored here as mark and scan costs dominate.
+</p>
+
+<p>
+The steady-state is defined by a constant allocation rate and a constant cost
+per byte, so in the steady-state we can derive a GC frequency from this new heap
+memory:
+</p>
+
+<p class="gc-guide-equation">
+<i>
+GC frequency = (Allocation rate) / (New heap memory) = (Allocation rate) / ((Live heap + GC roots) * GOGC / 100)
+</i>
+</p>
+
+<p>
+Putting this together, we get the full equation for the total cost:
+</p>
+
+<p class="gc-guide-equation">
+<i>
+Total GC CPU cost = (Allocation rate) / ((Live heap + GC roots) * GOGC / 100) * ((Live heap + GC roots) * (Cost per byte) + Fixed cost) * T
+</i>
+</p>
+
+<p>
+For a sufficiently large heap (which represents most cases), the marginal costs
+of a GC cycle dominate the fixed costs.
+This allows for a significant simplification of the total GC CPU cost formula.
+</p>
+
+<p class="gc-guide-equation">
+<i>
+Total GC CPU cost = (Allocation rate) / (GOGC / 100) * (Cost per byte) * T
+</i>
+</p>
+
+<p>
+From this simplified formula, we can see that if we double GOGC, we halve total
+GC CPU cost.
+(Note that the visualizations in this guide do simulate fixed costs, so the GC
+CPU overheads reported by them will not exactly halve when GOGC doubles.)
+Furthermore, GC CPU costs are largely determined by allocation rate and the
+cost per byte to scan memory.
+For more information on how to reduce these costs specifically, see the
+<a href="#Optimization_guide">optimization guide</a>.
+</p>
+
+<p>
+<i>
+Note: there exists a discrepancy between the size of the live heap, and the
+amount of that memory the GC actually needs to scan: the same size live heap but
+with a different structure will result in a different CPU cost, but the same
+memory cost, resulting a different trade-off.
+This is why the structure of the heap is part of the definition of the
+steady-state.
+The heap target should arguably only include the scannable live heap as a closer
+approximation of memory the GC needs to scan, but this leads to degenerate
+behavior when there's a very small amount of scannable live heap but the live
+heap is otherwise large.
+</i>
+</p>
+
 <script src="/js/d3.js"></script>
 <script async src="/doc/gc-guide.js"></script>