| <!--{ |
| "Title": "A Guide to the Go Garbage Collector", |
| "Path": "/doc/gc-guide", |
| "Breadcrumb": true |
| }--> |
| |
| <!-- |
| NOTE: In this document and others in this directory, the convention is to |
| set fixed-width phrases with non-fixed-width spaces, as in |
| <code>hello</code> <code>world</code>. |
| Do not send CLs removing the interior tags from such phrases. |
| --> |
| |
| <style> |
| .gc-guide-graph { |
| display: inline-block; |
| position: relative; |
| width: 100%; |
| vertical-align: top; |
| overflow: hidden; |
| } |
| |
| .gc-guide-graph-controls { |
| display: flex; |
| flex-direction: row; |
| flex-wrap: wrap; |
| justify-content: space-around; |
| align-items: center; |
| width: 100%; |
| } |
| |
| .gc-guide-graph-controls div { |
| display: flex; |
| flex-direction: row; |
| flex-wrap: nowrap; |
| align-items: center; |
| padding-left: 5px; |
| padding-right: 5px; |
| height: 24px; |
| width: auto; |
| } |
| |
| .gc-guide-equation { |
| display: block; |
| text-align: center; |
| } |
| </style> |
| |
| <h2 id="Introduction">Introduction</h2> |
| |
| <p> |
| This guide is intended to aid advanced Go users in better understanding their |
| application costs by providing insights into the Go garbage collector. |
| It also provides guidance on how Go users may use these insights to improve |
| their applications' resource utilization. |
| It does not assume any knowledge of garbage collection, but does assume |
| familiarity with the Go programming language. |
| </p> |
| |
| <p> |
| The Go language takes responsibility for arranging the storage of Go values; |
| in most cases, a Go developer need not care about where these values are stored, |
| or why, if at all. |
| In practice, however, these values often need to be stored in computer |
| <b>physical memory</b> and physical memory is a finite resource. |
| Because it is finite, memory must be managed carefully and recycled in order to |
| avoid running out of it while executing a Go program. |
| It's the job of a Go implementation to allocate and recycle memory as needed. |
| </p> |
| |
| <p> |
| Another term for automatically recycling memory is <b>garbage collection</b>. |
| At a high level, a garbage <i>collector</i> (or GC, for short) is a system that |
| recycles memory on behalf of the application by identifying which parts of memory |
| are no longer needed. |
| The Go standard toolchain provides a runtime library that ships with every |
| application, and this runtime library includes a garbage collector. |
| </p> |
| |
| <p> |
| Note that the existence of a garbage collector as described by this guide |
| is not guaranteed by the <a href="/ref/spec">Go specification</a>, only that |
| the underlying storage for Go values is managed by the language itself. |
| This omission is intentional and enables the use of radically different |
| memory management techniques. |
| </p> |
| |
| <p> |
| Therefore, this guide is about a specific implementation of the Go programming |
| language and <i>may not apply to other implementations</i>. |
| Specifically, this following guide applies to the standard toolchain (the |
| <code>gc</code> Go compiler and tools). |
| Gccgo and Gollvm both use a very similar GC implementation so many of the |
| same concepts apply, but details may vary. |
| </p> |
| |
| <p> |
| Furthermore, this is a living document and will change over time to best |
| reflect the latest release of Go. |
| This document currently describes the garbage collector as of Go 1.19. |
| </p> |
| |
| |
| <h3 id="Where_Go_Values_Live">Where Go Values Live</h3> |
| |
| <p> |
| Before we dive into the GC, let's first discuss the memory that doesn't need to |
| be managed by the GC. |
| </p> |
| |
| <p> |
| For instance, non-pointer Go values stored in local variables will likely not be |
| managed by the Go GC at all, and Go will instead arrange for memory to be |
| allocated that's tied to the |
| <a href="https://go.dev/ref/spec#Declarations_and_scope">lexical scope</a> in |
| which it's created. |
| In general, this is more efficient than relying on the GC, because the Go |
| compiler is able to predetermine when that memory may be freed and emit |
| machine instructions that clean up. |
| Typically, we refer to allocating memory for Go values this way as |
| "stack allocation," because the space is stored on the goroutine stack. |
| </p> |
| |
| <p> |
| Go values whose memory cannot be allocated this way, because the Go compiler |
| cannot determine its lifetime, are said to <i>escape to the heap</i>. |
| "The heap" can be thought of as a catch-all for memory allocation, for when Go |
| values need to be placed <i>somewhere</i>. |
| The act of allocating memory on the heap is typically referred to as "dynamic |
| memory allocation" because both the compiler and the runtime can make very few |
| assumptions as to how this memory is used and when it can be cleaned up. |
| That's where a GC comes in: it's a system that specifically identifies and |
| cleans up dynamic memory allocations. |
| </p> |
| |
| <p> |
| There are many reasons why a Go value might need to escape to the heap. |
| One reason could be that its size is dynamically determined. |
| Consider for instance the backing array of a slice whose initial size is |
| determined by a variable, rather than a constant. |
| Note that escaping to the heap must also be transitive: if a reference to a |
| Go value is written into another Go value that has already been determined to |
| escape, that value must also escape. |
| </p> |
| |
| <p> |
| Whether a Go value escapes or not is a function of the context in which it is |
| used and the Go compiler's escape analysis algorithm. |
| It would be fragile and difficult to try to enumerate precisely when values |
| escape: the algorithm itself is fairly sophisticated and changes between Go |
| releases. |
| For more details on how to identify which values escape and which do not, see |
| the section on |
| <a href="#Eliminating_heap_allocations">eliminating heap allocations</a>. |
| </p> |
| |
| <h3 id="Tracing_Garbage_Collection">Tracing Garbage Collection</h3> |
| |
| <p> |
| Garbage collection may refer to many different methods of automatically |
| recycling memory; for example, reference counting. |
| In the context of this document, garbage collection refers to <b>tracing</b> |
| garbage collection, which identifies in-use, so-called <b>live</b>, objects by |
| following pointers transitively. |
| </p> |
| |
| <p> |
| Let's define these terms more rigorously. |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| <b>Object</b>—An object is a dynamically allocated piece of memory |
| that contains one or more Go values. |
| </p> |
| </li> |
| <li> |
| <p> |
| <b>Pointer</b>—A memory address that references any value within an |
| object. |
| This naturally includes Go values of the form <code>*T</code>, but also includes |
| parts of built-in Go values. |
| Strings, slices, channels, maps, and interface values all contain memory |
| addresses that the GC must trace. |
| </p> |
| </li> |
| </ul> |
| |
| <p> |
| Together, objects and pointers to other objects form the <b>object graph</b>. |
| To identify live memory, the GC walks the object graph starting at the |
| program's <b>roots</b>, pointers that identify objects that are definitely |
| in-use by the program. |
| Two examples of roots are local variables and global variables. |
| The process of walking the object graph is referred to as <b>scanning</b>. |
| </p> |
| |
| <p> |
| This basic algorithm is common to all tracing GCs. |
| Where tracing GCs differ is what they do once they discover memory is live. |
| Go's GC uses the mark-sweep technique, which means that in order to keep track |
| of its progress, the GC also <b>marks</b> the values it encounters as live. |
| Once tracing is complete, the GC then walks over all memory in the heap and |
| makes all memory that is <i>not</i> marked available for allocation. |
| This process is called <b>sweeping</b>. |
| </p> |
| |
| <p> |
| One alternative technique you may be familiar with is to actually <i>move</i> |
| the objects to a new part of memory and leave behind a forwarding pointer that |
| is later used to update all the application's pointers. |
| We call a GC that moves objects in this way a <b>moving</b> GC; Go has a |
| <b>non-moving</b> GC. |
| </p> |
| |
| <h2 id="The_GC_cycle">The GC cycle</h2> |
| |
| <p> |
| Because the Go GC is a mark-sweep GC, it broadly operates in two phases: the |
| mark phase, and the sweep phase. |
| While this statement might seem tautological, it contains an important insight: |
| it's not possible to release memory back to be allocated until <i>all</i> memory |
| has been traced, because there may still be an un-scanned pointer keeping |
| an object alive. |
| As a result, the act of sweeping must be entirely separated from the act of |
| marking. |
| Furthermore, the GC may also not be active at all, when there's no GC-related |
| work to do. |
| The GC continuously rotates through these three phases of sweeping, off, and |
| marking in what's known as the <b>GC cycle</b>. |
| For the purposes of this document, consider the GC cycle starting with sweeping, |
| turning off, then marking. |
| </p> |
| |
| <p> |
| The next few sections will focus on building intuition for the costs of the |
| GC to aid users in tweaking GC parameters for their own benefit. |
| </p> |
| |
| <h3 id="Understanding_costs">Understanding costs</h3> |
| |
| <p> |
| The GC is inherently a complex piece of software built on even more complex |
| systems. |
| It's easy to become mired in detail when trying to understand the GC and |
| tweak its behavior. |
| This section is intended to provide a framework for reasoning about the cost |
| of the Go GC and tuning parameters. |
| </p> |
| |
| <p> |
| To begin with, consider this model of GC cost based on three simple axioms. |
| </p> |
| |
| <ol> |
| <li> |
| <p> |
| The GC involves only two resources: CPU time, and physical memory. |
| </p> |
| </li> |
| <li> |
| <p> |
| The GC's memory costs consist of live heap memory, new heap memory |
| allocated before the mark phase, and space for metadata that, even |
| if proportional to the previous costs, are small in comparison. |
| </p> |
| |
| <p> |
| <i> |
| Note: live heap memory is memory that was determined to be live by the |
| previous GC cycle, while new heap memory is any memory allocated in |
| the current cycle, which may or may not be live by the end. |
| </i> |
| </p> |
| </li> |
| <li> |
| <p> |
| The GC's CPU costs are modeled as a fixed cost per cycle, and a |
| marginal cost that scales proportionally with the size of the live |
| heap. |
| </p> |
| <p> |
| <i> |
| Note: Asymptotically speaking, sweeping scales worse than marking and |
| scanning, as it must perform work proportional to the size of the whole |
| heap, including memory that is determined to be not live (i.e. "dead"). |
| However, in the current implementation sweeping is so much faster than |
| marking and scanning that its associated costs can be ignored in this |
| discussion. |
| </i> |
| </p> |
| </li> |
| </ol> |
| |
| <p> |
| This model is simple but effective: it accurately categorizes the dominant |
| costs of the GC. |
| However, this model says nothing about the magnitude of these costs, nor |
| how they interact. |
| To model that, consider the following situation, referred to from here on as |
| the <b>steady-state</b>. |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| The rate at which the application allocates new memory (in bytes per |
| second) is constant. |
| </p> |
| |
| <p> |
| <i> |
| Note: it's important to understand that this allocation rate is |
| completely separate from whether or not this new memory is live. |
| None of it could be live, all of it could be live, or some of it could |
| be live. |
| (On top of this, some old heap memory could also die, so it's not |
| necessarily the case that if that memory is live, the live heap size |
| grows.) |
| </i> |
| </p> |
| |
| <p> |
| <i> |
| To put this more concretely, consider a web service that allocates |
| 2 MiB of total heap memory for each request that it handles. |
| During the request, at most 512 KiB of that 2 MiB stays live while |
| the request is in flight, and when the service is finished handling |
| the request, all that memory dies. |
| Now, for the sake of simplicity suppose each request takes about 1 |
| second to handle end-to-end. |
| Then, a steady stream of requests, say 100 requests per second, |
| results in an allocation rate of 200 MiB/s and a 50 MiB peak live |
| heap. |
| </i> |
| </p> |
| </li> |
| <li> |
| <p> |
| The application's object graph looks roughly the same each time |
| (objects are similarly sized, there's a roughly constant number of |
| pointers, the maximum depth of the graph is roughly constant). |
| </p> |
| <p> |
| Another way to think about this is that the marginal costs of |
| GC are constant. |
| </p> |
| </li> |
| </ul> |
| |
| <p> |
| <i> |
| Note: the steady-state may seem contrived, but it's representative of the |
| behavior of an application under some constant workload. |
| Naturally, workloads can change even while an application is executing, but |
| typically application behavior looks like a bunch of these steady-states strung |
| together with some transient behavior in between. |
| </i> |
| </p> |
| |
| <p> |
| <i> |
| Note: the steady-state makes no assumptions about the live heap. |
| It may be growing with each subsequent GC cycle, it may shrink, or it may |
| stay the same. |
| However, trying to encompass all of these situations in the explanations |
| to follow is tedious and not very illustrative, so the guide will focus |
| on examples where the live heap remains constant. |
| The <a href="#GOGC">GOGC section</a> explores the non-constant live heap |
| scenario in some more detail. |
| </i> |
| </p> |
| |
| <p> |
| In the steady-state while the live heap size is constant, every GC cycle is |
| going to look identical in the cost model as long as the GC executes after |
| the same amount of time has passed. |
| That's because in that fixed amount of time, with a fixed rate of allocation |
| by the application, a fixed amount of new heap memory will be allocated. |
| So with the live heap size constant, and that new heap memory constant, |
| memory use is always going to be the same. |
| And because the live heap is the same size, the marginal GC CPU costs will |
| be the same, and the fixed costs will be incurred at some regular interval. |
| </p> |
| |
| <p> |
| Now consider if the GC were to shift the point at which it runs <i>later</i> |
| in time. |
| Then, more memory would be allocated but each GC cycle would still incur |
| the same CPU cost. |
| However over some other fixed window of time fewer GC cycles would finish, |
| resulting in a lower overall CPU cost. |
| The opposite would be true if the GC decided to start <i>earlier</i> in time: |
| less memory would be allocated and CPU costs would be incurred more often. |
| </p> |
| |
| <p> |
| This situation represents the fundamental trade-off between CPU time and memory |
| that a GC can make, controlled by how often the GC actually executes. |
| In other words, the trade-off is entirely defined by <b>GC frequency</b>. |
| </p> |
| |
| <p> |
| One more detail remains to be defined, and that's when the GC should decide |
| to start. |
| Note that this directly sets the GC frequency in any particular steady-state, |
| defining the trade-off. |
| In Go, deciding when the GC should start is the main parameter which the user |
| has control over. |
| </p> |
| |
| <h3 id="GOGC">GOGC</h3> |
| |
| <p> |
| At a high level, GOGC determines the trade-off between GC CPU and memory. |
| </p> |
| |
| <p> |
| It works by determining the target heap size after each GC cycle, a target value |
| for the total heap size in the next cycle. |
| The GC's goal is to finish a collection cycle before the total heap size |
| exceeds the target heap size. |
| Total heap size is defined as the live heap size at the end of the previous |
| cycle, plus any new heap memory allocated by the application since the previous |
| cycle. |
| Meanwhile, target heap memory is defined as: |
| </p> |
| |
| <p class="gc-guide-equation"> |
| <i> |
| Target heap memory = Live heap + (Live heap + GC roots) * GOGC / 100 |
| </i> |
| </p> |
| |
| <p> |
| As an example, consider a Go program with a live heap size of 8 MiB, 1 MiB |
| of goroutine stacks, and 1 MiB of pointers in global variables. |
| Then, with a GOGC value of 100, the amount of new memory that will be allocated |
| before the next GC runs will be 10 MiB, or 100% of the 10 MiB of work, for a |
| total heap footprint of 18 MiB. |
| With a GOGC value of 50, then it'll be 50%, or 5 MiB. |
| With a GOGC value of 200, it'll be 200%, or 20 MiB. |
| </p> |
| |
| <p> |
| <i> |
| Note: GOGC includes the root set only as of Go 1.18. |
| Previously, it would only count the live heap. |
| Often, the amount of memory in goroutine stacks is quite small and the live |
| heap size dominates all other sources of GC work, but in cases where programs |
| had hundreds of thousands of goroutines, the GC was making poor judgements. |
| </i> |
| </p> |
| |
| <p> |
| The heap target controls GC frequency: the bigger the target, the longer the GC |
| can wait to start another mark phase and vice versa. |
| While the precise formula is useful for making estimates, it's best to think of |
| GOGC in terms of its fundamental purpose: a parameter that picks a point in the |
| GC CPU and memory trade-off. |
| The key takeaway is that <b>doubling GOGC will double heap memory overheads and |
| roughly halve GC CPU cost</b>, and vice versa. |
| (To see a full explanation as to why, see the |
| <a href="#Additional_notes_on_GOGC">appendix</a>.) |
| </p> |
| |
| <p> |
| <i> |
| Note: the target heap size is just a target, and there are several reasons why |
| the GC cycle might not finish right at that target. |
| For one, a large enough heap allocation can simply exceed the target. |
| However, other reasons appear in GC implementations that go beyond the |
| <a href="#Understanding_costs">GC model</a> this guide has been using thus far. |
| For some more detail, see the <a href="#Latency">latency section</a>, but the |
| complete details may be found in the <a href="#Additional_resources">additional |
| resources</a>. |
| </i> |
| </p> |
| |
| <p> |
| GOGC may be configured through either the <code>GOGC</code> environment |
| variable (which all Go programs recognize), or through the |
| <a href="https://pkg.go.dev/runtime/debug#SetGCPercent"><code>SetGCPercent</code></a> |
| API in the <code>runtime/debug</code> package. |
| </p> |
| |
| <p> |
| Note that GOGC may also be used to turn off the GC entirely (provided the |
| <a href="#Memory_limit">memory limit</a> does not apply) by setting |
| <code>GOGC=off</code> or calling <code>SetGCPercent(-1)</code>. |
| Conceptually, this setting is equivalent to setting GOGC to a value of |
| infinity, as the amount of new memory before a GC is triggered is unbounded. |
| </p> |
| |
| <p> |
| To better understand everything we've discussed so far, try out the interactive |
| visualization below that is built on the |
| <a href="#Understanding_costs">GC cost model</a> discussed earlier. |
| This visualization depicts the execution of some program whose non-GC work takes |
| 10 seconds of CPU time to complete. |
| In the first second it performs some initialization step (growing its live heap) |
| before settling into a steady-state. |
| The application allocates 200 MiB in total, with 20 MiB live at a time. |
| It assumes that the only relevant GC work to complete comes from the live heap, |
| and that (unrealistically) the application uses no additional memory. |
| </p> |
| |
| <p> |
| Use the slider to adjust the value of GOGC to see how the application responds |
| in terms of total duration and GC overhead. |
| Each GC cycle ends while the new heap drops to zero. |
| The time taken while the new heap drops to zero is the combined time for the |
| mark phase for cycle N, and the sweep phase for the cycle N+1. |
| Note that this visualization (and all the visualizations in this guide) assume |
| the application is paused while the GC executes, so GC CPU costs are fully |
| represented by the time it takes for new heap memory to drop to zero. |
| This is only to make visualization simpler; the same intuition still applies. |
| The X axis shifts to always show the full CPU-time duration of the program. |
| Notice that additional CPU time used by the GC increases the overall duration. |
| </p> |
| |
| <div class="gc-guide-graph" data-workload='[ |
| {"duration": 1.0, "allocRate": 20, "scanRate": 1024, "newSurvivalRate": 1.00, "oldDeathRate": 0.00}, |
| {"duration": 9.0, "allocRate": 20, "scanRate": 1024, "newSurvivalRate": 0.00, "oldDeathRate": 0.00} |
| ]' data-config='{ |
| "fixedCost": 0.04, |
| "otherMem": 0, |
| "GOGC": "graph1-gogc", |
| "memoryLimit": 100000 |
| }'></div> |
| <div class="gc-guide-graph-controls"> |
| <div> |
| GOGC |
| <input type="range" min="0" max="10" step="0.005" value="6.64" id="graph1-gogc"> |
| <div class="gc-guide-counter" id="graph1-gogc-display"></div> |
| </div> |
| </div> |
| |
| <p> |
| Notice that the GC always incurs some CPU and peak memory overhead. |
| As GOGC increases, CPU overhead decreases, but peak memory increases |
| proportionally to the live heap size. |
| As GOGC decreases, the peak memory requirement decreases at the expense of |
| additional CPU overhead. |
| </p> |
| |
| <p> |
| <i> |
| Note: the graph displays CPU time, not wall-clock time to complete the program. |
| If the program runs on 1 CPU and fully utilizes its resources, then these are |
| equivalent. |
| A real-world program likely runs on a multi-core system and does not 100% |
| utilize the CPUs at all times. |
| In these cases the wall-time impact of the GC will be lower. |
| </i> |
| </p> |
| |
| <p> |
| <i> |
| Note: the Go GC has a minimum total heap size of 4 MiB, so if the GOGC-set |
| target is ever below that, it gets rounded up. |
| The visualization reflects this detail. |
| </i> |
| </p> |
| |
| <p> |
| Here's another example that's a little bit more dynamic and realistic. |
| Once again, the application takes 10 CPU-seconds to complete without the GC, but |
| the steady-state allocation rate increases dramatically half-way through, and |
| the live heap size shifts around a bit in the first phase. |
| This example demonstrates how the steady-state might look when the live heap |
| size is actually changing, and how a higher allocation rate leads to more |
| frequent GC cycles. |
| </p> |
| |
| <div class="gc-guide-graph" data-workload='[ |
| {"duration": 1.0, "allocRate": 20, "scanRate": 1024, "newSurvivalRate": 1.00, "oldDeathRate": 0.00}, |
| {"duration": 1.0, "allocRate": 20, "scanRate": 1024, "newSurvivalRate": 0.00, "oldDeathRate": 0.50}, |
| {"duration": 1.0, "allocRate": 20, "scanRate": 1024, "newSurvivalRate": 0.50, "oldDeathRate": 0.00}, |
| {"duration": 1.0, "allocRate": 20, "scanRate": 1024, "newSurvivalRate": 0.00, "oldDeathRate": 0.50}, |
| {"duration": 1.0, "allocRate": 20, "scanRate": 1024, "newSurvivalRate": 0.50, "oldDeathRate": 0.00}, |
| {"duration": 5.0, "allocRate": 200, "scanRate": 1024, "newSurvivalRate": 0.02, "oldDeathRate": 1.00} |
| ]' data-config='{ |
| "fixedCost": 0.04, |
| "otherMem": 0, |
| "GOGC": "graph2-gogc", |
| "memoryLimit": 100000 |
| }'></div> |
| <div class="gc-guide-graph-controls"> |
| <div> |
| GOGC |
| <input type="range" min="0" max="10" step="0.005" value="6.64" id="graph2-gogc"> |
| <div class="gc-guide-counter" id="graph2-gogc-display"></div> |
| </div> |
| </div> |
| |
| <h3 id="Memory_limit">Memory limit</h3> |
| |
| <p> |
| Until Go 1.19, GOGC was the sole parameter that could be used to modify the GC's |
| behavior. |
| While it works great as a way to set a trade-off, it doesn't take into account |
| that available memory is finite. |
| Consider what happens when there's a transient spike in the live heap size: |
| because the GC will pick a total heap size proportional to that live heap size, |
| GOGC must be configured such for the <i>peak</i> live heap size, even if in the |
| usual case a higher GOGC value provides a better trade-off. |
| </p> |
| |
| <p> |
| The visualization below demonstrates this transient heap spike situation. |
| </p> |
| |
| <div class="gc-guide-graph" data-workload='[ |
| {"duration": 1.0, "allocRate": 20, "scanRate": 1024, "newSurvivalRate": 1.00, "oldDeathRate": 0.00}, |
| {"duration": 4.0, "allocRate": 20, "scanRate": 1024, "newSurvivalRate": 0.00, "oldDeathRate": 0.00}, |
| {"duration": 0.5, "allocRate": 20, "scanRate": 1024, "newSurvivalRate": 1.00, "oldDeathRate": 0.00}, |
| {"duration": 0.5, "allocRate": 20, "scanRate": 1024, "newSurvivalRate": 0.00, "oldDeathRate": 0.00}, |
| {"duration": 0.5, "allocRate": 20, "scanRate": 1024, "newSurvivalRate": 0.00, "oldDeathRate": 0.50}, |
| {"duration": 3.5, "allocRate": 20, "scanRate": 1024, "newSurvivalRate": 0.00, "oldDeathRate": 0.00} |
| ]' data-config='{ |
| "fixedCost": 0.04, |
| "otherMem": 0, |
| "GOGC": "graph3-gogc", |
| "memoryLimit": 100000 |
| }'></div> |
| <div class="gc-guide-graph-controls"> |
| <div> |
| GOGC |
| <input type="range" min="0" max="10" step="0.005" value="6.64" id="graph3-gogc"> |
| <div class="gc-guide-counter" id="graph3-gogc-display"></div> |
| </div> |
| </div> |
| |
| <p> |
| If the example workload is running in a container with a bit over 60 MiB of |
| memory available, then GOGC can't be increased beyond 100, even though the rest |
| of the GC cycles have the available memory to make use of that extra memory. |
| Furthermore, in some applications, these transient peaks can be rare and hard to |
| predict, leading to occasional, unavoidable, and potentially costly |
| out-of-memory conditions. |
| </p> |
| |
| <p> |
| That's why in the 1.19 release, Go added support for setting a runtime memory |
| limit. |
| The memory limit may be configured either via the <code>GOMEMLIMIT</code> |
| environment variable which all Go programs recognize, or through the |
| <code>SetMemoryLimit</code> function available in the <code>runtime/debug</code> |
| package. |
| </p> |
| |
| <p> |
| This memory limit sets a maximum on the <i>total amount of memory that the Go |
| runtime can use</i>. |
| The specific set of memory included is defined in terms of |
| <a href="https://pkg.go.dev/runtime#MemStats"><code>runtime.MemStats</code></a> |
| as the expression |
| </p> |
| |
| <p> |
| <code>Sys</code> <code>-</code> <code>HeapReleased</code> |
| </p> |
| |
| <p> |
| or equivalently in terms of the |
| <a href="https://pkg.go.dev/runtime/metrics"><code>runtime/metrics</code></a> |
| package, |
| </p> |
| |
| <p> |
| <code>/memory/classes/total:bytes</code> <code>-</code> <code>/memory/classes/heap/released:bytes</code> |
| </p> |
| |
| <p> |
| Because the Go GC has explicit control over how much heap memory it uses, it |
| sets the total heap size based on this memory limit and how much other memory |
| the Go runtime uses. |
| </p> |
| |
| <p> |
| The visualization below depicts the same single-phase steady-state workload from |
| the GOGC section, but this time with an extra 10 MiB of overhead from the Go |
| runtime and with an adjustable memory limit. |
| Try shifting around both GOGC and the memory limit and see what happens. |
| </p> |
| |
| <div class="gc-guide-graph" data-workload='[ |
| {"duration": 1.0, "allocRate": 20, "scanRate": 1024, "newSurvivalRate": 1.00, "oldDeathRate": 0}, |
| {"duration": 9.0, "allocRate": 20, "scanRate": 1024, "newSurvivalRate": 0.00, "oldDeathRate": 0} |
| ]' data-config='{ |
| "fixedCost": 0.04, |
| "otherMem": 10, |
| "GOGC": "graph4-gogc", |
| "memoryLimit": "graph4-memlimit" |
| }'></div> |
| <div class="gc-guide-graph-controls"> |
| <div> |
| GOGC |
| <input type="range" min="0" max="10" step="0.005" value="6.64" id="graph4-gogc"> |
| <div class="gc-guide-counter" id="graph4-gogc-display"></div> |
| </div> |
| <div> |
| Memory Limit |
| <input type="range" min="1" max="100" step="0.5" value="100" id="graph4-memlimit"> |
| <div class="gc-guide-counter" id="graph4-memlimit-display"></div> |
| </div> |
| </div> |
| |
| <p> |
| Notice that when the memory limit is lowered below the peak memory that's |
| determined by GOGC (42 MiB for a GOGC of 100), the GC runs more frequently to |
| keep the peak memory within the limit. |
| </p> |
| |
| <p> |
| Returning to our previous example of the transient heap spike, by setting a |
| memory limit and turning up GOGC, we can get the best of both worlds: no memory |
| limit breach, and better resource economy. |
| Try out the interactive visualization below. |
| </p> |
| |
| <div class="gc-guide-graph" data-workload='[ |
| {"duration": 1.0, "allocRate": 20, "scanRate": 1024, "newSurvivalRate": 1.00, "oldDeathRate": 0.00}, |
| {"duration": 4.0, "allocRate": 20, "scanRate": 1024, "newSurvivalRate": 0.00, "oldDeathRate": 0.00}, |
| {"duration": 0.5, "allocRate": 20, "scanRate": 1024, "newSurvivalRate": 1.00, "oldDeathRate": 0.00}, |
| {"duration": 0.5, "allocRate": 20, "scanRate": 1024, "newSurvivalRate": 0.00, "oldDeathRate": 0.00}, |
| {"duration": 0.5, "allocRate": 20, "scanRate": 1024, "newSurvivalRate": 0.00, "oldDeathRate": 0.50}, |
| {"duration": 3.5, "allocRate": 20, "scanRate": 1024, "newSurvivalRate": 0.00, "oldDeathRate": 0.00} |
| ]' data-config='{ |
| "fixedCost": 0.04, |
| "otherMem": 0, |
| "GOGC": "graph5-gogc", |
| "memoryLimit": "graph5-memlimit" |
| }'></div> |
| <div class="gc-guide-graph-controls"> |
| <div> |
| GOGC |
| <input type="range" min="0" max="10" step="0.005" value="6.64" id="graph5-gogc"> |
| <div class="gc-guide-counter" id="graph5-gogc-display"></div> |
| </div> |
| <div> |
| Memory Limit |
| <input type="range" min="1" max="100" step="0.5" value="100" id="graph5-memlimit"> |
| <div class="gc-guide-counter" id="graph5-memlimit-display"></div> |
| </div> |
| </div> |
| |
| <p> |
| Notice that with some values of GOGC and the memory limit, peak memory use |
| stops at whatever the memory limit is, but that the rest of the program's |
| execution still obeys the total heap size rule set by GOGC. |
| </p> |
| |
| <p> |
| This observation leads to another interesting detail: even when GOGC is set to |
| off, the memory limit is still respected! |
| In fact, this particular configuration represents a <i>maximization of resource |
| economy</i> because it sets the minimum GC frequency required to maintain some |
| memory limit. |
| In this case, <i>all</i> of the program's execution has the heap size rise to |
| meet the memory limit. |
| </p> |
| |
| <p> |
| Now, while the memory limit is clearly a powerful tool, <b>the use of |
| a memory limit does not come without a cost</b>, and certainly doesn't |
| invalidate the utility of GOGC. |
| </p> |
| |
| <p> |
| Consider what happens when the live heap grows large enough to bring total |
| memory use close to the memory limit. |
| In the steady-state visualization above, try turning GOGC off and then slowly |
| lowering the memory limit further and further to see what happens. |
| Notice that the total time the application takes will start to grow in an |
| unbounded manner as the GC is constantly executing to maintain an impossible |
| memory limit. |
| </p> |
| |
| <p> |
| This situation, where the program fails to make reasonable progress due to |
| constant GC cycles, is called <b>thrashing</b>. |
| It's particularly dangerous because it effectively stalls the program. |
| Even worse, it can happen for exactly the same situation we were trying to |
| avoid with GOGC: a large enough transient heap spike can cause a program to |
| stall indefinitely! |
| Try reducing the memory limit (around 30 MiB or lower) in the transient heap |
| spike visualization and notice how the worst behavior specifically starts with |
| the heap spike. |
| </p> |
| |
| <p> |
| In many cases, an indefinite stall is worse than an out-of-memory condition, |
| which tends to result in a much faster failure. |
| </p> |
| |
| <p> |
| For this reason, the memory limit is defined to be <b>soft</b>. |
| The Go runtime makes no guarantees that it will maintain this memory limit |
| under all circumstances; it only promises some reasonable amount of effort. |
| This relaxation of the memory limit is critical to avoiding thrashing behavior, |
| because it gives the GC a way out: let memory use surpass the limit to avoid |
| spending too much time in the GC. |
| </p> |
| |
| <p> |
| How this works internally is the GC sets an upper limit on the amount |
| of CPU time it can use over some time window (with some hysteresis for very |
| short transient spikes in CPU use). |
| This limit is currently set at roughly 50%, with a <code>2 * GOMAXPROCS</code> |
| CPU-second window. |
| The consequence of limiting GC CPU time is that the GC's work is delayed, |
| meanwhile the Go program may continue allocating new heap memory, even beyond |
| the memory limit. |
| </p> |
| |
| <p> |
| The intuition behind the 50% GC CPU limit is based on the worst-case impact |
| on a program with ample available memory. |
| In the case of a misconfiguration of the memory limit, where it is set too |
| low mistakenly, the program will slow down at most by 2x, because the GC |
| can't take more than 50% of its CPU time away. |
| </p> |
| |
| <p> |
| <i> |
| Note: the visualizations on this page do not simulate the GC CPU limit. |
| </i> |
| </p> |
| |
| <h4 id="Suggested_uses">Suggested uses</h4> |
| |
| <p> |
| While the memory limit is a powerful tool, and the Go runtime takes steps to |
| mitigate the worst behaviors from misuse, it's still important to use it |
| thoughtfully. |
| Below is a collection of tidbits of advice about where the memory limit is most |
| useful and applicable, and where it might cause more harm than good. |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| <b>Do</b> take advantage of the memory limit when the execution |
| environment of your Go program is entirely within your control, and |
| the Go program is the only program with access to some set of resources |
| (i.e. some kind of memory reservation, like a container memory limit). |
| </p> |
| <p> |
| A good example is the deployment of a web service into containers with |
| a fixed amount of available memory. |
| </p> |
| <p> |
| <b>In this case, a good rule of thumb is to leave an additional 5-10% |
| of headroom to account for memory sources the Go runtime is unaware of. |
| </b> |
| </p> |
| </li> |
| <li> |
| <p> |
| <b>Do</b> feel free to adjust the memory limit in real time to adapt to |
| changing conditions. |
| </p> |
| <p> |
| A good example is a cgo program where C libraries temporarily need to |
| use substantially more memory. |
| </p> |
| </li> |
| <li> |
| <p> |
| <b>Don't</b> set GOGC to off with a memory limit if the Go program |
| might share some of its limited memory with other programs, and those |
| programs are generally decoupled from the Go program. |
| Instead, keep the memory limit since it may help to curb undesirable |
| transient behavior, but set GOGC to some smaller, reasonable value for |
| the average case. |
| </p> |
| <p> |
| While it may be tempting to try and "reserve" memory for co-tenant |
| programs, unless the programs are fully synchronized (e.g. the Go |
| program calls some subprocess and blocks while its callee executes), |
| the result will be less reliable as inevitably both programs will |
| need more memory. |
| Letting the Go program use less memory when it doesn't need it will |
| generate a more reliable result overall. |
| This advice also applies to overcommit situations, where the sum of |
| memory limits of containers running on one machine may exceed the |
| actual physical memory available to the machine. |
| </p> |
| </li> |
| <li> |
| <p> |
| <b>Don't</b> use the memory limit when deploying to an execution |
| environment you don't control, especially when your program's memory |
| use is proportional to its inputs. |
| </p> |
| <p> |
| A good example is a CLI tool or a desktop application. |
| Baking a memory limit into the program when it's unclear what kind of |
| inputs it might be fed, or how much memory might be available on the |
| system can lead to confusing crashes and poor performance. |
| Plus, an advanced end-user can always set a memory limit if they wish. |
| </p> |
| </li> |
| <li> |
| <p> |
| <b>Don't</b> set a memory limit to avoid out-of-memory conditions when |
| a program is already close to its environment's memory limits. |
| </p> |
| <p> |
| This effectively replaces an out-of-memory risk with a risk of |
| severe application slowdown, which is often not a favorable trade, |
| even with the efforts Go makes to mitigate thrashing. |
| In such a case, it would be much more effective to either increase the |
| environment's memory limits (and <i>then</i> potentially set a memory |
| limit) or decrease GOGC (which provides a much cleaner trade-off than |
| thrashing-mitigation does). |
| </p> |
| </li> |
| </ul> |
| |
| <h3 id="Latency">Latency</h3> |
| |
| <p> |
| The visualizations in this document have modeled the application as paused while |
| the GC is executing. |
| GC implementations do exist that behave this way, and they're referred to as |
| "stop-the-world" GCs. |
| </p> |
| |
| <p> |
| The Go GC, however, is not fully stop-the-world and does most of its work |
| concurrently with the application. |
| This is primarily to reduce application <i>latencies</i>. |
| Specifically, the end-to-end duration of a single unit of computation (e.g. a |
| web request). |
| Thus far, this document mainly considered application <i>throughput</i> (e.g. |
| web requests handled per second). |
| Note that each example in the <a href="#The_GC_cycle">GC cycle</a> section |
| focused on the total CPU duration of an executing program. |
| However, such a duration is far less meaningful for say, a web service. |
| While throughput is still important for a web service (i.e. queries per second), |
| often the latency of each individual request matters even more. |
| </p> |
| |
| <p> |
| In terms of latency, a stop-the-world GC may require a considerable length of |
| time to execute both its mark and sweep phases, during which the application, |
| and in the context of a web service, any in-flight request, is unable to make |
| further progress. |
| Instead, the Go GC avoids making the length of any global application pauses |
| proportional to the size of the heap, and that the core tracing algorithm is |
| performed while the application is actively executing. |
| (The pauses are more strongly proportional to GOMAXPROCS algorithmically, but |
| most commonly are dominated by the time it takes to stop running goroutines.) |
| Collecting concurrently is not without cost: in practice it often leads to a |
| design with lower throughput than an equivalent stop-the-world garbage |
| collector. |
| However, it's important to note that <i>lower latency does not inherently mean |
| lower throughput</i>, and the performance of the Go garbage collector has |
| steadily improved over time, in both latency and throughput. |
| </p> |
| |
| <p> |
| The concurrent nature of Go's current GC does not invalidate anything discussed |
| in this document so far: none of the statements relied on this design choice. |
| GC frequency is still the primary way the GC trades off between CPU |
| time and memory for throughput, and in fact, it also takes on this role for |
| latency. |
| This is because most of the costs for the GC are incurred while the mark phase |
| is active. |
| </p> |
| |
| <p> |
| The key takeaway then, is that <b>reducing GC frequency may also lead to latency |
| improvements</b>. |
| This applies not only to reductions in GC frequency from modifying tuning |
| parameters, like increasing GOGC and/or the memory limit, but also applies to |
| the optimizations described in the |
| <a href="#Optimization_guide">optimization guide</a>. |
| </p> |
| |
| <p> |
| However, latency is often more complex to understand than throughput, because it |
| is a product of the moment-to-moment execution of the program and not just an |
| aggregation of costs. |
| As a result, the connection between latency and GC frequency is less direct. |
| Below is a list of possible sources of latency for those inclined to dig |
| deeper. |
| </p> |
| |
| <ol> |
| <li> |
| Brief stop-the-world pauses when the GC transitions between the mark |
| and sweep phases, |
| </li> |
| <li> |
| Scheduling delays because the GC takes 25% of CPU resources when in the |
| mark phase, |
| </li> |
| <li> |
| User goroutines assisting the GC in response to a high allocation rate, |
| </li> |
| <li> |
| Pointer writes requiring additional work while the GC is in the mark |
| phase, and |
| </li> |
| <li> |
| Running goroutines must be suspended for their roots to be scanned. |
| </li> |
| </ol> |
| |
| <p> |
| These latency sources are visible in |
| <a href="/doc/diagnostics#execution-tracer">execution traces</a>, except for |
| pointer writes requiring additional work. |
| </p> |
| |
| <!-- TODO: Add a short section about non-steady-state behavior. --> |
| |
| <h3 id="Additional_resources">Additional resources</h3> |
| |
| <p> |
| While the information presented above is accurate, it lacks the detail to |
| fully understand costs and trade-offs in the Go GC's design. |
| For more information, see the following additional resources. |
| </p> |
| |
| <ul> |
| <li> |
| <a href="https://gchandbook.org/">The GC Handbook</a>—An |
| excellent general resource and reference on garbage collector design. |
| </li> |
| <li> |
| <a href="https://google.github.io/tcmalloc/design.html">TCMalloc</a>—Design |
| document for the C/C++ memory allocator TCMalloc, which the Go memory allocator is based on. |
| </li> |
| <li> |
| <a href="https://go.dev/blog/go15gc">Go 1.5 GC announcement</a>—The |
| blog post announcing the Go 1.5 concurrent GC, which describes the algorithm in more detail. |
| </li> |
| <li> |
| <a href="https://go.dev/blog/ismmkeynote">Getting to Go</a>—An |
| in-depth presentation about the evolution of Go's GC design up to 2018. |
| </li> |
| <li> |
| <a href="https://docs.google.com/document/d/1wmjrocXIWTr1JxU-3EQBI6BK6KgtiFArkG47XK73xIQ/edit">Go 1.5 concurrent GC pacing</a>—Design |
| document for determining when to start a concurrent mark phase. |
| </li> |
| <li> |
| <a href="https://github.com/golang/go/issues/30333">Smarter scavenging</a>—Design |
| document for revising the way the Go runtime returns memory to the operating system. |
| </li> |
| <li> |
| <a href="https://github.com/golang/go/issues/35112">Scalable page allocator</a>—Design |
| document for revising the way the Go runtime manages memory it gets from the operating system. |
| </li> |
| <li> |
| <a href="https://github.com/golang/go/issues/44167">GC pacer redesign (Go 1.18)</a>—Design |
| document for revising the algorithm to determine when to start a concurrent mark phase. |
| </li> |
| <li> |
| <a href="https://github.com/golang/go/issues/48409">Soft memory limit (Go 1.19)</a>—Design |
| document for the soft memory limit. |
| </li> |
| </ul> |
| |
| <h2 id="A_note_about_virtual_memory">A note about virtual memory</h2> |
| |
| <p> |
| This guide has largely focused on the physical memory use of the GC, but a |
| question that comes up regularly is what exactly that means and how it compares |
| to virtual memory (typically presented in programs like <code>top</code> as |
| "VSS"). |
| </p> |
| |
| <p> |
| Physical memory is memory housed in the actual physical RAM chip in most |
| computers. |
| <a href="https://en.wikipedia.org/wiki/Virtual_memory">Virtual memory</a> is an |
| abstraction over physical memory provided by the operating system to isolate |
| programs from one another. |
| It's also typically acceptable for programs to reserve virtual address space |
| that doesn't map to any physical addresses at all. |
| </p> |
| |
| <p> |
| <b> |
| Because virtual memory is just a mapping maintained by the operating system, |
| it is typically very cheap to make large virtual memory reservations that don't |
| map to physical memory. |
| </b> |
| </p> |
| |
| <p> |
| The Go runtime generally relies upon this view of the cost of virtual memory in |
| a few ways: |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| The Go runtime never deletes virtual memory that it maps. |
| Instead, it uses special operations that most operating systems |
| provide to explicitly release any physical memory resources |
| associated with some virtual memory range. |
| </p> |
| |
| <p> |
| This technique is used explicitly to manage the |
| <a href="#Memory_limit">memory limit</a> and return memory to the |
| operating system that the Go runtime no longer needs. |
| The Go runtime also releases memory it no longer needs continuously |
| in the background. |
| See <a href="#Additional_resources">the additional resources</a> for |
| more information. |
| </p> |
| </li> |
| <li> |
| <p> |
| On 32-bit platforms, the Go runtime reserves between 128 MiB and 512 MiB |
| of address space up-front for the heap to limit fragmentation issues. |
| </p> |
| </li> |
| <li> |
| <p> |
| The Go runtime uses large virtual memory address space reservations |
| in the implementation of several internal data structures. |
| On 64-bit platforms, these typically have a minimum virtual memory |
| footprint of about 700 MiB. |
| On 32-bit platforms, their footprint is negligible. |
| </p> |
| </li> |
| </ul> |
| |
| <p> |
| As a result, virtual memory metrics such as "VSS" in <code>top</code> are |
| typically not very useful in understanding a Go program's memory footprint. |
| Instead, focus on "RSS" and similar measurements, which more directly reflect |
| physical memory usage. |
| </p> |
| |
| <h2 id="Optimization_guide">Optimization guide</h2> |
| |
| <h3 id="Identiying_costs">Identifying costs</h3> |
| |
| <p> |
| Before trying to optimize how your Go application interacts with the GC, it's |
| important to first identify that the GC is a major cost in the first place. |
| </p> |
| |
| </p> |
| The Go ecosystem provides a number of tools for identifying costs and optimizing |
| Go applications. |
| For a brief overview of these tools, see the |
| <a href="/doc/diagnostics">guide on diagnostics</a>. |
| Here, we'll focus on a subset of these tools and a reasonable order to apply |
| them in in order to understand GC impact and behavior. |
| </p> |
| |
| <ol> |
| <li> |
| <p> |
| <b>CPU profiles</b> |
| </p> |
| |
| <p> |
| A good place to start is with |
| <a href="https://pkg.go.dev/runtime/pprof#hdr-Profiling_a_Go_program">CPU profiling</a>. |
| CPU profiling provides an overview of where CPU time is spent, though to the |
| untrained eye it may be difficult to identify the magnitude of the role the |
| GC plays in a particular application. |
| Luckily, understanding how the GC fits in mostly boils down to knowing what |
| different functions in the `runtime` package mean. |
| Below is a useful subset of these functions for interpreting CPU profiles. |
| </p> |
| |
| <p> |
| <i> |
| Note: the functions listed below are not leaf functions, so they may not show |
| up in the default the <code>pprof</code> tool provides with the <code>top</code> |
| command. |
| Instead, use the <code>top -cum</code> command or use the <code>list</code> |
| command on these functions directly and focus on the cumulative percent column. |
| </i> |
| </p> |
| </li> |
| |
| <ul> |
| <li> |
| <p> |
| <b><code>runtime.gcBgMarkWorker</code></b>: Entrypoint to the |
| dedicated mark worker goroutines. |
| Time spent here scales with GC frequency and the complexity and |
| size of the object graph. |
| It represents a baseline for how much time the application spends |
| marking and scanning. |
| </p> |
| <p> |
| <i> |
| Note: In a largely idle Go application, the Go GC is going to |
| use up additional (idle) CPU resources to get its job done faster. |
| As a result, this symbol may represent a large fraction of samples, |
| that it believes are free. |
| One common reason this can happen is if an application runs entirely |
| in one goroutine but </i><code>GOMAXPROCS</code><i> is >1. |
| </i> |
| </p> |
| </li> |
| <li> |
| <p> |
| <b><code>runtime.mallocgc</code></b>: Entrypoint to the memory |
| allocator for heap memory. |
| A large amount of cumulative time spent here (>15%) |
| typically indicates a lot of memory being allocated. |
| </p> |
| </li> |
| <li> |
| <p> |
| <b><code>runtime.gcAssistAlloc</code></b>: Function goroutines |
| enter to yield some of their time to assist the GC with scanning |
| and marking. |
| A large amount of cumulative time spent here (>5%) indicates |
| that the application is likely out-pacing the GC with respect to how |
| fast it's allocating. |
| It indicates a particularly high degree of impact from the GC, |
| and also represents time the application spend marking and scanning. |
| Note that this is included in the <code>runtime.mallocgc</code> |
| call tree, so it will inflate that as well. |
| </p> |
| </li> |
| </ul> |
| |
| <li> |
| <p> |
| <b>Execution traces</b> |
| </p> |
| |
| <p> |
| While CPU profiles are great for identifying where time is spent in |
| aggregate, they're less useful for indicating performance costs that |
| are more subtle, rare, or related to latency specifically. |
| Execution traces</a> on |
| the other hand provide a rich and deep view into a short window of |
| a Go program's execution. |
| They contain a variety of events related to the Go GC and specific |
| execution paths can be directly observed, along with how the application |
| might interact with the Go GC. |
| All the GC events tracked are conveniently labeled as such in the trace |
| viewer. |
| </p> |
| |
| <p> |
| See the <a href="https://pkg.go.dev/runtime/trace">documentation for the |
| <code>runtime/trace</code></a> package for how to get started with |
| execution traces. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| <b>GC traces</b> |
| </p> |
| |
| <p> |
| When all else fails, the Go GC provides a few different specific traces |
| that provide much deeper insights into GC behavior. |
| These traces are always printed directly to STDERR, one line per GC cycle, |
| and are configured through the <code>GODEBUG</code> environment variable |
| that all Go programs recognize. |
| They're mostly useful for debugging the Go GC itself since they require |
| some familiarity with the specifics of the GC's implementation, but |
| nonetheless can occasionally be useful to gain a better understanding of |
| GC behavior. |
| </p> |
| |
| <p> |
| The core GC trace is enabled by setting <code>GODEBUG=gctrace=1</code>. |
| The output produced by this trace is documented in the |
| <a href="https://pkg.go.dev/runtime#hdr-Environment_Variables">environment |
| variables section in the documentation for the <code>runtime</code> |
| package.</a> |
| </p> |
| |
| <p> |
| A supplementary GC trace called the "pacer trace" provides even deeper |
| insights and is enabled by setting <code>GODEBUG=gcpacertrace=1</code>. |
| Interpreting this output requires an understanding of the GC's "pacer" |
| (see <a href="#Additional_resources">additional resources</a>), which is |
| outside the scope of this guide. |
| </p> |
| </li> |
| </ol> |
| |
| <h3 id="Eliminating_heap_allocations">Eliminating heap allocations</h3> |
| |
| <p> |
| One way to reduce costs from the GC is to have the GC manage fewer values to begin |
| with. |
| The techniques described below can produce some of the largest improvements in |
| performance, because as the <a href="#GOGC">GOGC section</a> demonstrated, the |
| allocation rate of a Go program is a major factor in GC frequency, the key |
| cost metric used by this guide. |
| </p> |
| |
| <h4 id="Heap_profiling">Heap profiling</h4> |
| |
| <p> |
| After <a href="#Identifying_costs">identifying that the GC is a source of |
| significant costs</a>, the next step in eliminating heap allocations is to |
| find out where most of them are coming from. |
| For this purpose, memory profiles (really, heap memory profiles) are very |
| useful. |
| Check out the <a href="https://pkg.go.dev/runtime/pprof#hdr-Profiling_a_Go_program"> |
| documentation</a> for how to get started with them. |
| </p> |
| |
| <p> |
| Memory profiles describe where in the program heap allocations come from, |
| identifying them by the stack trace at the point they were allocated. |
| Each memory profile can break down memory in four ways. |
| </p> |
| |
| <ul> |
| <li><code>inuse_objects</code>—Breaks down the number of objects that |
| are live.</li> |
| <li><code>inuse_space</code>—Breaks down live objects by how much |
| memory they use in bytes.</li> |
| <li><code>alloc_objects</code>—Breaks down the number of objects |
| that have been allocated since the Go program began executing.</li> |
| <li><code>alloc_space</code>—Breaks down the total amount of memory |
| allocated since the Go program began executing.</li> |
| </ul> |
| |
| <p> |
| Switching between these different views of heap memory may be done with either |
| the <code>-sample_index</code> flag to the <code>pprof</code> tool, or via the |
| <code>sample_index</code> option when the tool is used interactively. |
| </p> |
| |
| <p> |
| <i> |
| Note: memory profiles by default only sample a subset of heap objects so they |
| will not contain information about every single heap allocation. |
| However, this is sufficient to find hot-spots. |
| To change the sampling rate, see |
| <a href="https://pkg.go.dev/runtime#pkg-variables"><code>runtime.MemProfileRate</code></a>. |
| </i> |
| </p> |
| |
| <p> |
| For the purposes of reducing GC costs, <code>alloc_space</code> is typically the |
| most useful view as it directly corresponds to the allocation rate. |
| This view will indicate allocation hot spots that would provide the most benefit. |
| </p> |
| |
| <h4 id="Escape_analysis">Escape analysis</h4> |
| |
| <p> |
| Once candidate heap allocation sites have been identified with the help of |
| <a href="#Heap_profiling">heap profiles</a>, how can they be eliminated? |
| The key is to leverage the Go compiler's escape analysis to have the Go compiler |
| find alternative, and more efficient storage for this memory, for example |
| in the goroutine stack. |
| Luckily, the Go compiler has the ability to describe why it decides to escape |
| a Go value to the heap. |
| With that knowledge, it becomes a matter of reorganizing your source code to |
| change the outcome of the analysis (which is often the hardest part, but outside |
| the scope of this guide). |
| </p> |
| |
| <p> |
| As for how to access the information from the Go compiler's escape analysis, the |
| simplest way is through a debug flag supported by the Go compiler that describes |
| all optimizations it applied or did not apply to some package in a text format. |
| This includes whether or not values escape. |
| Try the following command, where <code>[package]</code> is some Go package path. |
| </p> |
| |
| <pre> |
| $ go build -gcflags=-m=3 [package] |
| </pre> |
| |
| <p> |
| This information can also be visualized as an overlay in VS Code. |
| This overlay is configured and enabled in the VS Code Go plugin settings. |
| </p> |
| |
| <ol> |
| <li> |
| Set the |
| <a href="https://github.com/golang/vscode-go/wiki/settings#uicodelenses"> |
| <code>ui.codelenses</code> setting to include <code>gc_details</code> |
| </a>. |
| </li> |
| <li> |
| Enable the overlay for escape analysis by |
| <a href="https://github.com/golang/vscode-go/wiki/settings#uidiagnosticannotations"> |
| setting <code>ui.diagnostic.annotations</code> to include <code>escape</code> |
| </a>. |
| </li> |
| </ol> |
| |
| <p> |
| Finally, the Go compiler provides this information in a machine-readable (JSON) |
| format that may be used to build additional custom tooling. |
| For more information on that, see the |
| <a href="https://cs.opensource.google/go/go/+/master:src/cmd/compile/internal/logopt/log_opts.go;l=25;drc=351e0f4083779d8ac91c05afebded42a302a6893"> |
| documentation in the source Go code</a>. |
| <p> |
| |
| <h3 id="Implementation-specific_optimizations">Implementation-specific optimizations</h3> |
| |
| <p> |
| The Go GC is sensitive to the demographics of live memory, because a complex |
| graph of objects and pointers both limits parallelism and generates more work |
| for the GC. |
| As a result, the GC contains a few optimizations for specific common structures. |
| The most directly useful ones for performance optimization are listed below. |
| </p> |
| |
| <p> |
| <i> |
| Note: Applying the optimizations below may reduce the readability of your code |
| by obscuring intent, and may fail to hold up across Go releases. |
| Prefer to apply these optimizations only in the places they matter most. |
| Such places may be identified by using the tools listed in the |
| <a href="#Identifying_costs">section on identifying costs</a>. |
| </i> |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| Pointer-free values are segregated from other values. |
| </p> |
| <p> |
| As a result, it may be advantageous to eliminate pointers from data |
| structures that do not strictly need them, as this reduces the cache |
| pressure the GC exerts on the program. |
| As a result, data structures that rely on indices over pointer values, |
| while less well-typed, may perform better. |
| This is only worth doing if it's clear that the object graph is complex |
| and the GC is spending a lot of time marking and scanning. |
| </p> |
| </li> |
| <li> |
| <p> |
| The GC will stop scanning values at the last pointer in the value. |
| </p> |
| <p> |
| As a result, it may be advantageous to group pointer fields in |
| struct-typed values at the beginning of the value. |
| This is only worth doing if it's clear the application spends a lot of its |
| time marking and scanning. |
| (In theory the compiler can do this automatically, but it is not yet |
| implemented, and struct fields are arranged as written in the source |
| code.) |
| </p> |
| </li> |
| </ul> |
| |
| <p> |
| Furthermore, the GC must interact with nearly every pointer it sees, so using |
| indices into an slice, for example, instead of pointers, can aid in reducing GC |
| costs. |
| </p> |
| |
| <h3 id="Linux_transparent_huge_pages">Linux transparent huge pages (THP)</h3> |
| |
| <p> |
| When a program accesses memory, the CPU needs to translate the |
| <a href="#A_note_about_virtual_memory">virtual memory</a> addresses it uses |
| into physical memory addresses that refer to the data it was trying to access. |
| To do this, the CPU consults the "page table," a data structure that represents |
| the mapping from virtual to physical memory, managed by the operating system. |
| Each entry in the page table represents an indivisible block of physical memory |
| called a page, hence the name. |
| </p> |
| |
| <p> |
| Transparent huge pages (THP) is a Linux feature that transparently replaces pages of |
| physical memory backing contiguous virtual memory regions with bigger blocks of memory |
| called huge pages. |
| By using bigger blocks, fewer page table entries are needed to represent the same memory |
| region, improving page table lookup times. |
| However, bigger blocks mean more waste if only a small part of the huge page is used |
| by the system. |
| </p> |
| |
| <p> |
| When running Go programs in production, enabling transparent huge pages on Linux |
| can improve throughput and latency at the cost of additional memory use. |
| Applications with small heaps tend not to benefit from THP and may end up using a |
| substantial amount of additional memory (as high as 50%). |
| However, applications with big heaps (1 GiB or more) tend to benefit quite a bit |
| (up to 10% throughput) without very much additional memory overhead (1-2% or less). |
| Being aware of your THP settings in either case can be helpful, and experimentation |
| is always recommended. |
| </p> |
| |
| <p> |
| One can enable or disable transparent huge pages in a Linux environment by modifying |
| <code>/sys/kernel/mm/transparent_hugepage/enabled</code>. |
| See the |
| <a href="https://www.kernel.org/doc/html/next/admin-guide/mm/transhuge.html">official |
| Linux admin guide</a> for more details. |
| If you choose to have your Linux production environment enable transparent huge pages, |
| we recommend the following additional settings for Go programs. |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| Set <code>/sys/kernel/mm/transparent_hugepage/defrag</code> |
| to <code>defer</code> or <code>defer+madvise</code>. |
| <br /> |
| <br /> |
| This setting controls how aggressively a Linux kernel coalesces regular |
| pages into huge pages. |
| <code>defer</code> tells the kernel to coalesce huge pages lazily |
| and in the background. |
| A more aggressive setting can induce stalls in memory constrained systems |
| and can often hurt application latencies. |
| <code>defer+madvise</code> is like <code>defer</code>, but is friendlier |
| to other applications on the system that request huge pages explicitly and |
| require them for performance. |
| </p> |
| </li> |
| <li> |
| <p> |
| Set <code>/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none</code> |
| to <code>0</code>. |
| <br /> |
| <br /> |
| This setting controls how many additional pages the Linux kernel daemon |
| can allocate when trying to allocate a huge page. |
| The default setting is maximally aggressive, and can often |
| <a href="https://bugzilla.kernel.org/show_bug.cgi?id=93111">undo work the Go |
| runtime does to return memory to the OS</a>. |
| Before Go 1.21, the Go runtime tried to mitigate the negative effects of the |
| default setting, but it came with a CPU cost. |
| With Go 1.21+ and Linux 6.2+, the Go runtime will coalesce regular pages |
| into huge pages itself with its own more accurate heuristics. |
| </p> |
| </li> |
| </ul> |
| |
| <h2 id="Appendix">Appendix</h2> |
| |
| <h3 id="Additional_notes_on_GOGC">Additional notes on GOGC</h3> |
| |
| <p> |
| The <a href="#GOGC">GOGC section</a> claimed that doubling GOGC doubles heap |
| memory overheads and halves GC CPU costs. |
| To see why, let's break it down mathematically. |
| </p> |
| |
| <p> |
| Firstly, the heap target sets a target for the total heap size. |
| This target, however, mainly influences the new heap memory, because the live |
| heap is fundamental to the application. |
| </p> |
| |
| <p class="gc-guide-equation"> |
| <i> |
| Target heap memory = Live heap + (Live heap + GC roots) * GOGC / 100 |
| </i> |
| </p> |
| |
| <p class="gc-guide-equation"> |
| <i> |
| Total heap memory = Live heap + New heap memory |
| </i> |
| </p> |
| |
| <p class="gc-guide-equation"> |
| <i> |
| ⇒ |
| </i> |
| </p> |
| |
| <p class="gc-guide-equation"> |
| <i> |
| New heap memory = (Live heap + GC roots) * GOGC / 100 |
| </i> |
| </p> |
| |
| <p> |
| From this we can see that doubling GOGC would also double the amount of new heap |
| memory that application will allocate each cycle, which captures heap memory |
| overheads. |
| Note that <i>Live heap + GC roots</i> is an approximation of the amount of |
| memory the GC needs to scan. |
| </p> |
| |
| <p> |
| Next, let's look at GC CPU cost. |
| Total cost can be broken down as the cost per cycle, times GC frequency over |
| some time period T. |
| </p> |
| |
| <p class="gc-guide-equation"> |
| <i> |
| Total GC CPU cost = (GC CPU cost per cycle) * (GC frequency) * T |
| </i> |
| </p> |
| |
| <p> |
| GC CPU cost per cycle can be derived from the |
| <a href="#Understanding_costs">GC model</a>: |
| </p> |
| |
| <p class="gc-guide-equation"> |
| <i> |
| GC CPU cost per cycle = (Live heap + GC roots) * (Cost per byte) + Fixed cost |
| </i> |
| </p> |
| |
| <p> |
| Note that sweep phase costs are ignored here as mark and scan costs dominate. |
| </p> |
| |
| <p> |
| The steady-state is defined by a constant allocation rate and a constant cost |
| per byte, so in the steady-state we can derive a GC frequency from this new heap |
| memory: |
| </p> |
| |
| <p class="gc-guide-equation"> |
| <i> |
| GC frequency = (Allocation rate) / (New heap memory) = (Allocation rate) / ((Live heap + GC roots) * GOGC / 100) |
| </i> |
| </p> |
| |
| <p> |
| Putting this together, we get the full equation for the total cost: |
| </p> |
| |
| <p class="gc-guide-equation"> |
| <i> |
| Total GC CPU cost = (Allocation rate) / ((Live heap + GC roots) * GOGC / 100) * ((Live heap + GC roots) * (Cost per byte) + Fixed cost) * T |
| </i> |
| </p> |
| |
| <p> |
| For a sufficiently large heap (which represents most cases), the marginal costs |
| of a GC cycle dominate the fixed costs. |
| This allows for a significant simplification of the total GC CPU cost formula. |
| </p> |
| |
| <p class="gc-guide-equation"> |
| <i> |
| Total GC CPU cost = (Allocation rate) / (GOGC / 100) * (Cost per byte) * T |
| </i> |
| </p> |
| |
| <p> |
| From this simplified formula, we can see that if we double GOGC, we halve total |
| GC CPU cost. |
| (Note that the visualizations in this guide do simulate fixed costs, so the GC |
| CPU overheads reported by them will not exactly halve when GOGC doubles.) |
| Furthermore, GC CPU costs are largely determined by allocation rate and the |
| cost per byte to scan memory. |
| For more information on how to reduce these costs specifically, see the |
| <a href="#Optimization_guide">optimization guide</a>. |
| </p> |
| |
| <p> |
| <i> |
| Note: there exists a discrepancy between the size of the live heap, and the |
| amount of that memory the GC actually needs to scan: the same size live heap but |
| with a different structure will result in a different CPU cost, but the same |
| memory cost, resulting a different trade-off. |
| This is why the structure of the heap is part of the definition of the |
| steady-state. |
| The heap target should arguably only include the scannable live heap as a closer |
| approximation of memory the GC needs to scan, but this leads to degenerate |
| behavior when there's a very small amount of scannable live heap but the live |
| heap is otherwise large. |
| </i> |
| </p> |
| |
| <script src="/js/d3.js"></script> |
| <script async src="/doc/gc-guide.js"></script> |