Authors: Michael Knyszek, Michael Pratt
Transparent huge pages (THP) admin guide.
Go scavenging policy. (Implementation details are out-of-date, but linked policy is relevant.)
Currently, Go's hugepage-related policies do not play well together and have bit-rotted.[^1] The result is that the memory regions the Go runtime chooses to mark as MADV_NOHUGEPAGE
and MADV_HUGEPAGE
are somewhat haphazard, resulting in memory overuse for small heaps. The memory overuse is upwards of 40% memory overhead in some cases. Turning off huge pages entirely fixes the problem, but leaves CPU performance on the table. This policy also means large heaps might have dense sections that are erroneously mapped as MADV_NOHUGEPAGE
, costing up to 1% throughput.
The goal of this work is to eliminate this overhead for small heaps while improving huge page utilization for large heaps.
[^1]: Large allocations will force a call to MADV_HUGEPAGE
for any aligned huge pages within, while small allocations tend to leave memory in an undetermined state for huge pages. The scavenger will try to release entire aligned hugepages at a time. Also, when any memory is released, we MADV_NOHUGEPAGE
any aligned pages in the range we release. However, the scavenger will only release 64 KiB at a time unless it finds an aligned huge page to release, and even then it'll only MADV_NOHUGEPAGE
the corresponding huge pages if the region it's scavenging crosses a huge page boundary.
One key insight in the design of the scavenger is that the runtime always has a good idea of how much memory will be used soon: the total heap footprint for a GC cycle is determined by the heap goal. [^2]
[^2]: The runtime also has a first-fit page allocator so that the scavenger can take pages from the high addresses in the heap, again to reduce the chance of conflict. The scavenger tries to return memory to the OS such that it leaves enough paged-in memory around to reach the heap goal (adjusted for fragmentation within spans and a 10% buffer for fragmentation outside of spans, or capped by the memory limit). The purpose behind this is to reduce the chance that the scavenger will return memory to the OS that will be used soon.
Indeed, by tracing page allocations and watching page state over time we can see that Go heaps tend to get very dense toward the end of a GC cycle; this makes all of that memory a decent candidate for huge pages from the perspective of fragmentation. However, it's also clear this density fluctuates significantly within a GC cycle.
Therefore, I propose the following policy:
MADV_HUGEPAGE
with the expectation that it will be used.MADV_NOHUGEPAGE
if it isn't already marked as such.max_ptes_none
is 0, then skip this step.GOMEMLIMIT
or runtime/debug.FreeOSMemory
ignores this rule.MADV_HUGEPAGE
.max_ptes_none
is 0, then use MADV_COLLAPSE
instead, if available.GOMEMLIMIT
or runtime/debug.FreeOSMemory
is not marked MADV_HUGEPAGE
until the next allocation that causes this condition after the end of the current GC cycle. [^6][^3]: 4 MiB doesn't align with linux/amd64 huge page sizes, but is a very convenient number of the runtime because the page allocator manages memory in 4 MiB chunks.
[^4]: The bar for explicit (non-default) backing by huge pages must be very high. The main issue is the default value of /sys/kernel/mm/transparent_hugepage/defrag
on Linux: it forces regions marked as MADV_HUGEPAGE
to be immediately backed, stalling in the kernel until it can compact and rearrange things to provide a huge page. Meanwhile the combination of MADV_NOHUGEPAGE
and MADV_DONTNEED
does the opposite. Switching between these two states often creates really expensive churn.
[^5]: Note that runtime/debug.FreeOSMemory
and the mechanism to maintain GOMEMLIMIT
must still be able to release all memory to be effective. For that reason, this rule does not apply to those two situations. Basically, these cases get to skip waiting until the end of the GC cycle, optimistically assuming that memory won't be used.
[^6]: It might happen that the wrong memory was scavenged (memory that soon after exceeds 96% occupancy). This delay helps reduce churn.
The goal of these changes is to ensure that when sparse regions of the heap have their memory returned to the OS, it stays that way regardless of max_ptes_none
. Meanwhile, the policy avoids expensive churn by delaying the release of pages that were part of dense memory regions by at least a full GC cycle.
Note that there‘s potentially quite a lot of hysteresis here, which could impact memory reclaim for, for example, a brief memory spike followed by a long-lived idle low-memory state. In the worst case, the time between GC cycles is 2 minutes, and the scavenger’s slowest return rate is ~256 MiB/sec. [^7] I suspect this isn't slow enough to be a problem in practice. Furthermore, GOMEMLIMIT
can still be employed to maintain a memory maximum.
[^7]: The scavenger is much more aggressive than it once was, targeting 1% of total CPU usage. Spending 1% of one CPU core in 2018 on MADV_DONTNEED
meant roughly 8 KiB released per millisecond in the worst case. For a GOMAXPROCS=32
process, this worst case is now approximately 256 KiB per millisecond. In the best case, wherein the scavenger can identify whole unreleased huge pages, it would release 2 MiB per millisecond in 2018, so 64 MiB per millisecond today.
Initially, I attempted a design where all heap memory up to the heap goal (address-ordered) is marked as MADV_HUGEPAGE
and ineligible for scavenging. The rest is always eligible for scavenging, and the scavenger marks that memory as MADV_NOHUGEPAGE
.
This approach had a few problems:
I also tried a simple heuristic to turn off the scavenger when it looks like the heap is growing, but not all heaps grow monotonically, so a small amount of churn still occurred. It's difficult to come up with a good heuristic without assuming monotonicity.
My next attempt was more direct: mark high density chunks as MADV_HUGEPAGE
, and allow low density chunks to be scavenged and set as MADV_NOHUGEPAGE
. A chunk would become high density if it was observed to have at least 80% occupancy, and would later switch back to low density if it had less than 20% occupancy. This gap existed for hysteresis to reduce churn. Unfortunately, this also didn't work: GC-heavy programs often have memory regions that go from extremely low (near 0%) occupancy to 100% within a single GC cycle, creating a lot of churn.
The design above is ultimately a combination of these two designs: assume that the heap gets generally dense within a GC cycle, but handle it on a chunk-by-chunk basis.
Where all this differs from other huge page efforts, such as what TCMalloc did, is the lack of bin-packing of allocated memory in huge pages (which is really the majority and key part of the design). Bin-packing provides the benefit of increasing the likelihood that an entire huge page will be free by putting new memory in existing huge pages over some global policy that may put it anywhere like “best-fit.” This not only improves the efficiency of releasing memory, but makes the overall footprint smaller due to less fragmentation.
This is unlikely to be that useful for Go since Go‘s heap already, at least transiently, gets very dense. Another thing that gets in the way of doing the same kind of bin-packing for Go is that the allocator’s slow path gets hit much harder than TCMalloc's slow path. The reason for this boils down to the GC memory reuse pattern (essentially, FIFO vs. LIFO reuse). Slowdowns in this path will likely create scalability problems.
Whether or not pages are eligible for THP is controlled by a combination of settings:
/sys/kernel/mm/transparent_hugepage/enabled
: system-wide control, possible values:
never
: THP disabledmadvise
: Only pages with MADV_HUGEPAGE
are eligiblealways
: All pages are eligible, unless marked MADV_NOHUGEPAGE
prctl(PR_SET_THP_DISABLE)
: process-wide control to disable THP
madvise
: per-mapping control, possible values:
MADV_NOHUGEPAGE
: mapping not eligible for THPMADV_HUGEPAGE
: mapping eligible for THP unless there is a process- or system-wide disable./sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
: system-wide control that specifies how many extra small pages can be allocated when collapsing a group of pages into a huge page. In other words, how many small pages in a candidate huge page can be not-faulted-in or faulted-in zero pages.
MADV_DONTNEED
on a smaller range within a huge page will split the huge page to zero the range. However, the full huge page range will still be immediately eligible for coalescing by khugepaged
if max_ptes_none > 0
, which is true for the default open source Linux configuration. Thus to both disable future THP and split an existing huge page race-free, you must first set MADV_NOHUGEPAGE
and then call MADV_DONTNEED
.
Another consideration is the newly-upstreamed MADV_COLLAPSE
, which collapses memory regions into huge pages unconditionally. MADV_DONTNEED
can then used to break them up. This scheme represents effectively complete control over huge pages, provided khugepaged
doesn't coalesce pages in a way that undoes the MADV_DONTNEED
. (For example by setting max_ptes_none
to zero.)
To investigate this issue I built a low-overhead page event tracer and visualization utility to check assumptions of application and GC behavior. Below are a bunch of traces and conclusions from them.