A more hugepage-aware Go heap

Authors: Michael Knyszek, Michael Pratt

Background

Transparent huge pages (THP) admin guide.

Go scavenging policy. (Implementation details are out-of-date, but linked policy is relevant.)

Motivation

Currently, Go's hugepage-related policies do not play well together and have bit-rotted.[^1] The result is that the memory regions the Go runtime chooses to mark as MADV_NOHUGEPAGE and MADV_HUGEPAGE are somewhat haphazard, resulting in memory overuse for small heaps. The memory overuse is upwards of 40% memory overhead in some cases. Turning off huge pages entirely fixes the problem, but leaves CPU performance on the table. This policy also means large heaps might have dense sections that are erroneously mapped as MADV_NOHUGEPAGE, costing up to 1% throughput.

The goal of this work is to eliminate this overhead for small heaps while improving huge page utilization for large heaps.

[^1]: Large allocations will force a call to MADV_HUGEPAGE for any aligned huge pages within, while small allocations tend to leave memory in an undetermined state for huge pages. The scavenger will try to release entire aligned hugepages at a time. Also, when any memory is released, we MADV_NOHUGEPAGE any aligned pages in the range we release. However, the scavenger will only release 64 KiB at a time unless it finds an aligned huge page to release, and even then it'll only MADV_NOHUGEPAGE the corresponding huge pages if the region it's scavenging crosses a huge page boundary.

Proposal

One key insight in the design of the scavenger is that the runtime always has a good idea of how much memory will be used soon: the total heap footprint for a GC cycle is determined by the heap goal. [^2]

[^2]: The runtime also has a first-fit page allocator so that the scavenger can take pages from the high addresses in the heap, again to reduce the chance of conflict. The scavenger tries to return memory to the OS such that it leaves enough paged-in memory around to reach the heap goal (adjusted for fragmentation within spans and a 10% buffer for fragmentation outside of spans, or capped by the memory limit). The purpose behind this is to reduce the chance that the scavenger will return memory to the OS that will be used soon.

Indeed, by tracing page allocations and watching page state over time we can see that Go heaps tend to get very dense toward the end of a GC cycle; this makes all of that memory a decent candidate for huge pages from the perspective of fragmentation. However, it's also clear this density fluctuates significantly within a GC cycle.

Therefore, I propose the following policy:

All new memory is initially marked as MADV_HUGEPAGE with the expectation that it will be used.
Before the scavenger releases pages in an aligned 4 MiB region of memory [^3] it first marks it as MADV_NOHUGEPAGE if it isn't already marked as such.
- If max_ptes_none is 0, then skip this step.
Aligned 4 MiB regions of memory are only available to scavenge if they weren't at least 96% [^4] full at the end of the last GC cycle. [^5]
- Scavenging for GOMEMLIMIT or runtime/debug.FreeOSMemory ignores this rule.
Any aligned 4 MiB region of memory that exceeds 96% occupancy is immediately marked as MADV_HUGEPAGE.
- If max_ptes_none is 0, then use MADV_COLLAPSE instead, if available.
- Memory scavenged for GOMEMLIMIT or runtime/debug.FreeOSMemory is not marked MADV_HUGEPAGE until the next allocation that causes this condition after the end of the current GC cycle. [^6]

[^3]: 4 MiB doesn't align with linux/amd64 huge page sizes, but is a very convenient number of the runtime because the page allocator manages memory in 4 MiB chunks.

[^4]: The bar for explicit (non-default) backing by huge pages must be very high. The main issue is the default value of /sys/kernel/mm/transparent_hugepage/defrag on Linux: it forces regions marked as MADV_HUGEPAGE to be immediately backed, stalling in the kernel until it can compact and rearrange things to provide a huge page. Meanwhile the combination of MADV_NOHUGEPAGE and MADV_DONTNEED does the opposite. Switching between these two states often creates really expensive churn.

[^5]: Note that runtime/debug.FreeOSMemory and the mechanism to maintain GOMEMLIMIT must still be able to release all memory to be effective. For that reason, this rule does not apply to those two situations. Basically, these cases get to skip waiting until the end of the GC cycle, optimistically assuming that memory won't be used.

[^6]: It might happen that the wrong memory was scavenged (memory that soon after exceeds 96% occupancy). This delay helps reduce churn.

The goal of these changes is to ensure that when sparse regions of the heap have their memory returned to the OS, it stays that way regardless of max_ptes_none. Meanwhile, the policy avoids expensive churn by delaying the release of pages that were part of dense memory regions by at least a full GC cycle.

Note that there‘s potentially quite a lot of hysteresis here, which could impact memory reclaim for, for example, a brief memory spike followed by a long-lived idle low-memory state. In the worst case, the time between GC cycles is 2 minutes, and the scavenger’s slowest return rate is ~256 MiB/sec. [^7] I suspect this isn't slow enough to be a problem in practice. Furthermore, GOMEMLIMIT can still be employed to maintain a memory maximum.

[^7]: The scavenger is much more aggressive than it once was, targeting 1% of total CPU usage. Spending 1% of one CPU core in 2018 on MADV_DONTNEED meant roughly 8 KiB released per millisecond in the worst case. For a GOMAXPROCS=32 process, this worst case is now approximately 256 KiB per millisecond. In the best case, wherein the scavenger can identify whole unreleased huge pages, it would release 2 MiB per millisecond in 2018, so 64 MiB per millisecond today.

Alternative attempts

Initially, I attempted a design where all heap memory up to the heap goal (address-ordered) is marked as MADV_HUGEPAGE and ineligible for scavenging. The rest is always eligible for scavenging, and the scavenger marks that memory as MADV_NOHUGEPAGE.

This approach had a few problems:

The heap goal tends to fluctuate, creating churn at the boundary.
When the heap is actively growing, the aftermath of this churn actually ends up in the middle of the fully-grown heap, as the scavenger works on memory beyond the boundary in between GC cycles.
Any fragmentation that does exist in the middle of the heap, for example if most allocations are large, is never looked at by the scavenger.

I also tried a simple heuristic to turn off the scavenger when it looks like the heap is growing, but not all heaps grow monotonically, so a small amount of churn still occurred. It's difficult to come up with a good heuristic without assuming monotonicity.

My next attempt was more direct: mark high density chunks as MADV_HUGEPAGE, and allow low density chunks to be scavenged and set as MADV_NOHUGEPAGE. A chunk would become high density if it was observed to have at least 80% occupancy, and would later switch back to low density if it had less than 20% occupancy. This gap existed for hysteresis to reduce churn. Unfortunately, this also didn't work: GC-heavy programs often have memory regions that go from extremely low (near 0%) occupancy to 100% within a single GC cycle, creating a lot of churn.

The design above is ultimately a combination of these two designs: assume that the heap gets generally dense within a GC cycle, but handle it on a chunk-by-chunk basis.

Where all this differs from other huge page efforts, such as what TCMalloc did, is the lack of bin-packing of allocated memory in huge pages (which is really the majority and key part of the design). Bin-packing provides the benefit of increasing the likelihood that an entire huge page will be free by putting new memory in existing huge pages over some global policy that may put it anywhere like “best-fit.” This not only improves the efficiency of releasing memory, but makes the overall footprint smaller due to less fragmentation.

This is unlikely to be that useful for Go since Go‘s heap already, at least transiently, gets very dense. Another thing that gets in the way of doing the same kind of bin-packing for Go is that the allocator’s slow path gets hit much harder than TCMalloc's slow path. The reason for this boils down to the GC memory reuse pattern (essentially, FIFO vs. LIFO reuse). Slowdowns in this path will likely create scalability problems.

Appendix: THP flag behavior

Whether or not pages are eligible for THP is controlled by a combination of settings:

/sys/kernel/mm/transparent_hugepage/enabled: system-wide control, possible values:

never: THP disabled
madvise: Only pages with MADV_HUGEPAGE are eligible
always: All pages are eligible, unless marked MADV_NOHUGEPAGE

prctl(PR_SET_THP_DISABLE): process-wide control to disable THP

madvise: per-mapping control, possible values:

MADV_NOHUGEPAGE: mapping not eligible for THP
- Note that existing huge pages will not be split if this flag is set.
MADV_HUGEPAGE: mapping eligible for THP unless there is a process- or system-wide disable.
Unset: mapping eligible for THP if system-wide control is set to “always”.

/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none: system-wide control that specifies how many extra small pages can be allocated when collapsing a group of pages into a huge page. In other words, how many small pages in a candidate huge page can be not-faulted-in or faulted-in zero pages.

MADV_DONTNEED on a smaller range within a huge page will split the huge page to zero the range. However, the full huge page range will still be immediately eligible for coalescing by khugepaged if max_ptes_none > 0, which is true for the default open source Linux configuration. Thus to both disable future THP and split an existing huge page race-free, you must first set MADV_NOHUGEPAGE and then call MADV_DONTNEED.

Another consideration is the newly-upstreamed MADV_COLLAPSE, which collapses memory regions into huge pages unconditionally. MADV_DONTNEED can then used to break them up. This scheme represents effectively complete control over huge pages, provided khugepaged doesn't coalesce pages in a way that undoes the MADV_DONTNEED. (For example by setting max_ptes_none to zero.)

Appendix: Page traces

To investigate this issue I built a low-overhead page event tracer and visualization utility to check assumptions of application and GC behavior. Below are a bunch of traces and conclusions from them.

Tile38 K-Nearest benchmark: GC-heavy benchmark. Note the fluctuation between very low occupancy and very high occupancy. During a single GC cycle, the page heap gets at least transiently very dense. This benchmark caused me the most trouble when trying out ideas.
Go compiler building a massive package: Note again the high density.