blob: 1e14729dad3c9e776bd72b801c190ea8206db371 [file] [log] [blame] [view]
# A more hugepage-aware Go heap
Authors: Michael Knyszek, Michael Pratt
## Background
[Transparent huge pages (THP) admin
guide](https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html).
[Go scavenging
policy](30333-smarter-scavenging.md#which-memory-should-we-scavenge).
(Implementation details are out-of-date, but linked policy is relevant.)
[THP flag behavior](#appendix_thp-flag-behavior).
## Motivation
Currently, Go's hugepage-related policies [do not play well
together](https://github.com/golang/go/issues/55328) and have bit-rotted.[^1]
The result is that the memory regions the Go runtime chooses to mark as
`MADV_NOHUGEPAGE` and `MADV_HUGEPAGE` are somewhat haphazard, resulting in
memory overuse for small heaps.
The memory overuse is upwards of 40% memory overhead in some cases.
Turning off huge pages entirely fixes the problem, but leaves CPU performance on
the table.
This policy also means large heaps might have dense sections that are
erroneously mapped as `MADV_NOHUGEPAGE`, costing up to 1% throughput.
The goal of this work is to eliminate this overhead for small heaps while
improving huge page utilization for large heaps.
[^1]: [Large allocations](https://cs.opensource.google/go/go/+/master:src/runtime/mheap.go;l=1344;drc=c70fd4b30aba5db2df7b5f6b0833c62b909f50eb)
will force [a call to `MADV_HUGEPAGE` for any aligned huge pages
within](https://cs.opensource.google/go/go/+/master:src/runtime/mem_linux.go;l=148;drc=9839668b5619f45e293dd40339bf0ac614ea6bee),
while small allocations tend to leave memory in an undetermined state for
huge pages.
The scavenger will try to release entire aligned hugepages at a time.
Also, when any memory is released, [we `MADV_NOHUGEPAGE` any aligned pages
in the range we
release](https://cs.opensource.google/go/go/+/master:src/runtime/mem_linux.go;l=40;drc=9839668b5619f45e293dd40339bf0ac614ea6bee).
However, the scavenger will [only release 64 KiB at a time unless it finds
an aligned huge page to
release](https://cs.opensource.google/go/go/+/master:src/runtime/mgcscavenge.go;l=564;drc=c70fd4b30aba5db2df7b5f6b0833c62b909f50eb),
and even then it'll [only `MADV_NOHUGEPAGE` the corresponding huge pages if
the region it's scavenging crosses a huge page
boundary](https://cs.opensource.google/go/go/+/master:src/runtime/mem_linux.go;l=70;drc=9839668b5619f45e293dd40339bf0ac614ea6bee).
## Proposal
One key insight in the design of the scavenger is that the runtime always has a
good idea of how much memory will be used soon: the total heap footprint for a
GC cycle is determined by the heap goal. [^2]
[^2]: The runtime also has a first-fit page allocator so that the scavenger can
take pages from the high addresses in the heap, again to reduce the chance
of conflict.
The scavenger tries to return memory to the OS such that it leaves enough
paged-in memory around to reach the heap goal (adjusted for fragmentation
within spans and a 10% buffer for fragmentation outside of spans, or capped
by the memory limit).
The purpose behind this is to reduce the chance that the scavenger will
return memory to the OS that will be used soon.
Indeed, by [tracing page allocations and watching page state over
time](#appendix_page-traces) we can see that Go heaps tend to get very dense
toward the end of a GC cycle; this makes all of that memory a decent candidate
for huge pages from the perspective of fragmentation.
However, it's also clear this density fluctuates significantly within a GC
cycle.
Therefore, I propose the following policy:
1. All new memory is initially marked as `MADV_HUGEPAGE` with the expectation
that it will be used.
1. Before the scavenger releases pages in an aligned 4 MiB region of memory [^3]
it [first](#appendix_thp-flag-behavior) marks it as `MADV_NOHUGEPAGE` if it
isn't already marked as such.
- If `max_ptes_none` is 0, then skip this step.
1. Aligned 4 MiB regions of memory are only available to scavenge if they
weren't at least 96% [^4] full at the end of the last GC cycle. [^5]
- Scavenging for `GOMEMLIMIT` or `runtime/debug.FreeOSMemory` ignores this
rule.
1. Any aligned 4 MiB region of memory that exceeds 96% occupancy is immediately
marked as `MADV_HUGEPAGE`.
- If `max_ptes_none` is 0, then use `MADV_COLLAPSE` instead, if available.
- Memory scavenged for `GOMEMLIMIT` or `runtime/debug.FreeOSMemory` is not
marked `MADV_HUGEPAGE` until the next allocation that causes this
condition after the end of the current GC cycle. [^6]
[^3]: 4 MiB doesn't align with linux/amd64 huge page sizes, but is a very
convenient number of the runtime because the page allocator manages memory
in 4 MiB chunks.
[^4]: The bar for explicit (non-default) backing by huge pages must be very
high.
The main issue is the default value of
`/sys/kernel/mm/transparent_hugepage/defrag` on Linux: it forces regions
marked as `MADV_HUGEPAGE` to be immediately backed, stalling in the kernel
until it can compact and rearrange things to provide a huge page.
Meanwhile the combination of `MADV_NOHUGEPAGE` and `MADV_DONTNEED` does the
opposite.
Switching between these two states often creates really expensive churn.
[^5]: Note that `runtime/debug.FreeOSMemory` and the mechanism to maintain
`GOMEMLIMIT` must still be able to release all memory to be effective.
For that reason, this rule does not apply to those two situations.
Basically, these cases get to skip waiting until the end of the GC cycle,
optimistically assuming that memory won't be used.
[^6]: It might happen that the wrong memory was scavenged (memory that soon
after exceeds 96% occupancy).
This delay helps reduce churn.
The goal of these changes is to ensure that when sparse regions of the heap have
their memory returned to the OS, it stays that way regardless of
`max_ptes_none`.
Meanwhile, the policy avoids expensive churn by delaying the release of pages
that were part of dense memory regions by at least a full GC cycle.
Note that there's potentially quite a lot of hysteresis here, which could impact
memory reclaim for, for example, a brief memory spike followed by a long-lived
idle low-memory state.
In the worst case, the time between GC cycles is 2 minutes, and the scavenger's
slowest return rate is ~256 MiB/sec. [^7] I suspect this isn't slow enough to be
a problem in practice.
Furthermore, `GOMEMLIMIT` can still be employed to maintain a memory maximum.
[^7]: The scavenger is much more aggressive than it once was, targeting 1% of
total CPU usage.
Spending 1% of one CPU core in 2018 on `MADV_DONTNEED` meant roughly 8 KiB
released per millisecond in the worst case.
For a `GOMAXPROCS=32` process, this worst case is now approximately 256 KiB
per millisecond.
In the best case, wherein the scavenger can identify whole unreleased huge
pages, it would release 2 MiB per millisecond in 2018, so 64 MiB per
millisecond today.
## Alternative attempts
Initially, I attempted a design where all heap memory up to the heap goal
(address-ordered) is marked as `MADV_HUGEPAGE` and ineligible for scavenging.
The rest is always eligible for scavenging, and the scavenger marks that memory
as `MADV_NOHUGEPAGE`.
This approach had a few problems:
1. The heap goal tends to fluctuate, creating churn at the boundary.
1. When the heap is actively growing, the aftermath of this churn actually ends
up in the middle of the fully-grown heap, as the scavenger works on memory
beyond the boundary in between GC cycles.
1. Any fragmentation that does exist in the middle of the heap, for example if
most allocations are large, is never looked at by the scavenger.
I also tried a simple heuristic to turn off the scavenger when it looks like the
heap is growing, but not all heaps grow monotonically, so a small amount of
churn still occurred.
It's difficult to come up with a good heuristic without assuming monotonicity.
My next attempt was more direct: mark high density chunks as `MADV_HUGEPAGE`,
and allow low density chunks to be scavenged and set as `MADV_NOHUGEPAGE`.
A chunk would become high density if it was observed to have at least 80%
occupancy, and would later switch back to low density if it had less than 20%
occupancy.
This gap existed for hysteresis to reduce churn.
Unfortunately, this also didn't work: GC-heavy programs often have memory
regions that go from extremely low (near 0%) occupancy to 100% within a single
GC cycle, creating a lot of churn.
The design above is ultimately a combination of these two designs: assume that
the heap gets generally dense within a GC cycle, but handle it on a
chunk-by-chunk basis.
Where all this differs from other huge page efforts, such as [what TCMalloc
did](https://google.github.io/tcmalloc/temeraire.html), is the lack of
bin-packing of allocated memory in huge pages (which is really the majority and
key part of the design).
Bin-packing provides the benefit of increasing the likelihood that an entire
huge page will be free by putting new memory in existing huge pages over some
global policy that may put it anywhere like "best-fit."
This not only improves the efficiency of releasing memory, but makes the overall
footprint smaller due to less fragmentation.
This is unlikely to be that useful for Go since Go's heap already, at least
transiently, gets very dense.
Another thing that gets in the way of doing the same kind of bin-packing for Go
is that the allocator's slow path gets hit much harder than TCMalloc's slow
path.
The reason for this boils down to the GC memory reuse pattern (essentially, FIFO
vs. LIFO reuse).
Slowdowns in this path will likely create scalability problems.
## Appendix: THP flag behavior
Whether or not pages are eligible for THP is controlled by a combination of
settings:
`/sys/kernel/mm/transparent_hugepage/enabled`: system-wide control, possible
values:
- `never`: THP disabled
- `madvise`: Only pages with `MADV_HUGEPAGE` are eligible
- `always`: All pages are eligible, unless marked `MADV_NOHUGEPAGE`
`prctl(PR_SET_THP_DISABLE)`: process-wide control to disable THP
`madvise`: per-mapping control, possible values:
- `MADV_NOHUGEPAGE`: mapping not eligible for THP
- Note that existing huge pages will not be split if this flag is set.
- `MADV_HUGEPAGE`: mapping eligible for THP unless there is a process- or
system-wide disable.
- Unset: mapping eligible for THP if system-wide control is set to always”.
`/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none`: system-wide
control that specifies how many extra small pages can be allocated when
collapsing a group of pages into a huge page.
In other words, how many small pages in a candidate huge page can be
not-faulted-in or faulted-in zero pages.
`MADV_DONTNEED` on a smaller range within a huge page will split the huge page
to zero the range.
However, the full huge page range will still be immediately eligible for
coalescing by `khugepaged` if `max_ptes_none > 0`, which is true for the default
open source Linux configuration.
Thus to both disable future THP and split an existing huge page race-free, you
must first set `MADV_NOHUGEPAGE` and then call `MADV_DONTNEED`.
Another consideration is the newly-upstreamed `MADV_COLLAPSE`, which collapses
memory regions into huge pages unconditionally.
`MADV_DONTNEED` can then used to break them up.
This scheme represents effectively complete control over huge pages, provided
`khugepaged` doesn't coalesce pages in a way that undoes the `MADV_DONTNEED`.
(For example by setting `max_ptes_none` to zero.)
## Appendix: Page traces
To investigate this issue I built a
[low-overhead](https://perf.golang.org/search?q=upload:20221024.9) [page event
tracer](https://go.dev/cl/444157) and [visualization
utility](https://go.dev/cl/444158) to check assumptions of application and GC
behavior.
Below are a bunch of traces and conclusions from them.
- [Tile38 K-Nearest benchmark](./59960/tile38.png): GC-heavy benchmark.
Note the fluctuation between very low occupancy and very high occupancy.
During a single GC cycle, the page heap gets at least transiently very dense.
This benchmark caused me the most trouble when trying out ideas.
- [Go compiler building a massive package](./59960/compiler.png): Note again the
high density.