blob: 896ecfdde9343e9e05890a41aab4ce919458cd08 [file] [log] [blame] [view]
# Memory regions
## Background
The [arena experiment](https://github.com/golang/go/issues/51317) adds a package
consisting of a single type to the standard library: `Arena`.
This type allows one to allocate data structures into it directly, and allows
early release of the arena memory in bulk.
In other words, it offers a form of region-based memory management to Go.
The implementation is memory-safe insofar as use-after-frees will never result
in memory corruption, only a potential crash.
In practice, this package has improved real application performance by 10-15%,
achieved almost entirely due to earlier memory reuse and staving off GC
execution.
The proposal to add arenas to the standard library is on indefinite hold due to
concerns about API pollution.
Specifically, for the API to be useful in a wide range of circumstances, APIs
have to accept an additional arena parameter.
A good example of this problem is the standard library: many functions in the
standard library allocate heap memory, and could allocate memory into an arena
upon request instead.
But this would require changing so many APIs that it becomes infeasible, and is
not worth a 10-15% cost reduction.
Furthermore, arenas do not compose well with the existing language.
One example is stack allocation.
The Go compiler employs an escape analysis pass to stack-allocate memory when
possible.
This decouples that memory's lifecycle from the garbage collector entirely, and
is even more effective than arenas when it is possible.
Unfortunately, when one chooses to allocate into an arena, that _forces_ that
allocation to come from an arena.
This document proposes a composable replacement for arenas in the form of
user-defined goroutine-local memory regions.
## Goals
First and foremost, our main goal is to reduce resource costs associated with
the GC.
If we can't achieve that, then this proposal is pointless.
The second most important goal is composability.
Specifically:
- No requirement to change APIs to take advantage of arenas.
- Composability with standard library features, like `sync.Pool` and
`unique.Handle`.
- Composability with existing optimizations, like stack allocation via escape
analysis.
Finally, whatever we implement must be relatively easy to use and intuitive to
intermediate-to-advanced Go developers.
We must offer tools for discovering where regions might be worth it, and where
they aren't working out.
## Design
The core of this design revolves around a pair of functions that behave like
annotations of function calls.
The annotations indicate whether the user expects most or all the memory
allocated by some function call (and its callees) to stay local to that function
(and its callees), and to be unreachable by the time that function returns.
If these expectations hold, then that memory is eagerly reclaimed when the
function returns.
```go
package region
// Do creates a new scope called a region, and calls f in
// that scope. The scope is destroyed when Do returns.
//
// At the implementation's discretion, memory allocated by f
// and its callees may be implicitly bound to the region.
//
// Memory is automatically unbound from the region when it
// becomes reachable from another region, another goroutine,
// the caller of Do, its caller, or from any other memory not
// bound to this region.
//
// Any memory still bound to the region when it is destroyed is
// eagerly reclaimed by the runtime.
//
// This function exists to reduce resource costs by more
// effectively reusing memory, reducing pressure on the garbage
// collector.
// However, when used incorrectly, it may instead increase
// resource costs, as there is a cost to unbinding memory from
// the region.
// Always experiment and benchmark before committing to
// using a region.
//
// If Do is called within an active region, it creates a new one,
// and the old one is reinstated once f returns.
// However, memory cannot be rebound between regions.
// If memory created by an inner region is referenced by an outer
// region, it is not rebound to the outer region, but rather unbound
// completely.
// Memory created by an outer regions referenced by an inner region
// does not unbind anything, because the outer region always out-lives
// the inner region.
//
// Regions are local to the goroutines that create them,
// and do not propagate to newly created goroutines.
//
// Panics and calls to [runtime.Goexit] will destroy region
// scopes the same as if f returned, if they unwind past the
// call to Do.
func Do(f func())
// Ignore causes g and its callees to ignore the current
// region on the goroutine.
//
// Calling Ignore when not in an active region has no effect.
//
// The primary use-case for Ignore is to exclude memory that
// is known to out-live a region, to more effectively make use
// of regions. Using Ignore is less expensive than the unbinding
// process described in the documentation for [region.Do].
func Ignore(g func())
```
### Comparison with arenas
Where an arena might be used like:
```go
func myFunc(buf []byte) error {
a := arena.New()
defer a.Free()
data := new(MyBigComplexProto)
if err := proto.UnmarshalOptions{Arena: a}.Unmarshal(buf, data); err != nil {
return err
}
// Do stuff with data. ...
}
```
Regions would be used like:
```go
func myFunc(buf []byte) error {
var topLevelErr error
region.Do(func() {
data := new(MyBigComplexProto)
if err := proto.Unmarshal(buf, data); err != nil {
topLevelErr = err
return
}
// Do stuff with data. ...
})
return topLevelErr
}
```
You can think of a region as an implicit goroutine-local arena that lives for
the duration of some function call.
That goroutine-local arena is then used for allocating _all_ the memory needed
by that function call and its callees (including maps, slices, structs, etc.).
And then thanks to some compiler and runtime magic, if any of that memory would
cause a use-after-free issue, it is automatically removed from the arena and
handed off to the garbage collector instead.
In practice, all uses of an arena within Google tightly limit the arena's
lifetime to that of a particular function, usually the one they are created in,
like the example above.
This fact suggests that regions will most likely be usable in most or all of the
same circumstances as arenas.
### Annotated examples
Below are some specific examples of different behaviors described in the API.
#### Basic
This example demonstrates that all major builtins are intended to work with this
functionality, and that leaking a pointer out of the region causes the memory it
points to to be unbound from the region.
```go
var keep *int
region.Do(func() {
w := new(int)
x := new(MyStruct)
y := make([]int, 10)
z := make(map[string]string)
*w = use(x, y, z)
keep = w // w is unbound from the region.
}) // x, y, and z's memory is eagerly cleaned up, w is not.
```
#### Nested regions
This example demonstrates how nested regions interact.
```go
region.Do(func() {
z := new(MyStruct)
var y *MyStruct
region.Do(func() {
x := new(MyStruct)
use(x, z) // z may be freely used within this inner region.
y = x // x is unbound from any region.
})
use(y)
})
```
## Implementation
### Overview
First, `region.Do` sets a field in the current `g` indicating the stack offset
at which the region begins.
If this stack offset field is `!= 0, runtime.mallocgc` uses a special allocation
path.
The non-zero stack offset field also implicitly enables a new kind of write
barrier for the goroutine (and only for that goroutine).
Then, `region.Do` calls the function that was passed to it.
The special allocation path allocates all memory in a special part of the heap.
Memory allocated this way is referred to as "**blue**" (**Bounded-Lifetime
Unshared** memory, **Eagerly** reclaimed, for fellow lovers of tortured
acronyms).
(Subtlety: "blueness" is relative to the goroutine that allocated the memory.)
In the `g`, we keep track of every span that was created as part of the blue
allocation path.
These spans are associated with the last call to `region.Do`.
The new write barrier catches writes of blue memory pointers to non-blue memory
(including regular heap memory, globals, other goroutine stacks, any part of the
stack above the latest `region.Do`, and cross-region writes (for nested regions,
since all cross-goroutine writes must pass through already-faded memory)) and
transitively "**fades**" the object graph rooted at the pointer being written.
Faded memory is managed by the GC in the traditional tri-color mark scheme
(hence "fading" to grayscale).
Once the function passed to `region.Do` returns, `region.Do` immediately sweeps
all the spans associated with it, releasing any blue memory and places the spans
back on a free list.
We go into more detail in the following sections.
### New g flag
Let's first define the behavior and invariants of a new field on the `g,
region`.
This value is not inherited when new goroutines are created, and it is not
inherited by `g0`.
This flag is only ever mutated and accessed by a goroutine, for itself.
### Blue memory management
If `runtime.mallocgc` sees that the current goroutine's `region` value is
non-zero, it goes down a special allocation path for blue memory.
This special allocation path allocates all memory in a part of the address space
that is separate (that may, but ideally won't, interleave with) the general
heap.
#### Address space
Region memory is allocated from dedicated `heapArenas` (not to be confused with
arenas, these are 64 MiB aligned reservations on most 64-bit platforms, or 4 MiB
on 32-bit platforms and Windows).
When these are created, we provide hints to `mmap` and the like to be contiguous
and in a distinct part of the address space from the heap, though it's OK if
these special region `heapArenas` interleave with regular `heapArenas`.
These special `heapArenas` are distinguished by a compact bitmap
(`mheap_.isRegionArena`) covering the whole address space, with each bit
representing one `heapArena`.
This bitmap is incredibly compact: a 48-bit address space divided into 64 MiB
chunks results in a 512 KiB bitmap.
We can reserve the full 512 KiB up-front and map in parts as-needed.
64 bytes of this bitmap (or one cache line on most systems) is sufficient to
cover a contiguous 32 GiB of region memory.
#### Memory layout
All blue memory is managed in fixed-size blocks with size 8 KiB.
If any allocation exceeds 2 KiB in size, it is not allocated blue, and is
instead redirected to the general heap.
These blocks are just special kinds of spans, and are backed with an `mspan`
that is managed like all other spans today.
The layout of each block follows the Immix design.
That is, each block is divided into 128-byte lines and groups of contiguous
lines are bump-allocated into.
Every allocation made into one of these regions has an 8-byte object header,
including objects without pointers and small objects (<=8 bytes in size).
(Insight: if objects are short-lived and we're going to reuse their memory
quickly, then per-object overheads don't matter as much.)
Implementation note: we can adjust the block size up to accommodate larger
objects.
We can also potentially allow much larger objects to participate in regions by
creating an additional `pageAlloc` structure for region-allocated memory.
This document assumes the greatest impact will be from smaller objects (<2 KiB),
which account for most of the allocation volume in most programs.
However, should that assumption fail, we can generalize at the cost of some
additional complexity.
#### Allocation
Just like Immix, objects are bump-pointer-allocated into contiguous free lines.
Also like Immix, each goroutine carries two local blocks to allocate from.
One block may be non-empty (that is, have some non-free lines) and the other
will always start out completely empty.
The main block is always attempted first, but medium-sized objects that fail to
allocate into the main block will be allocated into the secondary block.
Blocks are usually allocated from a P-local free list, and failing that, a
global free list.
If a full GC has run recently, each goroutine will first try to sweep its own
already-active blocks.
### Garbage collection
Each blue block reserves the first two lines for a pair of bit-per-word 128-byte
bitmaps.
One indicates the start of each object, while the other indicates which words of
the block have faded (tracked by the [write barrier](#new-write-barrier)).
The bit corresponding to an object's allocation header is set upon allocation.
Lines also each get mark and alloc bits, but since they're small (64 bits each)
they'll go right in the `mspan` for the block.
Each object's allocation header reserves the two least significant bits (which
will always be zero otherwise) for mark bits.
Each GC cycle, the two bits swap purposes.
In GC cycle N, the Nth bit mod 2 is the bit for that cycle.
When an object is marked, the (N+1 mod 2)'th bit is simultaneously cleared.
The GC will scan blocks using the object start bitmap to find the start of the
object, and thus the object header.
It marks and scans all reachable objects in blocks, even those objects that are
still blue.
It also marks any lines containing marked objects.
When the GC marks an object whose fade bits are not set, it leaves the fade bits
alone.
#### Sweeping
There are two ways in which objects in region blocks may have their memory made
available for reuse.
The first is based on the mark bits, which we'll call "full sweeping" since it's
very close in semantics to the existing sweeper.
The second is based on the fade bits, which we'll call "eager sweeping."
Full sweeping of all blocks is performed each GC cycle.
Lines that have not been marked are made available for allocation by installing
the line mark bits as the line allocation bits.
Note that freeing both blue and non-blue objects is completely fine here: the
mark bits prove whether or not the object is reachable.
This first kind of sweeping is the one referenced in the
[allocation](#allocation) path.
Eager sweeping of region blocks is performed when the function passed to
`region.Do` returns.
Lines that do not overlap with any faded words, as indicated by the bitmap, are
made available for allocation.
Note that we may free blue objects that were marked.
This is mostly fine: if the object never faded, we can be certain that the
object is dead when `f` returns.
The only hazard is if the GC queues a pointer to blue memory that is later
freed.
If the GC goes to inspect the object later, it might be gone.
Even worse, there might not be an object starting at the same address anymore!
There are many ways to deal with this problem, but here are two.
1. Delay eager sweeping until after the mark phase completes.
This is unfortunate, but easy to implement and safe.
1. Keep a generation counter on each blue block.
If the GC enqueues a blue pointer, it also queues the generation counter from
the block it lives in.
Later, it checks the generation counter when looking up the block for the
object again and if it doesn't match, it skips doing any additional work.
This complicates scanning, but would allow for eager reclamation during the
mark phase.
### New write barrier
A new write barrier tracks when pointers to objects are written outside the call
tree rooted at `f` and the blocks associated with `f`'s region.
In other words, it tracks writes of blue pointers to non-blue memory.
This write barrier is enabled only if the `region` field on the `g` is non-zero.
The write barrier executes on each pointer write that is not to the current
stack frame (same as the existing write barrier).
If the write barrier discovers a blue pointer being written to non-blue memory,
the fade bits for the words corresponding to the object's memory are set.
Then, the object's referents are transitively marked as faded.
Below is a sketch of the write barrier's fast path.
```go
func regionWriteBarrier(ptr, dst unsafe.Pointer) {
region := getg().region
if region == 0 {
return
}
if ptr == nil {
return
}
if dst >= getg().stack.lo && dst < getg().stack.lo+region {
// The pointer is being written into the stack within the region.
return
}
ptrBlock := findRegionBlock(ptr)
if ptrBlock == nil {
// Pointer doesn't belong to a region.
return
}
dstBlock := findRegionBlock(dst)
// If they belong to different regions and ptr's region doesn't
// does not out-live dst's, we must fade ptr.
if dstBlock != nil && ptrBlock.region > dstBlock.region {
// Otherwise, for within-region writes, only fade on writes of blue
// to faded memory.
if !ptrBlock.isBlue(ptr) || dstBlock.isBlue(dst) {
return
}
}
fade(ptr)
}
func findRegionBlock(ptr unsafe.Pointer) *regionBlock {
// Check if the pointer lies inside of a region arena.
arenaIdx := uintptr(ptr)/heapArenaBytes
if mheap_.isRegionArena[arenaIdx/8]&(uint8(1)<<(arenaIdx%8)) == 0 {
return nil
}
// Find the base of the block, where the fade bitmap, among other things, lives.
base := uintptr(ptr) &^ (8192-1)
return (*regionBlock)(unsafe.Pointer(base)))
}
type regionBlock struct {
faded [128]byte
objStart [128]byte
region uintptr
lineFade uint64
}
func (b *regionBlock) isBlue(ptr unsafe.Pointer) bool {
// Find the word index that ptr corresponds to in the block.
word := (uintptr(ptr)-uintptr(unsafe.Pointer(b)))/8
// Load, mask, and check the bit corresponding to the word.
return b.faded[word/8] & (1<<(word%8)) == 0
}
```
`fade` performs the transitive fade bitmap marking.
Note that objects must be _immediately_ transitively faded (that is, we cannot
defer the operation), because then it would be possible to hide pointers to
those objects in another goroutine's stack, which is not protected by a write
barrier (and shouldn't be; it's too expensive).
### Comparison with the arenas
#### Implicit allocation instead of explicit allocation
Arenas require explicitly choosing what gets allocated into an arena.
On the one hand, this offers fine-grained control over memory, but on the other
hand, does not compose well with the language.
For example, it requires updating APIs to make the best use of the arena.
Standard library APIs that do not accept an arena would need to have a new
version to allocate memory into the arena.
One could imagine making arenas implicit and goroutine-bound.
But the issue with implicitly allocating memory into an arena is that it's far
less clear when it's safe to free the arena since some values may have been
allocated into the arena that code authors don't expect.
In contrast, the regions design implicitly allocates **all** eligible heap
memory into a kind of memory region, but avoids the issue of not knowing when to
free memory.
In essence, the new write barrier effectively pins memory that is not safe to
free, meaning the rest can be freed once the region ends.
The regions design also offers an opt-out mechanism (`region.Ignore`), restoring
fine-grained control over memory allocation where needed.
Furthermore, explicit allocation introduces some complications with existing
optimizations.
For example, the compiler's escape analysis may be able to deduce that some
value may be stack allocated, but this analysis immediately fails if the value
is allocated into an arena.
Special compiler support could make arena-allocated values candidates for stack
allocation, but that's likely to be complex.
#### Finer granularity of memory reuse
One property of the arena experiment is that arena chunks are very large (8 MiB)
to amortize the cost of syscalls to force faults on the memory.
This means that, naively, making good use of arenas would require sharing them
across multiple logical operations (for example, multiple HTTP requests in a
server).
In practice, the arena experiment works around this by only reusing
partially-filled arena chunks, filling them up further.
However, these partially-filled chunks still contribute to the live heap, even
if most of the memory inside of them is dead.
Meanwhile, region blocks are _three orders of magnitude_ smaller than arena
chunks.
This means that there's really no need to try to reuse partially-filled region
blocks.
As a result, the baseline live heap contribution of `region.Do` should be
substantially smaller.
#### Accidentally retaining unreachable memory is less of a hazard
Failing to free an arena, or allocating too much into an arena, may result in
running out of memory, since the lifetime of all allocations in the arena are
truly bound together.
Even if large amounts of arena memory are no longer reachable, the GC is unable
to free them because arenas have no mechanism for reclamation except on the
order of arena chunks.
The region design meanwhile includes region blocks in the regular GC cycle.
This means that if regions are specified in an overly-broad way (regions that
are too big or too long-lived), there's no more potential for running out of
memory than usual.
Plus, blue memory in the region is still immediately reclaimable and reusable,
even if it survives multiple GC cycles.
#### Write barrier
One downside of regions over arenas is the write barrier necessary to prevent
use-after-free issues.
This design rests on the write barrier being cheap enough to avoid overwhelming
savings from reduced GC pressure.
The good news is that I believe this design achieves this goal.
See the [preliminary evaluation](#preliminary-evaluation).
To summarize, an upper bound on the expected CPU overhead of the write barrier
is between 1.1%–4.5%.
For applications that apply a substantial amount of GC pressure (resulting in
upwards of 10% GC CPU costs), this overhead should be readily recouped.
This analysis also assumes that all goroutines are in a region all of the time,
so it is truly a worst case.
I suspect this overhead is similar to the overhead of remapping large regions of
memory for arenas, so it can be thought of as a safety feature that has a cost.
## Preliminary evaluation
### Hypotheses
- In specific circumstances, a large fraction of allocations remain
goroutine-local and their lifetimes are bound to some call tree (in the limit,
the lifetime of a goroutine).
- The CPU benefit of faster allocation, less frequent global GCs, and the cache
benefit of fast memory reuse greatly outweigh the cumulative costs of the write
barrier test and transitive fades.
- Performance-sensitive users will be able to identify good candidates for
regions that don't result in frequent transitive fades, possibly with some light
tooling (e.g. profiling).
- Immix-style allocation is likely to slightly increase the live heap size on
average due to increased risk of fragmentation.
Allocation headers are likely to increase heap memory overheads for small
objects.
However, peak memory usage may be lower if less memory is likely to be live at
any given time due to the eager reuse.
Less time spent during GC cycles also means less memory allocated black, which
may shrink the live heap as well.
- Bump pointer allocation is likely to reduce median latency.
Reduced GC load and thus less frequent GCs are likely to decrease tail latency.
The effect of fading will depend on the application: if fading is frequent, it's
likely to increase median latency and tail latency; if it's rare, it's likely to
increase tail latency.
### Strategy
Because this technique involves new API, there are unknowns that are not
possible to quantify or estimate directly, since they depend on adoption in
usage.
In the analysis below, we define these unknowns and leave them unknown, allowing
us to plug in some reasonable numbers and make some reasonable inferences for
the best, worst, and middling cases.
#### CPU effect
- _AB_ = Allocated bytes
- _AO_ = Allocated objects
- _NGC_ = Number of GC cycles
- _BR_ = Fraction of bytes allocated in any region
- _OR_ = Fraction of objects allocated in any region
- _BF_ = Fraction of bytes allocated in any region, but fade
- _OF_ = Fraction of objects allocated in any region, but fade
- _BS_ = Fraction of bytes allocated in any region that do not fade, but are scanned (_BF_ + _BS_ ≤ 1)
- _CR_ = Relative cost of mark/sweep of region objects (expected >1)
- _PF_ = Pointer density (per byte) of faded objects
| ΔCPU time \= | |
| :---- | :---- |
| immixAllocCPU(*OR* \* *AO*, *BR* \* *AB*) | Objects in a region are bump-pointer allocated |
| \+ heapAllocCPU((1 \- *OR*) \* *AO*, (1 \- *BR*) \* *AB*) | Cost of allocating non-region-managed objects |
| − heapAllocCPU(*AO*, *AB*) | Gain over heap-allocating all objects |
| \+ total GC time \* (1 \- *BR*) | Cost of mark/sweep of non-region objects |
| \+ total GC time \* *BR* \* (*BF* \+ *BS*) \* *CR* | Cost of mark/sweep of faded region objects |
| − total GC time | Gain over mark/sweeping all objects |
| \+ wbTestTime \* \# of pointer writes \[ \* *OR* \] | Cost of write barrier test (upper bound) \[[scaled](#minor-model-deviation)\] |
| \+ fadeCPU(*AO* \* *OR* \* *OF*, *PF* \* *AB* \* *BR* \* *BF*) | Cost of fading objects (write barrier slow path) |
| \+ regionCleanupCPU(*AB* \* *BR*) | Cost of eagerly sweeping a region when it ends |
| − TODO Cache benefit of prompt reuse | |
_BS_ is defined a little strangely, but can be considered as a proxy for average
region lifetime.
If all regions survive for at least a full GC cycle (all region memory is
visible to the GC for scanning) then _BS_ is 1.
If all regions begin and end outside of a GC cycle (not realistic), then _BS_ is
0.
Good uses of regions will have a low non-zero _BS_ parameter.
#### RAM effect
For now, we avoid building a model for the RAM effect.
The competing factors are substantially harder to measure, but are unlikely to
be deal-breakers.
For instance, the Immix-style memory layout has a worse worst-case fragmentation
than our existing heap layout, but the average case is only a few percent
different: previous experiments (sorry, no results, [only the simulation
tool](https://github.com/mknyszek/goat)) on real application data showed up to
10% when not optimizing for allocation headers tiny objects, and ~3.5% after
optimizing allocation headers for tiny objects.
However, the effects of eager reuse on the live heap size can also be
substantial, as evidenced by performance improvements found with arenas.
The theory behind this memory improvement is that less live memory is visible to
the GC at any given time, especially via the allocate-black policy.
And lastly, because of `GOGC`, CPU benefits may generally be translated into
memory benefits as needed.
#### Latency effect
For now, we also avoid building a model for the latency effect.
Latency effects are even more contextually sensitive and harder to measure than
RAM effects.
Furthermore, in this design, the cost depends heavily on how deeply chained
faded memory turns out to be.
While I suspect that in practice this won't be much of an issue, especially with
good [diagnostics](#diagnostics), it's worth keeping an eye out on latency
metrics as the design and its evaluation progresses.
#### Experiments
To estimate the CPU effect, the following experiments will be performed:
1. Prototype the write barrier test in the compiler and compute a cost per
pointer write.
1. Prototype the fade path and empirically derive a cost function.
1. Prototype the Immix bump-pointer allocator and empirically derive a cost
function.
1. Prototype the region cleanup path and empirically derive a cost function.
These experiments will fill in quite a few of the gaps mentioned above, but
still leave the following parameters for CPU cost undefined: _BR, OR, BF, OF,
BS, CR, PF_.
_BR, OR, BF, OF_, and _BS_ all depend on how users make use of the API.
Ideally, users will be able to choose regions with a very low chance of objects
escaping, and we can just pick numbers that represent this best-case scenario.
In such a case, the win should be large and obvious.
For example:
- _BR_ > 0.5
- _OR_ > 0.9 (targets small objects)
- _BF_ ≈ 0.0
- _OF_ ≈ 0.0
- _BS_ < 0.1
- _CR_ < 1.05
- _PF_ can be anything, but at most 0.125 on 64-bit platforms.
But there still remains the question of whether users will actually be able to
do that.
We can get a reasonable idea of that by estimating the sensitivity that each
parameter has on overall CPU cost in different parts of the space.
Lastly, there's _CR.
CR_ is defined the way it is because algorithmically, what we propose here is
not so far off from the existing GC scan loop.
It's reasonable to expect that the scan loop cost will only be a constant factor
off at most (that is, _CR_).
Furthermore, actually estimating _CR_ is quite tricky, because it's difficult to
pull a scan loop out of context.
I propose just treating _CR_ as a parameter and apply the same sensitivity
analysis to it.
### Results
#### CPU effect experiments
##### Benchmarks
I collected data from 6 benchmarks in the Sweet benchmark suite where statistics
from real applications were needed:
* etcd Put
* etcd STM
* Tile38
* CockroachDB kv0 (single node)
* CockroachDB kv50 (single node)
* CockroachDB kv95 (single node)
The CockroachDB benchmark was run with `GOGC=300` (the default) and `GOGC=100`.
It turns out that at `GOGC=300`, the benchmark has a total GC overhead of about
3%, for which regions make no sense.
However, I adjusted `GOGC` down to see if the possible CPU savings from regions
could translate into memory savings instead: less memory used for the same GC
overhead.
##### Write barrier test time
I estimated the write barrier test time by implementing the worst-case [write
barrier fast
path](https://github.com/mknyszek/region-eval/blob/main/cpusim/esc.go#L129)
(region -> region write, two lookups required) in Go and [benchmarking
it](https://github.com/mknyszek/region-eval/blob/main/cpusim/esc_test.go#L205).
My estimate is ~5.2 ns per pointer write on average.
`wbTestTime ≈ 5.2 ns`
This result is from the fast path's worst case: a region-to-region write
(requiring metadata inspections for both `ptr` and `dst`) with poor spatial and
temporal locality (`ptr` and `dst` are shuffled across 1 GiB of contiguous
region memory).
The fast path is also implemented in Go, and can likely be further optimized
with an assembly-based implementation.
Using a non-nil pointer write frequency we can estimate an upper-bound on the
cost of the write barrier in terms of a % overhead.
I measured the non-nil heap pointer write frequency for 3 benchmarks.
First, I [replaced cgocheck2
mode](https://go-review.googlesource.com/c/go/+/595776) with a non-nil pointer
write counting mode (that also counted pointers in typed memory copies and the
like).
After measuring the pointer write counts for the benchmark, I re-ran the
benchmarks with pointer-counting turned off, and estimated the total CPU time
available to the benchmark.
Next, I divided the CPU time by the pointer write count for the benchmark to
obtain an average pointer write rate.
Lastly, I divided the conservative cost of the new write barrier by this rate to
determine an overhead.
This process produced the following results:
* etcd STM
* Mean CPU time between pointer writes: 119 cpu-ns
* Write barrier overhead: 4.37%
* etcd Put
* Mean CPU time between pointer writes: 491 cpu-ns
* Write barrier overhead: 1.06%
* Tile38
* Mean CPU time between pointer writes: 265 cpu-ns
* Write barrier overhead: 1.96%
* CockroachDB kv0
* Mean CPU time between pointer writes: 116 cpu-ns
* Write barrier overhead: 4.45%
* CockroachDB kv50
* Mean CPU time between pointer writes: 120 cpu-ns
* Write barrier overhead: 4.30%
* CockroachDB kv95
* Mean CPU time between pointer writes: 117 cpu-ns
* Write barrier overhead: 4.43%
I conclude that the expected CPU cost of the write barrier will be between
1.1–4.5% _as an upper-bound_, since this value is computed assuming the write
barrier is enabled globally.
##### `fadeCPU(o, p)`
I determined this cost function by creating [a simple out-of-tree
prototype](https://github.com/mknyszek/region-eval/blob/main/cpusim/esc_test.go#L20)
of the fade slow path.
The fade slow path has a per-object cost approximately equal to about twice the
cost of a regular heap allocation, plus an additional marginal cost _per
pointer_ in the object.
The fade slow path prototype is pessimal in that the benchmark actively tries to
induce cache misses by randomizing fades, so this is an upper-bound.
(The dominant per-object cost is cache misses.) In practice, I expect slightly
better locality, assuming that objects which are allocated together tend to
point to one another.
```
fadeCPU(o, p) = 40ns * o + 3.37 ns * p
```
##### `immixAllocCPU(o, b)`
I determined this cost function by creating [a simple out-of-tree
prototype](https://github.com/mknyszek/region-eval/blob/main/cpusim/alloc_test.go#L22)
of the allocation path, including a simulated cost of writing out GC metadata
(object start bits, allocation header).
The backing store of each block was allocated from heap and was measured, which
means this number includes the cost of the existing allocation slow path in the
measurement.
Like heapAllocCPU, this cost function was derived under ideal circumstances
(allocating into a fresh block) and ensured that any GC costs were excluded, so
as to be as comparable as possible.
```
immixAllocCPU(o, b) = 8ns * o + 0.15ns * b
```
Note: I do not yet have a good explanation for why the per-byte term looks so
much higher than for heapAllocCPU.
##### `regionCleanupCPU(o, b)`
I determined this cost function by adding [a simple out-of-tree
prototype](https://github.com/mknyszek/region-eval/blob/main/cpusim/alloc_test.go#L22)
of the region cleanup path to the immixAllocCPU experiment and re-running the
experiment.
The path just consists of installing the line mark bitmap (64 bits) into the
line allocation bitmap (64 bits), resetting the cursor/limit in the bump-pointer
allocator, and clearing the object-start bitmap for non-escaping lines.
We found that, in practice, the cost of the reset was negligible, well within
benchmarking noise.
(The benchmarks assume a worst case – clearing all the object-start bits.)
For the purposes of this analysis, we conclude that:
```
regionCleanupCPU(o, b) ≈ 0
```
#### CPU effect analysis
To perform the sensitivity analysis, I collected data from several benchmarks to
fill into the model, namely:
- Total GC CPU time - # of non-nil pointer writes (with a write barrier) - Total
objects allocated (_AO_) - Total bytes allocated (_AB_)
The actual sensitivity analysis is a one-at-a-time analysis.
First, we determine a reasonable range for each variable of interest.
Then, we fix all other variables in the model.
Finally, we adjust each variable of interest one at a time to learn how it
changes the overall result.
(This is similar to computing a partial derivative of the CPU model with respect
to some variable, but the equation of a partial derivative is far less intuitive
than a visual produced by moving one variable at a time.)
##### Minor model deviation
We make a slight modification to the model for the analysis below: we scale the
write barrier cost by _OR_ to reflect that the write barrier is only enabled
where regions are used.
This is not a perfect proxy for the write barrier overhead, but it at least
tracks how widely regions are used.
This is shown in the model in square brackets.
##### Baseline costs
The figure below shows the measured baseline GC costs for each benchmark, to
give a sense of what the best possible improvement is.
Note: The allocation CPU overheads are estimated using heapAllocCPU from the
number of objects and bytes allocated, not the actual sampled CPU time.
This is to make the output of the model a little more internally consistent.
![Baseline GC overheads for each benchmark.](70257/gc-overheads.png)
##### Overall picture
Just to get a sense of the possible wins and the breakdown, below is the output
of the model for various benchmarks with the following parameters.
These parameters describe a "pretty good use" scenario.
- _BR_ = 0.6 (Targets small allocations, so let's make _BR_ < _OR_.)
- _OR_ = 0.9 (Most objects end up in regions.)
- _BF_ = 0.05
- _OF_ = 0.05 (Few objects fade.)
- _BS_ = 0.01 (Regions are short-lived.)
- _CR_ = 1.05 (Regions are more expensive to scan.)
- _PF_ = 0.062 (50% of faded object bytes are pointers)
For the sensitivity analysis, we will be using these parameters as a baseline.
Outside of this document, I also performed other experiments, including the best
possible and worst possible outcomes, though they're not that much more
interesting, and this scenario is sufficient to paint a good picture of the
possible wins.
![Performance delta for each benchmark with regions in a "pretty good use"
scenario.](70257/pretty-good-use-scenario.png)
One thing to note is that there's no win for CockroachDB at `GOGC=300`, but
there is one for `GOGC=100`.
We don't quite get back to the baseline overhead at `GOGC=100` in this scenario,
but adjusting `GOGC` up to find the sweet spot means the potential for
substantial _memory savings_.
Note that the best-case scenario is the elimination of the overheads above to 0,
which is at most ~10% in these particular benchmarks.
Thus, it's helpful to consider the _proportion_ of GC overhead eliminated
relative to that 10% (so, 7% reduction means 70% GC overhead reduction).
##### Sensitivity analysis
Below we plot the effect of changes in the model due to all parameters except
_CR_ and _PF_ which the model was not very sensitive to in their reasonable
ranges ([1.00, 1.10] and [0.000, 0.625] respectively).
Below each plot are some observations.
![Plot of sensitivity to BR and OR for each
benchmark.](70257/sensitivity-br-or.png)
For the most part, the more regions are used (and used well), the better things
get.
∆CPU%, expressed in terms of a proportion of GC overhead, suggests a striking
improvement: **a best case 75% reduction in GC overheads**.
![Plot of sensitivity to BF and OF for each
benchmark.](70257/sensitivity-bf-of.png)
The assumptions behind regions are important: most objects must die with the
region to be effective.
However, note the crossover points on the above graph: the break-even point is
fairly reasonable at >25% faded for CockroachDB, and ~50% for Tile38.
It depends on the application, but there's plenty of wiggle-room, which is
motivating.
![Plot of sensitivity to BS for each benchmark.](70257/sensitivity-bs.png)
Region lifetimes are a bit more forgiving than fades: the crossover point sits
consistently further out across benchmarks.
## Diagnostics
Because this feature is opt-in, we need diagnostics to (1) help users understand
where in their application this feature is applicable and (2) help users
understand when this feature is not working in their favor and why.
Below are three proposed such diagnostic tools.
### Heap profile with lifetime information
To identify when memory regions may be applicable, I propose that we provide a
new kind of profile containing the size and lifetime (defined as the number of
GC cycles survived) of heap-allocated memory.
This profile would contain sufficient information to identify where short-lived
memory is allocated in high numbers, and where memory regions might be
appropriate.
This information could be collected with very low overhead.
In short, we could attach the GC cycle number in which an allocation was made to
a heap profiling `special`.
Upon deallocation, we simply record the difference between the current GC cycle
number and the one recorded in the profile.
Note that such a profile would likely be larger than a typical heap profile,
because one would want to group stacks by the number of GC cycles survived by
the allocations made there.
### Allocation tracing
With a little bit of extra tooling, it would be possible to give very precise
suggestions on where memory regions might be applicable (that is, where memory
is allocated together).
Allocation tracing is likely overkill for most purposes and has a much higher
overhead, but it already exists in the Go tree.
### Fade metrics
To quickly determine whether usage of memory regions is still worthwhile, we can
provide metrics on the number of objects allocated in memory regions and the
number of objects that are faded.
This ratio (let's call it the "fade ratio") should be as close to zero as
possible.
If the ratio is, for example, over 5%, users may want to consider experimenting
with removing memory regions.
### Fade profile
For digging deeper into why certain blue allocations fade, we can provide a
profile of where objects frequently escape from memory regions.
In essence, this profile would consist of stack traces taken when an object is
about to fade.
### Region skip profile
The implementation may choose to skip allocating into a region.
In the design presented in this document, this is primarily for large
allocations.
We could provide a profile for when this happens.
### Region lifetime profile
For digging deeper into whether regions are actually usually surviving less than
a full GC cycle, we can provide a profiling containing lifetime information (in
terms of GC cycles survived) for entire memory regions.
If they're frequently surviving a GC cycle (more than, for example, 0.5%), then
the regions are likely too broadly-defined.
## Alternatives considered
### No API
A valid question is why this functionality should be an API instead of
automatically inferred or always enabled.
There are basically four alternative approaches in this category:
statically-determined regions, profile-guided regions, make every goroutine a
region, and dynamic inference of which goroutines would make good regions.
#### Statically-determined regions
Of the three options, this seems the least likely to bear fruit.
This has been an extensively researched topic in the literature, but it has
unfortunately never come to a good conclusion.
I don't think it ever will, given that static analysis can't do much about
unbounded loops.
#### Profile-guided regions
This direction seems promising, but identifying what information is most useful
in identifying regions may end up being somewhat of a research topic.
My hunch is that doing this right will require precise lifetime information,
since we have to make choices that are going to be good for the lifetime of the
program.
My primary concern is that "short-lived" means "survives less than one GC cycle"
and the length of a GC cycle is fundamentally dynamic in nature.
Changing `GOGC, GOMEMLIMIT`, or even just a change in the application that
causes a change in the live heap size can affect what ends up getting inferred.
More precise lifetime information (perhaps coupled with GC performance data)
would allow us to make safer inferences.
However, I suspect collecting precise lifetime information is going to be very
expensive to collect from production (much more expensive than, say, a CPU
profile).
#### Using profiles to disable regions
Credit to Cherry Mui for this one.
This is an interesting approach to help monitor and catch bad uses of regions.
If you already have PGO in your production loop, this compiler could tell you
from a profile when it decided to disable a region, indicating that there may be
a regression, or that it can be removed from the codebase.
#### Every goroutine is a region
We could choose to implicitly make every goroutine a region.
This would basically turn the technique into a different kind of
request-oriented collection.
The risk is that the assumptions of the technique, that the vast majority of
memory is likely to be short-lived, may not hold.
This approach is essentially the same as the [request-oriented
collector](#request-oriented-collector).
Because it would be fairly easy to implement, this may be worth adding as a
`GOEXPERIMENT` for quickly trying out regions on an application.
#### Dynamic inference for goroutines
We could also choose to make _some_ goroutines regions, and infer which
goroutines are good regions at run-time.
This would require measuring lifetime information at run-time, but notably,
would not require that information to be very precise, since regions can be
re-inferred at run-time.
What this would entail is measuring which goroutines tend to be short-lived
(that is, die within one GC cycle) and tend to allocate memory scoped to their
lifetime.
### Allowing regions to cover multiple goroutines
Some applications, like `gopls`, have a memory use-pattern that maps well to a
fork-join pattern: a goroutine is created, and all memory created by it and its
child goroutines (who do not outlive the parent) is dropped on the floor by the
time the parent goroutine joins all the children.
This proposal's semantics are not good enough for this case, because memory
_will_ cross goroutine boundaries.
Extending this proposal to support such a use-case is hard.
There are two sources of issues: API issues, and implementation issues.
The API issues mostly revolve around bounding goroutine lifetimes.
As it stands, Go has no good way to block until a goroutine is truly exited.
Regions would either have to block until goroutines within them have exited,
creating a new synchronization mechanism or panic when their invariants are
violated.
Both of these options _change the semantics of the program_, which is a fairly
large negative for something that is ultimately an optimization.
The other issue is in the implementation.
I do not currently see a way around introducing a new, more expensive global
write barrier to enable the same semantics.
There are also new synchronization costs from multiple goroutines interacting
with the same region metadata simultaneously.
Lastly, there are implementation and maintenance costs, as this implementation
will almost certainly be more complicated.
This makes the cost/benefit estimate seem much more concerning.
## Prior Art
### Request-oriented collector
[Design document.](https://golang.org/s/gctoc)
This is a non-moving collector design that marks all new allocated objects as
goroutine-local.
Once a goroutine exits, all goroutine-local objects reachable from that
goroutine are immediately freed.
Any heap objects that escaped the goroutine were caught by a write barrier.
Once an object had been determined to escape, that object, as well as its
transitive referents, were marked as part of the global heap.
This idea worked very well for some applications, but poorly for others.
The biggest cost was the write barrier.
Applications that performed poorly typically created large heap subgraphs
locally and then published them to other goroutines.
The fact that the write barrier needed to be exhaustive, and needed to run
immediately (since the objects were just about to be published), hurt latency as
well.
Furthermore, the write barrier was always on and its publication test was
relatively expensive, requiring several metadata lookups on every pointer write.
This technique is very similar to what's described in this document, if one were
to extend all regions to the lifetime of each goroutine and always enable
regions everywhere.
There are some substantial differences, however:
- Regions are opt-in, so only the parts of the application that benefit make use
of the functionality.
- Regions have an optimized data and metadata layout.
- Much faster write barrier than ROC.
- Much faster allocation path.
- Regions can free non-escaped memory that was marked by the GC.
### Yak
At face value this technique is not quite so different from the "[Yak: A
High-Performance Big-Data-Friendly Garbage
Collector](https://www.usenix.org/conference/osdi16/technical-sessions/presentation/nguyen)"
(Nguyen, 2016), which also makes use of annotations to denote memory regions.
There are a few differences, however:
- Yak sought to remove long-lived objects out of the regular GC's purview so
that it would perform better.
We're trying to handle short-lived objects here.
- Yak's regular GC would always stop short of actually scanning into the
region-based memory.
Instead, the write barrier would maintain a remembered set for regular heap
pointers living in the region space, which the GC would then treat as roots.
- Yak would evacuate escaped objects from memory regions when they were swept.
It would stop the world in order to do this and scan all thread stacks.
This is a non-starter for us, latency-wise, since a substantial portion of the
GC's work might be in goroutine stacks.
Overall, it's just not quite the right fit.
### Immix
This design borrows heavily from "[Immix: A Mark-Region Garbage Collector
with Space Efficiency, Fast Collection, and Mutator
Performance](https://www.steveblackburn.org/pubs/papers/immix-pldi-2008.pdf)"
(Blackburn, 2008), since it offers a mostly non-moving bump-pointer allocator
design.
This document eschews Immix's defragmentation approach on the assumption that
fading memory will be relatively rare, minimizing the impact of fragmentation
from reclaiming memory at the line granularity.
### Reaps
This design is also inspired by the paper "[Reconsidering Custom Memory
Allocation](https://people.cs.umass.edu/~emery/pubs/berger-oopsla2002.pdf)"
(Berger, 2002) which argues that, outside of straightforward region-based memory
management, custom allocators are not worth the complexity.
This argument helps motivate this design's focus on region-based memory
management and not something else.
This design is also inspired by the "reaps" concept in that paper, which seeks
to merge the general-purpose heap with region-based memory management, rather
than keep them completely separate like the arena experiment does.