Memory regions

Background

The arena experiment adds a package consisting of a single type to the standard library: Arena. This type allows one to allocate data structures into it directly, and allows early release of the arena memory in bulk. In other words, it offers a form of region-based memory management to Go. The implementation is memory-safe insofar as use-after-frees will never result in memory corruption, only a potential crash.

In practice, this package has improved real application performance by 10-15%, achieved almost entirely due to earlier memory reuse and staving off GC execution.

The proposal to add arenas to the standard library is on indefinite hold due to concerns about API pollution. Specifically, for the API to be useful in a wide range of circumstances, APIs have to accept an additional arena parameter. A good example of this problem is the standard library: many functions in the standard library allocate heap memory, and could allocate memory into an arena upon request instead. But this would require changing so many APIs that it becomes infeasible, and is not worth a 10-15% cost reduction.

Furthermore, arenas do not compose well with the existing language. One example is stack allocation. The Go compiler employs an escape analysis pass to stack-allocate memory when possible. This decouples that memory's lifecycle from the garbage collector entirely, and is even more effective than arenas when it is possible. Unfortunately, when one chooses to allocate into an arena, that forces that allocation to come from an arena.

This document proposes a composable replacement for arenas in the form of user-defined goroutine-local memory regions.

Goals

First and foremost, our main goal is to reduce resource costs associated with the GC. If we can't achieve that, then this proposal is pointless.

The second most important goal is composability. Specifically:

No requirement to change APIs to take advantage of arenas.
Composability with standard library features, like sync.Pool and unique.Handle.
Composability with existing optimizations, like stack allocation via escape analysis.

Finally, whatever we implement must be relatively easy to use and intuitive to intermediate-to-advanced Go developers. We must offer tools for discovering where regions might be worth it, and where they aren't working out.

Design

The core of this design revolves around a pair of functions that behave like annotations of function calls.

The annotations indicate whether the user expects most or all the memory allocated by some function call (and its callees) to stay local to that function (and its callees), and to be unreachable by the time that function returns. If these expectations hold, then that memory is eagerly reclaimed when the function returns.

package region

// Do creates a new scope called a region, and calls f in
// that scope. The scope is destroyed when Do returns.
//
// At the implementation's discretion, memory allocated by f
// and its callees may be implicitly bound to the region.
//
// Memory is automatically unbound from the region when it
// becomes reachable from another region, another goroutine,
// the caller of Do, its caller, or from any other memory not
// bound to this region.
//
// Any memory still bound to the region when it is destroyed is
// eagerly reclaimed by the runtime.
//
// This function exists to reduce resource costs by more
// effectively reusing memory, reducing pressure on the garbage
// collector.
// However, when used incorrectly, it may instead increase
// resource costs, as there is a cost to unbinding memory from
// the region.
// Always experiment and benchmark before committing to
// using a region.
//
// If Do is called within an active region, it creates a new one,
// and the old one is reinstated once f returns.
// However, memory cannot be rebound between regions.
// If memory created by an inner region is referenced by an outer
// region, it is not rebound to the outer region, but rather unbound
// completely.
// Memory created by an outer regions referenced by an inner region
// does not unbind anything, because the outer region always out-lives
// the inner region.
//
// Regions are local to the goroutines that create them,
// and do not propagate to newly created goroutines.
//
// Panics and calls to [runtime.Goexit] will destroy region
// scopes the same as if f returned, if they unwind past the
// call to Do.
func Do(f func())

// Ignore causes g and its callees to ignore the current
// region on the goroutine.
//
// Calling Ignore when not in an active region has no effect.
//
// The primary use-case for Ignore is to exclude memory that
// is known to out-live a region, to more effectively make use
// of regions. Using Ignore is less expensive than the unbinding
// process described in the documentation for [region.Do].
func Ignore(g func())

Comparison with arenas

Where an arena might be used like:

func myFunc(buf []byte) error {
	a := arena.New()
	defer a.Free()

	data := new(MyBigComplexProto)
	if err := proto.UnmarshalOptions{Arena: a}.Unmarshal(buf, data); err != nil {
		return err
	}
	// Do stuff with data. ...
}

Regions would be used like:

func myFunc(buf []byte) error {
	var topLevelErr error
	region.Do(func() {
		data := new(MyBigComplexProto)
		if err := proto.Unmarshal(buf, data); err != nil {
			topLevelErr = err
			return
		}
		// Do stuff with data. ...
	})
	return topLevelErr
}

You can think of a region as an implicit goroutine-local arena that lives for the duration of some function call. That goroutine-local arena is then used for allocating all the memory needed by that function call and its callees (including maps, slices, structs, etc.). And then thanks to some compiler and runtime magic, if any of that memory would cause a use-after-free issue, it is automatically removed from the arena and handed off to the garbage collector instead.

In practice, all uses of an arena within Google tightly limit the arena's lifetime to that of a particular function, usually the one they are created in, like the example above. This fact suggests that regions will most likely be usable in most or all of the same circumstances as arenas.

Annotated examples

Below are some specific examples of different behaviors described in the API.

Basic

This example demonstrates that all major builtins are intended to work with this functionality, and that leaking a pointer out of the region causes the memory it points to to be unbound from the region.

var keep *int
region.Do(func() {
	w := new(int)
	x := new(MyStruct)
	y := make([]int, 10)
	z := make(map[string]string)
	*w = use(x, y, z)
	keep = w // w is unbound from the region.
}) // x, y, and z's memory is eagerly cleaned up, w is not.

Nested regions

This example demonstrates how nested regions interact.

region.Do(func() {
	z := new(MyStruct)
	var y *MyStruct
	region.Do(func() {
		x := new(MyStruct)
		use(x, z) // z may be freely used within this inner region.
		y = x // x is unbound from any region.
	})
	use(y)
})

Implementation

Overview

First, region.Do sets a field in the current g indicating the stack offset at which the region begins.

If this stack offset field is != 0, runtime.mallocgc uses a special allocation path. The non-zero stack offset field also implicitly enables a new kind of write barrier for the goroutine (and only for that goroutine).

Then, region.Do calls the function that was passed to it.

The special allocation path allocates all memory in a special part of the heap. Memory allocated this way is referred to as “blue” (Bounded-Lifetime Unshared memory, Eagerly reclaimed, for fellow lovers of tortured acronyms). (Subtlety: “blueness” is relative to the goroutine that allocated the memory.)

In the g, we keep track of every span that was created as part of the blue allocation path. These spans are associated with the last call to region.Do.

The new write barrier catches writes of blue memory pointers to non-blue memory (including regular heap memory, globals, other goroutine stacks, any part of the stack above the latest region.Do, and cross-region writes (for nested regions, since all cross-goroutine writes must pass through already-faded memory)) and transitively “fades” the object graph rooted at the pointer being written. Faded memory is managed by the GC in the traditional tri-color mark scheme (hence “fading” to grayscale).

Once the function passed to region.Do returns, region.Do immediately sweeps all the spans associated with it, releasing any blue memory and places the spans back on a free list.

We go into more detail in the following sections.

New g flag

Let's first define the behavior and invariants of a new field on the g, region. This value is not inherited when new goroutines are created, and it is not inherited by g0. This flag is only ever mutated and accessed by a goroutine, for itself.

Blue memory management

If runtime.mallocgc sees that the current goroutine‘s region value is non-zero, it goes down a special allocation path for blue memory. This special allocation path allocates all memory in a part of the address space that is separate (that may, but ideally won’t, interleave with) the general heap.

Address space

Region memory is allocated from dedicated heapArenas (not to be confused with arenas, these are 64 MiB aligned reservations on most 64-bit platforms, or 4 MiB on 32-bit platforms and Windows). When these are created, we provide hints to mmap and the like to be contiguous and in a distinct part of the address space from the heap, though it's OK if these special region heapArenas interleave with regular heapArenas.

These special heapArenas are distinguished by a compact bitmap (mheap_.isRegionArena) covering the whole address space, with each bit representing one heapArena. This bitmap is incredibly compact: a 48-bit address space divided into 64 MiB chunks results in a 512 KiB bitmap. We can reserve the full 512 KiB up-front and map in parts as-needed. 64 bytes of this bitmap (or one cache line on most systems) is sufficient to cover a contiguous 32 GiB of region memory.

Memory layout

All blue memory is managed in fixed-size blocks with size 8 KiB. If any allocation exceeds 2 KiB in size, it is not allocated blue, and is instead redirected to the general heap. These blocks are just special kinds of spans, and are backed with an mspan that is managed like all other spans today.

The layout of each block follows the Immix design. That is, each block is divided into 128-byte lines and groups of contiguous lines are bump-allocated into. Every allocation made into one of these regions has an 8-byte object header, including objects without pointers and small objects (<=8 bytes in size). (Insight: if objects are short-lived and we‘re going to reuse their memory quickly, then per-object overheads don’t matter as much.)

Implementation note: we can adjust the block size up to accommodate larger objects. We can also potentially allow much larger objects to participate in regions by creating an additional pageAlloc structure for region-allocated memory. This document assumes the greatest impact will be from smaller objects (<2 KiB), which account for most of the allocation volume in most programs. However, should that assumption fail, we can generalize at the cost of some additional complexity.

Allocation

Just like Immix, objects are bump-pointer-allocated into contiguous free lines. Also like Immix, each goroutine carries two local blocks to allocate from. One block may be non-empty (that is, have some non-free lines) and the other will always start out completely empty. The main block is always attempted first, but medium-sized objects that fail to allocate into the main block will be allocated into the secondary block.

Blocks are usually allocated from a P-local free list, and failing that, a global free list. If a full GC has run recently, each goroutine will first try to sweep its own already-active blocks.

Garbage collection

Each blue block reserves the first two lines for a pair of bit-per-word 128-byte bitmaps. One indicates the start of each object, while the other indicates which words of the block have faded (tracked by the write barrier). The bit corresponding to an object‘s allocation header is set upon allocation. Lines also each get mark and alloc bits, but since they’re small (64 bits each) they'll go right in the mspan for the block.

Each object's allocation header reserves the two least significant bits (which will always be zero otherwise) for mark bits. Each GC cycle, the two bits swap purposes. In GC cycle N, the Nth bit mod 2 is the bit for that cycle. When an object is marked, the (N+1 mod 2)'th bit is simultaneously cleared.

The GC will scan blocks using the object start bitmap to find the start of the object, and thus the object header. It marks and scans all reachable objects in blocks, even those objects that are still blue. It also marks any lines containing marked objects. When the GC marks an object whose fade bits are not set, it leaves the fade bits alone.

Sweeping

There are two ways in which objects in region blocks may have their memory made available for reuse. The first is based on the mark bits, which we‘ll call “full sweeping” since it’s very close in semantics to the existing sweeper. The second is based on the fade bits, which we'll call “eager sweeping.”

Full sweeping of all blocks is performed each GC cycle. Lines that have not been marked are made available for allocation by installing the line mark bits as the line allocation bits. Note that freeing both blue and non-blue objects is completely fine here: the mark bits prove whether or not the object is reachable. This first kind of sweeping is the one referenced in the allocation path.

Eager sweeping of region blocks is performed when the function passed to region.Do returns. Lines that do not overlap with any faded words, as indicated by the bitmap, are made available for allocation. Note that we may free blue objects that were marked. This is mostly fine: if the object never faded, we can be certain that the object is dead when f returns. The only hazard is if the GC queues a pointer to blue memory that is later freed. If the GC goes to inspect the object later, it might be gone. Even worse, there might not be an object starting at the same address anymore!

There are many ways to deal with this problem, but here are two.

Delay eager sweeping until after the mark phase completes. This is unfortunate, but easy to implement and safe.
Keep a generation counter on each blue block. If the GC enqueues a blue pointer, it also queues the generation counter from the block it lives in. Later, it checks the generation counter when looking up the block for the object again and if it doesn't match, it skips doing any additional work. This complicates scanning, but would allow for eager reclamation during the mark phase.

New write barrier

A new write barrier tracks when pointers to objects are written outside the call tree rooted at f and the blocks associated with f's region. In other words, it tracks writes of blue pointers to non-blue memory. This write barrier is enabled only if the region field on the g is non-zero. The write barrier executes on each pointer write that is not to the current stack frame (same as the existing write barrier).

If the write barrier discovers a blue pointer being written to non-blue memory, the fade bits for the words corresponding to the object‘s memory are set. Then, the object’s referents are transitively marked as faded. Below is a sketch of the write barrier's fast path.

func regionWriteBarrier(ptr, dst unsafe.Pointer) {
	region := getg().region
	if region == 0 {
		return
	}
	if ptr == nil {
		return
	}
	if dst >= getg().stack.lo && dst < getg().stack.lo+region {
		// The pointer is being written into the stack within the region.
		return
	}
	ptrBlock := findRegionBlock(ptr)
	if ptrBlock == nil {
		// Pointer doesn't belong to a region.
		return
	}
	dstBlock := findRegionBlock(dst)
	// If they belong to different regions and ptr's region doesn't
	// does not out-live dst's, we must fade ptr.
	if dstBlock != nil && ptrBlock.region > dstBlock.region {
		// Otherwise, for within-region writes, only fade on writes of blue
		// to faded memory.
		if !ptrBlock.isBlue(ptr) || dstBlock.isBlue(dst) {
			return
		}
	}
	fade(ptr)
}

func findRegionBlock(ptr unsafe.Pointer) *regionBlock {
	// Check if the pointer lies inside of a region arena.
	arenaIdx := uintptr(ptr)/heapArenaBytes
	if mheap_.isRegionArena[arenaIdx/8]&(uint8(1)<<(arenaIdx%8)) == 0 {
		return nil
	}
	// Find the base of the block, where the fade bitmap, among other things, lives.
	base := uintptr(ptr) &^ (8192-1)
	return (*regionBlock)(unsafe.Pointer(base)))
}

type regionBlock struct {
	faded    [128]byte
	objStart [128]byte
	region   uintptr
	lineFade uint64
}

func (b *regionBlock) isBlue(ptr unsafe.Pointer) bool {
	// Find the word index that ptr corresponds to in the block.
	word := (uintptr(ptr)-uintptr(unsafe.Pointer(b)))/8

	// Load, mask, and check the bit corresponding to the word.
	return b.faded[word/8] & (1<<(word%8)) == 0
}

fade performs the transitive fade bitmap marking. Note that objects must be immediately transitively faded (that is, we cannot defer the operation), because then it would be possible to hide pointers to those objects in another goroutine‘s stack, which is not protected by a write barrier (and shouldn’t be; it's too expensive).

Comparison with the arenas

Implicit allocation instead of explicit allocation

Arenas require explicitly choosing what gets allocated into an arena. On the one hand, this offers fine-grained control over memory, but on the other hand, does not compose well with the language. For example, it requires updating APIs to make the best use of the arena. Standard library APIs that do not accept an arena would need to have a new version to allocate memory into the arena.

One could imagine making arenas implicit and goroutine-bound. But the issue with implicitly allocating memory into an arena is that it‘s far less clear when it’s safe to free the arena since some values may have been allocated into the arena that code authors don't expect.

In contrast, the regions design implicitly allocates all eligible heap memory into a kind of memory region, but avoids the issue of not knowing when to free memory. In essence, the new write barrier effectively pins memory that is not safe to free, meaning the rest can be freed once the region ends. The regions design also offers an opt-out mechanism (region.Ignore), restoring fine-grained control over memory allocation where needed.

Furthermore, explicit allocation introduces some complications with existing optimizations. For example, the compiler‘s escape analysis may be able to deduce that some value may be stack allocated, but this analysis immediately fails if the value is allocated into an arena. Special compiler support could make arena-allocated values candidates for stack allocation, but that’s likely to be complex.

Finer granularity of memory reuse

One property of the arena experiment is that arena chunks are very large (8 MiB) to amortize the cost of syscalls to force faults on the memory. This means that, naively, making good use of arenas would require sharing them across multiple logical operations (for example, multiple HTTP requests in a server). In practice, the arena experiment works around this by only reusing partially-filled arena chunks, filling them up further. However, these partially-filled chunks still contribute to the live heap, even if most of the memory inside of them is dead.

Meanwhile, region blocks are three orders of magnitude smaller than arena chunks. This means that there's really no need to try to reuse partially-filled region blocks. As a result, the baseline live heap contribution of region.Do should be substantially smaller.

Accidentally retaining unreachable memory is less of a hazard

Failing to free an arena, or allocating too much into an arena, may result in running out of memory, since the lifetime of all allocations in the arena are truly bound together. Even if large amounts of arena memory are no longer reachable, the GC is unable to free them because arenas have no mechanism for reclamation except on the order of arena chunks.

The region design meanwhile includes region blocks in the regular GC cycle. This means that if regions are specified in an overly-broad way (regions that are too big or too long-lived), there's no more potential for running out of memory than usual. Plus, blue memory in the region is still immediately reclaimable and reusable, even if it survives multiple GC cycles.

Write barrier

One downside of regions over arenas is the write barrier necessary to prevent use-after-free issues. This design rests on the write barrier being cheap enough to avoid overwhelming savings from reduced GC pressure.

The good news is that I believe this design achieves this goal. See the preliminary evaluation. To summarize, an upper bound on the expected CPU overhead of the write barrier is between 1.1%–4.5%. For applications that apply a substantial amount of GC pressure (resulting in upwards of 10% GC CPU costs), this overhead should be readily recouped. This analysis also assumes that all goroutines are in a region all of the time, so it is truly a worst case. I suspect this overhead is similar to the overhead of remapping large regions of memory for arenas, so it can be thought of as a safety feature that has a cost.

Preliminary evaluation

Hypotheses

In specific circumstances, a large fraction of allocations remain goroutine-local and their lifetimes are bound to some call tree (in the limit, the lifetime of a goroutine).
The CPU benefit of faster allocation, less frequent global GCs, and the cache benefit of fast memory reuse greatly outweigh the cumulative costs of the write barrier test and transitive fades.
Performance-sensitive users will be able to identify good candidates for regions that don't result in frequent transitive fades, possibly with some light tooling (e.g. profiling).
Immix-style allocation is likely to slightly increase the live heap size on average due to increased risk of fragmentation. Allocation headers are likely to increase heap memory overheads for small objects. However, peak memory usage may be lower if less memory is likely to be live at any given time due to the eager reuse. Less time spent during GC cycles also means less memory allocated black, which may shrink the live heap as well.
Bump pointer allocation is likely to reduce median latency. Reduced GC load and thus less frequent GCs are likely to decrease tail latency. The effect of fading will depend on the application: if fading is frequent, it‘s likely to increase median latency and tail latency; if it’s rare, it's likely to increase tail latency.

Strategy

Because this technique involves new API, there are unknowns that are not possible to quantify or estimate directly, since they depend on adoption in usage. In the analysis below, we define these unknowns and leave them unknown, allowing us to plug in some reasonable numbers and make some reasonable inferences for the best, worst, and middling cases.

CPU effect

AB = Allocated bytes
AO = Allocated objects
NGC = Number of GC cycles
BR = Fraction of bytes allocated in any region
OR = Fraction of objects allocated in any region
BF = Fraction of bytes allocated in any region, but fade
OF = Fraction of objects allocated in any region, but fade
BS = Fraction of bytes allocated in any region that do not fade, but are scanned (BF + BS ≤ 1)
CR = Relative cost of mark/sweep of region objects (expected >1)
PF = Pointer density (per byte) of faded objects

ΔCPU time =
immixAllocCPU(OR * AO, BR * AB)	Objects in a region are bump-pointer allocated
+ heapAllocCPU((1 - OR) * AO, (1 - BR) * AB)	Cost of allocating non-region-managed objects
− heapAllocCPU(AO, AB)	Gain over heap-allocating all objects
+ total GC time * (1 - BR)	Cost of mark/sweep of non-region objects
+ total GC time * BR * (BF + BS) * CR	Cost of mark/sweep of faded region objects
− total GC time	Gain over mark/sweeping all objects
+ wbTestTime * # of pointer writes [ * OR ]	Cost of write barrier test (upper bound) [scaled]
+ fadeCPU(AO * OR * OF, PF * AB * BR * BF)	Cost of fading objects (write barrier slow path)
+ regionCleanupCPU(AB * BR)	Cost of eagerly sweeping a region when it ends
− TODO Cache benefit of prompt reuse

BS is defined a little strangely, but can be considered as a proxy for average region lifetime.

If all regions survive for at least a full GC cycle (all region memory is visible to the GC for scanning) then BS is 1. If all regions begin and end outside of a GC cycle (not realistic), then BS is 0. Good uses of regions will have a low non-zero BS parameter.

RAM effect

For now, we avoid building a model for the RAM effect.

The competing factors are substantially harder to measure, but are unlikely to be deal-breakers. For instance, the Immix-style memory layout has a worse worst-case fragmentation than our existing heap layout, but the average case is only a few percent different: previous experiments (sorry, no results, only the simulation tool) on real application data showed up to 10% when not optimizing for allocation headers tiny objects, and ~3.5% after optimizing allocation headers for tiny objects.

However, the effects of eager reuse on the live heap size can also be substantial, as evidenced by performance improvements found with arenas. The theory behind this memory improvement is that less live memory is visible to the GC at any given time, especially via the allocate-black policy.

And lastly, because of GOGC, CPU benefits may generally be translated into memory benefits as needed.

Latency effect

For now, we also avoid building a model for the latency effect. Latency effects are even more contextually sensitive and harder to measure than RAM effects. Furthermore, in this design, the cost depends heavily on how deeply chained faded memory turns out to be. While I suspect that in practice this won‘t be much of an issue, especially with good diagnostics, it’s worth keeping an eye out on latency metrics as the design and its evaluation progresses.

Experiments

To estimate the CPU effect, the following experiments will be performed:

Prototype the write barrier test in the compiler and compute a cost per pointer write.
Prototype the fade path and empirically derive a cost function.
Prototype the Immix bump-pointer allocator and empirically derive a cost function.
Prototype the region cleanup path and empirically derive a cost function.

These experiments will fill in quite a few of the gaps mentioned above, but still leave the following parameters for CPU cost undefined: BR, OR, BF, OF, BS, CR, PF.

BR, OR, BF, OF, and BS all depend on how users make use of the API. Ideally, users will be able to choose regions with a very low chance of objects escaping, and we can just pick numbers that represent this best-case scenario. In such a case, the win should be large and obvious. For example:

BR > 0.5
OR > 0.9 (targets small objects)
BF ≈ 0.0
OF ≈ 0.0
BS < 0.1
CR < 1.05
PF can be anything, but at most 0.125 on 64-bit platforms.

But there still remains the question of whether users will actually be able to do that. We can get a reasonable idea of that by estimating the sensitivity that each parameter has on overall CPU cost in different parts of the space.

Lastly, there‘s CR. CR is defined the way it is because algorithmically, what we propose here is not so far off from the existing GC scan loop. It’s reasonable to expect that the scan loop cost will only be a constant factor off at most (that is, CR). Furthermore, actually estimating CR is quite tricky, because it's difficult to pull a scan loop out of context. I propose just treating CR as a parameter and apply the same sensitivity analysis to it.

Results

CPU effect experiments

Benchmarks

I collected data from 6 benchmarks in the Sweet benchmark suite where statistics from real applications were needed:

etcd Put
etcd STM
Tile38
CockroachDB kv0 (single node)
CockroachDB kv50 (single node)
CockroachDB kv95 (single node)

The CockroachDB benchmark was run with GOGC=300 (the default) and GOGC=100. It turns out that at GOGC=300, the benchmark has a total GC overhead of about 3%, for which regions make no sense. However, I adjusted GOGC down to see if the possible CPU savings from regions could translate into memory savings instead: less memory used for the same GC overhead.

Write barrier test time

I estimated the write barrier test time by implementing the worst-case write barrier fast path (region -> region write, two lookups required) in Go and benchmarking it. My estimate is ~5.2 ns per pointer write on average.

wbTestTime ≈ 5.2 ns

This result is from the fast path's worst case: a region-to-region write (requiring metadata inspections for both ptr and dst) with poor spatial and temporal locality (ptr and dst are shuffled across 1 GiB of contiguous region memory). The fast path is also implemented in Go, and can likely be further optimized with an assembly-based implementation.

Using a non-nil pointer write frequency we can estimate an upper-bound on the cost of the write barrier in terms of a % overhead.

I measured the non-nil heap pointer write frequency for 3 benchmarks. First, I replaced cgocheck2 mode with a non-nil pointer write counting mode (that also counted pointers in typed memory copies and the like). After measuring the pointer write counts for the benchmark, I re-ran the benchmarks with pointer-counting turned off, and estimated the total CPU time available to the benchmark. Next, I divided the CPU time by the pointer write count for the benchmark to obtain an average pointer write rate. Lastly, I divided the conservative cost of the new write barrier by this rate to determine an overhead. This process produced the following results:

etcd STM
- Mean CPU time between pointer writes: 119 cpu-ns
- Write barrier overhead: 4.37%
etcd Put
- Mean CPU time between pointer writes: 491 cpu-ns
- Write barrier overhead: 1.06%
Tile38
- Mean CPU time between pointer writes: 265 cpu-ns
- Write barrier overhead: 1.96%
CockroachDB kv0
- Mean CPU time between pointer writes: 116 cpu-ns
- Write barrier overhead: 4.45%
CockroachDB kv50
- Mean CPU time between pointer writes: 120 cpu-ns
- Write barrier overhead: 4.30%
CockroachDB kv95
- Mean CPU time between pointer writes: 117 cpu-ns
- Write barrier overhead: 4.43%

I conclude that the expected CPU cost of the write barrier will be between 1.1–4.5% as an upper-bound, since this value is computed assuming the write barrier is enabled globally.

`fadeCPU(o, p)`

I determined this cost function by creating a simple out-of-tree prototype of the fade slow path. The fade slow path has a per-object cost approximately equal to about twice the cost of a regular heap allocation, plus an additional marginal cost per pointer in the object. The fade slow path prototype is pessimal in that the benchmark actively tries to induce cache misses by randomizing fades, so this is an upper-bound. (The dominant per-object cost is cache misses.) In practice, I expect slightly better locality, assuming that objects which are allocated together tend to point to one another.

fadeCPU(o, p) = 40ns * o + 3.37 ns * p

`immixAllocCPU(o, b)`

I determined this cost function by creating a simple out-of-tree prototype of the allocation path, including a simulated cost of writing out GC metadata (object start bits, allocation header). The backing store of each block was allocated from heap and was measured, which means this number includes the cost of the existing allocation slow path in the measurement. Like heapAllocCPU, this cost function was derived under ideal circumstances (allocating into a fresh block) and ensured that any GC costs were excluded, so as to be as comparable as possible.

immixAllocCPU(o, b) = 8ns * o + 0.15ns * b

Note: I do not yet have a good explanation for why the per-byte term looks so much higher than for heapAllocCPU.

`regionCleanupCPU(o, b)`

I determined this cost function by adding a simple out-of-tree prototype of the region cleanup path to the immixAllocCPU experiment and re-running the experiment. The path just consists of installing the line mark bitmap (64 bits) into the line allocation bitmap (64 bits), resetting the cursor/limit in the bump-pointer allocator, and clearing the object-start bitmap for non-escaping lines. We found that, in practice, the cost of the reset was negligible, well within benchmarking noise. (The benchmarks assume a worst case – clearing all the object-start bits.)

For the purposes of this analysis, we conclude that:

regionCleanupCPU(o, b) ≈ 0

CPU effect analysis

To perform the sensitivity analysis, I collected data from several benchmarks to fill into the model, namely:

Total GC CPU time - # of non-nil pointer writes (with a write barrier) - Total objects allocated (AO) - Total bytes allocated (AB)

The actual sensitivity analysis is a one-at-a-time analysis. First, we determine a reasonable range for each variable of interest. Then, we fix all other variables in the model. Finally, we adjust each variable of interest one at a time to learn how it changes the overall result.

(This is similar to computing a partial derivative of the CPU model with respect to some variable, but the equation of a partial derivative is far less intuitive than a visual produced by moving one variable at a time.)

Minor model deviation

We make a slight modification to the model for the analysis below: we scale the write barrier cost by OR to reflect that the write barrier is only enabled where regions are used. This is not a perfect proxy for the write barrier overhead, but it at least tracks how widely regions are used. This is shown in the model in square brackets.

Baseline costs

The figure below shows the measured baseline GC costs for each benchmark, to give a sense of what the best possible improvement is.

Note: The allocation CPU overheads are estimated using heapAllocCPU from the number of objects and bytes allocated, not the actual sampled CPU time. This is to make the output of the model a little more internally consistent.

Baseline GC overheads for each benchmark.

Overall picture

Just to get a sense of the possible wins and the breakdown, below is the output of the model for various benchmarks with the following parameters. These parameters describe a “pretty good use” scenario.

BR = 0.6 (Targets small allocations, so let's make BR < OR.)
OR = 0.9 (Most objects end up in regions.)
BF = 0.05
OF = 0.05 (Few objects fade.)
BS = 0.01 (Regions are short-lived.)
CR = 1.05 (Regions are more expensive to scan.)
PF = 0.062 (50% of faded object bytes are pointers)

For the sensitivity analysis, we will be using these parameters as a baseline. Outside of this document, I also performed other experiments, including the best possible and worst possible outcomes, though they're not that much more interesting, and this scenario is sufficient to paint a good picture of the possible wins.

Performance delta for each benchmark with regions in a pretty good usescenario.

One thing to note is that there‘s no win for CockroachDB at GOGC=300, but there is one for GOGC=100. We don’t quite get back to the baseline overhead at GOGC=100 in this scenario, but adjusting GOGC up to find the sweet spot means the potential for substantial memory savings.

Note that the best-case scenario is the elimination of the overheads above to 0, which is at most ~10% in these particular benchmarks. Thus, it's helpful to consider the proportion of GC overhead eliminated relative to that 10% (so, 7% reduction means 70% GC overhead reduction).

Sensitivity analysis

Below we plot the effect of changes in the model due to all parameters except CR and PF which the model was not very sensitive to in their reasonable ranges ([1.00, 1.10] and [0.000, 0.625] respectively). Below each plot are some observations.

Plot of sensitivity to BR and OR for eachbenchmark.

For the most part, the more regions are used (and used well), the better things get. ∆CPU%, expressed in terms of a proportion of GC overhead, suggests a striking improvement: a best case 75% reduction in GC overheads.

Plot of sensitivity to BF and OF for eachbenchmark.

The assumptions behind regions are important: most objects must die with the region to be effective. However, note the crossover points on the above graph: the break-even point is fairly reasonable at >25% faded for CockroachDB, and ~50% for Tile38. It depends on the application, but there's plenty of wiggle-room, which is motivating.

Plot of sensitivity to BS for each benchmark.

Region lifetimes are a bit more forgiving than fades: the crossover point sits consistently further out across benchmarks.

Diagnostics

Because this feature is opt-in, we need diagnostics to (1) help users understand where in their application this feature is applicable and (2) help users understand when this feature is not working in their favor and why. Below are three proposed such diagnostic tools.

Heap profile with lifetime information

To identify when memory regions may be applicable, I propose that we provide a new kind of profile containing the size and lifetime (defined as the number of GC cycles survived) of heap-allocated memory. This profile would contain sufficient information to identify where short-lived memory is allocated in high numbers, and where memory regions might be appropriate.

This information could be collected with very low overhead. In short, we could attach the GC cycle number in which an allocation was made to a heap profiling special. Upon deallocation, we simply record the difference between the current GC cycle number and the one recorded in the profile. Note that such a profile would likely be larger than a typical heap profile, because one would want to group stacks by the number of GC cycles survived by the allocations made there.

Allocation tracing

With a little bit of extra tooling, it would be possible to give very precise suggestions on where memory regions might be applicable (that is, where memory is allocated together). Allocation tracing is likely overkill for most purposes and has a much higher overhead, but it already exists in the Go tree.

Fade metrics

To quickly determine whether usage of memory regions is still worthwhile, we can provide metrics on the number of objects allocated in memory regions and the number of objects that are faded. This ratio (let's call it the “fade ratio”) should be as close to zero as possible. If the ratio is, for example, over 5%, users may want to consider experimenting with removing memory regions.

Fade profile

For digging deeper into why certain blue allocations fade, we can provide a profile of where objects frequently escape from memory regions. In essence, this profile would consist of stack traces taken when an object is about to fade.

Region skip profile

The implementation may choose to skip allocating into a region. In the design presented in this document, this is primarily for large allocations. We could provide a profile for when this happens.

Region lifetime profile

For digging deeper into whether regions are actually usually surviving less than a full GC cycle, we can provide a profiling containing lifetime information (in terms of GC cycles survived) for entire memory regions. If they're frequently surviving a GC cycle (more than, for example, 0.5%), then the regions are likely too broadly-defined.

Alternatives considered

No API

A valid question is why this functionality should be an API instead of automatically inferred or always enabled. There are basically four alternative approaches in this category: statically-determined regions, profile-guided regions, make every goroutine a region, and dynamic inference of which goroutines would make good regions.

Statically-determined regions

Of the three options, this seems the least likely to bear fruit. This has been an extensively researched topic in the literature, but it has unfortunately never come to a good conclusion. I don‘t think it ever will, given that static analysis can’t do much about unbounded loops.

Profile-guided regions

This direction seems promising, but identifying what information is most useful in identifying regions may end up being somewhat of a research topic. My hunch is that doing this right will require precise lifetime information, since we have to make choices that are going to be good for the lifetime of the program. My primary concern is that “short-lived” means “survives less than one GC cycle” and the length of a GC cycle is fundamentally dynamic in nature. Changing GOGC, GOMEMLIMIT, or even just a change in the application that causes a change in the live heap size can affect what ends up getting inferred. More precise lifetime information (perhaps coupled with GC performance data) would allow us to make safer inferences. However, I suspect collecting precise lifetime information is going to be very expensive to collect from production (much more expensive than, say, a CPU profile).

Using profiles to disable regions

Credit to Cherry Mui for this one. This is an interesting approach to help monitor and catch bad uses of regions. If you already have PGO in your production loop, this compiler could tell you from a profile when it decided to disable a region, indicating that there may be a regression, or that it can be removed from the codebase.

Every goroutine is a region

We could choose to implicitly make every goroutine a region. This would basically turn the technique into a different kind of request-oriented collection. The risk is that the assumptions of the technique, that the vast majority of memory is likely to be short-lived, may not hold.

This approach is essentially the same as the request-oriented collector. Because it would be fairly easy to implement, this may be worth adding as a GOEXPERIMENT for quickly trying out regions on an application.

Dynamic inference for goroutines

We could also choose to make some goroutines regions, and infer which goroutines are good regions at run-time. This would require measuring lifetime information at run-time, but notably, would not require that information to be very precise, since regions can be re-inferred at run-time. What this would entail is measuring which goroutines tend to be short-lived (that is, die within one GC cycle) and tend to allocate memory scoped to their lifetime.

Allowing regions to cover multiple goroutines

Some applications, like gopls, have a memory use-pattern that maps well to a fork-join pattern: a goroutine is created, and all memory created by it and its child goroutines (who do not outlive the parent) is dropped on the floor by the time the parent goroutine joins all the children. This proposal's semantics are not good enough for this case, because memory will cross goroutine boundaries.

Extending this proposal to support such a use-case is hard. There are two sources of issues: API issues, and implementation issues.

The API issues mostly revolve around bounding goroutine lifetimes. As it stands, Go has no good way to block until a goroutine is truly exited. Regions would either have to block until goroutines within them have exited, creating a new synchronization mechanism or panic when their invariants are violated. Both of these options change the semantics of the program, which is a fairly large negative for something that is ultimately an optimization.

The other issue is in the implementation. I do not currently see a way around introducing a new, more expensive global write barrier to enable the same semantics. There are also new synchronization costs from multiple goroutines interacting with the same region metadata simultaneously. Lastly, there are implementation and maintenance costs, as this implementation will almost certainly be more complicated. This makes the cost/benefit estimate seem much more concerning.

Prior Art

Request-oriented collector

Design document.

This is a non-moving collector design that marks all new allocated objects as goroutine-local. Once a goroutine exits, all goroutine-local objects reachable from that goroutine are immediately freed. Any heap objects that escaped the goroutine were caught by a write barrier. Once an object had been determined to escape, that object, as well as its transitive referents, were marked as part of the global heap.

This idea worked very well for some applications, but poorly for others. The biggest cost was the write barrier. Applications that performed poorly typically created large heap subgraphs locally and then published them to other goroutines. The fact that the write barrier needed to be exhaustive, and needed to run immediately (since the objects were just about to be published), hurt latency as well. Furthermore, the write barrier was always on and its publication test was relatively expensive, requiring several metadata lookups on every pointer write.

This technique is very similar to what's described in this document, if one were to extend all regions to the lifetime of each goroutine and always enable regions everywhere. There are some substantial differences, however:

Regions are opt-in, so only the parts of the application that benefit make use of the functionality.
Regions have an optimized data and metadata layout.
Much faster write barrier than ROC.
Much faster allocation path.
Regions can free non-escaped memory that was marked by the GC.

Yak

At face value this technique is not quite so different from the “Yak: A High-Performance Big-Data-Friendly Garbage Collector” (Nguyen, 2016), which also makes use of annotations to denote memory regions. There are a few differences, however:

Yak sought to remove long-lived objects out of the regular GC‘s purview so that it would perform better. We’re trying to handle short-lived objects here.
Yak's regular GC would always stop short of actually scanning into the region-based memory. Instead, the write barrier would maintain a remembered set for regular heap pointers living in the region space, which the GC would then treat as roots.
Yak would evacuate escaped objects from memory regions when they were swept. It would stop the world in order to do this and scan all thread stacks. This is a non-starter for us, latency-wise, since a substantial portion of the GC's work might be in goroutine stacks.

Overall, it's just not quite the right fit.

Immix

This design borrows heavily from “Immix: A Mark-Region Garbage Collector with Space Efficiency, Fast Collection, and Mutator Performance” (Blackburn, 2008), since it offers a mostly non-moving bump-pointer allocator design. This document eschews Immix's defragmentation approach on the assumption that fading memory will be relatively rare, minimizing the impact of fragmentation from reclaiming memory at the line granularity.

Reaps

This design is also inspired by the paper “Reconsidering Custom Memory Allocation” (Berger, 2002) which argues that, outside of straightforward region-based memory management, custom allocators are not worth the complexity. This argument helps motivate this design's focus on region-based memory management and not something else.

This design is also inspired by the “reaps” concept in that paper, which seeks to merge the general-purpose heap with region-based memory management, rather than keep them completely separate like the arena experiment does.