blob: 71ae1a74a5814ca6b352c989039948f9d77c2d77 [file] [log] [blame] [view]
# Proposal: Dense mark bits and sweep-free allocation
or, *How I learned to stop worrying and love the bitmap*
Author: Austin Clements
Last updated: 2015-09-30
Discussion at https://golang.org/issue/12800.
## Abstract
This document proposes a change to memory allocation to eliminate the
need for the sweeper and a new representation for the mark bitmap that
enables this. This reduces the cost of allocation, significantly
improves the locality of the mark bitmap, and paves the way for future
advances to the Go GC that require multiple mark bits.
## Background
All current releases of Go up to and including Go 1.5 use a
*mark-sweep* garbage collector. As of Go 1.5, this works by
alternating between a mostly-concurrent mark phase and a concurrent
sweep phase. The mark phase finds all reachable objects and *marks*
them, leaving all unreachable objects *unmarked*. The sweep phase
walks through the entire heap, finds all unmarked objects, and adds
them to free lists, which the allocator in turn uses to allocate new
objects.
However, this sweep phase is, in a sense, redundant. It primarily
transforms one representation of the free heapthe mark bitsinto
another representation of the free heapthe free lists. Not only does
this take time, but the free list representation is unfriendly to
modern CPUs since it is not very cacheable and accesses to it are hard
to predict. Furthermore, the current mark representation is also
cache-unfriendly, which adds even more to the cost of sweeping.
This document proposes a design for eliminating the sweeper. The key
idea is to allocate directly using the mark bitmap, foregoing the free
list representation entirely. Doing this efficiently requires a new,
dense representation for mark bits that enables fast scanning and
clearing. This representation also makes it easy to maintain multiple
mark bitmaps simultaneously. We introduce the dense bitmap
representation first. We then present a simple system for allocation
based on two mark bitmaps that eliminates the free list and hence the
need for the sweeper.
## Motivation
Typical Go programs spend about 5% of their CPU in the sweeper or in
cache misses induced by the free list representation. Hence, if we can
significantly reduce or eliminate the cost of the sweeping from
allocation and improve the free set representation, we can improve
overall program performance.
To measure this, we ran the go1 benchmark suite, just the BinaryTree17
benchmark with `-benchtime 5s`, and the x/benchmarks garbage benchmark
with `-benchmem 1024`. These represent a range of CPU-intensive and
allocation-intensive benchmarks. In all cases, GOMAXPROCS is 4. The
CPU time breaks down as follows:
| | go1 (all) | BinaryTree17 | garbage 1GB |
| --- | ---------:| ------------:| -----------:|
| CPU in mark | 2.8% | 4.0% | 34.7% |
| CPU in sweep | 3.3% | 20.6% | 5.2% |
| CPU in mallocgc (excl. sweep, GC) | 6.8% | 39.0% | 15.8% |
|   % of mallocgc spent walking free list | 19.2% | 17.5% | 14.3% |
(Times were collected using pprof. mark shows samples matching
`\.gc$|gcBgMarkWorker|gcAssistAlloc|gchelper`. sweep shows
`mSpan_Sweep`. mallocgc shows `mallocgc -gcAssistAlloc -mSpan_Sweep`.)
This proposal replaces sweepone with a scan that should require
roughly 1ms of CPU time per heap GB per GC cycle. For BinaryTree17,
thats less than 0.1% of its CPU time, versus 21% for the current
sweep. It replaces the cost of walking the free list in mallocgc with
what is likely to be a smaller cost of sequentially scanning a bitmap.
Its likely to have negligible effect on mark performance. Finally, it
should increase the heap footprint by roughly 0.02%.
## Dense mark bitmaps
Currently, the mark bits are stored in the *heap bitmap*, which is a
structure that stores two bits for every word of the heap, using a
simple formula to map between a heap address and a bitmap address. The
mark bit is stored in one of the bits for the first word of every
object. Because mark bits are "on the side," spans can be efficiently
subdivided into smaller object sizes (especially power-of-two sizes).
However, this sparse bitmap is expensive to scan and clear, as it
requires a strided, irregular traversal of memory, as shown in
figure 1. It also makes it difficult (albeit not impossible) to
maintain two sets of mark bits on word-sized objects, which is
necessary for sweep-free allocation.
![](12800/sparse.png)
**Figure 1.** In Go 1.5, mark bits are sparse and irregularly strided.
Therefore, this proposal separates the mark bits from the heap bitmap
into a dedicated mark bitmap structure. The difficulty here is,
because objects are different sizes and a given span can be freed and
reused for a different size class any number of times, a dense mark
bitmap cannot be addressed solely based on an objects address. It
must be indirected through the objects span.
One solution is to store the mark bits for the objects in each span as
a dense bit array in the span itself, either before or after the
objects. This dense representation is more efficient to scan and clear
than the sparse representation, but still requires visiting every span
to do so. Furthermore, it increases memory fragmentation, especially
for power-of-two allocations, which currently have zero external
fragmentation.
We propose maintaining a dense mark bitmap, but placing it outside of
the spans and in mostly contiguous memory by allocating the mark
bitmap anew *for every GC cycle*. In both the current sparse bitmap
and the strawman dense bitmap above, an objects mark bit is in the
same location for the lifetime of that object. However, an objects
mark only needs to persist for two GC cycles. By allocating the mark
bitmap anew for each GC cycle, we can avoid impacting span
fragmentation and use contiguous memory to enable bulk clearing of the
bitmap. Furthermore, discarding and recreating the bitmap on every
cycle lets us use a trivial scheme for allocating mark bitmaps to
spans while simultaneously dealing with changing heap layouts: even
though spans are reused for different size classes, any given span can
change size class at most once per GC cycle, so theres no need for
sophisticated management of a mark bitmap that will only last two
cycles.
Between GC cycles, the runtime will prepare a fresh mark bitmap for
the upcoming mark phase, as shown in figure 2. It will traverse the
list of in-use spans and use a simple arena-style linear allocator to
assign each span a mark bitmap sized for the number of objects in that
span. The arena allocator will obtain memory from the system in
reasonably large chunks (e.g., 64K) and bulk zero it. Likewise, any
span that transitions from free to in-use during this time will also
be allocated a mark bitmap.
![](12800/dense.png)
**Figure 2.** Proposed arena allocation of dense mark bitmaps. For
illustrative purposes, bitmaps are shown allocated without alignment
constraints.
When the mark phase begins, all in-use spans will have zeroed mark
bitmaps. The mark phase will set the mark bit for every reachable
object. Then, during mark termination, the garbage collector will
transfer this bitmap to the allocator, which can use it to find free
objects in spans that were in-use at the beginning of the mark phase.
Any spans that are allocated after the mark phase (including after
mark termination) will have a nil allocation bitmap, which is
equivalent to all objects in that span being unmarked and allows for
bump-pointer allocation within that span. Finally, when the allocator
is done with the mark bitmap, the whole arena can be bulk freed.
## Dense mark bitmap performance
The entire process of allocating and clearing the new mark bitmap will
require only about 1 ms of CPU time per heap GB. Walking the list of
in-use spans requires about 1 ms per heap GB and, thanks to the arena
allocation, zeroing the bitmap should add only 40 µs per heap GB,
assuming 50 GB/sec sequential memory bandwidth and an average object
size of 64 bytes.
Furthermore, the memory overhead of the mark bitmap is minimal. The
instantaneous average object size of "go build std" and "go test
-short std" is 136 bytes and 222 bytes, respectively. At these sizes,
and assuming two bitmaps, the mark bitmaps will have an overhead of
only 0.18% (1.9 MB per heap GB) and 0.11% (1.2 MB per heap GB),
respectively. Even given a very conservative average object size of 16
bytes, the overhead is only 1.6% (16 MB per heap GB).
Dense mark bits should have negligible impact on mark phase
performance. Because of interior pointers, marking an object already
requires looking up that objects span and dividing by the spans
object size to compute the object index. With the sparse mark bitmap,
it requires multiplying and adding to compute the objects base
address; three subtractions, three shifts, and a mask to compute the
location and value of the mark bit; and a random atomic memory write
to set the mark bit. With the dense mark bitmap, it requires reading
the mark bitmap pointer from the span (which can be placed on the same
cache line as the metadata already read); an addition, two shifts, and
a mask to compute the location and value of the mark bit; and a random
atomic memory write to set the mark bit.
Dense mark bits should simplify some parts of mark that currently
require checks and branches to treat the heap bitmap for the first two
object words differently from the heap bitmap for the remaining words.
This may improve branch predictor behavior and hence performance of
object scanning.
Finally, dense mark bits may slightly improve the performance of
unrolling the heap bitmap during allocation. Currently, small objects
require atomic writes to the heap bitmap because they may race with
the garbage collector marking objects. By separating out the mark
bits, the sole writer to any word of the heap bitmap is the P
allocating from that span, so all bitmap writes can be non-atomic.
## Sweep-free allocation
The key idea behind eliminating the sweeper is to use the mark bitmap
directly during allocation to find free objects that can be
reallocated, rather than transforming this bitmap into a free list and
then allocating using the free list. However, in a concurrent garbage
collector some second copy of the heap free set is necessary for the
simple reason that the mutator continues to allocate objects from the
free set at the same time the concurrent mark phase is constructing
the new free set.
In the current design, this second copy is the free list, which is
fully constructed from the mark bits by the time the next mark phase
begins. This requires essentially no space because the free list can
be threaded through the free objects. It also gives the system a
chance to clear all mark bits in preparation for the next GC cycle,
which is expensive in the sparse mark bitmap representation, so it
needs to be done incrementally and simultaneously with sweeping the
marks. The flow of information in the current sweeper is shown in
figure 3.
![](12800/sweep-flow.png)
**Figure 3.** Go 1.5 flow of free object information.
We propose simply using two sets of mark bits. At the end of the mark
phase, the object allocator switches to using the mark bitmap
constructed by the mark phase and the object allocators current mark
bitmap is discarded. During the time between mark phases, the runtime
allocates and bulk zeroes the mark bitmap for the next mark phase. The
flow of information about free objects in this design is shown in
figure 4.
![](12800/mark-flow.png)
**Figure 4.** Proposed flow of free object information.
To allocate an object, the object allocator obtains a cached span or
allocates a new span from the span allocator as it does now. The span
allocator works much like it does now, with the exception that where
the span allocator currently sweeps the span, it will now simply reset
its bitmap pointer to the beginning of the spans bitmap. With a span
in hand, the object allocator scans its bitmap for the next unmarked
object, updates the spans bitmap pointer, and initializes the object.
If there are no more unmarked objects in the span, the object
allocator acquires another span. Note that this may happen repeatedly
if the allocator obtains spans that are fully marked (in contrast,
this is currently handled by the sweeper, so span allocation will
never return a fully marked span).
Most likely, it makes sense to cache an inverted copy of the current
word of the bitmap in the span. Allocation can then find the next set
bit using processor ctz intrinsics or efficient software ctz and bit
shifts to maintain its position in the word. This also simplifies the
handling of fresh spans that have nil allocation bitmaps.
### Finalizers
One complication of this approach is that sweeping is currently also
responsible for queuing finalizers for unmarked objects. One solution
is to simply check the mark bits of all finalized objects between GC
cycles. This could be done in the same loop that allocates new mark
bits to all spans after mark termination, and would add very little
cost. In order to do this concurrently, if the allocator obtained a
span before the garbage collector was able to check it for finalizers,
the allocator would be responsible for queuing finalizers for objects
on that span.
## Compatibility
This proposal only affects the performance of the runtime. It does not
change any user-facing Go APIs, and hence it satisfies Go 1
compatibility.
## Implementation
This work will be carried out by Austin Clements and Rick Hudson
during the Go 1.6 development cycle.
Figure 5 shows the components of this proposal and the dependencies
between implementing them.
![](12800/plan.png)
**Figure 5.** Implementation dependency diagram.
We will implement dense mark bitmaps first because it should be fairly
easy to update the current sweeper to use the dense bitmaps and this
will enable multiple mark bitmaps. We will then build sweep-free
allocation on top of this. Sweep-free allocation makes it difficult to
detect completely unmarked spans and return them to the span
allocator, so we will most likely want to implement eager freeing of
unmarked spans first, as discussed in issue
[#11979](https://golang.org/issue/11979), though this is not strictly
necessary. At any point after dense mark bitmaps are in place, we can
implement the optimizations to the heap bitmap discussed in "Dense
mark bitmap performance."
## Related work
Dense mark bitmaps have a long history in garbage collectors that use
segregated-fits allocation. The Boehm conservative collector
[Boehm, 1988] used dense mark bitmaps, but stored each spans bitmap
along with that spans objects, rather than as part of large,
contiguous allocations. Similarly, Garner [2007] explores a mark-sweep
collector that supports both mark bits in object headers and dense
"side bitmaps." Garner observes the advantages of dense mark bitmaps
for bulk zeroing, and concludes that both approaches have similar
marking performance, which supports our prediction that switching to
dense mark bitmaps will have negligible impact on mark phase
performance.
Traditionally, mark-sweep garbage collectors alternate between marking
and sweeping. However, there have various approaches to enabling
simultaneous mark and sweep in a concurrent garbage collector that
closely resemble our approach of allowing simultaneous mark and
allocation. Lamport [1976] introduced a "purple" extension to the
traditional tri-color abstraction that made it possible for the
sweeper to distinguish objects that were not marked in the previous
mark phase (and hence should be swept) from objects that are not yet
marked in the current mark phase, but may be marked later in the
phase. To reduce the cost of resetting these colors, Lamports
algorithm cycles through three different interpretations of the color
encoding. In contrast, our approach adheres to the tri-color
abstraction and simply alternates between two different bitmaps. This
means we have to reset the colors for every mark phase, but we arrange
the bitmap such that this cost is negligible. Queinnecs "mark during
sweep" algorithm [Queinnec, 1989] alternates between two bitmaps like
our approach. However, unlike our approach, both Queinnec and Lamport
still depend on a sweeper to transform the mark bits into a free list
and to reset the mark bits back to white.
## Possible extensions
### 1-bit heap bitmap
With the mark bits no longer part of the heap bitmap, its possible we
could pack the heap bitmap more tightly, which would reduce its memory
footprint, improve cache locality, and may improve the performance of
the heap bitmap unroller (the most expensive step of malloc). One of
the two bits encoded in the heap bitmap for every word is a "dead"
bit, which forms a unary representation of the index of the last
pointer word of the object. Furthermore, its always safe to increase
this number (this is how we currently steal a bit for the mark bit).
We could store this information in base 2 in an object-indexed
structure and reduce overhead by only storing it for spans with a
sufficiently large size class (where the dead bit optimization
matters). Alternatively, we could continue storing it in unary, but at
lower fidelity, such as one dead bit per eight heap words.
### Reducing atomic access to mark bitmap
If the cost of atomically setting bits in the mark bitmap turns out to
be high, we could instead dedicate a byte per object for the mark.
This idea is mentioned in GC literature [Garner, 2007]. Obviously,
this involves an 8× increase in memory overhead. Its likely that on
modern hardware, the cost of the atomic bit operation is small, while
the cost of increasing the cache footprint of the mark structure is
probably large.
Another way to reduce atomic access to the mark bitmap is to keep an
additional mark bitmap per P. When the garbage collector checks if an
object is marked, it first consults the shared bitmap. If it is not
marked there, it updates the shared bitmap by reading the entire word
(or cache line) containing the mark bit from each per-P mark bitmap,
combining these words using bitwise-or, and writing the entire word to
the shared bitmap. It can then re-check the bit. When the garbage
collector marks an object, it simply sets the bit in its per-P bitmap.
## References
Hans-Juergen Boehm and Mark Weiser. 1988. Garbage collection in an
uncooperative environment. Software Practice and Experience 18, 9
(September 1988), 807820.
Robin Garner, Stephen M. Blackburn, and Daniel Frampton. 2007.
Effective prefetch for mark-sweep garbage collection. In Proceedings
of the 6th international symposium on Memory management (ISMM '07).
ACM, New York, NY, USA, 43–54.
Leslie Lamport. 1976. Garbage collection with multiple processes: An
exercise in parallelism. In International Conference on Parallel
Processing (ICPP). 50–54.
Christian Queinnec, Barbara Beaudoing, and Jean-Pierre Queille. 1989.
Mark DURING Sweep rather than Mark THEN Sweep. In Proceedings of the
Parallel Architectures and Languages Europe, Volume I: Parallel
Architectures (PARLE '89), Eddy Odijk, Martin Rem, and Jean-Claude
Syre (Eds.). Springer-Verlag, London, UK, UK, 224237.