blob: 266c2e902e5dcbcbc2217f6f578bfdcf91e147a8 [file] [log] [blame] [view]
# Proposal: Simplify mark termination and eliminate mark 2
Author(s): Austin Clements
Last updated: 2018-08-09
Discussion at https://golang.org/issue/26903.
## Abstract
Go's garbage collector has evolved substantially over time, and as
with any software with a history, there are places where the vestigial
remnants of this evolution show.
This document proposes several related simplifications to the design
of Go's mark termination and related parts of concurrent marking that
were made possible by shifts in other parts of the garbage collector.
The keystone of these simplifications is a new mark completion
algorithm.
The current algorithm is racy and, as a result, mark termination must
cope with the possibility that there may still be marking work to do.
We propose a new algorithm based on distributed termination detection
that both eliminates this race and replaces the existing "mark 2"
sub-phase, yielding simplifications throughout concurrent mark and
mark termination.
This new mark completion algorithm combined with a few smaller changes
can simplify or completely eliminate several other parts of the
garbage collector. Hence, we propose to also:
1. Unify stop-the-world GC and checkmark mode with concurrent marking;
2. Flush mcaches after mark termination;
3. And allow safe-points without preemption in dedicated workers.
Taken together, these fairly small changes will allow us to eliminate
mark 2, "blacken promptly" mode, the second root marking pass,
blocking drain mode, `getfull` and its troublesome spin loop, work
queue draining during mark termination, `gchelper`, and idle worker
tracking.
This will eliminate a good deal of subtle code from the garbage
collector, making it simpler and more maintainable.
As an added bonus, it's likely to perform a little better, too.
## Background
Prior to Go 1.5, Go's garbage collector was a stop-the-world garbage
collector, and Go continues to support STW garbage collection as a
debugging mode.
Go 1.5 introduced a concurrent collector, but, in order to minimize a
massively invasive change, it kept much of the existing GC mechanism
as a STW "mark termination" phase, while adding a concurrent mark
phase before the STW phase.
This concurrent mark phase did as much as it could, but ultimately
fell back to the STW algorithm to clean up any work it left behind.
Until Go 1.8, concurrent marking always left behind at least some
work, since any stacks that had been modified during the concurrent
mark phase had to be re-scanned during mark termination.
Go 1.8 introduced [a new write barrier](17503-eliminate-rescan.md)
that eliminated the need to re-scan stacks.
This significantly reduced the amount of work that had to be done in
mark termination.
However, since it had never really mattered before, we were sloppy
about entering mark termination: the algorithm that decided when it
was time to enter mark termination *usually* waited until all work was
done by concurrent mark, but sometimes [work would slip
through](17503-eliminate-rescan.md#appendix-mark-completion-race),
leaving mark termination to clean up the mess.
Furthermore, in order to minimize (though not eliminate) the chance of
entering mark termination prematurely, Go 1.5 divided concurrent
marking into two phases creatively named "mark 1" and "mark 2".
During mark 1, when it ran out of global marking work, it would flush
and disable all local work caches (enabling "blacken promptly" mode)
and enter mark 2.
During mark 2, when it ran out of global marking work again, it would
enter mark termination.
Unfortunately, blacken promptly mode has performance implications
(there was a reason for those local caches), and this algorithm can
enter mark 2 very early in the GC cycle since it merely detects a work
bottleneck.
And while disabling all local caches was intended to prevent premature
mark termination, this doesn't always work.
## Proposal
There are several steps to this proposal, and it's not necessary to
implement all of them.
However, the crux of the proposal is a new termination detection
algorithm.
### Replace mark 2 with a race-free algorithm
We propose replacing mark 2 with a race-free algorithm based on ideas
from distributed termination detection [Matocha '97].
The GC maintains several work queues of grey objects to be blackened.
It maintains two global queues, one for root marking work and one for
heap objects, but we can think of these as a single logical queue.
It also maintains a queue of locally cached work on each *P* (that is,
each GC worker).
A P can move work from the global queue to its local queue or
vice-versa.
Scanning removes work from the local queue and may add work back to
the local queue.
This algorithm does not change the structure of the GC's work queues
from the current implementation.
A P *cannot* observe or remove work from another P's local queue.
A P also *cannot* create work from nothing: it must consume a marking
job in order to create more marking jobs.
This is critical to termination detection because it means termination
is a stable condition.
Furthermore, all of these actions must be *GC-atomic*; that is, there
are no safe-points within each of these actions.
Again, all of this is true of the current implementation.
The proposed algorithm is as follows:
First, each P, maintains a local *flushed* flag that it sets whenever
the P flushes any local GC work to the global queue.
The P may cache an arbitrary amount of GC work locally without setting
this flag; the flag indicates that it may have shared work with
another P.
This flag is only accessed synchronously, so it need not be atomic.
When a P's local queue is empty and the global queue is empty it runs
the termination detection algorithm:
1. Acquire a global termination detection lock (only one P may run
this algorithm at a time).
2. Check the global queue. If it is non-empty, we have not reached
termination, so abort the algorithm.
3. Execute a ragged barrier. On each P, when it reaches a safe-point,
1. Flush the local write barrier buffer.
This may mark objects and add pointers to the local work queue.
2. Flush the local work queue.
This may set the P's flushed flag.
3. Check and clear the P's flushed flag.
4. If any P's flushed flag was set, we have not reached termination,
so abort the algorithm.
If no P's flushed flag was set, enter mark termination.
Like most wave-based distributed termination algorithms, it may be
necessary to run this algorithm multiple times during a cycle.
However, this isn't necessarily a disadvantage: flushing the local
work queues also serves to balance work between Ps, and makes it okay
to keep work cached on a P that isn't actively doing GC work.
There are a few subtleties to this algorithm that are worth noting.
First, unlike many distributed termination algorithms, it *does not*
detect that no work was done since the previous barrier.
It detects that no work was *communicated*, and that all queues were
empty at some point.
As a result, while many similar algorithms require at least two waves
to detect termination [Hudson '97], this algorithm can detect
termination in a single wave.
For example, on small heaps it's possible for all Ps to work entirely
out of their local queues, in which case mark can complete after just
a single wave.
Second, while the only way to add work to the local work queue is by
consuming work, this is not true of the local write barrier buffer.
Since this buffer simply records pointer writes, and the recorded
objects may already be black, it can continue to grow after
termination has been detected.
However, once termination is detected, we know that all pointers in
the write barrier buffer must be to black objects, so this buffer can
simply be discarded.
#### Variations on the basic algorithm
There are several small variations on the basic algorithm that may be
desirable for implementation and efficiency reasons.
When flushing the local work queues during the ragged barrier, it may
be valuable to break up the work buffers that are put on the global
queue.
For efficiency, work is tracked in batches and the queues track these
batches, rather than individual marking jobs.
The ragged barrier is an excellent opportunity to break up these
batches to better balance work.
The ragged barrier places no constraints on the order in which Ps
flush, nor does it need to run on all Ps if some P has its local
flushed flag set.
One obvious optimization this allows is for the P that triggers
termination detection to flush its own queues and check its own
flushed flag before trying to interrupt other Ps.
If its own flushed flag is set, it can simply clear it and abort (or
retry) termination detection.
#### Consequences
This new termination detection algorithm replaces mark 2, which means
we no longer need blacken-promptly mode.
Hence, we can delete all code related to blacken-promptly mode.
It also eliminates the mark termination race, so, in concurrent mode,
mark termination no longer needs to detect this race and behave
differently.
However, we should probably continue to detect the race and panic, as
detecting the race is cheap and this is an excellent self-check.
### Unify STW GC and checkmark mode with concurrent marking
The next step in this proposal is to unify stop-the-world GC and
checkmark mode with concurrent marking.
Because of the GC's heritage from a STW collector, there are several
code paths that are specific to STW collection, even though STW is
only a debugging option at this point.
In fact, as we've made the collector more concurrent, more code paths
have become vestigial, existing only to support STW mode.
This adds complexity to the garbage collector and makes this debugging
mode less reliable (and less useful) as these code paths are poorly
tested.
We propose instead implementing STW collection by reusing the existing
concurrent collector, but simply telling the scheduler that all Ps
must run "dedicated GC workers".
Hence, while the world won't technically be stopped during marking, it
will effectively be stopped.
#### Consequences
Unifying STW into concurrent marking directly eliminates several code
paths specific to STW mode.
Most notably, concurrent marking currently has two root marking phases
and STW mode has a single root marking pass.
All three of these passes must behave differently.
Unifying STW and concurrent marking collapses all three passes into
one.
In conjunction with the new termination detection algorithm, this
eliminates the need for mark work draining during mark termination.
As a result, the write barrier does not need to be on during mark
termination, and we can eliminate blocking drain mode entirely.
Currently, blocking drain mode is only used if the mark termination
race happens or if we're in STW mode.
This in turn eliminates the troublesome spin loop in `getfull` that
implements blocking drain mode.
Specifically, this eliminates `work.helperDrainBlock`, `gcDrainBlock`
mode, `gcWork.get`, and `getfull`.
At this point, the `gcMark` function should be renamed, since it will
no longer have anything to do with marking.
Unfortunately, this isn't enough to eliminate work draining entirely
from mark termination, since the draining mechanism is also used to
flush mcaches during mark termination.
### Flush mcaches after mark termination
The third step in this proposal is to delay the flushing of mcaches
until after mark termination.
Each P has an mcache that tracks spans being actively allocated from
by that P.
Sweeping happens when a P brings a span into its mcache, or can happen
asynchronously as part of background sweeping.
Hence, the spans in mcaches must be flushed out in order to trigger
sweeping of those spans and to prevent a race between allocating from
an unswept span and the background sweeper sweeping that span.
While it's important for the mcaches to be flushed between enabling
the sweeper and allocating, it does not have to happen during mark
termination.
Hence, we propose to flush each P's mcache when that P returns from
the mark termination STW.
This is early enough to ensure no allocation can happen on that P,
parallelizes this flushing, and doesn't block other Ps during
flushing.
Combined with the first two steps, this eliminates the only remaining
use of work draining during mark termination, so we can eliminate mark
termination draining entirely, including `gchelper` and related
mechanisms (`mhelpgc`, `m.helpgc`, `helpgc`, `gcprocs`,
`needaddgcproc`, etc).
### Allow safe-points without preemption in dedicated workers
The final step of this proposal is to allow safe-points in dedicated
GC workers.
Currently, dedicated GC workers only reach a safe-point when there is
no more local or global work.
However, this interferes with the ragged barrier in the termination
detection algorithm (which can only run at a safe-point on each P).
As a result, it's only fruitful to run the termination detection
algorithm if there are no dedicated workers running, which in turn
requires tracking the number of running and idle workers, and may
delay work balancing.
By allowing more frequent safe-points in dedicated GC workers,
termination detection can run more eagerly.
Furthermore, worker tracking was based on the mechanism used by STW GC
to implement the `getfull` barrier.
Once that has also been eliminated, we no longer need any worker
tracking.
## Proof of termination detection algorithm
The proposed termination detection algorithm is remarkably simple to
implement, but subtle in its reasoning.
Here we prove it correct and endeavor to provide some insight into why
it works.
**Theorem.** The termination detection algorithm succeeds only if all
mark work queues are empty when the algorithm terminates.
**Proof.** Assume the termination detection algorithm succeeds.
In order to show that all mark work queues must be empty once the
algorithm succeeds, we use induction to show that all possible actions
must maintain three conditions: 1) the global queue is empty, 2) all
flushed flags are clear, and 3) after a P has been visited by the
ragged barrier, its local queue is empty.
First, we show that these conditions were true at the instant it
observed the global queue was empty.
This point in time trivially satisfies condition 1.
Since the algorithm succeeded, each P's flushed flag must have been
clear when the ragged barrier observed that P.
Because termination detection is the only operation that clears the
flushed flags, each flag must have been clear for all time between the
start of termination detection and when the ragged barrier observed
the flag.
In particular, all flags must have been clear at the instant it
observed that the global queue was empty, so condition 2 is satisfied.
Condition 3 is trivially satisfied at this point because no Ps have
been visited by the ragged barrier.
This establishes the base case for induction.
Next, we consider all possible actions that could affect the state of
the queue or the flags after this initial state.
There are four such actions:
1. The ragged barrier can visit a P.
This may modify the global queue, but if it does so it will set the
flushed flag and the algorithm will not succeed, contradicting the
assumption.
Thus it could not have modified the global queue, maintaining
condition 1.
For the same reason, we know it did not set the flushed flag,
maintaining condition 2.
Finally, the ragged barrier adds the P to the set of visited P, but
flushes the P's local queue, thus maintaining condition 3.
2. If the global queue is non-empty, a P can move work from the global
queue to its local queue.
By assumption, the global queue is empty, so this action can't
happen.
3. If its local queue is non-empty, a P can consume local work and
potentially produce local work.
This action does not modify the global queue or flushed flag, so it
maintains conditions 1 and 2.
If the P has not been visited by the ragged barrier, then condition
3 is trivially maintained.
If it has been visited, then by assumption the P's local queue is
empty, so this action can't happen.
4. If the local queue is non-empty, the P can move work from the local
queue to the global queue.
There are two sub-cases.
If the P has not been visited by the ragged barrier, then this
action would set the P's flushed flag, causing termination
detection to fail, which contradicts the assumption.
If the P has been visited by the ragged barrier, then its local
queue is empty, so this action can't happen.
Therefore, by induction, all three conditions must be true when
termination detection succeeds.
Notably, we've shown that once the ragged barrier is complete, none of
the per-P actions (2, 3, and 4) can happen.
Thus, if termination detection succeeds, then by conditions 1 and 3,
all mark work queues must be empty.
**Corollary.** Once the termination detection algorithm succeeds,
there will be no work to do in mark termination.
Go's GC never turns a black object grey because it uses black mutator
techniques (once a stack is black it remains black) and a
forward-progress barrier.
Since the mark work queues contain pointers to grey objects, it
follows that once the mark work queues are empty, they will remain
empty, including when the garbage collector transitions in to mark
termination.
## Compatibility
This proposal does not affect any user-visible APIs, so it is Go 1
compatible.
## Implementation
This proposal can be implemented incrementally, and each step opens up
new simplifications.
The first step will be to implement the new termination detection
algorithm, since all other simplifications build on that, but the
other steps can be implemented as convenient.
Austin Clements plans to implement all or most of this proposal for Go
1.12.
The actual implementation effort for each step is likely to be fairly
small (the mark termination algorithm was implemented and debugged in
under an hour).
## References
[Hudson '97] R. L. Hudson, R. Morrison, J. E. B. Moss, and D. S.
Munro. Garbage collecting the world: One car at a time. In *ACM
SIGPLAN Notices* 32(10):162–175, October 1997.
[Matocha '98] Jeff Matocha and Tracy Camp. "A taxonomy of distributed
termination detection algorithms." In *Journal of Systems and
Software* 43(3):207221, November 1998.