Proposal: Simplify mark termination and eliminate mark 2

Author(s): Austin Clements

Last updated: 2018-08-09

Discussion at https://golang.org/issue/26903.

Abstract

Go‘s garbage collector has evolved substantially over time, and as with any software with a history, there are places where the vestigial remnants of this evolution show. This document proposes several related simplifications to the design of Go’s mark termination and related parts of concurrent marking that were made possible by shifts in other parts of the garbage collector.

The keystone of these simplifications is a new mark completion algorithm. The current algorithm is racy and, as a result, mark termination must cope with the possibility that there may still be marking work to do. We propose a new algorithm based on distributed termination detection that both eliminates this race and replaces the existing “mark 2” sub-phase, yielding simplifications throughout concurrent mark and mark termination.

This new mark completion algorithm combined with a few smaller changes can simplify or completely eliminate several other parts of the garbage collector. Hence, we propose to also:

Unify stop-the-world GC and checkmark mode with concurrent marking;
Flush mcaches after mark termination;
And allow safe-points without preemption in dedicated workers.

Taken together, these fairly small changes will allow us to eliminate mark 2, “blacken promptly” mode, the second root marking pass, blocking drain mode, getfull and its troublesome spin loop, work queue draining during mark termination, gchelper, and idle worker tracking.

This will eliminate a good deal of subtle code from the garbage collector, making it simpler and more maintainable. As an added bonus, it's likely to perform a little better, too.

Background

Prior to Go 1.5, Go's garbage collector was a stop-the-world garbage collector, and Go continues to support STW garbage collection as a debugging mode. Go 1.5 introduced a concurrent collector, but, in order to minimize a massively invasive change, it kept much of the existing GC mechanism as a STW “mark termination” phase, while adding a concurrent mark phase before the STW phase. This concurrent mark phase did as much as it could, but ultimately fell back to the STW algorithm to clean up any work it left behind.

Until Go 1.8, concurrent marking always left behind at least some work, since any stacks that had been modified during the concurrent mark phase had to be re-scanned during mark termination. Go 1.8 introduced a new write barrier that eliminated the need to re-scan stacks. This significantly reduced the amount of work that had to be done in mark termination. However, since it had never really mattered before, we were sloppy about entering mark termination: the algorithm that decided when it was time to enter mark termination usually waited until all work was done by concurrent mark, but sometimes work would slip through, leaving mark termination to clean up the mess.

Furthermore, in order to minimize (though not eliminate) the chance of entering mark termination prematurely, Go 1.5 divided concurrent marking into two phases creatively named “mark 1” and “mark 2”. During mark 1, when it ran out of global marking work, it would flush and disable all local work caches (enabling “blacken promptly” mode) and enter mark 2. During mark 2, when it ran out of global marking work again, it would enter mark termination. Unfortunately, blacken promptly mode has performance implications (there was a reason for those local caches), and this algorithm can enter mark 2 very early in the GC cycle since it merely detects a work bottleneck. And while disabling all local caches was intended to prevent premature mark termination, this doesn't always work.

Proposal

There are several steps to this proposal, and it's not necessary to implement all of them. However, the crux of the proposal is a new termination detection algorithm.

Replace mark 2 with a race-free algorithm

We propose replacing mark 2 with a race-free algorithm based on ideas from distributed termination detection [Matocha '97].

The GC maintains several work queues of grey objects to be blackened. It maintains two global queues, one for root marking work and one for heap objects, but we can think of these as a single logical queue. It also maintains a queue of locally cached work on each P (that is, each GC worker). A P can move work from the global queue to its local queue or vice-versa. Scanning removes work from the local queue and may add work back to the local queue. This algorithm does not change the structure of the GC's work queues from the current implementation.

A P cannot observe or remove work from another P's local queue. A P also cannot create work from nothing: it must consume a marking job in order to create more marking jobs. This is critical to termination detection because it means termination is a stable condition. Furthermore, all of these actions must be GC-atomic; that is, there are no safe-points within each of these actions. Again, all of this is true of the current implementation.

The proposed algorithm is as follows:

First, each P, maintains a local flushed flag that it sets whenever the P flushes any local GC work to the global queue. The P may cache an arbitrary amount of GC work locally without setting this flag; the flag indicates that it may have shared work with another P. This flag is only accessed synchronously, so it need not be atomic.

When a P's local queue is empty and the global queue is empty it runs the termination detection algorithm:

Acquire a global termination detection lock (only one P may run this algorithm at a time).
Check the global queue. If it is non-empty, we have not reached termination, so abort the algorithm.
Execute a ragged barrier. On each P, when it reaches a safe-point,
1. Flush the local write barrier buffer. This may mark objects and add pointers to the local work queue.
2. Flush the local work queue. This may set the P's flushed flag.
3. Check and clear the P's flushed flag.
If any P‘s flushed flag was set, we have not reached termination, so abort the algorithm. If no P’s flushed flag was set, enter mark termination.

Like most wave-based distributed termination algorithms, it may be necessary to run this algorithm multiple times during a cycle. However, this isn‘t necessarily a disadvantage: flushing the local work queues also serves to balance work between Ps, and makes it okay to keep work cached on a P that isn’t actively doing GC work.

There are a few subtleties to this algorithm that are worth noting. First, unlike many distributed termination algorithms, it does not detect that no work was done since the previous barrier. It detects that no work was communicated, and that all queues were empty at some point. As a result, while many similar algorithms require at least two waves to detect termination [Hudson ‘97], this algorithm can detect termination in a single wave. For example, on small heaps it’s possible for all Ps to work entirely out of their local queues, in which case mark can complete after just a single wave.

Second, while the only way to add work to the local work queue is by consuming work, this is not true of the local write barrier buffer. Since this buffer simply records pointer writes, and the recorded objects may already be black, it can continue to grow after termination has been detected. However, once termination is detected, we know that all pointers in the write barrier buffer must be to black objects, so this buffer can simply be discarded.

Variations on the basic algorithm

There are several small variations on the basic algorithm that may be desirable for implementation and efficiency reasons.

When flushing the local work queues during the ragged barrier, it may be valuable to break up the work buffers that are put on the global queue. For efficiency, work is tracked in batches and the queues track these batches, rather than individual marking jobs. The ragged barrier is an excellent opportunity to break up these batches to better balance work.

The ragged barrier places no constraints on the order in which Ps flush, nor does it need to run on all Ps if some P has its local flushed flag set. One obvious optimization this allows is for the P that triggers termination detection to flush its own queues and check its own flushed flag before trying to interrupt other Ps. If its own flushed flag is set, it can simply clear it and abort (or retry) termination detection.

Consequences

This new termination detection algorithm replaces mark 2, which means we no longer need blacken-promptly mode. Hence, we can delete all code related to blacken-promptly mode.

It also eliminates the mark termination race, so, in concurrent mode, mark termination no longer needs to detect this race and behave differently. However, we should probably continue to detect the race and panic, as detecting the race is cheap and this is an excellent self-check.

Unify STW GC and checkmark mode with concurrent marking

The next step in this proposal is to unify stop-the-world GC and checkmark mode with concurrent marking. Because of the GC‘s heritage from a STW collector, there are several code paths that are specific to STW collection, even though STW is only a debugging option at this point. In fact, as we’ve made the collector more concurrent, more code paths have become vestigial, existing only to support STW mode. This adds complexity to the garbage collector and makes this debugging mode less reliable (and less useful) as these code paths are poorly tested.

We propose instead implementing STW collection by reusing the existing concurrent collector, but simply telling the scheduler that all Ps must run “dedicated GC workers”. Hence, while the world won't technically be stopped during marking, it will effectively be stopped.

Consequences

Unifying STW into concurrent marking directly eliminates several code paths specific to STW mode. Most notably, concurrent marking currently has two root marking phases and STW mode has a single root marking pass. All three of these passes must behave differently. Unifying STW and concurrent marking collapses all three passes into one.

In conjunction with the new termination detection algorithm, this eliminates the need for mark work draining during mark termination. As a result, the write barrier does not need to be on during mark termination, and we can eliminate blocking drain mode entirely. Currently, blocking drain mode is only used if the mark termination race happens or if we're in STW mode. This in turn eliminates the troublesome spin loop in getfull that implements blocking drain mode. Specifically, this eliminates work.helperDrainBlock, gcDrainBlock mode, gcWork.get, and getfull.

At this point, the gcMark function should be renamed, since it will no longer have anything to do with marking.

Unfortunately, this isn't enough to eliminate work draining entirely from mark termination, since the draining mechanism is also used to flush mcaches during mark termination.

Flush mcaches after mark termination

The third step in this proposal is to delay the flushing of mcaches until after mark termination.

Each P has an mcache that tracks spans being actively allocated from by that P. Sweeping happens when a P brings a span into its mcache, or can happen asynchronously as part of background sweeping. Hence, the spans in mcaches must be flushed out in order to trigger sweeping of those spans and to prevent a race between allocating from an unswept span and the background sweeper sweeping that span.

While it's important for the mcaches to be flushed between enabling the sweeper and allocating, it does not have to happen during mark termination.

Hence, we propose to flush each P‘s mcache when that P returns from the mark termination STW. This is early enough to ensure no allocation can happen on that P, parallelizes this flushing, and doesn’t block other Ps during flushing.

Combined with the first two steps, this eliminates the only remaining use of work draining during mark termination, so we can eliminate mark termination draining entirely, including gchelper and related mechanisms (mhelpgc, m.helpgc, helpgc, gcprocs, needaddgcproc, etc).

Allow safe-points without preemption in dedicated workers

The final step of this proposal is to allow safe-points in dedicated GC workers. Currently, dedicated GC workers only reach a safe-point when there is no more local or global work. However, this interferes with the ragged barrier in the termination detection algorithm (which can only run at a safe-point on each P). As a result, it's only fruitful to run the termination detection algorithm if there are no dedicated workers running, which in turn requires tracking the number of running and idle workers, and may delay work balancing.

By allowing more frequent safe-points in dedicated GC workers, termination detection can run more eagerly.

Furthermore, worker tracking was based on the mechanism used by STW GC to implement the getfull barrier. Once that has also been eliminated, we no longer need any worker tracking.

Proof of termination detection algorithm

The proposed termination detection algorithm is remarkably simple to implement, but subtle in its reasoning. Here we prove it correct and endeavor to provide some insight into why it works.

Theorem. The termination detection algorithm succeeds only if all mark work queues are empty when the algorithm terminates.

Proof. Assume the termination detection algorithm succeeds. In order to show that all mark work queues must be empty once the algorithm succeeds, we use induction to show that all possible actions must maintain three conditions: 1) the global queue is empty, 2) all flushed flags are clear, and 3) after a P has been visited by the ragged barrier, its local queue is empty.

First, we show that these conditions were true at the instant it observed the global queue was empty. This point in time trivially satisfies condition 1. Since the algorithm succeeded, each P's flushed flag must have been clear when the ragged barrier observed that P. Because termination detection is the only operation that clears the flushed flags, each flag must have been clear for all time between the start of termination detection and when the ragged barrier observed the flag. In particular, all flags must have been clear at the instant it observed that the global queue was empty, so condition 2 is satisfied. Condition 3 is trivially satisfied at this point because no Ps have been visited by the ragged barrier. This establishes the base case for induction.

Next, we consider all possible actions that could affect the state of the queue or the flags after this initial state. There are four such actions:

The ragged barrier can visit a P. This may modify the global queue, but if it does so it will set the flushed flag and the algorithm will not succeed, contradicting the assumption. Thus it could not have modified the global queue, maintaining condition 1. For the same reason, we know it did not set the flushed flag, maintaining condition 2. Finally, the ragged barrier adds the P to the set of visited P, but flushes the P's local queue, thus maintaining condition 3.
If the global queue is non-empty, a P can move work from the global queue to its local queue. By assumption, the global queue is empty, so this action can't happen.
If its local queue is non-empty, a P can consume local work and potentially produce local work. This action does not modify the global queue or flushed flag, so it maintains conditions 1 and 2. If the P has not been visited by the ragged barrier, then condition 3 is trivially maintained. If it has been visited, then by assumption the P‘s local queue is empty, so this action can’t happen.
If the local queue is non-empty, the P can move work from the local queue to the global queue. There are two sub-cases. If the P has not been visited by the ragged barrier, then this action would set the P‘s flushed flag, causing termination detection to fail, which contradicts the assumption. If the P has been visited by the ragged barrier, then its local queue is empty, so this action can’t happen.

Therefore, by induction, all three conditions must be true when termination detection succeeds. Notably, we've shown that once the ragged barrier is complete, none of the per-P actions (2, 3, and 4) can happen. Thus, if termination detection succeeds, then by conditions 1 and 3, all mark work queues must be empty.

Corollary. Once the termination detection algorithm succeeds, there will be no work to do in mark termination.

Go's GC never turns a black object grey because it uses black mutator techniques (once a stack is black it remains black) and a forward-progress barrier. Since the mark work queues contain pointers to grey objects, it follows that once the mark work queues are empty, they will remain empty, including when the garbage collector transitions in to mark termination.

Compatibility

This proposal does not affect any user-visible APIs, so it is Go 1 compatible.

Implementation

This proposal can be implemented incrementally, and each step opens up new simplifications. The first step will be to implement the new termination detection algorithm, since all other simplifications build on that, but the other steps can be implemented as convenient.

Austin Clements plans to implement all or most of this proposal for Go 1.12. The actual implementation effort for each step is likely to be fairly small (the mark termination algorithm was implemented and debugged in under an hour).

References

[Hudson '97] R. L. Hudson, R. Morrison, J. E. B. Moss, and D. S. Munro. Garbage collecting the world: One car at a time. In ACM SIGPLAN Notices 32(10):162–175, October 1997.

[Matocha '98] Jeff Matocha and Tracy Camp. “A taxonomy of distributed termination detection algorithms.” In Journal of Systems and Software 43(3):207–221, November 1998.