Proposal: Decentralized GC coordination

Author(s): Austin Clements

Last updated: 2015-10-25

Discussion at https://golang.org/issue/11970.

Abstract

The Go 1.5 GC is structured as a straight-line coordinator goroutine plus several helper goroutines. All state transitions go through the coordinator. This makes state transitions dependent on the scheduler, which can delay transitions, in turn extending the length of the GC cycle, blocking allocation, and occasionally leading to long delays.

We propose to replace this straight-line coordinator with an explicit state machine where state transitions can be performed by any goroutine.

Background

As of Go 1.5, all GC phase changes are managed through straight-line code in the runtime.gc function, which runs on a dedicated GC goroutine. However, much of the real work is done in other goroutines. These other goroutines generally detect when it is time for a phase change and must coordinate with the main GC goroutine to effect this phase change. This coordination delays phase changes and opens windows where, for example, the mutator can allocate uncontrolled, or nothing can be accomplished because everything is waiting on the coordinator to wake up. This has led to bugs like #11677 and #11911. We‘ve tried to mitigate this by handing control directly to the coordinator goroutine when we wake it up, but the scheduler isn’t designed for this sort of explicit co-routine scheduling, so this doesn‘t always work and it’s more likely to fall apart under stress than an explicit design.

Proposal

We will restructure the garbage collector as an explicit state machine where any goroutine can effect a state transition. This is primarily an implementation change, not an algorithm change: for the most part, these states and the transitions between them closely follow the current GC algorithm.

Each state is global and determines the GC-related behavior of all goroutines. Each state also has an exit condition. State transitions are performed immediately by whatever goroutine detects that the current state's exit condition is satisfied. Multiple goroutines may detect an exit condition simultaneously, in which case none of these goroutines may progress until the transition has been performed. For many transitions, this is necessary to prevent runaway heap growth. Each transition has a specific set of steps to prepare for the next state and the system enters the next state as soon as those steps are completed. Furthermore, each transition is designed to make the exit condition that triggers that transition false so that the transition happens once and only once per cycle.

In principle, all of the goroutines that detect an exit condition could assist in performing the transition. However, we take a simpler approach where all transitions are protected by a global transition lock and transitions are designed to perform very little non-STW work. When a goroutine detects the exit condition, it acquires the transition lock, re-checks if the exit condition is still true and, if not, simply releases the lock and continues executing in whatever the new state is. It is necessary to re-check the condition, rather than simply check the current state, in case the goroutine is blocked though an entire GC cycle.

The sequence of states and transitions is as follows:

State: Sweep/Off This is the initial state of the system. No scanning, marking, or assisting is performed. Mutators perform proportional sweeping on allocation and background sweeping performs additional sweeping on idle Ps.
In this state, after allocating, a mutator checks if the heap size has exceeded the GC trigger size and, if so, it performs concurrent sweep termination by sweeping any remaining unswept spans (there shouldn't be any for a heap-triggered transition). Once there are no unswept spans, it performs the sweep termination transition. Periodic (sysmon-triggered) GC and runtime.GC perform these same steps regardless of the heap size.
Transition: Sweep termination and initialization Acquire worldsema. Start background workers. Stop the world. Perform sweep termination. Clear sync pools. Initialize GC state and statistics. Enable write barriers, assists, and background workers. If this is a concurrent GC, configure root marking, start the world, and enter concurrent mark. If this is a STW GC (runtime.GC), continue with the mark termination transition.
State: Concurrent mark 1 In this state, background workers perform concurrent scanning and marking and mutators perform assists.
Background workers initially participate in root marking and then switch to draining heap mark work.
Mutators assist with heap marking work in response to allocation according to the assist ratio established by the GC controller.
In this state, the system keeps an atomic counter of the number of active jobs, which includes the number of background workers and assists with checked out work buffers, plus the number of workers in root marking jobs. If this number drops to zero and gcBlackenPromptly is unset, the worker or assist that dropped it to zero transitions to concurrent mark 2. Note that it's important that this transition not happen until all root mark jobs are done, which is why the counter includes this.
Note: Assists could participate in root marking jobs just like background workers do and accumulate assist credit for this scanning work. This would particularly help at the beginning of the cycle when there may be little background credit or queued heap scan work. This would also help with load balancing. In this case, we would want to update scanblock to track scan credit and modify the scan work estimate to include roots.
Transition: Disable workbuf caching Disable caching of workbufs by setting gcBlackenPromptly. Queue root mark jobs for globals.
Note: It may also make sense to queue root mark jobs for stacks. This would require making it possible to re-scan a stack (and extend existing stack barriers).
State: Concurrent mark 2 The goroutine that performed the flush transition flushes all workbuf caches using forEachP. This counts as an active job to prevent the next transition from happening before this is done.
Otherwise, this state is identical to concurrent mark 1, except that workbuf caches are disabled.
Because workbuf caches are disabled, if the active workbuf count drops to zero, there is no more work. When this happens and gcBlackenPromptly is set, the worker or assist that dropped it the count to zero performs the mark termination transition.
Transition: Mark termination Stop the world. Unblock all parked assists. Perform gcMark, checkmark (optionally), gcSweep, and re-mark (optionally). Start the world. Release worldsema. Print GC stats. Free stacks.
Note that gcMark itself runs on all Ps, so this process is parallel even though it happens during a transition.

Rationale

There are various alternatives to this approach. The most obvious is to simply continue with what we do now: a central GC coordinator with hacks to deal with delays in various transitions. This is working surprisingly well right now, but only as a result of a good deal of engineering effort (primarily the cascade of fixes on #11677) and its fragility makes it difficult to make further changes to the garbage collector.

Another approach would be make the scheduler treat the GC coordinator as a high priority goroutine and always schedule it immediately when it becomes runnable. This would consolidate several of our current state transition “hacks”, which attempt to help out the scheduler. However, in a concurrent setting it‘s important to not only run the coordinator as soon as possible to perform a state transition, but also to disallow uncontrolled allocation on other threads while this transition is being performed. Scheduler hacks don’t address the latter problem.

Compatibility

This change is internal to the Go runtime. It does not change any user-facing Go APIs, and hence it satisfies Go 1 compatibility.

Implementation

This change will be implemented by Austin Clements, hopefully in the Go 1.6 development cycle. Much of the design has already been prototyped.

Many of the prerequisite changes have already been completed. In particular, we've already moved most of the non-STW work out of the GC coordinator (CL 16059 and CL 16070), made root marking jobs smaller (CL 16043), and improved the synchronization of blocked assists (CL 15890).

The GC coordinator will be converted to a decentralized state machine incrementally, one state/transition at a time where possible. At the end of this, there will be no work left in the GC coordinator and it will be deleted.

Open issues

There are devils in the details. One known devil in the current garbage collector that will affect this design in different ways is the complex constraints on scheduling within the garbage collector (and the runtime in general). For example, background workers are currently not allowed to block, which means they can't stop the world to perform mark termination. These constraints were designed for the current coordinator-based system and we will need to find ways of resolving them in the decentralized design.