| # Proposal: Decentralized GC coordination |
| |
| Author(s): Austin Clements |
| |
| Last updated: 2015-10-25 |
| |
| Discussion at https://golang.org/issue/11970. |
| |
| ## Abstract |
| |
| The Go 1.5 GC is structured as a straight-line coordinator goroutine |
| plus several helper goroutines. All state transitions go through the |
| coordinator. This makes state transitions dependent on the scheduler, |
| which can delay transitions, in turn extending the length of the GC |
| cycle, blocking allocation, and occasionally leading to long delays. |
| |
| We propose to replace this straight-line coordinator with an explicit |
| state machine where state transitions can be performed by any |
| goroutine. |
| |
| ## Background |
| |
| As of Go 1.5, all GC phase changes are managed through straight-line |
| code in the `runtime.gc` function, which runs on a dedicated GC |
| goroutine. However, much of the real work is done in other goroutines. |
| These other goroutines generally detect when it is time for a phase |
| change and must coordinate with the main GC goroutine to effect this |
| phase change. This coordination delays phase changes and opens windows |
| where, for example, the mutator can allocate uncontrolled, or nothing |
| can be accomplished because everything is waiting on the coordinator |
| to wake up. This has led to bugs like |
| [#11677](https://golang.org/issue/11677) and |
| [#11911](https://golang.org/issue/11911). We've tried to mitigate this |
| by handing control directly to the coordinator goroutine when we wake |
| it up, but the scheduler isn't designed for this sort of explicit |
| co-routine scheduling, so this doesn't always work and it's more |
| likely to fall apart under stress than an explicit design. |
| |
| ## Proposal |
| |
| We will restructure the garbage collector as an explicit state machine |
| where any goroutine can effect a state transition. This is primarily |
| an implementation change, not an algorithm change: for the most part, |
| these states and the transitions between them closely follow the |
| current GC algorithm. |
| |
| Each state is global and determines the GC-related behavior of all |
| goroutines. Each state also has an exit condition. State transitions |
| are performed immediately by whatever goroutine detects that the |
| current state's exit condition is satisfied. Multiple goroutines may |
| detect an exit condition simultaneously, in which case none of these |
| goroutines may progress until the transition has been performed. For |
| many transitions, this is necessary to prevent runaway heap growth. |
| Each transition has a specific set of steps to prepare for the next |
| state and the system enters the next state as soon as those steps are |
| completed. Furthermore, each transition is designed to make the exit |
| condition that triggers that transition false so that the transition |
| happens once and only once per cycle. |
| |
| In principle, all of the goroutines that detect an exit condition |
| could assist in performing the transition. However, we take a simpler |
| approach where all transitions are protected by a global *transition |
| lock* and transitions are designed to perform very little non-STW |
| work. When a goroutine detects the exit condition, it acquires the |
| transition lock, re-checks if the exit condition is still true and, if |
| not, simply releases the lock and continues executing in whatever the |
| new state is. It is necessary to re-check the condition, rather than |
| simply check the current state, in case the goroutine is blocked |
| though an entire GC cycle. |
| |
| The sequence of states and transitions is as follows: |
| |
| * **State: Sweep/Off** This is the initial state of the system. No |
| scanning, marking, or assisting is performed. Mutators perform |
| proportional sweeping on allocation and background sweeping performs |
| additional sweeping on idle Ps. |
| |
| In this state, after allocating, a mutator checks if the heap size |
| has exceeded the GC trigger size and, if so, it performs concurrent |
| sweep termination by sweeping any remaining unswept spans (there |
| shouldn't be any for a heap-triggered transition). Once there are no |
| unswept spans, it performs the *sweep termination* transition. |
| Periodic (sysmon-triggered) GC and `runtime.GC` perform these same |
| steps regardless of the heap size. |
| |
| * **Transition: Sweep termination and initialization** Acquire |
| `worldsema`. Start background workers. Stop the world. Perform sweep |
| termination. Clear sync pools. Initialize GC state and statistics. |
| Enable write barriers, assists, and background workers. If this is a |
| concurrent GC, configure root marking, start the world, and enter |
| *concurrent mark*. If this is a STW GC (`runtime.GC`), continue with |
| the *mark termination* transition. |
| |
| * **State: Concurrent mark 1** In this state, background workers |
| perform concurrent scanning and marking and mutators perform |
| assists. |
| |
| Background workers initially participate in root marking and then |
| switch to draining heap mark work. |
| |
| Mutators assist with heap marking work in response to allocation |
| according to the assist ratio established by the GC controller. |
| |
| In this state, the system keeps an atomic counter of the number of |
| active jobs, which includes the number of background workers and |
| assists with checked out work buffers, plus the number of workers in |
| root marking jobs. If this number drops to zero and |
| `gcBlackenPromptly` is unset, the worker or assist that dropped it |
| to zero transitions to *concurrent mark 2*. Note that it's important |
| that this transition not happen until all root mark jobs are done, |
| which is why the counter includes this. |
| |
| Note: Assists could participate in root marking jobs just like |
| background workers do and accumulate assist credit for this scanning |
| work. This would particularly help at the beginning of the cycle |
| when there may be little background credit or queued heap scan work. |
| This would also help with load balancing. In this case, we would |
| want to update `scanblock` to track scan credit and modify the scan |
| work estimate to include roots. |
| |
| * **Transition: Disable workbuf caching** Disable caching of workbufs |
| by setting `gcBlackenPromptly`. Queue root mark jobs for globals. |
| |
| Note: It may also make sense to queue root mark jobs for stacks. |
| This would require making it possible to re-scan a stack (and extend |
| existing stack barriers). |
| |
| * **State: Concurrent mark 2** The goroutine that performed the flush |
| transition flushes all workbuf caches using `forEachP`. This counts |
| as an active job to prevent the next transition from happening |
| before this is done. |
| |
| Otherwise, this state is identical to *concurrent mark 1*, except |
| that workbuf caches are disabled. |
| |
| Because workbuf caches are disabled, if the active workbuf count |
| drops to zero, there is no more work. When this happens and |
| `gcBlackenPromptly` is set, the worker or assist that dropped it the |
| count to zero performs the *mark termination* transition. |
| |
| * **Transition: Mark termination** Stop the world. Unblock all parked |
| assists. Perform `gcMark`, checkmark (optionally), `gcSweep`, and |
| re-mark (optionally). Start the world. Release `worldsema`. Print GC |
| stats. Free stacks. |
| |
| Note that `gcMark` itself runs on all Ps, so this process is |
| parallel even though it happens during a transition. |
| |
| ## Rationale |
| |
| There are various alternatives to this approach. The most obvious is |
| to simply continue with what we do now: a central GC coordinator with |
| hacks to deal with delays in various transitions. This is working |
| surprisingly well right now, but only as a result of a good deal of |
| engineering effort (primarily the cascade of fixes on |
| [#11677](https://github.com/golang/go/issues/11677)) and its fragility |
| makes it difficult to make further changes to the garbage collector. |
| |
| Another approach would be make the scheduler treat the GC coordinator |
| as a high priority goroutine and always schedule it immediately when |
| it becomes runnable. This would consolidate several of our current |
| state transition "hacks", which attempt to help out the scheduler. |
| However, in a concurrent setting it's important to not only run the |
| coordinator as soon as possible to perform a state transition, but |
| also to disallow uncontrolled allocation on other threads while this |
| transition is being performed. Scheduler hacks don't address the |
| latter problem. |
| |
| ## Compatibility |
| |
| This change is internal to the Go runtime. It does not change any |
| user-facing Go APIs, and hence it satisfies Go 1 compatibility. |
| |
| ## Implementation |
| |
| This change will be implemented by Austin Clements, hopefully in the |
| Go 1.6 development cycle. Much of the design has already been |
| prototyped. |
| |
| Many of the prerequisite changes have already been completed. In |
| particular, we've already moved most of the non-STW work out of the GC |
| coordinator ([CL 16059](https://go-review.googlesource.com/#/c/16059/) |
| and [CL 16070](https://go-review.googlesource.com/#/c/16070/)), made |
| root marking jobs smaller |
| ([CL 16043](https://go-review.googlesource.com/#/c/16043)), and |
| improved the synchronization of blocked assists |
| ([CL 15890](https://go-review.googlesource.com/#/c/15890)). |
| |
| The GC coordinator will be converted to a decentralized state machine |
| incrementally, one state/transition at a time where possible. At the |
| end of this, there will be no work left in the GC coordinator and it |
| will be deleted. |
| |
| ## Open issues |
| |
| There are devils in the details. One known devil in the current |
| garbage collector that will affect this design in different ways is |
| the complex constraints on scheduling within the garbage collector |
| (and the runtime in general). For example, background workers are |
| currently not allowed to block, which means they can't stop the world |
| to perform mark termination. These constraints were designed for the |
| current coordinator-based system and we will need to find ways of |
| resolving them in the decentralized design. |