Author(s): Austin Clements, Rick Hudson
Last updated: 2016-10-18
Discussion at https://golang.org/issue/17505.
Note: We are not actually proposing this. This design was developed before proposal #17503, which is a dramatically simpler solution to the problem of stack re-scanning. We're posting this design doc for its historical value.
Since the release of the concurrent garbage collector in Go 1.5, each subsequent release has further reduced stop-the-world (STW) time by moving more tasks to the concurrent phase. As of Go 1.7, the only non-trivial STW task is stack re-scanning. We propose to make stack re-scanning concurrent for Go 1.8, likely resulting in sub-millisecond worst-case STW times.
Go's concurrent garbage collector consists of four phases: mark, mark termination, sweep, and sweep termination. The mark and sweep phases are concurrent, meaning that the application (the mutator) continues to run during these phases, while the mark termination and sweep termination phases are stop-the-world (STW), meaning that the garbage collector pauses the mutator for the duration of the phase.
Since Go 1.5, we've been steadily moving tasks from the STW phases to the concurrent phases, with a particular focus on tasks that take time proportional to something under application control, such as heap size or number of goroutines. As a result, in Go 1.7, most applications have sub-millisecond STW times.
As of Go 1.7, the only remaining application-controllable STW task is stack re-scanning. Because of this one task, applications with large numbers of active goroutines can still experience STW times in excess of 10ms.
Stack re-scanning is necessary because stacks are permagray in the Go garbage collector. Specifically, for performance reasons, there are no write barriers for writes to pointers in the current stack frame. As a result, even though the garbage collector scans all stacks at the beginning of the mark phase, it must re-scan all modified stacks with the world is stopped to catch any pointers the mutator “hid” on the stack.
Unfortunately, this makes STW time proportional to the total amount of stack that needs to be rescanned. Worse, stack scanning is relatively expensive (~5ms/MB). Hence, applications with a large number of active goroutines can quickly drive up STW time.
We propose to make stack re-scanning concurrent using a transitive mark write barrier.
In this design, we add a new concurrent phase between mark and mark termination called stack re-scan. This phase starts as soon as the mark phase has marked all objects reachable from roots other than stacks. The phase re-scans stacks that have been modified since their initial scan, and enables a special transitive mark write barrier.
Re-scanning and the write barrier ensure the following invariant during this phase:
After a goroutine stack G has been re-scanned, all objects locally reachable to G are black.
This depends on a goroutine-local notion of reachability, which is the set of objects reachable from globals or a given goroutine's stack or registers. Unlike regular global reachability, this is not stable: as goroutines modify heap pointers or communicate, an object that was locally unreachable to a given goroutine may become locally reachable. However, the concepts are closely related: a globally reachable object must be locally reachable by at least one goroutine, and, conversely, an object that is not locally reachable by any goroutine is not globally reachable.
This invariant ensures that re-scanning a stack blackens that stack, and that the stack remains black since the goroutine has no way to find a white object once its stack has been re-scanned.
Furthermore, once every goroutine stack has been re-scanned, marking is complete. Every globally reachable object must be locally reachable by some goroutine and, once every stack has been re-scanned, every object locally reachable by some goroutine is black, so it follows that every globally reachable object is black once every stack has been re-scanned.
The transitive mark write barrier for an assignment
*dst = src (where
src is a pointer) ensures that all objects reachable from
src are black before writing
*dst may make any object reachable from
src itself) locally reachable to some goroutine that has been re-scanned. Hence, to maintain the invariant, we must ensure these objects are all black.
To do this, the write barrier greys
src and then drains the mark work queue until there are no grey objects (using the same work queue logic that drives the mark phase). At this point, it writes
*dst and allows the goroutine to proceed.
The write barrier must not perform the write until all simultaneous write barriers are also ready to perform the write. We refer to this mark quiescence. To see why this is necessary, consider two simultaneous write barriers for
*D1 = S1 and
*D2 = S2 on an object graph that looks like this:
G1 [b] → D1 [b] S1 [w] ↘ O1 [w] → O2 [w] → O3 [w] ↗ D2 [b] S2 [w]
Goroutine G1 has been re-scanned (so D1 must be black), while Sn and On are all white.
Suppose the S2 write barrier blackens S2 and O1 and greys O2, then the S1 write barrier blackens S1 and observes that O1 is already black:
G1 [b] → D1 [b] S1 [b] ↘ O1 [b] → O2 [g] → O3 [w] ↗ D2 [b] S2 [b]
At this point, the S1 barrier has run out of local work, but the S2 barrier is still going. If S1 were to complete and write
*D1 = S1 at this point, it would make white object O3 reachable to goroutine G1, violating the invariant. Hence, the S1 barrier cannot complete until the S2 barrier is also ready to complete.
This requirement sounds onerous, but it can be achieved in a simple and reasonably efficient manner by sharing a global mark work queue between the write barriers. This reuses the existing mark work queue and quiescence logic and allows write barriers to help each other to completion.
The stack re-scan phase re-scans the stacks of all goroutines that have run since the initial stack scan to find pointers to white objects. The process of re-scanning a stack is identical to that of the initial scan, except that it must participate in mark quiescence. Specifically, the re-scanned goroutine must not resume execution until the system has reached mark quiescence (even if no white pointers are found on the stack). Otherwise, the same sorts of races that were described above are possible.
There are multiple ways to realize this. The whole stack scan could participate in mark quiescence, but this would block any contemporaneous stack scans or write barriers from completing during a stack scan if any white pointers were found. Alternatively, each white pointer found on the stack could participate individually in mark quiescence, blocking the stack scan at that pointer until mark quiescence, and the stack scan could again participate in mark quiescence once all frames had been scanned.
We propose an intermediate: gather small batches of white pointers from a stack at a time and reach mark quiescence on each batch individually, as well as at the end of the stack scan (even if the final batch is empty).
Goroutines that start during stack re-scanning cannot reach any white objects, so their stacks are immediately considered black.
Goroutines can also share pointers through channels, which are often implemented as direct stack-to-stack copies. Hence, channel receives also require write barriers in order to maintain the invariant. Channel receives already have write barriers to maintain stack barriers, so there is no additional work here.
The primary drawback of this approach to concurrent stack re-scanning is that a write barrier during re-scanning could introduce significant mutator latency if the transitive mark finds a large unmarked region of the heap, or if overlapping write barriers significantly delay mark quiescence. However, we consider this situation unlikely in non-adversarial applications. Furthermore, the resulting delay should be no worse than the mark termination STW time applications currently experience, since mark termination has to do exactly the same amount of marking work, in addition to the cost of stack scanning.
An alternative solution to concurrent stack re-scanning would be to adopt DMOS-style quiescence [Hudson '97]. In this approach, greying any object during stack re-scanning (either by finding a pointer to a white object on a stack or by installing a pointer to a white object in the heap) forces the GC to drain this marking work and restart the stack re-scanning phase.
This approach has a much simpler write barrier implementation that is constant time, so the write barrier would not induce significant mutator latency. However, unlike the proposed approach, the amount of work performed by DMOS-style stack re-scanning is potentially unbounded. This interacts poorly with Go's GC pacer. The pacer enforces the goal heap size making allocating and GC work proportional, but this requires an upper bound on possible GC work. As a result, if the pacer underestimates the amount of re-scanning work, it may need to block allocation entirely to avoid exceeding the goal heap size. This would be an effective STW.
There is also a hybrid solution: we could use the proposed transitive marking write barrier, but bound the amount of work it can do (and hence the latency it can induce). If the write barrier exceeds this bound, it performs a DMOS-style restart. This is likely to get the best of both worlds, but also inherits the sum of their complexity.
A final alternative would be to eliminate concurrent stack re-scanning entirely by adopting a deletion-style write barrier [Yuasa '90]. This style of write barrier allows the initial stack scan to blacken the stack, rather than merely greying it (still without the need for stack write barriers). For full details, see proposal #17503.
This proposal does not affect the language or any APIs and hence satisfies the Go 1 compatibility guidelines.
We do not plan to implement this proposal. Instead, we plan to implement proposal #17503.
The implementation steps are as follows:
While not strictly necessary, first make GC assists participate in stack scanning. Currently this is not possible, which increases mutator latency at the beginning of the GC cycle. This proposal would compound this effect by also blocking GC assists at the end of the GC cycle, causing an effective STW.
Modify the write barrier to be pre-publication instead of post-publication. Currently the write barrier occurs after the write of a pointer, but this proposal requires that the write barrier complete transitive marking before writing the pointer to its destination. A pre-publication barrier is also necessary for ROC.
Make the mark completion condition precise. Currently it's possible (albeit unlikely) to enter mark termination before all heap pointers have been marked. This proposal requires that we not start stack re-scanning until all objects reachable from globals are marked, which requires a precise completion condition.
Implement the transitive mark write barrier. This can reuse the existing work buffer pool lists and logic, including the global quiescence barrier in getfull. It may be necessary to improve the performance characteristics of the getfull barrier, since this proposal will lean far more heavily on this barrier than we currently do.
Check stack re-scanning code and make sure it is safe during non-STW. Since this only runs during STW right now, it may omit synchronization that will be necessary when running during non-STW. This is likely to be minimal, since most of the code is shared with the initial stack scan, which does run concurrently.
Make stack re-scanning participate in write barrier quiescence.
Create a new stack re-scanning phase. Make mark 2 completion transition to stack re-scanning instead of mark termination and enqueue stack re-scanning root jobs. Once all stack re-scanning jobs are complete, transition to mark termination.
We would like to thank Rhys Hiltner (@rhysh) for suggesting the idea of a transitive mark write barrier.
[Hudson '97] R. L. Hudson, R. Morrison, J. E. B. Moss, and D. S. Munro. Garbage collecting the world: One car at a time. In ACM SIGPLAN Notices 32(10):162–175, October 1997.
[Yuasa '90] T. Yuasa. Real-time garbage collection on general-purpose machines. Journal of Systems and Software, 11(3):181–198, 1990.