blob: 3d800f224aff051ab7aa1be40d627d4989049187 [file] [log] [blame] [view]
# Proposal: Eliminate STW stack re-scanning
Author(s): Austin Clements, Rick Hudson
Last updated: 2016-10-21
Discussion at https://golang.org/issue/17503.
## Abstract
As of Go 1.7, the one remaining source of unbounded and potentially
non-trivial stop-the-world (STW) time is stack re-scanning.
We propose to eliminate the need for stack re-scanning by switching to
a *hybrid write barrier* that combines a Yuasa-style deletion write
barrier [Yuasa '90] and a Dijkstra-style insertion write barrier
[Dijkstra '78].
Preliminary experiments show that this can reduce worst-case STW time
to under 50µs, and this approach may make it practical to eliminate
STW mark termination altogether.
Eliminating stack re-scanning will in turn simplify and eliminate many
other parts of the garbage collector that exist solely to improve the
performance of stack re-scanning.
This includes stack barriers (which introduce significant complexity
in many parts of the runtime) and maintenance of the re-scan list.
Hence, in addition to substantially improving STW time, the hybrid
write barrier should also reduce the overall complexity of the garbage
collector.
## Background
The Go garbage collector is a *tricolor* concurrent collector
[Dijkstra '78].
Every object is shaded either white, grey, or black.
At the beginning of a GC cycle, all objects are white, and it is the
goal of the garbage collector to mark all reachable objects black and
then free all white objects.
The garbage collector achieves this by shading GC roots (stacks and
globals, primarily) grey and then endeavoring to turn all grey objects
black while satisfying the *strong tricolor invariant*:
> No black object may contain a pointer to a white object.
Ensuring the tricolor invariant in the presence of concurrent pointer
updates requires *barriers* on either pointer reads or pointer writes
(or both).
There are many flavors of barrier [Pirinen '98].
Go 1.7 uses a coarsened Dijkstra write barrier [Dijkstra '78], where
pointer writes are implemented as follows:
```
writePointer(slot, ptr):
shade(ptr)
*slot = ptr
```
`shade(ptr)` marks the object at `ptr` grey if it is not already grey
or black.
This ensures the strong tricolor invariant by conservatively assuming
that `*slot` may be in a black object, and ensuring `ptr` cannot be
white before installing it in `*slot`.
The Dijkstra barrier has several advantages over other types of
barriers.
It does not require any special handling of pointer reads, which has
performance advantages since pointer reads tend to outweigh pointer
writes by an order of magnitude or more.
It also ensures forward progress; unlike, for example, the Steele
write barrier [Steele '75], objects transition monotonically from
white to grey to black, so the total work is bounded by the heap size.
However, it also has disadvantages.
In particular, it presents a trade-off for pointers on stacks: either
writes to pointers on the stack must have write barriers, which is
prohibitively expensive, or stacks must be *permagrey*.
Go chooses the later, which means that many stacks must be re-scanned
during STW.
The garbage collector first scans all stacks at the beginning of the
GC cycle to collect roots.
However, without stack write barriers, we can't ensure that the stack
won't later contain a reference to a white object, so a scanned stack
is only black until its goroutine executes again, at which point it
conservatively reverts to grey.
Thus, at the end of the cycle, the garbage collector must re-scan grey
stacks to blacken them and finish marking any remaining heap pointers.
Since it must ensure the stacks don't continue to change during this,
the whole re-scan process happens *with the world stopped*.
Re-scanning the stacks can take 10's to 100's of milliseconds in an
application with a large number of active goroutines.
## Proposal
We propose to eliminate stack re-scanning and replace Go's write
barrier with a *hybrid write barrier* that combines a Yuasa-style
deletion write barrier [Yuasa '90] with a Dijkstra-style insertion
write barrier [Dijkstra '78].
The hybrid write barrier is implemented as follows:
```
writePointer(slot, ptr):
shade(*slot)
if current stack is grey:
shade(ptr)
*slot = ptr
```
That is, the write barrier shades the object whose reference is being
overwritten, and, if the current goroutine's stack has not yet been
scanned, also shades the reference being installed.
The hybrid barrier makes stack re-scanning unnecessary; once a stack
has been scanned and blackened, it remains black.
Hence, it eliminates the need for stack re-scanning and the mechanisms
that exist to support stack re-scanning, including stack barriers and
the re-scan list.
The hybrid barrier requires that objects be allocated black
(allocate-white is a common policy, but incompatible with this
barrier).
However, while not required by Go's current write barrier, Go already
allocates black for other reasons, so no change to allocation is
necessary.
The hybrid write barrier is equivalent to the "double write barrier"
used in the adaptation of Metronome used in the IBM real-time Java
implementation [Auerbach '07]. In that case, the garbage collector was
incremental, rather than concurrent, but ultimately had to deal with
the same problem of tightly bounded stop-the-world times.
### Reasoning
A full proof of the hybrid write barrier is given at the end of this
proposal.
Here we give the high-level intuition behind the barrier.
Unlike the Dijkstra write barrier, the hybrid barrier does *not*
satisfy the strong tricolor invariant: for example, a black goroutine
(a goroutine whose stack has been scanned) can write a pointer to a
white object into a black object without shading the white object.
However, it does satisfy the *weak tricolor invariant* [Pirinen '98]:
> Any white object pointed to by a black object is reachable from a
> grey object via a chain of white pointers (it is *grey-protected*).
The weak tricolor invariant observes that it's okay for a black object
to point to a white object, as long as *some* path ensures the garbage
collector will get around to marking that white object.
Any write barrier has to prohibit a mutator from "hiding" an object;
that is, rearranging the heap graph to violate the weak tricolor
invariant so the garbage collector fails to mark a reachable object.
For example, in a sense, the Dijkstra barrier allows a mutator to hide
a white object by moving the sole pointer to it to a stack that has
already been scanned.
The Dijkstra barrier addresses this by making stacks permagray and
re-scanning them during STW.
In the hybrid barrier, the two shades and the condition work together
to prevent a mutator from hiding an object:
1. `shade(*slot)` prevents a mutator from hiding an object by moving
the sole pointer to it from the heap to its stack.
If it attempts to unlink an object from the heap, this will shade
it.
2. `shade(ptr)` prevents a mutator from hiding an object by moving the
sole pointer to it from its stack into a black object in the heap.
If it attempts to install the pointer into a black object, this
will shade it.
3. Once a goroutine's stack is black, the `shade(ptr)` becomes
unnecessary.
`shade(ptr)` prevents hiding an object by moving it from the stack
to the heap, but this requires first having a pointer hidden on the
stack.
Immediately after a stack is scanned, it only points to shaded
objects, so it's not hiding anything, and the `shade(*slot)`
prevents it from hiding any other pointers on its stack.
The hybrid barrier combines the best of the Dijkstra barrier and the
Yuasa barrier.
The Yuasa barrier requires a STW at the beginning of marking to either
scan or snapshot stacks, but does not require a re-scan at the end of
marking.
The Dijkstra barrier lets concurrent marking start right away, but
requires a STW at the end of marking to re-scan stacks (though more
sophisticated non-STW approaches are possible [Hudson '97]).
The hybrid barrier inherits the best properties of both, allowing
stacks to be concurrently scanned at the beginning of the mark phase,
while also keeping stacks black after this initial scan.
## Rationale
The advantage of the hybrid barrier is that it lets a stack scan
permanently blacken a stack (without a STW and without write barriers
to the stack), which entirely eliminates the need for stack
re-scanning, in turn eliminating the need for stack barriers and
re-scan lists.
Stack barriers in particular introduce significant complexity
throughout the runtime, as well as interfering with stack walks from
external tools such as GDB and kernel-based profilers.
Also, like the Dijkstra-style write barrier, the hybrid barrier does
not require a read barrier, so pointer reads are regular memory reads;
and it ensures progress, since objects progress monotonically from
white to grey to black.
The disadvantages of the hybrid barrier are minor.
It may result in more floating garbage, since it retains everything
reachable from roots (other than stacks) at any point during the mark
phase.
However, in practice it's likely that the current Dijkstra barrier is
retaining nearly as much.
The hybrid barrier also prohibits certain optimizations: in
particular, the Go compiler currently omits a write barrier if it can
statically show that the pointer is nil, but the hybrid barrier
requires a write barrier in this case.
This may slightly increase binary size.
### Alternative barrier approaches
There are several variations on the proposed barrier that would also
work, but we believe the proposed barrier represents the best set of
trade-offs.
A basic variation is to make the Dijkstra-style aspect of the barrier
unconditional:
```
writePointer(slot, ptr):
shade(*slot)
shade(ptr)
*slot = ptr
```
The main advantage of this barrier is that it's easier to reason
about.
It directly ensures there are no black-to-white pointers in the heap,
so the only source of black-to-white pointers can be scanned stacks.
But once a stack is scanned, the only way it can get a white pointer
is by traversing reachable objects, and any white object that can be
reached by a goroutine with a black stack is grey-protected by a heap
object.
The disadvantage of this barrier is that it's effectively twice as
expensive as the proposed barrier for most of the mark phase.
Similarly, we could simply coarsen the stack condition:
```
writePointer(slot, ptr):
shade(*slot)
if any stack is grey:
shade(ptr)
*slot = ptr
```
This has the advantage of making cross-stack writes such as those
allowed by channels safe without any special handling, but prolongs
when the second shade is enabled, which slows down pointer writes.
A different approach would be to require that all stacks be blackened
before any heap objects are blackened, which would enable a pure
Yuasa-style deletion barrier:
```
writePointer(slot, ptr):
shade(*slot)
*slot = ptr
```
As originally proposed, the Yuasa barrier takes a complete snapshot of
the stack before proceeding with marking.
Yuasa argued that this was reasonable on hardware that could perform
bulk memory copies very quickly.
However, Yuasa's proposal was in the context of a single-threaded
system with a comparatively small stack, while Go programs regularly
have thousands of stacks that can total to a large amount of memory.
However, this complete snapshot isn't necessary.
It's sufficient to ensure all stacks are black before scanning any
heap objects.
This allows stack scanning to proceed concurrently, but has the
downside that it introduces a bottleneck to the parallelism of the
mark phase between stack scanning and heap scanning.
This bottleneck has downstream effects on goroutine availability,
since allocation is paced against marking progress.
Finally, there are other types of *black mutator* barrier techniques.
However, as shown by Pirinen, all possible black mutator barriers
other than the Yuasa barrier require a read barrier [Pirinen '98].
Given the relative frequency of pointer reads to writes, we consider
this unacceptable for application performance.
### Alternative approaches to re-scanning
Going further afield, it's also possible to make stack re-scanning
concurrent without eliminating it [Hudson '97].
This does not require changes to the write barrier, but does introduce
significant additional complexity into stack re-scanning.
Proposal #17505 gives a detailed design for how to do concurrent stack
re-scanning in Go.
## Other considerations
### Channel operations and go statements
The hybrid barrier assumes a goroutine cannot write to another
goroutine's stack.
This is true in Go except for two operations: channel sends and
starting goroutines, which can copy values directly from one stack to
another.
For channel operations, the `shade(ptr)` is necessary if *either* the
source stack or the destination stack is grey.
For starting a goroutine, the destination stack is always black, so
the `shade(ptr)` is necessary if the source stack is grey.
### Racy programs
In a racy program, two goroutines may store to the same pointer
simultaneously and invoke the write barrier concurrently on the same
slot.
The hazard is that this may cause the barrier to fail to shade some
object that it would have shaded in a sequential execution,
particularly given a relaxed memory model.
While racy Go programs are generally undefined, we have so far
maintained that a racy program cannot trivially defeat the soundness
of the garbage collector (since a racy program can defeat the type
system, it can technically do anything, but we try to keep the garbage
collector working as long as the program stays within the type
system).
Suppose *optr* is the value of the slot before either write to the
slot and *ptr1* and *ptr2* are the two pointers being written to the
slot.
"Before" is well-defined here because all architectures that Go
supports have *coherency*, which means there is a total order over all
reads and writes of a single memory location.
If the goroutine's respective stacks have been scanned, then *ptr1*
and *ptr2* will clearly be shaded, since those shades don't read from
memory.
Hence, the difficult case is if the goroutine's stacks have been
scanned.
In this case, the barriers reduce to:
<table>
<tr><th>Goroutine G1</th><th>Goroutine G2</th></tr>
<tr><td>
<pre>optr1 = *slot
shade(optr1)
*slot = ptr1</pre>
</td><td>
<pre>optr2 = *slot
shade(optr2)
*slot = ptr2</pre>
</td></tr>
</table>
Given that we're only dealing with one memory location, the property
of coherence means we can reason about this execution as if it were
sequentially consistent.
Given this, concurrent execution of the write barriers permits one
outcome that is not permitted by sequential execution: if both
barriers read `*slot` before assigning to it, then only *optr* will be
shaded, and neither *ptr1* nor *ptr2* will be shaded by the barrier.
For example:
<table>
<tr><th>Goroutine G1</th><th>Goroutine G2</th></tr>
<tr><td>
<pre>optr1 = *slot
shade(optr1)
*slot = ptr1
</pre>
</td><td>
<pre>optr2 = *slot
shade(optr2)
*slot = ptr2</pre>
</td></tr>
</table>
We assert that this is safe.
Suppose *ptr1* is written first.
This execution is *nearly* indistinguishable from an execution that
simply skips the write of *ptr1*.
The only way to distinguish it is if a read from another goroutine
*G3* observes *slot* between the two writes.
However, given our assumption that stacks have already been scanned,
either *ptr1* is already shaded, or it must be reachable from some
other place in the heap anyway (and will be shaded eventually), so
concurrently observing *ptr1* doesn't affect the marking or
reachability of *ptr1*.
### cgo
The hybrid barrier could be a problem if C code overwrites a Go
pointer in Go memory with either nil or a C pointer.
Currently, this operation does not require a barrier, but with any
sort of deletion barrier, this does require the barrier.
However, a program that does this would violate the cgo pointer
passing rules, since Go code is not allowed to pass memory to a C
function that contains Go pointers.
Furthermore, this is one of the "cheap dynamic checks" enabled by the
default setting of `GODEBUG=cgocheck=1`, so any program that violates
this rule will panic unless the checks have been explicitly disabled.
## Future directions
### Write barrier omission
The current write barrier can be omitted by the compiler in cases
where the compiler knows the pointer being written is permanently
shaded, such as nil pointers, pointers to globals, and pointers to
static data.
These optimizations are generally unsafe with the hybrid barrier.
However, if the compiler can show that *both* the current value of the
slot and the value being written are permanently shaded, then it can
still safely omit the write barrier.
This optimization is aided by the fact that newly allocated objects
are zeroed, so all pointer slots start out pointing to nil, which is
permanently shaded.
### Low-pause stack scans
Currently the garbage collector pauses a goroutine while scanning its
stack.
If goroutines have large stacks, this can introduce significant tail
latency effects.
The hybrid barrier and the removal of the existing stack barrier
mechanism would make it feasible to perform stack scans with only
brief goroutine pauses.
In this design, scanning a stack pauses the goroutine briefly while it
scans the active frame.
It then installs a *blocking stack barrier* over the return to the
next frame and lets the goroutine resume.
Stack scanning then continues toward the outer frames, moving the
stack barrier up the stack as it goes.
If the goroutine does return as far as the stack barrier, before it
can return to an unscanned frame, the stack barrier blocks until
scanning can scan that frame and move the barrier further up the
stack.
One complication is that a running goroutine could attempt to grow its
stack during the stack scan.
The simplest solution is to block the goroutine if this happens until
the scan is done.
Like the current stack barriers, this depends on write barriers when
writing through pointers to other frames.
For example, in a partially scanned stack, an active frame could use
an up-pointer to move a pointer to a white object out of an unscanned
frame and into the active frame.
Without a write barrier on the write that removes the pointer from the
unscanned frame, this could hide the white object from the garbage
collector.
However, with write barriers on up-pointers, this is safe.
Rather than arguing about "partially black" stacks, the write barrier
on up-pointers lets us view the stack as a sequence of separate
frames, where unscanned frames are treated as part of the *heap*.
Writes without write barriers can only happen to the active frame, so
we only have to view the active frame as the stack.
This design is technically possible now, but the complexity of
composing it with the existing stack barrier mechanism makes it
unappealing.
With the existing stack barriers gone, the implementation of this
approach becomes relatively straightforward.
It's also generally simpler than the existing stack barriers in many
dimensions, since there are at most two stack barriers per goroutine
at a time, and they are present only during the stack scan.
### Strictly bounded mark termination
This proposal goes a long way toward strictly bounding the time spent
in STW mark termination, but there are some other known causes of
longer mark termination pauses.
The primary cause is a race that can trigger mark termination while
there is still remaining heap mark work.
This race and how to resolve it are detailed in the "Mark completion
race" appendix.
### Concurrent mark termination
With stack re-scanning out of mark termination, it may become
practical to make the remaining tasks in mark termination concurrent
and eliminate the mark termination STW entirely.
On the other hand, the hybrid barrier may reduce STW so much that
completely eliminating it is not a practical concern.
The following is a probably incomplete list of remaining mark
termination tasks and how to address them.
Worker stack scans can be eliminated by having workers self-scan
during mark.
Scanning the finalizer queue can be eliminated by adding explicit
barriers to `queuefinalizer`.
Without these two scans (and with the fix for the mark completion race
detailed in the appendix), mark termination will produce no mark work,
so finishing the work queue drain also becomes unnecessary.
`mcache`s can be flushed lazily at the beginning of the sweep phase
using rolling synchronization.
Flushing the heap profile can be done immediately at the beginning of
sweep (this is already concurrent-safe, in fact).
Finally, updating global statistics can be done using atomics and
possibly a global memory barrier.
### Concurrent sweep termination
Likewise, it may be practical to eliminate the STW for sweep
termination.
This is slightly complicated by the fact that the hybrid barrier
requires a global memory fence at the beginning of the mark phase to
enable the write barrier and ensure all pointer writes prior to
enabling the write barrier are visible to the write barrier.
Currently, the STW for sweep termination and setting up the mark phase
accomplishes this.
If we were to make sweep termination concurrent, we could instead use
a ragged barrier to accomplish the global memory fence, or the
[`membarrier` syscall](http://man7.org/linux/man-pages/man2/membarrier.2.html)
on recent Linux kernels.
## Compatibility
This proposal does not affect the language or any APIs and hence
satisfies the Go 1 compatibility guidelines.
## Implementation
Austin plans to implement the hybrid barrier during the Go 1.8
development cycle.
For 1.8, we will leave stack re-scanning support in the runtime for
debugging purposes, but disable it by default using a `GODEBUG`
variable.
Assuming things go smoothly, we will remove stack re-scanning support
when the tree opens for Go 1.9 development.
The planned implementation approach is:
1. Fix places that do unusual or "clever" things with memory
containing pointers and make sure they cooperate with the hybrid
barrier.
We'll presumably find more of these as we debug in later steps, but
we'll have to make at least the following changes:
1. Ensure barriers on stack-to-stack copies for channel sends and
starting goroutines.
2. Check all places where we clear memory since the hybrid barrier
requires distinguishing between clearing for initialization and
clearing already-initialized memory.
This will require a barrier-aware `memclr` and disabling the
`duffzero` optimization for pointers with types.
3. Check all uses of unmanaged memory in the runtime to make sure
it is initialized properly.
This is particularly important for pools of unmanaged memory
such as the fixalloc allocator that may reuse memory.
2. Implement concurrent scanning of background mark worker stacks.
Currently these are placed on the rescan list and *only* scanned
during mark termination, but we're going to disable the rescan
list.
We could arrange for background mark workers to scan their own
stacks, or explicitly keep track of heap pointers on background
mark worker stacks.
3. Modify the write barrier to implement the hybrid write barrier and
the compiler to disable write barrier elision optimizations that
aren't valid for the hybrid barrier.
4. Disable stack re-scanning by making rescan enqueuing a no-op unless
a `GODEBUG` variable is set.
Likewise, disable stack barrier insertion unless this variable is
set.
5. Use checkmark mode and stress testing to verify that no objects are
missed.
6. Wait for the Go 1.9 development cycle.
7. Remove stack re-scanning, the rescan list, stack barriers, and the
`GODEBUG` variable to enable re-scanning.
Possibly, switch to low-pause stack scans, which can reuse some of
the stack barrier mechanism.
## Appendix: Mark completion race
Currently, because of a race in the mark completion condition, the
garbage collector can begin mark termination when there is still
available mark work.
This is safe because mark termination will finish draining this work,
but it makes mark termination unbounded.
This also interferes with possible further optimizations that remove
all mark work from mark termination.
Specifically, the following interleaving starts mark termination
without draining all mark work:
Initially `workers` is zero and there is one buffer on the full list.
<table>
<tr><th>Thread 1</th><th>Thread 2</th></tr>
<tr><td><pre>
inc(&workers)
gcDrain(&work)
=> acquires only full buffer
=> adds more pointer to work
work.dispose()
=> returns buffer to full list
n := dec(&workers) [n=0]
if n == 0 && [true]
full.empty() && [true]
markrootNext >= markrootJobs { [true]
startMarkTerm()
}
</pre></td><td><pre>
inc(&workers)
gcDrain(&work)
=> acquires only full buffer
=> adds more pointers to work
...
</pre></td></tr></table>
In this example, a race between observing the `workers` count and
observing the state of the full list causes thread 1 to start mark
termination prematurely.
Simply checking `full.empty()` before decrementing `workers` exhibits
a similar race.
To fix this race, we propose introducing a single atomic non-zero
indicator for the number of non-empty work buffers.
Specifically, this will count the number of work caches that are
caching a non-empty work buffer plus one for a non-empty full list.
Many buffer list operations can be done without modifying this count,
so we believe it will not be highly contended.
If this does prove to be a scalability issue, there are well-known
techniques for scalable non-zero indicators [Ellen '07].
## Appendix: Proof of soundness, completeness, and boundedness
<!-- TODO: Show that the only necessary memory fence is a global
!-- store/load fence between enabling the write barrier and
!-- blackening to ensure visibility of all pointers and the write
!-- barrier flag.
!-->
This section argues that the hybrid write barrier satisfies the weak
tricolor invariant, and hence is sound in the sense that it does not
collect reachable objects; that it terminates in a bounded number of
steps; and that it eventually collects all unreachable objects, and
hence is complete.
We have also further verified these properties using a
[randomized stateless model](https://github.com/aclements/go-misc/blob/master/go-weave/models/yuasa.go).
The following proofs consider global objects to be a subset of the
heap objects.
This is valid because the write barrier applies equally to global
objects.
Similarly, we omit explicit discussion of nil pointers, since the nil
pointer can be considered an always-black heap object of zero size.
The hybrid write barrier satisfies the *weak tricolor invariant*
[Pirinen '98].
However, rather than directly proving this, we prove that it satisfies
the following *modified tricolor invariant*:
> Any white object pointed to by a black object is grey-protected by a
> heap object (reachable via a chain of white pointers from the grey
> heap object).
> That is, for every *B -> W* edge, there is a path *G -> W₁ -> ⋯ ->
> Wₙ -> W* where *G* is a heap object.
This is identical to the weak tricolor invariant, except that it
requires that the grey-protector is a heap object.
This trivially implies the weak tricolor invariant, but gives us a
stronger basis for induction in the proof.
<!-- Thoughts on how to simplify the proof:
Perhaps define a more general notion of a "heap-protected object",
which is either black, grey, or grey-protected by a heap object.
-->
Lemma 1 establishes a simple property of paths we'll use several
times.
**Lemma 1.** In a path *O₁ -> ⋯ -> Oₙ* where *O₁* is a heap object,
all *Oᵢ* must be heap objects.
**Proof.** Since *O₁* is a heap object and heap objects can only point
to other heap objects, by induction, all *Oᵢ* must be heap objects.
In particular, if some object is grey-protected by a heap object,
every object in the grey-protecting path must be a heap object.
Lemma 2 extends the modified tricolor invariant to white objects that
are *indirectly* reachable from black objects.
**Lemma 2.** If the object graph satisfies the modified tricolor
invariant, then every white object reachable (directly or indirectly)
from a black object is grey-protected by a heap object.
<!-- Alternatively: every object reachable from a black object is
either black, grey, or grey-protected by a heap object. -->
**Proof.** Let *W* be a white object reachable from black object *B*
via simple path *B -> O₁ -> ⋯ -> Oₙ -> W*.
Note that *W* and all *Oᵢ* and must be heap objects because stacks can
only point to themselves (in which case it would not be a simple path)
or heap objects, so *O₁* must be a heap object, and by lemma 1, the
rest of the path must be heap objects.
Without loss of generality, we can assume none of *Oᵢ* are black;
otherwise, we can simply reconsider using the shortest path suffix
that starts with a black object.
If there are no *Oᵢ*, *B* points directly to *W* and the modified
tricolor invariant directly implies that *W* is grey-protected by a
heap object.
If any *Oᵢ* is grey, then *W* is grey-protected by the last grey
object in the path.
Otherwise, all *Oᵢ* are white.
Since *O₁* is a white object pointed to by a black object, *O₁* is
grey-protected by some path *G -> W₁ -> ⋯ -> Wₙ -> O₁* where *G* is a
heap object.
Thus, *W* is grey-protected by *G -> W₁ -> ⋯ -> Wₙ -> O₁ -> ⋯ -> Oₙ ->
W*.
Lemma 3 builds on lemma 2 to establish properties of objects reachable
by goroutines.
**Lemma 3.** If the object graph satisfies the modified tricolor
invariant, then every white object reachable by a black goroutine (a
goroutine whose stack has been scanned) is grey-protected by a heap
object.
**Proof.** Let *W* be a white object reachable by a black goroutine.
If *W* is reachable from the goroutine's stack, then by lemma 2 *W* is
grey-protected by a heap object.
Otherwise, *W* must be reachable from a global *X*.
Let *O* be the last non-white object in the path from *X* to *W* (*O*
must exist because *X* itself is either grey or black).
If *O* is grey, then *O* is a heap object that grey-protects *W*.
Otherwise, *O* is black and by lemma 2, *W* is grey-protected by some
heap object.
Now we're ready to prove that the hybrid write barrier satisfies the
weak tricolor invariant, which implies it is *sound* (it marks all
reachable objects).
**Theorem 1.** The hybrid write barrier satisfies the weak tricolor
invariant.
**Proof.** We first show that the hybrid write barrier satisfies the
modified tricolor invariant.
The proof follows by induction over the operations that affect the
object graph or its coloring.
*Base case.* Initially there are no black objects, so the invariant
holds trivially.
*Write pointer in the heap.* Let *obj.slot := ptr* denote the write,
where *obj* is in the heap, and let *optr* denote the value of
*obj.slot* prior to the write.
Let *W* be a white object pointed to by a black object *B* after the
heap write.
There are two cases:
1. *B ≠ obj*: *W* was pointed to by them same black object *B* before
the write, and, by assumption, *W* was grey-protected by a path *G
-> W₁ -> ⋯ -> Wₙ -> W*, where *G* is a heap object.
If none of these edges are *obj.slot*, then *W* is still protected
by *G*.
Otherwise, the path must have included the edge *obj -> optr* and,
since the write barrier shades *optr*, *W* is grey-protected by
*optr* after the heap write.
2. *B = obj*: We first establish that *W* was grey-protected before
the write, which breaks down into two cases:
1. *W = ptr*: The goroutine must be black, because otherwise the
write barrier shades *ptr*, so it is not white.
*ptr* must have been reachable by the goroutine for it to write
it, so by lemma 3, *ptr* was grey-protected by some heap object
*G* prior to the write.
2. *W ≠ ptr*: *B* pointed to *W* before the write and, by
assumption, *W* was grey-protected by some heap object *G*
before the write.
Because *obj* was black before the write, it could not be in the
grey-protecting path from *G* to *W*, so this write did not affect
this path, so *W* is still grey-protected by *G*.
*Write pointer in a stack.* Let *stk.slot := ptr* denote the write.
Let *W* be a white object pointed to by a black object *B* after the
stack write.
We first establish that *W* was grey-protected before the stack write,
which breaks down into two cases:
1. *B = stk* and *W = ptr*: *W* may not have been pointed to by a
black object prior to the stack write (that is, the write may
create a new *B -> W* edge).
However, *ptr* must have been reachable by the goroutine, which is
black (because *B = stk*), so by lemma 3, *W* was grey-protected by
some heap object *G* prior to the write.
2. Otherwise: *W* was pointed to by the same black object *B* prior to
the stack write, so, by assumption, *W* was grey-protected by some
heap object *G* prior to the write.
By lemma 1, none of the objects in the grey-protecting path from heap
object *G* to *W* can be a stack, so the stack write does not modify
this path.
Hence, *W* is still grey-protected after the stack write by *G*.
*Scan heap object.* Let *obj* denote the scanned object.
Let *W* be an object pointed to by a black object *B* after the scan.
*B* cannot be *obj* because immediately after the scan, *obj* does not
point to any white objects.
Thus, *B* must have been black and pointed to *W* before the scan as
well, so, by assumption, *W* was grey-protected by a path *G -> W₁ ->
⋯ -> Wₙ -> W*, where *G* is a heap object.
If some *Wᵢ* was an object pointed to by *obj*, then *W* is
grey-protected by *Wᵢ* after the scan.
Otherwise, *W* is still grey-protected by *G*.
*Stack scan.* This case is symmetric with scanning a heap object.
<!-- Old direct proof of stack scans:
*Stack scan.* Let *W* be an object pointed to by a black object *B*
after the stack scan.
Even though scanning stack *stk* blackens *stk*, *B* cannot be *stk*
because scanning greys all objects directly referenced by *stk*.
Hence, *W* was pointed to by the same black object *B* before the
stack scan, and by assumption was grey-protected by some path *G -> W₁
-> ⋯ -> Wₙ -> W* where *G* is a heap object.
By lemma 1, none of the objects in the grey-protecting path can be a
stack, so after the stack scan, *W* is either still grey-protected by
*G*, or some *Wᵢ* was greyed by the stack scan and *W* is now
grey-protected by *Wᵢ*.
-->
*Allocate an object.* Since new objects are allocated black and point
to nothing, the invariant trivially holds across allocation.
*Create a stack.* This case is symmetric with object allocation
because new stacks start out empty and hence are trivially black.
This completes the induction cases and shows that the hybrid write
barrier satisfies the modified tricolor invariant.
Since the modified tricolor invariant trivially implies the weak
tricolor invariant, the hybrid write barrier satisfies the weak
tricolor invariant.
The garbage collector is also *bounded*, meaning it eventually
terminates.
**Theorem 2.** A garbage collector using the hybrid write barrier
terminates in a finite number of marking steps.
**Proof.** We observe that objects progress strictly from white to
grey to black and, because new objects (including stacks) are
allocated black, the total marking work is bounded by the number of
objects at the beginning of garbage collection, which is finite.
Finally, the garbage collector is also *complete*, in the sense that
it eventually collects all unreachable objects.
This is trivial from the fact that the garbage collector cannot mark
any objects that are not reachable when the mark phase starts.
## References
[Auerbach '07] J. Auerbach, D. F. Bacon, B. Blainey, P. Cheng, M.
Dawson, M. Fulton, D. Grove, D. Hart, and M. Stoodley. Design and
implementation of a comprehensive real-time java virtual machine. In
*Proceedings of the 7th ACM & IEEE international conference on
Embedded software (EMSOFT '07)*, 249–258, 2007.
[Dijkstra '78] E. W. Dijkstra, L. Lamport, A. J. Martin, C. S.
Scholten, and E. F. Steffens. On-the-fly garbage collection: An
exercise in cooperation. *Communications of the ACM*, 21(11), 966–975,
1978.
[Ellen '07] F. Ellen, Y. Lev, V. Luchango, and M. Moir. SNZI: Scalable
nonzero indicators. In *Proceedings of the 26th ACM SIGACT-SIGOPS
Symposium on Principles of Distributed Computing*, Portland, OR,
August 2007.
[Hudson '97] R. L. Hudson, R. Morrison, J. E. B. Moss, and D. S.
Munro. Garbage collecting the world: One car at a time. In *ACM
SIGPLAN Notices* 32(10):162–175, October 1997.
[Pirinen '98] P. P. Pirinen. Barrier techniques for incremental
tracing. In *ACM SIGPLAN Notices*, 34(3), 20–25, October 1998.
[Steele '75] G. L. Steele Jr. Multiprocessing compactifying garbage
collection. *Communications of the ACM*, 18(9), 495–508, 1975.
[Yuasa '90] T. Yuasa. Real-time garbage collection on general-purpose
machines. *Journal of Systems and Software*, 11(3):181–198, 1990.