design/17503-eliminate-rescan.md - proposal - Git at Google

 # Proposal: Eliminate STW stack re-scanning

 Author(s): Austin Clements, Rick Hudson

 Last updated: 2016-10-21

 Discussion at https://golang.org/issue/17503.

 ## Abstract

 As of Go 1.7, the one remaining source of unbounded and potentially
 non-trivial stop-the-world (STW) time is stack re-scanning.
 We propose to eliminate the need for stack re-scanning by switching to
 a *hybrid write barrier* that combines a Yuasa-style deletion write
 barrier [Yuasa '90] and a Dijkstra-style insertion write barrier
 [Dijkstra '78].
 Preliminary experiments show that this can reduce worst-case STW time
 to under 50µs, and this approach may make it practical to eliminate
 STW mark termination altogether.

 Eliminating stack re-scanning will in turn simplify and eliminate many
 other parts of the garbage collector that exist solely to improve the
 performance of stack re-scanning.
 This includes stack barriers (which introduce significant complexity
 in many parts of the runtime) and maintenance of the re-scan list.
 Hence, in addition to substantially improving STW time, the hybrid
 write barrier should also reduce the overall complexity of the garbage
 collector.


 ## Background

 The Go garbage collector is a *tricolor* concurrent collector
 [Dijkstra '78].
 Every object is shaded either white, grey, or black.
 At the beginning of a GC cycle, all objects are white, and it is the
 goal of the garbage collector to mark all reachable objects black and
 then free all white objects.
 The garbage collector achieves this by shading GC roots (stacks and
 globals, primarily) grey and then endeavoring to turn all grey objects
 black while satisfying the *strong tricolor invariant*:

 > No black object may contain a pointer to a white object.

 Ensuring the tricolor invariant in the presence of concurrent pointer
 updates requires *barriers* on either pointer reads or pointer writes
 (or both).
 There are many flavors of barrier [Pirinen '98].
 Go 1.7 uses a coarsened Dijkstra write barrier [Dijkstra '78], where
 pointer writes are implemented as follows:

 ```
 writePointer(slot, ptr):
     shade(ptr)
     *slot = ptr
 ```

 `shade(ptr)` marks the object at `ptr` grey if it is not already grey
 or black.
 This ensures the strong tricolor invariant by conservatively assuming
 that `*slot` may be in a black object, and ensuring `ptr` cannot be
 white before installing it in `*slot`.

 The Dijkstra barrier has several advantages over other types of
 barriers.
 It does not require any special handling of pointer reads, which has
 performance advantages since pointer reads tend to outweigh pointer
 writes by an order of magnitude or more.
 It also ensures forward progress; unlike, for example, the Steele
 write barrier [Steele '75], objects transition monotonically from
 white to grey to black, so the total work is bounded by the heap size.

 However, it also has disadvantages.
 In particular, it presents a trade-off for pointers on stacks: either
 writes to pointers on the stack must have write barriers, which is
 prohibitively expensive, or stacks must be *permagrey*.
 Go chooses the later, which means that many stacks must be re-scanned
 during STW.
 The garbage collector first scans all stacks at the beginning of the
 GC cycle to collect roots.
 However, without stack write barriers, we can't ensure that the stack
 won't later contain a reference to a white object, so a scanned stack
 is only black until its goroutine executes again, at which point it
 conservatively reverts to grey.
 Thus, at the end of the cycle, the garbage collector must re-scan grey
 stacks to blacken them and finish marking any remaining heap pointers.
 Since it must ensure the stacks don't continue to change during this,
 the whole re-scan process happens *with the world stopped*.

 Re-scanning the stacks can take 10's to 100's of milliseconds in an
 application with a large number of active goroutines.


 ## Proposal

 We propose to eliminate stack re-scanning and replace Go's write
 barrier with a *hybrid write barrier* that combines a Yuasa-style
 deletion write barrier [Yuasa '90] with a Dijkstra-style insertion
 write barrier [Dijkstra '78].
 The hybrid write barrier is implemented as follows:

 ```
 writePointer(slot, ptr):
     shade(*slot)
     if current stack is grey:
         shade(ptr)
     *slot = ptr
 ```

 That is, the write barrier shades the object whose reference is being
 overwritten, and, if the current goroutine's stack has not yet been
 scanned, also shades the reference being installed.

 The hybrid barrier makes stack re-scanning unnecessary; once a stack
 has been scanned and blackened, it remains black.
 Hence, it eliminates the need for stack re-scanning and the mechanisms
 that exist to support stack re-scanning, including stack barriers and
 the re-scan list.

 The hybrid barrier requires that objects be allocated black
 (allocate-white is a common policy, but incompatible with this
 barrier).
 However, while not required by Go's current write barrier, Go already
 allocates black for other reasons, so no change to allocation is
 necessary.

 The hybrid write barrier is equivalent to the "double write barrier"
 used in the adaptation of Metronome used in the IBM real-time Java
 implementation [Auerbach '07]. In that case, the garbage collector was
 incremental, rather than concurrent, but ultimately had to deal with
 the same problem of tightly bounded stop-the-world times.

 ### Reasoning

 A full proof of the hybrid write barrier is given at the end of this
 proposal.
 Here we give the high-level intuition behind the barrier.

 Unlike the Dijkstra write barrier, the hybrid barrier does *not*
 satisfy the strong tricolor invariant: for example, a black goroutine
 (a goroutine whose stack has been scanned) can write a pointer to a
 white object into a black object without shading the white object.
 However, it does satisfy the *weak tricolor invariant* [Pirinen '98]:

 > Any white object pointed to by a black object is reachable from a
 > grey object via a chain of white pointers (it is *grey-protected*).

 The weak tricolor invariant observes that it's okay for a black object
 to point to a white object, as long as *some* path ensures the garbage
 collector will get around to marking that white object.

 Any write barrier has to prohibit a mutator from "hiding" an object;
 that is, rearranging the heap graph to violate the weak tricolor
 invariant so the garbage collector fails to mark a reachable object.
 For example, in a sense, the Dijkstra barrier allows a mutator to hide
 a white object by moving the sole pointer to it to a stack that has
 already been scanned.
 The Dijkstra barrier addresses this by making stacks permagray and
 re-scanning them during STW.

 In the hybrid barrier, the two shades and the condition work together
 to prevent a mutator from hiding an object:

 1. `shade(*slot)` prevents a mutator from hiding an object by moving
    the sole pointer to it from the heap to its stack.
    If it attempts to unlink an object from the heap, this will shade
    it.

 2. `shade(ptr)` prevents a mutator from hiding an object by moving the
    sole pointer to it from its stack into a black object in the heap.
    If it attempts to install the pointer into a black object, this
    will shade it.

 3. Once a goroutine's stack is black, the `shade(ptr)` becomes
    unnecessary.
    `shade(ptr)` prevents hiding an object by moving it from the stack
    to the heap, but this requires first having a pointer hidden on the
    stack.
    Immediately after a stack is scanned, it only points to shaded
    objects, so it's not hiding anything, and the `shade(*slot)`
    prevents it from hiding any other pointers on its stack.

 The hybrid barrier combines the best of the Dijkstra barrier and the
 Yuasa barrier.
 The Yuasa barrier requires a STW at the beginning of marking to either
 scan or snapshot stacks, but does not require a re-scan at the end of
 marking.
 The Dijkstra barrier lets concurrent marking start right away, but
 requires a STW at the end of marking to re-scan stacks (though more
 sophisticated non-STW approaches are possible [Hudson '97]).
 The hybrid barrier inherits the best properties of both, allowing
 stacks to be concurrently scanned at the beginning of the mark phase,
 while also keeping stacks black after this initial scan.


 ## Rationale

 The advantage of the hybrid barrier is that it lets a stack scan
 permanently blacken a stack (without a STW and without write barriers
 to the stack), which entirely eliminates the need for stack
 re-scanning, in turn eliminating the need for stack barriers and
 re-scan lists.
 Stack barriers in particular introduce significant complexity
 throughout the runtime, as well as interfering with stack walks from
 external tools such as GDB and kernel-based profilers.

 Also, like the Dijkstra-style write barrier, the hybrid barrier does
 not require a read barrier, so pointer reads are regular memory reads;
 and it ensures progress, since objects progress monotonically from
 white to grey to black.

 The disadvantages of the hybrid barrier are minor.
 It may result in more floating garbage, since it retains everything
 reachable from roots (other than stacks) at any point during the mark
 phase.
 However, in practice it's likely that the current Dijkstra barrier is
 retaining nearly as much.
 The hybrid barrier also prohibits certain optimizations: in
 particular, the Go compiler currently omits a write barrier if it can
 statically show that the pointer is nil, but the hybrid barrier
 requires a write barrier in this case.
 This may slightly increase binary size.

 ### Alternative barrier approaches

 There are several variations on the proposed barrier that would also
 work, but we believe the proposed barrier represents the best set of
 trade-offs.

 A basic variation is to make the Dijkstra-style aspect of the barrier
 unconditional:

 ```
 writePointer(slot, ptr):
     shade(*slot)
     shade(ptr)
     *slot = ptr
 ```

 The main advantage of this barrier is that it's easier to reason
 about.
 It directly ensures there are no black-to-white pointers in the heap,
 so the only source of black-to-white pointers can be scanned stacks.
 But once a stack is scanned, the only way it can get a white pointer
 is by traversing reachable objects, and any white object that can be
 reached by a goroutine with a black stack is grey-protected by a heap
 object.

 The disadvantage of this barrier is that it's effectively twice as
 expensive as the proposed barrier for most of the mark phase.

 Similarly, we could simply coarsen the stack condition:

 ```
 writePointer(slot, ptr):
     shade(*slot)
     if any stack is grey:
         shade(ptr)
     *slot = ptr
 ```

 This has the advantage of making cross-stack writes such as those
 allowed by channels safe without any special handling, but prolongs
 when the second shade is enabled, which slows down pointer writes.

 A different approach would be to require that all stacks be blackened
 before any heap objects are blackened, which would enable a pure
 Yuasa-style deletion barrier:

 ```
 writePointer(slot, ptr):
     shade(*slot)
     *slot = ptr
 ```

 As originally proposed, the Yuasa barrier takes a complete snapshot of
 the stack before proceeding with marking.
 Yuasa argued that this was reasonable on hardware that could perform
 bulk memory copies very quickly.
 However, Yuasa's proposal was in the context of a single-threaded
 system with a comparatively small stack, while Go programs regularly
 have thousands of stacks that can total to a large amount of memory.

 However, this complete snapshot isn't necessary.
 It's sufficient to ensure all stacks are black before scanning any
 heap objects.
 This allows stack scanning to proceed concurrently, but has the
 downside that it introduces a bottleneck to the parallelism of the
 mark phase between stack scanning and heap scanning.
 This bottleneck has downstream effects on goroutine availability,
 since allocation is paced against marking progress.

 Finally, there are other types of *black mutator* barrier techniques.
 However, as shown by Pirinen, all possible black mutator barriers
 other than the Yuasa barrier require a read barrier [Pirinen '98].
 Given the relative frequency of pointer reads to writes, we consider
 this unacceptable for application performance.

 ### Alternative approaches to re-scanning

 Going further afield, it's also possible to make stack re-scanning
 concurrent without eliminating it [Hudson '97].
 This does not require changes to the write barrier, but does introduce
 significant additional complexity into stack re-scanning.
 Proposal #17505 gives a detailed design for how to do concurrent stack
 re-scanning in Go.


 ## Other considerations

 ### Channel operations and go statements

 The hybrid barrier assumes a goroutine cannot write to another
 goroutine's stack.
 This is true in Go except for two operations: channel sends and
 starting goroutines, which can copy values directly from one stack to
 another.
 For channel operations, the `shade(ptr)` is necessary if *either* the
 source stack or the destination stack is grey.
 For starting a goroutine, the destination stack is always black, so
 the `shade(ptr)` is necessary if the source stack is grey.

 ### Racy programs

 In a racy program, two goroutines may store to the same pointer
 simultaneously and invoke the write barrier concurrently on the same
 slot.
 The hazard is that this may cause the barrier to fail to shade some
 object that it would have shaded in a sequential execution,
 particularly given a relaxed memory model.
 While racy Go programs are generally undefined, we have so far
 maintained that a racy program cannot trivially defeat the soundness
 of the garbage collector (since a racy program can defeat the type
 system, it can technically do anything, but we try to keep the garbage
 collector working as long as the program stays within the type
 system).

 Suppose *optr* is the value of the slot before either write to the
 slot and *ptr1* and *ptr2* are the two pointers being written to the
 slot.
 "Before" is well-defined here because all architectures that Go
 supports have *coherency*, which means there is a total order over all
 reads and writes of a single memory location.
 If the goroutine's respective stacks have been scanned, then *ptr1*
 and *ptr2* will clearly be shaded, since those shades don't read from
 memory.
 Hence, the difficult case is if the goroutine's stacks have been
 scanned.
 In this case, the barriers reduce to:

 <table>
 <tr><th>Goroutine G1</th><th>Goroutine G2</th></tr>
 <tr><td>
 <pre>optr1 = *slot
 shade(optr1)
 *slot = ptr1</pre>
 </td><td>
 <pre>optr2 = *slot
 shade(optr2)
 *slot = ptr2</pre>
 </td></tr>
 </table>

 Given that we're only dealing with one memory location, the property
 of coherence means we can reason about this execution as if it were
 sequentially consistent.
 Given this, concurrent execution of the write barriers permits one
 outcome that is not permitted by sequential execution: if both
 barriers read `*slot` before assigning to it, then only *optr* will be
 shaded, and neither *ptr1* nor *ptr2* will be shaded by the barrier.
 For example:

 <table>
 <tr><th>Goroutine G1</th><th>Goroutine G2</th></tr>
 <tr><td>
 <pre>optr1 = *slot
 shade(optr1)

 *slot = ptr1

 </pre>
 </td><td>
 <pre>optr2 = *slot

 shade(optr2)

 *slot = ptr2</pre>
 </td></tr>
 </table>

 We assert that this is safe.
 Suppose *ptr1* is written first.
 This execution is *nearly* indistinguishable from an execution that
 simply skips the write of *ptr1*.
 The only way to distinguish it is if a read from another goroutine
 *G3* observes *slot* between the two writes.
 However, given our assumption that stacks have already been scanned,
 either *ptr1* is already shaded, or it must be reachable from some
 other place in the heap anyway (and will be shaded eventually), so
 concurrently observing *ptr1* doesn't affect the marking or
 reachability of *ptr1*.

 ### cgo

 The hybrid barrier could be a problem if C code overwrites a Go
 pointer in Go memory with either nil or a C pointer.
 Currently, this operation does not require a barrier, but with any
 sort of deletion barrier, this does require the barrier.
 However, a program that does this would violate the cgo pointer
 passing rules, since Go code is not allowed to pass memory to a C
 function that contains Go pointers.
 Furthermore, this is one of the "cheap dynamic checks" enabled by the
 default setting of `GODEBUG=cgocheck=1`, so any program that violates
 this rule will panic unless the checks have been explicitly disabled.


 ## Future directions

 ### Write barrier omission

 The current write barrier can be omitted by the compiler in cases
 where the compiler knows the pointer being written is permanently
 shaded, such as nil pointers, pointers to globals, and pointers to
 static data.
 These optimizations are generally unsafe with the hybrid barrier.
 However, if the compiler can show that *both* the current value of the
 slot and the value being written are permanently shaded, then it can
 still safely omit the write barrier.
 This optimization is aided by the fact that newly allocated objects
 are zeroed, so all pointer slots start out pointing to nil, which is
 permanently shaded.

 ### Low-pause stack scans

 Currently the garbage collector pauses a goroutine while scanning its
 stack.
 If goroutines have large stacks, this can introduce significant tail
 latency effects.
 The hybrid barrier and the removal of the existing stack barrier
 mechanism would make it feasible to perform stack scans with only
 brief goroutine pauses.

 In this design, scanning a stack pauses the goroutine briefly while it
 scans the active frame.
 It then installs a *blocking stack barrier* over the return to the
 next frame and lets the goroutine resume.
 Stack scanning then continues toward the outer frames, moving the
 stack barrier up the stack as it goes.
 If the goroutine does return as far as the stack barrier, before it
 can return to an unscanned frame, the stack barrier blocks until
 scanning can scan that frame and move the barrier further up the
 stack.

 One complication is that a running goroutine could attempt to grow its
 stack during the stack scan.
 The simplest solution is to block the goroutine if this happens until
 the scan is done.

 Like the current stack barriers, this depends on write barriers when
 writing through pointers to other frames.
 For example, in a partially scanned stack, an active frame could use
 an up-pointer to move a pointer to a white object out of an unscanned
 frame and into the active frame.
 Without a write barrier on the write that removes the pointer from the
 unscanned frame, this could hide the white object from the garbage
 collector.

 However, with write barriers on up-pointers, this is safe.
 Rather than arguing about "partially black" stacks, the write barrier
 on up-pointers lets us view the stack as a sequence of separate
 frames, where unscanned frames are treated as part of the *heap*.
 Writes without write barriers can only happen to the active frame, so
 we only have to view the active frame as the stack.

 This design is technically possible now, but the complexity of
 composing it with the existing stack barrier mechanism makes it
 unappealing.
 With the existing stack barriers gone, the implementation of this
 approach becomes relatively straightforward.
 It's also generally simpler than the existing stack barriers in many
 dimensions, since there are at most two stack barriers per goroutine
 at a time, and they are present only during the stack scan.

 ### Strictly bounded mark termination

 This proposal goes a long way toward strictly bounding the time spent
 in STW mark termination, but there are some other known causes of
 longer mark termination pauses.
 The primary cause is a race that can trigger mark termination while
 there is still remaining heap mark work.
 This race and how to resolve it are detailed in the "Mark completion
 race" appendix.

 ### Concurrent mark termination

 With stack re-scanning out of mark termination, it may become
 practical to make the remaining tasks in mark termination concurrent
 and eliminate the mark termination STW entirely.
 On the other hand, the hybrid barrier may reduce STW so much that
 completely eliminating it is not a practical concern.

 The following is a probably incomplete list of remaining mark
 termination tasks and how to address them.
 Worker stack scans can be eliminated by having workers self-scan
 during mark.
 Scanning the finalizer queue can be eliminated by adding explicit
 barriers to `queuefinalizer`.
 Without these two scans (and with the fix for the mark completion race
 detailed in the appendix), mark termination will produce no mark work,
 so finishing the work queue drain also becomes unnecessary.
 `mcache`s can be flushed lazily at the beginning of the sweep phase
 using rolling synchronization.
 Flushing the heap profile can be done immediately at the beginning of
 sweep (this is already concurrent-safe, in fact).
 Finally, updating global statistics can be done using atomics and
 possibly a global memory barrier.

 ### Concurrent sweep termination

 Likewise, it may be practical to eliminate the STW for sweep
 termination.
 This is slightly complicated by the fact that the hybrid barrier
 requires a global memory fence at the beginning of the mark phase to
 enable the write barrier and ensure all pointer writes prior to
 enabling the write barrier are visible to the write barrier.
 Currently, the STW for sweep termination and setting up the mark phase
 accomplishes this.
 If we were to make sweep termination concurrent, we could instead use
 a ragged barrier to accomplish the global memory fence, or the
 [`membarrier` syscall](http://man7.org/linux/man-pages/man2/membarrier.2.html)
 on recent Linux kernels.


 ## Compatibility

 This proposal does not affect the language or any APIs and hence
 satisfies the Go 1 compatibility guidelines.


 ## Implementation

 Austin plans to implement the hybrid barrier during the Go 1.8
 development cycle.
 For 1.8, we will leave stack re-scanning support in the runtime for
 debugging purposes, but disable it by default using a `GODEBUG`
 variable.
 Assuming things go smoothly, we will remove stack re-scanning support
 when the tree opens for Go 1.9 development.

 The planned implementation approach is:

 1. Fix places that do unusual or "clever" things with memory
    containing pointers and make sure they cooperate with the hybrid
    barrier.
    We'll presumably find more of these as we debug in later steps, but
    we'll have to make at least the following changes:

     1. Ensure barriers on stack-to-stack copies for channel sends and
        starting goroutines.

     2. Check all places where we clear memory since the hybrid barrier
        requires distinguishing between clearing for initialization and
        clearing already-initialized memory.
        This will require a barrier-aware `memclr` and disabling the
        `duffzero` optimization for pointers with types.

     3. Check all uses of unmanaged memory in the runtime to make sure
        it is initialized properly.
        This is particularly important for pools of unmanaged memory
        such as the fixalloc allocator that may reuse memory.

 2. Implement concurrent scanning of background mark worker stacks.
    Currently these are placed on the rescan list and *only* scanned
    during mark termination, but we're going to disable the rescan
    list.
    We could arrange for background mark workers to scan their own
    stacks, or explicitly keep track of heap pointers on background
    mark worker stacks.

 3. Modify the write barrier to implement the hybrid write barrier and
    the compiler to disable write barrier elision optimizations that
    aren't valid for the hybrid barrier.

 4. Disable stack re-scanning by making rescan enqueuing a no-op unless
    a `GODEBUG` variable is set.
    Likewise, disable stack barrier insertion unless this variable is
    set.

 5. Use checkmark mode and stress testing to verify that no objects are
    missed.

 6. Wait for the Go 1.9 development cycle.

 7. Remove stack re-scanning, the rescan list, stack barriers, and the
    `GODEBUG` variable to enable re-scanning.
    Possibly, switch to low-pause stack scans, which can reuse some of
    the stack barrier mechanism.


 ## Appendix: Mark completion race

 Currently, because of a race in the mark completion condition, the
 garbage collector can begin mark termination when there is still
 available mark work.
 This is safe because mark termination will finish draining this work,
 but it makes mark termination unbounded.
 This also interferes with possible further optimizations that remove
 all mark work from mark termination.

 Specifically, the following interleaving starts mark termination
 without draining all mark work:

 Initially `workers` is zero and there is one buffer on the full list.

 <table>
 <tr><th>Thread 1</th><th>Thread 2</th></tr>
 <tr><td><pre>
 inc(&workers)
 gcDrain(&work)
 => acquires only full buffer
 => adds more pointer to work
 work.dispose()
 => returns buffer to full list
 n := dec(&workers) [n=0]
 if n == 0 && [true]


     full.empty() && [true]
     markrootNext >= markrootJobs { [true]
         startMarkTerm()
 }


 </pre></td><td><pre>


 inc(&workers)
 gcDrain(&work)
 => acquires only full buffer


 => adds more pointers to work
 ...
 </pre></td></tr></table>

 In this example, a race between observing the `workers` count and
 observing the state of the full list causes thread 1 to start mark
 termination prematurely.
 Simply checking `full.empty()` before decrementing `workers` exhibits
 a similar race.

 To fix this race, we propose introducing a single atomic non-zero
 indicator for the number of non-empty work buffers.
 Specifically, this will count the number of work caches that are
 caching a non-empty work buffer plus one for a non-empty full list.
 Many buffer list operations can be done without modifying this count,
 so we believe it will not be highly contended.
 If this does prove to be a scalability issue, there are well-known
 techniques for scalable non-zero indicators [Ellen '07].


 ## Appendix: Proof of soundness, completeness, and boundedness

 <!-- TODO: Show that the only necessary memory fence is a global
  !-- store/load fence between enabling the write barrier and
  !-- blackening to ensure visibility of all pointers and the write
  !-- barrier flag.
  !-->

 This section argues that the hybrid write barrier satisfies the weak
 tricolor invariant, and hence is sound in the sense that it does not
 collect reachable objects; that it terminates in a bounded number of
 steps; and that it eventually collects all unreachable objects, and
 hence is complete.
 We have also further verified these properties using a
 [randomized stateless model](https://github.com/aclements/go-misc/blob/master/go-weave/models/yuasa.go).

 The following proofs consider global objects to be a subset of the
 heap objects.
 This is valid because the write barrier applies equally to global
 objects.
 Similarly, we omit explicit discussion of nil pointers, since the nil
 pointer can be considered an always-black heap object of zero size.

 The hybrid write barrier satisfies the *weak tricolor invariant*
 [Pirinen '98].
 However, rather than directly proving this, we prove that it satisfies
 the following *modified tricolor invariant*:

 > Any white object pointed to by a black object is grey-protected by a
 > heap object (reachable via a chain of white pointers from the grey
 > heap object).
 > That is, for every *B -> W* edge, there is a path *G -> W₁ -> ⋯ ->
 > Wₙ -> W* where *G* is a heap object.

 This is identical to the weak tricolor invariant, except that it
 requires that the grey-protector is a heap object.
 This trivially implies the weak tricolor invariant, but gives us a
 stronger basis for induction in the proof.

 <!-- Thoughts on how to simplify the proof:

 Perhaps define a more general notion of a "heap-protected object",
 which is either black, grey, or grey-protected by a heap object.
 -->

 Lemma 1 establishes a simple property of paths we'll use several
 times.

 **Lemma 1.** In a path *O₁ -> ⋯ -> Oₙ* where *O₁* is a heap object,
 all *Oᵢ* must be heap objects.

 **Proof.** Since *O₁* is a heap object and heap objects can only point
 to other heap objects, by induction, all *Oᵢ* must be heap objects.
 ∎

 In particular, if some object is grey-protected by a heap object,
 every object in the grey-protecting path must be a heap object.

 Lemma 2 extends the modified tricolor invariant to white objects that
 are *indirectly* reachable from black objects.

 **Lemma 2.** If the object graph satisfies the modified tricolor
 invariant, then every white object reachable (directly or indirectly)
 from a black object is grey-protected by a heap object.

 <!-- Alternatively: every object reachable from a black object is
 either black, grey, or grey-protected by a heap object. -->

 **Proof.** Let *W* be a white object reachable from black object *B*
 via simple path *B -> O₁ -> ⋯ -> Oₙ -> W*.
 Note that *W* and all *Oᵢ* and must be heap objects because stacks can
 only point to themselves (in which case it would not be a simple path)
 or heap objects, so *O₁* must be a heap object, and by lemma 1, the
 rest of the path must be heap objects.
 Without loss of generality, we can assume none of *Oᵢ* are black;
 otherwise, we can simply reconsider using the shortest path suffix
 that starts with a black object.

 If there are no *Oᵢ*, *B* points directly to *W* and the modified
 tricolor invariant directly implies that *W* is grey-protected by a
 heap object.

 If any *Oᵢ* is grey, then *W* is grey-protected by the last grey
 object in the path.

 Otherwise, all *Oᵢ* are white.
 Since *O₁* is a white object pointed to by a black object, *O₁* is
 grey-protected by some path *G -> W₁ -> ⋯ -> Wₙ -> O₁* where *G* is a
 heap object.
 Thus, *W* is grey-protected by *G -> W₁ -> ⋯ -> Wₙ -> O₁ -> ⋯ -> Oₙ ->
 W*.
 ∎

 Lemma 3 builds on lemma 2 to establish properties of objects reachable
 by goroutines.

 **Lemma 3.** If the object graph satisfies the modified tricolor
 invariant, then every white object reachable by a black goroutine (a
 goroutine whose stack has been scanned) is grey-protected by a heap
 object.

 **Proof.** Let *W* be a white object reachable by a black goroutine.
 If *W* is reachable from the goroutine's stack, then by lemma 2 *W* is
 grey-protected by a heap object.
 Otherwise, *W* must be reachable from a global *X*.
 Let *O* be the last non-white object in the path from *X* to *W* (*O*
 must exist because *X* itself is either grey or black).
 If *O* is grey, then *O* is a heap object that grey-protects *W*.
 Otherwise, *O* is black and by lemma 2, *W* is grey-protected by some
 heap object.
 ∎

 Now we're ready to prove that the hybrid write barrier satisfies the
 weak tricolor invariant, which implies it is *sound* (it marks all
 reachable objects).

 **Theorem 1.** The hybrid write barrier satisfies the weak tricolor
 invariant.

 **Proof.** We first show that the hybrid write barrier satisfies the
 modified tricolor invariant.
 The proof follows by induction over the operations that affect the
 object graph or its coloring.

 *Base case.* Initially there are no black objects, so the invariant
 holds trivially.

 *Write pointer in the heap.* Let *obj.slot := ptr* denote the write,
 where *obj* is in the heap, and let *optr* denote the value of
 *obj.slot* prior to the write.

 Let *W* be a white object pointed to by a black object *B* after the
 heap write.
 There are two cases:

 1. *B ≠ obj*: *W* was pointed to by them same black object *B* before
    the write, and, by assumption, *W* was grey-protected by a path *G
    -> W₁ -> ⋯ -> Wₙ -> W*, where *G* is a heap object.
    If none of these edges are *obj.slot*, then *W* is still protected
    by *G*.
    Otherwise, the path must have included the edge *obj -> optr* and,
    since the write barrier shades *optr*, *W* is grey-protected by
    *optr* after the heap write.

 2. *B = obj*: We first establish that *W* was grey-protected before
    the write, which breaks down into two cases:

    1. *W = ptr*: The goroutine must be black, because otherwise the
       write barrier shades *ptr*, so it is not white.
       *ptr* must have been reachable by the goroutine for it to write
       it, so by lemma 3, *ptr* was grey-protected by some heap object
       *G* prior to the write.

    2. *W ≠ ptr*: *B* pointed to *W* before the write and, by
       assumption, *W* was grey-protected by some heap object *G*
       before the write.

    Because *obj* was black before the write, it could not be in the
    grey-protecting path from *G* to *W*, so this write did not affect
    this path, so *W* is still grey-protected by *G*.

 *Write pointer in a stack.* Let *stk.slot := ptr* denote the write.

 Let *W* be a white object pointed to by a black object *B* after the
 stack write.
 We first establish that *W* was grey-protected before the stack write,
 which breaks down into two cases:

 1. *B = stk* and *W = ptr*: *W* may not have been pointed to by a
    black object prior to the stack write (that is, the write may
    create a new *B -> W* edge).
    However, *ptr* must have been reachable by the goroutine, which is
    black (because *B = stk*), so by lemma 3, *W* was grey-protected by
    some heap object *G* prior to the write.

 2. Otherwise: *W* was pointed to by the same black object *B* prior to
    the stack write, so, by assumption, *W* was grey-protected by some
    heap object *G* prior to the write.

 By lemma 1, none of the objects in the grey-protecting path from heap
 object *G* to *W* can be a stack, so the stack write does not modify
 this path.
 Hence, *W* is still grey-protected after the stack write by *G*.

 *Scan heap object.* Let *obj* denote the scanned object.
 Let *W* be an object pointed to by a black object *B* after the scan.
 *B* cannot be *obj* because immediately after the scan, *obj* does not
  point to any white objects.
 Thus, *B* must have been black and pointed to *W* before the scan as
 well, so, by assumption, *W* was grey-protected by a path *G -> W₁ ->
 ⋯ -> Wₙ -> W*, where *G* is a heap object.
 If some *Wᵢ* was an object pointed to by *obj*, then *W* is
 grey-protected by *Wᵢ* after the scan.
 Otherwise, *W* is still grey-protected by *G*.

 *Stack scan.* This case is symmetric with scanning a heap object.

 <!-- Old direct proof of stack scans:

 *Stack scan.* Let *W* be an object pointed to by a black object *B*
  after the stack scan.
 Even though scanning stack *stk* blackens *stk*, *B* cannot be *stk*
 because scanning greys all objects directly referenced by *stk*.
 Hence, *W* was pointed to by the same black object *B* before the
 stack scan, and by assumption was grey-protected by some path *G -> W₁
 -> ⋯ -> Wₙ -> W* where *G* is a heap object.
 By lemma 1, none of the objects in the grey-protecting path can be a
 stack, so after the stack scan, *W* is either still grey-protected by
 *G*, or some *Wᵢ* was greyed by the stack scan and *W* is now
 grey-protected by *Wᵢ*.
 -->

 *Allocate an object.* Since new objects are allocated black and point
 to nothing, the invariant trivially holds across allocation.

 *Create a stack.* This case is symmetric with object allocation
 because new stacks start out empty and hence are trivially black.

 This completes the induction cases and shows that the hybrid write
 barrier satisfies the modified tricolor invariant.
 Since the modified tricolor invariant trivially implies the weak
 tricolor invariant, the hybrid write barrier satisfies the weak
 tricolor invariant.
 ∎

 The garbage collector is also *bounded*, meaning it eventually
 terminates.

 **Theorem 2.** A garbage collector using the hybrid write barrier
 terminates in a finite number of marking steps.

 **Proof.** We observe that objects progress strictly from white to
 grey to black and, because new objects (including stacks) are
 allocated black, the total marking work is bounded by the number of
 objects at the beginning of garbage collection, which is finite.
 ∎

 Finally, the garbage collector is also *complete*, in the sense that
 it eventually collects all unreachable objects.
 This is trivial from the fact that the garbage collector cannot mark
 any objects that are not reachable when the mark phase starts.


 ## References

 [Auerbach '07] J. Auerbach, D. F. Bacon, B. Blainey, P. Cheng, M.
 Dawson, M. Fulton, D. Grove, D. Hart, and M. Stoodley. Design and
 implementation of a comprehensive real-time java virtual machine. In
 *Proceedings of the 7th ACM & IEEE international conference on
 Embedded software (EMSOFT '07)*, 249–258, 2007.

 [Dijkstra '78] E. W. Dijkstra, L. Lamport, A. J. Martin, C. S.
 Scholten, and E. F. Steffens. On-the-fly garbage collection: An
 exercise in cooperation. *Communications of the ACM*, 21(11), 966–975,
 1978.

 [Ellen '07] F. Ellen, Y. Lev, V. Luchango, and M. Moir. SNZI: Scalable
 nonzero indicators. In *Proceedings of the 26th ACM SIGACT-SIGOPS
 Symposium on Principles of Distributed Computing*, Portland, OR,
 August 2007.

 [Hudson '97] R. L. Hudson, R. Morrison, J. E. B. Moss, and D. S.
 Munro. Garbage collecting the world: One car at a time. In *ACM
 SIGPLAN Notices* 32(10):162–175, October 1997.

 [Pirinen '98] P. P. Pirinen. Barrier techniques for incremental
 tracing. In *ACM SIGPLAN Notices*, 34(3), 20–25, October 1998.

 [Steele '75] G. L. Steele Jr. Multiprocessing compactifying garbage
 collection. *Communications of the ACM*, 18(9), 495–508, 1975.

 [Yuasa '90] T. Yuasa. Real-time garbage collection on general-purpose
 machines. *Journal of Systems and Software*, 11(3):181–198, 1990.