Proposal: Non-cooperative goroutine preemption

Author(s): Austin Clements

Last updated: 2018-03-26

Discussion at https://golang.org/issue/24543.

Abstract

Go currently uses compiler-inserted cooperative preemption points in function prologues. The majority of the time, this is good enough to allow Go developers to ignore preemption and focus on writing clear parallel code, but it has sharp edges that we‘ve seen degrade the developer experience time and time again. When it goes wrong, it goes spectacularly wrong, leading to mysterious system-wide latency issues and sometimes complete freezes. And because this is a language implementation issue that exists outside of Go’s language semantics, these failures are surprising and very difficult to debug.

@dr2chase has put significant effort into prototyping cooperative preemption points in loops, which is one way to solve this problem. However, even sophisticated approaches to this led to unacceptable slow-downs in tight loops (where slow-downs are generally least acceptable).

I propose that the Go implementation switch to non-cooperative preemption using stack and register maps at (essentially) every instruction. This would allow goroutines to be preempted without explicit preemption checks. This approach will solve the problem of delayed preemption with zero run-time overhead.

Background

Up to and including Go 1.10, Go has used cooperative preemption with safe-points only at function calls (and even then, not if the function is small or gets inlined). This can result in infrequent safe-points, which leads to many problems:

The most common in production code is that this can delay STW operations, such as starting and ending a GC cycle. This increases STW latency, and on large core counts can significantly impact throughput (if, for example, most threads are stopped while the runtime waits on a straggler for a long time). (#17831, #19241)
This can delay scheduling, preventing competing goroutines from executing in a timely manner.
This can delay stack scanning, which consumes CPU while the runtime waits for a preemption point and can ultimately delay GC termination, resulting in an effective STW where the system runs out of heap and no goroutines can allocate.
In really extreme cases, it can cause a program to halt, such as when a goroutine spinning on an atomic load starves out the goroutine responsible for setting that atomic. This often indicates bad or buggy code, but is surprising nonetheless and has clearly wasted a lot of developer time on debugging. (#543, #12553, #13546, #14561, #15442, #17174, #20793, #21053)

These problems impede developer productivity and production efficiency and expose Go‘s users to implementation details they shouldn’t have to worry about.

Loop preemption

@dr2chase put significant effort into trying to solve these problems using explicit loop preemption (#10958). This is a standard approach for runtimes employing cooperative preemption in which the compiler inserts preemption checks and safe-points at back-edges in the flow graph. This significantly improves the quality of preemption, since code almost never executes without a back-edge for any non-trivial amount of time.

Our most recent approach to loop preemption, which we call fault-based preemption, adds a single instruction, no branches, and no register pressure to loops on x86 and UNIX platforms (CL 43050). Despite this, the geomean slow-down on a large suite of benchmarks is 7.8%, with a handful of significantly worse outliers. Even compared to Go 1.9, where the slow-down is only 1% thanks to other improvements, most benchmarks see some slow-down and there are still significant outliers.

Fault-based preemption also has several implementation downsides. It can‘t target specific threads or goroutines, so it’s a poor match for stack scanning, ragged barriers, or regular scheduler preemption. It‘s also “sticky”, in that we can’t resume any loops until we resume all loops, so the safe-point can't simply resume if it occurs in an unsafe state (such as when runtime locks are held). And it requires more instructions (and more overhead) on non-x86 and non-UNIX platforms.

Proposal

I propose that we implement fully non-cooperative preemption by recording enough metadata to allow safe-points (almost) everywhere without explicit preemption checks in function bodies.

To do this, we would modify the compiler to produce register maps in addition to stack maps, and to emit these for as many program counters as possible. The runtime would use a signal (or GetThreadContext on Windows, or a note on Plan9) to retrieve each thread's register state, including the stack and register map at the interrupted PC. The garbage collector would then treat live pointers in registers just as it treats live pointers on the stack.

Certain instructions cannot be safe-points, so if a signal occurs at such a point, the runtime would simply resume the thread and try again later. The compiler just needs to make most instructions safe-points.

To @minux's credit, he suggested this in the very first reply to #10958. At the time we thought adding safe-points everywhere would be too difficult and that the overhead of explicit loop preemption would be lower than it turned out to be.

Many other garbage-collected languages use explicit safe-points on back-edges, or they use forward-simulation to reach a safe-point. Partly, it‘s possible for Go to support safe-points everywhere because Go’s GC already must have excellent support for interior pointers; in many languages, interior pointers never appear at a safe-point.

Handling unsafe-points

There are various points in generated code that must be GC-atomic and thus cannot have safe-points in them. A few known situations are:

Expressions involving unsafe.Pointer may temporarily represent the only pointer to an object as a uintptr. Hence, there must be no safe-points while a uintptr derived from an unsafe.Pointer is live. Likewise, we must recognize reflect.Value.Pointer, reflect.Value.UnsafeAddr, and reflect.Value.InterfaceData as unsafe.Pointer-to-uintptr conversions. Alternatively, if the compiler can reliably detect such uintptrs, it could mark this as pointers, but there's a danger that an intermediate value may not represent a legal pointer value.
In the write barrier there must not be a safe-point between the write-barrier-enabled check and a direct write. For example, suppose the goroutine is writing a pointer to B into object A. If the check happens, then GC starts and scans A, then the goroutine writes B into A and drops all references to B from its stack, the garbage collector could fail to mark B.
There are places where the compiler generates temporary pointers that can be past the end of allocations, such as in range loops over slices and arrays. These would either have to be avoided or safe-points would have to be disallowed while these are live.

All of these cases must already avoid significant reordering to avoid being split across a call. Internally, this is achieved via the “mem” pseudo-value, which must be sequentially threaded through all SSA values that manipulate memory. Mem is also threaded through values that must not be reordered, even if they don't touch memory. For example, conversion between unsafe.Pointer and uintptr is done with a special “Convert” operation that takes a mem solely to constrain reordering.

There are several possible solutions to these problem, some of which can be combined:

We could mark basic blocks that shouldn't contain preemption points. For unsafe.Pointer conversions, we would opt-out the basic block containing the conversion. For code adhering to the unsafe.Pointer rules, this should be sufficient, but it may break code that is incorrect but happens to work today in ways that are very difficult to debug. For write barriers this is also sufficient. For loops, this is overly broad and would require splitting some basic blocks.
For unsafe.Pointer conversions, we could simply opt-out entire functions that convert from unsafe.Pointer to uintptr. This would be easy to implement, and would keep even broken unsafe code working as well as it does today, but may have broad impact, especially in the presence of inlining.
A simple combination of 1 and 2 would be to opt-out any basic block that is reachable from an unsafe.Pointer to uintptr conversion, up to a function call (which is a safe-point today).
For range loops, the compiler could compile them differently such that it never constructs an out-of-bounds pointer (see below).
A far more precise and general approach (thanks to @cherrymui) would be to create new SSA operations that “taint” and “untaint” memory. The taint operation would take a mem and return a new tainted mem. This taint would flow to any values that themselves took a tainted value. The untaint operation would take a value and a mem and return an untainted value and an untainted mem. During liveness analysis, safe-points would be disallowed wherever a tainted value was live. This is probably the most precise solution, and is likely to keep even incorrect uses of unsafe working, but requires a complex implementation.

More broadly, it‘s worth considering making the compiler check unsafe.Pointer-using code and actively reject code that doesn’t follow the allowed patterns. This could be implemented as a simple type system that distinguishes pointer-ish uintptr from numeric uintptr. But this is out of scope for this proposal.

Range loops

Currently (as of 1.10), range loops are compiled roughly like:

for i, x := range s { b }
  ⇓
for i, _n, _p := 0, len(s), &s[0]; i < _n; i, _p = i+1, _p + unsafe.Sizeof(s[0]) { b }
  ⇓
i, _n, _p := 0, len(s), &s[0]
goto cond
body:
{ b }
i, _p = i+1, _p + unsafe.Sizeof(s[0])
cond:
if i < _n { goto body } else { goto end }
end:

The problem with this lowering is that _p may temporarily point past the end of the allocation the moment before the loop terminates. Currently this is safe because there's never a safe-point while this value of _p is live.

We could keep this lowering and mark the increment and condition blocks as unsafe-points. However, if the body is short, this could result in infrequent safe-points. It would also require creating a separate block for the increment, which is currently usually appended to the end of the body. Separating these blocks would inhibit reordering opportunities.

Alternatively, we could rewrite the loop to never create a past-the-end pointer. For example, we could lower it like:

i, _n, _p := 0, len(s), &s[0]
if i >= _n { goto end } else { goto body }
top:
_p += unsafe.Sizeof(s[0])
body:
{ b }
i++
if i >= _n { goto end } else { goto top }
end:

This would allow safe-points everywhere in the loop. Compared to the current loop compilation, it generates slightly more code, but executes the same number of conditional branch instructions (n+1) and results in the same number of SSA basic blocks (3).

This lowering does complicate bounds-check elimination. Currently, bounds-check elimination knows that i < _n in the body because the body block is dominated by the cond block. However, in the new lowering, deriving this fact requires detecting that i < _n on both paths into body and hence is true in body.

Runtime safe-points

Beyond generated code, the runtime in general is not written to be arbitrarily preemptible and there are many places that must not be preempted. Hence, we would likely disable safe-points by default in the runtime, except at calls (where they occur now).

While this would have little downside for most of the runtime, there are some parts of the runtime that could benefit substantially from non-cooperative preemption, such as memory functions like memmove. Non-cooperative preemption is an excellent way to make these preemptible without slowing down the common case, since we would only need to mark their register maps (which would often be empty for functions like memmove since all pointers would already be protected by arguments).

Over time we may opt-in more of the runtime.

Encoding of stack and register maps

To be determined.

@austin's prototype simply added the register liveness map as a third liveness map accompanying the existing argument and local liveness maps. The downside of this is that any change in any of these maps requires writing out all of them, and the register liveness changes tend not to correlate with argument and local liveness changes.

Other uses

Heap dump analysis

Having safe-points everywhere fixes problems with heap dump analysis from core files, which currently has to use conservative heuristics to identify live pointers in active call frames.

Call injection

Having safe-points everywhere also allows some function calls to be safely injected at runtime. This is useful in at least two situations:

To handle synchronous signals, such as nil-pointer dereferences, the runtime injects a call to runtime.sigpanic at the location of the fault. However, since there isn't usually a call at this location, the stack map may be inaccurate, which leads to complicated interactions between defers, escape analysis, and traceback handling. Having safe-points everywhere could simplify this.
Debugger function calls (#21678). Currently it's essentially impossible for a debugger to dynamically invoke a Go function call because of poor interactions with stack scanning and the garbage collector. Having stack and register maps everywhere would make this significantly more viable, since a function call could be injected nearly anywhere without making the stack un-scannable.

Testing

The primary danger of this approach is its potential for a long bug tail, since the coverage of safe-points in regular testing will decrease substantially. In addition to standard testing, I propose checking the generated liveness maps using static analysis of binaries. This tool would look for pointer dereferences or stores with a write barrier to indicate that a value is a pointer and would check the flow of that value through all possible paths. It would report anywhere a value transitioned from dead/scalar to live pointer and anywhere a value was used both like a pointer and like a scalar.

In effect, this tool would simulate the program to answer the question “for every two points in time A < B, are there allocations reachable from the liveness map at time B that were not reachable at time A and were not allocated between A and B?”

Most likely this static analysis tool could be written atop the existing golang.org/x/arch packages. These are the same packages used by, for example, go tool objdump, and handle most heavy-lifting of decoding the binary itself.

Other considerations

Space overhead. Traditionally, the main concern with having safe-points everywhere is the overhead of saving the stack/register maps. A very preliminary implementation of register maps and safe-points everywhere increased binary size by ~10%. However, this left several obvious optimizations on the table. Work by Stichmoth, et al. [1] further suggests that this overhead can be significantly curtailed with simple compression techniques.

Windows support. Unlike fault-based loop preemption, signaled preemption is quite easy to support in Windows because it provides SuspendThread and GetThreadContext, which make it trivial to get a thread's register set.

Decoupling stack-move points from GC safe-points. Because of the details of the current implementation, these are (essentially) the same. By decoupling these and only allowing stack growth and shrinking at function entry, stack copying would not need to adjust registers. This keeps stack copying simpler. It also enables better compiler optimizations and more safe-points since the compiler need not mark a register as a live pointer if it knows there‘s another live pointer to the object. Likewise, registers that are known to be derived from pointer arguments can be marked as scalar as long as those arguments are live. Such optimizations are possible because of Go’s non-moving collector. This also prevents stack moving from observing (and crashing on) transient small-valued pointers that the compiler constructs when it knows an offset from a potentially-nil pointer will be small.

Debuggers. Debuggers would have to be taught to ignore the signal used for stopping a thread. However, if we can use a distinct signal for this (such as one of the POSIX real-time signals), this should be easier than teaching debuggers to distinguish the SIGSEGVs produced by 1.10‘s loop preemption from genuine SIGSEGVs. It’s also not clear that our current fault-based approach can work at all under many debuggers on OS X due to a kernel bug.

Scheduler preemption. This mechanism is well-suited to temporary preemptions where the same goroutine will resume after the preemption because we don‘t need to save the full register state and can rely on the existing signal return path to restore the full register state. This applies to all GC-related preemptions, but it’s not as well suited to permanent preemption performed by the scheduler. However, we could still build on this mechanism. For example, since most of the time goroutines self-preempt, we only need to save the full signal state in the uncommon case, so the g could contain a pointer to its full saved state that's only used after a forced preemption. Restoring the full signal state could be done by either writing the architecture-dependent code to restore the full register set (a beefed-up runtime.gogo), or by self-signaling, swapping in the desired context, and letting the OS restore the full register set.

Targeting and resuming. In contrast with fault-based loop preemption, signaled preemption can be targeted at a specific thread and can immediately resume. Thread-targeting is a little different from cooperative preemption, which is goroutine-targeted. However, in many cases this is actually better, since targeting goroutines for preemption is racy and hence requires retry loops that can add significantly to STW time. Taking advantage of this for stack scanning will require some restructuring of how we track GC roots, but the result should eliminate the blocking retry loop we currently use.

Non-pointer pointers. This has the potential to expose incorrect uses of unsafe.Pointer for transiently storing non-pointers. Such uses are a clear violation of the unsafe.Pointer rules, but they may happen (especially in, for example, cgo-using code).

Assembly. By default, the runtime cannot safely preempt assembly code since it won‘t know what registers contain pointers. As a follow-on to the work on safe-points everywhere, we should audit assembly in the standard library for non-preemptible loops and annotate them with register maps. In most cases this should be trivial since most assembly never constructs a pointer that isn’t shadowed by an argument, so it can simply claim there are no pointers in registers. We should also document in the Go assembly guide how to do this for user code.

Alternatives

Rather than significantly increasing the number of safe-points, we could use a signal to stop a thread and then use hardware single-stepping support to advance the thread to a safe-point (or a point where the compiler has provided a branch to reach a safe-point, like in the current loop preemption approach). This works (somewhat surprisingly), but thoroughly bamboozles debuggers since both the debugger and the operating system assume the debugger owns single-stepping, not the process itself. This would also require the compiler to provide register flushing stubs for these safe-points, which increases code size (and hence instruction cache pressure) as well as stack size. Safe-points everywhere increase binary size, but not code size or stack size.

Compatibility

This proposal introduces no new APIs, so it is Go 1 compatible.

Implementation

Austin Clements (@austin) plans to implement register and stack maps everywhere for Go 1.11. This will enable some low-risk uses in the short term, such as debugger function calls.

Debugging and testing of register and stack maps can continue into the Go 1.11 freeze, including building the static analysis tool.

Then, for Go 1.12, Austin will implement safe-points everywhere atop the register and stacks maps.

References

[1] James M. Stichnoth, Guei-Yuan Lueh, and Michał Cierniak. 1999. Support for garbage collection at every instruction in a Java compiler. In Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation (PLDI '99). ACM, New York, NY, USA, 118–127.