| # Proposal: Non-cooperative goroutine preemption |
| |
| Author(s): Austin Clements |
| |
| Last updated: 2019-01-18 |
| |
| Discussion at https://golang.org/issue/24543. |
| |
| ## Abstract |
| |
| Go currently uses compiler-inserted cooperative preemption points in |
| function prologues. |
| The majority of the time, this is good enough to allow Go developers |
| to ignore preemption and focus on writing clear parallel code, but it |
| has sharp edges that we've seen degrade the developer experience time |
| and time again. |
| When it goes wrong, it goes spectacularly wrong, leading to mysterious |
| system-wide latency issues and sometimes complete freezes. |
| And because this is a language implementation issue that exists |
| outside of Go's language semantics, these failures are surprising and |
| very difficult to debug. |
| |
| @dr2chase has put significant effort into prototyping cooperative |
| preemption points in loops, which is one way to solve this problem. |
| However, even sophisticated approaches to this led to unacceptable |
| slow-downs in tight loops (where slow-downs are generally least |
| acceptable). |
| |
| I propose that the Go implementation switch to non-cooperative |
| preemption, which would allow goroutines to be preempted at |
| essentially any point without the need for explicit preemption checks. |
| This approach will solve the problem of delayed preemption and do so |
| with zero run-time overhead. |
| |
| Non-cooperative preemption is a general concept with a whole class of |
| implementation techniques. |
| This document describes and motivates the switch to non-cooperative |
| preemption and discusses common concerns of any non-cooperative |
| preemption approach in Go. |
| Specific implementation approaches are detailed in sub-proposals |
| linked from this document. |
| |
| |
| ## Background |
| |
| Up to and including Go 1.10, Go has used cooperative preemption with |
| safe-points only at function calls (and even then, not if the function |
| is small or gets inlined). |
| This means that Go can only switch between concurrently-executing |
| goroutines at specific points. |
| The main advantage of this is that the compiler can ensure useful |
| invariants at these safe-points. |
| In particular, the compiler ensures that all local garbage collection |
| roots are known at all safe-points, which is critical to precise |
| garbage collection. |
| It can also ensure that no registers are live at safe-points, which |
| means the Go runtime can switch goroutines without having to save and |
| restore a large register set. |
| |
| However, this can result in infrequent safe-points, which leads to |
| many problems: |
| |
| 1. The most common in production code is that this can delay STW |
| operations, such as starting and ending a GC cycle. |
| This increases STW latency, and on large core counts can |
| significantly impact throughput (if, for example, most threads are |
| stopped while the runtime waits on a straggler for a long time). |
| ([#17831](https://golang.org/issue/17831), |
| [#19241](https://golang.org/issue/19241)) |
| |
| 2. This can delay scheduling, preventing competing goroutines from |
| executing in a timely manner. |
| |
| 3. This can delay stack scanning, which consumes CPU while the runtime |
| waits for a preemption point and can ultimately delay GC |
| termination, resulting in an effective STW where the system runs |
| out of heap and no goroutines can allocate. |
| |
| 4. In really extreme cases, it can cause a program to halt, such as |
| when a goroutine spinning on an atomic load starves out the |
| goroutine responsible for setting that atomic. |
| This often indicates bad or buggy code, but is surprising |
| nonetheless and has clearly wasted a lot of developer time on |
| debugging. |
| ([#543](https://golang.org/issue/543), |
| [#12553](https://golang.org/issue/12553), |
| [#13546](https://golang.org/issue/13546), |
| [#14561](https://golang.org/issue/14561), |
| [#15442](https://golang.org/issue/15442), |
| [#17174](https://golang.org/issue/17174), |
| [#20793](https://golang.org/issue/20793), |
| [#21053](https://golang.org/issue/21053)) |
| |
| These problems impede developer productivity and production efficiency |
| and expose Go's users to implementation details they shouldn't have to |
| worry about. |
| |
| ### Cooperative loop preemption |
| |
| @dr2chase put significant effort into trying to solve these problems |
| using cooperative *loop preemption* |
| ([#10958](https://golang.org/issue/10958)). |
| This is a standard approach for runtimes employing cooperative |
| preemption in which the compiler inserts preemption checks and |
| safe-points at back-edges in the flow graph. |
| This significantly improves the quality of preemption, since code |
| almost never executes without a back-edge for any non-trivial amount |
| of time. |
| |
| Our most recent approach to loop preemption, which we call |
| *fault-based preemption*, adds a single instruction, no branches, and |
| no register pressure to loops on x86 and UNIX platforms ([CL |
| 43050](https://golang.org/cl/43050)). |
| Despite this, the geomean slow-down on a [large suite of |
| benchmarks](https://perf.golang.org/search?q=upload%3A20171003.1+%7C+upload-part%3A20171003.1%2F3+vs+upload-part%3A20171003.1%2F1) |
| is 7.8%, with a handful of significantly worse outliers. |
| Even [compared to Go |
| 1.9](https://perf.golang.org/search?q=upload%3A20171003.1+%7C+upload-part%3A20171003.1%2F0+vs+upload-part%3A20171003.1%2F1), |
| where the slow-down is only 1% thanks to other improvements, most |
| benchmarks see some slow-down and there are still significant |
| outliers. |
| |
| Fault-based preemption also has several implementation downsides. |
| It can't target specific threads or goroutines, so it's a poor match |
| for stack scanning, ragged barriers, or regular scheduler preemption. |
| It's also "sticky", in that we can't resume any loops until we resume |
| *all* loops, so the safe-point can't simply resume if it occurs in an |
| unsafe state (such as when runtime locks are held). |
| It requires more instructions (and more overhead) on non-x86 and |
| non-UNIX platforms. |
| Finally, it interferes with debuggers, which assume bad memory |
| references are a good reason to stop a program. |
| It's not clear it can work at all under many debuggers on OS X due to |
| a [kernel bug](https://bugs.llvm.org/show_bug.cgi?id=22868). |
| |
| |
| ## Non-cooperative preemption |
| |
| *Non-cooperative preemption* switches between concurrent execution |
| contexts without explicit preemption checks or assistance from those |
| contexts. |
| This is used by all modern desktop and server operating systems to |
| switch between threads. |
| Without this, a single poorly-behaved application could wedge the |
| entire system, much like how a single poorly-behaved goroutine can |
| currently wedge a Go application. |
| It is also a convenient abstraction: it lets us program as if there |
| are an infinite number of CPUs available, hiding the fact that the OS |
| is time-multiplexing a finite number of CPUs. |
| |
| Operating system schedulers use hardware interrupt support to switch a |
| running thread into the OS scheduler, which can save that thread's |
| state such as its CPU registers so that it can be resumed later. |
| In Go, we would use operating system support to do the same thing. |
| On UNIX-like operating systems, this can be done using signals. |
| |
| However, because of the garbage collector, Go has requirements that an |
| operating system does not: Go must be able to find the live pointers |
| on a goroutine's stack wherever it stops it. |
| Most of the complexity of non-cooperative preemption in Go derives |
| from this requirement. |
| |
| |
| ## Proposal |
| |
| I propose that Go implement non-cooperative goroutine preemption by |
| sending a POSIX signal (or using an equivalent OS mechanism) to stop a |
| running goroutine and capture its CPU state. |
| If a goroutine is interrupted at a point that must be GC atomic, as |
| detailed in the ["Handling unsafe-points"](#handling-unsafe-points) |
| section, the runtime can simply resume the goroutine and try again |
| later. |
| |
| The key difficulty of implementing non-cooperative preemption for Go |
| is finding live pointers in the stack of a preempted goroutine. |
| There are many possible ways to do this, which are detailed in these |
| sub-proposals: |
| |
| * The [safe-points everywhere |
| proposal](24543/safe-points-everywhere.md) describes an |
| implementation where the compiler records stack and register maps |
| for nearly every instruction. |
| This allows the runtime to halt a goroutine anywhere and find its GC |
| roots. |
| |
| * The [conservative inner-frame scanning |
| proposal](24543/conservative-inner-frame.md) describes an |
| implementation that uses conservative GC techniques to find pointers |
| in the inner-most stack frame of a preempted goroutine. |
| This can be done without any extra safe-point metadata. |
| |
| |
| ## Handling unsafe-points |
| |
| Any non-cooperative preemption approach in Go must deal with code |
| sequences that have to be atomic with respect to the garbage |
| collector. |
| We call these "unsafe-points", in contrast with GC safe-points. |
| A few known situations are: |
| |
| 1. Expressions involving `unsafe.Pointer` may temporarily represent |
| the only pointer to an object as a `uintptr`. |
| Hence, there must be no safe-points while a `uintptr` derived from |
| an `unsafe.Pointer` is live. |
| Likewise, we must recognize `reflect.Value.Pointer`, |
| `reflect.Value.UnsafeAddr`, and `reflect.Value.InterfaceData` as |
| `unsafe.Pointer`-to-`uintptr` conversions. |
| Alternatively, if the compiler can reliably detect such `uintptr`s, |
| it could mark this as pointers, but there's a danger that an |
| intermediate value may not represent a legal pointer value. |
| |
| 2. In the write barrier there must not be a safe-point between the |
| write-barrier-enabled check and a direct write. |
| For example, suppose the goroutine is writing a pointer to B into |
| object A. |
| If the check happens, then GC starts and scans A, then the |
| goroutine writes B into A and drops all references to B from its |
| stack, the garbage collector could fail to mark B. |
| |
| 3. There are places where the compiler generates temporary pointers |
| that can be past the end of allocations, such as in range loops |
| over slices and arrays. |
| These would either have to be avoided or safe-points would have to |
| be disallowed while these are live. |
| |
| All of these cases must already avoid significant reordering to avoid |
| being split across a call. |
| Internally, this is achieved via the "mem" pseudo-value, which must be |
| sequentially threaded through all SSA values that manipulate memory. |
| Mem is also threaded through values that must not be reordered, even |
| if they don't touch memory. |
| For example, conversion between `unsafe.Pointer` and `uintptr` is done |
| with a special "Convert" operation that takes a mem solely to |
| constrain reordering. |
| |
| There are several possible solutions to these problem, some of which |
| can be combined: |
| |
| 1. We could mark basic blocks that shouldn't contain preemption |
| points. |
| For `unsafe.Pointer` conversions, we would opt-out the basic block |
| containing the conversion. |
| For code adhering to the `unsafe.Pointer` rules, this should be |
| sufficient, but it may break code that is incorrect but happens to |
| work today in ways that are very difficult to debug. |
| For write barriers this is also sufficient. |
| For loops, this is overly broad and would require splitting some |
| basic blocks. |
| |
| 2. For `unsafe.Pointer` conversions, we could simply opt-out entire |
| functions that convert from `unsafe.Pointer` to `uintptr`. |
| This would be easy to implement, and would keep even broken unsafe |
| code working as well as it does today, but may have broad impact, |
| especially in the presence of inlining. |
| |
| 3. A simple combination of 1 and 2 would be to opt-out any basic block |
| that is *reachable* from an `unsafe.Pointer` to `uintptr` |
| conversion, up to a function call (which is a safe-point today). |
| |
| 4. For range loops, the compiler could compile them differently such |
| that it never constructs an out-of-bounds pointer (see below). |
| |
| 5. A far more precise and general approach (thanks to @cherrymui) |
| would be to create new SSA operations that "taint" and "untaint" |
| memory. |
| The taint operation would take a mem and return a new tainted mem. |
| This taint would flow to any values that themselves took a tainted |
| value. |
| The untaint operation would take a value and a mem and return an |
| untainted value and an untainted mem. |
| During liveness analysis, safe-points would be disallowed wherever |
| a tainted value was live. |
| This is probably the most precise solution, and is likely to keep |
| even incorrect uses of unsafe working, but requires a complex |
| implementation. |
| |
| More broadly, it's worth considering making the compiler check |
| `unsafe.Pointer`-using code and actively reject code that doesn't |
| follow the allowed patterns. |
| This could be implemented as a simple type system that distinguishes |
| pointer-ish `uintptr` from numeric `uintptr`. |
| But this is out of scope for this proposal. |
| |
| ### Range loops |
| |
| As of Go 1.10, range loops are compiled roughly like: |
| |
| ```go |
| for i, x := range s { b } |
| ⇓ |
| for i, _n, _p := 0, len(s), &s[0]; i < _n; i, _p = i+1, _p + unsafe.Sizeof(s[0]) { b } |
| ⇓ |
| i, _n, _p := 0, len(s), &s[0] |
| goto cond |
| body: |
| { b } |
| i, _p = i+1, _p + unsafe.Sizeof(s[0]) |
| cond: |
| if i < _n { goto body } else { goto end } |
| end: |
| ``` |
| |
| The problem with this lowering is that `_p` may temporarily point past |
| the end of the allocation the moment before the loop terminates. |
| Currently this is safe because there's never a safe-point while this |
| value of `_p` is live. |
| |
| This lowering requires that the compiler mark the increment and |
| condition blocks as unsafe-points. |
| However, if the body is short, this could result in infrequent |
| safe-points. |
| It also requires creating a separate block for the increment, which is |
| currently usually appended to the end of the body. |
| Separating these blocks would inhibit reordering opportunities. |
| |
| In preparation for non-cooperative preemption, Go 1.11 began compiling |
| range loops as follows to avoid ever creating a past-the-end pointer: |
| |
| ```go |
| i, _n, _p := 0, len(s), &s[0] |
| if i >= _n { goto end } else { goto body } |
| top: |
| _p += unsafe.Sizeof(s[0]) |
| body: |
| { b } |
| i++ |
| if i >= _n { goto end } else { goto top } |
| end: |
| ``` |
| |
| This allows safe-points everywhere in the loop. |
| Compared to the original loop compilation, it generates slightly more |
| code, but executes the same number of conditional branch instructions |
| (n+1) and results in the same number of SSA basic blocks (3). |
| |
| This lowering does complicate bounds-check elimination. |
| In Go 1.10, bounds-check elimination knew that `i < _n` in the body |
| because the body block is dominated by the cond block. |
| However, in the new lowering, deriving this fact required detecting |
| that `i < _n` on *both* paths into body and hence is true in body. |
| |
| ### Runtime safe-points |
| |
| Beyond generated code, the runtime in general is not written to be |
| arbitrarily preemptible and there are many places that must not be |
| preempted. |
| Hence, we would likely disable safe-points by default in the runtime, |
| except at calls (where they occur now). |
| |
| While this would have little downside for most of the runtime, there |
| are some parts of the runtime that could benefit substantially from |
| non-cooperative preemption, such as memory functions like `memmove`. |
| Non-cooperative preemption is an excellent way to make these |
| preemptible without slowing down the common case, since we would only |
| need to mark their register maps (which would often be empty for |
| functions like `memmove` since all pointers would already be protected |
| by arguments). |
| |
| Over time we may opt-in more of the runtime. |
| |
| ### Unsafe standard library code |
| |
| The Windows syscall package contains many `unsafe.Pointer` conversions |
| that don't follow the `unsafe.Pointer` rules. |
| It broadly makes shaky assumptions about safe-point behavior, |
| liveness, and when stack movement can happen. |
| It would likely need a thorough auditing, or would need to be opted |
| out like the runtime. |
| |
| Perhaps more troubling is that some of the Windows syscall package |
| types have uintptr fields that are actually pointers, hence forcing |
| callers to perform unsafe pointer conversions. |
| For example, see issue [#21376](https://golang.org/issue/21376). |
| |
| ### Ensuring progress with unsafe-points |
| |
| We propose simply giving up and retrying later when a goroutine is |
| interrupted at an unsafe-point. |
| One danger of this is that safe points may be rare in tight loops. |
| However, in many cases, there are more sophisticated alternatives to |
| this approach. |
| |
| For interruptions in the runtime or in functions without any safe |
| points (such as assembly), the signal handler could unwind the stack |
| and insert a return trampoline at the next return to a function with |
| safe point metadata. |
| The runtime could then let the goroutine continue running and the |
| trampoline would pause it as soon as possible. |
| |
| For write barriers and `unsafe.Pointer` sequences, the compiler could |
| insert a cheap, explicit preemption check at the end of the sequence. |
| For example, the runtime could modify some register that would be |
| checked at the end of the sequence and let the thread continue |
| executing. |
| In the write barrier sequence, this could even be the register that |
| the write barrier flag was loaded into, and the compiler could insert |
| a simple register test and conditional branch at the end of the |
| sequence. |
| To even further shrink the sequence, the runtime could put the address |
| of the stop function in this register so the stop sequence would be |
| just a register call and a jump. |
| |
| Alternatives to this check include forward and reverse simulation. |
| Forward simulation is tricky because the compiler must be careful to |
| only generate operations the runtime knows how to simulate. |
| Reverse simulation is easy *if* the compiler can always generate a |
| restartable sequence (simply move the PC back to the write barrier |
| flag check), but quickly becomes complicated if there are multiple |
| writes in the sequence or more complex writes such as DUFFCOPY. |
| |
| |
| ## Other considerations |
| |
| All of the proposed approaches to non-cooperative preemption involve |
| stopping a running goroutine by sending its thread an OS signal. |
| This section discusses general consequences of this. |
| |
| **Windows support.** Unlike fault-based loop preemption, signaled |
| preemption is quite easy to support in Windows because it provides |
| `SuspendThread` and `GetThreadContext`, which make it trivial to get a |
| thread's register set. |
| |
| **Choosing a signal.** We have to choose a signal that is unlikely to |
| interfere with existing uses of signals or with debuggers. |
| There are no perfect choices, but there are some heuristics. |
| 1) It should be a signal that's passed-through by debuggers by |
| default. |
| On Linux, this is SIGALRM, SIGURG, SIGCHLD, SIGIO, SIGVTALRM, SIGPROF, |
| and SIGWINCH, plus some glibc-internal signals. |
| 2) It shouldn't be used internally by libc in mixed Go/C binaries |
| because libc may assume it's the only thing that can handle these |
| signals. |
| For example SIGCANCEL or SIGSETXID. |
| 3) It should be a signal that can happen spuriously without |
| consequences. |
| For example, SIGALRM is a bad choice because the signal handler can't |
| tell if it was caused by the real process alarm or not (arguably this |
| means the signal is broken, but I digress). |
| SIGUSR1 and SIGUSR2 are also bad because those are often used in |
| meaningful ways by applications. |
| 4) We need to deal with platforms without real-time signals (like |
| macOS), so those are out. |
| |
| We use SIGURG because it meets all of these criteria, is extremely |
| unlikely to be used by an application for its "real" meaning (both |
| because out-of-band data is basically unused and because SIGURG |
| doesn't report which socket has the condition, making it pretty |
| useless), and even if it is, the application has to be ready for |
| spurious SIGURG. SIGIO wouldn't be a bad choice either, but is more |
| likely to be used for real. |
| |
| **Scheduler preemption.** This mechanism is well-suited to temporary |
| preemptions where the same goroutine will resume after the preemption |
| because we don't need to save the full register state and can rely on |
| the existing signal return path to restore the full register state. |
| This applies to all GC-related preemptions, but it's not as well |
| suited to permanent preemption performed by the scheduler. |
| However, we could still build on this mechanism. |
| For example, since most of the time goroutines self-preempt, we only |
| need to save the full signal state in the uncommon case, so the `g` |
| could contain a pointer to its full saved state that's only used after |
| a forced preemption. |
| Restoring the full signal state could be done by either writing the |
| architecture-dependent code to restore the full register set (a |
| beefed-up `runtime.gogo`), or by self-signaling, swapping in the |
| desired context, and letting the OS restore the full register set. |
| |
| **Targeting and resuming.** In contrast with fault-based loop |
| preemption, signaled preemption can be targeted at a specific thread |
| and can immediately resume. |
| Thread-targeting is a little different from cooperative preemption, |
| which is goroutine-targeted. |
| However, in many cases this is actually better, since targeting |
| goroutines for preemption is racy and hence requires retry loops that |
| can add significantly to STW time. |
| Taking advantage of this for stack scanning will require some |
| restructuring of how we track GC roots, but the result should |
| eliminate the blocking retry loop we currently use. |
| |
| **Non-pointer pointers.** This has the potential to expose incorrect |
| uses of `unsafe.Pointer` for transiently storing non-pointers. |
| Such uses are a clear violation of the `unsafe.Pointer` rules, but |
| they may happen (especially in, for example, cgo-using code). |
| |
| |
| ## Alternatives |
| |
| ### Single-stepping |
| |
| Rather than making an effort to be able to stop at any instruction, |
| the compiler could emit metadata for safe-points only at back-edges |
| and the runtime could use hardware single-stepping support to advance |
| the thread to a safe-point (or a point where the compiler has provided |
| a branch to reach a safe-point, like in the current loop preemption |
| approach). |
| This works (somewhat surprisingly), but thoroughly bamboozles |
| debuggers since both the debugger and the operating system assume the |
| debugger owns single-stepping, not the process itself. |
| This would also require the compiler to provide register flushing |
| stubs for these safe-points, which increases code size (and hence |
| instruction cache pressure) as well as stack size, much like |
| cooperative loop preemption. |
| However, unlike cooperative loop preemption, this approach would have |
| no effect on mainline code size or performance. |
| |
| ### Jump rewriting |
| |
| We can solve the problems of single-stepping by instead rewriting the |
| next safe-point jump instruction after the interruption point to jump |
| to a preemption path and resuming execution like usual. |
| To make this easy, the compiler could leave enough room (via padding |
| NOPs) so only the jump target needs to be modified. |
| |
| This approach has the usual drawbacks of modifiable code. |
| It's a security risk, it breaks text page sharing, and simply isn't |
| allowed on iOS. |
| It also can't target an individual goroutine (since another goroutine |
| could be executing the same code) and may have odd interactions with |
| concurrent execution on other cores. |
| |
| ### Out-of-line execution |
| |
| A further alternative in the same vein, but that doesn't require |
| modifying existing text is out-of-line execution. |
| In this approach, the signal handler relocates the instruction stream |
| from the interruption point to the next safe-point jump into a |
| temporary buffer, patches it to jump into the runtime at the end, and |
| resumes execution in this relocated sequence. |
| |
| This solves most of the problems with single-stepping and jump |
| rewriting, but is quite complex to implement and requires substantial |
| implementation effort for each platform. |
| It also isn't allowed on iOS. |
| |
| There is precedent for this sort of approach. |
| For example, when Linux uprobes injects an INT3, it relocates the |
| overwritten instructions into an "execute out-of-line" area to avoid |
| the usual problems with resuming from an INT3 instruction. |
| [The |
| implementation](https://github.com/torvalds/linux/blob/v4.18/arch/x86/kernel/uprobes.c) |
| is surprisingly simple given the complexity of the x86 instruction |
| encoding, but is still quite complex. |