| # Proposal: Non-cooperative goroutine preemption |
| |
| Author(s): Austin Clements |
| |
| Last updated: 2018-03-26 |
| |
| Discussion at https://golang.org/issue/24543. |
| |
| ## Abstract |
| |
| Go currently uses compiler-inserted cooperative preemption points in |
| function prologues. |
| The majority of the time, this is good enough to allow Go developers |
| to ignore preemption and focus on writing clear parallel code, but it |
| has sharp edges that we've seen degrade the developer experience time |
| and time again. |
| When it goes wrong, it goes spectacularly wrong, leading to mysterious |
| system-wide latency issues and sometimes complete freezes. |
| And because this is a language implementation issue that exists |
| outside of Go's language semantics, these failures are surprising and |
| very difficult to debug. |
| |
| @dr2chase has put significant effort into prototyping cooperative |
| preemption points in loops, which is one way to solve this problem. |
| However, even sophisticated approaches to this led to unacceptable |
| slow-downs in tight loops (where slow-downs are generally least |
| acceptable). |
| |
| I propose that the Go implementation switch to non-cooperative |
| preemption using stack and register maps at (essentially) every |
| instruction. |
| This would allow goroutines to be preempted without explicit |
| preemption checks. |
| This approach will solve the problem of delayed preemption with zero |
| run-time overhead. |
| |
| |
| ## Background |
| |
| Up to and including Go 1.10, Go has used cooperative preemption with |
| safe-points only at function calls (and even then, not if the function |
| is small or gets inlined). |
| This can result in infrequent safe-points, which leads to many |
| problems: |
| |
| 1. The most common in production code is that this can delay STW |
| operations, such as starting and ending a GC cycle. |
| This increases STW latency, and on large core counts can |
| significantly impact throughput (if, for example, most threads are |
| stopped while the runtime waits on a straggler for a long time). |
| (#17831, #19241) |
| |
| 2. This can delay scheduling, preventing competing goroutines from |
| executing in a timely manner. |
| |
| 3. This can delay stack scanning, which consumes CPU while the runtime |
| waits for a preemption point and can ultimately delay GC |
| termination, resulting in an effective STW where the system runs |
| out of heap and no goroutines can allocate. |
| |
| 4. In really extreme cases, it can cause a program to halt, such as |
| when a goroutine spinning on an atomic load starves out the |
| goroutine responsible for setting that atomic. |
| This often indicates bad or buggy code, but is surprising |
| nonetheless and has clearly wasted a lot of developer time on |
| debugging. |
| (#543, #12553, #13546, #14561, #15442, #17174, #20793, #21053) |
| |
| These problems impede developer productivity and production efficiency |
| and expose Go's users to implementation details they shouldn't have to |
| worry about. |
| |
| ### Loop preemption |
| |
| @dr2chase put significant effort into trying to solve these problems |
| using explicit *loop preemption* (#10958). |
| This is a standard approach for runtimes employing cooperative |
| preemption in which the compiler inserts preemption checks and |
| safe-points at back-edges in the flow graph. |
| This significantly improves the quality of preemption, since code |
| almost never executes without a back-edge for any non-trivial amount |
| of time. |
| |
| Our most recent approach to loop preemption, which we call |
| *fault-based preemption*, adds a single instruction, no branches, and |
| no register pressure to loops on x86 and UNIX platforms ([CL |
| 43050](https://golang.org/cl/43050)). |
| Despite this, the geomean slow-down on a [large suite of |
| benchmarks](https://perf.golang.org/search?q=upload%3A20171003.1+%7C+upload-part%3A20171003.1%2F3+vs+upload-part%3A20171003.1%2F1) |
| is 7.8%, with a handful of significantly worse outliers. |
| Even [compared to Go |
| 1.9](https://perf.golang.org/search?q=upload%3A20171003.1+%7C+upload-part%3A20171003.1%2F0+vs+upload-part%3A20171003.1%2F1), |
| where the slow-down is only 1% thanks to other improvements, most |
| benchmarks see some slow-down and there are still significant |
| outliers. |
| |
| Fault-based preemption also has several implementation downsides. |
| It can't target specific threads or goroutines, so it's a poor match |
| for stack scanning, ragged barriers, or regular scheduler preemption. |
| It's also "sticky", in that we can't resume any loops until we resume |
| *all* loops, so the safe-point can't simply resume if it occurs in an |
| unsafe state (such as when runtime locks are held). |
| And it requires more instructions (and more overhead) on non-x86 and |
| non-UNIX platforms. |
| |
| |
| ## Proposal |
| |
| I propose that we implement fully non-cooperative preemption by |
| recording enough metadata to allow safe-points (almost) everywhere |
| without explicit preemption checks in function bodies. |
| |
| To do this, we would modify the compiler to produce register maps in |
| addition to stack maps, and to emit these for as many program counters |
| as possible. |
| The runtime would use a signal (or `GetThreadContext` on Windows, or a |
| note on Plan9) to retrieve each thread's register state, including the |
| stack and register map at the interrupted PC. |
| The garbage collector would then treat live pointers in registers just |
| as it treats live pointers on the stack. |
| |
| Certain instructions cannot be safe-points, so if a signal occurs at |
| such a point, the runtime would simply resume the thread and try again |
| later. |
| The compiler just needs to make *most* instructions safe-points. |
| |
| To @minux's credit, he suggested this in [the very first |
| reply](https://github.com/golang/go/issues/10958#issuecomment-105678822) |
| to #10958. |
| At the time we thought adding safe-points everywhere would be too |
| difficult and that the overhead of explicit loop preemption would be |
| lower than it turned out to be. |
| |
| Many other garbage-collected languages use explicit safe-points on |
| back-edges, or they use forward-simulation to reach a safe-point. |
| Partly, it's possible for Go to support safe-points everywhere because |
| Go's GC already must have excellent support for interior pointers; in |
| many languages, interior pointers never appear at a safe-point. |
| |
| |
| ## Handling unsafe-points |
| |
| There are various points in generated code that must be GC-atomic and |
| thus cannot have safe-points in them. |
| A few known situations are: |
| |
| 1. Expressions involving `unsafe.Pointer` may temporarily represent |
| the only pointer to an object as a `uintptr`. |
| Hence, there must be no safe-points while a `uintptr` derived from |
| an `unsafe.Pointer` is live. |
| Likewise, we must recognize `reflect.Value.Pointer`, |
| `reflect.Value.UnsafeAddr`, and `reflect.Value.InterfaceData` as |
| `unsafe.Pointer`-to-`uintptr` conversions. |
| Alternatively, if the compiler can reliably detect such `uintptr`s, |
| it could mark this as pointers, but there's a danger that an |
| intermediate value may not represent a legal pointer value. |
| |
| 2. In the write barrier there must not be a safe-point between the |
| write-barrier-enabled check and a direct write. |
| For example, suppose the goroutine is writing a pointer to B into |
| object A. |
| If the check happens, then GC starts and scans A, then the |
| goroutine writes B into A and drops all references to B from its |
| stack, the garbage collector could fail to mark B. |
| |
| 3. There are places where the compiler generates temporary pointers |
| that can be past the end of allocations, such as in range loops |
| over slices and arrays. |
| These would either have to be avoided or safe-points would have to |
| be disallowed while these are live. |
| |
| All of these cases must already avoid significant reordering to avoid |
| being split across a call. |
| Internally, this is achieved via the "mem" pseudo-value, which must be |
| sequentially threaded through all SSA values that manipulate memory. |
| Mem is also threaded through values that must not be reordered, even |
| if they don't touch memory. |
| For example, conversion between `unsafe.Pointer` and `uintptr` is done |
| with a special "Convert" operation that takes a mem solely to |
| constrain reordering. |
| |
| There are several possible solutions to these problem, some of which |
| can be combined: |
| |
| 1. We could mark basic blocks that shouldn't contain preemption |
| points. |
| For `unsafe.Pointer` conversions, we would opt-out the basic block |
| containing the conversion. |
| For code adhering to the `unsafe.Pointer` rules, this should be |
| sufficient, but it may break code that is incorrect but happens to |
| work today in ways that are very difficult to debug. |
| For write barriers this is also sufficient. |
| For loops, this is overly broad and would require splitting some |
| basic blocks. |
| |
| 2. For `unsafe.Pointer` conversions, we could simply opt-out entire |
| functions that convert from `unsafe.Pointer` to `uintptr`. |
| This would be easy to implement, and would keep even broken unsafe |
| code working as well as it does today, but may have broad impact, |
| especially in the presence of inlining. |
| |
| 3. A simple combination of 1 and 2 would be to opt-out any basic block |
| that is *reachable* from an `unsafe.Pointer` to `uintptr` |
| conversion, up to a function call (which is a safe-point today). |
| |
| 4. For range loops, the compiler could compile them differently such |
| that it never constructs an out-of-bounds pointer (see below). |
| |
| 5. A far more precise and general approach (thanks to @cherrymui) |
| would be to create new SSA operations that "taint" and "untaint" |
| memory. |
| The taint operation would take a mem and return a new tainted mem. |
| This taint would flow to any values that themselves took a tainted |
| value. |
| The untaint operation would take a value and a mem and return an |
| untainted value and an untainted mem. |
| During liveness analysis, safe-points would be disallowed wherever |
| a tainted value was live. |
| This is probably the most precise solution, and is likely to keep |
| even incorrect uses of unsafe working, but requires a complex |
| implementation. |
| |
| More broadly, it's worth considering making the compiler check |
| `unsafe.Pointer`-using code and actively reject code that doesn't |
| follow the allowed patterns. |
| This could be implemented as a simple type system that distinguishes |
| pointer-ish `uintptr` from numeric `uintptr`. |
| But this is out of scope for this proposal. |
| |
| ### Range loops |
| |
| Currently (as of 1.10), range loops are compiled roughly like: |
| |
| ```go |
| for i, x := range s { b } |
| ⇓ |
| for i, _n, _p := 0, len(s), &s[0]; i < _n; i, _p = i+1, _p + unsafe.Sizeof(s[0]) { b } |
| ⇓ |
| i, _n, _p := 0, len(s), &s[0] |
| goto cond |
| body: |
| { b } |
| i, _p = i+1, _p + unsafe.Sizeof(s[0]) |
| cond: |
| if i < _n { goto body } else { goto end } |
| end: |
| ``` |
| |
| The problem with this lowering is that `_p` may temporarily point past |
| the end of the allocation the moment before the loop terminates. |
| Currently this is safe because there's never a safe-point while this |
| value of `_p` is live. |
| |
| We could keep this lowering and mark the increment and condition |
| blocks as unsafe-points. |
| However, if the body is short, this could result in infrequent |
| safe-points. |
| It would also require creating a separate block for the increment, |
| which is currently usually appended to the end of the body. |
| Separating these blocks would inhibit reordering opportunities. |
| |
| Alternatively, we could rewrite the loop to never create a |
| past-the-end pointer. |
| For example, we could lower it like: |
| |
| ```go |
| i, _n, _p := 0, len(s), &s[0] |
| if i >= _n { goto end } else { goto body } |
| top: |
| _p += unsafe.Sizeof(s[0]) |
| body: |
| { b } |
| i++ |
| if i >= _n { goto end } else { goto top } |
| end: |
| ``` |
| |
| This would allow safe-points everywhere in the loop. |
| Compared to the current loop compilation, it generates slightly more |
| code, but executes the same number of conditional branch instructions |
| (n+1) and results in the same number of SSA basic blocks (3). |
| |
| This lowering does complicate bounds-check elimination. |
| Currently, bounds-check elimination knows that `i < _n` in the body |
| because the body block is dominated by the cond block. |
| However, in the new lowering, deriving this fact requires detecting |
| that `i < _n` on *both* paths into body and hence is true in body. |
| |
| ### Runtime safe-points |
| |
| Beyond generated code, the runtime in general is not written to be |
| arbitrarily preemptible and there are many places that must not be |
| preempted. |
| Hence, we would likely disable safe-points by default in the runtime, |
| except at calls (where they occur now). |
| |
| While this would have little downside for most of the runtime, there |
| are some parts of the runtime that could benefit substantially from |
| non-cooperative preemption, such as memory functions like `memmove`. |
| Non-cooperative preemption is an excellent way to make these |
| preemptible without slowing down the common case, since we would only |
| need to mark their register maps (which would often be empty for |
| functions like `memmove` since all pointers would already be protected |
| by arguments). |
| |
| Over time we may opt-in more of the runtime. |
| |
| |
| ## Encoding of stack and register maps |
| |
| To be determined. |
| |
| @austin's prototype simply added the register liveness map as a third |
| liveness map accompanying the existing argument and local liveness |
| maps. |
| The downside of this is that any change in any of these maps requires |
| writing out all of them, and the register liveness changes tend not to |
| correlate with argument and local liveness changes. |
| |
| |
| ## Other uses |
| |
| ### Heap dump analysis |
| |
| Having safe-points everywhere fixes problems with heap dump analysis |
| from core files, which currently has to use conservative heuristics to |
| identify live pointers in active call frames. |
| |
| ### Call injection |
| |
| Having safe-points everywhere also allows some function calls to be |
| safely injected at runtime. |
| This is useful in at least two situations: |
| |
| 1. To handle synchronous signals, such as nil-pointer dereferences, |
| the runtime injects a call to `runtime.sigpanic` at the location of |
| the fault. |
| However, since there isn't usually a call at this location, the |
| stack map may be inaccurate, which leads to complicated |
| interactions between defers, escape analysis, and traceback |
| handling. |
| Having safe-points everywhere could simplify this. |
| |
| 2. Debugger function calls (#21678). |
| Currently it's essentially impossible for a debugger to dynamically |
| invoke a Go function call because of poor interactions with stack |
| scanning and the garbage collector. |
| Having stack and register maps everywhere would make this |
| significantly more viable, since a function call could be injected |
| nearly anywhere without making the stack un-scannable. |
| |
| |
| ## Testing |
| |
| The primary danger of this approach is its potential for a long bug |
| tail, since the coverage of safe-points in regular testing will |
| decrease substantially. |
| In addition to standard testing, I propose checking the generated |
| liveness maps using static analysis of binaries. |
| This tool would look for pointer dereferences or stores with a write |
| barrier to indicate that a value is a pointer and would check the flow |
| of that value through all possible paths. |
| It would report anywhere a value transitioned from dead/scalar to live |
| pointer and anywhere a value was used both like a pointer and like a |
| scalar. |
| |
| In effect, this tool would simulate the program to answer the question |
| "for every two points in time A < B, are there allocations reachable |
| from the liveness map at time B that were not reachable at time A and |
| were not allocated between A and B?" |
| |
| Most likely this static analysis tool could be written atop the |
| existing [golang.org/x/arch](https://godoc.org/golang.org/x/arch) |
| packages. |
| These are the same packages used by, for example, `go tool objdump`, |
| and handle most heavy-lifting of decoding the binary itself. |
| |
| |
| ## Other considerations |
| |
| **Space overhead.** Traditionally, the main concern with having |
| safe-points everywhere is the overhead of saving the stack/register |
| maps. |
| A very preliminary implementation of register maps and safe-points |
| everywhere increased binary size by ~10%. |
| However, this left several obvious optimizations on the table. |
| Work by Stichmoth, et al. [1] further suggests that this overhead |
| can be significantly curtailed with simple compression techniques. |
| |
| **Windows support.** Unlike fault-based loop preemption, signaled |
| preemption is quite easy to support in Windows because it provides |
| `SuspendThread` and `GetThreadContext`, which make it trivial to get a |
| thread's register set. |
| |
| **Decoupling stack-move points from GC safe-points.** Because of the |
| details of the current implementation, these are (essentially) the |
| same. |
| By decoupling these and only allowing stack growth and shrinking at |
| function entry, stack copying would not need to adjust registers. |
| This keeps stack copying simpler. |
| It also enables better compiler optimizations and more safe-points |
| since the compiler need not mark a register as a live pointer if it |
| knows there's another live pointer to the object. |
| Likewise, registers that are known to be derived from pointer |
| arguments can be marked as scalar as long as those arguments are live. |
| Such optimizations are possible because of Go's non-moving collector. |
| This also prevents stack moving from observing (and crashing on) |
| transient small-valued pointers that the compiler constructs when it |
| knows an offset from a potentially-nil pointer will be small. |
| |
| **Debuggers.** Debuggers would have to be taught to ignore the signal |
| used for stopping a thread. |
| However, if we can use a distinct signal for this (such as one of the |
| POSIX real-time signals), this should be easier than teaching |
| debuggers to distinguish the `SIGSEGV`s produced by 1.10's loop |
| preemption from genuine `SIGSEGV`s. |
| It's also not clear that our current fault-based approach can work at |
| all under many debuggers on OS X due to a [kernel |
| bug](https://bugs.llvm.org/show_bug.cgi?id=22868). |
| |
| **Scheduler preemption.** This mechanism is well-suited to temporary |
| preemptions where the same goroutine will resume after the preemption |
| because we don't need to save the full register state and can rely on |
| the existing signal return path to restore the full register state. |
| This applies to all GC-related preemptions, but it's not as well |
| suited to permanent preemption performed by the scheduler. |
| However, we could still build on this mechanism. |
| For example, since most of the time goroutines self-preempt, we only |
| need to save the full signal state in the uncommon case, so the `g` |
| could contain a pointer to its full saved state that's only used after |
| a forced preemption. |
| Restoring the full signal state could be done by either writing the |
| architecture-dependent code to restore the full register set (a |
| beefed-up `runtime.gogo`), or by self-signaling, swapping in the |
| desired context, and letting the OS restore the full register set. |
| |
| **Targeting and resuming.** In contrast with fault-based loop |
| preemption, signaled preemption can be targeted at a specific thread |
| and can immediately resume. |
| Thread-targeting is a little different from cooperative preemption, |
| which is goroutine-targeted. |
| However, in many cases this is actually better, since targeting |
| goroutines for preemption is racy and hence requires retry loops that |
| can add significantly to STW time. |
| Taking advantage of this for stack scanning will require some |
| restructuring of how we track GC roots, but the result should |
| eliminate the blocking retry loop we currently use. |
| |
| **Non-pointer pointers.** This has the potential to expose incorrect |
| uses of `unsafe.Pointer` for transiently storing non-pointers. |
| Such uses are a clear violation of the `unsafe.Pointer` rules, but |
| they may happen (especially in, for example, cgo-using code). |
| |
| **Assembly.** By default, the runtime cannot safely preempt assembly |
| code since it won't know what registers contain pointers. As a |
| follow-on to the work on safe-points everywhere, we should audit |
| assembly in the standard library for non-preemptible loops and |
| annotate them with register maps. In most cases this should be trivial |
| since most assembly never constructs a pointer that isn't shadowed by |
| an argument, so it can simply claim there are no pointers in |
| registers. We should also document in the Go assembly guide how to do |
| this for user code. |
| |
| |
| ## Alternatives |
| |
| Rather than significantly increasing the number of safe-points, we |
| could use a signal to stop a thread and then use hardware |
| single-stepping support to advance the thread to a safe-point (or a |
| point where the compiler has provided a branch to reach a safe-point, |
| like in the current loop preemption approach). |
| This works (somewhat surprisingly), but thoroughly bamboozles |
| debuggers since both the debugger and the operating system assume the |
| debugger owns single-stepping, not the process itself. |
| This would also require the compiler to provide register flushing |
| stubs for these safe-points, which increases code size (and hence |
| instruction cache pressure) as well as stack size. |
| Safe-points everywhere increase binary size, but not code size or |
| stack size. |
| |
| |
| ## Compatibility |
| |
| This proposal introduces no new APIs, so it is Go 1 compatible. |
| |
| |
| ## Implementation |
| |
| Austin Clements (@austin) plans to implement register and stack maps |
| everywhere for Go 1.11. |
| This will enable some low-risk uses in the short term, such as |
| debugger function calls. |
| |
| Debugging and testing of register and stack maps can continue into the |
| Go 1.11 freeze, including building the static analysis tool. |
| |
| Then, for Go 1.12, Austin will implement safe-points everywhere atop |
| the register and stacks maps. |
| |
| |
| ## References |
| |
| [1] James M. Stichnoth, Guei-Yuan Lueh, and Michał Cierniak. 1999. Support for garbage collection at every instruction in a Java compiler. In *Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation* (PLDI '99). ACM, New York, NY, USA, 118–127. |
| |