design: add 24543-non-cooperative-preemption

For golang/go#24543.

Change-Id: Iba313a963aafcd93521bb9e006cb32d1f242301b
Reviewed-on: https://go-review.googlesource.com/102600
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
diff --git a/design/24543-non-cooperative-preemption.md b/design/24543-non-cooperative-preemption.md
new file mode 100644
index 0000000..281b6ae
--- /dev/null
+++ b/design/24543-non-cooperative-preemption.md
@@ -0,0 +1,504 @@
+# Proposal: Non-cooperative goroutine preemption
+
+Author(s): Austin Clements
+
+Last updated: 2018-03-26
+
+Discussion at https://golang.org/issue/24543.
+
+## Abstract
+
+Go currently uses compiler-inserted cooperative preemption points in
+function prologues.
+The majority of the time, this is good enough to allow Go developers
+to ignore preemption and focus on writing clear parallel code, but it
+has sharp edges that we've seen degrade the developer experience time
+and time again.
+When it goes wrong, it goes spectacularly wrong, leading to mysterious
+system-wide latency issues and sometimes complete freezes.
+And because this is a language implementation issue that exists
+outside of Go's language semantics, these failures are surprising and
+very difficult to debug.
+
+@dr2chase has put significant effort into prototyping cooperative
+preemption points in loops, which is one way to solve this problem.
+However, even sophisticated approaches to this led to unacceptable
+slow-downs in tight loops (where slow-downs are generally least
+acceptable).
+
+I propose that the Go implementation switch to non-cooperative
+preemption using stack and register maps at (essentially) every
+instruction.
+This would allow goroutines to be preempted without explicit
+preemption checks.
+This approach will solve the problem of delayed preemption with zero
+run-time overhead.
+
+
+## Background
+
+Up to and including Go 1.10, Go has used cooperative preemption with
+safe-points only at function calls (and even then, not if the function
+is small or gets inlined).
+This can result in infrequent safe-points, which leads to many
+problems:
+
+1. The most common in production code is that this can delay STW
+   operations, such as starting and ending a GC cycle.
+   This increases STW latency, and on large core counts can
+   significantly impact throughput (if, for example, most threads are
+   stopped while the runtime waits on a straggler for a long time).
+   (#17831, #19241)
+
+2. This can delay scheduling, preventing competing goroutines from
+   executing in a timely manner.
+
+3. This can delay stack scanning, which consumes CPU while the runtime
+   waits for a preemption point and can ultimately delay GC
+   termination, resulting in an effective STW where the system runs
+   out of heap and no goroutines can allocate.
+
+4. In really extreme cases, it can cause a program to halt, such as
+   when a goroutine spinning on an atomic load starves out the
+   goroutine responsible for setting that atomic.
+   This often indicates bad or buggy code, but is surprising
+   nonetheless and has clearly wasted a lot of developer time on
+   debugging.
+   (#543, #12553, #13546, #14561, #15442, #17174, #20793, #21053)
+
+These problems impede developer productivity and production efficiency
+and expose Go's users to implementation details they shouldn't have to
+worry about.
+
+### Loop preemption
+
+@dr2chase put significant effort into trying to solve these problems
+using explicit *loop preemption* (#10958).
+This is a standard approach for runtimes employing cooperative
+preemption in which the compiler inserts preemption checks and
+safe-points at back-edges in the flow graph.
+This significantly improves the quality of preemption, since code
+almost never executes without a back-edge for any non-trivial amount
+of time.
+
+Our most recent approach to loop preemption, which we call
+*fault-based preemption*, adds a single instruction, no branches, and
+no register pressure to loops on x86 and UNIX platforms ([CL
+43050](https://golang.org/cl/43050)).
+Despite this, the geomean slow-down on a [large suite of
+benchmarks](https://perf.golang.org/search?q=upload%3A20171003.1+%7C+upload-part%3A20171003.1%2F3+vs+upload-part%3A20171003.1%2F1)
+is 7.8%, with a handful of significantly worse outliers.
+Even [compared to Go
+1.9](https://perf.golang.org/search?q=upload%3A20171003.1+%7C+upload-part%3A20171003.1%2F0+vs+upload-part%3A20171003.1%2F1),
+where the slow-down is only 1% thanks to other improvements, most
+benchmarks see some slow-down and there are still significant
+outliers.
+
+Fault-based preemption also has several implementation downsides.
+It can't target specific threads or goroutines, so it's a poor match
+for stack scanning, ragged barriers, or regular scheduler preemption.
+It's also "sticky", in that we can't resume any loops until we resume
+*all* loops, so the safe-point can't simply resume if it occurs in an
+unsafe state (such as when runtime locks are held).
+And it requires more instructions (and more overhead) on non-x86 and
+non-UNIX platforms.
+
+
+## Proposal
+
+I propose that we implement fully non-cooperative preemption by
+recording enough metadata to allow safe-points (almost) everywhere
+without explicit preemption checks in function bodies.
+
+To do this, we would modify the compiler to produce register maps in
+addition to stack maps, and to emit these for as many program counters
+as possible.
+The runtime would use a signal (or `GetThreadContext` on Windows, or a
+note on Plan9) to retrieve each thread's register state, including the
+stack and register map at the interrupted PC.
+The garbage collector would then treat live pointers in registers just
+as it treats live pointers on the stack.
+
+Certain instructions cannot be safe-points, so if a signal occurs at
+such a point, the runtime would simply resume the thread and try again
+later.
+The compiler just needs to make *most* instructions safe-points.
+
+To @minux's credit, he suggested this in [the very first
+reply](https://github.com/golang/go/issues/10958#issuecomment-105678822)
+to #10958.
+At the time we thought adding safe-points everywhere would be too
+difficult and that the overhead of explicit loop preemption would be
+lower than it turned out to be.
+
+Many other garbage-collected languages use explicit safe-points on
+back-edges, or they use forward-simulation to reach a safe-point.
+Partly, it's possible for Go to support safe-points everywhere because
+Go's GC already must have excellent support for interior pointers; in
+many languages, interior pointers never appear at a safe-point.
+
+
+## Handling unsafe-points
+
+There are various points in generated code that must be GC-atomic and
+thus cannot have safe-points in them.
+A few known situations are:
+
+1. Expressions involving `unsafe.Pointer` may temporarily represent
+   the only pointer to an object as a `uintptr`.
+   Hence, there must be no safe-points while a `uintptr` derived from
+   an `unsafe.Pointer` is live.
+   Likewise, we must recognize `reflect.Value.Pointer`,
+   `reflect.Value.UnsafeAddr`, and `reflect.Value.InterfaceData` as
+   `unsafe.Pointer`-to-`uintptr` conversions.
+   Alternatively, if the compiler can reliably detect such `uintptr`s,
+   it could mark this as pointers, but there's a danger that an
+   intermediate value may not represent a legal pointer value.
+
+2. In the write barrier there must not be a safe-point between the
+   write-barrier-enabled check and a direct write.
+   For example, suppose the goroutine is writing a pointer to B into
+   object A.
+   If the check happens, then GC starts and scans A, then the
+   goroutine writes B into A and drops all references to B from its
+   stack, the garbage collector could fail to mark B.
+
+3. There are places where the compiler generates temporary pointers
+   that can be past the end of allocations, such as in range loops
+   over slices and arrays.
+   These would either have to be avoided or safe-points would have to
+   be disallowed while these are live.
+
+All of these cases must already avoid significant reordering to avoid
+being split across a call.
+Internally, this is achieved via the "mem" pseudo-value, which must be
+sequentially threaded through all SSA values that manipulate memory.
+Mem is also threaded through values that must not be reordered, even
+if they don't touch memory.
+For example, conversion between `unsafe.Pointer` and `uintptr` is done
+with a special "Convert" operation that takes a mem solely to
+constrain reordering.
+
+There are several possible solutions to these problem, some of which
+can be combined:
+
+1. We could mark basic blocks that shouldn't contain preemption
+   points.
+   For `unsafe.Pointer` conversions, we would opt-out the basic block
+   containing the conversion.
+   For code adhering to the `unsafe.Pointer` rules, this should be
+   sufficient, but it may break code that is incorrect but happens to
+   work today in ways that are very difficult to debug.
+   For write barriers this is also sufficient.
+   For loops, this is overly broad and would require splitting some
+   basic blocks.
+
+2. For `unsafe.Pointer` conversions, we could simply opt-out entire
+   functions that convert from `unsafe.Pointer` to `uintptr`.
+   This would be easy to implement, and would keep even broken unsafe
+   code working as well as it does today, but may have broad impact,
+   especially in the presence of inlining.
+
+3. A simple combination of 1 and 2 would be to opt-out any basic block
+   that is *reachable* from an `unsafe.Pointer` to `uintptr`
+   conversion, up to a function call (which is a safe-point today).
+
+4. For range loops, the compiler could compile them differently such
+   that it never constructs an out-of-bounds pointer (see below).
+
+5. A far more precise and general approach (thanks to @cherrymui)
+   would be to create new SSA operations that "taint" and "untaint"
+   memory.
+   The taint operation would take a mem and return a new tainted mem.
+   This taint would flow to any values that themselves took a tainted
+   value.
+   The untaint operation would take a value and a mem and return an
+   untainted value and an untainted mem.
+   During liveness analysis, safe-points would be disallowed wherever
+   a tainted value was live.
+   This is probably the most precise solution, and is likely to keep
+   even incorrect uses of unsafe working, but requires a complex
+   implementation.
+
+More broadly, it's worth considering making the compiler check
+`unsafe.Pointer`-using code and actively reject code that doesn't
+follow the allowed patterns.
+This could be implemented as a simple type system that distinguishes
+pointer-ish `uintptr` from numeric `uintptr`.
+But this is out of scope for this proposal.
+
+### Range loops
+
+Currently (as of 1.10), range loops are compiled roughly like:
+
+```go
+for i, x := range s { b }
+  ⇓
+for i, _n, _p := 0, len(s), &s[0]; i < _n; i, _p = i+1, _p + unsafe.Sizeof(s[0]) { b }
+  ⇓
+i, _n, _p := 0, len(s), &s[0]
+goto cond
+body:
+{ b }
+i, _p = i+1, _p + unsafe.Sizeof(s[0])
+cond:
+if i < _n { goto body } else { goto end }
+end:
+```
+
+The problem with this lowering is that `_p` may temporarily point past
+the end of the allocation the moment before the loop terminates.
+Currently this is safe because there's never a safe-point while this
+value of `_p` is live.
+
+We could keep this lowering and mark the increment and condition
+blocks as unsafe-points.
+However, if the body is short, this could result in infrequent
+safe-points.
+It would also require creating a separate block for the increment,
+which is currently usually appended to the end of the body.
+Separating these blocks would inhibit reordering opportunities.
+
+Alternatively, we could rewrite the loop to never create a
+past-the-end pointer.
+For example, we could lower it like:
+
+```go
+i, _n, _p := 0, len(s), &s[0]
+if i < _n { goto end } else { goto body }
+top:
+_p += unsafe.Sizeof(s[0])
+body:
+{ b }
+i++
+if i < _n { goto end } else { goto top }
+end:
+```
+
+This would allow safe-points everywhere in the loop.
+Compared to the current loop compilation, it generates slightly more
+code, but executes the same number of conditional branch instructions
+(n+1) and results in the same number of SSA basic blocks (3).
+
+This lowering does complicate bounds-check elimination.
+Currently, bounds-check elimination knows that `i < _n` in the body
+because the body block is dominated by the cond block.
+However, in the new lowering, deriving this fact requires detecting
+that `i < _n` on *both* paths into body and hence is true in body.
+
+### Runtime safe-points
+
+Beyond generated code, the runtime in general is not written to be
+arbitrarily preemptible and there are many places that must not be
+preempted.
+Hence, we would likely disable safe-points by default in the runtime,
+except at calls (where they occur now).
+
+While this would have little downside for most of the runtime, there
+are some parts of the runtime that could benefit substantially from
+non-cooperative preemption, such as memory functions like `memmove`.
+Non-cooperative preemption is an excellent way to make these
+preemptible without slowing down the common case, since we would only
+need to mark their register maps (which would often be empty for
+functions like `memmove` since all pointers would already be protected
+by arguments).
+
+Over time we may opt-in more of the runtime.
+
+
+## Encoding of stack and register maps
+
+To be determined.
+
+@austin's prototype simply added the register liveness map as a third
+liveness map accompanying the existing argument and local liveness
+maps.
+The downside of this is that any change in any of these maps requires
+writing out all of them, and the register liveness changes tend not to
+correlate with argument and local liveness changes.
+
+
+## Other uses
+
+### Heap dump analysis
+
+Having safe-points everywhere fixes problems with heap dump analysis
+from core files, which currently has to use conservative heuristics to
+identify live pointers in active call frames.
+
+### Call injection
+
+Having safe-points everywhere also allows some function calls to be
+safely injected at runtime.
+This is useful in at least two situations:
+
+1. To handle synchronous signals, such as nil-pointer dereferences,
+   the runtime injects a call to `runtime.sigpanic` at the location of
+   the fault.
+   However, since there isn't usually a call at this location, the
+   stack map may be inaccurate, which leads to complicated
+   interactions between defers, escape analysis, and traceback
+   handling.
+   Having safe-points everywhere could simplify this.
+
+2. Debugger function calls (#21678).
+   Currently it's essentially impossible for a debugger to dynamically
+   invoke a Go function call because of poor interactions with stack
+   scanning and the garbage collector.
+   Having stack and register maps everywhere would make this
+   significantly more viable, since a function call could be injected
+   nearly anywhere without making the stack un-scannable.
+
+
+## Testing
+
+The primary danger of this approach is its potential for a long bug
+tail, since the coverage of safe-points in regular testing will
+decrease substantially.
+In addition to standard testing, I propose checking the generated
+liveness maps using static analysis of binaries.
+This tool would look for pointer dereferences or stores with a write
+barrier to indicate that a value is a pointer and would check the flow
+of that value through all possible paths.
+It would report anywhere a value transitioned from dead/scalar to live
+pointer and anywhere a value was used both like a pointer and like a
+scalar.
+
+In effect, this tool would simulate the program to answer the question
+"for every two points in time A < B, are there allocations reachable
+from the liveness map at time B that were not reachable at time A and
+were not allocated between A and B?"
+
+Most likely this static analysis tool could be written atop the
+existing [golang.org/x/arch](https://godoc.org/golang.org/x/arch)
+packages.
+These are the same packages used by, for example, `go tool objdump`,
+and handle most heavy-lifting of decoding the binary itself.
+
+
+## Other considerations
+
+**Space overhead.** Traditionally, the main concern with having
+safe-points everywhere is the overhead of saving the stack/register
+maps.
+A very preliminary implementation of register maps and safe-points
+everywhere increased binary size by ~10%.
+However, this left several obvious optimizations on the table.
+Work by Stichmoth, et al. [1] further suggests that this overhead
+can be significantly curtailed with simple compression techniques.
+
+**Windows support.** Unlike fault-based loop preemption, signaled
+preemption is quite easy to support in Windows because it provides
+`SuspendThread` and `GetThreadContext`, which make it trivial to get a
+thread's register set.
+
+**Decoupling stack-move points from GC safe-points.** Because of the
+details of the current implementation, these are (essentially) the
+same.
+By decoupling these and only allowing stack growth and shrinking at
+function entry, stack copying would not need to adjust registers.
+This keeps stack copying simpler.
+It also enables better compiler optimizations and more safe-points
+since the compiler need not mark a register as a live pointer if it
+knows there's another live pointer to the object.
+Likewise, registers that are known to be derived from pointer
+arguments can be marked as scalar as long as those arguments are live.
+Such optimizations are possible because of Go's non-moving collector.
+This also prevents stack moving from observing (and crashing on)
+transient small-valued pointers that the compiler constructs when it
+knows an offset from a potentially-nil pointer will be small.
+
+**Debuggers.** Debuggers would have to be taught to ignore the signal
+used for stopping a thread.
+However, if we can use a distinct signal for this (such as one of the
+POSIX real-time signals), this should be easier than teaching
+debuggers to distinguish the `SIGSEGV`s produced by 1.10's loop
+preemption from genuine `SIGSEGV`s.
+It's also not clear that our current fault-based approach can work at
+all under many debuggers on OS X due to a [kernel
+bug](https://bugs.llvm.org/show_bug.cgi?id=22868).
+
+**Scheduler preemption.** This mechanism is well-suited to temporary
+preemptions where the same goroutine will resume after the preemption
+because we don't need to save the full register state and can rely on
+the existing signal return path to restore the full register state.
+This applies to all GC-related preemptions, but it's not as well
+suited to permanent preemption performed by the scheduler.
+However, we could still build on this mechanism.
+For example, since most of the time goroutines self-preempt, we only
+need to save the full signal state in the uncommon case, so the `g`
+could contain a pointer to its full saved state that's only used after
+a forced preemption.
+Restoring the full signal state could be done by either writing the
+architecture-dependent code to restore the full register set (a
+beefed-up `runtime.gogo`), or by self-signaling, swapping in the
+desired context, and letting the OS restore the full register set.
+
+**Targeting and resuming.** In contrast with fault-based loop
+preemption, signaled preemption can be targeted at a specific thread
+and can immediately resume.
+Thread-targeting is a little different from cooperative preemption,
+which is goroutine-targeted.
+However, in many cases this is actually better, since targeting
+goroutines for preemption is racy and hence requires retry loops that
+can add significantly to STW time.
+Taking advantage of this for stack scanning will require some
+restructuring of how we track GC roots, but the result should
+eliminate the blocking retry loop we currently use.
+
+**Non-pointer pointers.** This has the potential to expose incorrect
+uses of `unsafe.Pointer` for transiently storing non-pointers.
+Such uses are a clear violation of the `unsafe.Pointer` rules, but
+they may happen (especially in, for example, cgo-using code).
+
+**Assembly.** By default, the runtime cannot safely preempt assembly
+code since it won't know what registers contain pointers. As a
+follow-on to the work on safe-points everywhere, we should audit
+assembly in the standard library for non-preemptible loops and
+annotate them with register maps. In most cases this should be trivial
+since most assembly never constructs a pointer that isn't shadowed by
+an argument, so it can simply claim there are no pointers in
+registers. We should also document in the Go assembly guide how to do
+this for user code.
+
+
+## Alternatives
+
+Rather than significantly increasing the number of safe-points, we
+could use a signal to stop a thread and then use hardware
+single-stepping support to advance the thread to a safe-point (or a
+point where the compiler has provided a branch to reach a safe-point,
+like in the current loop preemption approach).
+This works (somewhat surprisingly), but thoroughly bamboozles
+debuggers since both the debugger and the operating system assume the
+debugger owns single-stepping, not the process itself.
+This would also require the compiler to provide register flushing
+stubs for these safe-points, which increases code size (and hence
+instruction cache pressure) as well as stack size.
+Safe-points everywhere increase binary size, but not code size or
+stack size.
+
+
+## Compatibility
+
+This proposal introduces no new APIs, so it is Go 1 compatible.
+
+
+## Implementation
+
+Austin Clements (@austin) plans to implement register and stack maps
+everywhere for Go 1.11.
+This will enable some low-risk uses in the short term, such as
+debugger function calls.
+
+Debugging and testing of register and stack maps can continue into the
+Go 1.11 freeze, including building the static analysis tool.
+
+Then, for Go 1.12, Austin will implement safe-points everywhere atop
+the register and stacks maps.
+
+
+## References
+
+[1] James M. Stichnoth, Guei-Yuan Lueh, and MichaƂ Cierniak. 1999. Support for garbage collection at every instruction in a Java compiler. In *Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation* (PLDI '99). ACM, New York, NY, USA, 118–127.
+