Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 1 | # Proposal: Non-cooperative goroutine preemption |
| 2 | |
| 3 | Author(s): Austin Clements |
| 4 | |
Austin Clements | 87b05a0 | 2019-01-18 14:28:40 -0500 | [diff] [blame] | 5 | Last updated: 2019-01-18 |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 6 | |
| 7 | Discussion at https://golang.org/issue/24543. |
| 8 | |
| 9 | ## Abstract |
| 10 | |
| 11 | Go currently uses compiler-inserted cooperative preemption points in |
| 12 | function prologues. |
| 13 | The majority of the time, this is good enough to allow Go developers |
| 14 | to ignore preemption and focus on writing clear parallel code, but it |
| 15 | has sharp edges that we've seen degrade the developer experience time |
| 16 | and time again. |
| 17 | When it goes wrong, it goes spectacularly wrong, leading to mysterious |
| 18 | system-wide latency issues and sometimes complete freezes. |
| 19 | And because this is a language implementation issue that exists |
| 20 | outside of Go's language semantics, these failures are surprising and |
| 21 | very difficult to debug. |
| 22 | |
| 23 | @dr2chase has put significant effort into prototyping cooperative |
| 24 | preemption points in loops, which is one way to solve this problem. |
| 25 | However, even sophisticated approaches to this led to unacceptable |
| 26 | slow-downs in tight loops (where slow-downs are generally least |
| 27 | acceptable). |
| 28 | |
| 29 | I propose that the Go implementation switch to non-cooperative |
Austin Clements | 87b05a0 | 2019-01-18 14:28:40 -0500 | [diff] [blame] | 30 | preemption, which would allow goroutines to be preempted at |
| 31 | essentially any point without the need for explicit preemption checks. |
| 32 | This approach will solve the problem of delayed preemption and do so |
| 33 | with zero run-time overhead. |
| 34 | |
| 35 | Non-cooperative preemption is a general concept with a whole class of |
| 36 | implementation techniques. |
| 37 | This document describes and motivates the switch to non-cooperative |
| 38 | preemption and discusses common concerns of any non-cooperative |
| 39 | preemption approach in Go. |
| 40 | Specific implementation approaches are detailed in sub-proposals |
| 41 | linked from this document. |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 42 | |
| 43 | |
| 44 | ## Background |
| 45 | |
| 46 | Up to and including Go 1.10, Go has used cooperative preemption with |
| 47 | safe-points only at function calls (and even then, not if the function |
| 48 | is small or gets inlined). |
Austin Clements | 5c6560d | 2019-01-18 14:28:40 -0500 | [diff] [blame] | 49 | This means that Go can only switch between concurrently-executing |
| 50 | goroutines at specific points. |
| 51 | The main advantage of this is that the compiler can ensure useful |
| 52 | invariants at these safe-points. |
| 53 | In particular, the compiler ensures that all local garbage collection |
| 54 | roots are known at all safe-points, which is critical to precise |
| 55 | garbage collection. |
| 56 | It can also ensure that no registers are live at safe-points, which |
| 57 | means the Go runtime can switch goroutines without having to save and |
| 58 | restore a large register set. |
| 59 | |
| 60 | However, this can result in infrequent safe-points, which leads to |
| 61 | many problems: |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 62 | |
| 63 | 1. The most common in production code is that this can delay STW |
| 64 | operations, such as starting and ending a GC cycle. |
| 65 | This increases STW latency, and on large core counts can |
| 66 | significantly impact throughput (if, for example, most threads are |
| 67 | stopped while the runtime waits on a straggler for a long time). |
Austin Clements | 5b63da9 | 2019-02-25 18:00:12 -0500 | [diff] [blame] | 68 | ([#17831](https://golang.org/issue/17831), |
| 69 | [#19241](https://golang.org/issue/19241)) |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 70 | |
| 71 | 2. This can delay scheduling, preventing competing goroutines from |
| 72 | executing in a timely manner. |
| 73 | |
| 74 | 3. This can delay stack scanning, which consumes CPU while the runtime |
| 75 | waits for a preemption point and can ultimately delay GC |
| 76 | termination, resulting in an effective STW where the system runs |
| 77 | out of heap and no goroutines can allocate. |
| 78 | |
| 79 | 4. In really extreme cases, it can cause a program to halt, such as |
| 80 | when a goroutine spinning on an atomic load starves out the |
| 81 | goroutine responsible for setting that atomic. |
| 82 | This often indicates bad or buggy code, but is surprising |
| 83 | nonetheless and has clearly wasted a lot of developer time on |
| 84 | debugging. |
Austin Clements | 5b63da9 | 2019-02-25 18:00:12 -0500 | [diff] [blame] | 85 | ([#543](https://golang.org/issue/543), |
| 86 | [#12553](https://golang.org/issue/12553), |
| 87 | [#13546](https://golang.org/issue/13546), |
| 88 | [#14561](https://golang.org/issue/14561), |
| 89 | [#15442](https://golang.org/issue/15442), |
| 90 | [#17174](https://golang.org/issue/17174), |
| 91 | [#20793](https://golang.org/issue/20793), |
| 92 | [#21053](https://golang.org/issue/21053)) |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 93 | |
| 94 | These problems impede developer productivity and production efficiency |
| 95 | and expose Go's users to implementation details they shouldn't have to |
| 96 | worry about. |
| 97 | |
Austin Clements | 5c6560d | 2019-01-18 14:28:40 -0500 | [diff] [blame] | 98 | ### Cooperative loop preemption |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 99 | |
| 100 | @dr2chase put significant effort into trying to solve these problems |
Austin Clements | 5b63da9 | 2019-02-25 18:00:12 -0500 | [diff] [blame] | 101 | using cooperative *loop preemption* |
| 102 | ([#10958](https://golang.org/issue/10958)). |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 103 | This is a standard approach for runtimes employing cooperative |
| 104 | preemption in which the compiler inserts preemption checks and |
| 105 | safe-points at back-edges in the flow graph. |
| 106 | This significantly improves the quality of preemption, since code |
| 107 | almost never executes without a back-edge for any non-trivial amount |
| 108 | of time. |
| 109 | |
| 110 | Our most recent approach to loop preemption, which we call |
| 111 | *fault-based preemption*, adds a single instruction, no branches, and |
| 112 | no register pressure to loops on x86 and UNIX platforms ([CL |
| 113 | 43050](https://golang.org/cl/43050)). |
| 114 | Despite this, the geomean slow-down on a [large suite of |
| 115 | benchmarks](https://perf.golang.org/search?q=upload%3A20171003.1+%7C+upload-part%3A20171003.1%2F3+vs+upload-part%3A20171003.1%2F1) |
| 116 | is 7.8%, with a handful of significantly worse outliers. |
| 117 | Even [compared to Go |
| 118 | 1.9](https://perf.golang.org/search?q=upload%3A20171003.1+%7C+upload-part%3A20171003.1%2F0+vs+upload-part%3A20171003.1%2F1), |
| 119 | where the slow-down is only 1% thanks to other improvements, most |
| 120 | benchmarks see some slow-down and there are still significant |
| 121 | outliers. |
| 122 | |
| 123 | Fault-based preemption also has several implementation downsides. |
| 124 | It can't target specific threads or goroutines, so it's a poor match |
| 125 | for stack scanning, ragged barriers, or regular scheduler preemption. |
| 126 | It's also "sticky", in that we can't resume any loops until we resume |
| 127 | *all* loops, so the safe-point can't simply resume if it occurs in an |
| 128 | unsafe state (such as when runtime locks are held). |
Austin Clements | 04a0b7d | 2019-01-18 14:26:00 -0500 | [diff] [blame] | 129 | It requires more instructions (and more overhead) on non-x86 and |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 130 | non-UNIX platforms. |
Austin Clements | 04a0b7d | 2019-01-18 14:26:00 -0500 | [diff] [blame] | 131 | Finally, it interferes with debuggers, which assume bad memory |
| 132 | references are a good reason to stop a program. |
| 133 | It's not clear it can work at all under many debuggers on OS X due to |
| 134 | a [kernel bug](https://bugs.llvm.org/show_bug.cgi?id=22868). |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 135 | |
| 136 | |
Austin Clements | 5c6560d | 2019-01-18 14:28:40 -0500 | [diff] [blame] | 137 | ## Non-cooperative preemption |
| 138 | |
| 139 | *Non-cooperative preemption* switches between concurrent execution |
| 140 | contexts without explicit preemption checks or assistance from those |
| 141 | contexts. |
| 142 | This is used by all modern desktop and server operating systems to |
| 143 | switch between threads. |
| 144 | Without this, a single poorly-behaved application could wedge the |
| 145 | entire system, much like how a single poorly-behaved goroutine can |
| 146 | currently wedge a Go application. |
| 147 | It is also a convenient abstraction: it lets us program as if there |
| 148 | are an infinite number of CPUs available, hiding the fact that the OS |
| 149 | is time-multiplexing a finite number of CPUs. |
| 150 | |
| 151 | Operating system schedulers use hardware interrupt support to switch a |
| 152 | running thread into the OS scheduler, which can save that thread's |
| 153 | state such as its CPU registers so that it can be resumed later. |
| 154 | In Go, we would use operating system support to do the same thing. |
| 155 | On UNIX-like operating systems, this can be done using signals. |
| 156 | |
| 157 | However, because of the garbage collector, Go has requirements that an |
| 158 | operating system does not: Go must be able to find the live pointers |
| 159 | on a goroutine's stack wherever it stops it. |
| 160 | Most of the complexity of non-cooperative preemption in Go derives |
| 161 | from this requirement. |
| 162 | |
| 163 | |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 164 | ## Proposal |
| 165 | |
Austin Clements | 29ff007 | 2019-01-21 18:05:30 -0500 | [diff] [blame] | 166 | I propose that Go implement non-cooperative goroutine preemption by |
| 167 | sending a POSIX signal (or using an equivalent OS mechanism) to stop a |
| 168 | running goroutine and capture its CPU state. |
| 169 | If a goroutine is interrupted at a point that must be GC atomic, as |
| 170 | detailed in the ["Handling unsafe-points"](#handling-unsafe-points) |
| 171 | section, the runtime can simply resume the goroutine and try again |
| 172 | later. |
| 173 | |
| 174 | The key difficulty of implementing non-cooperative preemption for Go |
| 175 | is finding live pointers in the stack of a preempted goroutine. |
Austin Clements | 87b05a0 | 2019-01-18 14:28:40 -0500 | [diff] [blame] | 176 | There are many possible ways to do this, which are detailed in these |
| 177 | sub-proposals: |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 178 | |
Austin Clements | 87b05a0 | 2019-01-18 14:28:40 -0500 | [diff] [blame] | 179 | * The [safe-points everywhere |
| 180 | proposal](24543/safe-points-everywhere.md) describes an |
| 181 | implementation where the compiler records stack and register maps |
| 182 | for nearly every instruction. |
| 183 | This allows the runtime to halt a goroutine anywhere and find its GC |
| 184 | roots. |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 185 | |
Austin Clements | 29ff007 | 2019-01-21 18:05:30 -0500 | [diff] [blame] | 186 | * The [conservative inner-frame scanning |
| 187 | proposal](24543/conservative-inner-frame.md) describes an |
| 188 | implementation that uses conservative GC techniques to find pointers |
| 189 | in the inner-most stack frame of a preempted goroutine. |
| 190 | This can be done without any extra safe-point metadata. |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 191 | |
| 192 | |
| 193 | ## Handling unsafe-points |
| 194 | |
Austin Clements | 87b05a0 | 2019-01-18 14:28:40 -0500 | [diff] [blame] | 195 | Any non-cooperative preemption approach in Go must deal with code |
| 196 | sequences that have to be atomic with respect to the garbage |
| 197 | collector. |
| 198 | We call these "unsafe-points", in contrast with GC safe-points. |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 199 | A few known situations are: |
| 200 | |
| 201 | 1. Expressions involving `unsafe.Pointer` may temporarily represent |
| 202 | the only pointer to an object as a `uintptr`. |
| 203 | Hence, there must be no safe-points while a `uintptr` derived from |
| 204 | an `unsafe.Pointer` is live. |
| 205 | Likewise, we must recognize `reflect.Value.Pointer`, |
| 206 | `reflect.Value.UnsafeAddr`, and `reflect.Value.InterfaceData` as |
| 207 | `unsafe.Pointer`-to-`uintptr` conversions. |
| 208 | Alternatively, if the compiler can reliably detect such `uintptr`s, |
| 209 | it could mark this as pointers, but there's a danger that an |
| 210 | intermediate value may not represent a legal pointer value. |
| 211 | |
| 212 | 2. In the write barrier there must not be a safe-point between the |
| 213 | write-barrier-enabled check and a direct write. |
| 214 | For example, suppose the goroutine is writing a pointer to B into |
| 215 | object A. |
| 216 | If the check happens, then GC starts and scans A, then the |
| 217 | goroutine writes B into A and drops all references to B from its |
| 218 | stack, the garbage collector could fail to mark B. |
| 219 | |
| 220 | 3. There are places where the compiler generates temporary pointers |
| 221 | that can be past the end of allocations, such as in range loops |
| 222 | over slices and arrays. |
| 223 | These would either have to be avoided or safe-points would have to |
| 224 | be disallowed while these are live. |
| 225 | |
| 226 | All of these cases must already avoid significant reordering to avoid |
| 227 | being split across a call. |
| 228 | Internally, this is achieved via the "mem" pseudo-value, which must be |
| 229 | sequentially threaded through all SSA values that manipulate memory. |
| 230 | Mem is also threaded through values that must not be reordered, even |
| 231 | if they don't touch memory. |
| 232 | For example, conversion between `unsafe.Pointer` and `uintptr` is done |
| 233 | with a special "Convert" operation that takes a mem solely to |
| 234 | constrain reordering. |
| 235 | |
| 236 | There are several possible solutions to these problem, some of which |
| 237 | can be combined: |
| 238 | |
| 239 | 1. We could mark basic blocks that shouldn't contain preemption |
| 240 | points. |
| 241 | For `unsafe.Pointer` conversions, we would opt-out the basic block |
| 242 | containing the conversion. |
| 243 | For code adhering to the `unsafe.Pointer` rules, this should be |
| 244 | sufficient, but it may break code that is incorrect but happens to |
| 245 | work today in ways that are very difficult to debug. |
| 246 | For write barriers this is also sufficient. |
| 247 | For loops, this is overly broad and would require splitting some |
| 248 | basic blocks. |
| 249 | |
| 250 | 2. For `unsafe.Pointer` conversions, we could simply opt-out entire |
| 251 | functions that convert from `unsafe.Pointer` to `uintptr`. |
| 252 | This would be easy to implement, and would keep even broken unsafe |
| 253 | code working as well as it does today, but may have broad impact, |
| 254 | especially in the presence of inlining. |
| 255 | |
| 256 | 3. A simple combination of 1 and 2 would be to opt-out any basic block |
| 257 | that is *reachable* from an `unsafe.Pointer` to `uintptr` |
| 258 | conversion, up to a function call (which is a safe-point today). |
| 259 | |
| 260 | 4. For range loops, the compiler could compile them differently such |
| 261 | that it never constructs an out-of-bounds pointer (see below). |
| 262 | |
| 263 | 5. A far more precise and general approach (thanks to @cherrymui) |
| 264 | would be to create new SSA operations that "taint" and "untaint" |
| 265 | memory. |
| 266 | The taint operation would take a mem and return a new tainted mem. |
| 267 | This taint would flow to any values that themselves took a tainted |
| 268 | value. |
| 269 | The untaint operation would take a value and a mem and return an |
| 270 | untainted value and an untainted mem. |
| 271 | During liveness analysis, safe-points would be disallowed wherever |
| 272 | a tainted value was live. |
| 273 | This is probably the most precise solution, and is likely to keep |
| 274 | even incorrect uses of unsafe working, but requires a complex |
| 275 | implementation. |
| 276 | |
| 277 | More broadly, it's worth considering making the compiler check |
| 278 | `unsafe.Pointer`-using code and actively reject code that doesn't |
| 279 | follow the allowed patterns. |
| 280 | This could be implemented as a simple type system that distinguishes |
| 281 | pointer-ish `uintptr` from numeric `uintptr`. |
| 282 | But this is out of scope for this proposal. |
| 283 | |
| 284 | ### Range loops |
| 285 | |
Austin Clements | 87b05a0 | 2019-01-18 14:28:40 -0500 | [diff] [blame] | 286 | As of Go 1.10, range loops are compiled roughly like: |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 287 | |
| 288 | ```go |
| 289 | for i, x := range s { b } |
| 290 | ⇓ |
| 291 | for i, _n, _p := 0, len(s), &s[0]; i < _n; i, _p = i+1, _p + unsafe.Sizeof(s[0]) { b } |
| 292 | ⇓ |
| 293 | i, _n, _p := 0, len(s), &s[0] |
| 294 | goto cond |
| 295 | body: |
| 296 | { b } |
| 297 | i, _p = i+1, _p + unsafe.Sizeof(s[0]) |
| 298 | cond: |
| 299 | if i < _n { goto body } else { goto end } |
| 300 | end: |
| 301 | ``` |
| 302 | |
| 303 | The problem with this lowering is that `_p` may temporarily point past |
| 304 | the end of the allocation the moment before the loop terminates. |
| 305 | Currently this is safe because there's never a safe-point while this |
| 306 | value of `_p` is live. |
| 307 | |
Austin Clements | 87b05a0 | 2019-01-18 14:28:40 -0500 | [diff] [blame] | 308 | This lowering requires that the compiler mark the increment and |
| 309 | condition blocks as unsafe-points. |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 310 | However, if the body is short, this could result in infrequent |
| 311 | safe-points. |
Austin Clements | 87b05a0 | 2019-01-18 14:28:40 -0500 | [diff] [blame] | 312 | It also requires creating a separate block for the increment, which is |
| 313 | currently usually appended to the end of the body. |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 314 | Separating these blocks would inhibit reordering opportunities. |
| 315 | |
Austin Clements | 87b05a0 | 2019-01-18 14:28:40 -0500 | [diff] [blame] | 316 | In preparation for non-cooperative preemption, Go 1.11 began compiling |
| 317 | range loops as follows to avoid ever creating a past-the-end pointer: |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 318 | |
| 319 | ```go |
| 320 | i, _n, _p := 0, len(s), &s[0] |
mewmew | 9830606 | 2018-03-29 21:29:35 +0000 | [diff] [blame] | 321 | if i >= _n { goto end } else { goto body } |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 322 | top: |
| 323 | _p += unsafe.Sizeof(s[0]) |
| 324 | body: |
| 325 | { b } |
| 326 | i++ |
mewmew | 9830606 | 2018-03-29 21:29:35 +0000 | [diff] [blame] | 327 | if i >= _n { goto end } else { goto top } |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 328 | end: |
| 329 | ``` |
| 330 | |
Austin Clements | 87b05a0 | 2019-01-18 14:28:40 -0500 | [diff] [blame] | 331 | This allows safe-points everywhere in the loop. |
| 332 | Compared to the original loop compilation, it generates slightly more |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 333 | code, but executes the same number of conditional branch instructions |
| 334 | (n+1) and results in the same number of SSA basic blocks (3). |
| 335 | |
| 336 | This lowering does complicate bounds-check elimination. |
Austin Clements | 87b05a0 | 2019-01-18 14:28:40 -0500 | [diff] [blame] | 337 | In Go 1.10, bounds-check elimination knew that `i < _n` in the body |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 338 | because the body block is dominated by the cond block. |
Austin Clements | 87b05a0 | 2019-01-18 14:28:40 -0500 | [diff] [blame] | 339 | However, in the new lowering, deriving this fact required detecting |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 340 | that `i < _n` on *both* paths into body and hence is true in body. |
| 341 | |
| 342 | ### Runtime safe-points |
| 343 | |
| 344 | Beyond generated code, the runtime in general is not written to be |
| 345 | arbitrarily preemptible and there are many places that must not be |
| 346 | preempted. |
| 347 | Hence, we would likely disable safe-points by default in the runtime, |
| 348 | except at calls (where they occur now). |
| 349 | |
| 350 | While this would have little downside for most of the runtime, there |
| 351 | are some parts of the runtime that could benefit substantially from |
| 352 | non-cooperative preemption, such as memory functions like `memmove`. |
| 353 | Non-cooperative preemption is an excellent way to make these |
| 354 | preemptible without slowing down the common case, since we would only |
| 355 | need to mark their register maps (which would often be empty for |
| 356 | functions like `memmove` since all pointers would already be protected |
| 357 | by arguments). |
| 358 | |
| 359 | Over time we may opt-in more of the runtime. |
| 360 | |
Austin Clements | b7216a9 | 2018-07-09 17:16:48 -0400 | [diff] [blame] | 361 | ### Unsafe standard library code |
| 362 | |
| 363 | The Windows syscall package contains many `unsafe.Pointer` conversions |
| 364 | that don't follow the `unsafe.Pointer` rules. |
| 365 | It broadly makes shaky assumptions about safe-point behavior, |
| 366 | liveness, and when stack movement can happen. |
| 367 | It would likely need a thorough auditing, or would need to be opted |
| 368 | out like the runtime. |
| 369 | |
| 370 | Perhaps more troubling is that some of the Windows syscall package |
| 371 | types have uintptr fields that are actually pointers, hence forcing |
| 372 | callers to perform unsafe pointer conversions. |
Austin Clements | 5b63da9 | 2019-02-25 18:00:12 -0500 | [diff] [blame] | 373 | For example, see issue [#21376](https://golang.org/issue/21376). |
Austin Clements | b7216a9 | 2018-07-09 17:16:48 -0400 | [diff] [blame] | 374 | |
| 375 | ### Ensuring progress with unsafe-points |
| 376 | |
Austin Clements | 87b05a0 | 2019-01-18 14:28:40 -0500 | [diff] [blame] | 377 | We propose simply giving up and retrying later when a goroutine is |
| 378 | interrupted at an unsafe-point. |
Austin Clements | b7216a9 | 2018-07-09 17:16:48 -0400 | [diff] [blame] | 379 | One danger of this is that safe points may be rare in tight loops. |
| 380 | However, in many cases, there are more sophisticated alternatives to |
| 381 | this approach. |
| 382 | |
| 383 | For interruptions in the runtime or in functions without any safe |
| 384 | points (such as assembly), the signal handler could unwind the stack |
| 385 | and insert a return trampoline at the next return to a function with |
| 386 | safe point metadata. |
| 387 | The runtime could then let the goroutine continue running and the |
| 388 | trampoline would pause it as soon as possible. |
| 389 | |
| 390 | For write barriers and `unsafe.Pointer` sequences, the compiler could |
| 391 | insert a cheap, explicit preemption check at the end of the sequence. |
| 392 | For example, the runtime could modify some register that would be |
| 393 | checked at the end of the sequence and let the thread continue |
| 394 | executing. |
| 395 | In the write barrier sequence, this could even be the register that |
| 396 | the write barrier flag was loaded into, and the compiler could insert |
| 397 | a simple register test and conditional branch at the end of the |
| 398 | sequence. |
| 399 | To even further shrink the sequence, the runtime could put the address |
| 400 | of the stop function in this register so the stop sequence would be |
| 401 | just a register call and a jump. |
| 402 | |
| 403 | Alternatives to this check include forward and reverse simulation. |
| 404 | Forward simulation is tricky because the compiler must be careful to |
| 405 | only generate operations the runtime knows how to simulate. |
| 406 | Reverse simulation is easy *if* the compiler can always generate a |
| 407 | restartable sequence (simply move the PC back to the write barrier |
| 408 | flag check), but quickly becomes complicated if there are multiple |
| 409 | writes in the sequence or more complex writes such as DUFFCOPY. |
| 410 | |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 411 | |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 412 | ## Other considerations |
| 413 | |
Austin Clements | 87b05a0 | 2019-01-18 14:28:40 -0500 | [diff] [blame] | 414 | All of the proposed approaches to non-cooperative preemption involve |
| 415 | stopping a running goroutine by sending its thread an OS signal. |
| 416 | This section discusses general consequences of this. |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 417 | |
| 418 | **Windows support.** Unlike fault-based loop preemption, signaled |
| 419 | preemption is quite easy to support in Windows because it provides |
| 420 | `SuspendThread` and `GetThreadContext`, which make it trivial to get a |
| 421 | thread's register set. |
| 422 | |
Austin Clements | 04a0b7d | 2019-01-18 14:26:00 -0500 | [diff] [blame] | 423 | **Choosing a signal.** We have to choose a signal that is unlikely to |
| 424 | interfere with existing uses of signals or with debuggers. |
| 425 | There are no perfect choices, but there are some heuristics. |
| 426 | 1) It should be a signal that's passed-through by debuggers by |
| 427 | default. |
| 428 | On Linux, this is SIGALRM, SIGURG, SIGCHLD, SIGIO, SIGVTALRM, SIGPROF, |
| 429 | and SIGWINCH, plus some glibc-internal signals. |
| 430 | 2) It shouldn't be used internally by libc in mixed Go/C binaries |
| 431 | because libc may assume it's the only thing that can handle these |
| 432 | signals. |
| 433 | For example SIGCANCEL or SIGSETXID. |
| 434 | 3) It should be a signal that can happen spuriously without |
| 435 | consequences. |
| 436 | For example, SIGALRM is a bad choice because the signal handler can't |
| 437 | tell if it was caused by the real process alarm or not (arguably this |
| 438 | means the signal is broken, but I digress). |
| 439 | SIGUSR1 and SIGUSR2 are also bad because those are often used in |
| 440 | meaningful ways by applications. |
| 441 | 4) We need to deal with platforms without real-time signals (like |
| 442 | macOS), so those are out. |
| 443 | |
| 444 | We use SIGURG because it meets all of these criteria, is extremely |
| 445 | unlikely to be used by an application for its "real" meaning (both |
| 446 | because out-of-band data is basically unused and because SIGURG |
| 447 | doesn't report which socket has the condition, making it pretty |
| 448 | useless), and even if it is, the application has to be ready for |
| 449 | spurious SIGURG. SIGIO wouldn't be a bad choice either, but is more |
| 450 | likely to be used for real. |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 451 | |
| 452 | **Scheduler preemption.** This mechanism is well-suited to temporary |
| 453 | preemptions where the same goroutine will resume after the preemption |
| 454 | because we don't need to save the full register state and can rely on |
| 455 | the existing signal return path to restore the full register state. |
| 456 | This applies to all GC-related preemptions, but it's not as well |
| 457 | suited to permanent preemption performed by the scheduler. |
| 458 | However, we could still build on this mechanism. |
| 459 | For example, since most of the time goroutines self-preempt, we only |
| 460 | need to save the full signal state in the uncommon case, so the `g` |
| 461 | could contain a pointer to its full saved state that's only used after |
| 462 | a forced preemption. |
| 463 | Restoring the full signal state could be done by either writing the |
| 464 | architecture-dependent code to restore the full register set (a |
| 465 | beefed-up `runtime.gogo`), or by self-signaling, swapping in the |
| 466 | desired context, and letting the OS restore the full register set. |
| 467 | |
| 468 | **Targeting and resuming.** In contrast with fault-based loop |
| 469 | preemption, signaled preemption can be targeted at a specific thread |
| 470 | and can immediately resume. |
| 471 | Thread-targeting is a little different from cooperative preemption, |
| 472 | which is goroutine-targeted. |
| 473 | However, in many cases this is actually better, since targeting |
| 474 | goroutines for preemption is racy and hence requires retry loops that |
| 475 | can add significantly to STW time. |
| 476 | Taking advantage of this for stack scanning will require some |
| 477 | restructuring of how we track GC roots, but the result should |
| 478 | eliminate the blocking retry loop we currently use. |
| 479 | |
| 480 | **Non-pointer pointers.** This has the potential to expose incorrect |
| 481 | uses of `unsafe.Pointer` for transiently storing non-pointers. |
| 482 | Such uses are a clear violation of the `unsafe.Pointer` rules, but |
| 483 | they may happen (especially in, for example, cgo-using code). |
| 484 | |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 485 | |
| 486 | ## Alternatives |
| 487 | |
Austin Clements | b7216a9 | 2018-07-09 17:16:48 -0400 | [diff] [blame] | 488 | ### Single-stepping |
| 489 | |
Austin Clements | 87b05a0 | 2019-01-18 14:28:40 -0500 | [diff] [blame] | 490 | Rather than making an effort to be able to stop at any instruction, |
| 491 | the compiler could emit metadata for safe-points only at back-edges |
| 492 | and the runtime could use hardware single-stepping support to advance |
| 493 | the thread to a safe-point (or a point where the compiler has provided |
| 494 | a branch to reach a safe-point, like in the current loop preemption |
| 495 | approach). |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 496 | This works (somewhat surprisingly), but thoroughly bamboozles |
| 497 | debuggers since both the debugger and the operating system assume the |
| 498 | debugger owns single-stepping, not the process itself. |
| 499 | This would also require the compiler to provide register flushing |
| 500 | stubs for these safe-points, which increases code size (and hence |
Austin Clements | 5c6560d | 2019-01-18 14:28:40 -0500 | [diff] [blame] | 501 | instruction cache pressure) as well as stack size, much like |
| 502 | cooperative loop preemption. |
| 503 | However, unlike cooperative loop preemption, this approach would have |
| 504 | no effect on mainline code size or performance. |
Austin Clements | fc58df5 | 2018-03-26 15:41:40 -0400 | [diff] [blame] | 505 | |
Austin Clements | ed886f4 | 2019-01-18 14:02:31 -0500 | [diff] [blame] | 506 | ### Jump rewriting |
| 507 | |
| 508 | We can solve the problems of single-stepping by instead rewriting the |
| 509 | next safe-point jump instruction after the interruption point to jump |
| 510 | to a preemption path and resuming execution like usual. |
| 511 | To make this easy, the compiler could leave enough room (via padding |
| 512 | NOPs) so only the jump target needs to be modified. |
| 513 | |
| 514 | This approach has the usual drawbacks of modifiable code. |
| 515 | It's a security risk, it breaks text page sharing, and simply isn't |
| 516 | allowed on iOS. |
| 517 | It also can't target an individual goroutine (since another goroutine |
| 518 | could be executing the same code) and may have odd interactions with |
| 519 | concurrent execution on other cores. |
| 520 | |
| 521 | ### Out-of-line execution |
| 522 | |
| 523 | A further alternative in the same vein, but that doesn't require |
| 524 | modifying existing text is out-of-line execution. |
| 525 | In this approach, the signal handler relocates the instruction stream |
| 526 | from the interruption point to the next safe-point jump into a |
| 527 | temporary buffer, patches it to jump into the runtime at the end, and |
| 528 | resumes execution in this relocated sequence. |
| 529 | |
| 530 | This solves most of the problems with single-stepping and jump |
| 531 | rewriting, but is quite complex to implement and requires substantial |
| 532 | implementation effort for each platform. |
| 533 | It also isn't allowed on iOS. |
| 534 | |
| 535 | There is precedent for this sort of approach. |
| 536 | For example, when Linux uprobes injects an INT3, it relocates the |
| 537 | overwritten instructions into an "execute out-of-line" area to avoid |
| 538 | the usual problems with resuming from an INT3 instruction. |
| 539 | [The |
| 540 | implementation](https://github.com/torvalds/linux/blob/v4.18/arch/x86/kernel/uprobes.c) |
| 541 | is surprisingly simple given the complexity of the x86 instruction |
| 542 | encoding, but is still quite complex. |