| # Execution tracer overhaul |
| |
| Authored by mknyszek@google.com with a mountain of input from others. |
| |
| In no particular order, thank you to Felix Geisendorfer, Nick Ripley, Michael |
| Pratt, Austin Clements, Rhys Hiltner, thepudds, Dominik Honnef, and Bryan |
| Boreham for your invaluable feedback. |
| |
| ## Background |
| |
| [Original design document from |
| 2014.](https://docs.google.com/document/d/1FP5apqzBgr7ahCCgFO-yoVhk4YZrNIDNf9RybngBc14/pub) |
| |
| Go execution traces provide a moment-to-moment view of what happens in a Go |
| program over some duration. |
| This information is invaluable in understanding program behavior over time and |
| can be leveraged to achieve significant performance improvements. |
| Because Go has a runtime, it can provide deep information about program |
| execution without any external dependencies, making traces particularly |
| attractive for large deployments. |
| |
| Unfortunately limitations in the trace implementation prevent widespread use. |
| |
| For example, the process of analyzing execution traces scales poorly with the |
| size of the trace. |
| Traces need to be parsed in their entirety to do anything useful with them, |
| making them impossible to stream. |
| As a result, trace parsing and validation has very high memory requirements for |
| large traces. |
| |
| Also, Go execution traces are designed to be internally consistent, but don't |
| provide any way to align with other kinds of traces, for example OpenTelemetry |
| traces and Linux sched traces. |
| Alignment with higher level tracing mechanisms is critical to connecting |
| business-level tasks with resource costs. |
| Meanwhile alignment with lower level traces enables a fully vertical view of |
| application performance to root out the most difficult and subtle issues. |
| |
| Thanks to work in Go 1.21 cycle, the execution tracer's run-time overhead was |
| reduced from about -10% throughput and +10% request latency in web services to |
| about 1% in both for most applications. |
| This reduced overhead in conjunction with making traces more scalable enables |
| some [exciting and powerful new opportunities](#use-cases) for the diagnostic. |
| |
| Lastly, the implementation of the execution tracer has evolved organically over |
| time and it shows. |
| The codebase also has many old warts and some age-old bugs that make collecting |
| traces difficult, and seem broken. |
| Furthermore, many significant decision decisions were made over the years but |
| weren't thoroughly documented; those decisions largely exist solely in old |
| commit messages and breadcrumbs left in comments within the codebase itself. |
| |
| ## Goals |
| |
| The goal of this document is to define an alternative implementation for Go |
| execution traces that scales up to large Go deployments. |
| |
| Specifically, the design presented aims to achieve: |
| |
| - Make trace parsing require a small fraction of the memory it requires today. |
| - Streamable traces, to enable analysis without storage. |
| - Fix age-old bugs and present a path to clean up the implementation. |
| |
| Furthermore, this document will present the existing state of the tracer in |
| detail and explain why it's like that to justify the changes being made. |
| |
| ## Design |
| |
| ### Overview |
| |
| The design is broken down into four parts: |
| |
| - Timestamps and sequencing. |
| - Orienting against threads instead of Ps. |
| - Partitioning. |
| - Wire format cleanup. |
| |
| These four parts are roughly ordered by how fundamental they are to the trace |
| design, and so the former sections are more like concrete proposals, while the |
| latter sections are more like design suggestions that would benefit from |
| prototyping. |
| The earlier parts can also be implemented without considering the latter parts. |
| |
| Each section includes in the history and design of the existing system as well |
| to document the current system in one place, and to more easily compare it to |
| the new proposed system. |
| That requires, however, a lot of prose, which can obscure the bigger picture. |
| Here are the highlights of each section without that additional context. |
| |
| **Timestamps and sequencing**. |
| |
| - Compute timestamps from the OS's monotonic clock (`nanotime`). |
| - Use per-goroutine sequence numbers for establishing a partial order of events |
| (as before). |
| |
| **Orienting against threads (Ms) instead of Ps**. |
| |
| - Batch trace events by M instead of by P. |
| - Use lightweight M synchronization for trace start and stop. |
| - Simplify syscall handling. |
| - All syscalls have a full duration in the trace. |
| |
| **Partitioning**. |
| |
| - Traces are sequences of fully self-contained partitions that may be streamed. |
| - Each partition has its own stack table and string table. |
| - Partitions are purely logical: consecutive batches with the same ID. |
| - In general, parsers need state from the previous partition to get accurate |
| timing information. |
| - Partitions have an "earliest possible" timestamp to allow for imprecise |
| analysis without a previous partition. |
| - Partitions are bound by both a maximum wall time and a maximum size |
| (determined empirically). |
| - Traces contain an optional footer delineating partition boundaries as byte |
| offsets. |
| - Emit batch lengths to allow for rapidly identifying all batches within a |
| partition. |
| |
| **Wire format cleanup**. |
| |
| - More consistent naming scheme for event types. |
| - Separate out "reasons" that a goroutine can block or stop from the event |
| types. |
| - Put trace stacks, strings, and CPU samples in dedicated batches. |
| |
| ### Timestamps and sequencing |
| |
| For years, the Go execution tracer has used the `cputicks` runtime function for |
| obtaining a timestamp. |
| On most platforms, this function queries the CPU for a tick count with a single |
| instruction. |
| (Intuitively a "tick" goes by roughly every CPU clock period, but in practice |
| this clock usually has a constant rate that's independent of CPU frequency |
| entirely.) Originally, the execution tracer used this stamp exclusively for |
| ordering trace events. |
| Unfortunately, many [modern |
| CPUs](https://docs.google.com/spreadsheets/d/1jpw5aO3Lj0q23Nm_p9Sc8HrHO1-lm9qernsGpR0bYRg/edit#gid=0) |
| don't provide such a clock that is stable across CPU cores, meaning even though |
| cores might synchronize with one another, the clock read-out on each CPU is not |
| guaranteed to be ordered in the same direction as that synchronization. |
| This led to traces with inconsistent timestamps. |
| |
| To combat this, the execution tracer was modified to use a global sequence |
| counter that was incremented synchronously for each event. |
| Each event would then have a sequence number that could be used to order it |
| relative to other events on other CPUs, and the timestamps could just be used |
| solely as a measure of duration on the same CPU. |
| However, this approach led to tremendous overheads, especially on multiprocessor |
| systems. |
| |
| That's why in Go 1.7 the [implementation |
| changed](https://go-review.googlesource.com/c/go/+/21512) so that each goroutine |
| had its own sequence counter. |
| The implementation also cleverly avoids including the sequence number in the |
| vast majority of events by observing that running goroutines can't actually be |
| taken off their thread until they're synchronously preempted or yield |
| themselves. |
| Any event emitted while the goroutine is running is trivially ordered after the |
| start. |
| The only non-trivial ordering cases left are where an event is emitted by or on |
| behalf of a goroutine that is not actively running (Note: I like to summarize |
| this case as "emitting events at a distance" because the scheduling resource |
| itself is not emitting the event bound to it.) These cases need to be able to be |
| ordered with respect to each other and with a goroutine starting to run (i.e. |
| the `GoStart` event). |
| In the current trace format, there are only two such cases: the `GoUnblock` |
| event, which indicates that a blocked goroutine may start running again and is |
| useful for identifying scheduling latencies, and `GoSysExit`, which is used to |
| determine the duration of a syscall but may be emitted from a different P than |
| the one the syscall was originally made on. |
| (For more details on the `GoSysExit` case see in the [next section](#batching).) |
| Furthermore, there are versions of the `GoStart, GoUnblock`, and `GoSysExit` |
| events that omit a sequence number to save space if the goroutine just ran on |
| the same P as the last event, since that's also a case where the events are |
| trivially serialized. |
| In the end, this approach successfully reduced the trace overhead from over 3x |
| to 10-20%. |
| |
| However, it turns out that the trace parser still requires timestamps to be in |
| order, leading to the infamous ["time stamps out of |
| order"](https://github.com/golang/go/issues/16755) error when `cputicks` |
| inevitably doesn't actually emit timestamps in order. |
| Ps are a purely virtual resource; they don't actually map down directly to |
| physical CPU cores at all, so it's not even reasonable to assume that the same P |
| runs on the same CPU for any length of time. |
| |
| Although we can work around this issue, I propose we try to rely on the |
| operating system's clock instead and fix up timestamps as a fallback. |
| The main motivation behind this change is alignment with other tracing systems. |
| It's already difficult enough to try to internally align the `cputicks` clock, |
| but making it work with clocks used by other traces such as those produced by |
| distributed tracing systems is even more difficult. |
| |
| On Linux, the `clock_gettime` syscall, called `nanotime` in the runtime, takes |
| around 15ns on average when called in a loop. |
| This is compared to `cputicks'` 10ns. |
| Trivially replacing all `cputicks` calls in the current tracer with `nanotime` |
| reveals a small performance difference that depends largely on the granularity |
| of each result. |
| Today, `cputicks'` is divided by 64. |
| On a 3 GHz processor, this amounts to a granularity of about 20 ns. |
| Replacing that with `nanotime` and no time division (i.e. nanosecond |
| granularity) results in a 0.22% geomean regression in throughput in the Sweet |
| benchmarks. |
| The trace size also increases by 17%. |
| Dividing `nanotime` by 64 we see approximately no regression and a trace size |
| decrease of 1.7%. |
| Overall, there seems to be little performance downside to using `nanotime`, |
| provided we pick an appropriate timing granularity: what we lose by calling |
| `nanotime`, we can easily regain by sacrificing a small amount of precision. |
| And it's likely that most of the precision below 128 nanoseconds or so is noise, |
| given the average cost of a single call into the Go scheduler (~250 |
| nanoseconds). |
| To give us plenty of precision, I propose a target timestamp granularity of 64 |
| nanoseconds. |
| This should be plenty to give us fairly fine-grained insights into Go program |
| behavior while also keeping timestamps small. |
| |
| As for sequencing, I believe we must retain the per-goroutine sequence number |
| design as-is. |
| Relying solely on a timestamp, even a good one, has significant drawbacks. |
| For one, issues arise when timestamps are identical: the parser needs to decide |
| on a valid ordering and has no choice but to consider every possible ordering of |
| those events without additional information. |
| While such a case is unlikely with nanosecond-level timestamp granularity, it |
| totally precludes making timestamp granularity more coarse, as suggested in the |
| previous paragraph. |
| A sequencing system that's independent of the system's clock also retains the |
| ability of the tracing system to function despite a broken clock (modulo |
| returning an error when timestamps are out of order, which again I think we |
| should just work around). |
| Even `clock_gettime` might be broken on some machines! |
| |
| How would a tracing system continue to function despite a broken clock? For that |
| I propose making the trace parser fix up timestamps that don't line up with the |
| partial order. |
| The basic idea is that if the parser discovers a partial order edge between two |
| events A and B, and A's timestamp is later than B's, then the parser applies A's |
| timestamp to B. |
| B's new timestamp is in turn propagated later on in the algorithm in case |
| another partial order edge is discovered between B and some other event C, and |
| those events' timestamps are also out-of-order. |
| |
| There's one last issue with timestamps here on platforms for which the runtime |
| doesn't have an efficient nanosecond-precision clock at all, like Windows. |
| (Ideally, we'd make use of system calls to obtain a higher-precision clock, like |
| [QPC](https://learn.microsoft.com/en-us/windows/win32/sysinfo/acquiring-high-resolution-time-stamps), |
| but [calls to this API can take upwards of a hundred nanoseconds on modern |
| hardware, and even then the resolution is on the order of a few hundred |
| nanoseconds](https://github.com/golang/go/issues/8687#issuecomment-694498710).) |
| On Windows we can just continue to use `cputicks` to get the precision and rely |
| on the timestamp fixup logic in the parser, at the cost of being unable to |
| reliably align traces with other diagnostic data that uses the system clock. |
| |
| ### Orienting by threads (Ms) instead of Ps |
| |
| Today the Go runtime batches trace events by P. |
| That is, trace events are grouped in batches that all happened, in order, on the |
| same P. |
| A batch is represented in the runtime as a buffer, 32 KiB in size, which is |
| attached to a P when it's actively being written to. |
| Events are written to this P's buffer in their encoded form. |
| This design choice allows most event writes to elide synchronization with the |
| rest of the runtime and linearizes the trace with respect to a P, which is |
| crucial to [sequencing without requiring a global total |
| order](#timestamps-and-sequencing). |
| |
| Batching traces by any core scheduling resource (G, M, or P) could in principle |
| have similar properties. |
| At a glance, there are a few reasons Ps make a better choice. |
| One reason is that there are generally a small number of Ps compared to Ms and |
| Gs, minimizing the maximum number of buffers that can be active at any given |
| time. |
| Another reason is convenience. |
| When batching, tracing generally requires some kind of synchronization with all |
| instances of its companion resource type to get a consistent trace. |
| (Think of a buffer which has events written to it very infrequently. |
| It needs to be flushed at some point before the trace can be considered |
| complete, because it may contain critical information needed to sequence events |
| in other buffers.) Furthermore, synchronization when starting a trace is also |
| generally useful, as that provides an opportunity to inspect the state of the |
| world and write down details about it, simplifying validation. |
| Stopping the world is a convenient way to get the attention of all Ps in the Go |
| runtime. |
| |
| However, there are problems with batching by P that make traces more complex |
| than necessary. |
| |
| The core of these problems lies with the `GoSysExit` event. |
| This event requires special arrangements in both the runtime and when validating |
| traces to ensure a consistent trace. |
| The difficulty with this event is that it's emitted by a goroutine that was |
| blocked in a syscall and lost its P, and because it doesn't have a P, it might |
| race with the runtime enabling and disabling tracing. |
| Therefore it needs to wait until it has a P to avoid that race. |
| (Note: the tracing system does have a place to put events when there is no P |
| available, but that doesn't help in this case. |
| The tracing system uses the fact that it stops-the-world to synchronize with |
| `GoSysExit` by preventing it from writing an event until the trace system can |
| finish initialization.) |
| |
| The problem with `GoSysExit` stems from a fundamental mismatch: Ms emit events, |
| but only Ps are preemptible by the Go scheduler. |
| This really extends to any situation where we'd like an M to emit an event when |
| it doesn't have a P, say for example when it goes to sleep after dropping its P |
| and not finding any work in the scheduler, and it's one reason why we don't have |
| any M-specific events at all today. |
| |
| So, suppose we batch trace events by M instead. |
| |
| In the case of `GoSysExit`, it would always be valid to write to a trace buffer, |
| because any synchronization would have to happen on the M instead of the P, so |
| no races with stopping the world. |
| However, this also means the tracer _can't_ simply stop the world, because |
| stopping the world is built around stopping user Go code, which runs with a P. |
| So, the tracer would have to use something else (more on this later). |
| |
| Although `GoSysExit` is simpler, `GoSysBlock` becomes slightly more complex in |
| the case where the P is retaken by `sysmon`. |
| In the per-P world it could be written into the buffer of the taken P, so it |
| didn't need any synchronization. |
| In the per-M world, it becomes an event that happens "at a distance" and so |
| needs a sequence number from the syscalling goroutine for the trace consumer to |
| establish a partial order. |
| |
| However, we can do better by reformulating the syscall events altogether. |
| |
| Firstly, I propose always emitting the full time range for each syscall. |
| This is a quality-of-life choice that may increase trace overheads slightly, but |
| also provides substantially more insight into how much time is spent in |
| syscalls. |
| Syscalls already take ~250 nanoseconds at a baseline on Linux and are unlikely |
| to get faster in the future for security reasons (due to Spectre and Meltdown) |
| and the additional event would never contain a stack trace, so writing it out |
| should be quite fast. |
| (Nonetheless, we may want to reconsider emitting these events for cgo calls.) |
| The new events would be called, for example, `GoSyscallBegin` and |
| `GoSyscallEnd`. |
| If a syscall blocks, `GoSyscallEnd` is replaced with `GoSyscallEndBlocked` |
| (more on this later). |
| |
| Secondly, I propose adding an explicit event for stealing a P. |
| In the per-M world, keeping just `GoSysBlock` to represent both a goroutine's |
| state transition and the state of a P is not feasible, because we're revealing |
| the fact that fundamentally two threads are racing with one another on the |
| syscall exit. |
| An explicit P-stealing event, for example `ProcSteal`, would be required to |
| be ordered against a `GoSyscallEndBlocked`, with the former always happening |
| before. |
| The precise semantics of the `ProcSteal` event would be a `ProcStop` but one |
| performed by another thread. |
| Because this means events can now happen to P "at a distance," the `ProcStart`, |
| `ProcStop`, and `ProcSteal` events all need sequence numbers. |
| |
| Note that the naive emission of `ProcSteal` and `GoSyscallEndBlocked` will |
| cause them to race, but the interaction of the two represents an explicit |
| synchronization within the runtime, so the parser can always safely wait for the |
| `ProcSteal` to emerge in the frontier before proceeding. |
| The timestamp order may also not be right, but since [we already committed to |
| fixing broken timestamps](#timestamps-and-sequencing) in general, this skew will |
| be fixed up by that mechanism for presentation. |
| (In practice, I expect the skew to be quite small, since it only happens if the |
| retaking of a P races with a syscall exit.) |
| |
| Per-M batching might also incur a higher memory cost for tracing, since there |
| are generally more Ms than Ps. |
| I suspect this isn't actually too big of an issue since the number of Ms is |
| usually close to the number of Ps. |
| In the worst case, there may be as many Ms as Gs! However, if we also [partition |
| the trace](#partitioning), then the number of active buffers will only be |
| proportional to the number of Ms that actually ran in a given time window, which |
| is unlikely to be an issue. |
| Still, if this does become a problem, a reasonable mitigation would be to simply |
| shrink the size of each trace buffer compared to today. |
| The overhead of the tracing slow path is vanishingly small, so doubling its |
| frequency would likely not incur a meaningful compute cost. |
| |
| Other than those three details, per-M batching should function identically to |
| the current per-P batching: trace events may already be safely emitted without a |
| P (modulo `GoSysExit` synchronization), so we're not losing anything else with |
| the change. |
| |
| Instead, however, what we gain is a deeper insight into thread execution. |
| Thread information is currently present in execution traces, but difficult to |
| interpret because it's always tied to P start and stop events. |
| A switch to per-M batching forces traces to treat Ps and Ms orthogonally. |
| |
| Given all this, I propose switching to per-M batching. |
| The only remaining question to resolve is trace synchronization for Ms. |
| |
| (As an additional minor consideration, I would also like to propose adding the |
| batch length to the beginning of each batch. |
| Currently (and in the foreseeable future), the trace consumer needs to iterate |
| over the entire trace once to collect all the batches for ordering. |
| We can speed up this process tremendously by allowing the consumer to _just_ |
| collect the information it needs.) |
| |
| #### M synchronization |
| |
| The runtime already contains a mechanism to execute code on every M via signals, |
| `doAllThreadsSyscall`. |
| However, traces have different requirements than `doAllThreadsSyscall`, and I |
| think we can exploit the specifics of these requirements to achieve a more |
| lightweight alternative. |
| |
| First, observe that getting the attention of every M for tracing is not strictly |
| necessary: Ms that never use the tracer need not be mentioned in the trace at |
| all. |
| This observation allows us to delegate trace state initialization to the M |
| itself, so we can synchronize with Ms at trace start simply by atomically |
| setting a single global flag. |
| If an M ever writes into the trace buffer, then it will initialize its state by |
| emitting an event indicating what it was doing when tracing started. |
| For example, if the first event the M is going to emit is `GoBlock`, then it |
| will emit an additional event before that that indicates the goroutine was |
| running since the start of the trace (a hypothetical `GoRunning` event). |
| Disabling tracing is slightly more complex, as the tracer needs to flush every |
| M's buffer or identify that its buffer was never written to. |
| However, this is fairly straightforward to do with per-M seqlocks. |
| |
| Specifically, Ms would double-check the `trace.enabled` flag under the seqlock |
| and anything trying to stop a trace would first disable the flag, then iterate |
| over every M to make sure it _observed_ its seqlock was unlocked. |
| This guarantees that every M observed or will observe the new flag state. |
| |
| There's just one problem with all this: it may be that an M might be running and |
| never emit an event. |
| This case is critical to capturing system dynamics. |
| As far as I can tell, there are three cases here: the M is running user code |
| without emitting any trace events (e.g. a tight loop), the M is in a system call |
| or C code the whole time, or the M is `sysmon`, a special thread that always |
| runs without a P. |
| |
| The first case is fairly easy to deal with: because it's running with a P, we |
| can just preempt it and establish an invariant that any preemption always emits |
| an event if the M's trace state has not been initialized. |
| |
| The second case is a little more subtle, but luckily not very complex to |
| implement. |
| The tracer can identify whether the M has a G blocked in a syscall (or |
| equivalently has called into C code) just before disabling tracing globally by |
| checking `readgstatus(m.curg) == _Gsyscall` on each M's G and write it down. |
| If the tracer can see that the M never wrote to its buffer _after_ it disables |
| tracing, it can safely conclude that the M was still in a syscall at the moment |
| when tracing was disabled, since otherwise it would have written to the buffer. |
| |
| Note that since we're starting to read `gp.atomicstatus` via `readgstatus`, we |
| now need to ensure consistency between the G's internal status and the events |
| we emit representing those status changes, from the tracer's perspective. |
| Thus, we need to make sure we hold the M's trace seqlock across any |
| `gp.atomicstatus` transitions, which will require a small refactoring of the |
| runtime and the tracer. |
| |
| For the third case, we can play basically the same trick as the second case, but |
| instead check to see if sysmon is blocked on a note. |
| If it's not, it's running. |
| If it doesn't emit any events (e.g. to stop on a note) then the tracer can |
| assume that it's been running, if its buffer is empty. |
| |
| A big change with this synchronization mechanism is that the tracer no longer |
| obtains the state of _all_ goroutines when a trace starts, only those that run |
| or execute syscalls. |
| The remedy to this is to simply add it back in. |
| Since the only cases not covered are goroutines in `_Gwaiting` or `_Grunnable`, |
| the tracer can set the `_Gscan` bit on the status to ensure the goroutine can't |
| transition out. |
| At that point, it collects a stack trace and writes out a status event for the |
| goroutine inside buffer that isn't attached to any particular M. |
| To avoid ABA problems with a goroutine transitioning out of `_Gwaiting` or |
| `_Grunnable` and then back in, we also need an atomically-accessed flag for each |
| goroutine that indicates whether an event has been written for that goroutine |
| yet. |
| |
| The result of all this synchronization is that the tracer only lightly perturbs |
| application execution to start and stop traces. |
| In theory, a stop-the-world is no longer required at all, but in practice |
| (especially with [partitioning](#partitioning)), a brief stop-the-world to begin |
| tracing dramatically simplifies some of the additional synchronization necessary. |
| Even so, starting and stopping a trace will now be substantially more lightweight |
| than before. |
| |
| ### Partitioning |
| |
| The structure of the execution trace's binary format limits what you can do with |
| traces. |
| |
| In particular, to view just a small section of the trace, the entire trace needs |
| to be parsed, since a batch containing an early event may appear at the end of |
| the trace. |
| This can happen if, for example, a P that stays relatively idle throughout the |
| trace wakes up once at the beginning and once at the end, but never writes |
| enough to the trace buffer to flush it. |
| |
| To remedy this, I propose restructuring traces into a stream of self-contained |
| partitions, called "generations." |
| More specifically, a generation is a collection of trace batches that represents |
| a complete trace. |
| In practice this means each generation boundary is a global buffer flush. |
| Each trace batch will have a generation number associated with, and generations |
| will not interleave. |
| The way generation boundaries are identified in the trace is when a new batch |
| with a different number is identified. |
| |
| The size of each generation will be roughly constant: new generation boundaries |
| will be created when either the partition reaches some threshold size _or_ some |
| maximum amount of wall-clock time has elapsed. |
| The exact size threshold and wall-clock limit will be determined empirically, |
| though my general expectation for the size threshold is around 16 MiB. |
| |
| Generation boundaries will add a substantial amount of implementation |
| complexity, but the cost is worth it, since it'll fundamentally enable new |
| use-cases. |
| |
| The trace parser will need to be able to "stitch" together events across |
| generation boundaries, and for that I propose the addition of a new `GoStatus` |
| event. |
| This event is a generalization of the `GoInSyscall` and `GoWaiting` events. |
| This event will be emitted the first time a goroutine is about to be mentioned |
| in a trace, and will carry a state enum argument that indicates what state the |
| goroutine was in *before* the next event to be emitted. |
| The state emitted for a goroutine, for more tightly coupling events to states, |
| will be derived from the event about to be emitted (except for the "waiting" and |
| "syscall" cases where it's explicit). |
| The trace parser will be able to use these events to validate continuity across |
| generations. |
| |
| This global buffer flush can be implemented as an extension to the |
| aforementioned [M synchronization](#m-synchronization) design by replacing the |
| `enabled` flag with a generation counter and doubling up much of the trace |
| state. |
| The generation counter will be used to index into the trace state, allowing for |
| us to dump and then toss old trace state, to prepare it for the next generation |
| while the current generation is actively being generated. |
| In other words, it will work akin to ending a trace, but for just one half of |
| the trace state. |
| |
| One final subtlety created by partitioning is that now we may no longer have a |
| stop-the-world in between complete traces (since a generation is a complete |
| trace). |
| This means some events may be active when a trace is starting, and we'll lose |
| that information. |
| For user-controlled event types (user tasks and user regions), I propose doing |
| nothing special. |
| It's already true that if a task or region was in progress we lose that |
| information, and it's difficult to track that efficiently given how the API is |
| structured. |
| For range event types we control (sweep, mark assist, STW, etc.), I propose |
| adding the concept of "active" events which are used to indicate that a |
| goroutine or P has been actively in one of these ranges since the start of a |
| generation. |
| These events will be emitted alongside the `GoStatus` event for a goroutine. |
| (This wasn't mentioned above, but we'll also need a `ProcStatus` event, so we'll |
| also emit these events as part of that processes as well.) |
| |
| ### Event cleanup |
| |
| Since we're modifying the trace anyway, this is a good opportunity to clean up |
| the event types, making them simpler to implement, simpler to parse and |
| understand, and more uniform overall. |
| |
| The three biggest changes I would like to propose are |
| |
| 1. A uniform naming scheme, |
| 1. Separation of the reasons that goroutines might stop or block from the event |
| type. |
| 1. Placing strings, stacks, and CPU samples in their own dedicated batch types. |
| |
| Firstly, for the naming scheme, I propose the following rules (which are not |
| strictly enforced, only guidelines for documentation): |
| |
| - Scheduling resources, such as threads and goroutines, have events related to |
| them prefixed with the related resource (i.e. "Thread," "Go," or "Proc"). |
| - Scheduling resources have "Create" and "Destroy" events. |
| - Scheduling resources have generic "Status" events used for indicating their |
| state at the start of a partition (replaces `GoInSyscall` and `GoWaiting`). |
| - Scheduling resources have "Start" and "Stop" events to indicate when that |
| resource is in use. |
| The connection between resources is understood through context today. |
| - Goroutines also have "Block" events which are like "Stop", but require an |
| "Unblock" before the goroutine can "Start" again. |
| - Note: Ps are exempt since they aren't a true resource, more like a |
| best-effort semaphore in practice. |
| There's only "Start" and "Stop" for them. |
| - Events representing ranges of time come in pairs with the start event having |
| the "Begin" suffix and the end event having the "End" suffix. |
| - Events have a prefix corresponding to the deepest resource they're associated |
| with. |
| |
| Secondly, I propose moving the reasons why a goroutine resource might stop or |
| block into an argument. |
| This choice is useful for backwards compatibility, because the most likely |
| change to the trace format at any given time will be the addition of more |
| detail, for example in the form of a new reason to block. |
| |
| Thirdly, I propose placing strings, stacks, and CPU samples in their own |
| dedicated batch types. |
| This, in combination with batch sizes in each batch header, will allow the trace |
| parser to quickly skim over a generation and find all the strings, stacks, and |
| CPU samples. |
| This makes certain tasks faster, but more importantly, it simplifies parsing and |
| validation, since the full tables are just available up-front. |
| It also means that the event batches are able to have a much more uniform format |
| (one byte event type, followed by several varints) and we can keep all format |
| deviations separate. |
| |
| Beyond these big three, there are a myriad of additional tweaks I would like to |
| propose: |
| |
| - Remove the `FutileWakeup` event, which is no longer used (even in the current |
| implementation). |
| - Remove the `TimerGoroutine` event, which is no longer used (even in the |
| current implementation). |
| - Rename `GoMaxProcs` to `ProcsChange`. |
| - This needs to be written out at trace startup, as before. |
| - Redefine `GoSleep` as a `GoBlock` and `GoUnblock` pair. |
| - Break out `GoStartLabel` into a more generic `GoLabel` event that applies a |
| label to a goroutine the first time it emits an event in a partition. |
| - Change the suffixes of pairs of events representing some activity to be |
| `Begin` and `End`. |
| - Ensure all GC-related events contain the string `GC`. |
| - Add [sequence numbers](#timestamps-and-sequencing) to events where necessary: |
| - `GoStart` still needs a sequence number. |
| - `GoUnblock` still needs a sequence number. |
| - Eliminate the {`GoStart,GoUnblock,Go}Local` events for uniformity. |
| - Because traces are no longer bound by stopping the world, and can indeed now |
| start during the GC mark phase, we need a way to identify that a GC mark is |
| currently in progress. |
| Add the `GCMarkActive` event. |
| - Eliminate inline strings and make all strings referenced from a table for |
| uniformity. |
| - Because of partitioning, the size of the string table is unlikely to become |
| large in practice. |
| - This means all events can have a fixed size (expressed in terms of |
| the number of arguments) and we can drop the argument count bits from the |
| event type. |
| |
| A number of these tweaks above will likely make traces bigger, mostly due to |
| additional information and the elimination of some optimizations for uniformity. |
| |
| One idea to regain some of this space is to compress goroutine IDs in each |
| generation by maintaining a lightweight mapping. |
| Each time a goroutine would emit an event that needs the current goroutine |
| ID, it'll check a g-local cache. |
| This cache will contain the goroutine's alias ID for the current partition |
| and the partition number. |
| If the goroutine does not have an alias for the current partition, it |
| increments a global counter to acquire one and writes down the mapping of |
| that counter to its full goroutine ID in a global table. |
| This table is then written out at the end of each partition. |
| But this is optional. |
| |
| ## Implementation |
| |
| As discussed in the overview, this design is intended to be implemented in |
| parts. |
| Some parts are easy to integrate into the existing tracer, such as the switch to |
| nanotime and the traceback-related improvements. |
| Others, such as per-M batching, partitioning, and the change to the event |
| encoding, are harder. |
| |
| While in theory all could be implemented separately as an evolution of the |
| existing tracer, it's simpler and safer to hide the changes behind a feature |
| flag, like a `GOEXPERIMENT` flag, at least to start with. |
| Especially for per-M batching and partitioning, it's going to be quite |
| complicated to try to branch in all the right spots in the existing tracer. |
| |
| Instead, we'll basically have two trace implementations temporarily living |
| side-by-side, with a mostly shared interface. |
| Where they differ (and there are really only a few spots), the runtime can |
| explicitly check for the `GOEXPERIMENT` flag. |
| This also gives us an opportunity to polish the new trace implementation. |
| |
| The implementation will also require updating the trace parser to make tests |
| work. |
| I suspect this will result in what is basically a separate trace parser that's |
| invoked when the new header version is identified. |
| Once we have a basic trace parser that exports the same API, we can work on |
| improving the internal trace parser API to take advantage of the new features |
| described in this document. |
| |
| ## Use-cases |
| |
| This document presents a large change to the existing tracer under the banner of |
| "traces at scale" which is a bit abstract. |
| To better understand what that means, let's explore a few concrete use-cases |
| that this design enables. |
| |
| ### Flight recording |
| |
| Because the trace in this design is partitioned, it's now possible for the |
| runtime to carry around the most recent trace partition. |
| The runtime could then expose this partition to the application as a snapshot of |
| what the application recently did. |
| This is useful for debugging at scale, because it enables the application to |
| collect data exactly when something goes wrong. |
| For instance: |
| |
| - A server's health check fails. |
| The server takes a snapshot of recent execution and puts it into storage |
| before exiting. |
| - Handling a request exceeds a certain latency threshold. |
| The server takes a snapshot of the most recent execution state and uploads it |
| to storage for future inspection. |
| - A program crashes when executing in a deployed environment and leaves behind a |
| core dump. |
| That core dump contains the recent execution state. |
| |
| Here's a quick design sketch: |
| |
| ```go |
| package trace |
| |
| // FlightRecorder represents |
| type FlightRecorder struct { ... } |
| |
| // NewFlightRecorder creates a new flight recording configuration. |
| func NewFlightRecorder() *FlightRecorder |
| |
| // Start begins process-wide flight recording. Only one FlightRecorder |
| // may be started at once. If another FlightRecorder is started, this |
| // function will return an error. |
| // |
| // This function can be used concurrently with Start and Stop. |
| func (fr *FlightRecorder) Start() error |
| |
| // TakeSnapshot collects information about the execution of this program |
| // from the last few seconds and write it to w. This FlightRecorder must |
| // have started to take a snapshot. |
| func (fr *FlightRecorder) TakeSnapshot(w io.Writer) error |
| |
| // Stop ends process-wide flight recording. Stop must be called on the |
| // same FlightRecorder that started recording. |
| func (fr *FlightRecorder) Stop() error |
| ``` |
| |
| - The runtime accumulates trace buffers as usual, but when it makes a partition, |
| it puts the trace data aside as it starts a new one. |
| - Once the new partition is finished, the previous is discarded (buffers are |
| reused). |
| - At any point, the application can request a snapshot of the current trace |
| state. |
| - The runtime immediately creates a partition from whatever data it has, puts |
| that together with the previously-accumulated partition, and hands it off to |
| the application for reading. |
| - The runtime then continues accumulating trace data in a new partition while |
| the application reads the trace data. |
| - The application is almost always guaranteed two partitions, with one being as |
| large as a partition would be in a regular trace (the second one is likely to |
| be smaller). |
| - Support for flight recording in the `/debug/pprof/trace` endpoint. |
| |
| ### Fleetwide trace collection |
| |
| Today some power users collect traces at scale by setting a very modest sampling |
| rate, e.g. 1 second out of every 1000 seconds that a service is up. |
| This has resulted in collecting a tremendous amount of useful data about |
| execution at scale, but this use-case is currently severely limited by trace |
| performance properties. |
| |
| With this design, it should be reasonable to increase the sampling rate by an |
| order of magnitude and collect much larger traces for offline analysis. |
| |
| ### Online analysis |
| |
| Because the new design allows for partitions to be streamed and the trace |
| encoding is much faster to process, it's conceivable that a service could |
| process its own trace continuously and aggregate and filter only what it needs. |
| This online processing avoids the need to store traces for future inspection. |
| |
| The kinds of processing I imagine for this includes: |
| |
| - Detailed task latency breakdowns over long time horizons. |
| - CPU utilization estimates for different task types. |
| - Semantic processing and aggregation of user log events. |
| - Fully customizable flight recording (at a price). |
| |
| ### Linux sched trace association |
| |
| Tooling can be constructed that interleaves Go execution traces with traces |
| produced via: |
| |
| `perf sched record --clockid CLOCK_MONOTONIC <command>` |
| |
| ## Prior art |
| |
| ### JFR |
| |
| [JFR or "Java Flight |
| Recorder](https://developers.redhat.com/blog/2020/08/25/get-started-with-jdk-flight-recorder-in-openjdk-8u#using_jdk_flight_recorder_with_jdk_mission_control)" |
| is an execution tracing system for Java applications. |
| JFR is highly customizable with a sophisticated wire format. |
| It supports very detailed configuration of events via an XML configuration file, |
| including the ability to enable/disable stack traces per event and set a latency |
| threshold an event needs to exceed in order to be sampled (e.g. "only sample |
| file reads if they exceed 20ms"). |
| It also supports custom user-defined binary-encoded events. |
| |
| By default, JFR offers a low overhead configuration (<1%) for "continuous |
| tracing" and slightly higher overhead configuration (1-2%) for more detailed |
| "profile tracing." |
| These configurations are just default configuration files that ship with the |
| JVM. |
| |
| Continuous tracing is special in that it accumulates data into an internal |
| global ring buffer which may be dumped at any time. |
| Notably this data cannot be recovered from a crashed VM, though it can be |
| recovered in a wide variety of other cases. |
| Continuous tracing is disabled by default. |
| |
| The JFR encoding scheme is quite complex, but achieves low overhead partially |
| through varints. |
| JFR traces are also partitioned, enabling scalable analysis. |
| |
| In many ways the existing Go execution tracer looks a lot like JFR, just without |
| partitioning, and with a simpler encoding. |
| |
| ### KUTrace |
| |
| [KUTrace](https://github.com/dicksites/KUtrace) is a system-wide execution |
| tracing system for Linux. |
| It achieves low overhead, high resolution, and staggering simplicity by cleverly |
| choosing to trace on each kernel transition (the "goldilocks" point in the |
| design space, as the author puts it). |
| |
| It uses a fairly simple 8-byte-word-based encoding scheme to keep trace writing |
| fast, and exploits the very common case of a system call returning quickly |
| (>90%) to pack two events into each word when possible. |
| |
| ### This project's low-overhead tracing inspired this effort, but in the end we |
| didn't take too many of its insights. |
| |
| ### CTF |
| |
| [CTF or "Common Trace Format](https://diamon.org/ctf/#spec7)" is more of a meta |
| tracing system. |
| It defines a binary format, a metadata format (to describe that binary format), |
| as well as a description language. |
| In essence, it contains all the building blocks of a trace format without |
| defining one for a specific use-case. |
| |
| Traces in CTF are defined as a series of streams of events. |
| Events in a stream are grouped into self-contained packets. |
| CTF contains many of the same concepts that we define here (packets are batches; |
| the streams described in this document are per-thread, etc.), though it largely |
| leaves interpretation of the trace data up to the one defining a CTF-based trace |
| format. |
| |
| ## Future work |
| |
| ### Event encoding |
| |
| The trace format relies heavily on LEB128-encoded integers. |
| While this choice makes the trace quite compact (achieving 4-6 bytes/event, with |
| some tricks to keep each encoded integer relatively small), it comes at a cost |
| to decoding speed since LEB128 is well-known for being relatively slow to decode |
| compared to similar integer encodings. |
| (Note: this is only a hunch at present; this needs to be measured.) (The cost at |
| encoding time is dwarfed by other costs, like tracebacks, so it's not on the |
| critical path for this redesign.) |
| |
| This section proposes a possible future trace encoding that is simpler, but |
| without any change would produce a much larger trace. |
| It then proposes two possible methods of compressing the trace from there. |
| |
| #### Encoding format |
| |
| To start with, let's redefine the format as a series of 4-byte words of data. |
| |
| Each event has a single header word that includes the event type (8 bits), space |
| for a 4-bit event reason, and the timestamp delta (20 bits). |
| Note that every event requires a timestamp. |
| At a granularity of 64 nanoseconds, this gives us a timestamp range of ~1 |
| second, which is more than enough for most cases. |
| In the case where we can't fit the delta in 24 bits, we'll emit a new |
| "timestamp" event which is 2 words wide, containing the timestamp delta from the |
| start of the partition. |
| This gives us a 7-byte delta which should be plenty for any trace. |
| |
| Each event's header is then followed by a number of 4-byte arguments. |
| The minimum number of arguments is fixed per the event type, and will be made |
| evident in self-description. |
| A self-description table at the start of the trace could indicate (1) whether |
| there are a variable number of additional arguments, (2) which argument |
| indicates how many there are, and (3) whether any arguments are byte arrays |
| (integer followed by data). |
| Byte array data lengths are always rounded up to 4 bytes. |
| Possible arguments include, for example, per-goroutine sequence numbers, |
| goroutine IDs, thread IDs, P IDs, stack IDs, and string IDs. |
| Within a partition, most of these trivially fit in 32 bits: |
| |
| - Per-goroutine sequence numbers easily fit provided each partition causes a |
| global sequence number reset, which is straightforward to arrange. |
| - Thread IDs come from the OS as `uint32` on all platforms we support. |
| - P IDs trivially fit within 32 bits. |
| - Stack IDs are local to a partition and as a result trivially fit in 32-bits. |
| Assuming a partition is O(MiB) in size (so, O(millions) of events), even if |
| each event has a unique stack ID, it'll fit. |
| - String IDs follow the same logic as stack IDs. |
| - Goroutine IDs, when compressed as described in the event cleanup section, will |
| easily fit in 32 bits. |
| |
| The full range of possible encodings for an event can be summarized as thus: |
| |
| This format change, before any compression, will result in an encoded trace size |
| of 2-3x. |
| Plugging in the size of each proposed event above into the breakdown of a trace |
| such as [this 288 MiB one produced by |
| felixge@](https://gist.github.com/felixge/a79dad4c30e41a35cb62271e50861edc) |
| reveals an increase in encoded trace size by a little over 2x. |
| The aforementioned 288 MiB trace grows to around 640 MiB in size, not including |
| additional timestamp events, the batch header in this scheme, or any tables at |
| the end of each partition. |
| |
| This increase in trace size is likely unacceptable since there are several cases |
| where holding the trace in memory is desirable. |
| Luckily, the data in this format is very compressible. |
| |
| This design notably doesn't handle inline strings. |
| The easy thing to do would be to always put them in the string table, but then |
| they're held onto until at least the end of a partition, which isn't great for |
| memory use in the face of many non-cacheable strings. |
| This design would have to be extended to support inline strings, perhaps |
| rounding up their length to a multiple of 4 bytes. |
| |
| #### On-line integer compression |
| |
| One possible route from here is to simply choose a different integer compression |
| scheme. |
| There exist many fast block-based integer compression schemes, ranging from [GVE |
| and |
| PrefixVarint](https://en.wikipedia.org/wiki/Variable-length_quantity#Group_Varint_Encoding) |
| to SIMD-friendly bit-packing schemes. |
| |
| Inline strings would likely have to be excepted from this compression scheme, |
| and would likely need to be sent out-of-band. |
| |
| #### On-demand user compression |
| |
| Another alternative compression scheme is to have none, and instead expect the |
| trace consumer to perform their own compression, if they want the trace to be |
| compressed. |
| This certainly simplifies the runtime implementation, and would give trace |
| consumers a choice between trace encoding speed and trace size. |
| For a sufficiently fast compression scheme (perhaps LZ4), it's possible this |
| could rival integer compression in encoding overhead. |
| |
| More investigation on this front is required. |
| |
| ### Becoming a CPU profile superset |
| |
| Traces already contain CPU profile samples, but are missing some information |
| compared to what Go CPU profile pprof protos contain. |
| We should consider bridging that gap, such that it is straightforward to extract |
| a CPU profile from an execution trace. |
| |
| This might look like just emitting another section to the trace that contains |
| basically the entire pprof proto. |
| This can likely be added as a footer to the trace. |