Execution tracer overhaul

Authored by mknyszek@google.com with a mountain of input from others.

In no particular order, thank you to Felix Geisendorfer, Nick Ripley, Michael Pratt, Austin Clements, Rhys Hiltner, thepudds, Dominik Honnef, and Bryan Boreham for your invaluable feedback.

Background

Original design document.

Go execution traces provide a moment-to-moment view of what happens in a Go program over some duration. This information is invaluable in understanding program behavior over time and can be leveraged to achieve significant performance improvements. Because Go has a runtime, it can provide deep information about program execution without any external dependencies, making traces particularly attractive for large deployments.

Unfortunately limitations in the trace implementation prevent widespread use.

For example, the process of analyzing execution traces scales poorly with the size of the trace. Traces need to be parsed in their entirety to do anything useful with them, making them impossible to stream. As a result, trace parsing and validation has very high memory requirements for large traces.

Also, Go execution traces are designed to be internally consistent, but don't provide any way to align with other kinds of traces, for example OpenTelemetry traces and Linux sched traces. Alignment with higher level tracing mechanisms is critical to connecting business-level tasks with resource costs. Meanwhile alignment with lower level traces enables a fully vertical view of application performance to root out the most difficult and subtle issues.

Thanks to work in Go 1.21 cycle, the execution tracer's run-time overhead was reduced from about -10% throughput and +10% request latency in web services to about 1% in both for most applications. This reduced overhead in conjunction with making traces more scalable enables some exciting and powerful new opportunities for the diagnostic.

Lastly, the implementation of the execution tracer has evolved organically over time and it shows. The codebase also has many old warts and some age-old bugs that make collecting traces difficult, and seem broken. Furthermore, many significant decision decisions were made over the years but weren't thoroughly documented; those decisions largely exist solely in old commit messages and breadcrumbs left in comments within the codebase itself.

Goals

The goal of this document is to define an alternative implementation for Go execution traces that scales up to large Go deployments.

Specifically, the design presented aims to achieve:

Make trace parsing require a small fraction of the memory it requires today.
Streamable traces, to enable analysis without storage.
Partially self-describing traces, to reduce the upgrade burden on trace consumers.
Fix age-old bugs and present a path to clean up the implementation.

Furthermore, this document will present the existing state of the tracer in detail and explain why it's like that to justify the changes being made.

Design

Overview

The design is broken down into six parts:

Timestamps and sequencing.
Trace event batching.
Partitioning.
Event cleanup.
Self-description.

These five parts are roughly ordered by how fundamental they are to the trace design, and so the former sections are more like concrete proposals, while the latter sections are more like design suggestions that would benefit from prototyping. The earlier parts can also be implemented without considering the latter parts.

Each section includes in the history and design of the existing system as well to document the current system in one place, and to more easily compare it to the new proposed system. That requires, however, a lot of prose, which can obscure the bigger picture. Here are the highlights of each section without that additional context.

Timestamps and sequencing.

Compute timestamps from the OS's monotonic clock (nanotime).
Use per-goroutine sequence numbers for establishing a partial order of events (as before).

Batching.

Batch trace events by M instead of by P.
- Use lightweight M synchronization for trace start and stop (no stop-the-world).
Make the behavior of an M (create/destroy/start/stop) explicit in the trace.

Partitioning.

Traces are sequences of fully self-contained partitions that may be streamed.
- Each partition has its own stack table and string table.
Partitions are purely logical: consecutive batches with the same ID.
- In general, parsers need state from the previous partition to get accurate timing information.
- Partitions have an “earliest possible” timestamp to allow for imprecise analysis without a previous partition.
Partitions are bound by both a maximum wall time and a maximum size (determined empirically).
Traces contain an optional footer delineating partition boundaries as byte offsets.
Emit batch lengths to allow for rapidly identifying all batches within a partition.

Event cleanup.

More consistent naming scheme for event types.
All syscalls have a full duration in the trace.
Separate out “reasons” that a goroutine can block or stop from the event types.
Lots of other little tweaks.

Self-description.

Emit a table at the start of a trace.
- That table can be pre-serialized and embedded into the runtime, then just copied out. It could even be compressed.
- This table starts with a minor version, to imply backwards compatibility.
Map event types to lengths (or how to decode their length).
Map event types to human-friendly names and descriptions.
Classify event types to identify basic event semantics for analysis and validation.
- Events are either “control” events or “description” events.
  - “Control” events are the minimal set of events necessary to understand program execution. Their interpretation is assumed.
  - “Description” events are fully described, i.e. is the event instantaneous, or represent a range? If a range, which event does it pair with? What is the event scoped to? Goroutine, thread, or entire program?

Timestamps and sequencing

For years, the Go execution tracer has used the cputicks runtime function for obtaining a timestamp. On most platforms, this function queries the CPU for a tick count with a single instruction. (Intuitively a “tick” goes by roughly every CPU clock period, but in practice this clock usually has a constant rate that‘s independent of CPU frequency entirely.) Originally, the execution tracer used this stamp exclusively for ordering trace events. Unfortunately, many modern CPUs don’t provide such a clock that is stable across CPU cores, meaning even though cores might synchronize with one another, the clock read-out on each CPU is not guaranteed to be ordered in the same direction as that synchronization. This led to traces with inconsistent timestamps.

To combat this, the execution tracer was modified to use a global sequence counter that was incremented synchronously for each event. Each event would then have a sequence number that could be used to order it relative to other events on other CPUs, and the timestamps could just be used solely as a measure of duration on the same CPU. However, this approach led to tremendous overheads, especially on multiprocessor systems.

That‘s why in Go 1.7 the implementation changed so that each goroutine had its own sequence counter. The implementation also cleverly avoids including the sequence number in the vast majority of events by observing that running goroutines can’t actually be taken off their thread until they‘re synchronously preempted or yield themselves. Any event emitted while the goroutine is running is trivially ordered after the start. The only non-trivial ordering cases left are where an event is emitted by or on behalf of a goroutine that is not actively running (Note: I like to summarize this case as “emitting events at a distance” because the scheduling resource itself is not emitting the event bound to it.) These cases need to be able to be ordered with respect to each other and with a goroutine starting to run (i.e. the GoStart event). In the current trace format, there are only two such cases: the GoUnblock event, which indicates that a blocked goroutine may start running again and is useful for identifying scheduling latencies, and GoSysExit, which is used to determine the duration of a syscall but may be emitted from a different P than the one the syscall was originally made on. (For more details on the GoSysExit case see in the next section.) Furthermore, there are versions of the GoStart, GoUnblock, and GoSysExit events that omit a sequence number to save space if the goroutine just ran on the same P as the last event, since that’s also a case where the events are trivially serialized. In the end, this approach successfully reduced the trace overhead from over 10x to 10%.

However, it turns out that the trace parser still requires timestamps to be in order, leading to the infamous “time stamps out of order” error when cputicks inevitably doesn‘t actually emit timestamps in order. Ps are a purely virtual resource; they don’t actually map down directly to physical CPU cores at all, so it's not even reasonable to assume that the same P runs on the same CPU for any length of time.

Although we can work around this issue, I propose we try to rely on the operating system‘s clock instead and fix up timestamps as a fallback. The main motivation behind this change is alignment with other tracing systems. It’s already difficult enough to try to internally align the cputicks clock, but making it work with clocks used by other traces such as those produced by distributed tracing systems is even more difficult.

On Linux, the clock_gettime syscall, called nanotime in the runtime, takes around 15ns on average when called in a loop. This is compared to cputicks' 10ns. Trivially replacing all cputicks calls in the current tracer with nanotime reveals a small performance difference that depends largely on the granularity of each result. Today, cputicks' is divided by 64. On a 3 GHz processor, this amounts to a granularity of about 20 ns. Replacing that with nanotime and no time division (i.e. nanosecond granularity) results in a 0.22% geomean regression in throughput in the Sweet benchmarks. The trace size also increases by 17%. Dividing nanotime by 64 we see approximately no regression and a trace size decrease of 1.7%. Overall, there seems to be little performance downside to using nanotime, provided we pick an appropriate timing granularity: what we lose by calling nanotime, we can easily regain by sacrificing a small amount of precision. And it's likely that most of the precision below 128 nanoseconds or so is noise, given the average cost of a single call into the Go scheduler (~250 nanoseconds). To give us plenty of precision, I propose a target timestamp granularity of 64 nanoseconds. This should be plenty to give us fairly fine-grained insights into Go program behavior while also keeping timestamps small.

As for sequencing, I believe we must retain the per-goroutine sequence number design as-is. Relying solely on a timestamp, even a good one, has significant drawbacks. For one, issues arise when timestamps are identical: the parser needs to decide on a valid ordering and has no choice but to consider every possible ordering of those events without additional information. While such a case is unlikely with nanosecond-level timestamp granularity, it totally precludes making timestamp granularity more coarse, as suggested in the previous paragraph. A sequencing system that‘s independent of the system’s clock also retains the ability of the tracing system to function despite a broken clock (modulo returning an error when timestamps are out of order, which again I think we should just work around). Even clock_gettime might be broken on some machines!

How would a tracing system continue to function despite a broken clock? For that I propose making the trace parser fix up timestamps that don‘t line up with the partial order. The basic idea is that if the parser discovers a partial order edge between two events A and B, and A’s timestamp is later than B‘s, then the parser applies A’s timestamp to B. B‘s new timestamp is in turn propagated later on in the algorithm in case another partial order edge is discovered between B and some other event C, and those events’ timestamps are also out-of-order.

There‘s one last issue with timestamps here on platforms for which the runtime doesn’t have an efficient nanosecond-precision clock at all, like Windows. (Ideally, we'd make use of system calls to obtain a higher-precision clock, like QPC, but calls to this API can take upwards of a hundred nanoseconds on modern hardware, and even then the resolution is on the order of a few hundred nanoseconds.) On Windows we can just continue to use cputicks to get the precision and rely on the timestamp fixup logic in the parser, at the cost of being unable to reliably align traces with other diagnostic data that uses the system clock.

Batching

Today the Go runtime batches trace events by P. That is, trace events are grouped in batches that all happened, in order, on the same P. A batch is represented in the runtime as a buffer, 32 KiB in size, which is attached to a P when it‘s actively being written to. Events are written to this P’s buffer in their encoded form. This design choice allows most event writes to elide synchronization with the rest of the runtime and linearizes the trace with respect to a P, which is crucial to sequencing without requiring a global total order.

Batching traces by any core scheduling resource (G, M, or P) could in principle have similar properties. At a glance, there are a few reasons Ps make a better choice. One reason is that there are generally a small number of Ps compared to Ms and Gs, minimizing the maximum number of buffers that can be active at any given time. Another reason is convenience. When batching, tracing generally requires some kind of synchronization with all instances of its companion resource type to get a consistent trace. (Think of a buffer which has events written to it very infrequently. It needs to be flushed at some point before the trace can be considered complete, because it may contain critical information needed to sequence events in other buffers.) Furthermore, synchronization when starting a trace is also generally useful, as that provides an opportunity to inspect the state of the world and write down details about it, simplifying validation. Stopping the world is a convenient way to get the attention of all Ps in the Go runtime.

However, there are problems with batching by P that make traces more complex than necessary.

The core of these problems lies with the GoSysExit event. This event requires special arrangements in both the runtime and when validating traces to ensure a consistent trace. The difficulty with this event is that it‘s emitted by a goroutine that was blocked in a syscall and lost its P, and because it doesn’t have a P, it might race with the runtime enabling and disabling tracing. Therefore it needs to wait until it has a P to avoid that race. (Note: the tracing system does have a place to put events when there is no P available, but that doesn't help in this case. The tracing system uses the fact that it stops-the-world to synchronize with GoSysExit by preventing it from writing an event until the trace system can finish initialization.)

The problem with GoSysExit stems from a fundamental mismatch: Ms emit events, but only Ps are preemptible by the Go scheduler. This really extends to any situation where we‘d like an M to emit an event when it doesn’t have a P, say for example when it goes to sleep after dropping its P and not finding any work in the scheduler, and it‘s one reason why we don’t have any M-specific events at all today.

So, suppose we batch trace events by M instead.

In the case of GoSysExit, it would always be valid to write to a trace buffer, because any synchronization would have to happen on the M instead of the P, so no races with stopping the world. However, this also means the tracer can't simply stop the world, because stopping the world is built around stopping user Go code, which runs with a P. So, the tracer would have to use something else (more on this later).

Although GoSysExit is simpler, GoSysBlock becomes slightly more complex in the case where the P is retaken by sysmon. In the per-P world it could be written into the buffer of the taken P, so it didn‘t need any synchronization. In the per-M world, it becomes an event that happens “at a distance” and so needs a sequence number from the syscalling goroutine for the trace consumer to establish a partial order. However, that goroutine could start running again at any time, and even acquire a P, GoStart, and enter another syscall. Luckily there’s a fairly simple fix here. Before the syscalling goroutine marks its P as _Psyscall, it pessimistically scribbles down its next sequence number into the P. If the P is retaken, the M that took it can safely use that sequence number to emit a GoSysBlock event. However, if the goroutine reacquires the same P, and it identifies that that P wasn‘t in a syscall for some other goroutine (which the runtime is already capable of), it can safely use the same sequence number it scribbled down previously for itself. This scheme ensures that sequence numbers are never skipped, keeping validation simple, and doesn’t require any additional synchronization. Note that now GoSysBlock events might have a timestamp after GoSysExit, but this is actually OK since crucially the partial order is still correct. Since we already committed to fixing broken timestamps in general, this skew will be fixed up by that mechanism for presentation. (In practice, I expect the skew to be quite small, since it only happens if the retaking of a P races with a syscall exit.)

Another small complication is that ProcStop events can now also happen “at a distance.” As a result, both ProcStart and ProcStop require per-P sequence numbers to determine a partial ordering between them. More generally, this pair of events is now necessary to indicate the transfer of ownership of a P between Ms (the new owner is always the M whose buffer contains the ProcStart), something which the existing tracer does not need to do. It's a little extra complexity, but if we want to accurately represent threads in traces, an extra set of sequence numbers (on either Ms or Ps) is unavoidable.

Per-M batching might also incur a higher memory cost for tracing, since there are generally more Ms than Ps. I suspect this isn't actually too big of an issue since the number of Ms is usually close to the number of Ps. In the worst case, there may be as many Ms as Gs! However, if we also partition the trace, then the number of active buffers will only be proportional to the number of Ms that actually ran in a given time window, which is unlikely to be an issue. Still, if this does become a problem, a reasonable mitigation would be to simply shrink the size of each trace buffer compared to today. The overhead of the tracing slow path is vanishingly small, so doubling its frequency would likely not incur a meaningful compute cost.

Other than those three details, per-M batching should function identically to the current per-P batching: trace events may already be safely emitted without a P (modulo GoSysExit synchronization), so we're not losing anything else with the change.

Instead, however, what we gain is a deeper insight into thread execution. Thread information is currently present in execution traces, but difficult to interpret because it's always tied to P start and stop events. A switch to per-M batching forces traces to treat Ps and Ms orthogonally.

Given all this, I propose switching to per-M batching. The only remaining question to resolve is trace synchronization for Ms.

(As an additional minor consideration, I would also like to propose adding the batch length to the beginning of each batch. Currently (and in the foreseeable future), the trace consumer needs to iterate over the entire trace once to collect all the batches for ordering. We can speed up this process tremendously by allowing the consumer to just collect the information it needs.)

M synchronization

The runtime already contains a mechanism to execute code on every M via signals, doAllThreadsSyscall. However, traces have different requirements than doAllThreadsSyscall, and I think we can exploit the specifics of these requirements to achieve a more lightweight alternative.

First, observe that getting the attention of every M for tracing is not strictly necessary: Ms that never use the tracer need not be mentioned in the trace at all. This observation allows us to delegate trace state initialization to the M itself, so we can synchronize with Ms at trace start simply by atomically setting a single global flag. If an M ever writes into the trace buffer, then it will initialize its state by emitting an event indicating what it was doing when tracing started. For example, if the first event the M is going to emit is GoBlock, then it will emit an additional event before that that indicates the goroutine was running since the start of the trace (a hypothetical GoRunning event). Disabling tracing is slightly more complex, as the tracer needs to flush every M's buffer or identify that its buffer was never written to. However, this is fairly straightforward to do with atomic buffer swaps.

There's just one problem with all this: it may be that an M might be running and never emit an event. This case is critical to capturing system dynamics. As far as I can tell, there are three cases here: the M is running user code without emitting any trace events (e.g. a tight loop), the M is in a system call or C code the whole time, or the M is sysmon, a special thread that always runs without a P.

The first case is fairly easy to deal with: because it‘s running with a P, we can just preempt it and establish an invariant that any preemption always emits an event if the M’s trace state has not been initialized.

The second case is a little more subtle, but luckily not very complex to implement. The tracer can identify whether the M has a G blocked in a syscall (or equivalently has called into C code) just before disabling tracing globally by checking readgstatus(m.curg) == _Gsyscall on each M's G and write it down. If the tracer can see that the M never wrote to its buffer after it disables tracing, it can safely conclude that the M was still in a syscall at the moment when tracing was disabled, since otherwise it would have written to the buffer.

For the third case, we can play basically the same trick as the second case, but instead check to see if sysmon is blocked on a note. If it‘s not, it’s running. If it doesn‘t emit any events (e.g. to stop on a note) then the tracer can assume that it’s been running, if its buffer is empty.

That‘s a lot to consider, so I wrote up a sketch of what this would look like in code in this document’s appendix.

A big change with this synchronization mechanism is that the tracer no longer obtains the state of all goroutines when a trace starts, only those that run or execute syscalls. The remedy to this is to simply add it back in. Since the only cases not covered are goroutines in _Gwaiting or _Grunnable, the tracer can set the _Gscan bit on the status to ensure the goroutine can‘t transition out. At that point, it collects a stack trace and writes out a status event for the goroutine inside buffer that isn’t attached to any particular M. To avoid ABA problems with a goroutine transitioning out of _Gwaiting or _Grunnable and then back in, we also need an atomically-accessed flag for each goroutine that indicates whether an event has been written for that goroutine yet.

The result of all this synchronization is that the tracer only lightly perturbs application execution to start and stop traces. This property is not just great for downstream users' latencies, but will be critical for partitioning, since the tracer will want to synchronize all Ms more frequently than before.

Partitioning

The structure of an execution trace binary format limits what you can do with it.

In particular, to view just a small section of the trace, the entire trace needs to be parsed, since a batch containing an early event may appear at the end of the trace. This can happen if, for example, a P that stays relatively idle throughout the trace wakes up once at the beginning and once at the end, but never writes enough to the trace buffer to flush it.

To remedy this, I propose restructuring traces into a stream of self-contained partitions. More specifically, a partition is a series of trace batches along with few additional tables (like tracebacks and strings) that could represent a full trace. In practice this means each partition boundary is a global buffer flush. The size of each partition will be roughly constant: new partition boundaries will be created when either the partition reaches some threshold size or some maximum amount of wall-clock time has elapsed. The exact size threshold and wall-clock limit will be determined empirically, though my general expectation for the size threshold is around 16 MiB.

This global buffer flush can be implemented as an extension to the aforementioned M synchronization design by replacing the enabled flag with a partition generation counter and doubling up a few fields to be indexed by this counter. The appendix contains a sketch of the extended design. (Note that for partitions we don't need to snapshot the state of all goroutines, so we can skip that step.)

Partitions also present an interesting opportunity for making large traces easier to analyze and visualize. Because traces are partitioned, it‘s much more feasible to look at just a small section of a trace at a time, which could make visualization, for example, scale to much larger traces. To aid tooling that provides this kind of interactive debugging experience, I’d like to propose adding a partition table at the end of the trace which consumers can use to identify where partitions are in the trace.

Event cleanup

Since we're modifying the trace anyway, this is a good opportunity to clean up the event types, making them simpler to implement, simpler to parse and understand, and more uniform overall.

The four biggest changes I would like to propose are

A uniform naming scheme,
The addition of thread events (to support per-M batching),
Simpler syscall events, including giving every syscall its own time range, and
Separation of the reasons that goroutines might stop or block from the event type.

For the naming scheme, I propose the following rules (which are not strictly enforced, only guidelines for documentation):

Scheduling resources, such as threads and goroutines, have events related to them prefixed with the related resource (i.e. “Thread,” “Go,” or “Proc”).
- Scheduling resources have “Create” and “Destroy” events.
- Scheduling resources have generic “Status” events used for indicating their state at the start of a partition (replaces GoInSyscall and GoWaiting).
- Scheduling resources have “Start” and “Stop” events to indicate when that resource is in use. The connection between resources is understood through context today.
- Goroutines also have “Block” events which are like “Stop”, but require an “Unblock” before the goroutine can “Start” again.
- Note: Ps are exempt since they aren‘t a true resource, more like a best-effort semaphore in practice. There’s only “Start” and “Stop” for them.
Events representing ranges of time come in pairs with the start event having the “Begin” suffix and the end event having the “End” suffix.
Events have a prefix corresponding to the deepest resource they're associated with.

From this naming scheme, the thread events are clearly defined: ThreadCreate, ThreadDestroy, ThreadStatus, ThreadStart, ThreadStop, ThreadBlock, and ThreadUnblock. Threads also need their own sequence numbers because ThreadUnblock can happen “at a distance.”

Following the naming scheme, I propose representing all syscalls as a pair of events: GoSyscallBegin and GoSyscallEnd. Note that this means all syscalls will have a complete range, unlike today. However, this will likely come at a cost: an additional event for most syscalls. In practice, this is probably OK, since in the post-Meltdown world on Linux SYSCALL on its own costs ~250 nanoseconds and most trace events should be much cheaper than that. Adding a second event is unlikely to have a significant effect on performance (especially since GoSyscallEnd doesn‘t require a stack trace), and will provide a lot more useful information about syscalls which are missing today if sysmon doesn’t take back the P. Lastly, I propose renaming GoSysBlock to GoSyscallBlock for consistency.

And lastly, I propose moving the reasons why a goroutine resource might stop or block into an argument. This choice is useful for backwards compatibility, because the most likely change to the trace format at any given time will be the addition of more detail, for example in the form of a new reason to block.

For the sake of generality, and because it‘ll be useful in self-description, I propose generalizing these “reasons” as event “subtypes.” The subtype is an explicit argument that is always 1 byte in size and immediately follows the event type, if the event has subtypes. Subtypes do not modify the structure of an event; they’re purely descriptive.

Beyond these big four, there are a myriad of additional tweaks I would like to propose:

Remove the FutileWakeup event, which is no longer used (even in the current implementation).
Remove the TimerGoroutine event, which is no longer used (even in the current implementation).
Rename GoMaxProcs to ProcsChange.
- This needs to be written out at trace startup, as before.
Redefine GoSleep as a GoBlock and GoUnblock pair.
Remove the starting stack argument for GoCreate.
- Currently the GoCreate event includes a single PC denoting the start location for a new goroutine. However, this information is redundant with the first GoStart event from that goroutine.
Break out GoStartLabel into a more generic GoLabel event that applies a label to a goroutine the first time it emits an event in a partition.
Change the suffixes of pairs of events representing some activity to be Begin and End.
Ensure all GC-related events contain the string GC.
Add sequence numbers to events where necessary:
- GoStart still needs a sequence number.
- GoUnblock still needs a sequence number.
- GoSyscallBlock needs a sequence number, as explained above.
- Eliminate the {GoStart,GoUnblock,Go}Local events for uniformity.
- ThreadStart and ThreadUnblock need sequence numbers.
Redefine STW event “kinds” as an event type extension.
Redefine Status event status' as an event type extension.
Because traces are no longer bound by stopping the world, and can indeed now start during the GC mark phase, we need a way to identify that a GC mark is currently in progress. Add the GCMarkInProgress event.
Eliminate inline strings and make all strings referenced from a table for uniformity.
- Because of partitions, the size of the string table is unlikely to become large in practice.
- This means all events can have a fixed size in terms of arguments.

Finally, a number of these tweaks will likely make traces bigger, mostly due to additional information and the elimination of some optimizations for uniformity. However, there are two other places where we can save space in the current varint-based encoding.

Add Timestamp events which are occasionally emitted to ensure timestamp deltas in events are small.
Compress goroutine IDs in each partition by maintaining a lightweight mapping. Each time a goroutine would emit an event that needs the current goroutine ID, it‘ll check a g-local cache. This cache will contain the goroutine’s alias ID for the current partition and the partition number. If the goroutine does not have an alias for the current partition, it increments a global counter to acquire one and writes down the mapping of that counter to its full goroutine ID in a global table. This table is then written out at the end of each partition.

After these changes, the list of event looks like this:

PartitionTimestamp       [args: partition id (4 bytes), timestamp (8 bytes)]
Batch                    [args: partition id (4 bytes), thread id (4 bytes), size (2 bytes)]

# All integer arguments below are unsigned and use a variable-sized encoding. (LEB128)
# All arguments are integers, except for one case of inline strings.
# All events implicitly have a timestamp delta as their first argument.
# The argument count is encoded in the top 2 bits of the event type header.
# If the top 2 bits are 0b11, then there's an implicit first argument indicating
# the event length in the number of arguments.

# Control events.

ThreadStatus             [args: status          ]
ThreadCreate             [args: thread id       ]
ThreadDestroy            [args:                 ]
ThreadStart              [args: m seq           ]
ThreadStop               [args: subtype         ]
ThreadBlock              [args: subtype         ]
ThreadUnblock            [args: thread id, m seq]

ProcStart                [args: p id, p seq]
ProcStop                 [args: p seq      ]

GoStatus                 [args: subtype , goid              ]
GoCreate                 [args: goid    , stack id          ]
GoDestroy                [args:                             ]
GoStart                  [args: goid    , g seq             ]
GoStop                   [args: subtype , stack id          ]
GoBlock                  [args: subtype , stack id          ]
GoUnblock                [args: goid    , g seq   , stack id]
GoSyscallBegin           [args: stack id                    ]
GoSyscallBlock           [args: goid    , g seq             ]
GoSyscallEnd             [args:                             ]

# Descriptive events.

ProcsChange              [args: procs]

GoLabel                  [args: label (string id)]

GoGCSweepBegin           [args: stack id]
GoGCSweepEnd             [args:         ]
GoGCMarkAssistBegin      [args: stack id]
GoGCMarkAssistEnd        [args:         ]

STWBegin                 [args: subtype, stack id]
STWEnd                   [args:                  ]

GCMarkBegin              [args: stack id]
GCMarkInProgress         [args:         ]
GCMarkEnd                [args:         ]

GCHeapLive               [args: heap live]
GCHeapGoal               [args: heap goal]

UserTaskBegin            [args: task id, name (string id), parent task id, stack id]
UserTaskEnd              [args: task id, stack id]
UserRegionBegin          [args: task id, name (string id), stack id]
UserRegionEnd            [args: task id, name (string id), stack id]
UserLog                  [args: task id, key  (string id), value (inline), stack id]

CPUSample                [args: timestamp, m id, p id, goid, stack id]

Note that the event types are split into “control” events and “descriptive” events. The former set is the minimal set of events necessary to understand program execution. The latter set augments program execution with additional details. The “control” events may be thought of as edges in state machines for each thread and goroutine, namely:

State machine for threads:

Tracer state machine for threads

State machine for goroutines:

Tracer state machine for goroutines

Note: the state machine for Ps is not shown because it's trivial. Each P is either stopped or started, and the corresponding event moves them into that state. ProcsChange has no interesting interaction with these states because all Ps must necessarily be stopped when that happens, so a P might just never start again. In effect, ProcsChange is purely descriptive. (By this logic, stop-the-world events are also purely descriptive.)

Self-description

Changing trace internals today results in toil that someone has to deal with.

Because the trace internals are totally unexposed today and the Go toolchain provides only tooling, it's straightforward to just keep the tooling working. However, the trace parser does still maintain compatibility with very old traces, leading to a lot of cruft.

Lastly, part of the point of all this is to work toward trace data that‘s accessible to our users. That probably means documenting the wire format (for analysis from outside of Go) and providing a parsing library. There’s not all that much we can do about the former, but we can reduce the maintenance burden of the latter.

Without going into too many details on what a parsing library might look like, I think it‘s clear to me that we’d want to expose some types of trace events as “core” trace events that we guarantee will continue to exist (e.g. goroutine-related things, thread-related things, tasks, regions, logs, etc.), and others as implementation-dependent. For example, mark assists.

In an ideal world, we could add and remove such events without having to meaningfully change the parser or break downstream tools. We could make it possible for trace consumers who don't fully understand an event type to still say something interesting about it.

For this purpose, I‘d like to propose adding a self-description table to the start of each trace. This table can be pre-serialized and effectively just copied out of the runtime, so it doesn’t add much overhead. As for what goes into the table, I propose the following:

A minor version for the trace.
- Today, traces are versioned by the Go version where the trace format changed. I propose retaining this version in the header, but also following it with a minor version in the self-description table. Minor versions are backwards-compatible; they simply denote that the interpretation of the trace is slightly different.
- Obtaining the timestamp for each event is a part of the major version to avoid adding constraints to what is the most populous kind of information in a trace.
A mapping of event types (the bit patterns that are used in decoding) to their lengths, or instructions on how to compute the size of an event type from its arguments. This allows consumers to skip unfamiliar events.
A classification system for event types.
- Events are first classified as either “control” events or “descriptive” events.
  - “Control” events are fundamental to making sense of execution, and represent the minimal set of events necessary to reconstruct execution.
  - “Description” events are effectively just annotations of things goroutines do. In general they're bound to some scope (goroutine, P, M, or global) and either represent a range of time or an instant.
- “Control” events are part of the trace major version (i.e. Go version). Parsers are expected to know what they mean and how to interpret them.
- “Description” events are further broken down into two categories:
  - Ranges, which represent a range of time.
  - Instants, which represent an event at a single point in time.
- “Description” event types are also annotated with their scope so that trace parsers are able to validate them.
- “Description” event types get a full description of their arguments.
A human-friendly identifier and description strings for each event type.
A human-friendly identifier and description strings for each event subtype.
A human-friendly identifier and description strings for description string arguments.

With all this information, it's possible for tooling to say a lot of interesting things about these events without really knowing what they mean. Going back to the mark assist example, tooling could tell the user about the mark assist latency distribution and total mark assist time.

The self-description table will be used by our own parser to simplify validation and rely less on bespoke semantics for known events. This is particularly important to ensure the quality of the self-description, so that it doesn't break without us noticing.

As for the encoding of this self-description section, I'd like to propose the following straightforward inline encoding (no alignment enforced).

Minor version                    (uint8)
Self-description size            (uint32)
Number of event types            (uint8)
[for each event type:]
  Event type bits                (uint8)
  Event self-description size    (uint32)
  Event name length              (uint16)
  Event name                     ([0...65535]byte) (UTF-8)
  Event desc length              (uint16)
  Event desc                     ([0...65535]byte) (UTF-8)
  Number of event subtypes       (uint8)
  [for each event subtype:]
    Event subtype bits           (uint8)
    Event subtype name length    (uint16)
    Event subtype name           ([0...65535]byte) (UTF-8)
    Event subtype desc length    (uint16)
    Event subtype desc           ([0...65535]byte) (UTF-8)
  Is "description" event?        (uint8)
  [if "description" event:]
    Event scope                  (uint8) # one of: G, M, P, global
    Event class                  (uint8) # (instant/range begin/range end)
    Event type link              (uint8) # (for "range" event classes)
    Number of event arguments    (uint8) # Excludes subtype
    [for each event argument:]
      Event argument type        (uint8) # one of: uint, string, string (id), stack
      Event argument name length (uint16)
      Event argument name        ([0...65535]byte) (UTF-8)
      Event argument desc length (uint16)
      Event argument desc        ([0...65535]byte) (UTF-8)

Implementation

As discussed in the overview, this design is intended to be implemented in parts. Some parts are easy to integrate into the existing tracer, such as the switch to nanotime and the traceback-related improvements. Others, such as per-M batching, partitioning, and the change to the event encoding, are harder.

While in theory all could be implemented separately as an evolution of the existing tracer, it‘s simpler and safer to hide the changes behind a feature flag, like a GOEXPERIMENT flag, at least to start with. Especially for per-M batching and partitioning, it’s going to be quite complicated to try to branch in all the right spots in the existing tracer.

Instead, we'll basically have two trace implementations temporarily living side-by-side, with a mostly shared interface. Where they differ (and there are really only a few spots), the runtime can explicitly check for the GOEXPERIMENT flag. This also gives us an opportunity to polish the new trace implementation.

The implementation will also require updating the trace parser to make tests work. I suspect this will result in what is basically a separate trace parser that's invoked when the new header version is identified. Once we have a basic trace parser that exports the same API, we can work on improving the internal trace parser API to take advantage of the new features described in this document.

Use-cases

This document presents a large change to the existing tracer under the banner of “traces at scale” which is a bit abstract. To better understand what that means, let's explore a few concrete use-cases that this design enables.

Flight recording

Because the trace in this design is partitioned, it's now possible for the runtime to carry around the most recent trace partition. The runtime could then expose this partition to the application as a snapshot of what the application recently did. This is useful for debugging at scale, because it enables the application to collect data exactly when something goes wrong. For instance:

A server's health check fails. The server takes a snapshot of recent execution and puts it into storage before exiting.
Handling a request exceeds a certain latency threshold. The server takes a snapshot of the most recent execution state and uploads it to storage for future inspection.
A program crashes when executing in a deployed environment and leaves behind a core dump. That core dump contains the recent execution state.

Here's a quick design sketch:

package trace

// FlightRecorder represents
type FlightRecorder struct { ... }

// NewFlightRecorder creates a new flight recording configuration.
func NewFlightRecorder() *FlightRecorder

// Start begins process-wide flight recording. Only one FlightRecorder
// may be started at once. If another FlightRecorder is started, this
// function will return an error.
//
// This function can be used concurrently with Start and Stop.
func (fr *FlightRecorder) Start() error

// TakeSnapshot collects information about the execution of this program
// from the last few seconds and write it to w. This FlightRecorder must
// have started to take a snapshot.
func (fr *FlightRecorder) TakeSnapshot(w io.Writer) error

// Stop ends process-wide flight recording. Stop must be called on the
// same FlightRecorder that started recording.
func (fr *FlightRecorder) Stop() error

The runtime accumulates trace buffers as usual, but when it makes a partition, it puts the trace data aside as it starts a new one.
Once the new partition is finished, the previous is discarded (buffers are reused).
At any point, the application can request a snapshot of the current trace state.
- The runtime immediately creates a partition from whatever data it has, puts that together with the previously-accumulated partition, and hands it off to the application for reading.
- The runtime then continues accumulating trace data in a new partition while the application reads the trace data.
The application is almost always guaranteed two partitions, with one being as large as a partition would be in a regular trace (the second one is likely to be smaller).
Support for flight recording in the /debug/pprof/trace endpoint.

Fleetwide trace collection

Today some power users collect traces at scale by setting a very modest sampling rate, e.g. 1 second out of every 1000 seconds that a service is up. This has resulted in collecting a tremendous amount of useful data about execution at scale, but this use-case is currently severely limited by trace performance properties.

With this design, it should be reasonable to increase the sampling rate by an order of magnitude and collect much larger traces for offline analysis.

Online analysis

Because the new design allows for partitions to be streamed and the trace encoding is much faster to process, it's conceivable that a service could process its own trace continuously and aggregate and filter only what it needs. This online processing avoids the need to store traces for future inspection.

The kinds of processing I imagine for this includes:

Detailed task latency breakdowns over long time horizons.
CPU utilization estimates for different task types.
Semantic processing and aggregation of user log events.
Fully customizable flight recording (at a price).

Linux sched trace association

Tooling can be constructed that interleaves Go execution traces with traces produced via:

perf sched record --clockid CLOCK_MONOTONIC <command>

Prior art

JFR

JFR or "Java Flight Recorder" is an execution tracing system for Java applications. JFR is highly customizable with a sophisticated wire format. It supports very detailed configuration of events via an XML configuration file, including the ability to enable/disable stack traces per event and set a latency threshold an event needs to exceed in order to be sampled (e.g. “only sample file reads if they exceed 20ms”). It also supports custom user-defined binary-encoded events.

By default, JFR offers a low overhead configuration (<1%) for “continuous tracing” and slightly higher overhead configuration (1-2%) for more detailed “profile tracing.” These configurations are just default configuration files that ship with the JVM.

Continuous tracing is special in that it accumulates data into an internal global ring buffer which may be dumped at any time. Notably this data cannot be recovered from a crashed VM, though it can be recovered in a wide variety of other cases. Continuous tracing is disabled by default.

The JFR encoding scheme is quite complex, but achieves low overhead partially through varints. JFR traces are also partitioned, enabling scalable analysis.

In many ways the existing Go execution tracer looks a lot like JFR, just without partitioning, and with a simpler encoding.

KUTrace

KUTrace is a system-wide execution tracing system for Linux. It achieves low overhead, high resolution, and staggering simplicity by cleverly choosing to trace on each kernel transition (the “goldilocks” point in the design space, as the author puts it).

It uses a fairly simple 8-byte-word-based encoding scheme to keep trace writing fast, and exploits the very common case of a system call returning quickly (>90%) to pack two events into each word when possible.

This project's low-overhead tracing inspired this effort, but in the end we

didn't take too many of its insights.

CTF

CTF or "Common Trace Format" is more of a meta tracing system. It defines a binary format, a metadata format (to describe that binary format), as well as a description language. In essence, it contains all the building blocks of a trace format without defining one for a specific use-case.

Traces in CTF are defined as a series of streams of events. Events in a stream are grouped into self-contained packets. CTF contains many of the same concepts that we define here (packets are batches; the streams described in this document are per-thread, etc.), though it largely leaves interpretation of the trace data up to the one defining a CTF-based trace format.

Future work

Event encoding

The trace format relies heavily on LEB128-encoded integers. While this choice makes the trace quite compact (achieving 4-6 bytes/event, with some tricks to keep each encoded integer relatively small), it comes at a cost to decoding speed since LEB128 is well-known for being relatively slow to decode compared to similar integer encodings. (Note: this is only a hunch at present; this needs to be measured.) (The cost at encoding time is dwarfed by other costs, like tracebacks, so it's not on the critical path for this redesign.)

This section proposes a possible future trace encoding that is simpler, but without any change would produce a much larger trace. It then proposes two possible methods of compressing the trace from there.

Encoding format

To start with, let's redefine the format as a series of 4-byte words of data.

Each event has a single header word that includes the event type (8 bits), space for a 4-bit event reason, and the timestamp delta (20 bits). Note that every event requires a timestamp. At a granularity of 64 nanoseconds, this gives us a timestamp range of ~1 second, which is more than enough for most cases. In the case where we can‘t fit the delta in 24 bits, we’ll emit a new “timestamp” event which is 2 words wide, containing the timestamp delta from the start of the partition. This gives us a 7-byte delta which should be plenty for any trace.

Each event's header is then followed by a number of 4-byte arguments. The minimum number of arguments is fixed per the event type, and will be made evident in self-description. Self-description will also indicate (1) whether there are a variable number of additional arguments, (2) which argument indicates how many there are, and (3) whether any arguments are byte arrays (integer followed by data). Byte array data lengths are always rounded up to 4 bytes. Possible arguments include, for example, per-goroutine sequence numbers, goroutine IDs, thread IDs, P IDs, stack IDs, and string IDs. Within a partition, most of these trivially fit in 32 bits:

Per-goroutine sequence numbers easily fit provided each partition causes a global sequence number reset, which is straightforward to arrange.
Thread IDs come from the OS as uint32 on all platforms we support.
P IDs trivially fit within 32 bits.
Stack IDs are local to a partition and as a result trivially fit in 32-bits. Assuming a partition is O(MiB) in size (so, O(millions) of events), even if each event has a unique stack ID, it'll fit.
String IDs follow the same logic as stack IDs.
Goroutine IDs, when compressed as described in the event cleanup section, will easily fit in 32 bits.

The full range of possible encodings for an event can be summarized as thus:

This format change, before any compression, will result in an encoded trace size of 2-3x. Plugging in the size of each proposed event above into the breakdown of a trace such as this 288 MiB one produced by felixge@ reveals an increase in encoded trace size by a little over 2x. The aforementioned 288 MiB trace grows to around 640 MiB in size, not including additional timestamp events, the batch header in this scheme, or any tables at the end of each partition.

This increase in trace size is likely unacceptable since there are several cases where holding the trace in memory is desirable. Luckily, the data in this format is very compressible.

This design notably doesn‘t handle inline strings. The easy thing to do would be to always put them in the string table, but then they’re held onto until at least the end of a partition, which isn't great for memory use in the face of many non-cacheable strings. This design would have to be extended to support inline strings, perhaps rounding up their length to a multiple of 4 bytes.

On-line integer compression

One possible route from here is to simply choose a different integer compression scheme. There exist many fast block-based integer compression schemes, ranging from GVE and PrefixVarint to SIMD-friendly bit-packing schemes.

Inline strings would likely have to be excepted from this compression scheme, and would likely need to be sent out-of-band.

On-demand user compression

Another alternative compression scheme is to have none, and instead expect the trace consumer to perform their own compression, if they want the trace to be compressed. This certainly simplifies the runtime implementation, and would give trace consumers a choice between trace encoding speed and trace size. For a sufficiently fast compression scheme (perhaps LZ4), it's possible this could rival integer compression in encoding overhead.

More investigation on this front is required.

Becoming a CPU profile superset

Traces already contain CPU profile samples, but are missing some information compared to what Go CPU profile pprof protos contain. We should consider bridging that gap, such that it is straightforward to extract a CPU profile from an execution trace.

This might look like just emitting another section to the trace that contains basically the entire pprof proto. This can likely be added as a footer to the trace.

Appendix

M synchronization sketch

The sketch below elides how for each m would work exactly, but generally speaking it means iterating over allm while synchronizing with new M creation and M exit. It also elides the sysmon case, because it's not more interesting, important, or illustrative than the syscall case.

Note that the sketch back-dates any “in progress” events to the start of the trace. This is imprecise, but it's also impossible to be any more precise in this design.

Note also that this sketch synchronizes with a STW by preventing one from happening in critical sections. This is cheap, and unlikely to cause any latency issues because those sections are quite short. But, it is now possible for a trace start while a GC is in the mark phase, so this will need to be accounted for in the binary format with another trace event.

var trace struct {
    ...
    enabled   atomic.Bool
    ...
}

const (
    traceBufNew uintptr = iota
    traceBufWriting
    traceBufDone
)

type m struct {
    ...
    traceBuf  atomic.Uintptr // *traceBuf
    ...
}
...

// Trace start, always with a P on a regular goroutine.
// Only one goroutine can be executing this code at a time.
mp := acquirem() // Prevent preemption of self and STW.
trace.enabled.Store(true)
forEachP(func() {}) // Make sure we catch all running Ms before continuing.
releasem(mp)

// Start trace reader...

...

// Trace event emission, may run without a P.
if trace.enabled.Load() {
    buf := m.traceBuf.Swap(traceBufWriting)
    if buf == traceBufDone {
        m.traceBuf.Store(traceBufNew)
        return
    }
    if buf == traceBufNew {
    	buf = newTraceBuf(...)
        // Write events indicating that a goroutine was
        // running or in a syscall or any other time range,
        // if we're about to write an event that ends it.
        // Back-date these to trace start which is imprecise,
        // but the best we can do.
        ...
    }
    ...
    if not enough space for event {
        traceFullQueue(buf)
        buf = newTraceBuf(...)
    }
    ...
    m.traceBuf.Store(buf)
}

...

// Trace end, always with a P on a regular goroutine.
// Only one goroutine can be executing this code at a time.
mp := acquirem() // Prevent preemption of self and STW.

mInSyscall := make(map[*m]struct{})
for each m {
    if readgstatus(m.curg) == _Gsyscall {
        mInSyscall[m] = struct{}{}
    }
}

trace.enabled.Store(false)

// Allow for STW now. Tracing is off globally, and the
// buffers aren't going anywhere.
releasem(mp)

mToFlush := make([]*m, 0, len(allm))
for each m {
    mToFlush = append(mToFlush, m)
}

for len(mToFlush) > 0 {
    for i, m := range mToFlush {
        buf := m.traceBuf.Load()
        if buf == traceBufNew {
            if _, ok := mInSyscall[m]; ok {
                // Create a buffer and write out its state, back-dating
                // any "in progress" events to the start of the trace.
                ...
                traceFullQueue()
            }
            // Swap remove.
            mToFlush[i] = mToFlush[len(mToFlush)-1]
            mToFlush = mToFlush[:len(mToFlush)-1]
            continue
        }
        if buf == traceBufWriting {
            continue
        }
        if !m.traceBuf.CompareAndSwap(buf, traceBufDone) {
            continue
        }
        traceFullQueue()

        // Swap remove.
        mToFlush[i] = mToFlush[len(mToFlush)-1]
        mToFlush = mToFlush[:len(mToFlush)-1]
    }
    osyield()
}

// Write out end of trace tables.

// Notify trace reader that all buffers are flushed.

// Reset traceBuf state.
for each m {
    m.traceBuf.Store(traceBufNew)
}

// Wait for trace reader to finish.

M synchronization sketch with partitions

Note that in the sketch below partitions have no well-defined single global point in time. Instead, batches contain the partition ID and all batches from the same partition would appear together in the output. This also means that any “in progress” events' timestamps can contradict what a previous partition says about the activity it represents. This is on purpose: this back-dating is really just a lower bound timestamp that's useful for visualization if there is no previous partition. In practice, I expect the following behavior from the trace consumer:

If the consumer is simply streaming the trace, it ignores all “in-progress” events except for in the first partition. It has enough state to otherwise exactly identify the start and end of mentioned activities.
If the consumer is observing a vertical slice of a trace, then it must always start building its state from one partition earlier to get the most precise timing information.

var trace struct {
    ...
    // Trace generation.
    //
    // gen != 0 means tracing is enabled.
    // gen wraps around from ^uint32(0) to 2.
    gen atomic.Uint32
    ...
}

const (
    traceBufNew uintptr = iota
    traceBufWriting
    traceBufDone
)

type m struct {
    ...
    traceBuf  [2]atomic.Uintptr // *traceBuf
    ...
}

...

// Trace start, always with a P on a regular goroutine.
// Only one goroutine can be executing this code at a time.
mp := acquirem() // Prevent preemption of self and STW.
trace.gen.Store(1)
forEachP(func() {}) // Make sure we catch all running Ms before continuing.
releasem(mp)

...

// Trace event emission, may run without a P.
if gen := trace.gen.Load(); gen != 0 {
    buf := m.traceBuf[gen%2].Swap(traceBufWriting)
    if buf == traceBufDone {
        m.traceBuf[gen%2].Store(traceBufNew)
        return
    }
    if buf == traceBufNew {
    	buf = newTraceBuf(...)
        // Write events indicating that a goroutine was
        // running or in a syscall or any other time range,
        // if we're about to write an event that ends it.
        // Back-date these to partition start bound which
        // is imprecise, but the best we can do.
        ...
    }
    ...
    if not enough space for event {
        traceFullQueue(gen, buf)
        buf = newTraceBuf(gen, ...)
    }
    ...
    m.traceBuf[gen%2].Store(buf)
}

...

// Partition change or trace end, always with a P on a regular goroutine.
// Only one goroutine can be executing this code at a time.
mp := acquirem() // Prevent preemption of self and STW.

mInSyscall := make(map[*m]struct{})
for each m {
    if readgstatus(m.curg) == _Gsyscall {
        mInSyscall[m] = struct{}{}
    }
}

partitionStartBound := nanotime()
oldGen := trace.gen.Load()

if trace ending {
	trace.gen.Store(0)
} else {
    if oldGen == ^uint32(0) {
        trace.gen.Store(2)
    } else {
        trace.gen.Add(1)
    }
}

// Allow for STW now. We're just cleaning up state at this point.
releasem(mp)

mToFlush := make([]*m, 0, len(allm))
for each m {
    mToFlush = append(mToFlush, m)
}

for len(mToFlush) > 0 {
    for i, m := range mToFlush {
        buf := m.traceBuf[oldGen%2].Load()
        if buf == traceBufNew {
            if _, ok := mInSyscall[m]; ok {
                // Create a buffer and write out its state, back-dating
                // any "in progress" events to partitionStartBound.
                ...
                traceFullQueue(oldGen, buf)
            }
            // Swap remove.
            mToFlush[i] = mToFlush[len(mToFlush)-1]
            mToFlush = mToFlush[:len(mToFlush)-1]
            continue
        }
        if buf == traceBufWriting {
            continue
        }
        if !m.traceBuf[oldGen%2].CompareAndSwap(buf, traceBufDone) {
            continue
        }
        traceFullQueue(oldGen, buf)

        // Swap remove.
        mToFlush[i] = mToFlush[len(mToFlush)-1]
        mToFlush = mToFlush[:len(mToFlush)-1]
    }
    osyield()
}

// Write out end-of-partition tables.

// Notify trace reader that all buffers for oldGen are flushed.

if trace ending {
	// Notify trace reader that the trace is ending.
}

// Reset traceBuf state.
for each m {
    m.traceBuf[oldGen%2].Store(traceBufNew)
}

if trace ending {
	// Wait for trace reader to finish processing.
}

...

// Trace reader calls traceFullDequeue which always returns buffers from
// earlier generations before later generations.