design/60773-execution-tracer-overhaul.md: add design

For golang/go#60773.

Change-Id: I0f35c07941b5ebafc4e57db037a4a92092d6c7b0
Reviewed-on: https://go-review.googlesource.com/c/proposal/+/503038
Reviewed-by: Michael Pratt <mpratt@google.com>
diff --git a/design/60773-execution-tracer-overhaul.md b/design/60773-execution-tracer-overhaul.md
new file mode 100644
index 0000000..b454b29
--- /dev/null
+++ b/design/60773-execution-tracer-overhaul.md
@@ -0,0 +1,958 @@
+# Execution tracer overhaul
+
+Authored by mknyszek@google.com with a mountain of input from others.
+
+In no particular order, thank you to Felix Geisendorfer, Nick Ripley, Michael
+Pratt, Austin Clements, Rhys Hiltner, thepudds, Dominik Honnef, and Bryan
+Boreham for your invaluable feedback.
+
+## Background
+
+[Original design document from
+2014.](https://docs.google.com/document/d/1FP5apqzBgr7ahCCgFO-yoVhk4YZrNIDNf9RybngBc14/pub)
+
+Go execution traces provide a moment-to-moment view of what happens in a Go
+program over some duration.
+This information is invaluable in understanding program behavior over time and
+can be leveraged to achieve significant performance improvements.
+Because Go has a runtime, it can provide deep information about program
+execution without any external dependencies, making traces particularly
+attractive for large deployments.
+
+Unfortunately limitations in the trace implementation prevent widespread use.
+
+For example, the process of analyzing execution traces scales poorly with the
+size of the trace.
+Traces need to be parsed in their entirety to do anything useful with them,
+making them impossible to stream.
+As a result, trace parsing and validation has very high memory requirements for
+large traces.
+
+Also, Go execution traces are designed to be internally consistent, but don't
+provide any way to align with other kinds of traces, for example OpenTelemetry
+traces and Linux sched traces.
+Alignment with higher level tracing mechanisms is critical to connecting
+business-level tasks with resource costs.
+Meanwhile alignment with lower level traces enables a fully vertical view of
+application performance to root out the most difficult and subtle issues.
+
+Thanks to work in Go 1.21 cycle, the execution tracer's run-time overhead was
+reduced from about -10% throughput and +10% request latency in web services to
+about 1% in both for most applications.
+This reduced overhead in conjunction with making traces more scalable enables
+some [exciting and powerful new opportunities](#use-cases) for the diagnostic.
+
+Lastly, the implementation of the execution tracer has evolved organically over
+time and it shows.
+The codebase also has many old warts and some age-old bugs that make collecting
+traces difficult, and seem broken.
+Furthermore, many significant decision decisions were made over the years but
+weren't thoroughly documented; those decisions largely exist solely in old
+commit messages and breadcrumbs left in comments within the codebase itself.
+
+## Goals
+
+The goal of this document is to define an alternative implementation for Go
+execution traces that scales up to large Go deployments.
+
+Specifically, the design presented aims to achieve:
+
+- Make trace parsing require a small fraction of the memory it requires today.
+- Streamable traces, to enable analysis without storage.
+- Fix age-old bugs and present a path to clean up the implementation.
+
+Furthermore, this document will present the existing state of the tracer in
+detail and explain why it's like that to justify the changes being made.
+
+## Design
+
+### Overview
+
+The design is broken down into four parts:
+
+- Timestamps and sequencing.
+- Orienting against threads instead of Ps.
+- Partitioning.
+- Wire format cleanup.
+
+These four parts are roughly ordered by how fundamental they are to the trace
+design, and so the former sections are more like concrete proposals, while the
+latter sections are more like design suggestions that would benefit from
+prototyping.
+The earlier parts can also be implemented without considering the latter parts.
+
+Each section includes in the history and design of the existing system as well
+to document the current system in one place, and to more easily compare it to
+the new proposed system.
+That requires, however, a lot of prose, which can obscure the bigger picture.
+Here are the highlights of each section without that additional context.
+
+**Timestamps and sequencing**.
+
+- Compute timestamps from the OS's monotonic clock (`nanotime`).
+- Use per-goroutine sequence numbers for establishing a partial order of events
+  (as before).
+
+**Orienting against threads (Ms) instead of Ps**.
+
+- Batch trace events by M instead of by P.
+- Use lightweight M synchronization for trace start and stop.
+- Simplify syscall handling.
+- All syscalls have a full duration in the trace.
+
+**Partitioning**.
+
+- Traces are sequences of fully self-contained partitions that may be streamed.
+  - Each partition has its own stack table and string table.
+- Partitions are purely logical: consecutive batches with the same ID.
+  - In general, parsers need state from the previous partition to get accurate
+    timing information.
+  - Partitions have an "earliest possible" timestamp to allow for imprecise
+    analysis without a previous partition.
+- Partitions are bound by both a maximum wall time and a maximum size
+  (determined empirically).
+- Traces contain an optional footer delineating partition boundaries as byte
+  offsets.
+- Emit batch lengths to allow for rapidly identifying all batches within a
+  partition.
+
+**Wire format cleanup**.
+
+- More consistent naming scheme for event types.
+- Separate out "reasons" that a goroutine can block or stop from the event
+  types.
+- Put trace stacks, strings, and CPU samples in dedicated batches.
+
+### Timestamps and sequencing
+
+For years, the Go execution tracer has used the `cputicks` runtime function for
+obtaining a timestamp.
+On most platforms, this function queries the CPU for a tick count with a single
+instruction.
+(Intuitively a "tick" goes by roughly every CPU clock period, but in practice
+this clock usually has a constant rate that's independent of CPU frequency
+entirely.) Originally, the execution tracer used this stamp exclusively for
+ordering trace events.
+Unfortunately, many [modern
+CPUs](https://docs.google.com/spreadsheets/d/1jpw5aO3Lj0q23Nm_p9Sc8HrHO1-lm9qernsGpR0bYRg/edit#gid=0)
+don't provide such a clock that is stable across CPU cores, meaning even though
+cores might synchronize with one another, the clock read-out on each CPU is not
+guaranteed to be ordered in the same direction as that synchronization.
+This led to traces with inconsistent timestamps.
+
+To combat this, the execution tracer was modified to use a global sequence
+counter that was incremented synchronously for each event.
+Each event would then have a sequence number that could be used to order it
+relative to other events on other CPUs, and the timestamps could just be used
+solely as a measure of duration on the same CPU.
+However, this approach led to tremendous overheads, especially on multiprocessor
+systems.
+
+That's why in Go 1.7 the [implementation
+changed](https://go-review.googlesource.com/c/go/+/21512) so that each goroutine
+had its own sequence counter.
+The implementation also cleverly avoids including the sequence number in the
+vast majority of events by observing that running goroutines can't actually be
+taken off their thread until they're synchronously preempted or yield
+themselves.
+Any event emitted while the goroutine is running is trivially ordered after the
+start.
+The only non-trivial ordering cases left are where an event is emitted by or on
+behalf of a goroutine that is not actively running (Note: I like to summarize
+this case as "emitting events at a distance" because the scheduling resource
+itself is not emitting the event bound to it.) These cases need to be able to be
+ordered with respect to each other and with a goroutine starting to run (i.e.
+the `GoStart` event).
+In the current trace format, there are only two such cases: the `GoUnblock`
+event, which indicates that a blocked goroutine may start running again and is
+useful for identifying scheduling latencies, and `GoSysExit`, which is used to
+determine the duration of a syscall but may be emitted from a different P than
+the one the syscall was originally made on.
+(For more details on the `GoSysExit` case see in the [next section](#batching).)
+Furthermore, there are versions of the `GoStart, GoUnblock`, and `GoSysExit`
+events that omit a sequence number to save space if the goroutine just ran on
+the same P as the last event, since that's also a case where the events are
+trivially serialized.
+In the end, this approach successfully reduced the trace overhead from over 3x
+to 10-20%.
+
+However, it turns out that the trace parser still requires timestamps to be in
+order, leading to the infamous ["time stamps out of
+order"](https://github.com/golang/go/issues/16755) error when `cputicks`
+inevitably doesn't actually emit timestamps in order.
+Ps are a purely virtual resource; they don't actually map down directly to
+physical CPU cores at all, so it's not even reasonable to assume that the same P
+runs on the same CPU for any length of time.
+
+Although we can work around this issue, I propose we try to rely on the
+operating system's clock instead and fix up timestamps as a fallback.
+The main motivation behind this change is alignment with other tracing systems.
+It's already difficult enough to try to internally align the `cputicks` clock,
+but making it work with clocks used by other traces such as those produced by
+distributed tracing systems is even more difficult.
+
+On Linux, the `clock_gettime` syscall, called `nanotime` in the runtime, takes
+around 15ns on average when called in a loop.
+This is compared to `cputicks'` 10ns.
+Trivially replacing all `cputicks` calls in the current tracer with `nanotime`
+reveals a small performance difference that depends largely on the granularity
+of each result.
+Today, `cputicks'` is divided by 64.
+On a 3 GHz processor, this amounts to a granularity of about 20 ns.
+Replacing that with `nanotime` and no time division (i.e. nanosecond
+granularity) results in a 0.22% geomean regression in throughput in the Sweet
+benchmarks.
+The trace size also increases by 17%.
+Dividing `nanotime` by 64 we see approximately no regression and a trace size
+decrease of 1.7%.
+Overall, there seems to be little performance downside to using `nanotime`,
+provided we pick an appropriate timing granularity: what we lose by calling
+`nanotime`, we can easily regain by sacrificing a small amount of precision.
+And it's likely that most of the precision below 128 nanoseconds or so is noise,
+given the average cost of a single call into the Go scheduler (~250
+nanoseconds).
+To give us plenty of precision, I propose a target timestamp granularity of 64
+nanoseconds.
+This should be plenty to give us fairly fine-grained insights into Go program
+behavior while also keeping timestamps small.
+
+As for sequencing, I believe we must retain the per-goroutine sequence number
+design as-is.
+Relying solely on a timestamp, even a good one, has significant drawbacks.
+For one, issues arise when timestamps are identical: the parser needs to decide
+on a valid ordering and has no choice but to consider every possible ordering of
+those events without additional information.
+While such a case is unlikely with nanosecond-level timestamp granularity, it
+totally precludes making timestamp granularity more coarse, as suggested in the
+previous paragraph.
+A sequencing system that's independent of the system's clock also retains the
+ability of the tracing system to function despite a broken clock (modulo
+returning an error when timestamps are out of order, which again I think we
+should just work around).
+Even `clock_gettime` might be broken on some machines!
+
+How would a tracing system continue to function despite a broken clock? For that
+I propose making the trace parser fix up timestamps that don't line up with the
+partial order.
+The basic idea is that if the parser discovers a partial order edge between two
+events A and B, and A's timestamp is later than B's, then the parser applies A's
+timestamp to B.
+B's new timestamp is in turn propagated later on in the algorithm in case
+another partial order edge is discovered between B and some other event C, and
+those events' timestamps are also out-of-order.
+
+There's one last issue with timestamps here on platforms for which the runtime
+doesn't have an efficient nanosecond-precision clock at all, like Windows.
+(Ideally, we'd make use of system calls to obtain a higher-precision clock, like
+[QPC](https://learn.microsoft.com/en-us/windows/win32/sysinfo/acquiring-high-resolution-time-stamps),
+but [calls to this API can take upwards of a hundred nanoseconds on modern
+hardware, and even then the resolution is on the order of a few hundred
+nanoseconds](https://github.com/golang/go/issues/8687#issuecomment-694498710).)
+On Windows we can just continue to use `cputicks` to get the precision and rely
+on the timestamp fixup logic in the parser, at the cost of being unable to
+reliably align traces with other diagnostic data that uses the system clock.
+
+### Orienting by threads (Ms) instead of Ps
+
+Today the Go runtime batches trace events by P.
+That is, trace events are grouped in batches that all happened, in order, on the
+same P.
+A batch is represented in the runtime as a buffer, 32 KiB in size, which is
+attached to a P when it's actively being written to.
+Events are written to this P's buffer in their encoded form.
+This design choice allows most event writes to elide synchronization with the
+rest of the runtime and linearizes the trace with respect to a P, which is
+crucial to [sequencing without requiring a global total
+order](#timestamps-and-sequencing).
+
+Batching traces by any core scheduling resource (G, M, or P) could in principle
+have similar properties.
+At a glance, there are a few reasons Ps make a better choice.
+One reason is that there are generally a small number of Ps compared to Ms and
+Gs, minimizing the maximum number of buffers that can be active at any given
+time.
+Another reason is convenience.
+When batching, tracing generally requires some kind of synchronization with all
+instances of its companion resource type to get a consistent trace.
+(Think of a buffer which has events written to it very infrequently.
+It needs to be flushed at some point before the trace can be considered
+complete, because it may contain critical information needed to sequence events
+in other buffers.) Furthermore, synchronization when starting a trace is also
+generally useful, as that provides an opportunity to inspect the state of the
+world and write down details about it, simplifying validation.
+Stopping the world is a convenient way to get the attention of all Ps in the Go
+runtime.
+
+However, there are problems with batching by P that make traces more complex
+than necessary.
+
+The core of these problems lies with the `GoSysExit` event.
+This event requires special arrangements in both the runtime and when validating
+traces to ensure a consistent trace.
+The difficulty with this event is that it's emitted by a goroutine that was
+blocked in a syscall and lost its P, and because it doesn't have a P, it might
+race with the runtime enabling and disabling tracing.
+Therefore it needs to wait until it has a P to avoid that race.
+(Note: the tracing system does have a place to put events when there is no P
+available, but that doesn't help in this case.
+The tracing system uses the fact that it stops-the-world to synchronize with
+`GoSysExit` by preventing it from writing an event until the trace system can
+finish initialization.)
+
+The problem with `GoSysExit` stems from a fundamental mismatch: Ms emit events,
+but only Ps are preemptible by the Go scheduler.
+This really extends to any situation where we'd like an M to emit an event when
+it doesn't have a P, say for example when it goes to sleep after dropping its P
+and not finding any work in the scheduler, and it's one reason why we don't have
+any M-specific events at all today.
+
+So, suppose we batch trace events by M instead.
+
+In the case of `GoSysExit`, it would always be valid to write to a trace buffer,
+because any synchronization would have to happen on the M instead of the P, so
+no races with stopping the world.
+However, this also means the tracer _can't_ simply stop the world, because
+stopping the world is built around stopping user Go code, which runs with a P.
+So, the tracer would have to use something else (more on this later).
+
+Although `GoSysExit` is simpler, `GoSysBlock` becomes slightly more complex in
+the case where the P is retaken by `sysmon`.
+In the per-P world it could be written into the buffer of the taken P, so it
+didn't need any synchronization.
+In the per-M world, it becomes an event that happens "at a distance" and so
+needs a sequence number from the syscalling goroutine for the trace consumer to
+establish a partial order.
+
+However, we can do better by reformulating the syscall events altogether.
+
+Firstly, I propose always emitting the full time range for each syscall.
+This is a quality-of-life choice that may increase trace overheads slightly, but
+also provides substantially more insight into how much time is spent in
+syscalls.
+Syscalls already take ~250 nanoseconds at a baseline on Linux and are unlikely
+to get faster in the future for security reasons (due to Spectre and Meltdown)
+and the additional event would never contain a stack trace, so writing it out
+should be quite fast.
+(Nonetheless, we may want to reconsider emitting these events for cgo calls.)
+The new events would be called, for example, `GoSyscallBegin` and
+`GoSyscallEnd`.
+If a syscall blocks, `GoSyscallEnd` is replaced with `GoSyscallEndBlocked`
+(more on this later).
+
+Secondly, I propose adding an explicit event for stealing a P.
+In the per-M world, keeping just `GoSysBlock` to represent both a goroutine's
+state transition and the state of a P is not feasible, because we're revealing
+the fact that fundamentally two threads are racing with one another on the
+syscall exit.
+An explicit P-stealing event, for example `ProcSteal`, would be required to
+be ordered against a `GoSyscallEndBlocked`, with the former always happening
+before.
+The precise semantics of the `ProcSteal` event would be a `ProcStop` but one
+performed by another thread.
+Because this means events can now happen to P "at a distance," the `ProcStart`,
+`ProcStop`, and `ProcSteal` events all need sequence numbers.
+
+Note that the naive emission of `ProcSteal` and `GoSyscallEndBlocked` will
+cause them to race, but the interaction of the two represents an explicit
+synchronization within the runtime, so the parser can always safely wait for the
+`ProcSteal` to emerge in the frontier before proceeding.
+The timestamp order may also not be right, but since [we already committed to
+fixing broken timestamps](#timestamps-and-sequencing) in general, this skew will
+be fixed up by that mechanism for presentation.
+(In practice, I expect the skew to be quite small, since it only happens if the
+retaking of a P races with a syscall exit.)
+
+Per-M batching might also incur a higher memory cost for tracing, since there
+are generally more Ms than Ps.
+I suspect this isn't actually too big of an issue since the number of Ms is
+usually close to the number of Ps.
+In the worst case, there may be as many Ms as Gs! However, if we also [partition
+the trace](#partitioning), then the number of active buffers will only be
+proportional to the number of Ms that actually ran in a given time window, which
+is unlikely to be an issue.
+Still, if this does become a problem, a reasonable mitigation would be to simply
+shrink the size of each trace buffer compared to today.
+The overhead of the tracing slow path is vanishingly small, so doubling its
+frequency would likely not incur a meaningful compute cost.
+
+Other than those three details, per-M batching should function identically to
+the current per-P batching: trace events may already be safely emitted without a
+P (modulo `GoSysExit` synchronization), so we're not losing anything else with
+the change.
+
+Instead, however, what we gain is a deeper insight into thread execution.
+Thread information is currently present in execution traces, but difficult to
+interpret because it's always tied to P start and stop events.
+A switch to per-M batching forces traces to treat Ps and Ms orthogonally.
+
+Given all this, I propose switching to per-M batching.
+The only remaining question to resolve is trace synchronization for Ms.
+
+(As an additional minor consideration, I would also like to propose adding the
+batch length to the beginning of each batch.
+Currently (and in the foreseeable future), the trace consumer needs to iterate
+over the entire trace once to collect all the batches for ordering.
+We can speed up this process tremendously by allowing the consumer to _just_
+collect the information it needs.)
+
+#### M synchronization
+
+The runtime already contains a mechanism to execute code on every M via signals,
+`doAllThreadsSyscall`.
+However, traces have different requirements than `doAllThreadsSyscall`, and I
+think we can exploit the specifics of these requirements to achieve a more
+lightweight alternative.
+
+First, observe that getting the attention of every M for tracing is not strictly
+necessary: Ms that never use the tracer need not be mentioned in the trace at
+all.
+This observation allows us to delegate trace state initialization to the M
+itself, so we can synchronize with Ms at trace start simply by atomically
+setting a single global flag.
+If an M ever writes into the trace buffer, then it will initialize its state by
+emitting an event indicating what it was doing when tracing started.
+For example, if the first event the M is going to emit is `GoBlock`, then it
+will emit an additional event before that that indicates the goroutine was
+running since the start of the trace (a hypothetical `GoRunning` event).
+Disabling tracing is slightly more complex, as the tracer needs to flush every
+M's buffer or identify that its buffer was never written to.
+However, this is fairly straightforward to do with per-M seqlocks.
+
+Specifically, Ms would double-check the `trace.enabled` flag under the seqlock
+and anything trying to stop a trace would first disable the flag, then iterate
+over every M to make sure it _observed_ its seqlock was unlocked.
+This guarantees that every M observed or will observe the new flag state.
+
+There's just one problem with all this: it may be that an M might be running and
+never emit an event.
+This case is critical to capturing system dynamics.
+As far as I can tell, there are three cases here: the M is running user code
+without emitting any trace events (e.g. a tight loop), the M is in a system call
+or C code the whole time, or the M is `sysmon`, a special thread that always
+runs without a P.
+
+The first case is fairly easy to deal with: because it's running with a P, we
+can just preempt it and establish an invariant that any preemption always emits
+an event if the M's trace state has not been initialized.
+
+The second case is a little more subtle, but luckily not very complex to
+implement.
+The tracer can identify whether the M has a G blocked in a syscall (or
+equivalently has called into C code) just before disabling tracing globally by
+checking `readgstatus(m.curg) == _Gsyscall` on each M's G and write it down.
+If the tracer can see that the M never wrote to its buffer _after_ it disables
+tracing, it can safely conclude that the M was still in a syscall at the moment
+when tracing was disabled, since otherwise it would have written to the buffer.
+
+Note that since we're starting to read `gp.atomicstatus` via `readgstatus`, we
+now need to ensure consistency between the G's internal status and the events
+we emit representing those status changes, from the tracer's perspective.
+Thus, we need to make sure we hold the M's trace seqlock across any
+`gp.atomicstatus` transitions, which will require a small refactoring of the
+runtime and the tracer.
+
+For the third case, we can play basically the same trick as the second case, but
+instead check to see if sysmon is blocked on a note.
+If it's not, it's running.
+If it doesn't emit any events (e.g. to stop on a note) then the tracer can
+assume that it's been running, if its buffer is empty.
+
+A big change with this synchronization mechanism is that the tracer no longer
+obtains the state of _all_ goroutines when a trace starts, only those that run
+or execute syscalls.
+The remedy to this is to simply add it back in.
+Since the only cases not covered are goroutines in `_Gwaiting` or `_Grunnable`,
+the tracer can set the `_Gscan` bit on the status to ensure the goroutine can't
+transition out.
+At that point, it collects a stack trace and writes out a status event for the
+goroutine inside buffer that isn't attached to any particular M.
+To avoid ABA problems with a goroutine transitioning out of `_Gwaiting` or
+`_Grunnable` and then back in, we also need an atomically-accessed flag for each
+goroutine that indicates whether an event has been written for that goroutine
+yet.
+
+The result of all this synchronization is that the tracer only lightly perturbs
+application execution to start and stop traces.
+In theory, a stop-the-world is no longer required at all, but in practice
+(especially with [partitioning](#partitioning)), a brief stop-the-world to begin
+tracing dramatically simplifies some of the additional synchronization necessary.
+Even so, starting and stopping a trace will now be substantially more lightweight
+than before.
+
+### Partitioning
+
+The structure of the execution trace's binary format limits what you can do with
+traces.
+
+In particular, to view just a small section of the trace, the entire trace needs
+to be parsed, since a batch containing an early event may appear at the end of
+the trace.
+This can happen if, for example, a P that stays relatively idle throughout the
+trace wakes up once at the beginning and once at the end, but never writes
+enough to the trace buffer to flush it.
+
+To remedy this, I propose restructuring traces into a stream of self-contained
+partitions, called "generations."
+More specifically, a generation is a collection of trace batches that represents
+a complete trace.
+In practice this means each generation boundary is a global buffer flush.
+Each trace batch will have a generation number associated with, and generations
+will not interleave.
+The way generation boundaries are identified in the trace is when a new batch
+with a different number is identified.
+
+The size of each generation will be roughly constant: new generation boundaries
+will be created when either the partition reaches some threshold size _or_ some
+maximum amount of wall-clock time has elapsed.
+The exact size threshold and wall-clock limit will be determined empirically,
+though my general expectation for the size threshold is around 16 MiB.
+
+Generation boundaries will add a substantial amount of implementation
+complexity, but the cost is worth it, since it'll fundamentally enable new
+use-cases.
+
+The trace parser will need to be able to "stitch" together events across
+generation boundaries, and for that I propose the addition of a new `GoStatus`
+event.
+This event is a generalization of the `GoInSyscall` and `GoWaiting` events.
+This event will be emitted the first time a goroutine is about to be mentioned
+in a trace, and will carry a state enum argument that indicates what state the
+goroutine was in *before* the next event to be emitted.
+The state emitted for a goroutine, for more tightly coupling events to states,
+will be derived from the event about to be emitted (except for the "waiting" and
+"syscall" cases where it's explicit).
+The trace parser will be able to use these events to validate continuity across
+generations.
+
+This global buffer flush can be implemented as an extension to the
+aforementioned [M synchronization](#m-synchronization) design by replacing the
+`enabled` flag with a generation counter and doubling up much of the trace
+state.
+The generation counter will be used to index into the trace state, allowing for
+us to dump and then toss old trace state, to prepare it for the next generation
+while the current generation is actively being generated.
+In other words, it will work akin to ending a trace, but for just one half of
+the trace state.
+
+One final subtlety created by partitioning is that now we may no longer have a
+stop-the-world in between complete traces (since a generation is a complete
+trace).
+This means some events may be active when a trace is starting, and we'll lose
+that information.
+For user-controlled event types (user tasks and user regions), I propose doing
+nothing special.
+It's already true that if a task or region was in progress we lose that
+information, and it's difficult to track that efficiently given how the API is
+structured.
+For range event types we control (sweep, mark assist, STW, etc.), I propose
+adding the concept of "active" events which are used to indicate that a
+goroutine or P has been actively in one of these ranges since the start of a
+generation.
+These events will be emitted alongside the `GoStatus` event for a goroutine.
+(This wasn't mentioned above, but we'll also need a `ProcStatus` event, so we'll
+also emit these events as part of that processes as well.)
+
+### Event cleanup
+
+Since we're modifying the trace anyway, this is a good opportunity to clean up
+the event types, making them simpler to implement, simpler to parse and
+understand, and more uniform overall.
+
+The three biggest changes I would like to propose are
+
+1. A uniform naming scheme,
+1. Separation of the reasons that goroutines might stop or block from the event
+   type.
+1. Placing strings, stacks, and CPU samples in their own dedicated batch types.
+
+Firstly, for the naming scheme, I propose the following rules (which are not
+strictly enforced, only guidelines for documentation):
+
+- Scheduling resources, such as threads and goroutines, have events related to
+  them prefixed with the related resource (i.e. "Thread," "Go," or "Proc").
+  - Scheduling resources have "Create" and "Destroy" events.
+  - Scheduling resources have generic "Status" events used for indicating their
+    state at the start of a partition (replaces `GoInSyscall` and `GoWaiting`).
+  - Scheduling resources have "Start" and "Stop" events to indicate when that
+    resource is in use.
+    The connection between resources is understood through context today.
+  - Goroutines also have "Block" events which are like "Stop", but require an
+    "Unblock" before the goroutine can "Start" again.
+  - Note: Ps are exempt since they aren't a true resource, more like a
+    best-effort semaphore in practice.
+    There's only "Start" and "Stop" for them.
+- Events representing ranges of time come in pairs with the start event having
+  the "Begin" suffix and the end event having the "End" suffix.
+- Events have a prefix corresponding to the deepest resource they're associated
+  with.
+
+Secondly, I propose moving the reasons why a goroutine resource might stop or
+block into an argument.
+This choice is useful for backwards compatibility, because the most likely
+change to the trace format at any given time will be the addition of more
+detail, for example in the form of a new reason to block.
+
+Thirdly, I propose placing strings, stacks, and CPU samples in their own
+dedicated batch types.
+This, in combination with batch sizes in each batch header, will allow the trace
+parser to quickly skim over a generation and find all the strings, stacks, and
+CPU samples.
+This makes certain tasks faster, but more importantly, it simplifies parsing and
+validation, since the full tables are just available up-front.
+It also means that the event batches are able to have a much more uniform format
+(one byte event type, followed by several varints) and we can keep all format
+deviations separate.
+
+Beyond these big three, there are a myriad of additional tweaks I would like to
+propose:
+
+- Remove the `FutileWakeup` event, which is no longer used (even in the current
+  implementation).
+- Remove the `TimerGoroutine` event, which is no longer used (even in the
+  current implementation).
+- Rename `GoMaxProcs` to `ProcsChange`.
+  - This needs to be written out at trace startup, as before.
+- Redefine `GoSleep` as a `GoBlock` and `GoUnblock` pair.
+- Break out `GoStartLabel` into a more generic `GoLabel` event that applies a
+  label to a goroutine the first time it emits an event in a partition.
+- Change the suffixes of pairs of events representing some activity to be
+  `Begin` and `End`.
+- Ensure all GC-related events contain the string `GC`.
+- Add [sequence numbers](#timestamps-and-sequencing) to events where necessary:
+  - `GoStart` still needs a sequence number.
+  - `GoUnblock` still needs a sequence number.
+  - Eliminate the {`GoStart,GoUnblock,Go}Local` events for uniformity.
+- Because traces are no longer bound by stopping the world, and can indeed now
+  start during the GC mark phase, we need a way to identify that a GC mark is
+  currently in progress.
+  Add the `GCMarkActive` event.
+- Eliminate inline strings and make all strings referenced from a table for
+  uniformity.
+  - Because of partitioning, the size of the string table is unlikely to become
+    large in practice.
+  - This means all events can have a fixed size (expressed in terms of
+    the number of arguments) and we can drop the argument count bits from the
+    event type.
+
+A number of these tweaks above will likely make traces bigger, mostly due to
+additional information and the elimination of some optimizations for uniformity.
+
+One idea to regain some of this space is to compress goroutine IDs in each
+generation by maintaining a lightweight mapping.
+Each time a goroutine would emit an event that needs the current goroutine
+ID, it'll check a g-local cache.
+This cache will contain the goroutine's alias ID for the current partition
+and the partition number.
+If the goroutine does not have an alias for the current partition, it
+increments a global counter to acquire one and writes down the mapping of
+that counter to its full goroutine ID in a global table.
+This table is then written out at the end of each partition.
+But this is optional.
+
+## Implementation
+
+As discussed in the overview, this design is intended to be implemented in
+parts.
+Some parts are easy to integrate into the existing tracer, such as the switch to
+nanotime and the traceback-related improvements.
+Others, such as per-M batching, partitioning, and the change to the event
+encoding, are harder.
+
+While in theory all could be implemented separately as an evolution of the
+existing tracer, it's simpler and safer to hide the changes behind a feature
+flag, like a `GOEXPERIMENT` flag, at least to start with.
+Especially for per-M batching and partitioning, it's going to be quite
+complicated to try to branch in all the right spots in the existing tracer.
+
+Instead, we'll basically have two trace implementations temporarily living
+side-by-side, with a mostly shared interface.
+Where they differ (and there are really only a few spots), the runtime can
+explicitly check for the `GOEXPERIMENT` flag.
+This also gives us an opportunity to polish the new trace implementation.
+
+The implementation will also require updating the trace parser to make tests
+work.
+I suspect this will result in what is basically a separate trace parser that's
+invoked when the new header version is identified.
+Once we have a basic trace parser that exports the same API, we can work on
+improving the internal trace parser API to take advantage of the new features
+described in this document.
+
+## Use-cases
+
+This document presents a large change to the existing tracer under the banner of
+"traces at scale" which is a bit abstract.
+To better understand what that means, let's explore a few concrete use-cases
+that this design enables.
+
+### Flight recording
+
+Because the trace in this design is partitioned, it's now possible for the
+runtime to carry around the most recent trace partition.
+The runtime could then expose this partition to the application as a snapshot of
+what the application recently did.
+This is useful for debugging at scale, because it enables the application to
+collect data exactly when something goes wrong.
+For instance:
+
+- A server's health check fails.
+  The server takes a snapshot of recent execution and puts it into storage
+  before exiting.
+- Handling a request exceeds a certain latency threshold.
+  The server takes a snapshot of the most recent execution state and uploads it
+  to storage for future inspection.
+- A program crashes when executing in a deployed environment and leaves behind a
+  core dump.
+  That core dump contains the recent execution state.
+
+Here's a quick design sketch:
+
+```go
+package trace
+
+// FlightRecorder represents
+type FlightRecorder struct { ... }
+
+// NewFlightRecorder creates a new flight recording configuration.
+func NewFlightRecorder() *FlightRecorder
+
+// Start begins process-wide flight recording. Only one FlightRecorder
+// may be started at once. If another FlightRecorder is started, this
+// function will return an error.
+//
+// This function can be used concurrently with Start and Stop.
+func (fr *FlightRecorder) Start() error
+
+// TakeSnapshot collects information about the execution of this program
+// from the last few seconds and write it to w. This FlightRecorder must
+// have started to take a snapshot.
+func (fr *FlightRecorder) TakeSnapshot(w io.Writer) error
+
+// Stop ends process-wide flight recording. Stop must be called on the
+// same FlightRecorder that started recording.
+func (fr *FlightRecorder) Stop() error
+```
+
+- The runtime accumulates trace buffers as usual, but when it makes a partition,
+  it puts the trace data aside as it starts a new one.
+- Once the new partition is finished, the previous is discarded (buffers are
+  reused).
+- At any point, the application can request a snapshot of the current trace
+  state.
+  - The runtime immediately creates a partition from whatever data it has, puts
+    that together with the previously-accumulated partition, and hands it off to
+    the application for reading.
+  - The runtime then continues accumulating trace data in a new partition while
+    the application reads the trace data.
+- The application is almost always guaranteed two partitions, with one being as
+  large as a partition would be in a regular trace (the second one is likely to
+  be smaller).
+- Support for flight recording in the `/debug/pprof/trace` endpoint.
+
+### Fleetwide trace collection
+
+Today some power users collect traces at scale by setting a very modest sampling
+rate, e.g. 1 second out of every 1000 seconds that a service is up.
+This has resulted in collecting a tremendous amount of useful data about
+execution at scale, but this use-case is currently severely limited by trace
+performance properties.
+
+With this design, it should be reasonable to increase the sampling rate by an
+order of magnitude and collect much larger traces for offline analysis.
+
+### Online analysis
+
+Because the new design allows for partitions to be streamed and the trace
+encoding is much faster to process, it's conceivable that a service could
+process its own trace continuously and aggregate and filter only what it needs.
+This online processing avoids the need to store traces for future inspection.
+
+The kinds of processing I imagine for this includes:
+
+- Detailed task latency breakdowns over long time horizons.
+- CPU utilization estimates for different task types.
+- Semantic processing and aggregation of user log events.
+- Fully customizable flight recording (at a price).
+
+### Linux sched trace association
+
+Tooling can be constructed that interleaves Go execution traces with traces
+produced via:
+
+`perf sched record --clockid CLOCK_MONOTONIC <command>`
+
+## Prior art
+
+### JFR
+
+[JFR or "Java Flight
+Recorder](https://developers.redhat.com/blog/2020/08/25/get-started-with-jdk-flight-recorder-in-openjdk-8u#using_jdk_flight_recorder_with_jdk_mission_control)"
+is an execution tracing system for Java applications.
+JFR is highly customizable with a sophisticated wire format.
+It supports very detailed configuration of events via an XML configuration file,
+including the ability to enable/disable stack traces per event and set a latency
+threshold an event needs to exceed in order to be sampled (e.g. "only sample
+file reads if they exceed 20ms").
+It also supports custom user-defined binary-encoded events.
+
+By default, JFR offers a low overhead configuration (<1%) for "continuous
+tracing" and slightly higher overhead configuration (1-2%) for more detailed
+"profile tracing."
+These configurations are just default configuration files that ship with the
+JVM.
+
+Continuous tracing is special in that it accumulates data into an internal
+global ring buffer which may be dumped at any time.
+Notably this data cannot be recovered from a crashed VM, though it can be
+recovered in a wide variety of other cases.
+Continuous tracing is disabled by default.
+
+The JFR encoding scheme is quite complex, but achieves low overhead partially
+through varints.
+JFR traces are also partitioned, enabling scalable analysis.
+
+In many ways the existing Go execution tracer looks a lot like JFR, just without
+partitioning, and with a simpler encoding.
+
+### KUTrace
+
+[KUTrace](https://github.com/dicksites/KUtrace) is a system-wide execution
+tracing system for Linux.
+It achieves low overhead, high resolution, and staggering simplicity by cleverly
+choosing to trace on each kernel transition (the "goldilocks" point in the
+design space, as the author puts it).
+
+It uses a fairly simple 8-byte-word-based encoding scheme to keep trace writing
+fast, and exploits the very common case of a system call returning quickly
+(>90%) to pack two events into each word when possible.
+
+### This project's low-overhead tracing inspired this effort, but in the end we
+didn't take too many of its insights.
+
+### CTF
+
+[CTF or "Common Trace Format](https://diamon.org/ctf/#spec7)" is more of a meta
+tracing system.
+It defines a binary format, a metadata format (to describe that binary format),
+as well as a description language.
+In essence, it contains all the building blocks of a trace format without
+defining one for a specific use-case.
+
+Traces in CTF are defined as a series of streams of events.
+Events in a stream are grouped into self-contained packets.
+CTF contains many of the same concepts that we define here (packets are batches;
+the streams described in this document are per-thread, etc.), though it largely
+leaves interpretation of the trace data up to the one defining a CTF-based trace
+format.
+
+## Future work
+
+### Event encoding
+
+The trace format relies heavily on LEB128-encoded integers.
+While this choice makes the trace quite compact (achieving 4-6 bytes/event, with
+some tricks to keep each encoded integer relatively small), it comes at a cost
+to decoding speed since LEB128 is well-known for being relatively slow to decode
+compared to similar integer encodings.
+(Note: this is only a hunch at present; this needs to be measured.) (The cost at
+encoding time is dwarfed by other costs, like tracebacks, so it's not on the
+critical path for this redesign.)
+
+This section proposes a possible future trace encoding that is simpler, but
+without any change would produce a much larger trace.
+It then proposes two possible methods of compressing the trace from there.
+
+#### Encoding format
+
+To start with, let's redefine the format as a series of 4-byte words of data.
+
+Each event has a single header word that includes the event type (8 bits), space
+for a 4-bit event reason, and the timestamp delta (20 bits).
+Note that every event requires a timestamp.
+At a granularity of 64 nanoseconds, this gives us a timestamp range of ~1
+second, which is more than enough for most cases.
+In the case where we can't fit the delta in 24 bits, we'll emit a new
+"timestamp" event which is 2 words wide, containing the timestamp delta from the
+start of the partition.
+This gives us a 7-byte delta which should be plenty for any trace.
+
+Each event's header is then followed by a number of 4-byte arguments.
+The minimum number of arguments is fixed per the event type, and will be made
+evident in self-description.
+A self-description table at the start of the trace could indicate (1) whether
+there are a variable number of additional arguments, (2) which argument
+indicates how many there are, and (3) whether any arguments are byte arrays
+(integer followed by data).
+Byte array data lengths are always rounded up to 4 bytes.
+Possible arguments include, for example, per-goroutine sequence numbers,
+goroutine IDs, thread IDs, P IDs, stack IDs, and string IDs.
+Within a partition, most of these trivially fit in 32 bits:
+
+- Per-goroutine sequence numbers easily fit provided each partition causes a
+  global sequence number reset, which is straightforward to arrange.
+- Thread IDs come from the OS as `uint32` on all platforms we support.
+- P IDs trivially fit within 32 bits.
+- Stack IDs are local to a partition and as a result trivially fit in 32-bits.
+  Assuming a partition is O(MiB) in size (so, O(millions) of events), even if
+  each event has a unique stack ID, it'll fit.
+- String IDs follow the same logic as stack IDs.
+- Goroutine IDs, when compressed as described in the event cleanup section, will
+  easily fit in 32 bits.
+
+The full range of possible encodings for an event can be summarized as thus:
+
+This format change, before any compression, will result in an encoded trace size
+of 2-3x.
+Plugging in the size of each proposed event above into the breakdown of a trace
+such as [this 288 MiB one produced by
+felixge@](https://gist.github.com/felixge/a79dad4c30e41a35cb62271e50861edc)
+reveals an increase in encoded trace size by a little over 2x.
+The aforementioned 288 MiB trace grows to around 640 MiB in size, not including
+additional timestamp events, the batch header in this scheme, or any tables at
+the end of each partition.
+
+This increase in trace size is likely unacceptable since there are several cases
+where holding the trace in memory is desirable.
+Luckily, the data in this format is very compressible.
+
+This design notably doesn't handle inline strings.
+The easy thing to do would be to always put them in the string table, but then
+they're held onto until at least the end of a partition, which isn't great for
+memory use in the face of many non-cacheable strings.
+This design would have to be extended to support inline strings, perhaps
+rounding up their length to a multiple of 4 bytes.
+
+#### On-line integer compression
+
+One possible route from here is to simply choose a different integer compression
+scheme.
+There exist many fast block-based integer compression schemes, ranging from [GVE
+and
+PrefixVarint](https://en.wikipedia.org/wiki/Variable-length_quantity#Group_Varint_Encoding)
+to SIMD-friendly bit-packing schemes.
+
+Inline strings would likely have to be excepted from this compression scheme,
+and would likely need to be sent out-of-band.
+
+#### On-demand user compression
+
+Another alternative compression scheme is to have none, and instead expect the
+trace consumer to perform their own compression, if they want the trace to be
+compressed.
+This certainly simplifies the runtime implementation, and would give trace
+consumers a choice between trace encoding speed and trace size.
+For a sufficiently fast compression scheme (perhaps LZ4), it's possible this
+could rival integer compression in encoding overhead.
+
+More investigation on this front is required.
+
+### Becoming a CPU profile superset
+
+Traces already contain CPU profile samples, but are missing some information
+compared to what Go CPU profile pprof protos contain.
+We should consider bridging that gap, such that it is straightforward to extract
+a CPU profile from an execution trace.
+
+This might look like just emitting another section to the trace that contains
+basically the entire pprof proto.
+This can likely be added as a footer to the trace.