Proposal: API for unstable runtime metrics

Author: Michael Knyszek

Background & Motivation

The need for a new API for unstable metrics was already summarized quite well by @aclements, so I'll quote that here:

The runtime currently exposes heap-related metrics through runtime.ReadMemStats (which can be used programmatically) and GODEBUG=gctrace=1 (which is difficult to read programmatically). These metrics are critical to understanding runtime behavior, but have some serious limitations:
MemStats is hard to evolve because it must obey the Go 1 compatibility rules. The existing metrics are confusing, but we can‘t change them. Some of the metrics are now meaningless (like EnableGC and DebugGC), and several have aged poorly (like hard-coding the number of size classes at 61, or only having a single pause duration per GC cycle). Hence, we tend to shy away from adding anything to this because we’ll have to maintain it for the rest of time.
The gctrace format is unspecified, which means we can evolve it (and have completely changed it several times). But it‘s a pain to collect programmatically because it only comes out on stderr and, even if you can capture that, you have to parse a text format that changes. Hence, automated metric collection systems ignore gctrace. There have been requests to make this programmatically accessible (#28623). There are many metrics I would love to expose from the runtime memory manager and scheduler, but our current approach forces me to choose between two bad options: programmatically expose metrics that are so fundamental they’ll make sense for the rest of time, or expose unstable metrics in a way that's difficult to collect and process programmatically.

Other problems with ReadMemStats include performance, such as the need to stop-the-world. While it's otherwise difficult to collect many of the metrics in MemStats, not all metrics require it, and it would be nice to be able to acquire some subset of metrics without a global application penalty.

Requirements

Conversing with @aclements, we agree that:

The API should be easily extendable with new metrics.
The API should be easily retractable, to deprecate old metrics.
- Removing a metric should not break any Go applications as per the Go 1 compatibility promise.
The API should be discoverable, to obtain a list of currently relevant metrics.
The API should be rich, allowing a variety of metrics (e.g. distributions).
The API implementation should minimize CPU/memory usage, such that it does not appreciably affect any of the metrics being measured.
The API should include useful existing metrics already exposed by the runtime.

Goals

Given the requirements, I suggest we prioritize the following concerns when designing the API in the following order.

Extensibility.
- Metrics are “unstable” and therefore it should always be compatible to add or remove metrics.
- Since metrics will tend to be implementation-specific, this feature is critical.
Discoverability.
- Because these metrics are “unstable,” there must be a way for the application, and for the human writing the application, to discover the set of usable metrics and be able to do something useful with that information (e.g. log the metric).
- The API should enable collecting a subset of metrics programmatically. For example, one might want to “collect all memory-related metrics” or “collect all metrics which are efficient to collect”.
Performance.
- Must have a minimized effect on the metrics it returns in the steady-state.
- Should scale up to 100s metrics, an amount that a human might consider “a lot.”
  - Note that picking the right types to expose can limit the amount of metrics we need to expose. For example, a distribution type would significantly reduce the number of metrics.
Ergonomics.
- The API should be as easy to use as it can be, given the above.

Design

I propose we add a new standard library package to support a new runtime metrics API to avoid polluting the namespace of existing packages. The proposed name of the package is the runtime/metrics package.

I propose that this package expose a sampling-based API for acquiring runtime metrics, in the same vein as runtime.ReadMemStats, that meets this proposal‘s stated goals. The sampling approach is taken in opposition to a stream-based (or event-based) API. Many of the metrics currently exposed by the runtime are “continuous” in the sense that they’re cheap to update and are updated frequently enough that emitting an event for every update would be quite expensive, and would require scaffolding to allow the user to control the emission rate. Unless noted otherwise, this document will assume a sampling-based API.

With that said, I believe that in the future it will be worthwhile to expose an event-based API as well, taking a hybrid approach, much like Linux's perf tool. See “Time series data” for a discussion of such an extension.

Representation of metrics

Firstly, it probably makes the most sense to interact with a set of metrics, rather than one metric at a time. Many metrics require that the runtime reach some safe state to collect, so naturally it makes sense to collect all such metrics at this time for performance. For the rest of this document, we're going to consider “sets of metrics” as the unit of our API instead of individual metrics for this reason.

Second, the extendability and retractability requirements imply a less rigid data structure to represent and interact with a set of metrics. Perhaps the least rigid data structure in Go is something like a byte slice, but this is decidedly too low-level to use from within a Go application because it would need to have an encoding. Simply defining a new encoding for this would be a non-trivial undertaking with its own complexities.

The next least-rigid data structure is probably a Go map, which allows us to associate some key for a metric with a sampled metric value. The two most useful properties of maps here is that their set of keys is completely dynamic, and that they allow efficient random access. The inconvenience of a map though is its undefined iteration order. While this might not matter if we're just constructing an RPC message to hit an API, it does matter if one just wants to print statistics to STDERR every once in a while for debugging.

A slightly more rigid data structure would be useful for managing an unstable set of metrics is a slice of structs, with each struct containing a key (the metric name) and a value. This allows us to have a well-defined iteration order, and it's up to the user if they want efficient random access. For example, they could keep the slice sorted by metric keys, and do a binary search over them, or even have a map on the side.

There are several variants of this slice approach (e.g. struct of keys slice and values slice), but I think the general idea of using slices of key-value pairs strikes the right balance between flexibility and usability. Going any further in terms of rigidity and we end up right where we don't want to be: with a MemStats-like struct.

Third, I propose the metric key be something abstract but still useful for humans, such as a string. An alternative might be an integral ID, where we provide a function to obtain a metric‘s name from its ID. However, using an ID pollutes the API. Since we want to allow a user to ask for specific metrics, we would be required to provide named constants for each metric which would later be deprecated. It’s also unclear that this would give any performance benefit at all.

Finally, we want the metric value to be able to take on a variety of forms. Many metrics might work great as uint64 values, but most do not. For example we might want to collect a distribution of values (size classes are one such example). Distributions in particular can take on many different forms, for example if we wanted to have an HDR histogram of STW pause times. In the interest of being as extensible as possible, something like an empty interface value could work here.

However, an empty interface value has implications for performance. How do we efficiently populate that empty interface value without allocating? One idea is to only use pointer types, for example it might contain *float64 or *uint64 values. While this strategy allows us to re-use allocations between samples, it's starting to rely on the internal details of Go interface types for efficiency.

Fundamentally, the problem we have here is that we want to include a fixed set of valid types as possible values. This concept maps well to the notion of a sum type in other languages. While Go lacks such a facility, we can emulate one. Consider the following representation for a value:

type Kind int

const (
	KindBad Kind = iota
	KindUint64
	KindFloat64
	KindFloat64Histogram
)

type Value struct {
	// unexported fields
}

func (v Value) Kind() Kind

// panics if v.Kind() != KindUint64
func (v Value) Uint64() uint64

// panics if v.Kind() != KindFloat64
func (v Value) Float64() float64

// panics if v.Kind() != KindFloat64Histogram
func (v Value) Float64Histogram() *Float64Histogram

The advantage of such a representation means that we can hide away details about how each metric sample value is actually represented. For example, we could embed a uint64 slot into the Value which is used to hold either a uint64, a float64, or an int64, and which is populated directly by the runtime without any additional allocations at all. For types which will require an indirection, such as histograms, we could also hold an unsafe.Pointer or interface{} value as an unexported field and pull out the correct type as needed. In these cases we would still need to allocate once up-front (the histogram needs to contain a slice for counts, for example).

The downside of such a structure is mainly ergonomics. In order to use it effectively, one needs to switch on the result of the Kind() method, then call the appropriate method to get the underlying value. While in that case we lose some type safety as opposed to using an interface{} and a type-switch construct, there is some precedent for such a structure. In particular a Value mimics the API reflect.Value in some ways.

Putting this all together, I propose sampled metric values look like

// Sample captures a single metric sample.
type Sample struct {
  Name string
  Value Value
}

Furthermore, I propose that we use a slice of these Sample structures to represent our “snapshot” of the current state of the system (i.e. the counterpart to runtime.MemStats).

Discoverability

To support discovering which metrics the system supports, we must provide a function that returns the set of supported metric keys.

I propose that the discovery API return a slice of “metric descriptions” which contain a “Name” field referring to a metric key. Using a slice here mirrors the sampling API.

Metric naming

Choosing a naming scheme for each metric will significantly influence its usage, since these are the names that will eventually be surfaced to the user. There are two important properties we would like to have such that these metric names may be smoothly and correctly exposed to the user.

The first, and perhaps most important of these properties is that semantics be tied to their name. If the semantics (including the type of each sample value) of a metric changes, then the name should too.

The second is that the name should be easily parsable and mechanically rewritable, since different metric collection systems have different naming conventions.

Putting these two together, I propose that the metric name be built from two components: a forward-slash-separated path to a metric where each component is lowercase words separated by hyphens (the “name”, e.g. “/memory/heap/free”), and its unit (e.g. bytes, seconds). I propose we separate the two components of “name” and “unit” by a colon (“:”) and provide a well-defined format for the unit (e.g. “/memory/heap/free:bytes”).

Representing the metric name as a path is intended to provide a mechanism for namespacing metrics. Many metrics naturally group together, and this provides a straightforward way of filtering out only a subset of metrics, or perhaps matching on them. The use of lower-case and hyphenated path components is intended to make the name easy to translate to most common naming conventions used in metrics collection systems. The introduction of this new API is also a good time to rename some of the more vaguely named statistics, and perhaps to introduce a better namespacing convention.

Including the unit in the name may be a bit surprising at first. First of all, why should the unit even be a string? One alternative way to represent the unit is to use some structured format, but this has the potential to lock us into some bad decisions or limit us to only a certain subset of units. Using a string gives us more flexibility to extend the units we support in the future. Thus, I propose that no matter what we do, we should definitely keep the unit as a string.

In terms of a format for this string, I think we should keep the unit closely aligned with the Go benchmark output format to facilitate a nice user experience for measuring these metrics within the Go testing framework. This goal suggests the following very simple format: a series of all-lowercase common base unit names, singular or plural, without SI prefixes (such as “seconds” or “bytes”, not “nanoseconds” or “MiB”), potentially containing hyphens (e.g. “cpu-seconds”), delimited by either * or / characters. A regular expression is sufficient to describe the format, and ignoring the restriction of common base unit names, would look like ^[a-z-]+(?:[*\/][a-z-]+)*$.

Why should the unit be a part of the name? Mainly to help maintain the first property mentioned above. If we decide to change a metric‘s unit, which represents a semantic change, then the name must also change. Also, in this situation, it’s much more difficult for a user to forget to include the unit. If their metric collection system has no rules about names, then great, they can just use whatever Go gives them. If they do (and most seem to be fairly opinionated) it forces the user to account for the unit when dealing with the name and it lessens the chance that it would be forgotten. Furthermore, splitting a string is typically less computationally expensive than combining two strings.

Metric Descriptions

Firstly, any metric description must contain the name of the metric. No matter which way we choose to store a set of descriptions, it is both useful and necessary to carry this information around. Another useful field is an English description of the metric. This description may then be propagated into metrics collection systems dynamically.

The metric description should also indicate the performance sensitivity of the metric. Today ReadMemStats forces the user to endure a stop-the-world to collect all metrics. There are a number of pieces of information we could add, but one good one for now would be “does this metric require a stop-the-world event?”. The intended use of such information would be to collect certain metrics less often, or to exclude them altogether from metrics collection. While this is fairly implementation-specific for metadata, the majority of tracing GC designs involve a stop-the-world event at one point or another.

Another useful aspect of a metric description would be to indicate whether the metric is a “gauge” or a “counter” (i.e. it increases monotonically). We have examples of both in the runtime and this information is often useful to bubble up to metrics collection systems to influence how they‘re displayed and what operations are valid on them (e.g. counters are often more usefully viewed as rates). By including whether a metric is a gauge or a counter in the descriptions, metrics collection systems don’t have to try to guess, and users don't have to annotate exported metrics manually; they can do so programmatically.

Finally, metric descriptions should allow users to filter out metrics that their application can‘t understand. The most common situation in which this can happen is if a user upgrades or downgrades the Go version their application is built with, but they do not update their code. Another situation in which this can happen is if a user switches to a different Go runtime (e.g. TinyGo). There may be a new metric in this Go version represented by a type which was not used in previous versions. For this case, it’s useful to include type information in the metric description so that applications can programmatically filter these metrics out. In this case, I propose we use add a Kind field to the description.

Documentation

While the metric descriptions allow an application to programmatically discover the available set of metrics at runtime, it's tedious for humans to write an application just to dump the set of metrics available to them.

For ReadMemStats, the documentation is on the MemStats struct itself. For gctrace it is in the runtime package‘s top-level comment. Because this proposal doesn’t tie metrics to Go variables or struct fields, the best we can do is what gctrace does and document it in the metrics package-level documentation. A test in the runtime/metrics package will ensure that the documentation always matches the metric's English description.

Furthermore, the documentation should contain a record of when metrics were added and when metrics were removed (such as a note like “(since Go 1.X)” in the English description). Users who are using an old version of Go but looking at up-to-date documentation, such as the documentation exported to golang.org, will be able to more easily discover information relevant to their application. If a metric is removed, the documentation should note which version removed it.

Time series metrics

The API as described so far has been a sampling-based API, but many metrics are updated at well-defined (and relatively infrequent) intervals, such as many of the metrics found in the gctrace output. These metrics, which I‘ll call “time series metrics,” may be sampled, but the sampling operation is inherently lossy. In many cases it’s very useful for performance debugging to have precise information of how a metric might change e.g. from GC cycle to GC cycle.

Measuring such metrics thus fits better in an event-based, or stream-based API, which emits a stream of metric values (tagged with precise timestamps) which are then ingested by the application and logged someplace.

While we stated earlier that considering such time series metrics is outside of the scope of this proposal, it's worth noting that buying into a sampling-based API today does not close any doors toward exposing precise time series metrics in the future. A straightforward way of extending the API would be to add the time series metrics to the total list of metrics, allowing the usual sampling-based approach if desired, while also tagging some metrics with a “time series” flag in their descriptions. The event-based API, in that form, could then just be a pure addition.

A feasible alternative in this space is to only expose a sampling API, but to include a timestamp on event metrics to allow users to correlate metrics with specific events. For example, if metrics came from the previous GC, they would be tagged with the timestamp of that GC, and if the metric and timestamp hadn't changed, the user could identify that.

One interesting consequence of having an event-based API which is prompt is that users could then to Go runtime state on-the-fly, such as for detecting when the GC is running. On the one hand, this could provide value to some users of Go, who require fine-grained feedback from the runtime system. On the other hand, the supported metrics will still always be unstable, so relying on a metric for feedback in one release might no longer be possible in a future release.

Draft API Specification

Given the discussion of the design above, I propose the following draft API specification.

package metrics

// Float64Histogram represents a distribution of float64 values.
type Float64Histogram struct {
	// Counts contains the weights for each histogram bucket. The length of
	// Counts is equal to the length of Bucket plus one to account for the
	// implicit minimum bucket.
	//
	// Given N buckets, the following is the mathematical relationship between
	// Counts and Buckets.
	// count[0] is the weight of the range (-inf, bucket[0])
	// count[n] is the weight of the range [bucket[n], bucket[n+1]), for 0 < n < N-1
	// count[N-1] is the weight of the range [bucket[N-1], inf)
	Counts []uint64

	// Buckets contains the boundaries between histogram buckets, in increasing order.
	//
	// Because this slice contains boundaries, there are len(Buckets)+1 total buckets:
	// a bucket for all values less than the first boundary, a bucket covering each
	// [slice[i], slice[i+1]) interval, and a bucket for all values greater than or
	// equal to the last boundary.
	Buckets []float64
}

// Clone generates a deep copy of the Float64Histogram.
func (f *Float64Histogram) Clone() *Float64Histogram

// Kind is a tag for a metric Value which indicates its type.
type Kind int

const (
	// KindBad indicates that the Value has no type and should not be used.
	KindBad Kind = iota

	// KindUint64 indicates that the type of the Value is a uint64.
	KindUint64

	// KindFloat64 indicates that the type of the Value is a float64.
	KindFloat64

	// KindFloat64Histogram indicates that the type of the Value is a *Float64Histogram.
	KindFloat64Histogram
)

// Value represents a metric value returned by the runtime.
type Value struct {
	kind    Kind
	scalar  uint64         // contains scalar values for scalar Kinds.
	pointer unsafe.Pointer // contains non-scalar values.
}

// Value returns a value of one of the types mentioned by Kind.
//
// This function may allocate memory.
func (v Value) Value() interface{}

// Kind returns the a tag representing the kind of value this is.
func (v Value) Kind() Kind

// Uint64 returns the internal uint64 value for the metric.
//
// If v.Kind() != KindUint64, this method panics.
func (v Value) Uint64() uint64

// Float64 returns the internal float64 value for the metric.
//
// If v.Kind() != KindFloat64, this method panics.
func (v Value) Float64() float64

// Float64Histogram returns the internal *Float64Histogram value for the metric.
//
// The returned value may be reused by calls to Read, so the user should clone
// it if they intend to use it across calls to Read.
//
// If v.Kind() != KindFloat64Histogram, this method panics.
func (v Value) Float64Histogram() *Float64Histogram

// Description describes a runtime metric.
type Description struct {
	// Name is the full name of the metric, including the unit.
	//
	// The format of the metric may be described by the following regular expression.
	// ^(?P<name>/[^:]+):(?P<unit>[^:*\/]+(?:[*\/][^:*\/]+)*)$
	//
	// The format splits the name into two components, separated by a colon: a path which always
	// starts with a /, and a machine-parseable unit. The name may contain any valid Unicode
	// codepoint in between / characters, but by convention will try to stick to lowercase
	// characters and hyphens. An example of such a path might be "/memory/heap/free".
	//
	// The unit is by convention a series of lowercase English unit names (singular or plural)
	// without prefixes delimited by '*' or '/'. The unit names may contain any valid Unicode
	// codepoint that is not a delimiter.
	// Examples of units might be "seconds", "bytes", "bytes/second", "cpu-seconds",
	// "byte*cpu-seconds", and "bytes/second/second".
	//
	// A complete name might look like "/memory/heap/free:bytes".
	Name string

	// Cumulative is whether or not the metric is cumulative. If a cumulative metric is just
	// a single number, then it increases monotonically. If the metric is a distribution,
	// then each bucket count increases monotonically.
	//
	// This flag thus indicates whether or not it's useful to compute a rate from this value.
	Cumulative bool

	// Kind is the kind of value for this metric.
	//
	// The purpose of this field is to allow users to filter out metrics whose values are
	// types which their application may not understand.
	Kind Kind

	// StopTheWorld is whether or not the metric requires a stop-the-world
	// event in order to collect it.
	StopTheWorld bool
}

// All returns a slice of containing metric descriptions for all supported metrics.
func All() []Description

// Sample captures a single metric sample.
type Sample struct {
	// Name is the name of the metric sampled.
	//
	// It must correspond to a name in one of the metric descriptions
	// returned by Descriptions.
	Name string

	// Value is the value of the metric sample.
	Value Value
}

// Read populates each Value element in the given slice of metric samples.
//
// Desired metrics should be present in the slice with the appropriate name.
// The user of this API is encouraged to re-use the same slice between calls.
//
// Metric values with names not appearing in the value returned by Descriptions
// will simply be left untouched (Value.Kind == KindBad).
func Read(m []Sample)

The usage of the API we have in mind for collecting specific metrics is the following:

var stats = []metrics.Sample{
	{Name: "/gc/heap/goal:bytes"},
	{Name: "/gc/pause-latency-distribution:seconds"},
}

// Somewhere...
...
	go statsLoop(stats, 30*time.Second)
...

func statsLoop(stats []metrics.Sample, d time.Duration) {
	// Read and print stats every 30 seconds.
	ticker := time.NewTicker(d)
	for {
		metrics.Read(stats)
		for _, sample := range stats {
			split := strings.IndexByte(sample.Name, ':')
			name, unit := sample.Name[:split], sample.Name[split+1:]
			switch value.Kind() {
			case KindUint64:
				log.Printf("%s: %d %s", name, value.Uint64(), unit)
			case KindFloat64:
				log.Printf("%s: %d %s", name, value.Float64(), unit)
			case KindFloat64Histogram:
				v := value.Float64Histogram()
				m := computeMean(v)
				log.Printf("%s: %f avg %s", name, m, unit)
			default:
				log.Printf("unknown value %s:%s: %v", sample.Value())
			}
		}
		<-ticker.C
	}
}

I believe common usage will be to simply slurp up all metrics, which would look like this:

...
	// Generate a sample array for all the metrics.
	desc := metrics.All()
	stats := make([]metric.Sample, len(desc))
	for i := range desc {
		stats[i] = metric.Sample{Name: desc[i].Name}
	}
	go statsLoop(stats, 30*time.Second)
...

Proposed initial list of metrics

Existing metrics

/memory/heap/free:bytes        KindUint64 // (== HeapIdle - HeapReleased)
/memory/heap/uncommitted:bytes KindUint64 // (== HeapReleased)
/memory/heap/objects:bytes     KindUint64 // (== HeapAlloc)
/memory/heap/unused:bytes      KindUint64 // (== HeapInUse - HeapAlloc)
/memory/heap/stacks:bytes      KindUint64 // (== StackInuse)

/memory/metadata/mspan/inuse:bytes             KindUint64 // (== MSpanInUse)
/memory/metadata/mspan/free:bytes              KindUint64 // (== MSpanSys - MSpanInUse)
/memory/metadata/mcache/inuse:bytes            KindUint64 // (== MCacheInUse)
/memory/metadata/mcache/free:bytes             KindUint64 // (== MCacheSys - MCacheInUse)
/memory/metadata/other:bytes                   KindUint64 // (== GCSys)
/memory/metadata/profiling/buckets-inuse:bytes KindUint64 // (== BuckHashSys)

/memory/other:bytes        KindUint64 // (== OtherSys)
/memory/native-stack:bytes KindUint64 // (== StackSys - StackInuse)

/aggregates/total-virtual-memory:bytes KindUint64 // (== sum over everything in /memory/**)

/gc/heap/objects:objects       KindUint64 // (== HeapObjects)
/gc/heap/goal:bytes            KindUint64 // (== NextGC)
/gc/cycles/completed:gc-cycles KindUint64 // (== NumGC)
/gc/cycles/forced:gc-cycles    KindUint64 // (== NumForcedGC)

New GC metrics

// Distribution of pause times, replaces PauseNs and PauseTotalNs.
/gc/pause-latency-distribution:seconds KindFloat64Histogram

// Distribution of unsmoothed trigger ratio.
/gc/pacer/trigger-ratio-distribution:ratio KindFloat64Histogram

// Distribution of what fraction of CPU time was spent on GC in each GC cycle.
/gc/pacer/utilization-distribution:cpu-percent KindFloat64Histogram

// Distribution of objects by size.
// Buckets correspond directly to size classes up to 32 KiB,
// after that it's approximated by an HDR histogram.
// allocs-by-size replaces BySize, TotalAlloc, and Mallocs.
// frees-by-size replaces BySize and Frees.
/malloc/allocs-by-size:bytes KindFloat64Histogram
/malloc/frees-by-size:bytes  KindFloat64Histogram

// How many hits and misses in the mcache.
/malloc/cache/hits:allocations   KindUint64
/malloc/cache/misses:allocations KindUint64

// Distribution of sampled object lifetimes in number of GC cycles.
/malloc/lifetime-distribution:gc-cycles KindFloat64Histogram

// How many page cache hits and misses there were.
/malloc/page/cache/hits:allocations   KindUint64
/malloc/page/cache/misses:allocations KindUint64

// Distribution of stack scanning latencies. HDR histogram.
/gc/stack-scan-latency-distribution:seconds KindFloat64Histogram

Scheduler metrics

/sched/goroutines:goroutines     KindUint64
/sched/preempt/async:preemptions KindUint64
/sched/preempt/sync:preemptions  KindUint64

// Distribution of how long goroutines stay in runnable
// before transitioning to running. HDR histogram.
/sched/time-to-run-distribution:seconds KindFloat64Histogram

Backwards Compatibility

Note that although the set of metrics the runtime exposes will not be stable across Go versions, the API to discover and access those metrics will be.

Therefore, this proposal strictly increases the API surface of the Go standard library without changing any existing functionality and is therefore Go 1 compatible.