Authors: Russ Cox, Austin Clements
Last updated: February 2016
Discussion at golang.org/issue/14313.
We propose to make the current output of go test -bench
the defined format for recording all Go benchmark data. Having a defined format allows benchmark measurement programs and benchmark analysis programs to interoperate while evolving independently.
We are unaware of any standard formats for recording raw benchmark data, and we've been unable to find any using web searches. One might expect that a standard benchmark suite such as SPEC CPU2006 would have defined a format for raw results, but that appears not to be the case. The collection of published results includes only analyzed data (example), not raw data.
Go has a de facto standard format for benchmark data: the lines generated by the testing package when using go test -bench
. For example, running compress/flate's benchmarks produces this output:
BenchmarkDecodeDigitsSpeed1e4-8 100 154125 ns/op 64.88 MB/s 40418 B/op 7 allocs/op BenchmarkDecodeDigitsSpeed1e5-8 10 1367632 ns/op 73.12 MB/s 41356 B/op 14 allocs/op BenchmarkDecodeDigitsSpeed1e6-8 1 13879794 ns/op 72.05 MB/s 52056 B/op 94 allocs/op BenchmarkDecodeDigitsDefault1e4-8 100 147551 ns/op 67.77 MB/s 40418 B/op 8 allocs/op BenchmarkDecodeDigitsDefault1e5-8 10 1197672 ns/op 83.50 MB/s 41508 B/op 13 allocs/op BenchmarkDecodeDigitsDefault1e6-8 1 11808775 ns/op 84.68 MB/s 53800 B/op 80 allocs/op BenchmarkDecodeDigitsCompress1e4-8 100 143348 ns/op 69.76 MB/s 40417 B/op 8 allocs/op BenchmarkDecodeDigitsCompress1e5-8 10 1185527 ns/op 84.35 MB/s 41508 B/op 13 allocs/op BenchmarkDecodeDigitsCompress1e6-8 1 11740304 ns/op 85.18 MB/s 53800 B/op 80 allocs/op BenchmarkDecodeTwainSpeed1e4-8 100 143665 ns/op 69.61 MB/s 40849 B/op 15 allocs/op BenchmarkDecodeTwainSpeed1e5-8 10 1390359 ns/op 71.92 MB/s 45700 B/op 31 allocs/op BenchmarkDecodeTwainSpeed1e6-8 1 12128469 ns/op 82.45 MB/s 89336 B/op 221 allocs/op BenchmarkDecodeTwainDefault1e4-8 100 141916 ns/op 70.46 MB/s 40849 B/op 15 allocs/op BenchmarkDecodeTwainDefault1e5-8 10 1076669 ns/op 92.88 MB/s 43820 B/op 28 allocs/op BenchmarkDecodeTwainDefault1e6-8 1 10106485 ns/op 98.95 MB/s 71096 B/op 172 allocs/op BenchmarkDecodeTwainCompress1e4-8 100 138516 ns/op 72.19 MB/s 40849 B/op 15 allocs/op BenchmarkDecodeTwainCompress1e5-8 10 1227964 ns/op 81.44 MB/s 43316 B/op 25 allocs/op BenchmarkDecodeTwainCompress1e6-8 1 10040347 ns/op 99.60 MB/s 72120 B/op 173 allocs/op BenchmarkEncodeDigitsSpeed1e4-8 30 482808 ns/op 20.71 MB/s BenchmarkEncodeDigitsSpeed1e5-8 5 2685455 ns/op 37.24 MB/s BenchmarkEncodeDigitsSpeed1e6-8 1 24966055 ns/op 40.05 MB/s BenchmarkEncodeDigitsDefault1e4-8 20 655592 ns/op 15.25 MB/s BenchmarkEncodeDigitsDefault1e5-8 1 13000839 ns/op 7.69 MB/s BenchmarkEncodeDigitsDefault1e6-8 1 136341747 ns/op 7.33 MB/s BenchmarkEncodeDigitsCompress1e4-8 20 668083 ns/op 14.97 MB/s BenchmarkEncodeDigitsCompress1e5-8 1 12301511 ns/op 8.13 MB/s BenchmarkEncodeDigitsCompress1e6-8 1 137962041 ns/op 7.25 MB/s
The testing package always reports ns/op, and directly supports the addition of MB/s (throughput) and also B/op and allocs/op (allocation rates). Benchmarks can report additional metrics with any custom unit using B.ReportMetric
.
Multiple tools have been written that process this format, most notably benchcmp and its more statistically valid successor benchstat. There is also benchmany's plot subcommand and likely more unpublished programs.
Multiple tools have also been written that generate this format. In addition to the standard Go testing package, compilebench generates this data format based on runs of the Go compiler, and Austin's unpublished shellbench generates this data format after running an arbitrary shell command.
The golang.org/x/benchmarks benchmarks are notable for not originally generating this format, which made all analysis of those results more complex than we believe it should be. Part of the motivation for the proposal is to avoid the need to process custom output formats in future benchmarks.
A Go benchmark data file is a UTF-8 textual file consisting of a sequence of lines. Configuration lines, benchmark result lines, and unit metadata lines, described below, have semantic meaning in the reporting of benchmark results.
All other lines in the data file, including but not limited to blank lines and lines beginning with a # character, are ignored. For example, the testing package prints test results above benchmark data, usually the text PASS
. That line is neither a configuration line nor a benchmark result line, so it is ignored.
A configuration line is a key-value pair of the form
key: value
where key begins with a lower case character (as defined by unicode.IsLower
), contains no space characters (as defined by unicode.IsSpace
) nor upper case characters (as defined by unicode.IsUpper
), and one or more ASCII space or tab characters separate “key:” from “value.” Conventionally, multiword keys are written with the words separated by hyphens, as in cpu-speed. There are no restrictions on value, except that it cannot contain a newline character. Value can be omitted entirely, in which case the colon must still be present, but need not be followed by a space.
The interpretation of a key/value pair is up to tooling, but the key/value pair is considered to describe all benchmark results that follow, until overwritten by a configuration line with the same key.
A benchmark result line has the general form
<name> <iterations> <value> <unit> [<value> <unit>...]
The fields are separated by runs of space characters (as defined by unicode.IsSpace
), so the line can be parsed with strings.Fields
. The line must have an even number of fields, and at least four.
The first field is the benchmark name, which must begin with Benchmark
followed by an upper case character (as defined by unicode.IsUpper
) or the end of the field, as in BenchmarkReverseString
or just Benchmark
. Tools displaying benchmark data conventionally omit the Benchmark
prefix. The same benchmark name can appear on multiple result lines, indicating that the benchmark was run multiple times.
The second field gives the number of iterations run. For most processing this number can be ignored, although it may give some indication of the expected accuracy of the measurements that follow.
The remaining fields report value/unit pairs in which the value is a float64 that can be parsed by strconv.ParseFloat
and the unit explains the value, as in “64.88 MB/s”. The units reported are typically normalized so that they can be interpreted without considering to the number of iterations. In the example, the CPU cost is reported per-operation and the throughput is reported per-second; neither is a total that depends on the number of iterations.
A value's unit string is expected to specify not only the measurement unit but also, as needed, a description of what is being measured. For example, a benchmark might report its overall execution time as well as cache miss times with three units “ns/op,” “L1-miss-ns/op,”and “L2-miss-ns/op.”
Tooling can expect that the unit strings are identical for all runs to be compared; for example, a result reporting “ns/op” need not be considered comparable to one reporting “µs/op.”
However, tooling may assume that the measurement unit is the final of the hyphen-separated words in the unit string and may recognize and rescale known measurement units. For example, consistently large “ns/op” or “L1-miss-ns/op” might be rescaled to “ms/op” or “L1-miss-ms/op” for display.
In the current testing package, benchmark names correspond to Go identifiers: each benchmark must be written as a different Go function. Work targeted for Go 1.7 will allow tests and benchmarks to define sub-tests and sub-benchmarks programatically, in particular to vary interesting parameters both when testing and when benchmarking. That work uses a slash to separate the name of a benchmark collection from the description of a sub-benchmark.
We propose that sub-benchmarks adopt the convention of choosing names that are key=value pairs; that slash-prefixed key=value pairs in the benchmark name are treated by benchmark data processors as per-benchmark configuration values.
When a benchmark reports units outside the standard units implemented by the testing package, it can be useful for tools to understand additional metadata about those units.
A unit metadata line has the form
Unit <unit> <key>=<value> <key>=<value> ...
The fields are separated by runs of space characters (as defined by unicode.IsSpace
), and space characters are not allowed within unit, key, or value. Keys must not contain =
.
It is an error to specify different values for any given unit and key, even on different unit metadata lines. That is, once unit metadata is specified, it can't be overridden. Specifying the same value for a key multiple times is not an error.
Unit metadata applies to all following benchmark result lines, though it is unspecified whether it applies to earlier benchmark results lines. This allows for stream-oriented processing of benchmark results.
Keys are not constrained, but the following keys have predefined meanings:
better={higher,lower}
indicates whether higher or lower values of this unit are better (indicate an improvement). By default, ns/op, B/op, and allocs/op are better=lower
, and MB/s is better=higher
. Other units do not assume a default.
assume={nothing,exact}
indicates what statistical assumption to make when considering distributions of values. nothing
means to make no statistical assumptions (e.g., use non-parametric methods) and exact
means to assume measurements are exact (repeated measurement does not increase confidence). The default is nothing
. In the future we may also support normal
, but that's almost never the right assumption for benchmarks.
The benchmark output given in the background section above is already in the format proposed here. That is a key feature of the proposal.
However, a future run of the benchmark might add configuration lines, and the benchmark might be rewritten to use sub-benchmarks, producing this output:
commit: 7cd9055 commit-time: 2016-02-11T13:25:45-0500 goos: darwin goarch: amd64 cpu: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz cpu-count: 8 cpu-physical-count: 4 os: Mac OS X 10.11.3 mem: 16 GB BenchmarkDecode/text=digits/level=speed/size=1e4-8 100 154125 ns/op 64.88 MB/s 40418 B/op 7 allocs/op BenchmarkDecode/text=digits/level=speed/size=1e5-8 10 1367632 ns/op 73.12 MB/s 41356 B/op 14 allocs/op BenchmarkDecode/text=digits/level=speed/size=1e6-8 1 13879794 ns/op 72.05 MB/s 52056 B/op 94 allocs/op BenchmarkDecode/text=digits/level=default/size=1e4-8 100 147551 ns/op 67.77 MB/s 40418 B/op 8 allocs/op BenchmarkDecode/text=digits/level=default/size=1e5-8 10 1197672 ns/op 83.50 MB/s 41508 B/op 13 allocs/op BenchmarkDecode/text=digits/level=default/size=1e6-8 1 11808775 ns/op 84.68 MB/s 53800 B/op 80 allocs/op BenchmarkDecode/text=digits/level=best/size=1e4-8 100 143348 ns/op 69.76 MB/s 40417 B/op 8 allocs/op BenchmarkDecode/text=digits/level=best/size=1e5-8 10 1185527 ns/op 84.35 MB/s 41508 B/op 13 allocs/op BenchmarkDecode/text=digits/level=best/size=1e6-8 1 11740304 ns/op 85.18 MB/s 53800 B/op 80 allocs/op BenchmarkDecode/text=twain/level=speed/size=1e4-8 100 143665 ns/op 69.61 MB/s 40849 B/op 15 allocs/op BenchmarkDecode/text=twain/level=speed/size=1e5-8 10 1390359 ns/op 71.92 MB/s 45700 B/op 31 allocs/op BenchmarkDecode/text=twain/level=speed/size=1e6-8 1 12128469 ns/op 82.45 MB/s 89336 B/op 221 allocs/op BenchmarkDecode/text=twain/level=default/size=1e4-8 100 141916 ns/op 70.46 MB/s 40849 B/op 15 allocs/op BenchmarkDecode/text=twain/level=default/size=1e5-8 10 1076669 ns/op 92.88 MB/s 43820 B/op 28 allocs/op BenchmarkDecode/text=twain/level=default/size=1e6-8 1 10106485 ns/op 98.95 MB/s 71096 B/op 172 allocs/op BenchmarkDecode/text=twain/level=best/size=1e4-8 100 138516 ns/op 72.19 MB/s 40849 B/op 15 allocs/op BenchmarkDecode/text=twain/level=best/size=1e5-8 10 1227964 ns/op 81.44 MB/s 43316 B/op 25 allocs/op BenchmarkDecode/text=twain/level=best/size=1e6-8 1 10040347 ns/op 99.60 MB/s 72120 B/op 173 allocs/op BenchmarkEncode/text=digits/level=speed/size=1e4-8 30 482808 ns/op 20.71 MB/s BenchmarkEncode/text=digits/level=speed/size=1e5-8 5 2685455 ns/op 37.24 MB/s BenchmarkEncode/text=digits/level=speed/size=1e6-8 1 24966055 ns/op 40.05 MB/s BenchmarkEncode/text=digits/level=default/size=1e4-8 20 655592 ns/op 15.25 MB/s BenchmarkEncode/text=digits/level=default/size=1e5-8 1 13000839 ns/op 7.69 MB/s BenchmarkEncode/text=digits/level=default/size=1e6-8 1 136341747 ns/op 7.33 MB/s BenchmarkEncode/text=digits/level=best/size=1e4-8 20 668083 ns/op 14.97 MB/s BenchmarkEncode/text=digits/level=best/size=1e5-8 1 12301511 ns/op 8.13 MB/s BenchmarkEncode/text=digits/level=best/size=1e6-8 1 137962041 ns/op 7.25 MB/s
Using sub-benchmarks has benefits beyond this proposal, namely that it would avoid the current repetitive code:
func BenchmarkDecodeDigitsSpeed1e4(b *testing.B) { benchmarkDecode(b, digits, speed, 1e4) } func BenchmarkDecodeDigitsSpeed1e5(b *testing.B) { benchmarkDecode(b, digits, speed, 1e5) } func BenchmarkDecodeDigitsSpeed1e6(b *testing.B) { benchmarkDecode(b, digits, speed, 1e6) } func BenchmarkDecodeDigitsDefault1e4(b *testing.B) { benchmarkDecode(b, digits, default_, 1e4) } func BenchmarkDecodeDigitsDefault1e5(b *testing.B) { benchmarkDecode(b, digits, default_, 1e5) } func BenchmarkDecodeDigitsDefault1e6(b *testing.B) { benchmarkDecode(b, digits, default_, 1e6) } func BenchmarkDecodeDigitsCompress1e4(b *testing.B) { benchmarkDecode(b, digits, compress, 1e4) } func BenchmarkDecodeDigitsCompress1e5(b *testing.B) { benchmarkDecode(b, digits, compress, 1e5) } func BenchmarkDecodeDigitsCompress1e6(b *testing.B) { benchmarkDecode(b, digits, compress, 1e6) } func BenchmarkDecodeTwainSpeed1e4(b *testing.B) { benchmarkDecode(b, twain, speed, 1e4) } func BenchmarkDecodeTwainSpeed1e5(b *testing.B) { benchmarkDecode(b, twain, speed, 1e5) } func BenchmarkDecodeTwainSpeed1e6(b *testing.B) { benchmarkDecode(b, twain, speed, 1e6) } func BenchmarkDecodeTwainDefault1e4(b *testing.B) { benchmarkDecode(b, twain, default_, 1e4) } func BenchmarkDecodeTwainDefault1e5(b *testing.B) { benchmarkDecode(b, twain, default_, 1e5) } func BenchmarkDecodeTwainDefault1e6(b *testing.B) { benchmarkDecode(b, twain, default_, 1e6) } func BenchmarkDecodeTwainCompress1e4(b *testing.B) { benchmarkDecode(b, twain, compress, 1e4) } func BenchmarkDecodeTwainCompress1e5(b *testing.B) { benchmarkDecode(b, twain, compress, 1e5) } func BenchmarkDecodeTwainCompress1e6(b *testing.B) { benchmarkDecode(b, twain, compress, 1e6) }
More importantly for this proposal, using sub-benchmarks also makes the possible comparison axes clear: digits vs twait, speed vs default vs best, size 1e4 vs 1e5 vs 1e6.
As discussed in the background section, we have already developed a number of analysis programs that assume this proposal's format, as well as a number of programs that generate this format. Standardizing the format should encourage additional work on both kinds of programs.
Issue 12826 suggests a different approach, namely the addition of a new go test
option -benchformat
, to control the format of benchmark output. In fact it gives the lack of standardization as the main justification for a new option:
Currently
go test -bench .
prints out benchmark results in a certain format, but there is no guarantee that this format will not change. Thus a tool that parses go test output may break if an incompatible change to the output format is made.
Our approach is instead to guarantee that the format will not change, or rather that it will only change in ways allowed by this design. An analysis tool that parses the output specified here will not break in future versions of Go, and a tool that generates the output specified here will work with all such analysis tools. Having one agreed-upon format enables broad interoperation; the ability for one tool to generate arbitrarily many different formats does not achieve the same result.
The proposed format also seems to be extensible enough to accommodate anticipated future work on benchmark reporting.
The main known issue with the current go test -bench
is that we'd like to emit finer-grained detail about runs, for linearity testing and more robust statistics (see issue 10669). This proposal allows that by simply printing more result lines.
Another known issue is that we may want to add custom outputs such as garbage collector statistics to certain benchmark runs. This proposal allows that by adding more value-unit pairs.
Tools consuming existing benchmark format may need trivial changes to ignore non-benchmark result lines or to cope with additional value-unit pairs in benchmark results.
The benchmark format described here is already generated by go test -bench
and expected by tools like benchcmp
and benchstat
.
The format is trivial to generate, and it is straightforward but not quite trivial to parse.
We anticipate that the new x/perf subrepo will include a library for loading benchmark data from files, although the format is also simple enough that tools that want a different in-memory representation might reasonably write separate parsers.