blob: 908ed8d85f143bed6cc8029c390c2af0cba04e91 [file] [log] [blame] [view]
# Proposal: Go Benchmark Data Format
Authors: Russ Cox, Austin Clements
Last updated: February 2016
Discussion at [golang.org/issue/14313](https://golang.org/issue/14313).
## Abstract
We propose to make the current output of `go test -bench` the defined format for recording all Go benchmark data.
Having a defined format allows benchmark measurement programs
and benchmark analysis programs to interoperate while
evolving independently.
## Background
### Benchmark data formats
We are unaware of any standard formats for recording raw benchmark data,
and we've been unable to find any using web searches.
One might expect that a standard benchmark suite such as SPEC CPU2006 would have
defined a format for raw results, but that appears not to be the case.
The [collection of published results](https://www.spec.org/cpu2006/results/)
includes only analyzed data ([example](https://www.spec.org/cpu2006/results/res2011q3/cpu2006-20110620-17230.txt)), not raw data.
Go has a de facto standard format for benchmark data:
the lines generated by the testing package when using `go test -bench`.
For example, running compress/flate's benchmarks produces this output:
BenchmarkDecodeDigitsSpeed1e4-8 100 154125 ns/op 64.88 MB/s 40418 B/op 7 allocs/op
BenchmarkDecodeDigitsSpeed1e5-8 10 1367632 ns/op 73.12 MB/s 41356 B/op 14 allocs/op
BenchmarkDecodeDigitsSpeed1e6-8 1 13879794 ns/op 72.05 MB/s 52056 B/op 94 allocs/op
BenchmarkDecodeDigitsDefault1e4-8 100 147551 ns/op 67.77 MB/s 40418 B/op 8 allocs/op
BenchmarkDecodeDigitsDefault1e5-8 10 1197672 ns/op 83.50 MB/s 41508 B/op 13 allocs/op
BenchmarkDecodeDigitsDefault1e6-8 1 11808775 ns/op 84.68 MB/s 53800 B/op 80 allocs/op
BenchmarkDecodeDigitsCompress1e4-8 100 143348 ns/op 69.76 MB/s 40417 B/op 8 allocs/op
BenchmarkDecodeDigitsCompress1e5-8 10 1185527 ns/op 84.35 MB/s 41508 B/op 13 allocs/op
BenchmarkDecodeDigitsCompress1e6-8 1 11740304 ns/op 85.18 MB/s 53800 B/op 80 allocs/op
BenchmarkDecodeTwainSpeed1e4-8 100 143665 ns/op 69.61 MB/s 40849 B/op 15 allocs/op
BenchmarkDecodeTwainSpeed1e5-8 10 1390359 ns/op 71.92 MB/s 45700 B/op 31 allocs/op
BenchmarkDecodeTwainSpeed1e6-8 1 12128469 ns/op 82.45 MB/s 89336 B/op 221 allocs/op
BenchmarkDecodeTwainDefault1e4-8 100 141916 ns/op 70.46 MB/s 40849 B/op 15 allocs/op
BenchmarkDecodeTwainDefault1e5-8 10 1076669 ns/op 92.88 MB/s 43820 B/op 28 allocs/op
BenchmarkDecodeTwainDefault1e6-8 1 10106485 ns/op 98.95 MB/s 71096 B/op 172 allocs/op
BenchmarkDecodeTwainCompress1e4-8 100 138516 ns/op 72.19 MB/s 40849 B/op 15 allocs/op
BenchmarkDecodeTwainCompress1e5-8 10 1227964 ns/op 81.44 MB/s 43316 B/op 25 allocs/op
BenchmarkDecodeTwainCompress1e6-8 1 10040347 ns/op 99.60 MB/s 72120 B/op 173 allocs/op
BenchmarkEncodeDigitsSpeed1e4-8 30 482808 ns/op 20.71 MB/s
BenchmarkEncodeDigitsSpeed1e5-8 5 2685455 ns/op 37.24 MB/s
BenchmarkEncodeDigitsSpeed1e6-8 1 24966055 ns/op 40.05 MB/s
BenchmarkEncodeDigitsDefault1e4-8 20 655592 ns/op 15.25 MB/s
BenchmarkEncodeDigitsDefault1e5-8 1 13000839 ns/op 7.69 MB/s
BenchmarkEncodeDigitsDefault1e6-8 1 136341747 ns/op 7.33 MB/s
BenchmarkEncodeDigitsCompress1e4-8 20 668083 ns/op 14.97 MB/s
BenchmarkEncodeDigitsCompress1e5-8 1 12301511 ns/op 8.13 MB/s
BenchmarkEncodeDigitsCompress1e6-8 1 137962041 ns/op 7.25 MB/s
The testing package always reports ns/op, and each benchmark can request the addition of MB/s (throughput) and also B/op and allocs/op (allocation rates).
### Benchmark processors
Multiple tools have been written that process this format,
most notably [benchcmp](https://godoc.org/golang.org/x/tools/cmd/benchcmp)
and its more statistically valid successor [benchstat](https://godoc.org/rsc.io/benchstat).
There is also [benchmany](https://godoc.org/github.com/aclements/go-misc/benchmany)'s plot subcommand
and likely more unpublished programs.
### Benchmark runners
Multiple tools have also been written that generate this format.
In addition to the standard Go testing package,
[compilebench](https://godoc.org/rsc.io/compilebench)
generates this data format based on runs of the Go compiler,
and Austin's unpublished shellbench generates this data format
after running an arbitrary shell command.
The [golang.org/x/benchmarks/bench](https://golang.org/x/benchmarks/bench) benchmarks
are notable for _not_ generating this format,
which has made all analysis of those results
more complex than we believe it should be.
We intend to update those benchmarks to generate the standard format,
once a standard format is defined.
Part of the motivation for the proposal is to avoid
the need to process custom output formats in future benchmarks.
## Proposal
A Go benchmark data file is a UTF-8 textual file consisting of a sequence of lines.
Configuration lines and benchmark result lines, described below,
have semantic meaning in the reporting of benchmark results.
All other lines in the data file, including but not limited to
blank lines and lines beginning with a # character, are ignored.
For example, the testing package prints test results above benchmark data,
usually the text `PASS`. That line is neither a configuration line nor a benchmark
result line, so it is ignored.
### Configuration Lines
A configuration line is a key-value pair of the form
key: value
where key begins with a lower case character (as defined by `unicode.IsLower`),
contains no space characters (as defined by `unicode.IsSpace`)
nor upper case characters (as defined by `unicode.IsUpper`),
and one or more ASCII space or tab characters separate “key:” from “value.”
Conventionally, multiword keys are written with the words
separated by hyphens, as in cpu-speed.
There are no restrictions on value, except that it cannot contain a newline character.
Value can be omitted entirely, in which case the colon must still be
present, but need not be followed by a space.
The interpretation of a key/value pair is up to tooling, but the key/value pair
is considered to describe all benchmark results that follow,
until overwritten by a configuration line with the same key.
### Benchmark Results
A benchmark result line has the general form
<name> <iterations> <value> <unit> [<value> <unit>...]
The fields are separated by runs of space characters (as defined by `unicode.IsSpace`),
so the line can be parsed with `strings.Fields`.
The line must have an even number of fields, and at least four.
The first field is the benchmark name, which must begin with `Benchmark`
followed by an upper case character (as defined by `unicode.IsUpper`)
or the end of the field,
as in `BenchmarkReverseString` or just `Benchmark`.
Tools displaying benchmark data conventionally omit the `Benchmark` prefix.
The same benchmark name can appear on multiple result lines,
indicating that the benchmark was run multiple times.
The second field gives the number of iterations run.
For most processing this number can be ignored, although
it may give some indication of the expected accuracy
of the measurements that follow.
The remaining fields report value/unit pairs in which the value
is a float64 that can be parsed by `strconv.ParseFloat`
and the unit explains the value, as in “64.88 MB/s”.
The units reported are typically normalized so that they can be
interpreted without considering to the number of iterations.
In the example, the CPU cost is reported per-operation and the
throughput is reported per-second; neither is a total that
depends on the number of iterations.
### Value Units
A value's unit string is expected to specify not only the measurement unit
but also, as needed, a description of what is being measured.
For example, a benchmark might report its overall execution time
as well as cache miss times with three units “ns/op,” “L1-miss-ns/op,”and “L2-miss-ns/op.”
Tooling can expect that the unit strings are identical for all runs to be compared;
for example, a result reporting “ns/op” need not be considered comparable
to one reporting “µs/op.”
However, tooling may assume that the measurement unit is the final
of the hyphen-separated words in the unit string and may recognize
and rescale known measurement units.
For example, consistently large “ns/op” or “L1-miss-ns/op”
might be rescaled to “ms/op” or “L1-miss-ms/op” for display.
### Benchmark Name Configuration
In the current testing package, benchmark names correspond to Go identifiers:
each benchmark must be written as a different Go function.
[Work targeted for Go 1.7](https://github.com/golang/proposal/blob/master/design/12166-subtests.md) will allow tests and benchmarks
to define sub-tests and sub-benchmarks programatically,
in particular to vary interesting parameters both when
testing and when benchmarking.
That work uses a slash to separate the name of a benchmark
collection from the description of a sub-benchmark.
We propose that sub-benchmarks adopt the convention of
choosing names that are key=value pairs;
that slash-prefixed key=value pairs in the benchmark name are
treated by benchmark data processors as per-benchmark
configuration values.
### Example
The benchmark output given in the background section above
is already in the format proposed here.
That is a key feature of the proposal.
However, a future run of the benchmark might add configuration lines,
and the benchmark might be rewritten to use sub-benchmarks,
producing this output:
commit: 7cd9055
commit-time: 2016-02-11T13:25:45-0500
goos: darwin
goarch: amd64
cpu: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
cpu-count: 8
cpu-physical-count: 4
os: Mac OS X 10.11.3
mem: 16 GB
BenchmarkDecode/text=digits/level=speed/size=1e4-8 100 154125 ns/op 64.88 MB/s 40418 B/op 7 allocs/op
BenchmarkDecode/text=digits/level=speed/size=1e5-8 10 1367632 ns/op 73.12 MB/s 41356 B/op 14 allocs/op
BenchmarkDecode/text=digits/level=speed/size=1e6-8 1 13879794 ns/op 72.05 MB/s 52056 B/op 94 allocs/op
BenchmarkDecode/text=digits/level=default/size=1e4-8 100 147551 ns/op 67.77 MB/s 40418 B/op 8 allocs/op
BenchmarkDecode/text=digits/level=default/size=1e5-8 10 1197672 ns/op 83.50 MB/s 41508 B/op 13 allocs/op
BenchmarkDecode/text=digits/level=default/size=1e6-8 1 11808775 ns/op 84.68 MB/s 53800 B/op 80 allocs/op
BenchmarkDecode/text=digits/level=best/size=1e4-8 100 143348 ns/op 69.76 MB/s 40417 B/op 8 allocs/op
BenchmarkDecode/text=digits/level=best/size=1e5-8 10 1185527 ns/op 84.35 MB/s 41508 B/op 13 allocs/op
BenchmarkDecode/text=digits/level=best/size=1e6-8 1 11740304 ns/op 85.18 MB/s 53800 B/op 80 allocs/op
BenchmarkDecode/text=twain/level=speed/size=1e4-8 100 143665 ns/op 69.61 MB/s 40849 B/op 15 allocs/op
BenchmarkDecode/text=twain/level=speed/size=1e5-8 10 1390359 ns/op 71.92 MB/s 45700 B/op 31 allocs/op
BenchmarkDecode/text=twain/level=speed/size=1e6-8 1 12128469 ns/op 82.45 MB/s 89336 B/op 221 allocs/op
BenchmarkDecode/text=twain/level=default/size=1e4-8 100 141916 ns/op 70.46 MB/s 40849 B/op 15 allocs/op
BenchmarkDecode/text=twain/level=default/size=1e5-8 10 1076669 ns/op 92.88 MB/s 43820 B/op 28 allocs/op
BenchmarkDecode/text=twain/level=default/size=1e6-8 1 10106485 ns/op 98.95 MB/s 71096 B/op 172 allocs/op
BenchmarkDecode/text=twain/level=best/size=1e4-8 100 138516 ns/op 72.19 MB/s 40849 B/op 15 allocs/op
BenchmarkDecode/text=twain/level=best/size=1e5-8 10 1227964 ns/op 81.44 MB/s 43316 B/op 25 allocs/op
BenchmarkDecode/text=twain/level=best/size=1e6-8 1 10040347 ns/op 99.60 MB/s 72120 B/op 173 allocs/op
BenchmarkEncode/text=digits/level=speed/size=1e4-8 30 482808 ns/op 20.71 MB/s
BenchmarkEncode/text=digits/level=speed/size=1e5-8 5 2685455 ns/op 37.24 MB/s
BenchmarkEncode/text=digits/level=speed/size=1e6-8 1 24966055 ns/op 40.05 MB/s
BenchmarkEncode/text=digits/level=default/size=1e4-8 20 655592 ns/op 15.25 MB/s
BenchmarkEncode/text=digits/level=default/size=1e5-8 1 13000839 ns/op 7.69 MB/s
BenchmarkEncode/text=digits/level=default/size=1e6-8 1 136341747 ns/op 7.33 MB/s
BenchmarkEncode/text=digits/level=best/size=1e4-8 20 668083 ns/op 14.97 MB/s
BenchmarkEncode/text=digits/level=best/size=1e5-8 1 12301511 ns/op 8.13 MB/s
BenchmarkEncode/text=digits/level=best/size=1e6-8 1 137962041 ns/op 7.25 MB/s
Using sub-benchmarks has benefits beyond this proposal, namely that it would
avoid the current repetitive code:
func BenchmarkDecodeDigitsSpeed1e4(b *testing.B) { benchmarkDecode(b, digits, speed, 1e4) }
func BenchmarkDecodeDigitsSpeed1e5(b *testing.B) { benchmarkDecode(b, digits, speed, 1e5) }
func BenchmarkDecodeDigitsSpeed1e6(b *testing.B) { benchmarkDecode(b, digits, speed, 1e6) }
func BenchmarkDecodeDigitsDefault1e4(b *testing.B) { benchmarkDecode(b, digits, default_, 1e4) }
func BenchmarkDecodeDigitsDefault1e5(b *testing.B) { benchmarkDecode(b, digits, default_, 1e5) }
func BenchmarkDecodeDigitsDefault1e6(b *testing.B) { benchmarkDecode(b, digits, default_, 1e6) }
func BenchmarkDecodeDigitsCompress1e4(b *testing.B) { benchmarkDecode(b, digits, compress, 1e4) }
func BenchmarkDecodeDigitsCompress1e5(b *testing.B) { benchmarkDecode(b, digits, compress, 1e5) }
func BenchmarkDecodeDigitsCompress1e6(b *testing.B) { benchmarkDecode(b, digits, compress, 1e6) }
func BenchmarkDecodeTwainSpeed1e4(b *testing.B) { benchmarkDecode(b, twain, speed, 1e4) }
func BenchmarkDecodeTwainSpeed1e5(b *testing.B) { benchmarkDecode(b, twain, speed, 1e5) }
func BenchmarkDecodeTwainSpeed1e6(b *testing.B) { benchmarkDecode(b, twain, speed, 1e6) }
func BenchmarkDecodeTwainDefault1e4(b *testing.B) { benchmarkDecode(b, twain, default_, 1e4) }
func BenchmarkDecodeTwainDefault1e5(b *testing.B) { benchmarkDecode(b, twain, default_, 1e5) }
func BenchmarkDecodeTwainDefault1e6(b *testing.B) { benchmarkDecode(b, twain, default_, 1e6) }
func BenchmarkDecodeTwainCompress1e4(b *testing.B) { benchmarkDecode(b, twain, compress, 1e4) }
func BenchmarkDecodeTwainCompress1e5(b *testing.B) { benchmarkDecode(b, twain, compress, 1e5) }
func BenchmarkDecodeTwainCompress1e6(b *testing.B) { benchmarkDecode(b, twain, compress, 1e6) }
More importantly for this proposal, using sub-benchmarks also makes the possible
comparison axes clear: digits vs twait, speed vs default vs best, size 1e4 vs 1e5 vs 1e6.
## Rationale
As discussed in the background section,
we have already developed a number of analysis programs
that assume this proposal's format,
as well as a number of programs that generate this format.
Standardizing the format should encourage additional work
on both kinds of programs.
[Issue 12826](https://golang.org/issue/12826) suggests a different approach,
namely the addition of a new `go test` option `-benchformat`, to control
the format of benchmark output. In fact it gives the lack of standardization
as the main justification for a new option:
> Currently `go test -bench .` prints out benchmark results in a
> certain format, but there is no guarantee that this format will not
> change. Thus a tool that parses go test output may break if an
> incompatible change to the output format is made.
Our approach is instead to guarantee that the format will not change,
or rather that it will only change in ways allowed by this design.
An analysis tool that parses the output specified here will not break
in future versions of Go,
and a tool that generates the output specified here will work
with all such analysis tools.
Having one agreed-upon format enables broad interoperation;
the ability for one tool to generate arbitrarily many different formats
does not achieve the same result.
The proposed format also seems to be extensible enough to accommodate
anticipated future work on benchmark reporting.
The main known issue with the current `go test -bench` is that
we'd like to emit finer-grained detail about runs, for linearity testing
and more robust statistics (see [issue 10669](https://golang.org/issue/10669)).
This proposal allows that by simply printing more result lines.
Another known issue is that we may want to add custom outputs
such as garbage collector statistics to certain benchmark runs.
This proposal allows that by adding more value-unit pairs.
## Compatibility
Tools consuming existing benchmark format may need trivial changes
to ignore non-benchmark result lines or to cope with additional value-unit pairs
in benchmark results.
## Implementation
The benchmark format described here is already generated by `go test -bench`
and expected by tools like `benchcmp` and `benchstat`.
The format is trivial to generate, and it is
straightforward but not quite trivial to parse.
We anticipate that the [new x/perf subrepo](https://github.com/golang/go/issues/14304) will include a library for loading
benchmark data from files, although the format is also simple enough that
tools that want a different in-memory representation might reasonably
write separate parsers.