| # Proposal: Go Benchmark Data Format |
| |
| Authors: Russ Cox, Austin Clements |
| |
| Last updated: February 2016 |
| |
| Discussion at [golang.org/issue/14313](https://golang.org/issue/14313). |
| |
| ## Abstract |
| |
| We propose to make the current output of `go test -bench` the defined format for recording all Go benchmark data. |
| Having a defined format allows benchmark measurement programs |
| and benchmark analysis programs to interoperate while |
| evolving independently. |
| |
| ## Background |
| |
| ### Benchmark data formats |
| |
| We are unaware of any standard formats for recording raw benchmark data, |
| and we've been unable to find any using web searches. |
| One might expect that a standard benchmark suite such as SPEC CPU2006 would have |
| defined a format for raw results, but that appears not to be the case. |
| The [collection of published results](https://www.spec.org/cpu2006/results/) |
| includes only analyzed data ([example](https://www.spec.org/cpu2006/results/res2011q3/cpu2006-20110620-17230.txt)), not raw data. |
| |
| Go has a de facto standard format for benchmark data: |
| the lines generated by the testing package when using `go test -bench`. |
| For example, running compress/flate's benchmarks produces this output: |
| |
| BenchmarkDecodeDigitsSpeed1e4-8 100 154125 ns/op 64.88 MB/s 40418 B/op 7 allocs/op |
| BenchmarkDecodeDigitsSpeed1e5-8 10 1367632 ns/op 73.12 MB/s 41356 B/op 14 allocs/op |
| BenchmarkDecodeDigitsSpeed1e6-8 1 13879794 ns/op 72.05 MB/s 52056 B/op 94 allocs/op |
| BenchmarkDecodeDigitsDefault1e4-8 100 147551 ns/op 67.77 MB/s 40418 B/op 8 allocs/op |
| BenchmarkDecodeDigitsDefault1e5-8 10 1197672 ns/op 83.50 MB/s 41508 B/op 13 allocs/op |
| BenchmarkDecodeDigitsDefault1e6-8 1 11808775 ns/op 84.68 MB/s 53800 B/op 80 allocs/op |
| BenchmarkDecodeDigitsCompress1e4-8 100 143348 ns/op 69.76 MB/s 40417 B/op 8 allocs/op |
| BenchmarkDecodeDigitsCompress1e5-8 10 1185527 ns/op 84.35 MB/s 41508 B/op 13 allocs/op |
| BenchmarkDecodeDigitsCompress1e6-8 1 11740304 ns/op 85.18 MB/s 53800 B/op 80 allocs/op |
| BenchmarkDecodeTwainSpeed1e4-8 100 143665 ns/op 69.61 MB/s 40849 B/op 15 allocs/op |
| BenchmarkDecodeTwainSpeed1e5-8 10 1390359 ns/op 71.92 MB/s 45700 B/op 31 allocs/op |
| BenchmarkDecodeTwainSpeed1e6-8 1 12128469 ns/op 82.45 MB/s 89336 B/op 221 allocs/op |
| BenchmarkDecodeTwainDefault1e4-8 100 141916 ns/op 70.46 MB/s 40849 B/op 15 allocs/op |
| BenchmarkDecodeTwainDefault1e5-8 10 1076669 ns/op 92.88 MB/s 43820 B/op 28 allocs/op |
| BenchmarkDecodeTwainDefault1e6-8 1 10106485 ns/op 98.95 MB/s 71096 B/op 172 allocs/op |
| BenchmarkDecodeTwainCompress1e4-8 100 138516 ns/op 72.19 MB/s 40849 B/op 15 allocs/op |
| BenchmarkDecodeTwainCompress1e5-8 10 1227964 ns/op 81.44 MB/s 43316 B/op 25 allocs/op |
| BenchmarkDecodeTwainCompress1e6-8 1 10040347 ns/op 99.60 MB/s 72120 B/op 173 allocs/op |
| BenchmarkEncodeDigitsSpeed1e4-8 30 482808 ns/op 20.71 MB/s |
| BenchmarkEncodeDigitsSpeed1e5-8 5 2685455 ns/op 37.24 MB/s |
| BenchmarkEncodeDigitsSpeed1e6-8 1 24966055 ns/op 40.05 MB/s |
| BenchmarkEncodeDigitsDefault1e4-8 20 655592 ns/op 15.25 MB/s |
| BenchmarkEncodeDigitsDefault1e5-8 1 13000839 ns/op 7.69 MB/s |
| BenchmarkEncodeDigitsDefault1e6-8 1 136341747 ns/op 7.33 MB/s |
| BenchmarkEncodeDigitsCompress1e4-8 20 668083 ns/op 14.97 MB/s |
| BenchmarkEncodeDigitsCompress1e5-8 1 12301511 ns/op 8.13 MB/s |
| BenchmarkEncodeDigitsCompress1e6-8 1 137962041 ns/op 7.25 MB/s |
| |
| The testing package always reports ns/op, and each benchmark can request the addition of MB/s (throughput) and also B/op and allocs/op (allocation rates). |
| |
| ### Benchmark processors |
| |
| Multiple tools have been written that process this format, |
| most notably [benchcmp](https://godoc.org/golang.org/x/tools/cmd/benchcmp) |
| and its more statistically valid successor [benchstat](https://godoc.org/rsc.io/benchstat). |
| There is also [benchmany](https://godoc.org/github.com/aclements/go-misc/benchmany)'s plot subcommand |
| and likely more unpublished programs. |
| |
| ### Benchmark runners |
| |
| Multiple tools have also been written that generate this format. |
| In addition to the standard Go testing package, |
| [compilebench](https://godoc.org/rsc.io/compilebench) |
| generates this data format based on runs of the Go compiler, |
| and Austin's unpublished shellbench generates this data format |
| after running an arbitrary shell command. |
| |
| The [golang.org/x/benchmarks/bench](https://golang.org/x/benchmarks/bench) benchmarks |
| are notable for _not_ generating this format, |
| which has made all analysis of those results |
| more complex than we believe it should be. |
| We intend to update those benchmarks to generate the standard format, |
| once a standard format is defined. |
| Part of the motivation for the proposal is to avoid |
| the need to process custom output formats in future benchmarks. |
| |
| ## Proposal |
| |
| A Go benchmark data file is a UTF-8 textual file consisting of a sequence of lines. |
| Configuration lines and benchmark result lines, described below, |
| have semantic meaning in the reporting of benchmark results. |
| |
| All other lines in the data file, including but not limited to |
| blank lines and lines beginning with a # character, are ignored. |
| For example, the testing package prints test results above benchmark data, |
| usually the text `PASS`. That line is neither a configuration line nor a benchmark |
| result line, so it is ignored. |
| |
| ### Configuration Lines |
| |
| A configuration line is a key-value pair of the form |
| |
| key: value |
| |
| where key begins with a lower case character (as defined by `unicode.IsLower`), |
| contains no space characters (as defined by `unicode.IsSpace`) |
| nor upper case characters (as defined by `unicode.IsUpper`), |
| and one or more ASCII space or tab characters separate “key:” from “value.” |
| Conventionally, multiword keys are written with the words |
| separated by hyphens, as in cpu-speed. |
| There are no restrictions on value, except that it cannot contain a newline character. |
| Value can be omitted entirely, in which case the colon must still be |
| present, but need not be followed by a space. |
| |
| The interpretation of a key/value pair is up to tooling, but the key/value pair |
| is considered to describe all benchmark results that follow, |
| until overwritten by a configuration line with the same key. |
| |
| ### Benchmark Results |
| |
| A benchmark result line has the general form |
| |
| <name> <iterations> <value> <unit> [<value> <unit>...] |
| |
| The fields are separated by runs of space characters (as defined by `unicode.IsSpace`), |
| so the line can be parsed with `strings.Fields`. |
| The line must have an even number of fields, and at least four. |
| |
| The first field is the benchmark name, which must begin with `Benchmark` |
| followed by an upper case character (as defined by `unicode.IsUpper`) |
| or the end of the field, |
| as in `BenchmarkReverseString` or just `Benchmark`. |
| Tools displaying benchmark data conventionally omit the `Benchmark` prefix. |
| The same benchmark name can appear on multiple result lines, |
| indicating that the benchmark was run multiple times. |
| |
| The second field gives the number of iterations run. |
| For most processing this number can be ignored, although |
| it may give some indication of the expected accuracy |
| of the measurements that follow. |
| |
| The remaining fields report value/unit pairs in which the value |
| is a float64 that can be parsed by `strconv.ParseFloat` |
| and the unit explains the value, as in “64.88 MB/s”. |
| The units reported are typically normalized so that they can be |
| interpreted without considering to the number of iterations. |
| In the example, the CPU cost is reported per-operation and the |
| throughput is reported per-second; neither is a total that |
| depends on the number of iterations. |
| |
| ### Value Units |
| |
| A value's unit string is expected to specify not only the measurement unit |
| but also, as needed, a description of what is being measured. |
| For example, a benchmark might report its overall execution time |
| as well as cache miss times with three units “ns/op,” “L1-miss-ns/op,”and “L2-miss-ns/op.” |
| |
| Tooling can expect that the unit strings are identical for all runs to be compared; |
| for example, a result reporting “ns/op” need not be considered comparable |
| to one reporting “µs/op.” |
| |
| However, tooling may assume that the measurement unit is the final |
| of the hyphen-separated words in the unit string and may recognize |
| and rescale known measurement units. |
| For example, consistently large “ns/op” or “L1-miss-ns/op” |
| might be rescaled to “ms/op” or “L1-miss-ms/op” for display. |
| |
| ### Benchmark Name Configuration |
| |
| In the current testing package, benchmark names correspond to Go identifiers: |
| each benchmark must be written as a different Go function. |
| [Work targeted for Go 1.7](https://github.com/golang/proposal/blob/master/design/12166-subtests.md) will allow tests and benchmarks |
| to define sub-tests and sub-benchmarks programatically, |
| in particular to vary interesting parameters both when |
| testing and when benchmarking. |
| That work uses a slash to separate the name of a benchmark |
| collection from the description of a sub-benchmark. |
| |
| We propose that sub-benchmarks adopt the convention of |
| choosing names that are key=value pairs; |
| that slash-prefixed key=value pairs in the benchmark name are |
| treated by benchmark data processors as per-benchmark |
| configuration values. |
| |
| ### Example |
| |
| The benchmark output given in the background section above |
| is already in the format proposed here. |
| That is a key feature of the proposal. |
| |
| However, a future run of the benchmark might add configuration lines, |
| and the benchmark might be rewritten to use sub-benchmarks, |
| producing this output: |
| |
| commit: 7cd9055 |
| commit-time: 2016-02-11T13:25:45-0500 |
| goos: darwin |
| goarch: amd64 |
| cpu: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz |
| cpu-count: 8 |
| cpu-physical-count: 4 |
| os: Mac OS X 10.11.3 |
| mem: 16 GB |
| |
| BenchmarkDecode/text=digits/level=speed/size=1e4-8 100 154125 ns/op 64.88 MB/s 40418 B/op 7 allocs/op |
| BenchmarkDecode/text=digits/level=speed/size=1e5-8 10 1367632 ns/op 73.12 MB/s 41356 B/op 14 allocs/op |
| BenchmarkDecode/text=digits/level=speed/size=1e6-8 1 13879794 ns/op 72.05 MB/s 52056 B/op 94 allocs/op |
| BenchmarkDecode/text=digits/level=default/size=1e4-8 100 147551 ns/op 67.77 MB/s 40418 B/op 8 allocs/op |
| BenchmarkDecode/text=digits/level=default/size=1e5-8 10 1197672 ns/op 83.50 MB/s 41508 B/op 13 allocs/op |
| BenchmarkDecode/text=digits/level=default/size=1e6-8 1 11808775 ns/op 84.68 MB/s 53800 B/op 80 allocs/op |
| BenchmarkDecode/text=digits/level=best/size=1e4-8 100 143348 ns/op 69.76 MB/s 40417 B/op 8 allocs/op |
| BenchmarkDecode/text=digits/level=best/size=1e5-8 10 1185527 ns/op 84.35 MB/s 41508 B/op 13 allocs/op |
| BenchmarkDecode/text=digits/level=best/size=1e6-8 1 11740304 ns/op 85.18 MB/s 53800 B/op 80 allocs/op |
| BenchmarkDecode/text=twain/level=speed/size=1e4-8 100 143665 ns/op 69.61 MB/s 40849 B/op 15 allocs/op |
| BenchmarkDecode/text=twain/level=speed/size=1e5-8 10 1390359 ns/op 71.92 MB/s 45700 B/op 31 allocs/op |
| BenchmarkDecode/text=twain/level=speed/size=1e6-8 1 12128469 ns/op 82.45 MB/s 89336 B/op 221 allocs/op |
| BenchmarkDecode/text=twain/level=default/size=1e4-8 100 141916 ns/op 70.46 MB/s 40849 B/op 15 allocs/op |
| BenchmarkDecode/text=twain/level=default/size=1e5-8 10 1076669 ns/op 92.88 MB/s 43820 B/op 28 allocs/op |
| BenchmarkDecode/text=twain/level=default/size=1e6-8 1 10106485 ns/op 98.95 MB/s 71096 B/op 172 allocs/op |
| BenchmarkDecode/text=twain/level=best/size=1e4-8 100 138516 ns/op 72.19 MB/s 40849 B/op 15 allocs/op |
| BenchmarkDecode/text=twain/level=best/size=1e5-8 10 1227964 ns/op 81.44 MB/s 43316 B/op 25 allocs/op |
| BenchmarkDecode/text=twain/level=best/size=1e6-8 1 10040347 ns/op 99.60 MB/s 72120 B/op 173 allocs/op |
| BenchmarkEncode/text=digits/level=speed/size=1e4-8 30 482808 ns/op 20.71 MB/s |
| BenchmarkEncode/text=digits/level=speed/size=1e5-8 5 2685455 ns/op 37.24 MB/s |
| BenchmarkEncode/text=digits/level=speed/size=1e6-8 1 24966055 ns/op 40.05 MB/s |
| BenchmarkEncode/text=digits/level=default/size=1e4-8 20 655592 ns/op 15.25 MB/s |
| BenchmarkEncode/text=digits/level=default/size=1e5-8 1 13000839 ns/op 7.69 MB/s |
| BenchmarkEncode/text=digits/level=default/size=1e6-8 1 136341747 ns/op 7.33 MB/s |
| BenchmarkEncode/text=digits/level=best/size=1e4-8 20 668083 ns/op 14.97 MB/s |
| BenchmarkEncode/text=digits/level=best/size=1e5-8 1 12301511 ns/op 8.13 MB/s |
| BenchmarkEncode/text=digits/level=best/size=1e6-8 1 137962041 ns/op 7.25 MB/s |
| |
| Using sub-benchmarks has benefits beyond this proposal, namely that it would |
| avoid the current repetitive code: |
| |
| func BenchmarkDecodeDigitsSpeed1e4(b *testing.B) { benchmarkDecode(b, digits, speed, 1e4) } |
| func BenchmarkDecodeDigitsSpeed1e5(b *testing.B) { benchmarkDecode(b, digits, speed, 1e5) } |
| func BenchmarkDecodeDigitsSpeed1e6(b *testing.B) { benchmarkDecode(b, digits, speed, 1e6) } |
| func BenchmarkDecodeDigitsDefault1e4(b *testing.B) { benchmarkDecode(b, digits, default_, 1e4) } |
| func BenchmarkDecodeDigitsDefault1e5(b *testing.B) { benchmarkDecode(b, digits, default_, 1e5) } |
| func BenchmarkDecodeDigitsDefault1e6(b *testing.B) { benchmarkDecode(b, digits, default_, 1e6) } |
| func BenchmarkDecodeDigitsCompress1e4(b *testing.B) { benchmarkDecode(b, digits, compress, 1e4) } |
| func BenchmarkDecodeDigitsCompress1e5(b *testing.B) { benchmarkDecode(b, digits, compress, 1e5) } |
| func BenchmarkDecodeDigitsCompress1e6(b *testing.B) { benchmarkDecode(b, digits, compress, 1e6) } |
| func BenchmarkDecodeTwainSpeed1e4(b *testing.B) { benchmarkDecode(b, twain, speed, 1e4) } |
| func BenchmarkDecodeTwainSpeed1e5(b *testing.B) { benchmarkDecode(b, twain, speed, 1e5) } |
| func BenchmarkDecodeTwainSpeed1e6(b *testing.B) { benchmarkDecode(b, twain, speed, 1e6) } |
| func BenchmarkDecodeTwainDefault1e4(b *testing.B) { benchmarkDecode(b, twain, default_, 1e4) } |
| func BenchmarkDecodeTwainDefault1e5(b *testing.B) { benchmarkDecode(b, twain, default_, 1e5) } |
| func BenchmarkDecodeTwainDefault1e6(b *testing.B) { benchmarkDecode(b, twain, default_, 1e6) } |
| func BenchmarkDecodeTwainCompress1e4(b *testing.B) { benchmarkDecode(b, twain, compress, 1e4) } |
| func BenchmarkDecodeTwainCompress1e5(b *testing.B) { benchmarkDecode(b, twain, compress, 1e5) } |
| func BenchmarkDecodeTwainCompress1e6(b *testing.B) { benchmarkDecode(b, twain, compress, 1e6) } |
| |
| More importantly for this proposal, using sub-benchmarks also makes the possible |
| comparison axes clear: digits vs twait, speed vs default vs best, size 1e4 vs 1e5 vs 1e6. |
| |
| ## Rationale |
| |
| As discussed in the background section, |
| we have already developed a number of analysis programs |
| that assume this proposal's format, |
| as well as a number of programs that generate this format. |
| Standardizing the format should encourage additional work |
| on both kinds of programs. |
| |
| [Issue 12826](https://golang.org/issue/12826) suggests a different approach, |
| namely the addition of a new `go test` option `-benchformat`, to control |
| the format of benchmark output. In fact it gives the lack of standardization |
| as the main justification for a new option: |
| |
| > Currently `go test -bench .` prints out benchmark results in a |
| > certain format, but there is no guarantee that this format will not |
| > change. Thus a tool that parses go test output may break if an |
| > incompatible change to the output format is made. |
| |
| Our approach is instead to guarantee that the format will not change, |
| or rather that it will only change in ways allowed by this design. |
| An analysis tool that parses the output specified here will not break |
| in future versions of Go, |
| and a tool that generates the output specified here will work |
| with all such analysis tools. |
| Having one agreed-upon format enables broad interoperation; |
| the ability for one tool to generate arbitrarily many different formats |
| does not achieve the same result. |
| |
| The proposed format also seems to be extensible enough to accommodate |
| anticipated future work on benchmark reporting. |
| |
| The main known issue with the current `go test -bench` is that |
| we'd like to emit finer-grained detail about runs, for linearity testing |
| and more robust statistics (see [issue 10669](https://golang.org/issue/10669)). |
| This proposal allows that by simply printing more result lines. |
| |
| Another known issue is that we may want to add custom outputs |
| such as garbage collector statistics to certain benchmark runs. |
| This proposal allows that by adding more value-unit pairs. |
| |
| ## Compatibility |
| |
| Tools consuming existing benchmark format may need trivial changes |
| to ignore non-benchmark result lines or to cope with additional value-unit pairs |
| in benchmark results. |
| |
| ## Implementation |
| |
| The benchmark format described here is already generated by `go test -bench` |
| and expected by tools like `benchcmp` and `benchstat`. |
| |
| The format is trivial to generate, and it is |
| straightforward but not quite trivial to parse. |
| |
| We anticipate that the [new x/perf subrepo](https://github.com/golang/go/issues/14304) will include a library for loading |
| benchmark data from files, although the format is also simple enough that |
| tools that want a different in-memory representation might reasonably |
| write separate parsers. |
| |