Author: Marcel van Lohuizen
With input from Sameer Ajmani, Austin Clements, Damien Neil, and Bryan Mills.
Last updated: September 2, 2015
Discussion at https://golang.org/issue/12166.
Enhanced table-driven support in the testing package by means of a Run method for starting subtests and subbenchmarks.
Adding a Run methods for spawning subtests and subbenchmarks addresses a variety of features, including:
The proposal for tests and benchmarks are discussed separately below. A separate section explains logging and how to select subbenchmarks and subtests on the command line.
T gets the following method:
// Run runs f as a subtest of t called name. It panics if name is not unique // among t's subtests and reports whether f succeeded. // Run will block until all its parallel subtests have completed. func (t *T) Run(name string, f func(t *testing.T)) bool
Several methods get further clarification on their behavior for subfunctions. Changes inbetween square brackets:
// Fail marks the function [and its calling functions] as having failed but // continues execution. func (c *common) Fail() // FailNow marks the function as having failed, stops its execution // [and aborts pending parallel subtests]. // Execution will continue at the next test or benchmark. // FailNow must be called from the goroutine running the // test or benchmark function, not from other goroutines // created during the test. Calling FailNow does not stop // those other goroutines. func (c *common) FailNow() // SkipNow marks the test as having been skipped, stops its execution // [and aborts pending parallel subtests]. // ... (analoguous to FailNow) func (c *common) SkipNow()
A NumFailed method might be useful as well.
A typical use case:
tests := []struct {
A, B int
Sum int
}{
{ 1, 2, 3 },
{ 1, 1, 2 },
{ 2, 1, 3 },
}
func TestSum(t *testing.T) {
for _, tc := range tests {
t.Run(fmt.Sprint(tc.A, "+", tc.B), func(t *testing.T) {
if got := tc.A + tc.B; got != tc.Sum {
t.Errorf("got %d; want %d", got, tc.Sum)
}
})
}
}
Note that we write t.Errorf("got %d; want %d")
instead of something like t.Errorf("%d+%d = %d; want %d")
: the subtest name already uniquely identifies the test.
Selected (sub)tests from the command line using -test.run:
go test --run=TestFoo,1+2 # selects the first test in TestFoo
go test --run=TestFoo,1+ # selects tests for which A == 1 in TestFoo
go test --run=,1+ # for any top-level test, select subtests matching ”1+”
Skipping a subtest will not terminate subsequent tests in the calling test:
func TestFail(t *testing.T) { for i, tc := range tests { t.Run(fmt.Sprint(tc.A, "+", tc.B), func(t *testing.T) { if tc.A < 0 { t.Skip(i) // terminate test i, but proceed with test i+1. } if got := tc.A + tc.B; got != tc.Sum { t.Errorf("got %d; want %d", got, tc.Sum) } }) } }
Run a subtest in parallel:
func TestParallel(t *testing.T) { for _, tc := range tests { tc := tc // Must capture the range variable. t.Run(tc.Name, func(t *testing.T) { t.Parallel() ... }) } }
Run teardown code after a few tests:
func TestTeardown(t *testing.T) { t.Run("Test1", test1) t.Run("Test2", test2) t.Run("Test3", test3) // teardown code. }
Run teardown code after parallel tests:
func TestTeardownParallel(t *testing.T) { // By definition, this Run will not return until the parallel tests finish. t.Run("block", func(t *testing.T) { t.Run("Test1", parallelTest1) t.Run("Test2", parallelTest2) t.Run("Test3", parallelTest3) }) // teardown code. }
Test1-Test3 will run in parallel with each other, but not with any other parallel tests. This follows from the fact that Run
will block until all subtests have completed and that both TestTeardownParallel and “block” are sequential.
The -test.run flag allows filtering of subtests given a comma-separated list of regular expressions, applying the following rules:
func TestFoo
) has level 1, a test invoked with Run by such test has level 2, a test invoked by such a subtest level 3, and so forth.We use a “,” to separate the regular expressions because it is both a common way to define a list, because it is a character not frequently printed by %+v
and because it plays well with regular expressions.
Rules for -test.bench are analoguous.
####Examples: Select top-level test “TestFoo” and all its subtests:
go test --run=TestFoo
Select top-level tests that contain the string “Foo” and all their subtests that contain the string “A:3”.
go test --run=Foo,A:3
Select all subtests of level 2 which name contains the string “A:1 B:2” for any top-level test:
go test --run=”,A:1 B:2”
The latter could match, for example, struct{A, B int}{1, 2} printed with %+v.
The following method would be added to B:
// Run benchmarks f as a subbenchmark with the given name. It panics if name // is not unique within b's scope and reports whether there were any failures. // // A subbenchmark is like any other benchmark. A benchmark that calls Run at // least once will not be measured itself and will only run for one iteration. func (b *B) Run(name string, f func(b *testing.B)) bool
The uniqueness requirement of the name will help tools that rely on benchmarks having a unique name. For benchmarks, benchcmp is an example of such a tool.
The Benchmark
function gets an additional clarification (addition between []):
// Benchmark benchmarks a single function. Useful for creating // custom benchmarks that do not use the "go test" command. // [ // If f calls Run, the result will be an estimate of running all its // subbenchmarks that don't call Run in sequence in a single benchmark.] func Benchmark(f func(b *B)) BenchmarkResult
See the Rational section for an explanation.
The following code shows the use of two levels of subbenchmarks. It is based on a possible rewrite of golang.org/x/text/unicode/norm/normalize_test.go.
func BenchmarkMethod(b *testing.B) { for _, tt := range allMethods { b.Run(tt.name, func(b *testing.B) { for _, d := range textdata { fn := tt.f(NFC, []byte(d.data)) // initialize the test b.Run(d.name, func(b *testing.B) { b.SetBytes(int64(len(d.data))) for i := 0; i < b.N; i++ { fn() } }) } }) } } var allMethods = []struct { name string f func(to Form, b []byte) func() }{{"Transform", func(f Form, b []byte) func() { buf := make([]byte, 4*len(b)) return func() { f.Transform(buf, b, true) } }, {"Iter", func(f Form, b []byte) func() { iter := Iter{} return func() { for iter.Init(f, b); !iter.Done(); iter.Next() { } } }, { ... }} var textdata = []struct { name, data string }{ {"small_change", "No\u0308rmalization"}, {"small_no_change", "nörmalization"}, {"ascii", ascii}, {"all", txt_all}, }
Note that there is some initialization code above the second Run. Because it is outside of Run, there is no need to call ResetTimer. As Run
starts a new Benchmark, it is not possible to hoist the SetBytes call in a similar manner.
The output for the above benchmark, without additional logging, could look something like:
BenchmarkMethod-8,Transform,small_change 200000 668 ns/op 22.43 MB/s BenchmarkMethod-8,Transform,small_no_change 1000000 100 ns/op 139.13 MB/s BenchmarkMethod-8,Transform,ascii 10000 22430 ns/op 735.60 MB/s BenchmarkMethod-8,Transform,all 1000 128511 ns/op 43.82 MB/s BenchmarkMethod-8,Iter,small_change 200000 701 ns/op 21.39 MB/s BenchmarkMethod-8,Iter,small_no_change 500000 321 ns/op 43.52 MB/s BenchmarkMethod-8,Iter,ascii 1000 210633 ns/op 78.33 MB/s BenchmarkMethod-8,Iter,all 1000 235950 ns/op 23.87 MB/s BenchmarkMethod-8,ToLower,small_change 300000 475 ns/op 31.57 MB/s BenchmarkMethod-8,ToLower,small_no_change 500000 239 ns/op 58.44 MB/s BenchmarkMethod-8,ToLower,ascii 500 297486 ns/op 55.46 MB/s BenchmarkMethod-8,ToLower,all 1000 151722 ns/op 37.12 MB/s BenchmarkMethod-8,QuickSpan,small_change 2000000 70.0 ns/op 214.20 MB/s BenchmarkMethod-8,QuickSpan,small_no_change 1000000 115 ns/op 120.94 MB/s BenchmarkMethod-8,QuickSpan,ascii 5000 25418 ns/op 649.13 MB/s BenchmarkMethod-8,QuickSpan,all 1000 175954 ns/op 32.01 MB/s BenchmarkMethod-8,Append,small_change 200000 721 ns/op 20.78 MB/s ok golang.org/x/text/unicode/norm 5.601s
The only change in the output is the characters allowable in the Benchmark name. The output is identical in the absence of subbenchmarks. This format is compatible with tools like benchstats.
Logs for tests are printed hierarchically. Example:
--- FAIL: TestFoo (0.03s) display_test.go:75: setup issue --- FAIL: TestFoo,{Alpha:1_Beta:1} (0.01s) display_test.go:75: Foo(Beta) = 5: want 6 --- FAIL: TestFoo,{Alpha:1_Beta:3} (0.01s) display_test.go:75: Foo(Beta) = 5; want 6 display_test.go:75: Foo(Beta) = 5; want 6 display_test.go:75: Foo(Beta) = 5; want 6 display_test.go:75: Foo(Beta) = 5; want 6 display_test.go:75: setup issue --- FAIL: TestFoo,{Alpha:1_Beta:4} (0.01s) display_test.go:75: Foo(Beta) = 5; want 6 display_test.go:75: Foo(Beta) = 5; want 6 display_test.go:75: Foo(Beta) = 5; want 6 display_test.go:75: Foo(Beta) = 5; want 6 --- FAIL: TestFoo,{Alpha:1_Beta:8} (0.01s) display_test.go:75: Foo(Beta) = 5; want 6 display_test.go:75: Foo(Beta) = 5; want 6 display_test.go:75: Foo(Beta) = 5; want 6 --- FAIL: TestFoo,{Alpha:1_Beta:9} (0.03s) display_test.go:75: Foo(Beta) = 5; want 6
For each header, we include the full name, thereby repeating the name of the parent. This makes it easier to identify the specific test from within the local context and obviates the need for tools to keep track of context.
For benchmarks we adopt a different strategy. For benchmarks, it is important to be able to relate the logs that might have influenced performance to the respective benchmark. This means we should ideally interleave the logs with the benchmark results. It also means we should distinguish between logs of unmeasured, enclosing benchmarks and logs written during actual benchmarks. For the latter purpose we introduce the LOG
tag in addition to the FAIL
and BENCH
tags. For example:
--- LOG: BenchmarkForm,from␣NFC-8 normalize_test.go:768: Some message BenchmarkForm,from␣NFC,canonical,to␣NFC-8 10000 15914 ns/op 166.64 MB/s --- LOG: BenchmarkForm,from␣NFC-8 normalize_test.go:768: Some message --- LOG: BenchmarkForm,from␣NFC,canonical-8 normalize_test.go:776: Some message. BenchmarkForm,from␣NFC,canonical,to␣NFD-8 10000 15914 ns/op 166.64 MB/s --- LOG: BenchmarkForm,from␣NFC,canonical-8 normalize_test.go:776: Some message. --- BENCH: BenchmarkForm,from␣NFC,canonical,to␣NFD-8 normalize_test.go:789: Some message. normalize_test.go:789: Some message. normalize_test.go:789: Some message. BenchmarkForm,from␣NFC,canonical,to␣NFKC-8 10000 15170 ns/op 174.82 MB/s --- LOG: BenchmarkForm,from␣NFC,canonical-8 normalize_test.go:776: Some message. BenchmarkForm,from␣NFC,canonical,to␣NFKD-8 10000 15881 ns/op 166.99 MB/s --- LOG: BenchmarkForm,from␣NFC,canonical-8 normalize_test.go:776: Some message. BenchmarkForm,from␣NFC,ext_latin,to␣NFC-8 5000 30720 ns/op 52.86 MB/s --- LOG: BenchmarkForm,from␣NFC-8 normalize_test.go:768: Some message BenchmarkForm,from␣NFC,ext_latin,to␣NFD-8 2000 71258 ns/op 22.79 MB/s --- BENCH: BenchmarkForm,from␣NFC,ext_latin,to␣NFD-8 normalize_test.go:789: Some message. normalize_test.go:789: Some message. normalize_test.go:789: Some message. BenchmarkForm,from␣NFC,ext_latin,to␣NFKC-8 5000 32233 ns/op 50.38 MB/s
Only logs marked BENCH
influenced benchmark results. No bench results are printed for “parent” benchmarks.
One alternative to the given proposal is to define variants of tests as top-level tests or benchmarks that call helper functions. For example, the use case explained above could be written as:
func doSum(t *testing.T, a, b, sum int) {
if got := a + b; got != sum {
t.Errorf("got %d; want %d", got, sum)
}
}
func TestSumA1B2(t *testing.T) { doSum(t, 1, 2, 3) }
func TestSumA1B1(t *testing.T) { doSum(t, 1, 1, 2) }
func TestSumA2B1(t *testing.T) { doSum(t, 2, 1, 3) }
This approach can work well for smaller sets, but starts to get tedious for larger sets. Some disadvantages of this approach:
Some of these objections can be addressed by generating the test cases. It seems, though, that addressing anything beyond point 1 and 2 with generation would require more complexity than the addition of Run introduces. Overall, it seems that the benefits of the proposed addition outweigh the benefits of an approach using generation as well as expanding tests by hand.
A subtest refers to a call to Run and a test function refers to the function f passed to Run. A subtest will be like any other test. In fact, top-level tests are semantically equivalent to subtests of a single main test function.
For all subtests holds:
The combination of 3 and 4 means that all subtests marked as Parallel
run after the enclosing test function returns but before the Run method invoking this test function returns. This corresponds to the semantics of Parallel
as it exists today.
These semantics enhance consistency: a call to FailNow
will always terminate the same set of subtests.
These semantics also guarantee that sequential tests are always run exclusively, while only parallel tests can run together. Also, parallel tests created by one sequentially running test will never run in parallel with parallel tests created by another sequentially running test. These simple rules allow for fairly extensive control over parallelism.
The Benchmark
function defines the BenchmarkResult
to be the result of running all of its subbenchmarks in sequence. This is equivalent to returning N == 1 and then the sum of all values for all benchmarks, normalized to a single iteration. It may be more appropriate to use a geometric mean, but as some of the values may be zero the usage of such is somewhat problematic. The proposed definition is meaningful and the user can still compute geometric means by replacing calls to Run with calls to Benchmark if needed. The main purpose of this definition is to define some semantics to using Run
in functions passed to Benchmark
.
The rules for logging subtests are:
These rule are consistent with the subtest semantics presented earlier. Combined with these semantics, logs have the following properties:
Printing hierarchically makes the relation between tests visually clear. It also avoids repeating printing some headers.
For benchmarks the priorities for logging are different. It is important to visually correlate the logs with the benchmark lines. It is also relatively rare to log a lot during benchmarking, so repeating some headers is less of an issue. The proposed logging scheme for benchmarks takes this into account.
As an alternative, we could use the same approach for benchmarks as for tests. In that case, logs would only be printed after each top-level test. For example:
BenchmarkForm,from␣NFC,canonical,to␣NFC-8 10000 23609 ns/op 112.33 MB/s BenchmarkForm,from␣NFC,canonical,to␣NFD-8 10000 16597 ns/op 159.78 MB/s BenchmarkForm,from␣NFC,canonical,to␣NFKC-8 10000 17188 ns/op 154.29 MB/s BenchmarkForm,from␣NFC,canonical,to␣NFKD-8 10000 16082 ns/op 164.90 MB/s BenchmarkForm,from␣NFD,overflow,to␣NFC-8 300 441589 ns/op 38.34 MB/s BenchmarkForm,from␣NFD,overflow,to␣NFD-8 300 483748 ns/op 35.00 MB/s BenchmarkForm,from␣NFD,overflow,to␣NFKC-8 300 467694 ns/op 36.20 MB/s BenchmarkForm,from␣NFD,overflow,to␣NFKD-8 300 515475 ns/op 32.85 MB/s --- FAIL: BenchmarkForm --- FAIL: BenchmarkForm,from␣NFC normalize_test.go:768: Some failure. --- BENCH: BenchmarkForm,from␣NFC,canonical normalize_test.go:776: Just a message. normalize_test.go:776: Just a message. --- BENCH: BenchmarkForm,from␣NFC,canonical,to␣NFD-8 normalize_test.go:789: Some message normalize_test.go:789: Some message normalize_test.go:789: Some message normalize_test.go:776: Just a message. normalize_test.go:776: Just a message. normalize_test.go:768: Some failure. … --- FAIL: BenchmarkForm-8,from␣NFD normalize_test.go:768: Some failure. … normalize_test.go:768: Some failure. --- BENCH: BenchmarkForm-8,from␣NFD,overflow normalize_test.go:776: Just a message. normalize_test.go:776: Just a message. --- BENCH: BenchmarkForm-8,from␣NFD,overflow,to␣NFD normalize_test.go:789: Some message normalize_test.go:789: Some message normalize_test.go:789: Some message normalize_test.go:776: Just a message. normalize_test.go:776: Just a message. BenchmarkMethod,Transform,small_change-8 100000 1165 ns/op 12.87 MB/s BenchmarkMethod,Transform,small_no_change-8 1000000 103 ns/op 135.26 MB/s …
It is still easy to see which logs influenced results (those marked BENCH
), but the user will have to align the logs with the result lines to correlate the data.
The API changes are fully backwards compatible. It introduces several minor changes in the logs:
There are no change required to benchstats and benchcmp.
Most of the work would be done by the author of this proposal.
The first step consists of some minor refactorings to make the diffs for implementing T.Run and B.Run as small as possible. Subsequently, T.Run and B.Run can be implemented individually.
Although the capability for parallel subtests will be implemented in the first iteration, they will initially only be allowed for top-level tests. Once we have a good way to detect improper usage of range variables, we could open up parallelism by introducing Go or enable calling Parallel
on subtests.
It should be possible to have the first implementations of T.Run and B.Run in before 1.6.
Using Parallel in combination with closures is prone to the “forgetting to capture a range variable” problem. We could define a Go method analogous to Run, defined as follows:
func (t *T) Go(name string, f func(t *T)) {
t.Run(name, func(t *T) {
t.Parallel()
f(t)
}
}
This suffers from the same problem, but at least would make it a) more explicit that a range variable requires capturing and b) makes it easier to detect misuse by go vet. If it is possible for go vet to detect whether t.Parallel is in the call graph of t.Run and whether the closure refers to a range variable this would be sufficient and the Go method might not be necessary.
At first we could prohibit calls to Parallel from within subtests until we decide on one of these methods or find a better solution.
We showed how it is possible to insert teardown code after running a few parallel tests. Thought not difficult, it is a bit clumsy. We could at some point add the following method to make this easier:
// Wait blocks until all parallel subtests have finished. It will Skip // the current test if more than n subtests have failed. If n < 0 it will // wait for all subtests to complete. func (t *T) Wait(n numFailures)
The documentation in Run would have to be slightly changed to say that Run will call Wait(-1) before returning. The parallel teardown example could than be written as:
func TestTeardownParallel(t *testing.T) { t.Go("Test1", parallelTest1) t.Go("Test2", parallelTest2) t.Go("Test3", parallelTest3) t.Wait(-1) // teardown code. }
This could be added later if there seems to be a need for it. The introduction of Wait would only require a minimal and backward compatible change to the subtest semantics.