encoding/protowire: micro-optimize SizeVarint (-20% on Intel)
SizeVarint is of strategic importance for Protobuf encoding,
but I want to be clear: This change, on its own, does not
measurably improve real-world Protobuf usages in my testing.
It does, however, improve performance within the context of
another, larger project. I don’t want to sequence this optimization
on the bigger project, but would rather test and submit it in isolation.
As the detailed comment in the source code explains,
this implementation follows C++ Protobuf’s approach.
For your convenience, here is a godbolt Compiler Explorer link
that shows what the Go compiler makes of the old and new version:
https://godbolt.org/z/4erW1EY4r
When compiling with GOAMD64=v1 (the default), the new version
is roughly performance-neutral (a little faster on some, a little
slower on other CPU architectures — probably within the noise floor):
.fullname: SizeVarint-4
│ head │ micro |
│ sec/op │ sec/op vs base |
conan-altra 2.174µ ± 0% 2.156µ ± 0% -0.83% (p=0.000 n=10)
arcadia-rome 3.519µ ± 2% 3.558µ ± 0% ~ (p=0.060 n=10)
indus-skylake 2.143µ ± 3% 2.192µ ± 7% ~ (p=0.448 n=10)
izumi-sapphirerapids 974.9n ± 0% 1020.0n ± 0% +4.63% (p=0.000 n=10)
geomean 1.999µ 2.035µ +1.78%
By setting GOAMD64=v3, we unlock the full feature set of our CPUs.
If we build the old version with GOAMD64=v3, we already see a -50% speed-up
on AMD Zen 2 CPUs (due to switching from the slow BSRQ to the fast LZCNTQ):
.fullname: SizeVarint-4
│ head │ head-goamd64v3 │
│ sec/op │ sec/op vs base │
conan-altra 2.174µ ± 0% 2.174µ ± 0% ~ (p=1.000 n=10)
arcadia-rome 3.519µ ± 2% 1.789µ ± 0% -49.15% (p=0.000 n=10)
indus-skylake 2.143µ ± 3% 2.165µ ± 9% ~ (p=0.739 n=10)
izumi-sapphirerapids 974.9n ± 0% 980.5n ± 3% +0.58% (p=0.007 n=10)
geomean 1.999µ 1.695µ -15.22%
And if we benchmark the new version with GOAMD64=v3, we see a further speed-up
on ARM and Intel — as high as 20% on Skylake!
.fullname: SizeVarint-4
│ head-goamd64v3 │ micro-goamd64v3 │
│ sec/op │ sec/op vs base │
conan-altra 2.174µ ± 0% 2.156µ ± 0% -0.83% (p=0.000 n=10)
arcadia-rome 1.789µ ± 0% 1.836µ ± 1% +2.63% (p=0.000 n=10)
indus-skylake 2.165µ ± 9% 1.753µ ± 7% -19.05% (p=0.000 n=10)
izumi-sapphirerapids 980.5n ± 3% 959.1n ± 0% -2.19% (p=0.000 n=10)
geomean 1.695µ 1.606µ -5.25%
In summary, I believe this version of SizeVarint is currently the fastest
on the relevant CPUs, and leaves the path open to squeeze out a little more
performance by changing the Go compiler.
Change-Id: Ibc2629f8dcf9f2f4eb0a09fe37f923829ee3165b
Reviewed-on: https://go-review.googlesource.com/c/protobuf/+/683955
Reviewed-by: Nicolas Hillegeer <aktau@google.com>
Auto-Submit: Nicolas Hillegeer <aktau@google.com>
Reviewed-by: Christian Höppner <hoeppi@google.com>
Reviewed-by: Damien Neil <dneil@google.com>
Commit-Queue: Nicolas Hillegeer <aktau@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
diff --git a/encoding/protowire/wire.go b/encoding/protowire/wire.go
index e942bc9..743bfb8 100644
--- a/encoding/protowire/wire.go
+++ b/encoding/protowire/wire.go
@@ -371,7 +371,31 @@
func SizeVarint(v uint64) int {
// This computes 1 + (bits.Len64(v)-1)/7.
// 9/64 is a good enough approximation of 1/7
- return int(9*uint32(bits.Len64(v))+64) / 64
+ //
+ // The Go compiler can translate the bits.LeadingZeros64 call into the LZCNT
+ // instruction, which is very fast on CPUs from the last few years. The
+ // specific way of expressing the calculation matches C++ Protobuf, see
+ // https://godbolt.org/z/4P3h53oM4 for the C++ code and how gcc/clang
+ // optimize that function for GOAMD64=v1 and GOAMD64=v3 (-march=haswell).
+
+ // By OR'ing v with 1, we guarantee that v is never 0, without changing the
+ // result of SizeVarint. LZCNT is not defined for 0, meaning the compiler
+ // needs to add extra instructions to handle that case.
+ //
+ // The Go compiler currently (go1.24.4) does not make use of this knowledge.
+ // This opportunity (removing the XOR instruction, which handles the 0 case)
+ // results in a small (1%) performance win across CPU architectures.
+ //
+ // Independently of avoiding the 0 case, we need the v |= 1 line because
+ // it allows the Go compiler to eliminate an extra XCHGL barrier.
+ v |= 1
+
+ // It would be clearer to write log2value := 63 - uint32(...), but
+ // writing uint32(...) ^ 63 is much more efficient (-14% ARM, -20% Intel).
+ // Proof of identity for our value range [0..63]:
+ // https://go.dev/play/p/Pdn9hEWYakX
+ log2value := uint32(bits.LeadingZeros64(v)) ^ 63
+ return int((log2value*9 + (64 + 9)) / 64)
}
// AppendFixed32 appends v to b as a little-endian uint32.
diff --git a/encoding/protowire/wire_test.go b/encoding/protowire/wire_test.go
index 486d183..f0281c3 100644
--- a/encoding/protowire/wire_test.go
+++ b/encoding/protowire/wire_test.go
@@ -678,3 +678,31 @@
}
}
}
+
+// TODO(go1.23): use slices.Repeat
+var testvals = func() []uint64 {
+ // These values are representative for the values that we observe when
+ // running benchmarks extracted from Google production workloads.
+ vals := []uint64{
+ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
+ 55, 66, 77, 88, 99, 100,
+ 123456789, 98765432,
+ }
+ newslice := make([]uint64, 100*len(vals))
+ n := copy(newslice, vals)
+ for n < len(newslice) {
+ n += copy(newslice[n:], newslice[:n])
+ }
+ return newslice
+}()
+
+func BenchmarkSizeVarint(b *testing.B) {
+ var total int
+ for range b.N {
+ for _, val := range testvals {
+ total += SizeVarint(val)
+ }
+ }
+ // Prevent the Go compiler from optimizing out the SizeVarint call:
+ b.Logf("total: %d", total)
+}