encoding/protowire: micro-optimize SizeVarint (-20% on Intel)

SizeVarint is of strategic importance for Protobuf encoding,
but I want to be clear: This change, on its own, does not
measurably improve real-world Protobuf usages in my testing.
It does, however, improve performance within the context of
another, larger project. I don’t want to sequence this optimization
on the bigger project, but would rather test and submit it in isolation.

As the detailed comment in the source code explains,
this implementation follows C++ Protobuf’s approach.

For your convenience, here is a godbolt Compiler Explorer link
that shows what the Go compiler makes of the old and new version:
https://godbolt.org/z/4erW1EY4r

When compiling with GOAMD64=v1 (the default), the new version
is roughly performance-neutral (a little faster on some, a little
slower on other CPU architectures — probably within the noise floor):

.fullname: SizeVarint-4
                     │    head     │                micro                |
                     │   sec/op    │    sec/op     vs base               |
conan-altra            2.174µ ± 0%    2.156µ ± 0%  -0.83% (p=0.000 n=10)
arcadia-rome           3.519µ ± 2%    3.558µ ± 0%       ~ (p=0.060 n=10)
indus-skylake          2.143µ ± 3%    2.192µ ± 7%       ~ (p=0.448 n=10)
izumi-sapphirerapids   974.9n ± 0%   1020.0n ± 0%  +4.63% (p=0.000 n=10)
geomean                1.999µ         2.035µ       +1.78%

By setting GOAMD64=v3, we unlock the full feature set of our CPUs.

If we build the old version with GOAMD64=v3, we already see a -50% speed-up
on AMD Zen 2 CPUs (due to switching from the slow BSRQ to the fast LZCNTQ):

.fullname: SizeVarint-4
                     │    head     │           head-goamd64v3            │
                     │   sec/op    │   sec/op     vs base                │
conan-altra            2.174µ ± 0%   2.174µ ± 0%        ~ (p=1.000 n=10)
arcadia-rome           3.519µ ± 2%   1.789µ ± 0%  -49.15% (p=0.000 n=10)
indus-skylake          2.143µ ± 3%   2.165µ ± 9%        ~ (p=0.739 n=10)
izumi-sapphirerapids   974.9n ± 0%   980.5n ± 3%   +0.58% (p=0.007 n=10)
geomean                1.999µ        1.695µ       -15.22%

And if we benchmark the new version with GOAMD64=v3, we see a further speed-up
on ARM and Intel — as high as 20% on Skylake!

.fullname: SizeVarint-4
                     │ head-goamd64v3 │           micro-goamd64v3           │
                     │     sec/op     │   sec/op     vs base                │
conan-altra               2.174µ ± 0%   2.156µ ± 0%   -0.83% (p=0.000 n=10)
arcadia-rome              1.789µ ± 0%   1.836µ ± 1%   +2.63% (p=0.000 n=10)
indus-skylake             2.165µ ± 9%   1.753µ ± 7%  -19.05% (p=0.000 n=10)
izumi-sapphirerapids      980.5n ± 3%   959.1n ± 0%   -2.19% (p=0.000 n=10)
geomean                   1.695µ        1.606µ        -5.25%

In summary, I believe this version of SizeVarint is currently the fastest
on the relevant CPUs, and leaves the path open to squeeze out a little more
performance by changing the Go compiler.

Change-Id: Ibc2629f8dcf9f2f4eb0a09fe37f923829ee3165b
Reviewed-on: https://go-review.googlesource.com/c/protobuf/+/683955
Reviewed-by: Nicolas Hillegeer <aktau@google.com>
Auto-Submit: Nicolas Hillegeer <aktau@google.com>
Reviewed-by: Christian Höppner <hoeppi@google.com>
Reviewed-by: Damien Neil <dneil@google.com>
Commit-Queue: Nicolas Hillegeer <aktau@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
diff --git a/encoding/protowire/wire.go b/encoding/protowire/wire.go
index e942bc9..743bfb8 100644
--- a/encoding/protowire/wire.go
+++ b/encoding/protowire/wire.go
@@ -371,7 +371,31 @@
 func SizeVarint(v uint64) int {
 	// This computes 1 + (bits.Len64(v)-1)/7.
 	// 9/64 is a good enough approximation of 1/7
-	return int(9*uint32(bits.Len64(v))+64) / 64
+	//
+	// The Go compiler can translate the bits.LeadingZeros64 call into the LZCNT
+	// instruction, which is very fast on CPUs from the last few years. The
+	// specific way of expressing the calculation matches C++ Protobuf, see
+	// https://godbolt.org/z/4P3h53oM4 for the C++ code and how gcc/clang
+	// optimize that function for GOAMD64=v1 and GOAMD64=v3 (-march=haswell).
+
+	// By OR'ing v with 1, we guarantee that v is never 0, without changing the
+	// result of SizeVarint. LZCNT is not defined for 0, meaning the compiler
+	// needs to add extra instructions to handle that case.
+	//
+	// The Go compiler currently (go1.24.4) does not make use of this knowledge.
+	// This opportunity (removing the XOR instruction, which handles the 0 case)
+	// results in a small (1%) performance win across CPU architectures.
+	//
+	// Independently of avoiding the 0 case, we need the v |= 1 line because
+	// it allows the Go compiler to eliminate an extra XCHGL barrier.
+	v |= 1
+
+	// It would be clearer to write log2value := 63 - uint32(...), but
+	// writing uint32(...) ^ 63 is much more efficient (-14% ARM, -20% Intel).
+	// Proof of identity for our value range [0..63]:
+	// https://go.dev/play/p/Pdn9hEWYakX
+	log2value := uint32(bits.LeadingZeros64(v)) ^ 63
+	return int((log2value*9 + (64 + 9)) / 64)
 }
 
 // AppendFixed32 appends v to b as a little-endian uint32.
diff --git a/encoding/protowire/wire_test.go b/encoding/protowire/wire_test.go
index 486d183..f0281c3 100644
--- a/encoding/protowire/wire_test.go
+++ b/encoding/protowire/wire_test.go
@@ -678,3 +678,31 @@
 		}
 	}
 }
+
+// TODO(go1.23): use slices.Repeat
+var testvals = func() []uint64 {
+	// These values are representative for the values that we observe when
+	// running benchmarks extracted from Google production workloads.
+	vals := []uint64{
+		1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
+		55, 66, 77, 88, 99, 100,
+		123456789, 98765432,
+	}
+	newslice := make([]uint64, 100*len(vals))
+	n := copy(newslice, vals)
+	for n < len(newslice) {
+		n += copy(newslice[n:], newslice[:n])
+	}
+	return newslice
+}()
+
+func BenchmarkSizeVarint(b *testing.B) {
+	var total int
+	for range b.N {
+		for _, val := range testvals {
+			total += SizeVarint(val)
+		}
+	}
+	// Prevent the Go compiler from optimizing out the SizeVarint call:
+	b.Logf("total: %d", total)
+}