math/big: tune addVW/subVW performance on arm64

Add an optimization for addVW and subVW over large-sized vectors, it switches
from add/sub with carry to copy the rest of the vector when we are done with
carries. Consistent performance improvement are observed on various arm64
machines.

Add additional tests and benchmarks to increase the test coverage.
TestFunVWExt:
  Testing with various types of input vector, using the result from go-version
  addVW/subVW as golden reference.
BenchmarkAddVWext and BenchmarkSubVWext:
  Benchmarking using input vector having all 1s or all 0s, for evaluating the
  overhead of worst case.

1. Perf. comparison over randomly generated input vectors:

Server 1:
name             old time/op    new time/op    delta
AddVW/1            12.3ns ± 3%    12.0ns ± 0%    -2.60%  (p=0.001 n=10+8)
AddVW/2            12.5ns ± 2%    12.3ns ± 0%    -1.84%  (p=0.001 n=10+8)
AddVW/3            12.6ns ± 2%    12.3ns ± 0%    -1.91%  (p=0.009 n=10+10)
AddVW/4            13.1ns ± 3%    12.7ns ± 0%    -2.98%  (p=0.006 n=10+8)
AddVW/5            14.4ns ± 1%    13.9ns ± 0%    -3.81%  (p=0.000 n=10+10)
AddVW/10           11.7ns ± 0%    11.7ns ± 0%      ~     (all equal)
AddVW/100          47.8ns ± 0%    29.9ns ± 2%   -37.38%  (p=0.000 n=10+9)
AddVW/1000          446ns ± 0%     207ns ± 0%   -53.59%  (p=0.000 n=10+10)
AddVW/10000        4.35µs ± 1%    2.92µs ± 0%   -32.85%  (p=0.000 n=10+10)
AddVW/100000       43.6µs ± 0%    29.7µs ± 0%   -31.92%  (p=0.000 n=8+10)
SubVW/1            12.6ns ± 0%    12.3ns ± 2%    -2.22%  (p=0.000 n=7+10)
SubVW/2            12.7ns ± 0%    12.6ns ± 1%    -0.39%  (p=0.046 n=8+10)
SubVW/3            12.7ns ± 1%    12.6ns ± 1%      ~     (p=0.410 n=10+10)
SubVW/4            13.3ns ± 3%    13.1ns ± 3%      ~     (p=0.072 n=10+10)
SubVW/5            14.2ns ± 0%    14.1ns ± 1%    -0.63%  (p=0.046 n=8+10)
SubVW/10           11.7ns ± 0%    11.7ns ± 0%      ~     (all equal)
SubVW/100          47.8ns ± 0%    33.1ns ±19%   -30.71%  (p=0.000 n=10+10)
SubVW/1000          446ns ± 0%     207ns ± 0%   -53.59%  (p=0.000 n=10+10)
SubVW/10000        4.33µs ± 1%    2.92µs ± 0%   -32.66%  (p=0.000 n=10+6)
SubVW/100000       43.4µs ± 0%    29.6µs ± 0%   -31.90%  (p=0.000 n=10+9)

Server 2:
name             old time/op    new time/op    delta
AddVW/1            5.49ns ± 0%    5.53ns ± 2%     ~     (p=1.000 n=9+10)
AddVW/2            5.96ns ± 2%    5.92ns ± 1%   -0.69%  (p=0.039 n=10+10)
AddVW/3            6.72ns ± 0%    6.73ns ± 0%     ~     (p=0.078 n=10+10)
AddVW/4            7.07ns ± 0%    6.75ns ± 2%   -4.55%  (p=0.000 n=10+10)
AddVW/5            8.14ns ± 0%    8.17ns ± 0%   +0.46%  (p=0.003 n=8+8)
AddVW/10           10.0ns ± 0%    10.1ns ± 1%   +0.70%  (p=0.003 n=10+10)
AddVW/100          43.0ns ± 0%    33.5ns ± 0%  -22.09%  (p=0.000 n=9+9)
AddVW/1000          394ns ± 0%     278ns ± 0%  -29.44%  (p=0.000 n=10+10)
AddVW/10000        4.18µs ± 0%    3.14µs ± 0%  -24.81%  (p=0.000 n=8+8)
AddVW/100000       68.3µs ± 3%    62.1µs ± 5%   -9.13%  (p=0.000 n=10+10)
SubVW/1            5.37ns ± 2%    5.42ns ± 1%     ~     (p=0.990 n=10+10)
SubVW/2            5.89ns ± 0%    5.92ns ± 1%   +0.58%  (p=0.000 n=8+10)
SubVW/3            6.64ns ± 1%    6.82ns ± 3%   +2.63%  (p=0.000 n=9+10)
SubVW/4            7.17ns ± 0%    6.69ns ± 2%   -6.74%  (p=0.000 n=10+9)
SubVW/5            8.22ns ± 0%    8.18ns ± 0%   -0.46%  (p=0.001 n=8+9)
SubVW/10           10.0ns ± 1%    10.1ns ± 1%     ~     (p=0.341 n=10+10)
SubVW/100          43.0ns ± 0%    33.5ns ± 0%  -22.09%  (p=0.000 n=7+10)
SubVW/1000          394ns ± 0%     278ns ± 0%  -29.44%  (p=0.000 n=10+10)
SubVW/10000        4.18µs ± 0%    3.15µs ± 0%  -24.62%  (p=0.000 n=9+9)
SubVW/100000       67.7µs ± 4%    62.4µs ± 2%   -7.92%  (p=0.000 n=10+10)

2. Perf. comparison over input vectors of all 1s or all 0s

Server 1:
name             old time/op    new time/op    delta
AddVWext/1         12.6ns ± 0%    12.0ns ± 0%    -4.76%  (p=0.000 n=6+10)
AddVWext/2         12.7ns ± 0%    12.4ns ± 1%    -2.52%  (p=0.000 n=10+10)
AddVWext/3         12.7ns ± 0%    12.4ns ± 0%    -2.36%  (p=0.000 n=9+7)
AddVWext/4         13.2ns ± 4%    12.7ns ± 0%    -3.71%  (p=0.001 n=10+9)
AddVWext/5         14.6ns ± 0%    13.9ns ± 0%    -4.79%  (p=0.000 n=10+8)
AddVWext/10        11.7ns ± 0%    11.7ns ± 0%      ~     (all equal)
AddVWext/100       47.8ns ± 0%    47.4ns ± 0%    -0.84%  (p=0.000 n=10+10)
AddVWext/1000       446ns ± 0%     399ns ± 0%   -10.54%  (p=0.000 n=10+10)
AddVWext/10000     4.34µs ± 1%    3.90µs ± 0%   -10.12%  (p=0.000 n=10+10)
AddVWext/100000    43.9µs ± 1%    39.4µs ± 0%   -10.18%  (p=0.000 n=10+10)
SubVWext/1         12.6ns ± 0%    12.3ns ± 2%    -2.70%  (p=0.000 n=7+10)
SubVWext/2         12.6ns ± 1%    12.6ns ± 2%      ~     (p=0.234 n=10+10)
SubVWext/3         12.7ns ± 0%    12.6ns ± 2%    -0.71%  (p=0.033 n=10+10)
SubVWext/4         13.4ns ± 0%    13.1ns ± 3%    -2.01%  (p=0.006 n=8+10)
SubVWext/5         14.2ns ± 0%    14.1ns ± 1%    -0.85%  (p=0.003 n=10+10)
SubVWext/10        11.7ns ± 0%    11.7ns ± 0%      ~     (all equal)
SubVWext/100       47.8ns ± 0%    47.4ns ± 0%    -0.84%  (p=0.000 n=10+10)
SubVWext/1000       446ns ± 0%     399ns ± 0%   -10.54%  (p=0.000 n=10+10)
SubVWext/10000     4.33µs ± 1%    3.90µs ± 0%   -10.02%  (p=0.000 n=10+10)
SubVWext/100000    43.5µs ± 0%    39.5µs ± 1%    -9.16%  (p=0.000 n=7+10)

Server 2:
name             old time/op    new time/op    delta
AddVWext/1         5.48ns ± 0%    5.43ns ± 1%   -0.97%  (p=0.000 n=9+9)
AddVWext/2         5.99ns ± 2%    5.93ns ± 1%     ~     (p=0.054 n=10+10)
AddVWext/3         6.74ns ± 0%    6.79ns ± 1%   +0.80%  (p=0.000 n=9+10)
AddVWext/4         7.18ns ± 0%    7.21ns ± 1%   +0.36%  (p=0.034 n=9+10)
AddVWext/5         7.93ns ± 3%    8.18ns ± 0%   +3.18%  (p=0.000 n=10+8)
AddVWext/10        10.0ns ± 0%    10.1ns ± 1%   +0.60%  (p=0.011 n=10+10)
AddVWext/100       43.0ns ± 0%    47.7ns ± 0%  +10.93%  (p=0.000 n=9+10)
AddVWext/1000       394ns ± 0%     399ns ± 0%   +1.27%  (p=0.000 n=10+10)
AddVWext/10000     4.18µs ± 0%    4.50µs ± 0%   +7.73%  (p=0.000 n=9+10)
AddVWext/100000    67.6µs ± 2%    68.4µs ± 3%     ~     (p=0.139 n=9+8)
SubVWext/1         5.46ns ± 1%    5.43ns ± 0%   -0.55%  (p=0.002 n=9+9)
SubVWext/2         5.89ns ± 0%    5.93ns ± 1%   +0.68%  (p=0.000 n=8+10)
SubVWext/3         6.72ns ± 1%    6.79ns ± 1%   +1.07%  (p=0.000 n=10+10)
SubVWext/4         6.98ns ± 1%    7.21ns ± 0%   +3.25%  (p=0.000 n=10+10)
SubVWext/5         8.22ns ± 0%    7.99ns ± 3%   -2.83%  (p=0.000 n=8+10)
SubVWext/10        10.0ns ± 1%    10.1ns ± 1%     ~     (p=0.239 n=10+10)
SubVWext/100       43.0ns ± 0%    47.7ns ± 0%  +10.93%  (p=0.000 n=8+10)
SubVWext/1000       394ns ± 0%     399ns ± 0%   +1.27%  (p=0.000 n=10+10)
SubVWext/10000     4.18µs ± 0%    4.51µs ± 0%   +7.86%  (p=0.000 n=8+8)
SubVWext/100000    68.3µs ± 2%    68.0µs ± 3%     ~     (p=0.515 n=10+8)

Change-Id: I134a5194b8a2deaaebbaa2b771baf72846971d58
Reviewed-on: https://go-review.googlesource.com/c/go/+/229739
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Robert Griesemer <gri@golang.org>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
2 files changed
tree: 6c1195a807d207f75bcf5b201fde66e785af127e
  1. .gitattributes
  2. .github/
  3. .gitignore
  4. AUTHORS
  5. CONTRIBUTING.md
  6. CONTRIBUTORS
  7. LICENSE
  8. PATENTS
  9. README.md
  10. SECURITY.md
  11. api/
  12. doc/
  13. favicon.ico
  14. lib/
  15. misc/
  16. robots.txt
  17. src/
  18. test/
README.md

The Go Programming Language

Go is an open source programming language that makes it easy to build simple, reliable, and efficient software.

Gopher image Gopher image by Renee French, licensed under Creative Commons 3.0 Attributions license.

Our canonical Git repository is located at https://go.googlesource.com/go. There is a mirror of the repository at https://github.com/golang/go.

Unless otherwise noted, the Go source files are distributed under the BSD-style license found in the LICENSE file.

Download and Install

Binary Distributions

Official binary distributions are available at https://golang.org/dl/.

After downloading a binary release, visit https://golang.org/doc/install or load doc/install.html in your web browser for installation instructions.

Install From Source

If a binary distribution is not available for your combination of operating system and architecture, visit https://golang.org/doc/install/source or load doc/install-source.html in your web browser for source installation instructions.

Contributing

Go is the work of thousands of contributors. We appreciate your help!

To contribute, please read the contribution guidelines: https://golang.org/doc/contribute.html

Note that the Go project uses the issue tracker for bug reports and proposals only. See https://golang.org/wiki/Questions for a list of places to ask questions about the Go language.