poly1305: rewrite the Go implementation with 64-bit limbs

The new code is meant to be readable without external references for
Poly1305, and explains the field logic. The generic code is now 30-50%
faster on a Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, and even better
on a 3.1 GHz i7 MacBook.

name        old time/op    new time/op    delta
64-48          126ns ± 0%      80ns ± 1%  -36.24%  (p=0.000 n=16+20)
1K-48         1.07µs ± 0%    0.81µs ± 2%  -23.63%  (p=0.000 n=19+20)
2M-48         2.07ms ± 0%    1.61ms ± 1%  -22.31%  (p=0.000 n=20+20)
Write64-48    79.3ns ± 0%    58.0ns ± 1%  -26.89%  (p=0.000 n=20+19)
Write1K-48    1.02µs ± 0%    0.79µs ± 1%  -22.91%  (p=0.000 n=19+19)
Write2M-48    2.07ms ± 0%    1.61ms ± 2%  -22.33%  (p=0.000 n=17+20)

name        old speed      new speed      delta
64-48        508MB/s ± 0%   797MB/s ± 1%  +56.95%  (p=0.000 n=16+20)
1K-48        960MB/s ± 0%  1257MB/s ± 2%  +30.94%  (p=0.000 n=18+20)
2M-48       1.01GB/s ± 0%  1.30GB/s ± 1%  +28.73%  (p=0.000 n=20+20)
Write64-48   807MB/s ± 0%  1104MB/s ± 1%  +36.78%  (p=0.000 n=18+19)
Write1K-48  1.00GB/s ± 0%  1.30GB/s ± 1%  +29.71%  (p=0.000 n=18+19)
Write2M-48  1.01GB/s ± 0%  1.31GB/s ± 2%  +28.77%  (p=0.000 n=17+20)

The assembly is still 50-90% faster on the Xeon, 30-60% on the MacBook.
The Go code does not use all the arithmetic tricks the assembly does,
and it does not have access to the three operand wide shift instruction.

name        old time/op    new time/op    delta
64-48         80.3ns ± 1%    54.2ns ± 0%  -32.50%  (p=0.000 n=20+17)
1K-48          815ns ± 2%     446ns ± 1%  -45.27%  (p=0.000 n=20+20)
2M-48         1.61ms ± 1%    0.86ms ± 0%  -46.54%  (p=0.000 n=20+17)
Write64-48    58.0ns ± 1%    34.0ns ± 0%  -41.34%  (p=0.000 n=19+20)
Write1K-48     790ns ± 1%     427ns ± 0%  -45.92%  (p=0.000 n=19+17)
Write2M-48    1.61ms ± 2%    0.86ms ± 0%  -46.51%  (p=0.000 n=20+20)

name        old speed      new speed      delta
64-48        797MB/s ± 1%  1180MB/s ± 0%  +48.09%  (p=0.000 n=20+19)
1K-48       1.26GB/s ± 2%  2.30GB/s ± 1%  +82.71%  (p=0.000 n=20+20)
2M-48       1.30GB/s ± 1%  2.44GB/s ± 0%  +87.04%  (p=0.000 n=20+17)
Write64-48  1.10GB/s ± 1%  1.88GB/s ± 0%  +70.52%  (p=0.000 n=19+18)
Write1K-48  1.30GB/s ± 1%  2.40GB/s ± 0%  +84.84%  (p=0.000 n=19+18)
Write2M-48  1.31GB/s ± 2%  2.44GB/s ± 0%  +86.93%  (p=0.000 n=20+20)

Hopefully this will also avoid the need for an arm64 implementation.

Since now the Go and the amd64/ppc64le assembly use the same limb
schedule, drop the assembly initialize and finalize implementations,
and make the wrapper code match. It comes with a minor slowdown.

name        old time/op    new time/op    delta
64-48         50.3ns ± 0%    54.2ns ± 0%  +7.73%  (p=0.000 n=20+17)
1K-48          441ns ± 0%     446ns ± 1%  +1.10%  (p=0.000 n=19+20)
2M-48          860µs ± 0%     859µs ± 0%    ~     (p=0.178 n=19+17)
Write64-48    34.0ns ± 0%    34.0ns ± 0%    ~     (all equal)
Write1K-48     424ns ± 0%     427ns ± 0%  +0.71%  (p=0.000 n=17+17)
Write2M-48     860µs ± 0%     859µs ± 0%  -0.04%  (p=0.000 n=19+20)

name        old speed      new speed      delta
64-48       1.27GB/s ± 0%  1.18GB/s ± 0%  -7.20%  (p=0.000 n=20+19)
1K-48       2.32GB/s ± 0%  2.30GB/s ± 1%  -1.07%  (p=0.000 n=18+20)
2M-48       2.44GB/s ± 0%  2.44GB/s ± 0%    ~     (p=0.173 n=19+17)
Write64-48  1.88GB/s ± 0%  1.88GB/s ± 0%  +0.04%  (p=0.000 n=19+18)
Write1K-48  2.41GB/s ± 0%  2.40GB/s ± 0%  -0.67%  (p=0.000 n=19+18)
Write2M-48  2.44GB/s ± 0%  2.44GB/s ± 0%  +0.04%  (p=0.000 n=19+20)

Since poly1305/sum_generic.go was almost entirely rewritten, it's
probably best reviewed on gitiles.

This is the implementation published at
https://blog.filippo.io/a-literate-go-implementation-of-poly1305/

Updates #31470

Change-Id: I74f9011d3ee317a43b05ae7f05d96081d08bffd3
Reviewed-on: https://go-review.googlesource.com/c/crypto/+/169037
Reviewed-by: Katie Hockman <katie@golang.org>
12 files changed
tree: de45872e0734c93f0e6f2668799b274cca52272e
  1. acme/
  2. argon2/
  3. bcrypt/
  4. blake2b/
  5. blake2s/
  6. blowfish/
  7. bn256/
  8. cast5/
  9. chacha20poly1305/
  10. cryptobyte/
  11. curve25519/
  12. ed25519/
  13. hkdf/
  14. internal/
  15. md4/
  16. nacl/
  17. ocsp/
  18. openpgp/
  19. otr/
  20. pbkdf2/
  21. pkcs12/
  22. poly1305/
  23. ripemd160/
  24. salsa20/
  25. scrypt/
  26. sha3/
  27. ssh/
  28. tea/
  29. twofish/
  30. xtea/
  31. xts/
  32. .gitattributes
  33. .gitignore
  34. AUTHORS
  35. codereview.cfg
  36. CONTRIBUTING.md
  37. CONTRIBUTORS
  38. go.mod
  39. go.sum
  40. LICENSE
  41. PATENTS
  42. README.md
README.md

Go Cryptography

This repository holds supplementary Go cryptography libraries.

Download/Install

The easiest way to install is to run go get -u golang.org/x/crypto/.... You can also manually git clone the repository to $GOPATH/src/golang.org/x/crypto.

Report Issues / Send Patches

This repository uses Gerrit for code changes. To learn how to submit changes to this repository, see https://golang.org/doc/contribute.html.

The main issue tracker for the crypto repository is located at https://github.com/golang/go/issues. Prefix your issue with “x/crypto:” in the subject line, so it is easy to find.

Note that contributions to the cryptography package receive additional scrutiny due to their sensitive nature. Patches may take longer than normal to receive feedback.