x/crypto/poly1305: optimize amd64 assembly performance

Improve performance on amd64 through faster assembly.

name 		old time/op 	new time/op 	delta
64-8 		101ns ± 4% 	42ns ± 3% 	-58.31% (p=0.002 n=6+6)
1K-8 		887ns ± 1% 	456ns ± 1% 	-48.53% (p=0.002 n=6+6)
64Unaligned-8 	98.1ns ± 1% 	41.1ns ± 1% 	-58.06% (p=0.002 n=6+6)
1KUnaligned-8 	885ns ± 2% 	460ns ± 3% 	-48.04% (p=0.002 n=6+6)

name 		old speed 	new speed 	delta
64-8 		635MB/s ± 4% 	1525MB/s ± 3% 	+140.15% (p=0.002 n=6+6)
1K-8 		1.15GB/s ± 1% 	2.24GB/s ± 1% 	+94.22%  (p=0.002 n=6+6)
64Unaligned-8 	653MB/s ± 1% 	1557MB/s ± 1% 	+138.58% (p=0.002 n=6+6)
1KUnaligned-8  	1.16GB/s ± 2%  	2.23GB/s ± 3%	+92.46%  (p=0.002 n=6+6)

Change-Id: Ia3be8e7ff012f8a9b451d728a646f29f809ba665
Reviewed-on: https://go-review.googlesource.com/29993
Reviewed-by: Adam Langley <agl@golang.org>
2 files changed