blake2b: add AVX assembly

Add an AVX implementation and improve SSE4.1 assembly.

AVX vs SSE4.1
name 		old time/op 		new time/op 	delta
Write128-8 	249ns ± 0% 		220ns ± 0% 	-11.85% (p=0.029 n=4+4)
Write1K-8 	1.68µs ± 1% 		1.56µs ± 1% 	-6.71% (p=0.029 n=4+4)
Write32K-8 	52.6µs ± 0% 		48.7µs ± 0% 	-7.40% (p=0.029 n=4+4)
Sum128-8 	264ns ± 0% 		241ns ± 1% 	-8.52% (p=0.029 n=4+4)
Sum1K-8 	1.70µs ± 0% 		1.57µs ± 0% 	-7.79% (p=0.029 n=4+4)
Sum32K-8 	54.1µs ± 3% 		49.5µs ± 1% 	-8.36% (p=0.029 n=4+4)

name 		old speed 		new speed	 delta
Write128-8 	513MB/s ± 0% 		582MB/s ± 0% 	+13.38% (p=0.029 n=4+4)
Write1K-8 	610MB/s ± 1% 		654MB/s ± 1% 	+7.22% (p=0.029 n=4+4)
Write32K-8 	622MB/s ± 0% 		672MB/s ± 0% 	+7.99% (p=0.029 n=4+4)
Sum128-8 	484MB/s ± 1% 		529MB/s ± 0% 	+9.21% (p=0.029 n=4+4)
Sum1K-8 	602MB/s ± 0% 		653MB/s ± 0% 	+8.42% (p=0.029 n=4+4)
Sum32K-8 	607MB/s ± 3% 		662MB/s ± 1% 	+9.03% (p=0.029 n=4+4)

AVX2 vs AVX
name 		old time/op 		new time/op 	delta
Write128-4 	192ns ± 0% 		166ns ± 0% 	-14.03% (p=0.029 n=4+4)
Write1K-4 	1.37µs ± 0% 		1.19µs ± 0% 	-12.65% (p=0.029 n=4+4)
Write32K-4 	42.5µs ± 0% 		37.3µs ± 0% 	-12.33% (p=0.029 n=4+4)
Sum128-4 	213ns ± 0% 		188ns ± 0% 	-11.97% (p=0.029 n=4+4)
Sum1K-4 	1.40µs ± 0% 		1.22µs ± 0% 	-12.85% (p=0.029 n=4+4)
Sum32K-4 	42.8µs ± 0% 		37.3µs ± 0% 	-12.94% (p=0.029 n=4+4)

name 		old speed 		new speed 	delta
Write128-4 	662MB/s ± 0% 		771MB/s ± 0% 	+16.47% (p=0.029 n=4+4)
Write1K-4 	748MB/s ± 0% 		857MB/s ± 0% 	+14.49% (p=0.029 n=4+4)
Write32K-4 	771MB/s ± 0% 		879MB/s ± 0% 	+14.07% (p=0.029 n=4+4)
Sum128-4 	600MB/s ± 0% 		680MB/s ± 0% 	+13.49% (p=0.029 n=4+4)
Sum1K-4 	733MB/s ± 0% 		841MB/s ± 0% 	+14.72% (p=0.029 n=4+4)
Sum32K-4    	765MB/s ± 0%  		879MB/s ± 0%  	+14.85% (p=0.029 n=4+4)

Change-Id: Idf85742e952c07b76c0c7fb5404ed9b0caf0f6eb
Reviewed-on: https://go-review.googlesource.com/34319
Reviewed-by: Adam Langley <agl@golang.org>
5 files changed