internal/chacha20: refactor for readability and consistency

Separated the complex buffering logic from key stream generation more
clearly, added plenty of comments and generally refactored the Go
implementation for readability. Made the interface with the
generic/assembly cores smaller and more consistent, according to
golang.org/wiki/TargetSpecific.

We will recover the lost performance on unaligned calls by caching 3/4
of the first round across XORKeyStream invocations, which we now have
complexity budget for.

name                old speed     new speed     delta
ChaCha20/64-4       435MB/s ± 2%  429MB/s ± 2%  -1.47%  (p=0.013 n=10+9)
ChaCha20/256-4      496MB/s ± 1%  493MB/s ± 2%    ~     (p=0.280 n=10+10)
ChaCha20/10x25-4    283MB/s ± 1%  274MB/s ± 2%  -3.13%  (p=0.000 n=10+10)
ChaCha20/4096-4     494MB/s ± 1%  493MB/s ± 5%    ~     (p=0.631 n=10+10)
ChaCha20/100x40-4   421MB/s ± 3%  408MB/s ± 1%  -3.14%  (p=0.003 n=9+9)
ChaCha20/65536-4    515MB/s ± 1%  519MB/s ± 3%    ~     (p=0.161 n=7+10)
ChaCha20/1000x65-4  501MB/s ± 2%  501MB/s ± 3%    ~     (p=0.497 n=9+10)

Also applied a fix for a lingering bug in the ppc64le assembly written
by Lynn Boger <laboger@linux.vnet.ibm.com>.

Updates golang/go#24485

Change-Id: I10cf24a7f10359b1b4ae63c9bb1946735b98ac9b
Reviewed-on: https://go-review.googlesource.com/c/crypto/+/185439
Reviewed-by: Michael Munday <mike.munday@ibm.com>
diff --git a/internal/chacha20/chacha_noasm.go b/internal/chacha20/chacha_noasm.go
index fc26825..ec609ed 100644
--- a/internal/chacha20/chacha_noasm.go
+++ b/internal/chacha20/chacha_noasm.go
@@ -6,11 +6,8 @@
 
 package chacha20
 
-const (
-	bufSize = 64
-	haveAsm = false
-)
+const bufSize = blockSize
 
-func (*Cipher) xorKeyStreamAsm(dst, src []byte) {
-	panic("not implemented")
+func (s *Cipher) xorKeyStreamBlocks(dst, src []byte) {
+	s.xorKeyStreamBlocksGeneric(dst, src)
 }