sha3: use unaligned reads and xors on x86 and x64

Speedup of about 1.4x on x64. Added benchmarks that use the
ShakeHash interface, which doesn't require copying the state.

Unaligned or generic xorIn and copyOut functions chosen via
buildline, but both are tested.

Substantial contributions from Eric Eisner.

See golang.org/cl/151630044 for the previous CR.

(There are also some minor edits/additions to the documentation.)

Change-Id: I9500c25682457c82487512b9b8c66df7d75bff5d
Reviewed-on: https://go-review.googlesource.com/2132
Reviewed-by: Adam Langley <agl@golang.org>
6 files changed