sha3: use unaligned reads and xors on x86 and x64

Speedup of about 1.4x on x64. Added benchmarks that use the
ShakeHash interface, which doesn't require copying the state.

Unaligned or generic xorIn and copyOut functions chosen via
buildline, but both are tested.

Substantial contributions from Eric Eisner.

See golang.org/cl/151630044 for the previous CR.

(There are also some minor edits/additions to the documentation.)

Change-Id: I9500c25682457c82487512b9b8c66df7d75bff5d
Reviewed-on: https://go-review.googlesource.com/2132
Reviewed-by: Adam Langley <agl@golang.org>
diff --git a/sha3/doc.go b/sha3/doc.go
index 027c8ad..a0ee3ae 100644
--- a/sha3/doc.go
+++ b/sha3/doc.go
@@ -12,7 +12,8 @@
 // Guidance
 //
 // If you aren't sure what function you need, use SHAKE256 with at least 64
-// bytes of output.
+// bytes of output. The SHAKE instances are faster than the SHA3 instances;
+// the latter have to allocate memory to conform to the hash.Hash interface.
 //
 // If you need a secret-key MAC (message authentication code), prepend the
 // secret key to the input, hash with SHAKE256 and read at least 32 bytes of
@@ -21,45 +22,42 @@
 //
 // Security strengths
 //
-// The SHA3-x functions have a security strength against preimage attacks of x
-// bits. Since they only produce x bits of output, their collision-resistance
-// is only x/2 bits.
+// The SHA3-x (x equals 224, 256, 384, or 512) functions have a security
+// strength against preimage attacks of x bits. Since they only produce "x"
+// bits of output, their collision-resistance is only "x/2" bits.
 //
-// The SHAKE-x functions have a generic security strength of x bits against
-// all attacks, provided that at least 2x bits of their output is used.
-// Requesting more than 2x bits of output does not increase the collision-
-// resistance of the SHAKE functions.
+// The SHAKE-256 and -128 functions have a generic security strength of 256 and
+// 128 bits against all attacks, provided that at least 2x bits of their output
+// is used.  Requesting more than 64 or 32 bytes of output, respectively, does
+// not increase the collision-resistance of the SHAKE functions.
 //
 //
 // The sponge construction
 //
-// A sponge builds a pseudo-random function from a pseudo-random permutation,
-// by applying the permutation to a state of "rate + capacity" bytes, but
-// hiding "capacity" of the bytes.
+// A sponge builds a pseudo-random function from a public pseudo-random
+// permutation, by applying the permutation to a state of "rate + capacity"
+// bytes, but hiding "capacity" of the bytes.
 //
 // A sponge starts out with a zero state. To hash an input using a sponge, up
 // to "rate" bytes of the input are XORed into the sponge's state. The sponge
-// has thus been "filled up" and the permutation is applied. This process is
+// is then "full" and the permutation is applied to "empty" it. This process is
 // repeated until all the input has been "absorbed". The input is then padded.
-// The digest is "squeezed" from the sponge by the same method, except that
-// output is copied out.
+// The digest is "squeezed" from the sponge in the same way, except that output
+// output is copied out instead of input being XORed in.
 //
 // A sponge is parameterized by its generic security strength, which is equal
 // to half its capacity; capacity + rate is equal to the permutation's width.
-//
 // Since the KeccakF-1600 permutation is 1600 bits (200 bytes) wide, this means
-// that security_strength == (1600 - bitrate) / 2.
+// that the security strength of a sponge instance is equal to (1600 - bitrate) / 2.
 //
 //
-// Recommendations, detailed
+// Recommendations
 //
 // The SHAKE functions are recommended for most new uses. They can produce
 // output of arbitrary length. SHAKE256, with an output length of at least
-// 64 bytes, provides 256-bit security against all attacks.
-//
-// The Keccak team recommends SHAKE256 for most applications upgrading from
-// SHA2-512. (NIST chose a much stronger, but much slower, sponge instance
-// for SHA3-512.)
+// 64 bytes, provides 256-bit security against all attacks.  The Keccak team
+// recommends it for most applications upgrading from SHA2-512. (NIST chose a
+// much stronger, but much slower, sponge instance for SHA3-512.)
 //
 // The SHA-3 functions are "drop-in" replacements for the SHA-2 functions.
 // They produce output of the same length, with the same security strengths
diff --git a/sha3/sha3.go b/sha3/sha3.go
index 8d77568..c8fd31c 100644
--- a/sha3/sha3.go
+++ b/sha3/sha3.go
@@ -4,10 +4,6 @@
 
 package sha3
 
-import (
-	"encoding/binary"
-)
-
 // spongeDirection indicates the direction bytes are flowing through the sponge.
 type spongeDirection int
 
@@ -30,25 +26,25 @@
 	buf  []byte     // points into storage
 	rate int        // the number of bytes of state to use
 
-	// dsbyte contains the "domain separation" value and the first bit of
-	// the padding. In sections 6.1 and 6.2 of [1], the SHA-3 and SHAKE
-	// functions are defined with bits appended to the message: SHA-3
-	// functions have 01 and SHAKE functions have 1111. Because of the way
-	// that bits are numbered from the LSB upwards, that ends up as
-	// 00000010b and 00001111b, respectively. Then the padding rule from
-	// section 5.1 is applied to pad to a multiple of the rate, which
-	// involves adding a 1 bit, zero or more zero bits and then a final one
-	// bit. The first one bit from the padding is merged into the dsbyte
-	// value giving 00000110b (0x06) and 00011111b (0x1f), respectively.
-	//
-	// [1] http://csrc.nist.gov/publications/drafts/fips-202/fips_202_draft.pdf,
+	// dsbyte contains the "domain separation" bits and the first bit of
+	// the padding. Sections 6.1 and 6.2 of [1] separate the outputs of the
+	// SHA-3 and SHAKE functions by appending bitstrings to the message.
+	// Using a little-endian bit-ordering convention, these are "01" for SHA-3
+	// and "1111" for SHAKE, or 00000010b and 00001111b, respectively. Then the
+	// padding rule from section 5.1 is applied to pad the message to a multiple
+	// of the rate, which involves adding a "1" bit, zero or more "0" bits, and
+	// a final "1" bit. We merge the first "1" bit from the padding into dsbyte,
+	// giving 00000110b (0x06) and 00011111b (0x1f).
+	// [1] http://csrc.nist.gov/publications/drafts/fips-202/fips_202_draft.pdf
+	//     "Draft FIPS 202: SHA-3 Standard: Permutation-Based Hash and
+	//      Extendable-Output Functions (May 2014)"
 	dsbyte  byte
 	storage [maxRate]byte
 
 	// Specific to SHA-3 and SHAKE.
 	fixedOutput bool            // whether this is a fixed-ouput-length instance
 	outputLen   int             // the default output size in bytes
-	state       spongeDirection // current direction of the sponge
+	state       spongeDirection // whether the sponge is absorbing or squeezing
 }
 
 // BlockSize returns the rate of sponge underlying this hash function.
@@ -79,35 +75,6 @@
 	return &ret
 }
 
-// xorIn xors a buffer into the state, byte-swapping to
-// little-endian as necessary; it returns the number of bytes
-// copied, including any zeros appended to the bytestring.
-func (d *state) xorIn(buf []byte) {
-	n := len(buf) / 8
-
-	for i := 0; i < n; i++ {
-		a := binary.LittleEndian.Uint64(buf)
-		d.a[i] ^= a
-		buf = buf[8:]
-	}
-	if len(buf) != 0 {
-		// XOR in the last partial ulint64.
-		a := uint64(0)
-		for i, v := range buf {
-			a |= uint64(v) << uint64(8*i)
-		}
-		d.a[n] ^= a
-	}
-}
-
-// copyOut copies ulint64s to a byte buffer.
-func (d *state) copyOut(b []byte) {
-	for i := 0; len(b) >= 8; i++ {
-		binary.LittleEndian.PutUint64(b, d.a[i])
-		b = b[8:]
-	}
-}
-
 // permute applies the KeccakF-1600 permutation. It handles
 // any input-output buffering.
 func (d *state) permute() {
@@ -115,7 +82,7 @@
 	case spongeAbsorbing:
 		// If we're absorbing, we need to xor the input into the state
 		// before applying the permutation.
-		d.xorIn(d.buf)
+		xorIn(d, d.buf)
 		d.buf = d.storage[:0]
 		keccakF1600(&d.a)
 	case spongeSqueezing:
@@ -123,7 +90,7 @@
 		// copying more output.
 		keccakF1600(&d.a)
 		d.buf = d.storage[:d.rate]
-		d.copyOut(d.buf)
+		copyOut(d, d.buf)
 	}
 }
 
@@ -151,7 +118,7 @@
 	d.permute()
 	d.state = spongeSqueezing
 	d.buf = d.storage[:d.rate]
-	d.copyOut(d.buf)
+	copyOut(d, d.buf)
 }
 
 // Write absorbs more data into the hash's state. It produces an error
@@ -168,7 +135,7 @@
 	for len(p) > 0 {
 		if len(d.buf) == 0 && len(p) >= d.rate {
 			// The fast path; absorb a full "rate" bytes of input and apply the permutation.
-			d.xorIn(p[:d.rate])
+			xorIn(d, p[:d.rate])
 			p = p[d.rate:]
 			keccakF1600(&d.a)
 		} else {
diff --git a/sha3/sha3_test.go b/sha3/sha3_test.go
index 6f84863..79d962e 100644
--- a/sha3/sha3_test.go
+++ b/sha3/sha3_test.go
@@ -7,8 +7,8 @@
 // Tests include all the ShortMsgKATs provided by the Keccak team at
 // https://github.com/gvanas/KeccakCodePackage
 //
-// They only include the zero-bit case of the utterly useless bitwise
-// testvectors published by NIST in the draft of FIPS-202.
+// They only include the zero-bit case of the bitwise testvectors
+// published by NIST in the draft of FIPS-202.
 
 import (
 	"bytes"
@@ -46,14 +46,14 @@
 	"SHAKE256": newHashShake256,
 }
 
-// testShakes contains functions returning ShakeHash instances for
+// testShakes contains functions that return ShakeHash instances for
 // testing the ShakeHash-specific interface.
 var testShakes = map[string]func() ShakeHash{
 	"SHAKE128": NewShake128,
 	"SHAKE256": NewShake256,
 }
 
-// decodeHex converts an hex-encoded string into a raw byte string.
+// decodeHex converts a hex-encoded string into a raw byte string.
 func decodeHex(s string) []byte {
 	b, err := hex.DecodeString(s)
 	if err != nil {
@@ -71,135 +71,146 @@
 	}
 }
 
+func testUnalignedAndGeneric(t *testing.T, testf func(impl string)) {
+	xorInOrig, copyOutOrig := xorIn, copyOut
+	xorIn, copyOut = xorInGeneric, copyOutGeneric
+	testf("generic")
+	if xorImplementationUnaligned != "generic" {
+		xorIn, copyOut = xorInUnaligned, copyOutUnaligned
+		testf("unaligned")
+	}
+	xorIn, copyOut = xorInOrig, copyOutOrig
+}
+
 // TestKeccakKats tests the SHA-3 and Shake implementations against all the
 // ShortMsgKATs from https://github.com/gvanas/KeccakCodePackage
 // (The testvectors are stored in keccakKats.json.deflate due to their length.)
 func TestKeccakKats(t *testing.T) {
-	// Read the KATs.
-	deflated, err := os.Open(katFilename)
-	if err != nil {
-		t.Errorf("Error opening %s: %s", katFilename, err)
-	}
-	file := flate.NewReader(deflated)
-	dec := json.NewDecoder(file)
-	var katSet KeccakKats
-	err = dec.Decode(&katSet)
-	if err != nil {
-		t.Errorf("%s", err)
-	}
+	testUnalignedAndGeneric(t, func(impl string) {
+		// Read the KATs.
+		deflated, err := os.Open(katFilename)
+		if err != nil {
+			t.Errorf("error opening %s: %s", katFilename, err)
+		}
+		file := flate.NewReader(deflated)
+		dec := json.NewDecoder(file)
+		var katSet KeccakKats
+		err = dec.Decode(&katSet)
+		if err != nil {
+			t.Errorf("error decoding KATs: %s", err)
+		}
 
-	// Do the KATs.
-	for functionName, kats := range katSet.Kats {
-		d := testDigests[functionName]()
-		t.Logf("%s", functionName)
-		for _, kat := range kats {
-			d.Reset()
-			in, err := hex.DecodeString(kat.Message)
-			if err != nil {
-				t.Errorf("%s", err)
-			}
-			d.Write(in[:kat.Length/8])
-			got := strings.ToUpper(hex.EncodeToString(d.Sum(nil)))
-			want := kat.Digest
-			if got != want {
-				t.Errorf("function=%s, length=%d\nmessage:\n  %s\ngot:\n  %s\nwanted:\n %s",
-					functionName, kat.Length, kat.Message, got, want)
-				t.Logf("wanted %+v", kat)
-				t.FailNow()
+		// Do the KATs.
+		for functionName, kats := range katSet.Kats {
+			d := testDigests[functionName]()
+			for _, kat := range kats {
+				d.Reset()
+				in, err := hex.DecodeString(kat.Message)
+				if err != nil {
+					t.Errorf("error decoding KAT: %s", err)
+				}
+				d.Write(in[:kat.Length/8])
+				got := strings.ToUpper(hex.EncodeToString(d.Sum(nil)))
+				if got != kat.Digest {
+					t.Errorf("function=%s, implementation=%s, length=%d\nmessage:\n  %s\ngot:\n  %s\nwanted:\n %s",
+						functionName, impl, kat.Length, kat.Message, got, kat.Digest)
+					t.Logf("wanted %+v", kat)
+					t.FailNow()
+				}
+				continue
 			}
 		}
-	}
+	})
 }
 
 // TestUnalignedWrite tests that writing data in an arbitrary pattern with
 // small input buffers.
-func TestUnalignedWrite(t *testing.T) {
-	buf := sequentialBytes(0x10000)
-	for alg, df := range testDigests {
-		d := df()
-		d.Reset()
-		d.Write(buf)
-		want := d.Sum(nil)
-		d.Reset()
-		for i := 0; i < len(buf); {
-			// Cycle through offsets which make a 137 byte sequence.
-			// Because 137 is prime this sequence should exercise all corner cases.
-			offsets := [17]int{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1}
-			for _, j := range offsets {
-				if v := len(buf) - i; v < j {
-					j = v
+func testUnalignedWrite(t *testing.T) {
+	testUnalignedAndGeneric(t, func(impl string) {
+		buf := sequentialBytes(0x10000)
+		for alg, df := range testDigests {
+			d := df()
+			d.Reset()
+			d.Write(buf)
+			want := d.Sum(nil)
+			d.Reset()
+			for i := 0; i < len(buf); {
+				// Cycle through offsets which make a 137 byte sequence.
+				// Because 137 is prime this sequence should exercise all corner cases.
+				offsets := [17]int{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1}
+				for _, j := range offsets {
+					if v := len(buf) - i; v < j {
+						j = v
+					}
+					d.Write(buf[i : i+j])
+					i += j
 				}
-				d.Write(buf[i : i+j])
-				i += j
+			}
+			got := d.Sum(nil)
+			if !bytes.Equal(got, want) {
+				t.Errorf("Unaligned writes, implementation=%s, alg=%s\ngot %q, want %q", impl, alg, got, want)
 			}
 		}
-		got := d.Sum(nil)
-		if !bytes.Equal(got, want) {
-			t.Errorf("Unaligned writes, alg=%s\ngot %q, want %q", alg, got, want)
-		}
-	}
+	})
 }
 
-// Test that appending works when reallocation is necessary.
+// TestAppend checks that appending works when reallocation is necessary.
 func TestAppend(t *testing.T) {
-	d := New224()
+	testUnalignedAndGeneric(t, func(impl string) {
+		d := New224()
 
-	for capacity := 2; capacity < 64; capacity += 64 {
-		// The first time around the loop, Sum will have to reallocate.
-		// The second time, it will not.
-		buf := make([]byte, 2, capacity)
-		d.Reset()
+		for capacity := 2; capacity < 64; capacity += 64 {
+			// The first time around the loop, Sum will have to reallocate.
+			// The second time, it will not.
+			buf := make([]byte, 2, capacity)
+			d.Reset()
+			d.Write([]byte{0xcc})
+			buf = d.Sum(buf)
+			expected := "0000DF70ADC49B2E76EEE3A6931B93FA41841C3AF2CDF5B32A18B5478C39"
+			if got := strings.ToUpper(hex.EncodeToString(buf)); got != expected {
+				t.Errorf("got %s, want %s", got, expected)
+			}
+		}
+	})
+}
+
+// TestAppendNoRealloc tests that appending works when no reallocation is necessary.
+func TestAppendNoRealloc(t *testing.T) {
+	testUnalignedAndGeneric(t, func(impl string) {
+		buf := make([]byte, 1, 200)
+		d := New224()
 		d.Write([]byte{0xcc})
 		buf = d.Sum(buf)
-		expected := "0000DF70ADC49B2E76EEE3A6931B93FA41841C3AF2CDF5B32A18B5478C39"
+		expected := "00DF70ADC49B2E76EEE3A6931B93FA41841C3AF2CDF5B32A18B5478C39"
 		if got := strings.ToUpper(hex.EncodeToString(buf)); got != expected {
-			t.Errorf("got %s, want %s", got, expected)
+			t.Errorf("%s: got %s, want %s", impl, got, expected)
 		}
-	}
-}
-
-// Test that appending works when no reallocation is necessary.
-func TestAppendNoRealloc(t *testing.T) {
-	buf := make([]byte, 1, 200)
-	d := New224()
-	d.Write([]byte{0xcc})
-	buf = d.Sum(buf)
-	expected := "00DF70ADC49B2E76EEE3A6931B93FA41841C3AF2CDF5B32A18B5478C39"
-	if got := strings.ToUpper(hex.EncodeToString(buf)); got != expected {
-		t.Errorf("got %s, want %s", got, expected)
-	}
+	})
 }
 
 // TestSqueezing checks that squeezing the full output a single time produces
 // the same output as repeatedly squeezing the instance.
 func TestSqueezing(t *testing.T) {
-	for functionName, newShakeHash := range testShakes {
-		t.Logf("%s", functionName)
-		d0 := newShakeHash()
-		d0.Write([]byte(testString))
-		ref := make([]byte, 32)
-		d0.Read(ref)
+	testUnalignedAndGeneric(t, func(impl string) {
+		for functionName, newShakeHash := range testShakes {
+			d0 := newShakeHash()
+			d0.Write([]byte(testString))
+			ref := make([]byte, 32)
+			d0.Read(ref)
 
-		d1 := newShakeHash()
-		d1.Write([]byte(testString))
-		var multiple []byte
-		for _ = range ref {
-			one := make([]byte, 1)
-			d1.Read(one)
-			multiple = append(multiple, one...)
+			d1 := newShakeHash()
+			d1.Write([]byte(testString))
+			var multiple []byte
+			for _ = range ref {
+				one := make([]byte, 1)
+				d1.Read(one)
+				multiple = append(multiple, one...)
+			}
+			if !bytes.Equal(ref, multiple) {
+				t.Errorf("%s (%s): squeezing %d bytes one at a time failed", functionName, impl, len(ref))
+			}
 		}
-		if !bytes.Equal(ref, multiple) {
-			t.Errorf("squeezing %d bytes one at a time failed", len(ref))
-		}
-	}
-}
-
-func TestReadSimulation(t *testing.T) {
-	d := NewShake256()
-	d.Write(nil)
-	dwr := make([]byte, 32)
-	d.Read(dwr)
-
+	})
 }
 
 // sequentialBytes produces a buffer of size consecutive bytes 0x00, 0x01, ..., used for testing.
@@ -221,29 +232,75 @@
 	}
 }
 
-// benchmarkBulkHash tests the speed to hash a buffer of buflen.
-func benchmarkBulkHash(b *testing.B, h hash.Hash, size int) {
+// benchmarkHash tests the speed to hash num buffers of buflen each.
+func benchmarkHash(b *testing.B, h hash.Hash, size, num int) {
 	b.StopTimer()
 	h.Reset()
 	data := sequentialBytes(size)
-	b.SetBytes(int64(size))
+	b.SetBytes(int64(size * num))
 	b.StartTimer()
 
 	var state []byte
 	for i := 0; i < b.N; i++ {
-		h.Write(data)
+		for j := 0; j < num; j++ {
+			h.Write(data)
+		}
 		state = h.Sum(state[:0])
 	}
 	b.StopTimer()
 	h.Reset()
 }
 
-func BenchmarkSha3_512_MTU(b *testing.B) { benchmarkBulkHash(b, New512(), 1350) }
-func BenchmarkSha3_384_MTU(b *testing.B) { benchmarkBulkHash(b, New384(), 1350) }
-func BenchmarkSha3_256_MTU(b *testing.B) { benchmarkBulkHash(b, New256(), 1350) }
-func BenchmarkSha3_224_MTU(b *testing.B) { benchmarkBulkHash(b, New224(), 1350) }
-func BenchmarkShake256_MTU(b *testing.B) { benchmarkBulkHash(b, newHashShake256(), 1350) }
-func BenchmarkShake128_MTU(b *testing.B) { benchmarkBulkHash(b, newHashShake128(), 1350) }
+// benchmarkShake is specialized to the Shake instances, which don't
+// require a copy on reading output.
+func benchmarkShake(b *testing.B, h ShakeHash, size, num int) {
+	b.StopTimer()
+	h.Reset()
+	data := sequentialBytes(size)
+	d := make([]byte, 32)
 
-func BenchmarkSha3_512_1MiB(b *testing.B) { benchmarkBulkHash(b, New512(), 1<<20) }
-func BenchmarkShake256_1MiB(b *testing.B) { benchmarkBulkHash(b, newHashShake256(), 1<<20) }
+	b.SetBytes(int64(size * num))
+	b.StartTimer()
+
+	for i := 0; i < b.N; i++ {
+		h.Reset()
+		for j := 0; j < num; j++ {
+			h.Write(data)
+		}
+		h.Read(d)
+	}
+}
+
+func BenchmarkSha3_512_MTU(b *testing.B) { benchmarkHash(b, New512(), 1350, 1) }
+func BenchmarkSha3_384_MTU(b *testing.B) { benchmarkHash(b, New384(), 1350, 1) }
+func BenchmarkSha3_256_MTU(b *testing.B) { benchmarkHash(b, New256(), 1350, 1) }
+func BenchmarkSha3_224_MTU(b *testing.B) { benchmarkHash(b, New224(), 1350, 1) }
+
+func BenchmarkShake128_MTU(b *testing.B)  { benchmarkShake(b, NewShake128(), 1350, 1) }
+func BenchmarkShake256_MTU(b *testing.B)  { benchmarkShake(b, NewShake256(), 1350, 1) }
+func BenchmarkShake256_16x(b *testing.B)  { benchmarkShake(b, NewShake256(), 16, 1024) }
+func BenchmarkShake256_1MiB(b *testing.B) { benchmarkShake(b, NewShake256(), 1024, 1024) }
+
+func BenchmarkSha3_512_1MiB(b *testing.B) { benchmarkHash(b, New512(), 1024, 1024) }
+
+func Example_sum() {
+	buf := []byte("some data to hash")
+	// A hash needs to be 64 bytes long to have 256-bit collision resistance.
+	h := make([]byte, 64)
+	// Compute a 64-byte hash of buf and put it in h.
+	ShakeSum256(h, buf)
+}
+
+func Example_mac() {
+	k := []byte("this is a secret key; you should generate a strong random key that's at least 32 bytes long")
+	buf := []byte("and this is some data to authenticate")
+	// A MAC with 32 bytes of output has 256-bit security strength -- if you use at least a 32-byte-long key.
+	h := make([]byte, 32)
+	d := NewShake256()
+	// Write the key into the hash.
+	d.Write(k)
+	// Now write the data.
+	d.Write(buf)
+	// Read 32 bytes of output from the hash into h.
+	d.Read(h)
+}
diff --git a/sha3/xor.go b/sha3/xor.go
new file mode 100644
index 0000000..d622979
--- /dev/null
+++ b/sha3/xor.go
@@ -0,0 +1,16 @@
+// Copyright 2015 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+// +build !amd64,!386 appengine
+
+package sha3
+
+var (
+	xorIn            = xorInGeneric
+	copyOut          = copyOutGeneric
+	xorInUnaligned   = xorInGeneric
+	copyOutUnaligned = copyOutGeneric
+)
+
+const xorImplementationUnaligned = "generic"
diff --git a/sha3/xor_generic.go b/sha3/xor_generic.go
new file mode 100644
index 0000000..fd35f02
--- /dev/null
+++ b/sha3/xor_generic.go
@@ -0,0 +1,28 @@
+// Copyright 2015 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package sha3
+
+import "encoding/binary"
+
+// xorInGeneric xors the bytes in buf into the state; it
+// makes no non-portable assumptions about memory layout
+// or alignment.
+func xorInGeneric(d *state, buf []byte) {
+	n := len(buf) / 8
+
+	for i := 0; i < n; i++ {
+		a := binary.LittleEndian.Uint64(buf)
+		d.a[i] ^= a
+		buf = buf[8:]
+	}
+}
+
+// copyOutGeneric copies ulint64s to a byte buffer.
+func copyOutGeneric(d *state, b []byte) {
+	for i := 0; len(b) >= 8; i++ {
+		binary.LittleEndian.PutUint64(b, d.a[i])
+		b = b[8:]
+	}
+}
diff --git a/sha3/xor_unaligned.go b/sha3/xor_unaligned.go
new file mode 100644
index 0000000..c7851a1
--- /dev/null
+++ b/sha3/xor_unaligned.go
@@ -0,0 +1,58 @@
+// Copyright 2015 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+// +build amd64 386
+// +build !appengine
+
+package sha3
+
+import "unsafe"
+
+func xorInUnaligned(d *state, buf []byte) {
+	bw := (*[maxRate / 8]uint64)(unsafe.Pointer(&buf[0]))
+	n := len(buf)
+	if n >= 72 {
+		d.a[0] ^= bw[0]
+		d.a[1] ^= bw[1]
+		d.a[2] ^= bw[2]
+		d.a[3] ^= bw[3]
+		d.a[4] ^= bw[4]
+		d.a[5] ^= bw[5]
+		d.a[6] ^= bw[6]
+		d.a[7] ^= bw[7]
+		d.a[8] ^= bw[8]
+	}
+	if n >= 104 {
+		d.a[9] ^= bw[9]
+		d.a[10] ^= bw[10]
+		d.a[11] ^= bw[11]
+		d.a[12] ^= bw[12]
+	}
+	if n >= 136 {
+		d.a[13] ^= bw[13]
+		d.a[14] ^= bw[14]
+		d.a[15] ^= bw[15]
+		d.a[16] ^= bw[16]
+	}
+	if n >= 144 {
+		d.a[17] ^= bw[17]
+	}
+	if n >= 168 {
+		d.a[18] ^= bw[18]
+		d.a[19] ^= bw[19]
+		d.a[20] ^= bw[20]
+	}
+}
+
+func copyOutUnaligned(d *state, buf []byte) {
+	ab := (*[maxRate]uint8)(unsafe.Pointer(&d.a[0]))
+	copy(buf, ab[:])
+}
+
+var (
+	xorIn   = xorInUnaligned
+	copyOut = copyOutUnaligned
+)
+
+const xorImplementationUnaligned = "unaligned"