crypto/rc4: naïve ARM assembly implementation

On 800MHz Cortex-A8:
benchmark           old ns/op    new ns/op    delta
BenchmarkRC4_128         9395         2838  -69.79%
BenchmarkRC4_1K         74497        22120  -70.31%
BenchmarkRC4_8K        587243       171435  -70.81%

benchmark            old MB/s     new MB/s  speedup
BenchmarkRC4_128        13.62        45.09    3.31x
BenchmarkRC4_1K         13.75        46.29    3.37x
BenchmarkRC4_8K         13.79        47.22    3.42x

Result for "OpenSSL 1.0.1c 10 May 2012" from Debian/armhf sid:
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
rc4              39553.81k    46522.39k    49336.11k    50085.63k    50258.06k

R=golang-dev, agl, dave
CC=golang-dev
https://golang.org/cl/7310051
3 files changed