all: remove 'extern register M *m' from runtime

The runtime has historically held two dedicated values g (current goroutine)
and m (current thread) in 'extern register' slots (TLS on x86, real registers
backed by TLS on ARM).

This CL removes the extern register m; code now uses g->m.

On ARM, this frees up the register that formerly held m (R9).
This is important for NaCl, because NaCl ARM code cannot use R9 at all.

The Go 1 macrobenchmarks (those with per-op times >= 10 µs) are unaffected:

BenchmarkBinaryTree17              5491374955     5471024381     -0.37%
BenchmarkFannkuch11                4357101311     4275174828     -1.88%
BenchmarkGobDecode                 11029957       11364184       +3.03%
BenchmarkGobEncode                 6852205        6784822        -0.98%
BenchmarkGzip                      650795967      650152275      -0.10%
BenchmarkGunzip                    140962363      141041670      +0.06%
BenchmarkHTTPClientServer          71581          73081          +2.10%
BenchmarkJSONEncode                31928079       31913356       -0.05%
BenchmarkJSONDecode                117470065      113689916      -3.22%
BenchmarkMandelbrot200             6008923        5998712        -0.17%
BenchmarkGoParse                   6310917        6327487        +0.26%
BenchmarkRegexpMatchMedium_1K      114568         114763         +0.17%
BenchmarkRegexpMatchHard_1K        168977         169244         +0.16%
BenchmarkRevcomp                   935294971      914060918      -2.27%
BenchmarkTemplate                  145917123      148186096      +1.55%

Minux previous reported larger variations, but these were caused by
run-to-run noise, not repeatable slowdowns.

Actual code changes by Minux.
I only did the docs and the benchmarking.

LGTM=dvyukov, iant, minux
R=minux, josharian, iant, dave, bradfitz, dvyukov
CC=golang-codereviews
https://golang.org/cl/109050043
diff --git a/doc/asm.html b/doc/asm.html
index d44cb79..f4ef1e6 100644
--- a/doc/asm.html
+++ b/doc/asm.html
@@ -149,7 +149,7 @@
 <p>
 Instructions, registers, and assembler directives are always in UPPER CASE to remind you
 that assembly programming is a fraught endeavor.
-(Exceptions: the <code>m</code> and <code>g</code> register renamings on ARM.)
+(Exception: the <code>g</code> register renaming on ARM.)
 </p>
 
 <p>
@@ -344,7 +344,7 @@
 <h3 id="x86">32-bit Intel 386</h3>
 
 <p>
-The runtime pointers to the <code>m</code> and <code>g</code> structures are maintained
+The runtime pointer to the <code>g</code> structure is maintained
 through the value of an otherwise unused (as far as Go is concerned) register in the MMU.
 A OS-dependent macro <code>get_tls</code> is defined for the assembler if the source includes
 an architecture-dependent header file, like this:
@@ -356,14 +356,15 @@
 
 <p>
 Within the runtime, the <code>get_tls</code> macro loads its argument register
-with a pointer to a pair of words representing the <code>g</code> and <code>m</code> pointers.
+with a pointer to the <code>g</code> pointer, and the <code>g</code> struct
+contains the <code>m</code> pointer.
 The sequence to load <code>g</code> and <code>m</code> using <code>CX</code> looks like this:
 </p>
 
 <pre>
 get_tls(CX)
-MOVL	g(CX), AX	// Move g into AX.
-MOVL	m(CX), BX	// Move m into BX.
+MOVL	g(CX), AX     // Move g into AX.
+MOVL	g_m(AX), BX   // Move g->m into BX.
 </pre>
 
 <h3 id="amd64">64-bit Intel 386 (a.k.a. amd64)</h3>
@@ -376,22 +377,21 @@
 
 <pre>
 get_tls(CX)
-MOVQ	g(CX), AX	// Move g into AX.
-MOVQ	m(CX), BX	// Move m into BX.
+MOVQ	g(CX), AX     // Move g into AX.
+MOVQ	g_m(AX), BX   // Move g->m into BX.
 </pre>
 
 <h3 id="arm">ARM</h3>
 
 <p>
-The registers <code>R9</code>, <code>R10</code>, and <code>R11</code>
+The registers <code>R10</code> and <code>R11</code>
 are reserved by the compiler and linker.
 </p>
 
 <p>
-<code>R9</code> and <code>R10</code> point to the <code>m</code> (machine) and <code>g</code>
-(goroutine) structures, respectively.
-Within assembler source code, these pointers must be referred to as <code>m</code> and <code>g</code>;
-the names <code>R9</code> and <code>R10</code> are not recognized.
+<code>R10</code> points to the <code>g</code> (goroutine) structure.
+Within assembler source code, this pointer must be referred to as <code>g</code>;
+the name <code>R10</code> is not recognized.
 </p>
 
 <p>