cmd/compile: automatically handle commuting ops in rewrite rules

Note that this is a redo of an undo of the original buggy CL 38666.

We have lots of rewrite rules that vary only in the fact that
we have 2 versions for the 2 different orderings of various
commuting ops. For example:

(ADDL x (MOVLconst [c])) -> (ADDLconst [c] x)
(ADDL (MOVLconst [c]) x) -> (ADDLconst [c] x)

It can get unwieldly quickly, especially when there is more than
one commuting op in a rule.

Our existing "fix" for this problem is to have rules that
canonicalize the operations first. For example:

(Eq64 x (Const64 <t> [c])) && x.Op != OpConst64 -> (Eq64 (Const64 <t> [c]) x)

Subsequent rules can then assume if there is a constant arg to Eq64,
it will be the first one. This fix kinda works, but it is fragile and
only works when we remember to include the required extra rules.

The fundamental problem is that the rule matcher doesn't
know anything about commuting ops. This CL fixes that fact.

We already have information about which ops commute. (The register
allocator takes advantage of commutivity.)  The rule generator now
automatically generates multiple rules for a single source rule when
there are commutative ops in the rule. We can now drop all of our
almost-duplicate source-level rules and the canonicalization rules.

I have some CLs in progress that will be a lot less verbose when
the rule generator handles commutivity for me.

I had to reorganize the load-combining rules a bit. The 8-way OR rules
generated 128 different reorderings, which was causing the generator
to put too much code in the rewrite*.go files (the big ones were going
from 25K lines to 132K lines). Instead I reorganized the rules to
combine pairs of loads at a time. The generated rule files are now
actually a bit (5%) smaller.

Make.bash times are ~unchanged.

Compiler benchmarks are not observably different. Probably because
we don't spend much compiler time in rule matching anyway.

I've also done a pass over all of our ops adding commutative markings
for ops which hadn't had them previously.

Fixes #18292

Change-Id: Ic1c0e43fbf579539f459971625f69690c9ab8805
Reviewed-on: https://go-review.googlesource.com/38801
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
diff --git a/src/cmd/compile/internal/ssa/gen/genericOps.go b/src/cmd/compile/internal/ssa/gen/genericOps.go
index a5ac628..7991f32 100644
--- a/src/cmd/compile/internal/ssa/gen/genericOps.go
+++ b/src/cmd/compile/internal/ssa/gen/genericOps.go
@@ -28,8 +28,8 @@
 	{name: "Add32", argLength: 2, commutative: true},
 	{name: "Add64", argLength: 2, commutative: true},
 	{name: "AddPtr", argLength: 2}, // For address calculations.  arg0 is a pointer and arg1 is an int.
-	{name: "Add32F", argLength: 2},
-	{name: "Add64F", argLength: 2},
+	{name: "Add32F", argLength: 2, commutative: true},
+	{name: "Add64F", argLength: 2, commutative: true},
 
 	{name: "Sub8", argLength: 2}, // arg0 - arg1
 	{name: "Sub16", argLength: 2},
@@ -43,24 +43,25 @@
 	{name: "Mul16", argLength: 2, commutative: true},
 	{name: "Mul32", argLength: 2, commutative: true},
 	{name: "Mul64", argLength: 2, commutative: true},
-	{name: "Mul32F", argLength: 2},
-	{name: "Mul64F", argLength: 2},
+	{name: "Mul32F", argLength: 2, commutative: true},
+	{name: "Mul64F", argLength: 2, commutative: true},
 
 	{name: "Div32F", argLength: 2}, // arg0 / arg1
 	{name: "Div64F", argLength: 2},
 
-	{name: "Hmul32", argLength: 2},
-	{name: "Hmul32u", argLength: 2},
-	{name: "Hmul64", argLength: 2},
-	{name: "Hmul64u", argLength: 2},
+	{name: "Hmul32", argLength: 2, commutative: true},
+	{name: "Hmul32u", argLength: 2, commutative: true},
+	{name: "Hmul64", argLength: 2, commutative: true},
+	{name: "Hmul64u", argLength: 2, commutative: true},
 
-	{name: "Mul32uhilo", argLength: 2, typ: "(UInt32,UInt32)"}, // arg0 * arg1, returns (hi, lo)
-	{name: "Mul64uhilo", argLength: 2, typ: "(UInt64,UInt64)"}, // arg0 * arg1, returns (hi, lo)
+	{name: "Mul32uhilo", argLength: 2, typ: "(UInt32,UInt32)", commutative: true}, // arg0 * arg1, returns (hi, lo)
+	{name: "Mul64uhilo", argLength: 2, typ: "(UInt64,UInt64)", commutative: true}, // arg0 * arg1, returns (hi, lo)
 
 	// Weird special instructions for use in the strength reduction of divides.
 	// These ops compute unsigned (arg0 + arg1) / 2, correct to all
 	// 32/64 bits, even when the intermediate result of the add has 33/65 bits.
 	// These ops can assume arg0 >= arg1.
+	// Note: these ops aren't commutative!
 	{name: "Avg32u", argLength: 2, typ: "UInt32"}, // 32-bit platforms only
 	{name: "Avg64u", argLength: 2, typ: "UInt64"}, // 64-bit platforms only
 
@@ -159,8 +160,8 @@
 	{name: "EqPtr", argLength: 2, commutative: true, typ: "Bool"},
 	{name: "EqInter", argLength: 2, typ: "Bool"}, // arg0 or arg1 is nil; other cases handled by frontend
 	{name: "EqSlice", argLength: 2, typ: "Bool"}, // arg0 or arg1 is nil; other cases handled by frontend
-	{name: "Eq32F", argLength: 2, typ: "Bool"},
-	{name: "Eq64F", argLength: 2, typ: "Bool"},
+	{name: "Eq32F", argLength: 2, commutative: true, typ: "Bool"},
+	{name: "Eq64F", argLength: 2, commutative: true, typ: "Bool"},
 
 	{name: "Neq8", argLength: 2, commutative: true, typ: "Bool"}, // arg0 != arg1
 	{name: "Neq16", argLength: 2, commutative: true, typ: "Bool"},
@@ -169,8 +170,8 @@
 	{name: "NeqPtr", argLength: 2, commutative: true, typ: "Bool"},
 	{name: "NeqInter", argLength: 2, typ: "Bool"}, // arg0 or arg1 is nil; other cases handled by frontend
 	{name: "NeqSlice", argLength: 2, typ: "Bool"}, // arg0 or arg1 is nil; other cases handled by frontend
-	{name: "Neq32F", argLength: 2, typ: "Bool"},
-	{name: "Neq64F", argLength: 2},
+	{name: "Neq32F", argLength: 2, commutative: true, typ: "Bool"},
+	{name: "Neq64F", argLength: 2, commutative: true, typ: "Bool"},
 
 	{name: "Less8", argLength: 2, typ: "Bool"},  // arg0 < arg1, signed
 	{name: "Less8U", argLength: 2, typ: "Bool"}, // arg0 < arg1, unsigned