blob: 896f4c6fa149313622630e86c41f958d3bc71385 [file] [log] [blame] [view]
Iskander (Alex) Sharipov9b7c4df2018-10-23 12:40:02 +03001Go 1.11 release introduces [AVX-512](https://en.wikipedia.org/wiki/AVX-512) support.
2This page describes how to use new features as well as some important encoder details.
3
4### Terminology
5
6Most terminology comes from [Intel Software Developer's manual](https://software.intel.com/en-us/articles/intel-sdm).
7Suffixes originate from Go assembler syntax, which is close to AT&T, which also uses size suffixes.
8
9Some terms are listed to avoid ambiguity (for example, opcode can have different meanings).
10
11<table>
12 <tr>
13 <th>Term</th>
14 <th>Description</th>
15 </tr>
16 <tr>
17 <td>Operand</td>
18 <td>
19 Same as "instruction argument".
20 </td>
21 </tr>
22 <tr>
23 <td>Opcode</td>
24 <td>
25 Name that refers to instruction group. For example, <code>VADDPD</code> is an opcode.<br>
26 It refers to both VEX and EVEX encoded forms and all operand combinations.<br>
27 Most Go assembler opcodes for AVX-512 match Intel manual entries, with exceptions for cases<br>
28 where additional size suffix is used (e.g. <code>VCVTTPD2DQY</code> is <code>VCVTTPD2DQ</code>).
29 </td>
30 </tr>
31 <tr>
32 <td>Opcode suffix</td>
33 <td>
34 Suffix that overrides some opcode properties. Listed after "." (dot).<br>
35 For example, <code>VADDPD.Z</code> has "Z" opcode suffix.<br>
36 There can be multiple dot-separated opcode suffixes.
37 </td>
38 </tr>
39 <tr>
40 <td>Size suffix</td>
41 <td>
42 Suffix that specifies instruction operand size if it can't be inferred from operands alone.<br>
43 For example, <code>VCVTSS2USIL</code> has "L" size suffix.
44 </td>
45 </tr>
46 <tr>
47 <td>Opmask</td>
48 <td>
49 Used for both <code>{k1}</code> notation and to describe instructions that have <code>K</code> registers operands.<br>
50 Related to masking support in EVEX prefix.
51 </td>
52 </tr>
53 <tr>
54 <td>Register block</td>
55 <td>
56 Multi-source operand that encodes register range.<br>
57 Intel manual uses <code>+n</code> notation for register blocks.<br>
58 For example, <code>+3</code> is a register block of 4 registers.
59 </td>
60 </tr>
61 <tr>
62 <td>FP</td>
63 <td>Floating-point</td>
64 </tr>
65</table>
66
67### New registers
68
69EVEX-enabled instructions can access additional 16 `X` (128-bit xmm) and `Y` (256-bit ymm) registers, plus 32 new `Z` (512-bit zmm) registers in 64-bit mode. 32-bit mode only gets `Z0-Z7`.
70
71New opmask registers are named `K0-K7`.
72They can be used for both masking and for special opmask instructions (like `KADDB`).
73
74### Masking support
75
76Instructions that support masking can omit `K` register operand.
77In this case, `K0` register is implied ("all ones") and merging-masking is performed.
78This is effectively "no masking".
79
80`K1-K7` registers can be used to override default opmask.
81`K` register should be placed right before destination operand.
82
83Zeroing-masking can be activated with `Z` opcode suffix.
84
85For example, `VADDPD.Z (AX), Z30, K3, Z10` uses zeroing-masking and explicit `K` register.
86- If `Z` opcode suffix is removed, it's merging-masking.
87- If `K3` operand is removed, `K0` operand is implied.
88
89It's compile-time error to use `K0` register for `{k1}` operands (consult [manuals](https://software.intel.com/en-us/articles/intel-sdm) for details).
90
91### EVEX broadcast/rounding/SAE support
92
93Embedded broadcast, rounding and SAE activated through opcode suffixes.
94
95For reg-reg FP instructions with `{er}` enabled, rounding opcode suffix can be specified:
96
97* `RU_SAE` to round towards +Inf
98* `RD_SAE` to round towards -Inf
99* `RZ_SAE` to round towards zero
100* `RN_SAE` to round towards nearest
101
102> To read more about rounding modes, see [MXCSR.RC info](http://qcd.phys.cmu.edu/QCDcluster/intel/vtune/reference/vc148.htm).
103
104For reg-reg FP instructions with `{sae}` enabled, exception suppression can be specified with `SAE` opcode suffix.
105
106For reg-mem instrictons with `m32bcst/m64bcst` operand, broadcasting can be turned on with `BCST` opcode suffix.
107
108Zeroing opcode suffix can be combined with any of these.
109For example, `VMAXPD.SAE.Z Z3, Z2, Z1` uses both `Z` and `SAE` opcode suffixes.
110It is important to put zeroing opcode suffix last, otherwise it is a compilation error.
111
112### Register block (multi-source) operands
113
114Register blocks are specified using register range syntax.
115
116It would be enough to specify just first (low) register, but Go assembler requires
117explicit range with both ends for readability reasons.
118
119For example, instructions with `+3` range can be used like `VP4DPWSSD Z25, [Z0-Z3], (AX)`.
120Range `[Z0-Z3]` reads like "register block of Z0, Z1, Z2, Z3".
121Invalid ranges result in compilation error.
122
123### AVX1 and AVX2 instructions with EVEX prefix
124
125Previously existed opcodes that can be encoded using EVEX prefix now can access AVX-512 features like wider register file, zeroing/merging masking, etc. For example, `VADDPD` can now use 512-bit vector registers.
126
127See [encoder details](#encoder-details) for more info.
128
129### Supported extensions
130
131Best way to get up-to-date list of supported extensions is to do `ls -1` inside [test suite](https://github.com/golang/go/tree/master/src/cmd/asm/internal/asm/testdata/avx512enc) directory.
132
133Latest list includes:
134```
135aes_avx512f
136avx512_4fmaps
137avx512_4vnniw
138avx512_bitalg
139avx512_ifma
140avx512_vbmi
141avx512_vbmi2
142avx512_vnni
143avx512_vpopcntdq
144avx512bw
145avx512cd
146avx512dq
147avx512er
148avx512f
149avx512pf
150gfni_avx512f
151vpclmulqdq_avx512f
152```
153
154128-bit and 256-bit instructions additionally require `avx512vl`.
155That is, if `VADDPD` is available in `avx512f`, you can't use `X` and `Y` arguments
156without `avx512vl`.
157
158Filenames follow `GNU as` (gas) conventions.
159[avx512extmap.csv](https://gist.github.com/Quasilyte/92321dadcc3f86b05c1aeda2c13c851f) can make naming scheme more apparent.
160
161### Instructions with size suffix
162
163Some opcodes do not match Intel manual entries.
164This section is provided for search convenience.
165
166| Intel opcode | Go assembler opcodes |
167|--------------|----------------------|
168| `VCVTPD2DQ` | `VCVTPD2DQX`, `VCVTPD2DQY` |
169| `VCVTPD2PS` | `VCVTPD2PSX`, `VCVTPD2PSY` |
170| `VCVTTPD2DQ` | `VCVTTPD2DQX`, `VCVTTPD2DQY` |
171| `VCVTQQ2PS` | `VCVTQQ2PSX`, `VCVTQQ2PSY` |
172| `VCVTUQQ2PS` | `VCVTUQQ2PSX`, `VCVTUQQ2PSY` |
173| `VCVTPD2UDQ` | `VCVTPD2UDQX`, `VCVTPD2UDQY` |
174| `VCVTTPD2UDQ` | `VCVTTPD2UDQX`, `VCVTTPD2UDQY` |
175| `VFPCLASSPD` | `VFPCLASSPDX`, `VFPCLASSPDY`, `VFPCLASSPDZ` |
176| `VFPCLASSPS` | `VFPCLASSPSX`, `VFPCLASSPSY`, `VFPCLASSPSZ` |
177| `VCVTSD2SI` | `VCVTSD2SI`, `VCVTSD2SIQ` |
178| `VCVTTSD2SI` | `VCVTSD2SI`, `VCVTSD2SIQ` |
179| `VCVTTSS2SI` | `VCVTSD2SI`, `VCVTSD2SIQ` |
180| `VCVTSS2SI` | `VCVTSD2SI`, `VCVTSD2SIQ` |
181| `VCVTSD2USI` | `VCVTSD2USIL`, `VCVTSD2USIQ` |
182| `VCVTSS2USI` | `VCVTSS2USIL`, `VCVTSS2USIQ` |
183| `VCVTTSD2USI` | `VCVTTSD2USIL`, `VCVTTSD2USIQ` |
184| `VCVTTSS2USI` | `VCVTTSS2USIL`, `VCVTTSS2USIQ` |
185| `VCVTUSI2SD` | `VCVTUSI2SDL`, `VCVTUSI2SDQ` |
186| `VCVTUSI2SS` | `VCVTUSI2SSL`, `VCVTUSI2SSQ` |
187| `VCVTSI2SD` | `VCVTSI2SDL`, `VCVTSI2SDQ` |
188| `VCVTSI2SS` | `VCVTSI2SSL`, `VCVTSI2SSQ` |
189| `ANDN` | `ANDNL`, `ANDNQ` |
190| `BEXTR` | `BEXTRL`, `BEXTRQ` |
191| `BLSI` | `BLSIL`, `BLSIQ` |
192| `BLSMSK` | `BLSMSKL`, `BLSMSKQ` |
193| `BLSR` | `BLSRL`, `BLSRQ` |
194| `BZHI` | `BZHIL`, `BZHIQ` |
195| `MULX` | `MULXL`, `MULXQ` |
196| `PDEP` | `PDEPL`, `PDEPQ` |
197| `PEXT` | `PEXTL`, `PEXTQ` |
198| `RORX` | `RORXL`, `RORXQ` |
199| `SARX` | `SARXL`, `SARXQ` |
200| `SHLX` | `SHLXL`, `SHLXQ` |
201| `SHRX` | `SHRXL`, `SHRXQ` |
202
203### Encoder details
204
205Bitwise comparison with older encoder may fail for VEX-encoded instructions due to slightly different encoder tables order.
206
207This difference may arise for instructions with both `{reg, reg/mem}` and `{reg/mem, reg}` forms for reg-reg case. One of such instructions is `VMOVUPS`.
208
209This does not affect code behavior, nor makes it bigger/less efficient.
210New encoding selection scheme is borrowed from [Intel XED](https://github.com/intelxed/xed).
211
212EVEX encoding is used when any of the following is true:
213
214* Instruction uses new registers (High 16 `X`/`Y`, `Z` or `K` registers)
215* Instruction uses EVEX-related opcode suffixes like `BCST`
216* Instruction uses operands combination that is only available for AVX-512
217
218In all other cases VEX encoding is used.
219This means that VEX is used whenever possible, and EVEX whenever required.
220
221Compressed disp8 is applied whenever possible for EVEX-encoded instructions.
222This also covers broadcasting disp8 which sometimes has different N multiplier.
223
224Experienced readers can inspect [avx_optabs.go](https://github.com/golang/go/blob/master/src/cmd/internal/obj/x86/avx_optabs.go) to learn about N multipliers for any instruction.
225
226For example, `VADDPD` has these:
227* `N=64` for 512-bit form; `N=8` when broadcasting
228* `N=32` for 256-bit form; `N=8` when broadcasting
229* `N=16` for 128-bit form; `N=8` when broadcasting
230
231### Examples
232
233Exhaustive amount of examples can be found in Go assembler [test suite](https://github.com/golang/go/tree/master/src/cmd/asm/internal/asm/testdata/avx512enc).
234
235Each file provides several examples for every supported instruction form in particular AVX-512 extension.
236Every example also includes generated machine code.
237
238Here is adopted "Vectorized Histogram Update Using AVX-512CD" from
239[IntelĀ® Optimization Manual](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf):
240
241```go
242for i := 0; i < 512; i++ {
243 histo[key[i]] += 1
244}
245```
246
247```asm
248top:
249 VMOVUPS 0x40(SP)(DX*4), Z4 //; vmovups zmm4, [rsp+rdx*4+0x40]
250 VPXORD Z1, Z1, Z1 //; vpxord zmm1, zmm1, zmm1
251 KMOVW K1, K2 //; kmovw k2, k1
252 VPCONFLICTD Z4, Z2 //; vpconflictd zmm2, zmm4
253 VPGATHERDD (AX)(Z4*4), K2, Z1 //; vpgatherdd zmm1{k2}, [rax+zmm4*4]
254 VPTESTMD histo<>(SB), Z2, K0 //; vptestmd k0, zmm2, [rip+0x185c]
255 KMOVW K0, CX //; kmovw ecx, k0
256 VPADDD Z0, Z1, Z3 //; vpaddd zmm3, zmm1, zmm0
257 TESTL CX, CX //; test ecx, ecx
258 JZ noConflicts //; jz noConflicts
259 VMOVUPS histo<>(SB), Z1 //; vmovups zmm1, [rip+0x1884]
260 VPTESTMD histo<>(SB), Z2, K0 //; vptestmd k0, zmm2, [rip+0x18ba]
261 VPLZCNTD Z2, Z5 //; vplzcntd zmm5, zmm2
262 XORB BX, BX //; xor bl, bl
263 KMOVW K0, CX //; kmovw ecx, k0
264 VPSUBD Z5, Z1, Z1 //; vpsubd zmm1, zmm1, zmm5
265 VPSUBD Z5, Z1, Z1 //; vpsubd zmm1, zmm1, zmm5
266
267resolveConflicts:
268 VPBROADCASTD CX, Z5 //; vpbroadcastd zmm5, ecx
269 KMOVW CX, K2 //; kmovw k2, ecx
270 VPERMD Z3, Z1, K2, Z3 //; vpermd zmm3{k2}, zmm1, zmm3
271 VPADDD Z0, Z3, K2, Z3 //; vpaddd zmm3{k2}, zmm3, zmm0
272 VPTESTMD Z2, Z5, K2, K0 //; vptestmd k0{k2}, zmm5, zmm2
273 KMOVW K0, SI //; kmovw esi, k0
274 ANDL SI, CX //; and ecx, esi
275 JZ noConflicts //; jz noConflicts
276 ADDB $1, BX //; add bl, 0x1
277 CMPB BX, $16 //; cmp bl, 0x10
278 JB resolveConflicts //; jb resolveConflicts
279
280noConflicts:
281 KMOVW K1, K2 //; kmovw k2, k1
282 VPSCATTERDD Z3, K2, (AX)(Z4*4) //; vpscatterdd [rax+zmm4*4]{k2}, zmm3
283 ADDL $16, DX //; add edx, 0x10
284 CMPL DX, $1024 //; cmp edx, 0x400
285 JB top //; jb top
286```