Iskander (Alex) Sharipov | 9b7c4df | 2018-10-23 12:40:02 +0300 | [diff] [blame] | 1 | Go 1.11 release introduces [AVX-512](https://en.wikipedia.org/wiki/AVX-512) support. |
| 2 | This page describes how to use new features as well as some important encoder details. |
| 3 | |
| 4 | ### Terminology |
| 5 | |
| 6 | Most terminology comes from [Intel Software Developer's manual](https://software.intel.com/en-us/articles/intel-sdm). |
| 7 | Suffixes originate from Go assembler syntax, which is close to AT&T, which also uses size suffixes. |
| 8 | |
| 9 | Some terms are listed to avoid ambiguity (for example, opcode can have different meanings). |
| 10 | |
| 11 | <table> |
| 12 | <tr> |
| 13 | <th>Term</th> |
| 14 | <th>Description</th> |
| 15 | </tr> |
| 16 | <tr> |
| 17 | <td>Operand</td> |
| 18 | <td> |
| 19 | Same as "instruction argument". |
| 20 | </td> |
| 21 | </tr> |
| 22 | <tr> |
| 23 | <td>Opcode</td> |
| 24 | <td> |
| 25 | Name that refers to instruction group. For example, <code>VADDPD</code> is an opcode.<br> |
| 26 | It refers to both VEX and EVEX encoded forms and all operand combinations.<br> |
| 27 | Most Go assembler opcodes for AVX-512 match Intel manual entries, with exceptions for cases<br> |
| 28 | where additional size suffix is used (e.g. <code>VCVTTPD2DQY</code> is <code>VCVTTPD2DQ</code>). |
| 29 | </td> |
| 30 | </tr> |
| 31 | <tr> |
| 32 | <td>Opcode suffix</td> |
| 33 | <td> |
| 34 | Suffix that overrides some opcode properties. Listed after "." (dot).<br> |
| 35 | For example, <code>VADDPD.Z</code> has "Z" opcode suffix.<br> |
| 36 | There can be multiple dot-separated opcode suffixes. |
| 37 | </td> |
| 38 | </tr> |
| 39 | <tr> |
| 40 | <td>Size suffix</td> |
| 41 | <td> |
| 42 | Suffix that specifies instruction operand size if it can't be inferred from operands alone.<br> |
| 43 | For example, <code>VCVTSS2USIL</code> has "L" size suffix. |
| 44 | </td> |
| 45 | </tr> |
| 46 | <tr> |
| 47 | <td>Opmask</td> |
| 48 | <td> |
| 49 | Used for both <code>{k1}</code> notation and to describe instructions that have <code>K</code> registers operands.<br> |
| 50 | Related to masking support in EVEX prefix. |
| 51 | </td> |
| 52 | </tr> |
| 53 | <tr> |
| 54 | <td>Register block</td> |
| 55 | <td> |
| 56 | Multi-source operand that encodes register range.<br> |
| 57 | Intel manual uses <code>+n</code> notation for register blocks.<br> |
| 58 | For example, <code>+3</code> is a register block of 4 registers. |
| 59 | </td> |
| 60 | </tr> |
| 61 | <tr> |
| 62 | <td>FP</td> |
| 63 | <td>Floating-point</td> |
| 64 | </tr> |
| 65 | </table> |
| 66 | |
| 67 | ### New registers |
| 68 | |
| 69 | EVEX-enabled instructions can access additional 16 `X` (128-bit xmm) and `Y` (256-bit ymm) registers, plus 32 new `Z` (512-bit zmm) registers in 64-bit mode. 32-bit mode only gets `Z0-Z7`. |
| 70 | |
| 71 | New opmask registers are named `K0-K7`. |
| 72 | They can be used for both masking and for special opmask instructions (like `KADDB`). |
| 73 | |
| 74 | ### Masking support |
| 75 | |
| 76 | Instructions that support masking can omit `K` register operand. |
| 77 | In this case, `K0` register is implied ("all ones") and merging-masking is performed. |
| 78 | This is effectively "no masking". |
| 79 | |
| 80 | `K1-K7` registers can be used to override default opmask. |
| 81 | `K` register should be placed right before destination operand. |
| 82 | |
| 83 | Zeroing-masking can be activated with `Z` opcode suffix. |
| 84 | |
| 85 | For example, `VADDPD.Z (AX), Z30, K3, Z10` uses zeroing-masking and explicit `K` register. |
| 86 | - If `Z` opcode suffix is removed, it's merging-masking. |
| 87 | - If `K3` operand is removed, `K0` operand is implied. |
| 88 | |
| 89 | It's compile-time error to use `K0` register for `{k1}` operands (consult [manuals](https://software.intel.com/en-us/articles/intel-sdm) for details). |
| 90 | |
| 91 | ### EVEX broadcast/rounding/SAE support |
| 92 | |
| 93 | Embedded broadcast, rounding and SAE activated through opcode suffixes. |
| 94 | |
| 95 | For reg-reg FP instructions with `{er}` enabled, rounding opcode suffix can be specified: |
| 96 | |
| 97 | * `RU_SAE` to round towards +Inf |
| 98 | * `RD_SAE` to round towards -Inf |
| 99 | * `RZ_SAE` to round towards zero |
| 100 | * `RN_SAE` to round towards nearest |
| 101 | |
| 102 | > To read more about rounding modes, see [MXCSR.RC info](http://qcd.phys.cmu.edu/QCDcluster/intel/vtune/reference/vc148.htm). |
| 103 | |
| 104 | For reg-reg FP instructions with `{sae}` enabled, exception suppression can be specified with `SAE` opcode suffix. |
| 105 | |
| 106 | For reg-mem instrictons with `m32bcst/m64bcst` operand, broadcasting can be turned on with `BCST` opcode suffix. |
| 107 | |
| 108 | Zeroing opcode suffix can be combined with any of these. |
| 109 | For example, `VMAXPD.SAE.Z Z3, Z2, Z1` uses both `Z` and `SAE` opcode suffixes. |
| 110 | It is important to put zeroing opcode suffix last, otherwise it is a compilation error. |
| 111 | |
| 112 | ### Register block (multi-source) operands |
| 113 | |
| 114 | Register blocks are specified using register range syntax. |
| 115 | |
| 116 | It would be enough to specify just first (low) register, but Go assembler requires |
| 117 | explicit range with both ends for readability reasons. |
| 118 | |
| 119 | For example, instructions with `+3` range can be used like `VP4DPWSSD Z25, [Z0-Z3], (AX)`. |
| 120 | Range `[Z0-Z3]` reads like "register block of Z0, Z1, Z2, Z3". |
| 121 | Invalid ranges result in compilation error. |
| 122 | |
| 123 | ### AVX1 and AVX2 instructions with EVEX prefix |
| 124 | |
| 125 | Previously existed opcodes that can be encoded using EVEX prefix now can access AVX-512 features like wider register file, zeroing/merging masking, etc. For example, `VADDPD` can now use 512-bit vector registers. |
| 126 | |
| 127 | See [encoder details](#encoder-details) for more info. |
| 128 | |
| 129 | ### Supported extensions |
| 130 | |
| 131 | Best way to get up-to-date list of supported extensions is to do `ls -1` inside [test suite](https://github.com/golang/go/tree/master/src/cmd/asm/internal/asm/testdata/avx512enc) directory. |
| 132 | |
| 133 | Latest list includes: |
| 134 | ``` |
| 135 | aes_avx512f |
| 136 | avx512_4fmaps |
| 137 | avx512_4vnniw |
| 138 | avx512_bitalg |
| 139 | avx512_ifma |
| 140 | avx512_vbmi |
| 141 | avx512_vbmi2 |
| 142 | avx512_vnni |
| 143 | avx512_vpopcntdq |
| 144 | avx512bw |
| 145 | avx512cd |
| 146 | avx512dq |
| 147 | avx512er |
| 148 | avx512f |
| 149 | avx512pf |
| 150 | gfni_avx512f |
| 151 | vpclmulqdq_avx512f |
| 152 | ``` |
| 153 | |
| 154 | 128-bit and 256-bit instructions additionally require `avx512vl`. |
| 155 | That is, if `VADDPD` is available in `avx512f`, you can't use `X` and `Y` arguments |
| 156 | without `avx512vl`. |
| 157 | |
| 158 | Filenames follow `GNU as` (gas) conventions. |
| 159 | [avx512extmap.csv](https://gist.github.com/Quasilyte/92321dadcc3f86b05c1aeda2c13c851f) can make naming scheme more apparent. |
| 160 | |
| 161 | ### Instructions with size suffix |
| 162 | |
| 163 | Some opcodes do not match Intel manual entries. |
| 164 | This section is provided for search convenience. |
| 165 | |
| 166 | | Intel opcode | Go assembler opcodes | |
| 167 | |--------------|----------------------| |
| 168 | | `VCVTPD2DQ` | `VCVTPD2DQX`, `VCVTPD2DQY` | |
| 169 | | `VCVTPD2PS` | `VCVTPD2PSX`, `VCVTPD2PSY` | |
| 170 | | `VCVTTPD2DQ` | `VCVTTPD2DQX`, `VCVTTPD2DQY` | |
| 171 | | `VCVTQQ2PS` | `VCVTQQ2PSX`, `VCVTQQ2PSY` | |
| 172 | | `VCVTUQQ2PS` | `VCVTUQQ2PSX`, `VCVTUQQ2PSY` | |
| 173 | | `VCVTPD2UDQ` | `VCVTPD2UDQX`, `VCVTPD2UDQY` | |
| 174 | | `VCVTTPD2UDQ` | `VCVTTPD2UDQX`, `VCVTTPD2UDQY` | |
| 175 | | `VFPCLASSPD` | `VFPCLASSPDX`, `VFPCLASSPDY`, `VFPCLASSPDZ` | |
| 176 | | `VFPCLASSPS` | `VFPCLASSPSX`, `VFPCLASSPSY`, `VFPCLASSPSZ` | |
| 177 | | `VCVTSD2SI` | `VCVTSD2SI`, `VCVTSD2SIQ` | |
| 178 | | `VCVTTSD2SI` | `VCVTSD2SI`, `VCVTSD2SIQ` | |
| 179 | | `VCVTTSS2SI` | `VCVTSD2SI`, `VCVTSD2SIQ` | |
| 180 | | `VCVTSS2SI` | `VCVTSD2SI`, `VCVTSD2SIQ` | |
| 181 | | `VCVTSD2USI` | `VCVTSD2USIL`, `VCVTSD2USIQ` | |
| 182 | | `VCVTSS2USI` | `VCVTSS2USIL`, `VCVTSS2USIQ` | |
| 183 | | `VCVTTSD2USI` | `VCVTTSD2USIL`, `VCVTTSD2USIQ` | |
| 184 | | `VCVTTSS2USI` | `VCVTTSS2USIL`, `VCVTTSS2USIQ` | |
| 185 | | `VCVTUSI2SD` | `VCVTUSI2SDL`, `VCVTUSI2SDQ` | |
| 186 | | `VCVTUSI2SS` | `VCVTUSI2SSL`, `VCVTUSI2SSQ` | |
| 187 | | `VCVTSI2SD` | `VCVTSI2SDL`, `VCVTSI2SDQ` | |
| 188 | | `VCVTSI2SS` | `VCVTSI2SSL`, `VCVTSI2SSQ` | |
| 189 | | `ANDN` | `ANDNL`, `ANDNQ` | |
| 190 | | `BEXTR` | `BEXTRL`, `BEXTRQ` | |
| 191 | | `BLSI` | `BLSIL`, `BLSIQ` | |
| 192 | | `BLSMSK` | `BLSMSKL`, `BLSMSKQ` | |
| 193 | | `BLSR` | `BLSRL`, `BLSRQ` | |
| 194 | | `BZHI` | `BZHIL`, `BZHIQ` | |
| 195 | | `MULX` | `MULXL`, `MULXQ` | |
| 196 | | `PDEP` | `PDEPL`, `PDEPQ` | |
| 197 | | `PEXT` | `PEXTL`, `PEXTQ` | |
| 198 | | `RORX` | `RORXL`, `RORXQ` | |
| 199 | | `SARX` | `SARXL`, `SARXQ` | |
| 200 | | `SHLX` | `SHLXL`, `SHLXQ` | |
| 201 | | `SHRX` | `SHRXL`, `SHRXQ` | |
| 202 | |
| 203 | ### Encoder details |
| 204 | |
| 205 | Bitwise comparison with older encoder may fail for VEX-encoded instructions due to slightly different encoder tables order. |
| 206 | |
| 207 | This difference may arise for instructions with both `{reg, reg/mem}` and `{reg/mem, reg}` forms for reg-reg case. One of such instructions is `VMOVUPS`. |
| 208 | |
| 209 | This does not affect code behavior, nor makes it bigger/less efficient. |
| 210 | New encoding selection scheme is borrowed from [Intel XED](https://github.com/intelxed/xed). |
| 211 | |
| 212 | EVEX encoding is used when any of the following is true: |
| 213 | |
| 214 | * Instruction uses new registers (High 16 `X`/`Y`, `Z` or `K` registers) |
| 215 | * Instruction uses EVEX-related opcode suffixes like `BCST` |
| 216 | * Instruction uses operands combination that is only available for AVX-512 |
| 217 | |
| 218 | In all other cases VEX encoding is used. |
| 219 | This means that VEX is used whenever possible, and EVEX whenever required. |
| 220 | |
| 221 | Compressed disp8 is applied whenever possible for EVEX-encoded instructions. |
| 222 | This also covers broadcasting disp8 which sometimes has different N multiplier. |
| 223 | |
| 224 | Experienced readers can inspect [avx_optabs.go](https://github.com/golang/go/blob/master/src/cmd/internal/obj/x86/avx_optabs.go) to learn about N multipliers for any instruction. |
| 225 | |
| 226 | For example, `VADDPD` has these: |
| 227 | * `N=64` for 512-bit form; `N=8` when broadcasting |
| 228 | * `N=32` for 256-bit form; `N=8` when broadcasting |
| 229 | * `N=16` for 128-bit form; `N=8` when broadcasting |
| 230 | |
| 231 | ### Examples |
| 232 | |
| 233 | Exhaustive amount of examples can be found in Go assembler [test suite](https://github.com/golang/go/tree/master/src/cmd/asm/internal/asm/testdata/avx512enc). |
| 234 | |
| 235 | Each file provides several examples for every supported instruction form in particular AVX-512 extension. |
| 236 | Every example also includes generated machine code. |
| 237 | |
| 238 | Here is adopted "Vectorized Histogram Update Using AVX-512CD" from |
| 239 | [IntelĀ® Optimization Manual](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf): |
| 240 | |
| 241 | ```go |
| 242 | for i := 0; i < 512; i++ { |
| 243 | histo[key[i]] += 1 |
| 244 | } |
| 245 | ``` |
| 246 | |
| 247 | ```asm |
| 248 | top: |
| 249 | VMOVUPS 0x40(SP)(DX*4), Z4 //; vmovups zmm4, [rsp+rdx*4+0x40] |
| 250 | VPXORD Z1, Z1, Z1 //; vpxord zmm1, zmm1, zmm1 |
| 251 | KMOVW K1, K2 //; kmovw k2, k1 |
| 252 | VPCONFLICTD Z4, Z2 //; vpconflictd zmm2, zmm4 |
| 253 | VPGATHERDD (AX)(Z4*4), K2, Z1 //; vpgatherdd zmm1{k2}, [rax+zmm4*4] |
| 254 | VPTESTMD histo<>(SB), Z2, K0 //; vptestmd k0, zmm2, [rip+0x185c] |
| 255 | KMOVW K0, CX //; kmovw ecx, k0 |
| 256 | VPADDD Z0, Z1, Z3 //; vpaddd zmm3, zmm1, zmm0 |
| 257 | TESTL CX, CX //; test ecx, ecx |
| 258 | JZ noConflicts //; jz noConflicts |
| 259 | VMOVUPS histo<>(SB), Z1 //; vmovups zmm1, [rip+0x1884] |
| 260 | VPTESTMD histo<>(SB), Z2, K0 //; vptestmd k0, zmm2, [rip+0x18ba] |
| 261 | VPLZCNTD Z2, Z5 //; vplzcntd zmm5, zmm2 |
| 262 | XORB BX, BX //; xor bl, bl |
| 263 | KMOVW K0, CX //; kmovw ecx, k0 |
| 264 | VPSUBD Z5, Z1, Z1 //; vpsubd zmm1, zmm1, zmm5 |
| 265 | VPSUBD Z5, Z1, Z1 //; vpsubd zmm1, zmm1, zmm5 |
| 266 | |
| 267 | resolveConflicts: |
| 268 | VPBROADCASTD CX, Z5 //; vpbroadcastd zmm5, ecx |
| 269 | KMOVW CX, K2 //; kmovw k2, ecx |
| 270 | VPERMD Z3, Z1, K2, Z3 //; vpermd zmm3{k2}, zmm1, zmm3 |
| 271 | VPADDD Z0, Z3, K2, Z3 //; vpaddd zmm3{k2}, zmm3, zmm0 |
| 272 | VPTESTMD Z2, Z5, K2, K0 //; vptestmd k0{k2}, zmm5, zmm2 |
| 273 | KMOVW K0, SI //; kmovw esi, k0 |
| 274 | ANDL SI, CX //; and ecx, esi |
| 275 | JZ noConflicts //; jz noConflicts |
| 276 | ADDB $1, BX //; add bl, 0x1 |
| 277 | CMPB BX, $16 //; cmp bl, 0x10 |
| 278 | JB resolveConflicts //; jb resolveConflicts |
| 279 | |
| 280 | noConflicts: |
| 281 | KMOVW K1, K2 //; kmovw k2, k1 |
| 282 | VPSCATTERDD Z3, K2, (AX)(Z4*4) //; vpscatterdd [rax+zmm4*4]{k2}, zmm3 |
| 283 | ADDL $16, DX //; add edx, 0x10 |
| 284 | CMPL DX, $1024 //; cmp edx, 0x400 |
| 285 | JB top //; jb top |
| 286 | ``` |