Blame - AVX512.md - wiki

blob: 896f4c6fa149313622630e86c41f958d3bc71385 [file] [log] [blame] [view]

Iskander (Alex) Sharipov	9b7c4df	2018-10-23 12:40:02 +0300	[diff] [blame]	1	Go 1.11 release introduces [AVX-512](https://en.wikipedia.org/wiki/AVX-512) support.
				2	This page describes how to use new features as well as some important encoder details.
				3
				4	### Terminology
				5
				6	Most terminology comes from [Intel Software Developer's manual](https://software.intel.com/en-us/articles/intel-sdm).
				7	Suffixes originate from Go assembler syntax, which is close to AT&T, which also uses size suffixes.
				8
				9	Some terms are listed to avoid ambiguity (for example, opcode can have different meanings).
				10
				11	<table>
				12	<tr>
				13	<th>Term</th>
				14	<th>Description</th>
				15	</tr>
				16	<tr>
				17	<td>Operand</td>
				18	<td>
				19	Same as "instruction argument".
				20	</td>
				21	</tr>
				22	<tr>
				23	<td>Opcode</td>
				24	<td>
				25	Name that refers to instruction group. For example, <code>VADDPD</code> is an opcode.<br>
				26	It refers to both VEX and EVEX encoded forms and all operand combinations.<br>
				27	Most Go assembler opcodes for AVX-512 match Intel manual entries, with exceptions for cases<br>
				28	where additional size suffix is used (e.g. <code>VCVTTPD2DQY</code> is <code>VCVTTPD2DQ</code>).
				29	</td>
				30	</tr>
				31	<tr>
				32	<td>Opcode suffix</td>
				33	<td>
				34	Suffix that overrides some opcode properties. Listed after "." (dot).<br>
				35	For example, <code>VADDPD.Z</code> has "Z" opcode suffix.<br>
				36	There can be multiple dot-separated opcode suffixes.
				37	</td>
				38	</tr>
				39	<tr>
				40	<td>Size suffix</td>
				41	<td>
				42	Suffix that specifies instruction operand size if it can't be inferred from operands alone.<br>
				43	For example, <code>VCVTSS2USIL</code> has "L" size suffix.
				44	</td>
				45	</tr>
				46	<tr>
				47	<td>Opmask</td>
				48	<td>
				49	Used for both <code>{k1}</code> notation and to describe instructions that have <code>K</code> registers operands.<br>
				50	Related to masking support in EVEX prefix.
				51	</td>
				52	</tr>
				53	<tr>
				54	<td>Register block</td>
				55	<td>
				56	Multi-source operand that encodes register range.<br>
				57	Intel manual uses <code>+n</code> notation for register blocks.<br>
				58	For example, <code>+3</code> is a register block of 4 registers.
				59	</td>
				60	</tr>
				61	<tr>
				62	<td>FP</td>
				63	<td>Floating-point</td>
				64	</tr>
				65	</table>
				66
				67	### New registers
				68
				69	EVEX-enabled instructions can access additional 16 `X` (128-bit xmm) and `Y` (256-bit ymm) registers, plus 32 new `Z` (512-bit zmm) registers in 64-bit mode. 32-bit mode only gets `Z0-Z7`.
				70
				71	New opmask registers are named `K0-K7`.
				72	They can be used for both masking and for special opmask instructions (like `KADDB`).
				73
				74	### Masking support
				75
				76	Instructions that support masking can omit `K` register operand.
				77	In this case, `K0` register is implied ("all ones") and merging-masking is performed.
				78	This is effectively "no masking".
				79
				80	`K1-K7` registers can be used to override default opmask.
				81	`K` register should be placed right before destination operand.
				82
				83	Zeroing-masking can be activated with `Z` opcode suffix.
				84
				85	For example, `VADDPD.Z (AX), Z30, K3, Z10` uses zeroing-masking and explicit `K` register.
				86	- If `Z` opcode suffix is removed, it's merging-masking.
				87	- If `K3` operand is removed, `K0` operand is implied.
				88
				89	It's compile-time error to use `K0` register for `{k1}` operands (consult [manuals](https://software.intel.com/en-us/articles/intel-sdm) for details).
				90
				91	### EVEX broadcast/rounding/SAE support
				92
				93	Embedded broadcast, rounding and SAE activated through opcode suffixes.
				94
				95	For reg-reg FP instructions with `{er}` enabled, rounding opcode suffix can be specified:
				96
				97	* `RU_SAE` to round towards +Inf
				98	* `RD_SAE` to round towards -Inf
				99	* `RZ_SAE` to round towards zero
				100	* `RN_SAE` to round towards nearest
				101
				102	> To read more about rounding modes, see [MXCSR.RC info](http://qcd.phys.cmu.edu/QCDcluster/intel/vtune/reference/vc148.htm).
				103
				104	For reg-reg FP instructions with `{sae}` enabled, exception suppression can be specified with `SAE` opcode suffix.
				105
				106	For reg-mem instrictons with `m32bcst/m64bcst` operand, broadcasting can be turned on with `BCST` opcode suffix.
				107
				108	Zeroing opcode suffix can be combined with any of these.
				109	For example, `VMAXPD.SAE.Z Z3, Z2, Z1` uses both `Z` and `SAE` opcode suffixes.
				110	It is important to put zeroing opcode suffix last, otherwise it is a compilation error.
				111
				112	### Register block (multi-source) operands
				113
				114	Register blocks are specified using register range syntax.
				115
				116	It would be enough to specify just first (low) register, but Go assembler requires
				117	explicit range with both ends for readability reasons.
				118
				119	For example, instructions with `+3` range can be used like `VP4DPWSSD Z25, [Z0-Z3], (AX)`.
				120	Range `[Z0-Z3]` reads like "register block of Z0, Z1, Z2, Z3".
				121	Invalid ranges result in compilation error.
				122
				123	### AVX1 and AVX2 instructions with EVEX prefix
				124
				125	Previously existed opcodes that can be encoded using EVEX prefix now can access AVX-512 features like wider register file, zeroing/merging masking, etc. For example, `VADDPD` can now use 512-bit vector registers.
				126
				127	See [encoder details](#encoder-details) for more info.
				128
				129	### Supported extensions
				130
				131	Best way to get up-to-date list of supported extensions is to do `ls -1` inside [test suite](https://github.com/golang/go/tree/master/src/cmd/asm/internal/asm/testdata/avx512enc) directory.
				132
				133	Latest list includes:
				134	```
				135	aes_avx512f
				136	avx512_4fmaps
				137	avx512_4vnniw
				138	avx512_bitalg
				139	avx512_ifma
				140	avx512_vbmi
				141	avx512_vbmi2
				142	avx512_vnni
				143	avx512_vpopcntdq
				144	avx512bw
				145	avx512cd
				146	avx512dq
				147	avx512er
				148	avx512f
				149	avx512pf
				150	gfni_avx512f
				151	vpclmulqdq_avx512f
				152	```
				153
				154	128-bit and 256-bit instructions additionally require `avx512vl`.
				155	That is, if `VADDPD` is available in `avx512f`, you can't use `X` and `Y` arguments
				156	without `avx512vl`.
				157
				158	Filenames follow `GNU as` (gas) conventions.
				159	[avx512extmap.csv](https://gist.github.com/Quasilyte/92321dadcc3f86b05c1aeda2c13c851f) can make naming scheme more apparent.
				160
				161	### Instructions with size suffix
				162
				163	Some opcodes do not match Intel manual entries.
				164	This section is provided for search convenience.
				165
				166	\| Intel opcode \| Go assembler opcodes \|
				167	\|--------------\|----------------------\|
				168	\| `VCVTPD2DQ` \| `VCVTPD2DQX`, `VCVTPD2DQY` \|
				169	\| `VCVTPD2PS` \| `VCVTPD2PSX`, `VCVTPD2PSY` \|
				170	\| `VCVTTPD2DQ` \| `VCVTTPD2DQX`, `VCVTTPD2DQY` \|
				171	\| `VCVTQQ2PS` \| `VCVTQQ2PSX`, `VCVTQQ2PSY` \|
				172	\| `VCVTUQQ2PS` \| `VCVTUQQ2PSX`, `VCVTUQQ2PSY` \|
				173	\| `VCVTPD2UDQ` \| `VCVTPD2UDQX`, `VCVTPD2UDQY` \|
				174	\| `VCVTTPD2UDQ` \| `VCVTTPD2UDQX`, `VCVTTPD2UDQY` \|
				175	\| `VFPCLASSPD` \| `VFPCLASSPDX`, `VFPCLASSPDY`, `VFPCLASSPDZ` \|
				176	\| `VFPCLASSPS` \| `VFPCLASSPSX`, `VFPCLASSPSY`, `VFPCLASSPSZ` \|
				177	\| `VCVTSD2SI` \| `VCVTSD2SI`, `VCVTSD2SIQ` \|
				178	\| `VCVTTSD2SI` \| `VCVTSD2SI`, `VCVTSD2SIQ` \|
				179	\| `VCVTTSS2SI` \| `VCVTSD2SI`, `VCVTSD2SIQ` \|
				180	\| `VCVTSS2SI` \| `VCVTSD2SI`, `VCVTSD2SIQ` \|
				181	\| `VCVTSD2USI` \| `VCVTSD2USIL`, `VCVTSD2USIQ` \|
				182	\| `VCVTSS2USI` \| `VCVTSS2USIL`, `VCVTSS2USIQ` \|
				183	\| `VCVTTSD2USI` \| `VCVTTSD2USIL`, `VCVTTSD2USIQ` \|
				184	\| `VCVTTSS2USI` \| `VCVTTSS2USIL`, `VCVTTSS2USIQ` \|
				185	\| `VCVTUSI2SD` \| `VCVTUSI2SDL`, `VCVTUSI2SDQ` \|
				186	\| `VCVTUSI2SS` \| `VCVTUSI2SSL`, `VCVTUSI2SSQ` \|
				187	\| `VCVTSI2SD` \| `VCVTSI2SDL`, `VCVTSI2SDQ` \|
				188	\| `VCVTSI2SS` \| `VCVTSI2SSL`, `VCVTSI2SSQ` \|
				189	\| `ANDN` \| `ANDNL`, `ANDNQ` \|
				190	\| `BEXTR` \| `BEXTRL`, `BEXTRQ` \|
				191	\| `BLSI` \| `BLSIL`, `BLSIQ` \|
				192	\| `BLSMSK` \| `BLSMSKL`, `BLSMSKQ` \|
				193	\| `BLSR` \| `BLSRL`, `BLSRQ` \|
				194	\| `BZHI` \| `BZHIL`, `BZHIQ` \|
				195	\| `MULX` \| `MULXL`, `MULXQ` \|
				196	\| `PDEP` \| `PDEPL`, `PDEPQ` \|
				197	\| `PEXT` \| `PEXTL`, `PEXTQ` \|
				198	\| `RORX` \| `RORXL`, `RORXQ` \|
				199	\| `SARX` \| `SARXL`, `SARXQ` \|
				200	\| `SHLX` \| `SHLXL`, `SHLXQ` \|
				201	\| `SHRX` \| `SHRXL`, `SHRXQ` \|
				202
				203	### Encoder details
				204
				205	Bitwise comparison with older encoder may fail for VEX-encoded instructions due to slightly different encoder tables order.
				206
				207	This difference may arise for instructions with both `{reg, reg/mem}` and `{reg/mem, reg}` forms for reg-reg case. One of such instructions is `VMOVUPS`.
				208
				209	This does not affect code behavior, nor makes it bigger/less efficient.
				210	New encoding selection scheme is borrowed from [Intel XED](https://github.com/intelxed/xed).
				211
				212	EVEX encoding is used when any of the following is true:
				213
				214	* Instruction uses new registers (High 16 `X`/`Y`, `Z` or `K` registers)
				215	* Instruction uses EVEX-related opcode suffixes like `BCST`
				216	* Instruction uses operands combination that is only available for AVX-512
				217
				218	In all other cases VEX encoding is used.
				219	This means that VEX is used whenever possible, and EVEX whenever required.
				220
				221	Compressed disp8 is applied whenever possible for EVEX-encoded instructions.
				222	This also covers broadcasting disp8 which sometimes has different N multiplier.
				223
				224	Experienced readers can inspect [avx_optabs.go](https://github.com/golang/go/blob/master/src/cmd/internal/obj/x86/avx_optabs.go) to learn about N multipliers for any instruction.
				225
				226	For example, `VADDPD` has these:
				227	* `N=64` for 512-bit form; `N=8` when broadcasting
				228	* `N=32` for 256-bit form; `N=8` when broadcasting
				229	* `N=16` for 128-bit form; `N=8` when broadcasting
				230
				231	### Examples
				232
				233	Exhaustive amount of examples can be found in Go assembler [test suite](https://github.com/golang/go/tree/master/src/cmd/asm/internal/asm/testdata/avx512enc).
				234
				235	Each file provides several examples for every supported instruction form in particular AVX-512 extension.
				236	Every example also includes generated machine code.
				237
				238	Here is adopted "Vectorized Histogram Update Using AVX-512CD" from
				239	[Intel® Optimization Manual](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf):
				240
				241	```go
				242	for i := 0; i < 512; i++ {
				243	histo[key[i]] += 1
				244	}
				245	```
				246
				247	```asm
				248	top:
				249	VMOVUPS 0x40(SP)(DX4), Z4 //; vmovups zmm4, [rsp+rdx4+0x40]
				250	VPXORD Z1, Z1, Z1 //; vpxord zmm1, zmm1, zmm1
				251	KMOVW K1, K2 //; kmovw k2, k1
				252	VPCONFLICTD Z4, Z2 //; vpconflictd zmm2, zmm4
				253	VPGATHERDD (AX)(Z44), K2, Z1 //; vpgatherdd zmm1{k2}, [rax+zmm44]
				254	VPTESTMD histo<>(SB), Z2, K0 //; vptestmd k0, zmm2, [rip+0x185c]
				255	KMOVW K0, CX //; kmovw ecx, k0
				256	VPADDD Z0, Z1, Z3 //; vpaddd zmm3, zmm1, zmm0
				257	TESTL CX, CX //; test ecx, ecx
				258	JZ noConflicts //; jz noConflicts
				259	VMOVUPS histo<>(SB), Z1 //; vmovups zmm1, [rip+0x1884]
				260	VPTESTMD histo<>(SB), Z2, K0 //; vptestmd k0, zmm2, [rip+0x18ba]
				261	VPLZCNTD Z2, Z5 //; vplzcntd zmm5, zmm2
				262	XORB BX, BX //; xor bl, bl
				263	KMOVW K0, CX //; kmovw ecx, k0
				264	VPSUBD Z5, Z1, Z1 //; vpsubd zmm1, zmm1, zmm5
				265	VPSUBD Z5, Z1, Z1 //; vpsubd zmm1, zmm1, zmm5
				266
				267	resolveConflicts:
				268	VPBROADCASTD CX, Z5 //; vpbroadcastd zmm5, ecx
				269	KMOVW CX, K2 //; kmovw k2, ecx
				270	VPERMD Z3, Z1, K2, Z3 //; vpermd zmm3{k2}, zmm1, zmm3
				271	VPADDD Z0, Z3, K2, Z3 //; vpaddd zmm3{k2}, zmm3, zmm0
				272	VPTESTMD Z2, Z5, K2, K0 //; vptestmd k0{k2}, zmm5, zmm2
				273	KMOVW K0, SI //; kmovw esi, k0
				274	ANDL SI, CX //; and ecx, esi
				275	JZ noConflicts //; jz noConflicts
				276	ADDB $1, BX //; add bl, 0x1
				277	CMPB BX, $16 //; cmp bl, 0x10
				278	JB resolveConflicts //; jb resolveConflicts
				279
				280	noConflicts:
				281	KMOVW K1, K2 //; kmovw k2, k1
				282	VPSCATTERDD Z3, K2, (AX)(Z44) //; vpscatterdd [rax+zmm44]{k2}, zmm3
				283	ADDL $16, DX //; add edx, 0x10
				284	CMPL DX, $1024 //; cmp edx, 0x400
				285	JB top //; jb top
				286	```