User Guide for AMDGPU Backend¶
Introduction¶
The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
R600 family up until the current GCN families. It lives in the
llvm/lib/Target/AMDGPU directory.
LLVM¶
Target Triples¶
Use the Clang option -target <Architecture>-<Vendor>-<OS>-<Environment>
to specify the target triple:
Table 18 AMDGPU Architectures¶ Architecture
Description
r600AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
amdgcnAMD GPUs GCN GFX6 onwards for graphics and compute shaders.
Table 19 AMDGPU Vendors¶ Vendor
Description
amdCan be used for all AMD GPU usage.
mesaCan be used if the OS is
mesa3d.
Table 20 AMDGPU Operating Systems¶ OS
Description
<empty>
Defaults to the unknown OS.
amdhsaCompute kernels executed on HSA [HSA] compatible runtimes such as:
AMD’s ROCm™ runtime [AMD-ROCm] using the rocm-amdhsa loader on Linux. See AMD ROCm Platform Release Notes [AMD-ROCm-Release-Notes] for supported hardware and software.
AMD’s PAL runtime using the pal-amdhsa loader on Windows.
amdpalGraphic shaders and compute kernels executed on AMD’s PAL runtime using the pal-amdpal loader on Windows and Linux Pro.
mesa3dGraphic shaders and compute kernels executed on AMD’s Mesa 3D runtime using the mesa-mesa3d loader on Linux.
Table 21 AMDGPU Environments¶ Environment
Description
<empty>
Default.
Processors¶
Use the Clang options -mcpu=<target-id> or --offload-arch=<target-id> to
specify the AMDGPU processor together with optional target features. See
Target ID and Target Features for AMD GPU target
specific information.
Every processor supports every OS ABI (see AMDGPU Operating Systems) with the following exceptions:
amdhsais not supported inr600architecture (see AMDGPU Architectures).Table 22 AMDGPU Processors¶ Processor
Alternative Processor
Target Triple Architecture
dGPU/ APU
Target Features Supported
Target Properties
OS Support (see amdgpu-os and corresponding runtime release notes for current information and level of support)
Example Products
Radeon HD 2000/3000 Series (R600) [AMD-RADEON-HD-2000-3000]
r600r600dGPU
Does not support generic address space
r630r600dGPU
Does not support generic address space
rs880r600dGPU
Does not support generic address space
rv670r600dGPU
Does not support generic address space
Radeon HD 4000 Series (R700) [AMD-RADEON-HD-4000]
rv710r600dGPU
Does not support generic address space
rv730r600dGPU
Does not support generic address space
rv770r600dGPU
Does not support generic address space
Radeon HD 5000 Series (Evergreen) [AMD-RADEON-HD-5000]
cedarr600dGPU
Does not support generic address space
cypressr600dGPU
Does not support generic address space
juniperr600dGPU
Does not support generic address space
redwoodr600dGPU
Does not support generic address space
sumor600dGPU
Does not support generic address space
Radeon HD 6000 Series (Northern Islands) [AMD-RADEON-HD-6000]
bartsr600dGPU
Does not support generic address space
caicosr600dGPU
Does not support generic address space
caymanr600dGPU
Does not support generic address space
turksr600dGPU
Does not support generic address space
GCN GFX6 (Southern Islands (SI)) [AMD-GCN-GFX6]
gfx600tahiti
amdgcndGPU
Does not support generic address space
pal-amdpal
gfx601pitcairnverde
amdgcndGPU
Does not support generic address space
pal-amdpal
gfx602hainanoland
amdgcndGPU
Does not support generic address space
pal-amdpal
GCN GFX7 (Sea Islands (CI)) [AMD-GCN-GFX7]
gfx700kaveri
amdgcnAPU
Offset flat scratch
rocm-amdhsa
pal-amdhsa
pal-amdpal
A6-7000
A6 Pro-7050B
A8-7100
A8 Pro-7150B
A10-7300
A10 Pro-7350B
FX-7500
A8-7200P
A10-7400P
FX-7600P
gfx701hawaii
amdgcndGPU
Offset flat scratch
rocm-amdhsa
pal-amdhsa
pal-amdpal
FirePro W8100
FirePro W9100
FirePro S9150
FirePro S9170
gfx702amdgcndGPU
Offset flat scratch
rocm-amdhsa
pal-amdhsa
pal-amdpal
Radeon R9 290
Radeon R9 290x
Radeon R390
Radeon R390x
gfx703kabinimullins
amdgcnAPU
Offset flat scratch
pal-amdhsa
pal-amdpal
E1-2100
E1-2200
E1-2500
E2-3000
E2-3800
A4-5000
A4-5100
A6-5200
A4 Pro-3340B
gfx704bonaire
amdgcndGPU
Offset flat scratch
pal-amdhsa
pal-amdpal
Radeon HD 7790
Radeon HD 8770
R7 260
R7 260X
gfx705amdgcnAPU
Offset flat scratch
pal-amdhsa
pal-amdpal
TBA
GCN GFX8 (Volcanic Islands (VI)) [AMD-GCN-GFX8]
gfx801carrizo
amdgcnAPU
xnack
Offset flat scratch
rocm-amdhsa
pal-amdhsa
pal-amdpal
A6-8500P
Pro A6-8500B
A8-8600P
Pro A8-8600B
FX-8800P
Pro A12-8800B
A10-8700P
Pro A10-8700B
A10-8780P
A10-9600P
A10-9630P
A12-9700P
A12-9730P
FX-9800P
FX-9830P
E2-9010
A6-9210
A9-9410
gfx802icelandtonga
amdgcndGPU
Offset flat scratch
rocm-amdhsa
pal-amdhsa
pal-amdpal
Radeon R9 285
Radeon R9 380
Radeon R9 385
gfx803fiji
amdgcndGPU
rocm-amdhsa
pal-amdhsa
pal-amdpal
Radeon R9 Nano
Radeon R9 Fury
Radeon R9 FuryX
Radeon Pro Duo
FirePro S9300x2
Radeon Instinct MI8
polaris10
amdgcndGPU
Offset flat scratch
rocm-amdhsa
pal-amdhsa
pal-amdpal
Radeon RX 470
Radeon RX 480
Radeon Instinct MI6
polaris11
amdgcndGPU
Offset flat scratch
rocm-amdhsa
pal-amdhsa
pal-amdpal
Radeon RX 460
gfx805tongapro
amdgcndGPU
Offset flat scratch
rocm-amdhsa
pal-amdhsa
pal-amdpal
FirePro S7150
FirePro S7100
FirePro W7100
Mobile FirePro M7170
gfx810stoney
amdgcnAPU
xnack
Offset flat scratch
rocm-amdhsa
pal-amdhsa
pal-amdpal
TBA
GCN GFX9 (Vega) [AMD-GCN-GFX900-GFX904-VEGA] [AMD-GCN-GFX906-VEGA7NM] [AMD-GCN-GFX908-CDNA1] [AMD-GCN-GFX90A-CDNA2] [AMD-GCN-GFX940-GFX942-CDNA3]
gfx900amdgcndGPU
xnack
Absolute flat scratch
rocm-amdhsa
pal-amdhsa
pal-amdpal
Radeon Vega Frontier Edition
Radeon RX Vega 56
Radeon RX Vega 64
Radeon RX Vega 64 Liquid
Radeon Instinct MI25
gfx902amdgcnAPU
xnack
Absolute flat scratch
rocm-amdhsa
pal-amdhsa
pal-amdpal
Ryzen 3 2200G
Ryzen 5 2400G
gfx904amdgcndGPU
xnack
rocm-amdhsa
pal-amdhsa
pal-amdpal
TBA
gfx906amdgcndGPU
sramecc
xnack
Absolute flat scratch
rocm-amdhsa
pal-amdhsa
pal-amdpal
Radeon Instinct MI50
Radeon Instinct MI60
Radeon VII
Radeon Pro VII
gfx908amdgcndGPU
sramecc
xnack
Absolute flat scratch
rocm-amdhsa
AMD Instinct MI100 Accelerator
gfx909amdgcnAPU
xnack
Absolute flat scratch
pal-amdpal
TBA
gfx90aamdgcndGPU
sramecc
tgsplit
xnack
kernarg preload
Absolute flat scratch
Packed work-item IDs
rocm-amdhsa
rocm-amdhsa
rocm-amdhsa
AMD Instinct MI210 Accelerator
AMD Instinct MI250 Accelerator
AMD Instinct MI250X Accelerator
gfx90camdgcnAPU
xnack
Absolute flat scratch
pal-amdpal
Ryzen 7 4700G
Ryzen 7 4700GE
Ryzen 5 4600G
Ryzen 5 4600GE
Ryzen 3 4300G
Ryzen 3 4300GE
Ryzen Pro 4000G
Ryzen 7 Pro 4700G
Ryzen 7 Pro 4750GE
Ryzen 5 Pro 4650G
Ryzen 5 Pro 4650GE
Ryzen 3 Pro 4350G
Ryzen 3 Pro 4350GE
gfx940amdgcndGPU
sramecc
tgsplit
xnack
kernarg preload
Architected flat scratch
Packed work-item IDs
TBA
gfx941amdgcndGPU
sramecc
tgsplit
xnack
kernarg preload
Architected flat scratch
Packed work-item IDs
TBA
gfx942amdgcndGPU
sramecc
tgsplit
xnack
kernarg preload
Architected flat scratch
Packed work-item IDs
TBA
GCN GFX10.1 (RDNA 1) [AMD-GCN-GFX10-RDNA1]
gfx1010amdgcndGPU
cumode
wavefrontsize64
xnack
Absolute flat scratch
rocm-amdhsa
pal-amdhsa
pal-amdpal
Radeon RX 5700
Radeon RX 5700 XT
Radeon Pro 5600 XT
Radeon Pro 5600M
gfx1011amdgcndGPU
cumode
wavefrontsize64
xnack
Absolute flat scratch
rocm-amdhsa
pal-amdhsa
pal-amdpal
Radeon Pro V520
gfx1012amdgcndGPU
cumode
wavefrontsize64
xnack
Absolute flat scratch
rocm-amdhsa
pal-amdhsa
pal-amdpal
Radeon RX 5500
Radeon RX 5500 XT
gfx1013amdgcnAPU
cumode
wavefrontsize64
xnack
Absolute flat scratch
rocm-amdhsa
pal-amdhsa
pal-amdpal
TBA
GCN GFX10.3 (RDNA 2) [AMD-GCN-GFX10-RDNA2]
gfx1030amdgcndGPU
cumode
wavefrontsize64
Absolute flat scratch
rocm-amdhsa
pal-amdhsa
pal-amdpal
Radeon RX 6800
Radeon RX 6800 XT
Radeon RX 6900 XT
gfx1031amdgcndGPU
cumode
wavefrontsize64
Absolute flat scratch
rocm-amdhsa
pal-amdhsa
pal-amdpal
Radeon RX 6700 XT
gfx1032amdgcndGPU
cumode
wavefrontsize64
Absolute flat scratch
rocm-amdhsa
pal-amdhsa
pal-amdpal
TBA
gfx1033amdgcnAPU
cumode
wavefrontsize64
Absolute flat scratch
pal-amdpal
TBA
gfx1034amdgcndGPU
cumode
wavefrontsize64
Absolute flat scratch
pal-amdpal
TBA
gfx1035amdgcnAPU
cumode
wavefrontsize64
Absolute flat scratch
pal-amdpal
TBA
gfx1036amdgcnAPU
cumode
wavefrontsize64
Absolute flat scratch
pal-amdpal
TBA
GCN GFX11 (RDNA 3) [AMD-GCN-GFX11-RDNA3]
gfx1100amdgcndGPU
cumode
wavefrontsize64
Architected flat scratch
Packed work-item IDs
pal-amdpal
TBA
gfx1101amdgcndGPU
cumode
wavefrontsize64
Architected flat scratch
Packed work-item IDs
TBA
gfx1102amdgcndGPU
cumode
wavefrontsize64
Architected flat scratch
Packed work-item IDs
TBA
gfx1103amdgcnAPU
cumode
wavefrontsize64
Architected flat scratch
Packed work-item IDs
TBA
gfx1150amdgcnAPU
cumode
wavefrontsize64
Architected flat scratch
Packed work-item IDs
TBA
gfx1151amdgcnAPU
cumode
wavefrontsize64
Architected flat scratch
Packed work-item IDs
TBA
gfx1152amdgcnAPU
cumode
wavefrontsize64
Architected flat scratch
Packed work-item IDs
TBA
gfx1200amdgcndGPU
cumode
wavefrontsize64
Architected flat scratch
Packed work-item IDs
TBA
gfx1201amdgcndGPU
cumode
wavefrontsize64
Architected flat scratch
Packed work-item IDs
TBA
Generic processors allow execution of a single code object on any of the processors that it supports. Such code objects may not perform as well as those for the non-generic processors.
Generic processors are only available on code object V6 and above (see ELF Code Object).
Generic processor code objects are versioned. See Generic Processor Versioning for more information on how versioning works.
Table 23 AMDGPU Generic Processors¶ Processor
Target Triple Architecture
Supported Processors
Target Features Supported
Target Properties
Target Restrictions
gfx9-generic
amdgcn
gfx900
gfx902
gfx904
gfx906
gfx909
gfx90c
xnack
Absolute flat scratch
v_mad_mixinstructions are not available ongfx900,gfx902,gfx909,gfx90c
v_fma_mixinstructions are not available ongfx904sramecc is not available on
gfx906The following instructions are not available on
gfx906:
v_fmac_f32
v_xnor_b32
v_dot4_i32_i8
v_dot8_i32_i4
v_dot2_i32_i16
v_dot2_u32_u16
v_dot4_u32_u8
v_dot8_u32_u4
v_dot2_f32_f16
gfx10-1-generic
amdgcn
gfx1010
gfx1011
gfx1012
gfx1013
xnack
wavefrontsize64
cumode
Absolute flat scratch
The following instructions are not available on
gfx1011andgfx1012
v_dot4_i32_i8
v_dot8_i32_i4
v_dot2_i32_i16
v_dot2_u32_u16
v_dot2c_f32_f16
v_dot4c_i32_i8
v_dot4_u32_u8
v_dot8_u32_u4
v_dot2_f32_f16BVH Ray Tracing instructions are not available on
gfx1013
gfx10-3-generic
amdgcn
gfx1030
gfx1031
gfx1032
gfx1033
gfx1034
gfx1035
gfx1036
wavefrontsize64
cumode
Absolute flat scratch
No restrictions.
gfx11-generic
amdgcn
gfx1100
gfx1101
gfx1102
gfx1103
gfx1150
gfx1151
gfx1152
wavefrontsize64
cumode
Architected flat scratch
Packed work-item IDs
Various codegen pessimizations are applied to work around some hazards specific to some targets within this family.
Not all VGPRs can be used on:
gfx1100
gfx1101
gfx1151
gfx1152SALU floating point instructions and single-use VGPR hint instructions are not available on:
gfx1150
gfx1151
gfx1152SGPRs are not supported for src1 in dpp instructions for:
gfx1150
gfx1151
gfx1152
gfx12-generic
amdgcn
gfx1200
gfx1201
wavefrontsize64
cumode
Architected flat scratch
Packed work-item IDs
No restrictions.
Generic Processor Versioning¶
Generic processor (see AMDGPU Generic Processors) code objects are versioned (see AMDGPU ELF Header e_flags for Code Object V6 and After) between 1 and 255. The version of non-generic code objects is always set to 0.
For a generic code object, adding a new supported processor may require the code generated for the generic target to be changed so it can continue to execute on the previously supported processors as well as on the new one. When this happens, the generic code object version number is incremented at the same time as the generic target is updated.
Each supported processor of a generic target is mapped to the version it was introduced in. A generic code object can execute on a supported processor if the version of the code object being loaded is greater than or equal to the version in which the processor was added to the generic target.
Target Features¶
Target features control how code is generated to support certain processor specific features. Not all target features are supported by all processors. The runtime must ensure that the features supported by the device used to execute the code match the features enabled when generating the code. A mismatch of features may result in incorrect execution, or a reduction in performance.
The target features supported by each processor is listed in Processors.
Target features are controlled by exactly one of the following Clang options:
-mcpu=<target-id> or --offload-arch=<target-id>
The
-mcpuand--offload-archcan specify the target feature as optional components of the target ID. If omitted, the target feature has theanyvalue. See Target ID.
-m[no-]<target-feature>
Target features not specified by the target ID are specified using a separate option. These target features can have an
onoroffvalue.onis specified by omitting theno-prefix, andoffis specified by including theno-prefix. The default if not specified isoff.
For example:
-mcpu=gfx908:xnack+Enable the
xnackfeature.-mcpu=gfx908:xnack-Disable the
xnackfeature.-mcumodeEnable the
cumodefeature.-mno-cumodeDisable the
cumodefeature.Table 24 AMDGPU Target Features¶ Target Feature
Clang Option to Control
Description
Name
cumode
-m[no-]cumode
Control the wavefront execution mode used when generating code for kernels. When disabled native WGP wavefront execution mode is used, when enabled CU wavefront execution mode is used (see Memory Model).
sramecc
-mcpu--offload-arch
If specified, generate code that can only be loaded and executed in a process that has a matching setting for SRAMECC.
If not specified for code object V2 to V3, generate code that can be loaded and executed in a process with SRAMECC enabled.
If not specified for code object V4 or above, generate code that can be loaded and executed in a process with either setting of SRAMECC.
tgsplit
-m[no-]tgsplitEnable/disable generating code that assumes work-groups are launched in threadgroup split mode. When enabled the waves of a work-group may be launched in different CUs.
wavefrontsize64
-m[no-]wavefrontsize64
Control the wavefront size used when generating code for kernels. When disabled native wavefront size 32 is used, when enabled wavefront size 64 is used.
xnack
-mcpu--offload-arch
If specified, generate code that can only be loaded and executed in a process that has a matching setting for XNACK replay.
If not specified for code object V2 to V3, generate code that can be loaded and executed in a process with XNACK replay enabled.
If not specified for code object V4 or above, generate code that can be loaded and executed in a process with either setting of XNACK replay.
XNACK replay can be used for demand paging and page migration. If enabled in the device, then if a page fault occurs the code may execute incorrectly unless generated with XNACK replay enabled, or generated for code object V4 or above without specifying XNACK replay. Executing code that was generated with XNACK replay enabled, or generated for code object V4 or above without specifying XNACK replay, on a device that does not have XNACK replay enabled will execute correctly but may be less performant than code generated for XNACK replay disabled.
Target ID¶
AMDGPU supports target IDs. See Clang Offload Bundler for a general description. The AMDGPU target specific information is:
- processor
Is an AMDGPU processor or alternative processor name specified in AMDGPU Processors. The non-canonical form target ID allows both the primary processor and alternative processor names. The canonical form target ID only allow the primary processor name.
- target-feature
Is a target feature name specified in AMDGPU Target Features that is supported by the processor. The target features supported by each processor is specified in AMDGPU Processors. Those that can be specified in a target ID are marked as being controlled by
-mcpuand--offload-arch. Each target feature must appear at most once in a target ID. The non-canonical form target ID allows the target features to be specified in any order. The canonical form target ID requires the target features to be specified in alphabetic order.
Code Object V2 to V3 Target ID¶
The target ID syntax for code object V2 to V3 is the same as defined in Clang Offload Bundler except when used in the .amdgcn_target <target-triple> “-” <target-id> assembler directive and the bundle entry ID. In those cases it has the following BNF syntax:
<target-id> ::== <processor> ( "+" <target-feature> )*
Where a target feature is omitted if Off and present if On or Any.
Note
The code object V2 to V3 cannot represent Any and treats it the same as On.
Embedding Bundled Code Objects¶
AMDGPU supports the HIP and OpenMP languages that perform code object embedding as described in Clang Offload Bundler.
Note
The target ID syntax used for code object V2 to V3 for a bundle entry ID differs from that used elsewhere. See Code Object V2 to V3 Target ID.
Address Spaces¶
The AMDGPU architecture supports a number of memory address spaces. The address space names use the OpenCL standard names, with some additions.
The AMDGPU address spaces correspond to target architecture specific LLVM address space numbers used in LLVM IR.
The AMDGPU address spaces are described in
AMDGPU Address Spaces. Only 64-bit process address spaces are
supported for the amdgcn target.
Table 25 AMDGPU Address Spaces¶ 64-Bit Process Address Space
Address Space Name
LLVM IR Address Space Number
HSA Segment Name
Hardware Name
Address Size
NULL Value
Generic
0
flat
flat
64
0x0000000000000000
Global
1
global
global
64
0x0000000000000000
Region
2
N/A
GDS
32
not implemented for AMDHSA
Local
3
group
LDS
32
0xFFFFFFFF
Constant
4
constant
same as global
64
0x0000000000000000
Private
5
private
scratch
32
0xFFFFFFFF
Constant 32-bit
6
TODO
0x00000000
Buffer Fat Pointer
7
N/A
N/A
160
0
Buffer Resource
8
N/A
V#
128
0x00000000000000000000000000000000
Buffer Strided Pointer (experimental)
9
TODO
Streamout Registers
128
N/A
GS_REGS
- Generic
The generic address space is supported unless the Target Properties column of AMDGPU Processors specifies Does not support generic address space.
The generic address space uses the hardware flat address support for two fixed ranges of virtual addresses (the private and local apertures), that are outside the range of addressable global memory, to map from a flat address to a private or local address. This uses FLAT instructions that can take a flat address and access global, private (scratch), and group (LDS) memory depending on if the address is within one of the aperture ranges.
Flat access to scratch requires hardware aperture setup and setup in the kernel prologue (see Flat Scratch). Flat access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup (see M0).
To convert between a private or group address space address (termed a segment address) and a flat address the base address of the corresponding aperture can be used. For GFX7-GFX8 these are available in the HSA AQL Queue the address of which can be obtained with Queue Ptr SGPR (see Initial Kernel Execution State). For GFX9-GFX11 the aperture base addresses are directly available as inline constant registers
SRC_SHARED_BASE/LIMITandSRC_PRIVATE_BASE/LIMIT. In 64-bit address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32 which makes it easier to convert from flat to segment or segment to flat.A global address space address has the same value when used as a flat address so no conversion is needed.
- Global and Constant
The global and constant address spaces both use global virtual addresses, which are the same virtual address space used by the CPU. However, some virtual addresses may only be accessible to the CPU, some only accessible by the GPU, and some by both.
Using the constant address space indicates that the data will not change during the execution of the kernel. This allows scalar read instructions to be used. As the constant address space could only be modified on the host side, a generic pointer loaded from the constant address space is safe to be assumed as a global pointer since only the device global memory is visible and managed on the host side. The vector and scalar L1 caches are invalidated of volatile data before each kernel dispatch execution to allow constant memory to change values between kernel dispatches.
- Region
The region address space uses the hardware Global Data Store (GDS). All wavefronts executing on the same device will access the same memory for any given region address. However, the same region address accessed by wavefronts executing on different devices will access different memory. It is higher performance than global memory. It is allocated by the runtime. The data store (DS) instructions can be used to access it.
- Local
The local address space uses the hardware Local Data Store (LDS) which is automatically allocated when the hardware creates the wavefronts of a work-group, and freed when all the wavefronts of a work-group have terminated. All wavefronts belonging to the same work-group will access the same memory for any given local address. However, the same local address accessed by wavefronts belonging to different work-groups will access different memory. It is higher performance than global memory. The data store (DS) instructions can be used to access it.
- Private
The private address space uses the hardware scratch memory support which automatically allocates memory when it creates a wavefront and frees it when a wavefronts terminates. The memory accessed by a lane of a wavefront for any given private address will be different to the memory accessed by another lane of the same or different wavefront for the same private address.
If a kernel dispatch uses scratch, then the hardware allocates memory from a pool of backing memory allocated by the runtime for each wavefront. The lanes of the wavefront access this using dword (4 byte) interleaving. The mapping used from private address to backing memory address is:
wavefront-scratch-base + ((private-address / 4) * wavefront-size * 4) + (wavefront-lane-id * 4) + (private-address % 4)If each lane of a wavefront accesses the same private address, the interleaving results in adjacent dwords being accessed and hence requires fewer cache lines to be fetched.
There are different ways that the wavefront scratch base address is determined by a wavefront (see Initial Kernel Execution State).
Scratch memory can be accessed in an interleaved manner using buffer instructions with the scratch buffer descriptor and per wavefront scratch offset, by the scratch instructions, or by flat instructions. Multi-dword access is not supported except by flat and scratch instructions in GFX9-GFX11.
Code that manipulates the stack values in other lanes of a wavefront, such as by
addrspacecast-ing stack pointers to generic ones and taking offsets that reach other lanes or by explicitly constructing the scratch buffer descriptor, triggers undefined behavior when it modifies the scratch values of other lanes. The compiler may assume that such modifications do not occur. When using code object V5LIBOMPTARGET_STACK_SIZEmay be used to provide the private segment size in bytes, for cases where a dynamic stack is used.- Constant 32-bit
TODO
- Buffer Fat Pointer
The buffer fat pointer is an experimental address space that is currently unsupported in the backend. It exposes a non-integral pointer that is in the future intended to support the modelling of 128-bit buffer descriptors plus a 32-bit offset into the buffer (in total encapsulating a 160-bit pointer), allowing normal LLVM load/store/atomic operations to be used to model the buffer descriptors used heavily in graphics workloads targeting the backend.
The buffer descriptor used to construct a buffer fat pointer must be raw: the stride must be 0, the “add tid” flag must be 0, the swizzle enable bits must be off, and the extent must be measured in bytes. (On subtargets where bounds checking may be disabled, buffer fat pointers may choose to enable it or not).
- Buffer Resource
The buffer resource pointer, in address space 8, is the newer form for representing buffer descriptors in AMDGPU IR, replacing their previous representation as <4 x i32>. It is a non-integral pointer that represents a 128-bit buffer descriptor resource (V#).
Since, in general, a buffer resource supports complex addressing modes that cannot be easily represented in LLVM (such as implicit swizzled access to structured buffers), it is illegal to perform non-trivial address computations, such as
getelementptroperations, on buffer resources. They may be passed to AMDGPU buffer intrinsics, and they may be converted to and fromi128.Casting a buffer resource to a buffer fat pointer is permitted and adds an offset of 0.
Buffer resources can be created from 64-bit pointers (which should be either generic or global) using the llvm.amdgcn.make.buffer.rsrc intrinsic, which takes the pointer, which becomes the base of the resource, the 16-bit stride (and swzizzle control) field stored in bits 63:48 of a V#, the 32-bit NumRecords/extent field (bits 95:64), and the 32-bit flags field (bits 127:96). The specific interpretation of these fields varies by the target architecture and is detailed in the ISA descriptions.
- Buffer Strided Pointer
The buffer index pointer is an experimental address space. It represents a 128-bit buffer descriptor and a 32-bit offset, like the Buffer Fat Pointer. Additionally, it contains an index into the buffer, which allows the direct addressing of structured elements. These components appear in that order, i.e., the descriptor comes first, then the 32-bit offset followed by the 32-bit index.
The bits in the buffer descriptor must meet the following requirements: the stride is the size of a structured element, the “add tid” flag must be 0, and the swizzle enable bits must be off.
- Streamout Registers
Dedicated registers used by the GS NGG Streamout Instructions. The register file is modelled as a memory in a distinct address space because it is indexed by an address-like offset in place of named registers, and because register accesses affect LGKMcnt. This is an internal address space used only by the compiler. Do not use this address space for IR pointers.
Memory Scopes¶
This section provides LLVM memory synchronization scopes supported by the AMDGPU
backend memory model when the target triple OS is amdhsa (see
Memory Model and Target Triples).
The memory model supported is based on the HSA memory model [HSA] which is based in turn on HRF-indirect with scope inclusion [HRF]. The happens-before relation is transitive over the synchronizes-with relation independent of scope and synchronizes-with allows the memory scope instances to be inclusive (see table AMDHSA LLVM Sync Scopes).
This is different to the OpenCL [OpenCL] memory model which does not have scope inclusion and requires the memory scopes to exactly match. However, this is conservatively correct for OpenCL.
Table 26 AMDHSA LLVM Sync Scopes¶ LLVM Sync Scope
Description
none
The default:
system.Synchronizes with, and participates in modification and seq_cst total orderings with, other operations (except image operations) for all address spaces (except private, or generic that accesses private) provided the other operation’s sync scope is:
system.
agentand executed by a thread on the same agent.
workgroupand executed by a thread in the same work-group.
wavefrontand executed by a thread in the same wavefront.
agentSynchronizes with, and participates in modification and seq_cst total orderings with, other operations (except image operations) for all address spaces (except private, or generic that accesses private) provided the other operation’s sync scope is:
systemoragentand executed by a thread on the same agent.
workgroupand executed by a thread in the same work-group.
wavefrontand executed by a thread in the same wavefront.
workgroupSynchronizes with, and participates in modification and seq_cst total orderings with, other operations (except image operations) for all address spaces (except private, or generic that accesses private) provided the other operation’s sync scope is:
system,agentorworkgroupand executed by a thread in the same work-group.
wavefrontand executed by a thread in the same wavefront.
wavefrontSynchronizes with, and participates in modification and seq_cst total orderings with, other operations (except image operations) for all address spaces (except private, or generic that accesses private) provided the other operation’s sync scope is:
system,agent,workgrouporwavefrontand executed by a thread in the same wavefront.
singlethreadOnly synchronizes with and participates in modification and seq_cst total orderings with, other operations (except image operations) running in the same thread for all address spaces (for example, in signal handlers).
one-asSame as
systembut only synchronizes with other operations within the same address space.
agent-one-asSame as
agentbut only synchronizes with other operations within the same address space.
workgroup-one-asSame as
workgroupbut only synchronizes with other operations within the same address space.
wavefront-one-asSame as
wavefrontbut only synchronizes with other operations within the same address space.
singlethread-one-asSame as
singlethreadbut only synchronizes with other operations within the same address space.
LLVM IR Intrinsics¶
The AMDGPU backend implements the following LLVM IR intrinsics.
This section is WIP.
LLVM Intrinsic |
Description |
|---|---|
llvm.amdgcn.sqrt |
Provides direct access to v_sqrt_f64, v_sqrt_f32 and v_sqrt_f16 (on targets with half support). Performs sqrt function. |
llvm.amdgcn.log |
Provides direct access to v_log_f32 and v_log_f16 (on targets with half support). Performs log2 function. |
llvm.amdgcn.exp2 |
Provides direct access to v_exp_f32 and v_exp_f16 (on targets with half support). Performs exp2 function. |
Implemented for half, float and double. |
|
Implemented for float and half (and vectors of float or half). Not implemented for double. Hardware provides 1ULP accuracy for float, and 0.51ULP for half. Float instruction does not natively support denormal inputs. |
|
Implemented for double, float and half (and vectors). |
|
Implemented for float and half (and vectors). |
|
Implemented for float and half (and vectors). |
|
Implemented for float and half (and vectors). |
|
Implemented for float and half (and vectors of float or half). Not implemented for double. Hardware provides 1ULP accuracy for float, and 0.51ULP for half. Float instruction does not natively support denormal inputs. |
|
Implemented, must use the alloca address space. |
|
Implemented, must use the alloca address space. |
|
The natural floating-point mode type is i32. This implemented by extracting relevant bits out of the MODE register with s_getreg_b32. The first 10 bits are the core floating-point mode. Bits 12:18 are the exception mask. On gfx9+, bit 23 is FP16_OVFL. Bitfields not relevant to floating-point instructions are 0s. |
|
AMDGPU supports two separately controllable rounding modes depending on the floating-point type. One controls float, and the other controls both double and half operations. If both modes are the same, returns one of the standard return values. If the modes are different, returns one of 12 extended values describing the two modes. To nearest, ties away from zero is not a supported mode. The raw rounding mode values in the MODE register do not exactly match the FLT_ROUNDS values, so a conversion is performed. |
|
Input value expected to be one of the valid results
from ‘ |
|
Returns the current value of the AMDGPU floating point environment. This stores information related to the current rounding mode, denormalization mode, enabled traps, and floating point exceptions. The format is a 64-bit concatenation of the MODE and TRAPSTS registers. |
|
Sets the floating point environment to the specifies state. |
|
llvm.amdgcn.readfirstlane |
Provides direct access to v_readfirstlane_b32. Returns the value in the lowest active lane of the input operand. Currently implemented for i16, i32, float, half, bfloat, <2 x i16>, <2 x half>, <2 x bfloat>, i64, double, pointers, multiples of the 32-bit vectors. |
llvm.amdgcn.readlane |
Provides direct access to v_readlane_b32. Returns the value in the specified lane of the first input operand. The second operand specifies the lane to read from. Currently implemented for i16, i32, float, half, bfloat, <2 x i16>, <2 x half>, <2 x bfloat>, i64, double, pointers, multiples of the 32-bit vectors. |
llvm.amdgcn.writelane |
Provides direct access to v_writelane_b32. Writes value in the first input operand to the specified lane of divergent output. The second operand specifies the lane to write. Currently implemented for i16, i32, float, half, bfloat, <2 x i16>, <2 x half>, <2 x bfloat>, i64, double, pointers, multiples of the 32-bit vectors. |
llvm.amdgcn.wave.reduce.umin |
Performs an arithmetic unsigned min reduction on the unsigned values provided by each lane in the wavefront. Intrinsic takes a hint for reduction strategy using second operand 0: Target default preference, 1: Iterative strategy, and 2: DPP. If target does not support the DPP operations (e.g. gfx6/7), reduction will be performed using default iterative strategy. Intrinsic is currently only implemented for i32. |
llvm.amdgcn.wave.reduce.umax |
Performs an arithmetic unsigned max reduction on the unsigned values provided by each lane in the wavefront. Intrinsic takes a hint for reduction strategy using second operand 0: Target default preference, 1: Iterative strategy, and 2: DPP. If target does not support the DPP operations (e.g. gfx6/7), reduction will be performed using default iterative strategy. Intrinsic is currently only implemented for i32. |
llvm.amdgcn.permlane16 |
Provides direct access to v_permlane16_b32. Performs arbitrary gather-style operation within a row (16 contiguous lanes) of the second input operand. The third and fourth inputs must be scalar values. these are combined into a single 64-bit value representing lane selects used to swizzle within each row. Currently implemented for i16, i32, float, half, bfloat, <2 x i16>, <2 x half>, <2 x bfloat>, i64, double, pointers, multiples of the 32-bit vectors. |
llvm.amdgcn.permlanex16 |
Provides direct access to v_permlanex16_b32. Performs arbitrary gather-style operation across two rows of the second input operand (each row is 16 contiguous lanes). The third and fourth inputs must be scalar values. these are combined into a single 64-bit value representing lane selects used to swizzle within each row. Currently implemented for i16, i32, float, half, bfloat, <2 x i16>, <2 x half>, <2 x bfloat>, i64, double, pointers, multiples of the 32-bit vectors. |
llvm.amdgcn.permlane64 |
Provides direct access to v_permlane64_b32. Performs a specific permutation across lanes of the input operand where the high half and low half of a wave64 are swapped. Performs no operation in wave32 mode. Currently implemented for i16, i32, float, half, bfloat, <2 x i16>, <2 x half>, <2 x bfloat>, i64, double, pointers, multiples of the 32-bit vectors. |
llvm.amdgcn.udot2 |
Provides direct access to v_dot2_u32_u16 across targets which support such instructions. This performs unsigned dot product with two v2i16 operands, summed with the third i32 operand. The i1 fourth operand is used to clamp the output. |
llvm.amdgcn.udot4 |
Provides direct access to v_dot4_u32_u8 across targets which support such instructions. This performs unsigned dot product with two i32 operands (holding a vector of 4 8bit values), summed with the third i32 operand. The i1 fourth operand is used to clamp the output. |
llvm.amdgcn.udot8 |
Provides direct access to v_dot8_u32_u4 across targets which support such instructions. This performs unsigned dot product with two i32 operands (holding a vector of 8 4bit values), summed with the third i32 operand. The i1 fourth operand is used to clamp the output. |
llvm.amdgcn.sdot2 |
Provides direct access to v_dot2_i32_i16 across targets which support such instructions. This performs signed dot product with two v2i16 operands, summed with the third i32 operand. The i1 fourth operand is used to clamp the output. When applicable (e.g. no clamping), this is lowered into v_dot2c_i32_i16 for targets which support it. |
llvm.amdgcn.sdot4 |
Provides direct access to v_dot4_i32_i8 across targets which support such instructions. This performs signed dot product with two i32 operands (holding a vector of 4 8bit values), summed with the third i32 operand. The i1 fourth operand is used to clamp the output. When applicable (i.e. no clamping / operand modifiers), this is lowered into v_dot4c_i32_i8 for targets which support it. RDNA3 does not offer v_dot4_i32_i8, and rather offers v_dot4_i32_iu8 which has operands to hold the signedness of the vector operands. Thus, this intrinsic lowers to the signed version of this instruction for gfx11 targets. |
llvm.amdgcn.sdot8 |
Provides direct access to v_dot8_u32_u4 across targets which support such instructions. This performs signed dot product with two i32 operands (holding a vector of 8 4bit values), summed with the third i32 operand. The i1 fourth operand is used to clamp the output. When applicable (i.e. no clamping / operand modifiers), this is lowered into v_dot8c_i32_i4 for targets which support it. RDNA3 does not offer v_dot8_i32_i4, and rather offers v_dot4_i32_iu4 which has operands to hold the signedness of the vector operands. Thus, this intrinsic lowers to the signed version of this instruction for gfx11 targets. |
llvm.amdgcn.sudot4 |
Provides direct access to v_dot4_i32_iu8 on gfx11 targets. This performs dot product with two i32 operands (holding a vector of 4 8bit values), summed with the fifth i32 operand. The i1 sixth operand is used to clamp the output. The i1s preceding the vector operands decide the signedness. |
llvm.amdgcn.sudot8 |
Provides direct access to v_dot8_i32_iu4 on gfx11 targets. This performs dot product with two i32 operands (holding a vector of 8 4bit values), summed with the fifth i32 operand. The i1 sixth operand is used to clamp the output. The i1s preceding the vector operands decide the signedness. |
llvm.amdgcn.sched_barrier |
Controls the types of instructions that may be allowed to cross the intrinsic during instruction scheduling. The parameter is a mask for the instruction types that can cross the intrinsic.
|
llvm.amdgcn.sched_group_barrier |
Creates schedule groups with specific properties to create custom scheduling pipelines. The ordering between groups is enforced by the instruction scheduler. The intrinsic applies to the code that preceeds the intrinsic. The intrinsic takes three values that control the behavior of the schedule groups.
The mask can include multiple instruction types. It is undefined behavior to set values beyond the range of valid masks. Combining multiple sched_group_barrier intrinsics enables an ordering of specific instruction types during instruction scheduling. For example, the following enforces a sequence of 1 VMEM read, followed by 1 VALU instruction, followed by 5 MFMA instructions. // 1 VMEM read__builtin_amdgcn_sched_group_barrier(32, 1, 0)// 1 VALU__builtin_amdgcn_sched_group_barrier(2, 1, 0)// 5 MFMA__builtin_amdgcn_sched_group_barrier(8, 5, 0) |
llvm.amdgcn.iglp_opt |
An experimental intrinsic for instruction group level parallelism. The intrinsic implements predefined intruction scheduling orderings. The intrinsic applies to the surrounding scheduling region. The intrinsic takes a value that specifies the strategy. The compiler implements two strategies.
Only one iglp_opt intrinsic may be used in a scheduling region. The iglp_opt intrinsic cannot be combined with sched_barrier or sched_group_barrier. The iglp_opt strategy implementations are subject to change. |
llvm.amdgcn.atomic.cond.sub.u32 |
Provides direct access to flat_atomic_cond_sub_u32, global_atomic_cond_sub_u32 and ds_cond_sub_u32 based on address space on gfx12 targets. This performs subtraction only if the memory value is greater than or equal to the data value. |
llvm.amdgcn.s.getpc |
Provides access to the s_getpc_b64 instruction, but with the return value sign-extended from the width of the underlying PC hardware register even on processors where the s_getpc_b64 instruction returns a zero-extended value. |
LLVM IR Metadata¶
The AMDGPU backend implements the following target custom LLVM IR metadata.
‘amdgpu.last.use’ Metadata¶
Sets TH_LOAD_LU temporal hint on load instructions that support it. Takes priority over nontemporal hint (TH_LOAD_NT). This takes no arguments.
%val = load i32, ptr %in, align 4, !amdgpu.last.use !{}
‘amdgpu.no.remote.memory’ Metadata¶
Asserts a memory operation does not access bytes in host memory, or remote connected peer device memory (the address must be device local). This is intended for use with atomicrmw and other atomic instructions. This is required to emit a native hardware instruction for some system scope atomic operations on some subtargets. For most integer atomic operations, this is a sufficient restriction to emit a native atomic instruction.
An atomicrmw without metadata will be treated conservatively as required to preserve the operation behavior in all cases. This will typically be used in conjunction with !amdgpu.no.fine.grained.memory.
; Indicates the atomic does not access fine-grained memory, or
; remote device memory.
%old0 = atomicrmw sub ptr %ptr0, i32 1 acquire, !amdgpu.no.fine.grained.memory !0, !amdgpu.no.remote.memory !0
; Indicates the atomic does not access peer device memory.
%old2 = atomicrmw sub ptr %ptr2, i32 1 acquire, !amdgpu.no.remote.memory !0
!0 = !{}
‘amdgpu.no.fine.grained.memory’ Metadata¶
Asserts a memory access does not access bytes allocated in fine-grained allocated memory. This is intended for use with atomicrmw and other atomic instructions. This is required to emit a native hardware instruction for some system scope atomic operations on some subtargets. An atomicrmw without metadata will be treated conservatively as required to preserve the operation behavior in all cases. This will typically be used in conjunction with !amdgpu.no.remote.memory.access.
; Indicates the access does not access fine-grained memory, or
; remote device memory.
%old0 = atomicrmw sub ptr %ptr0, i32 1 acquire, !amdgpu.no.fine.grained.memory !0, !amdgpu.no.remote.memory.access !0
; Indicates the access does not access fine-grained memory
%old2 = atomicrmw sub ptr %ptr2, i32 1 acquire, !amdgpu.no.fine.grained.memory !0
!0 = !{}
‘amdgpu.ignore.denormal.mode’ Metadata¶
For use with atomicrmw floating-point operations. Indicates the handling of denormal inputs and results is insignificant and may be inconsistent with the expected floating-point mode. This is necessary to emit a native atomic instruction on some targets for some address spaces where float denormals are unconditionally flushed. This is typically used in conjunction with !amdgpu.no.remote.memory.access and !amdgpu.no.fine.grained.memory
%res0 = atomicrmw fadd ptr addrspace(1) %ptr, float %value seq_cst, align 4, !amdgpu.ignore.denormal.mode !0
%res1 = atomicrmw fadd ptr addrspace(1) %ptr, float %value seq_cst, align 4, !amdgpu.ignore.denormal.mode !0, !amdgpu.no.fine.grained.memory !0, !amdgpu.no.remote.memory.access !0
!0 = !{}
LLVM IR Attributes¶
The AMDGPU backend supports the following LLVM IR attributes.
Table 28 AMDGPU LLVM IR Attributes¶ LLVM Attribute
Description
“amdgpu-flat-work-group-size”=”min,max”
Specify the minimum and maximum flat work group sizes that will be specified when the kernel is dispatched. Generated by the
amdgpu_flat_work_group_sizeCLANG attribute [CLANG-ATTR]. The IR implied default value is 1,1024. Clang may emit this attribute with more restrictive bounds depending on language defaults. If the actual block or workgroup size exceeds the limit at any point during the execution, the behavior is undefined. For example, even if there is only one active thread but the thread local id exceeds the limit, the behavior is undefined.“amdgpu-implicitarg-num-bytes”=”n”
Number of kernel argument bytes to add to the kernel argument block size for the implicit arguments. This varies by OS and language (for OpenCL see OpenCL kernel implicit arguments appended for AMDHSA OS).
“amdgpu-num-sgpr”=”n”
Specifies the number of SGPRs to use. Generated by the
amdgpu_num_sgprCLANG attribute [CLANG-ATTR].“amdgpu-num-vgpr”=”n”
Specifies the number of VGPRs to use. Generated by the
amdgpu_num_vgprCLANG attribute [CLANG-ATTR].“amdgpu-waves-per-eu”=”m,n”
Specify the minimum and maximum number of waves per execution unit. Generated by the
amdgpu_waves_per_euCLANG attribute [CLANG-ATTR]. This is an optimization hint, and the backend may not be able to satisfy the request. If the specified range is incompatible with the function’s “amdgpu-flat-work-group-size” value, the implied occupancy bounds by the workgroup size takes precedence.“amdgpu-ieee” true/false.
GFX6-GFX11 Only Specify whether the function expects the IEEE field of the mode register to be set on entry. Overrides the default for the calling convention.
“amdgpu-dx10-clamp” true/false.
GFX6-GFX11 Only Specify whether the function expects the DX10_CLAMP field of the mode register to be set on entry. Overrides the default for the calling convention.
“amdgpu-no-workitem-id-x”
Indicates the function does not depend on the value of the llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this attribute, or reached through a call site marked with this attribute, and that intrinsic is called, the behavior of the program is undefined. (Whole-program undefined behavior is used here because, for example, the absence of a required workitem ID in the preloaded register set can mean that all other preloaded registers are earlier than the compilation assumed they would be.) The backend can generally infer this during code generation, so typically there is no benefit to frontends marking functions with this.
“amdgpu-no-workitem-id-y”
The same as amdgpu-no-workitem-id-x, except for the llvm.amdgcn.workitem.id.y intrinsic.
“amdgpu-no-workitem-id-z”
The same as amdgpu-no-workitem-id-x, except for the llvm.amdgcn.workitem.id.z intrinsic.
“amdgpu-no-workgroup-id-x”
The same as amdgpu-no-workitem-id-x, except for the llvm.amdgcn.workgroup.id.x intrinsic.
“amdgpu-no-workgroup-id-y”
The same as amdgpu-no-workitem-id-x, except for the llvm.amdgcn.workgroup.id.y intrinsic.
“amdgpu-no-workgroup-id-z”
The same as amdgpu-no-workitem-id-x, except for the llvm.amdgcn.workgroup.id.z intrinsic.
“amdgpu-no-dispatch-ptr”
The same as amdgpu-no-workitem-id-x, except for the llvm.amdgcn.dispatch.ptr intrinsic.
“amdgpu-no-implicitarg-ptr”
The same as amdgpu-no-workitem-id-x, except for the llvm.amdgcn.implicitarg.ptr intrinsic.
“amdgpu-no-dispatch-id”
The same as amdgpu-no-workitem-id-x, except for the llvm.amdgcn.dispatch.id intrinsic.
“amdgpu-no-queue-ptr”
Similar to amdgpu-no-workitem-id-x, except for the llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint attributes, the queue pointer may be required in situations where the intrinsic call does not directly appear in the program. Some subtargets require the queue pointer for to handle some addrspacecasts, as well as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and llvm.debug intrinsics.
“amdgpu-no-hostcall-ptr”
Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit kernel argument that holds the pointer to the hostcall buffer. If this attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
“amdgpu-no-heap-ptr”
Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit kernel argument that holds the pointer to an initialized memory buffer that conforms to the requirements of the malloc/free device library V1 version implementation. If this attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
“amdgpu-no-multigrid-sync-arg”
Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit kernel argument that holds the multigrid synchronization pointer. If this attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
“amdgpu-no-default-queue”
Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit kernel argument that holds the default queue pointer. If this attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
“amdgpu-no-completion-action”
Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit kernel argument that holds the completion action pointer. If this attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
“amdgpu-lds-size”=”min[,max]”
Min is the minimum number of bytes that will be allocated in the Local Data Store at address zero. Variables are allocated within this frame using absolute symbol metadata, primarily by the AMDGPULowerModuleLDS pass. Optional max is the maximum number of bytes that will be allocated. Note that min==max indicates that no further variables can be added to the frame. This is an internal detail of how LDS variables are lowered, language front ends should not set this attribute.
“amdgpu-gds-size”
Bytes expected to be allocated at the start of GDS memory at entry.
“amdgpu-git-ptr-high”
The hard-wired high half of the address of the global information table for AMDPAL OS type. 0xffffffff represents no hard-wired high half, since current hardware only allows a 16 bit value.
“amdgpu-32bit-address-high-bits”
Assumed high 32-bits for 32-bit address spaces which are really truncated 64-bit addresses (i.e., addrspace(6))
“amdgpu-color-export”
Indicates shader exports color information if set to 1. Defaults to 1 for amdgpu_ps, and 0 for other calling conventions. Determines the necessity and type of null exports when a shader terminates early by killing lanes.
“amdgpu-depth-export”
Indicates shader exports depth information if set to 1. Determines the necessity and type of null exports when a shader terminates early by killing lanes. A depth-only shader will export to depth channel when no null export target is available (GFX11+).
“InitialPSInputAddr”
Set the initial value of the spi_ps_input_addr register for amdgpu_ps shaders. Any bits enabled by this value will be enabled in the final register value.
“amdgpu-wave-priority-threshold”
VALU instruction count threshold for adjusting wave priority. If exceeded, temporarily raise the wave priority at the start of the shader function until its last VMEM instructions to allow younger waves to issue their VMEM instructions as well.
“amdgpu-memory-bound”
Set internally by backend
“amdgpu-wave-limiter”
Set internally by backend
“amdgpu-unroll-threshold”
Set base cost threshold preference for loop unrolling within this function, default is 300. Actual threshold may be varied by per-loop metadata or reduced by heuristics.
“amdgpu-max-num-workgroups”=”x,y,z”
Specify the maximum number of work groups for the kernel dispatch in the X, Y, and Z dimensions. Generated by the
amdgpu_max_num_work_groupsCLANG attribute [CLANG-ATTR]. Clang only emits this attribute when all the three numbers are >= 1.“amdgpu-no-agpr”
Indicates the function will not require allocating AGPRs. This is only relevant on subtargets with AGPRs. The behavior is undefined if a function which requires AGPRs is reached through any function marked with this attribute.
Calling Conventions¶
The AMDGPU backend supports the following calling conventions:
Table 29 AMDGPU Calling Conventions¶ Calling Convention
Description
cccThe C calling convention. Used by default. See Non-Kernel Functions for more details.
fastccThe fast calling convention. Mostly the same as the
ccc.
coldccThe cold calling convention. Mostly the same as the
ccc.
amdgpu_csUsed for Mesa/AMDPAL compute shaders. ..TODO:: Describe.
amdgpu_cs_chainSimilar to
amdgpu_cs, with differences described below.Functions with this calling convention cannot be called directly. They must instead be launched via the
llvm.amdgcn.cs.chainintrinsic.Arguments are passed in SGPRs, starting at s0, if they have the
inregattribute, and in VGPRs otherwise, starting at v8. Using more SGPRs or VGPRs than available in the subtarget is not allowed. On subtargets that use a scratch buffer descriptor (as opposed toscratch_{load,store}_*instructions), the scratch buffer descriptor is passed in s[48:51]. This limits the SGPR /inregarguments to the equivalent of 48 dwords; using more than that is not allowed.The return type must be void. Varargs, sret, byval, byref, inalloca, preallocated are not supported.
Values in scalar registers as well as v0-v7 are not preserved. Values in VGPRs starting at v8 are not preserved for the active lanes, but must be saved by the callee for inactive lanes when using WWM.
Wave scratch is “empty” at function boundaries. There is no stack pointer input or output value, but functions are free to use scratch starting from an initial stack pointer. Calls to
amdgpu_gfxfunctions are allowed and behave like they do inamdgpu_csfunctions.All counters (
lgkmcnt,vmcnt,storecnt, etc.) are presumed in an unknown state at function entry.A function may have multiple exits (e.g. one chain exit and one plain
ret voidfor when the wave ends), but allllvm.amdgcn.cs.chainexits must be in uniform control flow.
amdgpu_cs_chain_preserveSame as
amdgpu_cs_chain, but active lanes for VGPRs starting at v8 are preserved. Calls toamdgpu_gfxfunctions are not allowed, and any calls tollvm.amdgcn.cs.chainmust not pass more VGPR arguments than the caller’s VGPR function parameters.
amdgpu_esUsed for AMDPAL shader stage before geometry shader if geometry is in use. So either the domain (= tessellation evaluation) shader if tessellation is in use, or otherwise the vertex shader. ..TODO:: Describe.
amdgpu_gfxUsed for AMD graphics targets. Functions with this calling convention cannot be used as entry points. ..TODO:: Describe.
amdgpu_gsUsed for Mesa/AMDPAL geometry shaders. ..TODO:: Describe.
amdgpu_hsUsed for Mesa/AMDPAL hull shaders (= tessellation control shaders). ..TODO:: Describe.
amdgpu_kernelSee Kernel Functions
amdgpu_lsUsed for AMDPAL vertex shader if tessellation is in use. ..TODO:: Describe.
amdgpu_psUsed for Mesa/AMDPAL pixel shaders. ..TODO:: Describe.
amdgpu_vsUsed for Mesa/AMDPAL last shader stage before rasterization (vertex shader if tessellation and geometry are not in use, or otherwise copy shader if one is needed). ..TODO:: Describe.
AMDGPU MCExpr¶
As part of the AMDGPU MC layer, AMDGPU provides the following target specific
MCExprs.
Table 30 AMDGPU MCExpr types:¶ MCExpr
Operands
Return value
max(arg, ...)1 or more
Variadic signed operation that returns the maximum value of all its arguments.
or(arg, ...)1 or more
Variadic signed operation that returns the bitwise-or result of all its arguments.
ELF Code Object¶
The AMDGPU backend generates a standard ELF [ELF] relocatable code object that
can be linked by lld to produce a standard ELF shared code object which can
be loaded and executed on an AMDGPU target.
Header¶
The AMDGPU backend uses the following ELF header:
Table 31 AMDGPU ELF Header¶ Field
Value
e_ident[EI_CLASS]
ELFCLASS64
e_ident[EI_DATA]
ELFDATA2LSB
e_ident[EI_OSABI]
ELFOSABI_NONE
ELFOSABI_AMDGPU_HSA
ELFOSABI_AMDGPU_PAL
ELFOSABI_AMDGPU_MESA3D
e_ident[EI_ABIVERSION]
ELFABIVERSION_AMDGPU_HSA_V2
ELFABIVERSION_AMDGPU_HSA_V3
ELFABIVERSION_AMDGPU_HSA_V4
ELFABIVERSION_AMDGPU_HSA_V5
ELFABIVERSION_AMDGPU_HSA_V6
ELFABIVERSION_AMDGPU_PAL
ELFABIVERSION_AMDGPU_MESA3D
e_type
ET_REL
ET_DYN
e_machine
EM_AMDGPU
e_entry0
e_flagsSee AMDGPU ELF Header e_flags for Code Object V2, AMDGPU ELF Header e_flags for Code Object V3, AMDGPU ELF Header e_flags for Code Object V4 and V5, and AMDGPU ELF Header e_flags for Code Object V6 and After
Table 32 AMDGPU ELF Header Enumeration Values¶ Name
Value
EM_AMDGPU224
ELFOSABI_NONE0
ELFOSABI_AMDGPU_HSA64
ELFOSABI_AMDGPU_PAL65
ELFOSABI_AMDGPU_MESA3D66
ELFABIVERSION_AMDGPU_HSA_V20
ELFABIVERSION_AMDGPU_HSA_V31
ELFABIVERSION_AMDGPU_HSA_V42
ELFABIVERSION_AMDGPU_HSA_V53
ELFABIVERSION_AMDGPU_HSA_V64
ELFABIVERSION_AMDGPU_PAL0
ELFABIVERSION_AMDGPU_MESA3D0
e_ident[EI_CLASS]The ELF class is:
ELFCLASS32forr600architecture.ELFCLASS64foramdgcnarchitecture which only supports 64-bit process address space applications.
e_ident[EI_DATA]All AMDGPU targets use
ELFDATA2LSBfor little-endian byte ordering.e_ident[EI_OSABI]One of the following AMDGPU target architecture specific OS ABIs (see AMDGPU Operating Systems):
ELFOSABI_NONEfor unknown OS.ELFOSABI_AMDGPU_HSAforamdhsaOS.ELFOSABI_AMDGPU_PALforamdpalOS.ELFOSABI_AMDGPU_MESA3Dformesa3DOS.
e_ident[EI_ABIVERSION]The ABI version of the AMDGPU target architecture specific OS ABI to which the code object conforms:
ELFABIVERSION_AMDGPU_HSA_V2is used to specify the version of AMD HSA runtime ABI for code object V2. Can no longer be emitted by this version of LLVM.ELFABIVERSION_AMDGPU_HSA_V3is used to specify the version of AMD HSA runtime ABI for code object V3. Can no longer be emitted by this version of LLVM.ELFABIVERSION_AMDGPU_HSA_V4is used to specify the version of AMD HSA runtime ABI for code object V4. Specify using the Clang option-mcode-object-version=4.ELFABIVERSION_AMDGPU_HSA_V5is used to specify the version of AMD HSA runtime ABI for code object V5. Specify using the Clang option-mcode-object-version=5. This is the default code object version if not specified.ELFABIVERSION_AMDGPU_HSA_V6is used to specify the version of AMD HSA runtime ABI for code object V6. Specify using the Clang option-mcode-object-version=6.ELFABIVERSION_AMDGPU_PALis used to specify the version of AMD PAL runtime ABI.ELFABIVERSION_AMDGPU_MESA3Dis used to specify the version of AMD MESA 3D runtime ABI.
e_typeCan be one of the following values:
ET_RELThe type produced by the AMDGPU backend compiler as it is relocatable code object.
ET_DYNThe type produced by the linker as it is a shared code object.
The AMD HSA runtime loader requires a
ET_DYNcode object.e_machineThe value
EM_AMDGPUis used for the machine for all processors supported by ther600andamdgcnarchitectures (see AMDGPU Processors). The specific processor is specified in theNT_AMD_HSA_ISA_VERSIONnote record for code object V2 (see Code Object V2 Note Records) and in theEF_AMDGPU_MACHbit field of thee_flagsfor code object V3 and above (see AMDGPU ELF Header e_flags for Code Object V3, AMDGPU ELF Header e_flags for Code Object V4 and V5 and AMDGPU ELF Header e_flags for Code Object V6 and After).e_entryThe entry point is 0 as the entry points for individual kernels must be selected in order to invoke them through AQL packets.
e_flagsThe AMDGPU backend uses the following ELF header flags:
Table 33 AMDGPU ELF Header e_flagsfor Code Object V2¶Name
Value
Description
EF_AMDGPU_FEATURE_XNACK_V20x01
Indicates if the
xnacktarget feature is enabled for all code contained in the code object. If the processor does not support thexnacktarget feature then must be 0. See Target Features.EF_AMDGPU_FEATURE_TRAP_HANDLER_V20x02
Indicates if the trap handler is enabled for all code contained in the code object. If the processor does not support a trap handler then must be 0. See Target Features.
Table 34 AMDGPU ELF Header e_flagsfor Code Object V3¶Name
Value
Description
EF_AMDGPU_MACH0x0ff
AMDGPU processor selection mask for
EF_AMDGPU_MACH_xxxvalues defined in AMDGPU EF_AMDGPU_MACH Values.EF_AMDGPU_FEATURE_XNACK_V30x100
Indicates if the
xnacktarget feature is enabled for all code contained in the code object. If the processor does not support thexnacktarget feature then must be 0. See Target Features.EF_AMDGPU_FEATURE_SRAMECC_V30x200
Indicates if the
sramecctarget feature is enabled for all code contained in the code object. If the processor does not support thesramecctarget feature then must be 0. See Target Features.Table 35 AMDGPU ELF Header e_flagsfor Code Object V4 and V5¶Name
Value
Description
EF_AMDGPU_MACH0x0ff
AMDGPU processor selection mask for
EF_AMDGPU_MACH_xxxvalues defined in AMDGPU EF_AMDGPU_MACH Values.EF_AMDGPU_FEATURE_XNACK_V40x300
XNACK selection mask for
EF_AMDGPU_FEATURE_XNACK_*_V4values.EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V40x000
XNACK unsupported.
EF_AMDGPU_FEATURE_XNACK_ANY_V40x100
XNACK can have any value.
EF_AMDGPU_FEATURE_XNACK_OFF_V40x200
XNACK disabled.
EF_AMDGPU_FEATURE_XNACK_ON_V40x300
XNACK enabled.
EF_AMDGPU_FEATURE_SRAMECC_V40xc00
SRAMECC selection mask for
EF_AMDGPU_FEATURE_SRAMECC_*_V4values.EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V40x000
SRAMECC unsupported.
EF_AMDGPU_FEATURE_SRAMECC_ANY_V40x400
SRAMECC can have any value.
EF_AMDGPU_FEATURE_SRAMECC_OFF_V40x800
SRAMECC disabled,
EF_AMDGPU_FEATURE_SRAMECC_ON_V40xc00
SRAMECC enabled.
Table 36 AMDGPU ELF Header e_flagsfor Code Object V6 and After¶Name
Value
Description
EF_AMDGPU_MACH0x0ff
AMDGPU processor selection mask for
EF_AMDGPU_MACH_xxxvalues defined in AMDGPU EF_AMDGPU_MACH Values.EF_AMDGPU_FEATURE_XNACK_V40x300
XNACK selection mask for
EF_AMDGPU_FEATURE_XNACK_*_V4values.EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V40x000
XNACK unsupported.
EF_AMDGPU_FEATURE_XNACK_ANY_V40x100
XNACK can have any value.
EF_AMDGPU_FEATURE_XNACK_OFF_V40x200
XNACK disabled.
EF_AMDGPU_FEATURE_XNACK_ON_V40x300
XNACK enabled.
EF_AMDGPU_FEATURE_SRAMECC_V40xc00
SRAMECC selection mask for
EF_AMDGPU_FEATURE_SRAMECC_*_V4values.EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V40x000
SRAMECC unsupported.
EF_AMDGPU_FEATURE_SRAMECC_ANY_V40x400
SRAMECC can have any value.
EF_AMDGPU_FEATURE_SRAMECC_OFF_V40x800
SRAMECC disabled,
EF_AMDGPU_FEATURE_SRAMECC_ON_V40xc00
SRAMECC enabled.
EF_AMDGPU_GENERIC_VERSION_V0xff000000
Generic code object version selection mask. This is a value between 1 and 255, stored in the most significant byte of EFLAGS. See Generic Processor Versioning
Table 37 AMDGPU EF_AMDGPU_MACHValues¶Name
Value
Description (see AMDGPU Processors)
EF_AMDGPU_MACH_NONE0x000
not specified
EF_AMDGPU_MACH_R600_R6000x001
r600EF_AMDGPU_MACH_R600_R6300x002
r630EF_AMDGPU_MACH_R600_RS8800x003
rs880EF_AMDGPU_MACH_R600_RV6700x004
rv670EF_AMDGPU_MACH_R600_RV7100x005
rv710EF_AMDGPU_MACH_R600_RV7300x006
rv730EF_AMDGPU_MACH_R600_RV7700x007
rv770EF_AMDGPU_MACH_R600_CEDAR0x008
cedarEF_AMDGPU_MACH_R600_CYPRESS0x009
cypressEF_AMDGPU_MACH_R600_JUNIPER0x00a
juniperEF_AMDGPU_MACH_R600_REDWOOD0x00b
redwoodEF_AMDGPU_MACH_R600_SUMO0x00c
sumoEF_AMDGPU_MACH_R600_BARTS0x00d
bartsEF_AMDGPU_MACH_R600_CAICOS0x00e
caicosEF_AMDGPU_MACH_R600_CAYMAN0x00f
caymanEF_AMDGPU_MACH_R600_TURKS0x010
turksreserved
0x011 - 0x01f
Reserved for
r600architecture processors.EF_AMDGPU_MACH_AMDGCN_GFX6000x020
gfx600EF_AMDGPU_MACH_AMDGCN_GFX6010x021
gfx601EF_AMDGPU_MACH_AMDGCN_GFX7000x022
gfx700EF_AMDGPU_MACH_AMDGCN_GFX7010x023
gfx701EF_AMDGPU_MACH_AMDGCN_GFX7020x024
gfx702EF_AMDGPU_MACH_AMDGCN_GFX7030x025
gfx703EF_AMDGPU_MACH_AMDGCN_GFX7040x026
gfx704reserved
0x027
Reserved.
EF_AMDGPU_MACH_AMDGCN_GFX8010x028
gfx801EF_AMDGPU_MACH_AMDGCN_GFX8020x029
gfx802EF_AMDGPU_MACH_AMDGCN_GFX8030x02a
gfx803EF_AMDGPU_MACH_AMDGCN_GFX8100x02b
gfx810EF_AMDGPU_MACH_AMDGCN_GFX9000x02c
gfx900EF_AMDGPU_MACH_AMDGCN_GFX9020x02d
gfx902EF_AMDGPU_MACH_AMDGCN_GFX9040x02e
gfx904EF_AMDGPU_MACH_AMDGCN_GFX9060x02f
gfx906EF_AMDGPU_MACH_AMDGCN_GFX9080x030
gfx908EF_AMDGPU_MACH_AMDGCN_GFX9090x031
gfx909EF_AMDGPU_MACH_AMDGCN_GFX90C0x032
gfx90cEF_AMDGPU_MACH_AMDGCN_GFX10100x033
gfx1010EF_AMDGPU_MACH_AMDGCN_GFX10110x034
gfx1011EF_AMDGPU_MACH_AMDGCN_GFX10120x035
gfx1012EF_AMDGPU_MACH_AMDGCN_GFX10300x036
gfx1030EF_AMDGPU_MACH_AMDGCN_GFX10310x037
gfx1031EF_AMDGPU_MACH_AMDGCN_GFX10320x038
gfx1032EF_AMDGPU_MACH_AMDGCN_GFX10330x039
gfx1033EF_AMDGPU_MACH_AMDGCN_GFX6020x03a
gfx602EF_AMDGPU_MACH_AMDGCN_GFX7050x03b
gfx705EF_AMDGPU_MACH_AMDGCN_GFX8050x03c
gfx805EF_AMDGPU_MACH_AMDGCN_GFX10350x03d
gfx1035EF_AMDGPU_MACH_AMDGCN_GFX10340x03e
gfx1034EF_AMDGPU_MACH_AMDGCN_GFX90A0x03f
gfx90aEF_AMDGPU_MACH_AMDGCN_GFX9400x040
gfx940EF_AMDGPU_MACH_AMDGCN_GFX11000x041
gfx1100EF_AMDGPU_MACH_AMDGCN_GFX10130x042
gfx1013EF_AMDGPU_MACH_AMDGCN_GFX11500x043
gfx1150EF_AMDGPU_MACH_AMDGCN_GFX11030x044
gfx1103EF_AMDGPU_MACH_AMDGCN_GFX10360x045
gfx1036EF_AMDGPU_MACH_AMDGCN_GFX11010x046
gfx1101EF_AMDGPU_MACH_AMDGCN_GFX11020x047
gfx1102EF_AMDGPU_MACH_AMDGCN_GFX12000x048
gfx1200reserved
0x049
Reserved.
EF_AMDGPU_MACH_AMDGCN_GFX11510x04a
gfx1151EF_AMDGPU_MACH_AMDGCN_GFX9410x04b
gfx941EF_AMDGPU_MACH_AMDGCN_GFX9420x04c
gfx942reserved
0x04d
Reserved.
EF_AMDGPU_MACH_AMDGCN_GFX12010x04e
gfx1201reserved
0x04f
Reserved.
reserved
0x050
Reserved.
EF_AMDGPU_MACH_AMDGCN_GFX9_GENERIC0x051
gfx9-genericEF_AMDGPU_MACH_AMDGCN_GFX10_1_GENERIC0x052
gfx10-1-genericEF_AMDGPU_MACH_AMDGCN_GFX10_3_GENERIC0x053
gfx10-3-genericEF_AMDGPU_MACH_AMDGCN_GFX11_GENERIC0x054
gfx11-genericEF_AMDGPU_MACH_AMDGCN_GFX11520x055
gfx1152.reserved
0x056
Reserved.
reserved
0x057
Reserved.
reserved
0x058
Reserved.
EF_AMDGPU_MACH_AMDGCN_GFX12_GENERIC0x059
gfx12-generic
Sections¶
An AMDGPU target ELF code object has the standard ELF sections which include:
Table 38 AMDGPU ELF Sections¶ Name
Type
Attributes
.bss
SHT_NOBITS
SHF_ALLOC+SHF_WRITE
.data
SHT_PROGBITS
SHF_ALLOC+SHF_WRITE
.debug_*
SHT_PROGBITSnone
.dynamic
SHT_DYNAMIC
SHF_ALLOC
.dynstr
SHT_PROGBITS
SHF_ALLOC
.dynsym
SHT_PROGBITS
SHF_ALLOC
.got
SHT_PROGBITS
SHF_ALLOC+SHF_WRITE
.hash
SHT_HASH
