User Guide for AMDGPU Backend

Introduction

The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the R600 family up until the current GCN families. It lives in the llvm/lib/Target/AMDGPU directory.

LLVM

Target Triples

Use the Clang option -target <Architecture>-<Vendor>-<OS>-<Environment> to specify the target triple:

Table 18 AMDGPU Architectures

Architecture

Description

r600

AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.

amdgcn

AMD GPUs GCN GFX6 onwards for graphics and compute shaders.

Table 19 AMDGPU Vendors

Vendor

Description

amd

Can be used for all AMD GPU usage.

mesa

Can be used if the OS is mesa3d.

Table 20 AMDGPU Operating Systems

OS

Description

<empty>

Defaults to the unknown OS.

amdhsa

Compute kernels executed on HSA [HSA] compatible runtimes such as:

  • AMD’s ROCm™ runtime [AMD-ROCm] using the rocm-amdhsa loader on Linux. See AMD ROCm Platform Release Notes [AMD-ROCm-Release-Notes] for supported hardware and software.

  • AMD’s PAL runtime using the pal-amdhsa loader on Windows.

amdpal

Graphic shaders and compute kernels executed on AMD’s PAL runtime using the pal-amdpal loader on Windows and Linux Pro.

mesa3d

Graphic shaders and compute kernels executed on AMD’s Mesa 3D runtime using the mesa-mesa3d loader on Linux.

Table 21 AMDGPU Environments

Environment

Description

<empty>

Default.

Processors

Use the Clang options -mcpu=<target-id> or --offload-arch=<target-id> to specify the AMDGPU processor together with optional target features. See Target ID and Target Features for AMD GPU target specific information.

Every processor supports every OS ABI (see AMDGPU Operating Systems) with the following exceptions:

  • amdhsa is not supported in r600 architecture (see AMDGPU Architectures).

    Table 22 AMDGPU Processors

    Processor

    Alternative Processor

    Target Triple Architecture

    dGPU/ APU

    Target Features Supported

    Target Properties

    OS Support (see amdgpu-os and corresponding runtime release notes for current information and level of support)

    Example Products

    Radeon HD 2000/3000 Series (R600) [AMD-RADEON-HD-2000-3000]

    r600

    r600

    dGPU

    • Does not support generic address space

    r630

    r600

    dGPU

    • Does not support generic address space

    rs880

    r600

    dGPU

    • Does not support generic address space

    rv670

    r600

    dGPU

    • Does not support generic address space

    Radeon HD 4000 Series (R700) [AMD-RADEON-HD-4000]

    rv710

    r600

    dGPU

    • Does not support generic address space

    rv730

    r600

    dGPU

    • Does not support generic address space

    rv770

    r600

    dGPU

    • Does not support generic address space

    Radeon HD 5000 Series (Evergreen) [AMD-RADEON-HD-5000]

    cedar

    r600

    dGPU

    • Does not support generic address space

    cypress

    r600

    dGPU

    • Does not support generic address space

    juniper

    r600

    dGPU

    • Does not support generic address space

    redwood

    r600

    dGPU

    • Does not support generic address space

    sumo

    r600

    dGPU

    • Does not support generic address space

    Radeon HD 6000 Series (Northern Islands) [AMD-RADEON-HD-6000]

    barts

    r600

    dGPU

    • Does not support generic address space

    caicos

    r600

    dGPU

    • Does not support generic address space

    cayman

    r600

    dGPU

    • Does not support generic address space

    turks

    r600

    dGPU

    • Does not support generic address space

    GCN GFX6 (Southern Islands (SI)) [AMD-GCN-GFX6]

    gfx600

    • tahiti

    amdgcn

    dGPU

    • Does not support generic address space

    • pal-amdpal

    gfx601

    • pitcairn

    • verde

    amdgcn

    dGPU

    • Does not support generic address space

    • pal-amdpal

    gfx602

    • hainan

    • oland

    amdgcn

    dGPU

    • Does not support generic address space

    • pal-amdpal

    GCN GFX7 (Sea Islands (CI)) [AMD-GCN-GFX7]

    gfx700

    • kaveri

    amdgcn

    APU

    • Offset flat scratch

    • rocm-amdhsa

    • pal-amdhsa

    • pal-amdpal

    • A6-7000

    • A6 Pro-7050B

    • A8-7100

    • A8 Pro-7150B

    • A10-7300

    • A10 Pro-7350B

    • FX-7500

    • A8-7200P

    • A10-7400P

    • FX-7600P

    gfx701

    • hawaii

    amdgcn

    dGPU

    • Offset flat scratch

    • rocm-amdhsa

    • pal-amdhsa

    • pal-amdpal

    • FirePro W8100

    • FirePro W9100

    • FirePro S9150

    • FirePro S9170

    gfx702

    amdgcn

    dGPU

    • Offset flat scratch

    • rocm-amdhsa

    • pal-amdhsa

    • pal-amdpal

    • Radeon R9 290

    • Radeon R9 290x

    • Radeon R390

    • Radeon R390x

    gfx703

    • kabini

    • mullins

    amdgcn

    APU

    • Offset flat scratch

    • pal-amdhsa

    • pal-amdpal

    • E1-2100

    • E1-2200

    • E1-2500

    • E2-3000

    • E2-3800

    • A4-5000

    • A4-5100

    • A6-5200

    • A4 Pro-3340B

    gfx704

    • bonaire

    amdgcn

    dGPU

    • Offset flat scratch

    • pal-amdhsa

    • pal-amdpal

    • Radeon HD 7790

    • Radeon HD 8770

    • R7 260

    • R7 260X

    gfx705

    amdgcn

    APU

    • Offset flat scratch

    • pal-amdhsa

    • pal-amdpal

    TBA

    GCN GFX8 (Volcanic Islands (VI)) [AMD-GCN-GFX8]

    gfx801

    • carrizo

    amdgcn

    APU

    • xnack

    • Offset flat scratch

    • rocm-amdhsa

    • pal-amdhsa

    • pal-amdpal

    • A6-8500P

    • Pro A6-8500B

    • A8-8600P

    • Pro A8-8600B

    • FX-8800P

    • Pro A12-8800B

    • A10-8700P

    • Pro A10-8700B

    • A10-8780P

    • A10-9600P

    • A10-9630P

    • A12-9700P

    • A12-9730P

    • FX-9800P

    • FX-9830P

    • E2-9010

    • A6-9210

    • A9-9410

    gfx802

    • iceland

    • tonga

    amdgcn

    dGPU

    • Offset flat scratch

    • rocm-amdhsa

    • pal-amdhsa

    • pal-amdpal

    • Radeon R9 285

    • Radeon R9 380

    • Radeon R9 385

    gfx803

    • fiji

    amdgcn

    dGPU

    • rocm-amdhsa

    • pal-amdhsa

    • pal-amdpal

    • Radeon R9 Nano

    • Radeon R9 Fury

    • Radeon R9 FuryX

    • Radeon Pro Duo

    • FirePro S9300x2

    • Radeon Instinct MI8

    • polaris10

    amdgcn

    dGPU

    • Offset flat scratch

    • rocm-amdhsa

    • pal-amdhsa

    • pal-amdpal

    • Radeon RX 470

    • Radeon RX 480

    • Radeon Instinct MI6

    • polaris11

    amdgcn

    dGPU

    • Offset flat scratch

    • rocm-amdhsa

    • pal-amdhsa

    • pal-amdpal

    • Radeon RX 460

    gfx805

    • tongapro

    amdgcn

    dGPU

    • Offset flat scratch

    • rocm-amdhsa

    • pal-amdhsa

    • pal-amdpal

    • FirePro S7150

    • FirePro S7100

    • FirePro W7100

    • Mobile FirePro M7170

    gfx810

    • stoney

    amdgcn

    APU

    • xnack

    • Offset flat scratch

    • rocm-amdhsa

    • pal-amdhsa

    • pal-amdpal

    TBA

    GCN GFX9 (Vega) [AMD-GCN-GFX900-GFX904-VEGA] [AMD-GCN-GFX906-VEGA7NM] [AMD-GCN-GFX908-CDNA1] [AMD-GCN-GFX90A-CDNA2] [AMD-GCN-GFX940-GFX942-CDNA3]

    gfx900

    amdgcn

    dGPU

    • xnack

    • Absolute flat scratch

    • rocm-amdhsa

    • pal-amdhsa

    • pal-amdpal

    • Radeon Vega Frontier Edition

    • Radeon RX Vega 56

    • Radeon RX Vega 64

    • Radeon RX Vega 64 Liquid

    • Radeon Instinct MI25

    gfx902

    amdgcn

    APU

    • xnack

    • Absolute flat scratch

    • rocm-amdhsa

    • pal-amdhsa

    • pal-amdpal

    • Ryzen 3 2200G

    • Ryzen 5 2400G

    gfx904

    amdgcn

    dGPU

    • xnack

    • rocm-amdhsa

    • pal-amdhsa

    • pal-amdpal

    TBA

    gfx906

    amdgcn

    dGPU

    • sramecc

    • xnack

    • Absolute flat scratch

    • rocm-amdhsa

    • pal-amdhsa

    • pal-amdpal

    • Radeon Instinct MI50

    • Radeon Instinct MI60

    • Radeon VII

    • Radeon Pro VII

    gfx908

    amdgcn

    dGPU

    • sramecc

    • xnack

    • Absolute flat scratch

    • rocm-amdhsa

    • AMD Instinct MI100 Accelerator

    gfx909

    amdgcn

    APU

    • xnack

    • Absolute flat scratch

    • pal-amdpal

    TBA

    gfx90a

    amdgcn

    dGPU

    • sramecc

    • tgsplit

    • xnack

    • kernarg preload

    • Absolute flat scratch

    • Packed work-item IDs

    • rocm-amdhsa

    • rocm-amdhsa

    • rocm-amdhsa

    • AMD Instinct MI210 Accelerator

    • AMD Instinct MI250 Accelerator

    • AMD Instinct MI250X Accelerator

    gfx90c

    amdgcn

    APU

    • xnack

    • Absolute flat scratch

    • pal-amdpal

    • Ryzen 7 4700G

    • Ryzen 7 4700GE

    • Ryzen 5 4600G

    • Ryzen 5 4600GE

    • Ryzen 3 4300G

    • Ryzen 3 4300GE

    • Ryzen Pro 4000G

    • Ryzen 7 Pro 4700G

    • Ryzen 7 Pro 4750GE

    • Ryzen 5 Pro 4650G

    • Ryzen 5 Pro 4650GE

    • Ryzen 3 Pro 4350G

    • Ryzen 3 Pro 4350GE

    gfx940

    amdgcn

    dGPU

    • sramecc

    • tgsplit

    • xnack

    • kernarg preload

    • Architected flat scratch

    • Packed work-item IDs

    TBA

    gfx941

    amdgcn

    dGPU

    • sramecc

    • tgsplit

    • xnack

    • kernarg preload

    • Architected flat scratch

    • Packed work-item IDs

    TBA

    gfx942

    amdgcn

    dGPU

    • sramecc

    • tgsplit

    • xnack

    • kernarg preload

    • Architected flat scratch

    • Packed work-item IDs

    TBA

    GCN GFX10.1 (RDNA 1) [AMD-GCN-GFX10-RDNA1]

    gfx1010

    amdgcn

    dGPU

    • cumode

    • wavefrontsize64

    • xnack

    • Absolute flat scratch

    • rocm-amdhsa

    • pal-amdhsa

    • pal-amdpal

    • Radeon RX 5700

    • Radeon RX 5700 XT

    • Radeon Pro 5600 XT

    • Radeon Pro 5600M

    gfx1011

    amdgcn

    dGPU

    • cumode

    • wavefrontsize64

    • xnack

    • Absolute flat scratch

    • rocm-amdhsa

    • pal-amdhsa

    • pal-amdpal

    • Radeon Pro V520

    gfx1012

    amdgcn

    dGPU

    • cumode

    • wavefrontsize64

    • xnack

    • Absolute flat scratch

    • rocm-amdhsa

    • pal-amdhsa

    • pal-amdpal

    • Radeon RX 5500

    • Radeon RX 5500 XT

    gfx1013

    amdgcn

    APU

    • cumode

    • wavefrontsize64

    • xnack

    • Absolute flat scratch

    • rocm-amdhsa

    • pal-amdhsa

    • pal-amdpal

    TBA

    GCN GFX10.3 (RDNA 2) [AMD-GCN-GFX10-RDNA2]

    gfx1030

    amdgcn

    dGPU

    • cumode

    • wavefrontsize64

    • Absolute flat scratch

    • rocm-amdhsa

    • pal-amdhsa

    • pal-amdpal

    • Radeon RX 6800

    • Radeon RX 6800 XT

    • Radeon RX 6900 XT

    gfx1031

    amdgcn

    dGPU

    • cumode

    • wavefrontsize64

    • Absolute flat scratch

    • rocm-amdhsa

    • pal-amdhsa

    • pal-amdpal

    • Radeon RX 6700 XT

    gfx1032

    amdgcn

    dGPU

    • cumode

    • wavefrontsize64

    • Absolute flat scratch

    • rocm-amdhsa

    • pal-amdhsa

    • pal-amdpal

    TBA

    gfx1033

    amdgcn

    APU

    • cumode

    • wavefrontsize64

    • Absolute flat scratch

    • pal-amdpal

    TBA

    gfx1034

    amdgcn

    dGPU

    • cumode

    • wavefrontsize64

    • Absolute flat scratch

    • pal-amdpal

    TBA

    gfx1035

    amdgcn

    APU

    • cumode

    • wavefrontsize64

    • Absolute flat scratch

    • pal-amdpal

    TBA

    gfx1036

    amdgcn

    APU

    • cumode

    • wavefrontsize64

    • Absolute flat scratch

    • pal-amdpal

    TBA

    GCN GFX11 (RDNA 3) [AMD-GCN-GFX11-RDNA3]

    gfx1100

    amdgcn

    dGPU

    • cumode

    • wavefrontsize64

    • Architected flat scratch

    • Packed work-item IDs

    • pal-amdpal

    TBA

    gfx1101

    amdgcn

    dGPU

    • cumode

    • wavefrontsize64

    • Architected flat scratch

    • Packed work-item IDs

    TBA

    gfx1102

    amdgcn

    dGPU

    • cumode

    • wavefrontsize64

    • Architected flat scratch

    • Packed work-item IDs

    TBA

    gfx1103

    amdgcn

    APU

    • cumode

    • wavefrontsize64

    • Architected flat scratch

    • Packed work-item IDs

    TBA

    gfx1150

    amdgcn

    APU

    • cumode

    • wavefrontsize64

    • Architected flat scratch

    • Packed work-item IDs

    TBA

    gfx1151

    amdgcn

    APU

    • cumode

    • wavefrontsize64

    • Architected flat scratch

    • Packed work-item IDs

    TBA

    gfx1152

    amdgcn

    APU

    • cumode

    • wavefrontsize64

    • Architected flat scratch

    • Packed work-item IDs

    TBA

    gfx1200

    amdgcn

    dGPU

    • cumode

    • wavefrontsize64

    • Architected flat scratch

    • Packed work-item IDs

    TBA

    gfx1201

    amdgcn

    dGPU

    • cumode

    • wavefrontsize64

    • Architected flat scratch

    • Packed work-item IDs

    TBA

Generic processors allow execution of a single code object on any of the processors that it supports. Such code objects may not perform as well as those for the non-generic processors.

Generic processors are only available on code object V6 and above (see ELF Code Object).

Generic processor code objects are versioned. See Generic Processor Versioning for more information on how versioning works.

Table 23 AMDGPU Generic Processors

Processor

Target Triple Architecture

Supported Processors

Target Features Supported

Target Properties

Target Restrictions

gfx9-generic

amdgcn

  • gfx900

  • gfx902

  • gfx904

  • gfx906

  • gfx909

  • gfx90c

  • xnack

  • Absolute flat scratch

  • v_mad_mix instructions are not available on gfx900, gfx902, gfx909, gfx90c

  • v_fma_mix instructions are not available on gfx904

  • sramecc is not available on gfx906

  • The following instructions are not available on gfx906:

    • v_fmac_f32

    • v_xnor_b32

    • v_dot4_i32_i8

    • v_dot8_i32_i4

    • v_dot2_i32_i16

    • v_dot2_u32_u16

    • v_dot4_u32_u8

    • v_dot8_u32_u4

    • v_dot2_f32_f16

gfx10-1-generic

amdgcn

  • gfx1010

  • gfx1011

  • gfx1012

  • gfx1013

  • xnack

  • wavefrontsize64

  • cumode

  • Absolute flat scratch

  • The following instructions are not available on gfx1011 and gfx1012

    • v_dot4_i32_i8

    • v_dot8_i32_i4

    • v_dot2_i32_i16

    • v_dot2_u32_u16

    • v_dot2c_f32_f16

    • v_dot4c_i32_i8

    • v_dot4_u32_u8

    • v_dot8_u32_u4

    • v_dot2_f32_f16

  • BVH Ray Tracing instructions are not available on gfx1013

gfx10-3-generic

amdgcn

  • gfx1030

  • gfx1031

  • gfx1032

  • gfx1033

  • gfx1034

  • gfx1035

  • gfx1036

  • wavefrontsize64

  • cumode

  • Absolute flat scratch

No restrictions.

gfx11-generic

amdgcn

  • gfx1100

  • gfx1101

  • gfx1102

  • gfx1103

  • gfx1150

  • gfx1151

  • gfx1152

  • wavefrontsize64

  • cumode

  • Architected flat scratch

  • Packed work-item IDs

Various codegen pessimizations are applied to work around some hazards specific to some targets within this family.

Not all VGPRs can be used on:

  • gfx1100

  • gfx1101

  • gfx1151

  • gfx1152

SALU floating point instructions and single-use VGPR hint instructions are not available on:

  • gfx1150

  • gfx1151

  • gfx1152

SGPRs are not supported for src1 in dpp instructions for:

  • gfx1150

  • gfx1151

  • gfx1152

gfx12-generic

amdgcn

  • gfx1200

  • gfx1201

  • wavefrontsize64

  • cumode

  • Architected flat scratch

  • Packed work-item IDs

No restrictions.

Generic Processor Versioning

Generic processor (see AMDGPU Generic Processors) code objects are versioned (see AMDGPU ELF Header e_flags for Code Object V6 and After) between 1 and 255. The version of non-generic code objects is always set to 0.

For a generic code object, adding a new supported processor may require the code generated for the generic target to be changed so it can continue to execute on the previously supported processors as well as on the new one. When this happens, the generic code object version number is incremented at the same time as the generic target is updated.

Each supported processor of a generic target is mapped to the version it was introduced in. A generic code object can execute on a supported processor if the version of the code object being loaded is greater than or equal to the version in which the processor was added to the generic target.

Target Features

Target features control how code is generated to support certain processor specific features. Not all target features are supported by all processors. The runtime must ensure that the features supported by the device used to execute the code match the features enabled when generating the code. A mismatch of features may result in incorrect execution, or a reduction in performance.

The target features supported by each processor is listed in Processors.

Target features are controlled by exactly one of the following Clang options:

-mcpu=<target-id> or --offload-arch=<target-id>

The -mcpu and --offload-arch can specify the target feature as optional components of the target ID. If omitted, the target feature has the any value. See Target ID.

-m[no-]<target-feature>

Target features not specified by the target ID are specified using a separate option. These target features can have an on or off value. on is specified by omitting the no- prefix, and off is specified by including the no- prefix. The default if not specified is off.

For example:

-mcpu=gfx908:xnack+

Enable the xnack feature.

-mcpu=gfx908:xnack-

Disable the xnack feature.

-mcumode

Enable the cumode feature.

-mno-cumode

Disable the cumode feature.

Table 24 AMDGPU Target Features

Target Feature

Clang Option to Control

Description

Name

cumode

  • -m[no-]cumode

Control the wavefront execution mode used when generating code for kernels. When disabled native WGP wavefront execution mode is used, when enabled CU wavefront execution mode is used (see Memory Model).

sramecc

  • -mcpu

  • --offload-arch

If specified, generate code that can only be loaded and executed in a process that has a matching setting for SRAMECC.

If not specified for code object V2 to V3, generate code that can be loaded and executed in a process with SRAMECC enabled.

If not specified for code object V4 or above, generate code that can be loaded and executed in a process with either setting of SRAMECC.

tgsplit

-m[no-]tgsplit

Enable/disable generating code that assumes work-groups are launched in threadgroup split mode. When enabled the waves of a work-group may be launched in different CUs.

wavefrontsize64

  • -m[no-]wavefrontsize64

Control the wavefront size used when generating code for kernels. When disabled native wavefront size 32 is used, when enabled wavefront size 64 is used.

xnack

  • -mcpu

  • --offload-arch

If specified, generate code that can only be loaded and executed in a process that has a matching setting for XNACK replay.

If not specified for code object V2 to V3, generate code that can be loaded and executed in a process with XNACK replay enabled.

If not specified for code object V4 or above, generate code that can be loaded and executed in a process with either setting of XNACK replay.

XNACK replay can be used for demand paging and page migration. If enabled in the device, then if a page fault occurs the code may execute incorrectly unless generated with XNACK replay enabled, or generated for code object V4 or above without specifying XNACK replay. Executing code that was generated with XNACK replay enabled, or generated for code object V4 or above without specifying XNACK replay, on a device that does not have XNACK replay enabled will execute correctly but may be less performant than code generated for XNACK replay disabled.

Target ID

AMDGPU supports target IDs. See Clang Offload Bundler for a general description. The AMDGPU target specific information is:

processor

Is an AMDGPU processor or alternative processor name specified in AMDGPU Processors. The non-canonical form target ID allows both the primary processor and alternative processor names. The canonical form target ID only allow the primary processor name.

target-feature

Is a target feature name specified in AMDGPU Target Features that is supported by the processor. The target features supported by each processor is specified in AMDGPU Processors. Those that can be specified in a target ID are marked as being controlled by -mcpu and --offload-arch. Each target feature must appear at most once in a target ID. The non-canonical form target ID allows the target features to be specified in any order. The canonical form target ID requires the target features to be specified in alphabetic order.

Code Object V2 to V3 Target ID

The target ID syntax for code object V2 to V3 is the same as defined in Clang Offload Bundler except when used in the .amdgcn_target <target-triple> “-” <target-id> assembler directive and the bundle entry ID. In those cases it has the following BNF syntax:

<target-id> ::== <processor> ( "+" <target-feature> )*

Where a target feature is omitted if Off and present if On or Any.

Note

The code object V2 to V3 cannot represent Any and treats it the same as On.

Embedding Bundled Code Objects

AMDGPU supports the HIP and OpenMP languages that perform code object embedding as described in Clang Offload Bundler.

Note

The target ID syntax used for code object V2 to V3 for a bundle entry ID differs from that used elsewhere. See Code Object V2 to V3 Target ID.

Address Spaces

The AMDGPU architecture supports a number of memory address spaces. The address space names use the OpenCL standard names, with some additions.

The AMDGPU address spaces correspond to target architecture specific LLVM address space numbers used in LLVM IR.

The AMDGPU address spaces are described in AMDGPU Address Spaces. Only 64-bit process address spaces are supported for the amdgcn target.

Table 25 AMDGPU Address Spaces

64-Bit Process Address Space

Address Space Name

LLVM IR Address Space Number

HSA Segment Name

Hardware Name

Address Size

NULL Value

Generic

0

flat

flat

64

0x0000000000000000

Global

1

global

global

64

0x0000000000000000

Region

2

N/A

GDS

32

not implemented for AMDHSA

Local

3

group

LDS

32

0xFFFFFFFF

Constant

4

constant

same as global

64

0x0000000000000000

Private

5

private

scratch

32

0xFFFFFFFF

Constant 32-bit

6

TODO

0x00000000

Buffer Fat Pointer

7

N/A

N/A

160

0

Buffer Resource

8

N/A

V#

128

0x00000000000000000000000000000000

Buffer Strided Pointer (experimental)

9

TODO

Streamout Registers

128

N/A

GS_REGS

Generic

The generic address space is supported unless the Target Properties column of AMDGPU Processors specifies Does not support generic address space.

The generic address space uses the hardware flat address support for two fixed ranges of virtual addresses (the private and local apertures), that are outside the range of addressable global memory, to map from a flat address to a private or local address. This uses FLAT instructions that can take a flat address and access global, private (scratch), and group (LDS) memory depending on if the address is within one of the aperture ranges.

Flat access to scratch requires hardware aperture setup and setup in the kernel prologue (see Flat Scratch). Flat access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup (see M0).

To convert between a private or group address space address (termed a segment address) and a flat address the base address of the corresponding aperture can be used. For GFX7-GFX8 these are available in the HSA AQL Queue the address of which can be obtained with Queue Ptr SGPR (see Initial Kernel Execution State). For GFX9-GFX11 the aperture base addresses are directly available as inline constant registers SRC_SHARED_BASE/LIMIT and SRC_PRIVATE_BASE/LIMIT. In 64-bit address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32 which makes it easier to convert from flat to segment or segment to flat.

A global address space address has the same value when used as a flat address so no conversion is needed.

Global and Constant

The global and constant address spaces both use global virtual addresses, which are the same virtual address space used by the CPU. However, some virtual addresses may only be accessible to the CPU, some only accessible by the GPU, and some by both.

Using the constant address space indicates that the data will not change during the execution of the kernel. This allows scalar read instructions to be used. As the constant address space could only be modified on the host side, a generic pointer loaded from the constant address space is safe to be assumed as a global pointer since only the device global memory is visible and managed on the host side. The vector and scalar L1 caches are invalidated of volatile data before each kernel dispatch execution to allow constant memory to change values between kernel dispatches.

Region

The region address space uses the hardware Global Data Store (GDS). All wavefronts executing on the same device will access the same memory for any given region address. However, the same region address accessed by wavefronts executing on different devices will access different memory. It is higher performance than global memory. It is allocated by the runtime. The data store (DS) instructions can be used to access it.

Local

The local address space uses the hardware Local Data Store (LDS) which is automatically allocated when the hardware creates the wavefronts of a work-group, and freed when all the wavefronts of a work-group have terminated. All wavefronts belonging to the same work-group will access the same memory for any given local address. However, the same local address accessed by wavefronts belonging to different work-groups will access different memory. It is higher performance than global memory. The data store (DS) instructions can be used to access it.

Private

The private address space uses the hardware scratch memory support which automatically allocates memory when it creates a wavefront and frees it when a wavefronts terminates. The memory accessed by a lane of a wavefront for any given private address will be different to the memory accessed by another lane of the same or different wavefront for the same private address.

If a kernel dispatch uses scratch, then the hardware allocates memory from a pool of backing memory allocated by the runtime for each wavefront. The lanes of the wavefront access this using dword (4 byte) interleaving. The mapping used from private address to backing memory address is:

wavefront-scratch-base + ((private-address / 4) * wavefront-size * 4) + (wavefront-lane-id * 4) + (private-address % 4)

If each lane of a wavefront accesses the same private address, the interleaving results in adjacent dwords being accessed and hence requires fewer cache lines to be fetched.

There are different ways that the wavefront scratch base address is determined by a wavefront (see Initial Kernel Execution State).

Scratch memory can be accessed in an interleaved manner using buffer instructions with the scratch buffer descriptor and per wavefront scratch offset, by the scratch instructions, or by flat instructions. Multi-dword access is not supported except by flat and scratch instructions in GFX9-GFX11.

Code that manipulates the stack values in other lanes of a wavefront, such as by addrspacecast-ing stack pointers to generic ones and taking offsets that reach other lanes or by explicitly constructing the scratch buffer descriptor, triggers undefined behavior when it modifies the scratch values of other lanes. The compiler may assume that such modifications do not occur. When using code object V5 LIBOMPTARGET_STACK_SIZE may be used to provide the private segment size in bytes, for cases where a dynamic stack is used.

Constant 32-bit

TODO

Buffer Fat Pointer

The buffer fat pointer is an experimental address space that is currently unsupported in the backend. It exposes a non-integral pointer that is in the future intended to support the modelling of 128-bit buffer descriptors plus a 32-bit offset into the buffer (in total encapsulating a 160-bit pointer), allowing normal LLVM load/store/atomic operations to be used to model the buffer descriptors used heavily in graphics workloads targeting the backend.

The buffer descriptor used to construct a buffer fat pointer must be raw: the stride must be 0, the “add tid” flag must be 0, the swizzle enable bits must be off, and the extent must be measured in bytes. (On subtargets where bounds checking may be disabled, buffer fat pointers may choose to enable it or not).

Buffer Resource

The buffer resource pointer, in address space 8, is the newer form for representing buffer descriptors in AMDGPU IR, replacing their previous representation as <4 x i32>. It is a non-integral pointer that represents a 128-bit buffer descriptor resource (V#).

Since, in general, a buffer resource supports complex addressing modes that cannot be easily represented in LLVM (such as implicit swizzled access to structured buffers), it is illegal to perform non-trivial address computations, such as getelementptr operations, on buffer resources. They may be passed to AMDGPU buffer intrinsics, and they may be converted to and from i128.

Casting a buffer resource to a buffer fat pointer is permitted and adds an offset of 0.

Buffer resources can be created from 64-bit pointers (which should be either generic or global) using the llvm.amdgcn.make.buffer.rsrc intrinsic, which takes the pointer, which becomes the base of the resource, the 16-bit stride (and swzizzle control) field stored in bits 63:48 of a V#, the 32-bit NumRecords/extent field (bits 95:64), and the 32-bit flags field (bits 127:96). The specific interpretation of these fields varies by the target architecture and is detailed in the ISA descriptions.

Buffer Strided Pointer

The buffer index pointer is an experimental address space. It represents a 128-bit buffer descriptor and a 32-bit offset, like the Buffer Fat Pointer. Additionally, it contains an index into the buffer, which allows the direct addressing of structured elements. These components appear in that order, i.e., the descriptor comes first, then the 32-bit offset followed by the 32-bit index.

The bits in the buffer descriptor must meet the following requirements: the stride is the size of a structured element, the “add tid” flag must be 0, and the swizzle enable bits must be off.

Streamout Registers

Dedicated registers used by the GS NGG Streamout Instructions. The register file is modelled as a memory in a distinct address space because it is indexed by an address-like offset in place of named registers, and because register accesses affect LGKMcnt. This is an internal address space used only by the compiler. Do not use this address space for IR pointers.

Memory Scopes

This section provides LLVM memory synchronization scopes supported by the AMDGPU backend memory model when the target triple OS is amdhsa (see Memory Model and Target Triples).

The memory model supported is based on the HSA memory model [HSA] which is based in turn on HRF-indirect with scope inclusion [HRF]. The happens-before relation is transitive over the synchronizes-with relation independent of scope and synchronizes-with allows the memory scope instances to be inclusive (see table AMDHSA LLVM Sync Scopes).

This is different to the OpenCL [OpenCL] memory model which does not have scope inclusion and requires the memory scopes to exactly match. However, this is conservatively correct for OpenCL.

Table 26 AMDHSA LLVM Sync Scopes

LLVM Sync Scope

Description

none

The default: system.

Synchronizes with, and participates in modification and seq_cst total orderings with, other operations (except image operations) for all address spaces (except private, or generic that accesses private) provided the other operation’s sync scope is:

  • system.

  • agent and executed by a thread on the same agent.

  • workgroup and executed by a thread in the same work-group.

  • wavefront and executed by a thread in the same wavefront.

agent

Synchronizes with, and participates in modification and seq_cst total orderings with, other operations (except image operations) for all address spaces (except private, or generic that accesses private) provided the other operation’s sync scope is:

  • system or agent and executed by a thread on the same agent.

  • workgroup and executed by a thread in the same work-group.

  • wavefront and executed by a thread in the same wavefront.

workgroup

Synchronizes with, and participates in modification and seq_cst total orderings with, other operations (except image operations) for all address spaces (except private, or generic that accesses private) provided the other operation’s sync scope is:

  • system, agent or workgroup and executed by a thread in the same work-group.

  • wavefront and executed by a thread in the same wavefront.

wavefront

Synchronizes with, and participates in modification and seq_cst total orderings with, other operations (except image operations) for all address spaces (except private, or generic that accesses private) provided the other operation’s sync scope is:

  • system, agent, workgroup or wavefront and executed by a thread in the same wavefront.

singlethread

Only synchronizes with and participates in modification and seq_cst total orderings with, other operations (except image operations) running in the same thread for all address spaces (for example, in signal handlers).

one-as

Same as system but only synchronizes with other operations within the same address space.

agent-one-as

Same as agent but only synchronizes with other operations within the same address space.

workgroup-one-as

Same as workgroup but only synchronizes with other operations within the same address space.

wavefront-one-as

Same as wavefront but only synchronizes with other operations within the same address space.

singlethread-one-as

Same as singlethread but only synchronizes with other operations within the same address space.

LLVM IR Intrinsics

The AMDGPU backend implements the following LLVM IR intrinsics.

This section is WIP.

Table 27 AMDGPU LLVM IR Intrinsics

LLVM Intrinsic

Description

llvm.amdgcn.sqrt

Provides direct access to v_sqrt_f64, v_sqrt_f32 and v_sqrt_f16 (on targets with half support). Performs sqrt function.

llvm.amdgcn.log

Provides direct access to v_log_f32 and v_log_f16 (on targets with half support). Performs log2 function.

llvm.amdgcn.exp2

Provides direct access to v_exp_f32 and v_exp_f16 (on targets with half support). Performs exp2 function.

llvm.frexp

Implemented for half, float and double.

llvm.log2

Implemented for float and half (and vectors of float or half). Not implemented for double. Hardware provides 1ULP accuracy for float, and 0.51ULP for half. Float instruction does not natively support denormal inputs.

llvm.sqrt

Implemented for double, float and half (and vectors).

llvm.log

Implemented for float and half (and vectors).

llvm.exp

Implemented for float and half (and vectors).

llvm.log10

Implemented for float and half (and vectors).

llvm.exp2

Implemented for float and half (and vectors of float or half). Not implemented for double. Hardware provides 1ULP accuracy for float, and 0.51ULP for half. Float instruction does not natively support denormal inputs.

llvm.stacksave.p5

Implemented, must use the alloca address space.

llvm.stackrestore.p5

Implemented, must use the alloca address space.

llvm.get.fpmode.i32

The natural floating-point mode type is i32. This implemented by extracting relevant bits out of the MODE register with s_getreg_b32. The first 10 bits are the core floating-point mode. Bits 12:18 are the exception mask. On gfx9+, bit 23 is FP16_OVFL. Bitfields not relevant to floating-point instructions are 0s.

llvm.get.rounding

AMDGPU supports two separately controllable rounding modes depending on the floating-point type. One controls float, and the other controls both double and half operations. If both modes are the same, returns one of the standard return values. If the modes are different, returns one of 12 extended values describing the two modes.

To nearest, ties away from zero is not a supported mode. The raw rounding mode values in the MODE register do not exactly match the FLT_ROUNDS values, so a conversion is performed.

llvm.set.rounding

Input value expected to be one of the valid results from ‘llvm.get.rounding’. Rounding mode is undefined if not passed a valid input. This should be a wave uniform value. In case of a divergent input value, the first active lane’s value will be used.

llvm.get.fpenv

Returns the current value of the AMDGPU floating point environment. This stores information related to the current rounding mode, denormalization mode, enabled traps, and floating point exceptions. The format is a 64-bit concatenation of the MODE and TRAPSTS registers.

llvm.set.fpenv

Sets the floating point environment to the specifies state.

llvm.amdgcn.readfirstlane

Provides direct access to v_readfirstlane_b32. Returns the value in the lowest active lane of the input operand. Currently implemented for i16, i32, float, half, bfloat, <2 x i16>, <2 x half>, <2 x bfloat>, i64, double, pointers, multiples of the 32-bit vectors.

llvm.amdgcn.readlane

Provides direct access to v_readlane_b32. Returns the value in the specified lane of the first input operand. The second operand specifies the lane to read from. Currently implemented for i16, i32, float, half, bfloat, <2 x i16>, <2 x half>, <2 x bfloat>, i64, double, pointers, multiples of the 32-bit vectors.

llvm.amdgcn.writelane

Provides direct access to v_writelane_b32. Writes value in the first input operand to the specified lane of divergent output. The second operand specifies the lane to write. Currently implemented for i16, i32, float, half, bfloat, <2 x i16>, <2 x half>, <2 x bfloat>, i64, double, pointers, multiples of the 32-bit vectors.

llvm.amdgcn.wave.reduce.umin

Performs an arithmetic unsigned min reduction on the unsigned values provided by each lane in the wavefront. Intrinsic takes a hint for reduction strategy using second operand 0: Target default preference, 1: Iterative strategy, and 2: DPP. If target does not support the DPP operations (e.g. gfx6/7), reduction will be performed using default iterative strategy. Intrinsic is currently only implemented for i32.

llvm.amdgcn.wave.reduce.umax

Performs an arithmetic unsigned max reduction on the unsigned values provided by each lane in the wavefront. Intrinsic takes a hint for reduction strategy using second operand 0: Target default preference, 1: Iterative strategy, and 2: DPP. If target does not support the DPP operations (e.g. gfx6/7), reduction will be performed using default iterative strategy. Intrinsic is currently only implemented for i32.

llvm.amdgcn.permlane16

Provides direct access to v_permlane16_b32. Performs arbitrary gather-style operation within a row (16 contiguous lanes) of the second input operand. The third and fourth inputs must be scalar values. these are combined into a single 64-bit value representing lane selects used to swizzle within each row. Currently implemented for i16, i32, float, half, bfloat, <2 x i16>, <2 x half>, <2 x bfloat>, i64, double, pointers, multiples of the 32-bit vectors.

llvm.amdgcn.permlanex16

Provides direct access to v_permlanex16_b32. Performs arbitrary gather-style operation across two rows of the second input operand (each row is 16 contiguous lanes). The third and fourth inputs must be scalar values. these are combined into a single 64-bit value representing lane selects used to swizzle within each row. Currently implemented for i16, i32, float, half, bfloat, <2 x i16>, <2 x half>, <2 x bfloat>, i64, double, pointers, multiples of the 32-bit vectors.

llvm.amdgcn.permlane64

Provides direct access to v_permlane64_b32. Performs a specific permutation across lanes of the input operand where the high half and low half of a wave64 are swapped. Performs no operation in wave32 mode. Currently implemented for i16, i32, float, half, bfloat, <2 x i16>, <2 x half>, <2 x bfloat>, i64, double, pointers, multiples of the 32-bit vectors.

llvm.amdgcn.udot2

Provides direct access to v_dot2_u32_u16 across targets which support such instructions. This performs unsigned dot product with two v2i16 operands, summed with the third i32 operand. The i1 fourth operand is used to clamp the output.

llvm.amdgcn.udot4

Provides direct access to v_dot4_u32_u8 across targets which support such instructions. This performs unsigned dot product with two i32 operands (holding a vector of 4 8bit values), summed with the third i32 operand. The i1 fourth operand is used to clamp the output.

llvm.amdgcn.udot8

Provides direct access to v_dot8_u32_u4 across targets which support such instructions. This performs unsigned dot product with two i32 operands (holding a vector of 8 4bit values), summed with the third i32 operand. The i1 fourth operand is used to clamp the output.

llvm.amdgcn.sdot2

Provides direct access to v_dot2_i32_i16 across targets which support such instructions. This performs signed dot product with two v2i16 operands, summed with the third i32 operand. The i1 fourth operand is used to clamp the output. When applicable (e.g. no clamping), this is lowered into v_dot2c_i32_i16 for targets which support it.

llvm.amdgcn.sdot4

Provides direct access to v_dot4_i32_i8 across targets which support such instructions. This performs signed dot product with two i32 operands (holding a vector of 4 8bit values), summed with the third i32 operand. The i1 fourth operand is used to clamp the output. When applicable (i.e. no clamping / operand modifiers), this is lowered into v_dot4c_i32_i8 for targets which support it. RDNA3 does not offer v_dot4_i32_i8, and rather offers v_dot4_i32_iu8 which has operands to hold the signedness of the vector operands. Thus, this intrinsic lowers to the signed version of this instruction for gfx11 targets.

llvm.amdgcn.sdot8

Provides direct access to v_dot8_u32_u4 across targets which support such instructions. This performs signed dot product with two i32 operands (holding a vector of 8 4bit values), summed with the third i32 operand. The i1 fourth operand is used to clamp the output. When applicable (i.e. no clamping / operand modifiers), this is lowered into v_dot8c_i32_i4 for targets which support it. RDNA3 does not offer v_dot8_i32_i4, and rather offers v_dot4_i32_iu4 which has operands to hold the signedness of the vector operands. Thus, this intrinsic lowers to the signed version of this instruction for gfx11 targets.

llvm.amdgcn.sudot4

Provides direct access to v_dot4_i32_iu8 on gfx11 targets. This performs dot product with two i32 operands (holding a vector of 4 8bit values), summed with the fifth i32 operand. The i1 sixth operand is used to clamp the output. The i1s preceding the vector operands decide the signedness.

llvm.amdgcn.sudot8

Provides direct access to v_dot8_i32_iu4 on gfx11 targets. This performs dot product with two i32 operands (holding a vector of 8 4bit values), summed with the fifth i32 operand. The i1 sixth operand is used to clamp the output. The i1s preceding the vector operands decide the signedness.

llvm.amdgcn.sched_barrier

Controls the types of instructions that may be allowed to cross the intrinsic during instruction scheduling. The parameter is a mask for the instruction types that can cross the intrinsic.

  • 0x0000: No instructions may be scheduled across sched_barrier.

  • 0x0001: All, non-memory, non-side-effect producing instructions may be scheduled across sched_barrier, i.e. allow ALU instructions to pass.

  • 0x0002: VALU instructions may be scheduled across sched_barrier.

  • 0x0004: SALU instructions may be scheduled across sched_barrier.

  • 0x0008: MFMA/WMMA instructions may be scheduled across sched_barrier.

  • 0x0010: All VMEM instructions may be scheduled across sched_barrier.

  • 0x0020: VMEM read instructions may be scheduled across sched_barrier.

  • 0x0040: VMEM write instructions may be scheduled across sched_barrier.

  • 0x0080: All DS instructions may be scheduled across sched_barrier.

  • 0x0100: All DS read instructions may be scheduled accoss sched_barrier.

  • 0x0200: All DS write instructions may be scheduled across sched_barrier.

  • 0x0400: All Transcendental (e.g. V_EXP) instructions may be scheduled across sched_barrier.

llvm.amdgcn.sched_group_barrier

Creates schedule groups with specific properties to create custom scheduling pipelines. The ordering between groups is enforced by the instruction scheduler. The intrinsic applies to the code that preceeds the intrinsic. The intrinsic takes three values that control the behavior of the schedule groups.

  • Mask : Classify instruction groups using the llvm.amdgcn.sched_barrier mask values.

  • Size : The number of instructions that are in the group.

  • SyncID : Order is enforced between groups with matching values.

The mask can include multiple instruction types. It is undefined behavior to set values beyond the range of valid masks.

Combining multiple sched_group_barrier intrinsics enables an ordering of specific instruction types during instruction scheduling. For example, the following enforces a sequence of 1 VMEM read, followed by 1 VALU instruction, followed by 5 MFMA instructions.

// 1 VMEM read
__builtin_amdgcn_sched_group_barrier(32, 1, 0)
// 1 VALU
__builtin_amdgcn_sched_group_barrier(2, 1, 0)
// 5 MFMA
__builtin_amdgcn_sched_group_barrier(8, 5, 0)

llvm.amdgcn.iglp_opt

An experimental intrinsic for instruction group level parallelism. The intrinsic implements predefined intruction scheduling orderings. The intrinsic applies to the surrounding scheduling region. The intrinsic takes a value that specifies the strategy. The compiler implements two strategies.

  1. Interleave DS and MFMA instructions for small GEMM kernels.

  2. Interleave DS and MFMA instructions for single wave small GEMM kernels.

Only one iglp_opt intrinsic may be used in a scheduling region. The iglp_opt intrinsic cannot be combined with sched_barrier or sched_group_barrier.

The iglp_opt strategy implementations are subject to change.

llvm.amdgcn.atomic.cond.sub.u32

Provides direct access to flat_atomic_cond_sub_u32, global_atomic_cond_sub_u32 and ds_cond_sub_u32 based on address space on gfx12 targets. This performs subtraction only if the memory value is greater than or equal to the data value.

llvm.amdgcn.s.getpc

Provides access to the s_getpc_b64 instruction, but with the return value sign-extended from the width of the underlying PC hardware register even on processors where the s_getpc_b64 instruction returns a zero-extended value.

LLVM IR Metadata

The AMDGPU backend implements the following target custom LLVM IR metadata.

amdgpu.last.use’ Metadata

Sets TH_LOAD_LU temporal hint on load instructions that support it. Takes priority over nontemporal hint (TH_LOAD_NT). This takes no arguments.

%val = load i32, ptr %in, align 4, !amdgpu.last.use !{}

amdgpu.no.remote.memory’ Metadata

Asserts a memory operation does not access bytes in host memory, or remote connected peer device memory (the address must be device local). This is intended for use with atomicrmw and other atomic instructions. This is required to emit a native hardware instruction for some system scope atomic operations on some subtargets. For most integer atomic operations, this is a sufficient restriction to emit a native atomic instruction.

An atomicrmw without metadata will be treated conservatively as required to preserve the operation behavior in all cases. This will typically be used in conjunction with !amdgpu.no.fine.grained.memory.

; Indicates the atomic does not access fine-grained memory, or
; remote device memory.
%old0 = atomicrmw sub ptr %ptr0, i32 1 acquire, !amdgpu.no.fine.grained.memory !0, !amdgpu.no.remote.memory !0

; Indicates the atomic does not access peer device memory.
%old2 = atomicrmw sub ptr %ptr2, i32 1 acquire, !amdgpu.no.remote.memory !0

!0 = !{}

amdgpu.no.fine.grained.memory’ Metadata

Asserts a memory access does not access bytes allocated in fine-grained allocated memory. This is intended for use with atomicrmw and other atomic instructions. This is required to emit a native hardware instruction for some system scope atomic operations on some subtargets. An atomicrmw without metadata will be treated conservatively as required to preserve the operation behavior in all cases. This will typically be used in conjunction with !amdgpu.no.remote.memory.access.

; Indicates the access does not access fine-grained memory, or
; remote device memory.
%old0 = atomicrmw sub ptr %ptr0, i32 1 acquire, !amdgpu.no.fine.grained.memory !0, !amdgpu.no.remote.memory.access !0

; Indicates the access does not access fine-grained memory
%old2 = atomicrmw sub ptr %ptr2, i32 1 acquire, !amdgpu.no.fine.grained.memory !0

!0 = !{}

amdgpu.ignore.denormal.mode’ Metadata

For use with atomicrmw floating-point operations. Indicates the handling of denormal inputs and results is insignificant and may be inconsistent with the expected floating-point mode. This is necessary to emit a native atomic instruction on some targets for some address spaces where float denormals are unconditionally flushed. This is typically used in conjunction with !amdgpu.no.remote.memory.access and !amdgpu.no.fine.grained.memory

%res0 = atomicrmw fadd ptr addrspace(1) %ptr, float %value seq_cst, align 4, !amdgpu.ignore.denormal.mode !0
%res1 = atomicrmw fadd ptr addrspace(1) %ptr, float %value seq_cst, align 4, !amdgpu.ignore.denormal.mode !0, !amdgpu.no.fine.grained.memory !0, !amdgpu.no.remote.memory.access !0

!0 = !{}

LLVM IR Attributes

The AMDGPU backend supports the following LLVM IR attributes.

Table 28 AMDGPU LLVM IR Attributes

LLVM Attribute

Description

“amdgpu-flat-work-group-size”=”min,max”

Specify the minimum and maximum flat work group sizes that will be specified when the kernel is dispatched. Generated by the amdgpu_flat_work_group_size CLANG attribute [CLANG-ATTR]. The IR implied default value is 1,1024. Clang may emit this attribute with more restrictive bounds depending on language defaults. If the actual block or workgroup size exceeds the limit at any point during the execution, the behavior is undefined. For example, even if there is only one active thread but the thread local id exceeds the limit, the behavior is undefined.

“amdgpu-implicitarg-num-bytes”=”n”

Number of kernel argument bytes to add to the kernel argument block size for the implicit arguments. This varies by OS and language (for OpenCL see OpenCL kernel implicit arguments appended for AMDHSA OS).

“amdgpu-num-sgpr”=”n”

Specifies the number of SGPRs to use. Generated by the amdgpu_num_sgpr CLANG attribute [CLANG-ATTR].

“amdgpu-num-vgpr”=”n”

Specifies the number of VGPRs to use. Generated by the amdgpu_num_vgpr CLANG attribute [CLANG-ATTR].

“amdgpu-waves-per-eu”=”m,n”

Specify the minimum and maximum number of waves per execution unit. Generated by the amdgpu_waves_per_eu CLANG attribute [CLANG-ATTR]. This is an optimization hint, and the backend may not be able to satisfy the request. If the specified range is incompatible with the function’s “amdgpu-flat-work-group-size” value, the implied occupancy bounds by the workgroup size takes precedence.

“amdgpu-ieee” true/false.

GFX6-GFX11 Only Specify whether the function expects the IEEE field of the mode register to be set on entry. Overrides the default for the calling convention.

“amdgpu-dx10-clamp” true/false.

GFX6-GFX11 Only Specify whether the function expects the DX10_CLAMP field of the mode register to be set on entry. Overrides the default for the calling convention.

“amdgpu-no-workitem-id-x”

Indicates the function does not depend on the value of the llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this attribute, or reached through a call site marked with this attribute, and that intrinsic is called, the behavior of the program is undefined. (Whole-program undefined behavior is used here because, for example, the absence of a required workitem ID in the preloaded register set can mean that all other preloaded registers are earlier than the compilation assumed they would be.) The backend can generally infer this during code generation, so typically there is no benefit to frontends marking functions with this.

“amdgpu-no-workitem-id-y”

The same as amdgpu-no-workitem-id-x, except for the llvm.amdgcn.workitem.id.y intrinsic.

“amdgpu-no-workitem-id-z”

The same as amdgpu-no-workitem-id-x, except for the llvm.amdgcn.workitem.id.z intrinsic.

“amdgpu-no-workgroup-id-x”

The same as amdgpu-no-workitem-id-x, except for the llvm.amdgcn.workgroup.id.x intrinsic.

“amdgpu-no-workgroup-id-y”

The same as amdgpu-no-workitem-id-x, except for the llvm.amdgcn.workgroup.id.y intrinsic.

“amdgpu-no-workgroup-id-z”

The same as amdgpu-no-workitem-id-x, except for the llvm.amdgcn.workgroup.id.z intrinsic.

“amdgpu-no-dispatch-ptr”

The same as amdgpu-no-workitem-id-x, except for the llvm.amdgcn.dispatch.ptr intrinsic.

“amdgpu-no-implicitarg-ptr”

The same as amdgpu-no-workitem-id-x, except for the llvm.amdgcn.implicitarg.ptr intrinsic.

“amdgpu-no-dispatch-id”

The same as amdgpu-no-workitem-id-x, except for the llvm.amdgcn.dispatch.id intrinsic.

“amdgpu-no-queue-ptr”

Similar to amdgpu-no-workitem-id-x, except for the llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint attributes, the queue pointer may be required in situations where the intrinsic call does not directly appear in the program. Some subtargets require the queue pointer for to handle some addrspacecasts, as well as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and llvm.debug intrinsics.

“amdgpu-no-hostcall-ptr”

Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit kernel argument that holds the pointer to the hostcall buffer. If this attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.

“amdgpu-no-heap-ptr”

Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit kernel argument that holds the pointer to an initialized memory buffer that conforms to the requirements of the malloc/free device library V1 version implementation. If this attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.

“amdgpu-no-multigrid-sync-arg”

Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit kernel argument that holds the multigrid synchronization pointer. If this attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.

“amdgpu-no-default-queue”

Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit kernel argument that holds the default queue pointer. If this attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.

“amdgpu-no-completion-action”

Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit kernel argument that holds the completion action pointer. If this attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.

“amdgpu-lds-size”=”min[,max]”

Min is the minimum number of bytes that will be allocated in the Local Data Store at address zero. Variables are allocated within this frame using absolute symbol metadata, primarily by the AMDGPULowerModuleLDS pass. Optional max is the maximum number of bytes that will be allocated. Note that min==max indicates that no further variables can be added to the frame. This is an internal detail of how LDS variables are lowered, language front ends should not set this attribute.

“amdgpu-gds-size”

Bytes expected to be allocated at the start of GDS memory at entry.

“amdgpu-git-ptr-high”

The hard-wired high half of the address of the global information table for AMDPAL OS type. 0xffffffff represents no hard-wired high half, since current hardware only allows a 16 bit value.

“amdgpu-32bit-address-high-bits”

Assumed high 32-bits for 32-bit address spaces which are really truncated 64-bit addresses (i.e., addrspace(6))

“amdgpu-color-export”

Indicates shader exports color information if set to 1. Defaults to 1 for amdgpu_ps, and 0 for other calling conventions. Determines the necessity and type of null exports when a shader terminates early by killing lanes.

“amdgpu-depth-export”

Indicates shader exports depth information if set to 1. Determines the necessity and type of null exports when a shader terminates early by killing lanes. A depth-only shader will export to depth channel when no null export target is available (GFX11+).

“InitialPSInputAddr”

Set the initial value of the spi_ps_input_addr register for amdgpu_ps shaders. Any bits enabled by this value will be enabled in the final register value.

“amdgpu-wave-priority-threshold”

VALU instruction count threshold for adjusting wave priority. If exceeded, temporarily raise the wave priority at the start of the shader function until its last VMEM instructions to allow younger waves to issue their VMEM instructions as well.

“amdgpu-memory-bound”

Set internally by backend

“amdgpu-wave-limiter”

Set internally by backend

“amdgpu-unroll-threshold”

Set base cost threshold preference for loop unrolling within this function, default is 300. Actual threshold may be varied by per-loop metadata or reduced by heuristics.

“amdgpu-max-num-workgroups”=”x,y,z”

Specify the maximum number of work groups for the kernel dispatch in the X, Y, and Z dimensions. Generated by the amdgpu_max_num_work_groups CLANG attribute [CLANG-ATTR]. Clang only emits this attribute when all the three numbers are >= 1.

“amdgpu-no-agpr”

Indicates the function will not require allocating AGPRs. This is only relevant on subtargets with AGPRs. The behavior is undefined if a function which requires AGPRs is reached through any function marked with this attribute.

Calling Conventions

The AMDGPU backend supports the following calling conventions:

Table 29 AMDGPU Calling Conventions

Calling Convention

Description

ccc

The C calling convention. Used by default. See Non-Kernel Functions for more details.

fastcc

The fast calling convention. Mostly the same as the ccc.

coldcc

The cold calling convention. Mostly the same as the ccc.

amdgpu_cs

Used for Mesa/AMDPAL compute shaders. ..TODO:: Describe.

amdgpu_cs_chain

Similar to amdgpu_cs, with differences described below.

Functions with this calling convention cannot be called directly. They must instead be launched via the llvm.amdgcn.cs.chain intrinsic.

Arguments are passed in SGPRs, starting at s0, if they have the inreg attribute, and in VGPRs otherwise, starting at v8. Using more SGPRs or VGPRs than available in the subtarget is not allowed. On subtargets that use a scratch buffer descriptor (as opposed to scratch_{load,store}_* instructions), the scratch buffer descriptor is passed in s[48:51]. This limits the SGPR / inreg arguments to the equivalent of 48 dwords; using more than that is not allowed.

The return type must be void. Varargs, sret, byval, byref, inalloca, preallocated are not supported.

Values in scalar registers as well as v0-v7 are not preserved. Values in VGPRs starting at v8 are not preserved for the active lanes, but must be saved by the callee for inactive lanes when using WWM.

Wave scratch is “empty” at function boundaries. There is no stack pointer input or output value, but functions are free to use scratch starting from an initial stack pointer. Calls to amdgpu_gfx functions are allowed and behave like they do in amdgpu_cs functions.

All counters (lgkmcnt, vmcnt, storecnt, etc.) are presumed in an unknown state at function entry.

A function may have multiple exits (e.g. one chain exit and one plain ret void for when the wave ends), but all llvm.amdgcn.cs.chain exits must be in uniform control flow.

amdgpu_cs_chain_preserve

Same as amdgpu_cs_chain, but active lanes for VGPRs starting at v8 are preserved. Calls to amdgpu_gfx functions are not allowed, and any calls to llvm.amdgcn.cs.chain must not pass more VGPR arguments than the caller’s VGPR function parameters.

amdgpu_es

Used for AMDPAL shader stage before geometry shader if geometry is in use. So either the domain (= tessellation evaluation) shader if tessellation is in use, or otherwise the vertex shader. ..TODO:: Describe.

amdgpu_gfx

Used for AMD graphics targets. Functions with this calling convention cannot be used as entry points. ..TODO:: Describe.

amdgpu_gs

Used for Mesa/AMDPAL geometry shaders. ..TODO:: Describe.

amdgpu_hs

Used for Mesa/AMDPAL hull shaders (= tessellation control shaders). ..TODO:: Describe.

amdgpu_kernel

See Kernel Functions

amdgpu_ls

Used for AMDPAL vertex shader if tessellation is in use. ..TODO:: Describe.

amdgpu_ps

Used for Mesa/AMDPAL pixel shaders. ..TODO:: Describe.

amdgpu_vs

Used for Mesa/AMDPAL last shader stage before rasterization (vertex shader if tessellation and geometry are not in use, or otherwise copy shader if one is needed). ..TODO:: Describe.

AMDGPU MCExpr

As part of the AMDGPU MC layer, AMDGPU provides the following target specific MCExprs.

Table 30 AMDGPU MCExpr types:

MCExpr

Operands

Return value

max(arg, ...)

1 or more

Variadic signed operation that returns the maximum value of all its arguments.

or(arg, ...)

1 or more

Variadic signed operation that returns the bitwise-or result of all its arguments.

ELF Code Object

The AMDGPU backend generates a standard ELF [ELF] relocatable code object that can be linked by lld to produce a standard ELF shared code object which can be loaded and executed on an AMDGPU target.

Sections

An AMDGPU target ELF code object has the standard ELF sections which include:

Table 38 AMDGPU ELF Sections

Name

Type

Attributes

.bss

SHT_NOBITS

SHF_ALLOC + SHF_WRITE

.data

SHT_PROGBITS

SHF_ALLOC + SHF_WRITE

.debug_*

SHT_PROGBITS

none

.dynamic

SHT_DYNAMIC

SHF_ALLOC

.dynstr

SHT_PROGBITS

SHF_ALLOC

.dynsym

SHT_PROGBITS

SHF_ALLOC

.got

SHT_PROGBITS

SHF_ALLOC + SHF_WRITE

.hash

SHT_HASH