Convergence And Uniformity

Introduction

In some environments, groups of threads execute the same program in parallel, where efficient communication within a group is established using special primitives called convergent operations. The outcome of a convergent operation is sensitive to the set of threads that participate in it.

The intuitive picture of convergence is built around threads executing in “lock step” — a set of threads is thought of as converged if they are all executing “the same sequence of instructions together”. Such threads may diverge at a divergent branch, and they may later reconverge at some common program point.

In this intuitive picture, when converged threads execute an instruction, the resulting value is said to be uniform if it is the same in those threads, and divergent otherwise. Correspondingly, a branch is said to be a uniform branch if its condition is uniform, and it is a divergent branch otherwise.

But the assumption of lock-step execution is not necessary for describing communication at convergent operations. It also constrains the implementation (compiler as well as hardware) by overspecifying how threads execute in such a parallel environment. To eliminate this assumption:

  • We define convergence as a relation between the execution of each instruction by different threads and not as a relation between the threads themselves. This definition is reasonable for known targets and is compatible with the semantics of convergent operations in LLVM IR.

  • We also define uniformity in terms of this convergence. The output of an instruction can be examined for uniformity across multiple threads only if the corresponding executions of that instruction are converged.

This document decribes a static analysis for determining convergence at each instruction in a function. The analysis extends previous work on divergence analysis [DivergenceSPMD] to cover irreducible control-flow. The described analysis is used in LLVM to implement a UniformityAnalysis that determines the uniformity of value(s) computed at each instruction in an LLVM IR or MIR function.

[DivergenceSPMD]

Julian Rosemann, Simon Moll, and Sebastian Hack. 2021. An Abstract Interpretation for SPMD Divergence on Reducible Control Flow Graphs. Proc. ACM Program. Lang. 5, POPL, Article 31 (January 2021), 35 pages. https://doi.org/10.1145/3434312

Motivation

Divergent branches constrain program transforms such as changing the CFG or moving a convergent operation to a different point of the CFG. Performing these transformations across a divergent branch can change the sets of threads that execute convergent operations convergently. While these constraints are out of scope for this document, uniformity analysis allows these transformations to identify uniform branches where these constraints do not hold.

Uniformity is also useful by itself on targets that execute threads in groups with shared execution resources (e.g. waves, warps, or subgroups):

  • Uniform outputs can potentially be computed or stored on shared resources.

  • These targets must “linearize” a divergent branch to ensure that each side of the branch is followed by the corresponding threads in the same group. But linearization is unnecessary at uniform branches, since the whole group of threads follows either one side of the branch or the other.

Terminology

Cycles

Described in LLVM Cycle Terminology.

Closed path

Described in Closed Paths and Cycles.

Disjoint paths

Two paths in a CFG are said to be disjoint if the only nodes common to both are the start node or the end node, or both.

Join node

A join node of a branch is a node reachable along disjoint paths starting from that branch.

Diverged path

A diverged path is a path that starts from a divergent branch and either reaches a join node of the branch or reaches the end of the function without passing through any join node of the branch.

Threads and Dynamic Instances

Each occurrence of an instruction in the program source is called a static instance. When a thread executes a program, each execution of a static instance produces a distinct dynamic instance of that instruction.

Each thread produces a unique sequence of dynamic instances:

  • The sequence is generated along branch decisions and loop traversals.

  • Starts with a dynamic instance of a “first” instruction.

  • Continues with dynamic instances of successive “next” instructions.

Threads are independent; some targets may choose to execute them in groups in order to share resources when possible.

_images/convergence-natural-loop.png

1

2

3

4

5

6

7

8

9

Thread 1

Entry1

H1

B1

L1

H3

L3

Exit

Thread 2

Entry1

H2

L2

H4

B2

L4

H5

B3

L5

Exit

In the above table, each row is a different thread, listing the dynamic instances produced by that thread from left to right. Each thread executes the same program that starts with an Entry node and ends with an Exit node, but different threads may take different paths through the control flow of the program. The columns are numbered merely for convenience, and empty cells have no special meaning. Dynamic instances listed in the same column are converged.

Convergence

Convergence-before is a strict partial order over dynamic instances that is defined as the transitive closure of:

  1. If dynamic instance P is executed strictly before Q in the same thread, then P is convergence-before Q.

  2. If dynamic instance P is executed strictly before Q1 in the same thread, and Q1 is converged-with Q2, then P is convergence-before Q2.

  3. If dynamic instance P1 is converged-with P2, and P2 is executed strictly before Q in the same thread, then P1 is convergence-before Q.

1

2

3

4

5

6

7

8

9

Thread 1

Entry

S2

T

Exit

Thread 2

Entry

Q2

R

S1

Exit

Thread 3

Entry

P

Q1

The above table shows partial sequences of dynamic instances from different threads. Dynamic instances in the same column are assumed to be converged (i.e., related to each other in the converged-with relation). The resulting convergence order includes the edges P -> Q2, Q1 -> R, P -> R, P -> T, etc.

Converged-with is a transitive symmetric relation over dynamic instances produced by different threads for the same static instance.

It is impractical to provide any one definition for the converged-with relation, since different environments may wish to relate dynamic instances in different ways. The fact that convergence-before is a strict partial order is a constraint on the converged-with relation. It is trivially satisfied if different dynamic instances are never converged. Below, we provide a relation called maximal converged-with, which satisifies convergence-before and is suitable for known targets.

Note

  1. The convergence-before relation is not directly observable. Program transforms are in general free to change the order of instructions, even though that obviously changes the convergence-before relation.

  2. Converged dynamic instances need not be executed at the same time or even on the same resource. Converged dynamic instances of a convergent operation may appear to do so but that is an implementation detail.

  3. The fact that P is convergence-before Q does not automatically imply that P happens-before Q in a memory model sense.

Maximal Convergence

This section defines a constraint that may be used to produce a maximal converged-with relation without violating the strict convergence-before order. This maximal converged-with relation is reasonable for real targets and is compatible with convergent operations.

The maximal converged-with relation is defined in terms of cycle headers, with the assumption that threads converge at the header on every “iteration” of the cycle. Informally, two threads execute the same iteration of a cycle if they both previously executed the cycle header the same number of times after they entered that cycle. In general, this needs to account for the iterations of parent cycles as well.

Maximal converged-with:

Dynamic instances X1 and X2 produced by different threads for the same static instance X are converged in the maximal converged-with relation if and only if:

  • X is not contained in any cycle, or,

  • For every cycle C with header H that contains X:

    • every dynamic instance H1 of H that precedes X1 in the respective thread is convergence-before X2, and,

    • every dynamic instance H2 of H that precedes X2 in the respective thread is convergence-before X1,

    • without assuming that X1 is converged with X2.

Note

Cycle headers may not be unique to a given CFG if it is irreducible. Each cycle hierarchy for the same CFG results in a different maximal converged-with relation.

For brevity, the rest of the document restricts the term converged to mean “related under the maximal converged-with relation for the given cycle hierarchy”.

Maximal convergence can now be demonstrated in the earlier example as follows:

1

2

3

4

5

6

7

8

9

Thread 1

Entry1

H1

B1

L1

H3

L3

Exit

Thread 2

Entry2

H2

L2

H4

B2

L4

H5

B3

L5

Exit

  • Entry1 and Entry2 are converged.

  • H1 and H2 are converged.

  • B1 and B2 are not converged due to H4 which is not convergence-before B1.

  • H3 and H4 are converged.

  • H3 is not converged with H5 due to H4 which is not convergence-before H3.

  • L1 and L2 are converged.

  • L3 and L4 are converged.

  • L3 is not converged with L5 due to H5 which is not convergence-before L3.

Dependence on Cycles Headers

Contradictions in convergence-before are possible only between two nodes that are inside some cycle. The dynamic instances of such nodes may be interleaved in the same thread, and this interleaving may be different for different threads.

When a thread executes a node X once and then executes it again, it must have followed a closed path in the CFG that includes X. Such a path must pass through the header of at least one cycle — the smallest cycle that includes the entire closed path. In a given thread, two dynamic instances of X are either separated by the execution of at least one cycle header, or X itself is a cycle header.

In reducible cycles (natural loops), each execution of the header is equivalent to the start of a new iteration of the cycle. But this analogy breaks down in the presence of explicit constraints on the converged-with relation, such as those described in future work. Instead, cycle headers should be treated as implicit points of convergence in a maximal converged-with relation.

Consider a sequence of nested cycles C1, C2, …, Ck such that C1 is the outermost cycle and Ck is the innermost cycle, with headers H1, H2, …, Hk respectively. When a thread enters the cycle Ck, any of the following is possible:

  1. The thread directly entered cycle Ck without having executed any of the headers H1 to Hk.

  2. The thread executed some or all of the nested headers one or more times.

The maximal converged-with relation captures the following intuition about cycles:

  1. When two threads enter a top-level cycle C1, they execute converged dynamic instances of every node that is a child of C1.

  2. When two threads enter a nested cycle Ck, they execute converged dynamic instances of every node that is a child of Ck, until either thread exits Ck, if and only if they executed converged dynamic instances of the last nested header that either thread encountered.

    Note that when a thread exits a nested cycle Ck, it must follow a closed path outside Ck to reenter it. This requires executing the header of some outer cycle, as described earlier.

Consider two dynamic instances X1 and X2 produced by threads T1 and T2 for a node X that is a child of nested cycle Ck. Maximal convergence relates X1 and X2 as follows:

  1. If neither thread executed any header from H1 to Hk, then X1 and X2 are converged.

  2. Otherwise, if there are no converged dynamic instances Q1 and Q2 of any header Q from H1 to Hk (where Q is possibly the same as X), such that Q1 precedes X1 and Q2 precedes X2 in the respective threads, then X1 and X2 are not converged.

  3. Otherwise, consider the pair Q1 and Q2 of converged dynamic instances of a header Q from H1 to Hk that occur most recently before X1 and X2 in the respective threads. Then X1 and X2 are converged if and only if there is no dynamic instance of any header from H1 to Hk that occurs between Q1 and X1 in thread T1, or between Q2 and X2 in thread T2. In other words, Q1 and Q2 represent the last point of convergence, with no other header being executed before executing X.

Example:

_images/convergence-both-diverged-nested.png

The above figure shows two nested irreducible cycles with headers R and S. The nodes Entry and Q have divergent branches. The table below shows the convergence between three threads taking different paths through the CFG. Dynamic instances listed in the same column are converged.

1

2

3

4

5

6

7

8

10

Thread1

Entry

P1

Q1

S1

P3

Q3

R1

S2

Exit

Thread2

Entry

P2

Q2

R2

S3

Exit

Thread3

Entry

R3

S4

Exit

  • P2 and P3 are not converged due to S1

  • Q2 and Q3 are not converged due to S1

  • S1 and S3 are not converged due to R2

  • S1 and S4 are not converged due to R3

Informally, T1 and T2 execute the inner cycle a different number of times, without executing the header of the outer cycle. All threads converge in the outer cycle when they first execute the header of the outer cycle.

Uniformity

  1. The output of two converged dynamic instances is uniform if and only if it compares equal for those two dynamic instances.

  2. The output of a static instance X is uniform for a given set of threads if and only if it is uniform for every pair of converged dynamic instances of X produced by those threads.

A non-uniform value is said to be