9.7.15.4. Async Proxy

The wgmma.mma_async operations are performed in the asynchronous proxy (or async proxy).

Accessing the same memory location across multiple proxies needs a cross-proxy fence. For the async proxy, fence.proxy.async should be used to synchronize memory between generic proxy and the async proxy.

The completion of a wgmma.mma_async operation is followed by an implicit generic-async proxy fence. So the result of the asynchronous operation is made visible to the generic proxy as soon as its completion is observed. wgmma.commit_group and wgmma.wait_group operations must be used to wait for the completion of the wgmma.mma_async instructions.

9.7.15.7. Asynchronous wgmma Proxy Operations

This section describes warpgroup level wgmma.fence, wgmma.commit_group and wgmma.wait_group instructions.

9.7.15.7.1. Asynchronous Multiply-and-Accumulate Instruction: wgmma.fence

wgmma.fence

Enforce an ordering of register accesses between wgmma.mma_async and other operations.

Syntax

ptx

wgmma.fence.sync.aligned;

Description

wgmma.fence instruction establishes an ordering between prior accesses to any warpgroup registers and subsequent accesses to the same registers by a wgmma.mma_async instruction. Only the accumulator register and the input registers containing the fragments of matrix A require this ordering.

The wgmma.fence instruction must be issued by all warps of the warpgroup at the following locations:

Before the first wgmma.mma_async operation in a warpgroup.
Between a register access by a thread in the warpgroup and any wgmma.mma_async instruction that accesses the same registers, either as accumulator or input register containing fragments of matrix A, except when these are accumulator register accesses across multiple wgmma.mma_async instructions of the same shape. In the latter case, an ordering guarantee is provided by default.

Otherwise, the behavior is undefined.

An async proxy fence must be used to establish an ordering between prior writes to shared memory matrices and subsequent reads of the same matrices in a wgmma.mma_async instruction.

The mandatory .sync qualifier indicates that wgmma.fence instruction causes the executing thread to wait until all threads in the warp execute the same wgmma.fence instruction before resuming execution.

The mandatory .aligned qualifier indicates that all threads in the warpgroup must execute the same wgmma.fence instruction. In conditionally executed code, a wgmma.fence instruction should only be used if it is known that all threads in the warpgroup evaluate the condition identically, otherwise the behavior is undefined.

PTX ISA Notes

Introduced in PTX ISA version 8.0.

Target ISA Notes

Requires sm_90a.

Examples

ptx

// Example 1, first use example:
wgmma.fence.sync.aligned;    // Establishes an ordering w.r.t. prior accesses to the registers s32d<0-3>
wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.u8  {s32d0, s32d1, s32d2, s32d3},
                                                  descA, descB, scaleD;
wgmma.commit_group.sync.aligned;
wgmma.wait_group.sync.aligned 0;

// Example 2, use-case with the input value updated in between:
wgmma.fence.sync.aligned;
wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.u8  {s32d0, s32d1, s32d2, s32d3},
                                                  descA, descB, scaleD;
...
mov.b32 s32d0, new_val;
wgmma.fence.sync.aligned;
wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.u8  {s32d4, s32d5, s32d6, s32d7},
                                                 {s32d0, s32d1, s32d2, s32d3},
                                                  descB, scaleD;
wgmma.commit_group.sync.aligned;
wgmma.wait_group.sync.aligned 0;

9.7.15.7.2. Asynchronous Multiply-and-Accumulate Instruction: wgmma.commit_group

wgmma.commit_group

Commits all prior uncommitted wgmma.mma_async operations into a wgmma-group.

Syntax

ptx

wgmma.commit_group.sync.aligned;

Description

wgmma.commit_group instruction creates a new wgmma-group per warpgroup and batches all prior wgmma.mma_async instructions initiated by the executing warp but not committed to any wgmma-group into the new wgmma-group. If there are no uncommitted wgmma.mma_async instructions then wgmma.commit_group results in an empty wgmma-group.

An executing thread can wait for the completion of all wgmma.mma_async operations in a wgmma-group by using wgmma.wait_group.

The mandatory .sync qualifier indicates that wgmma.commit_group instruction causes the executing thread to wait until all threads in the warp execute the same wgmma.commit_group instruction before resuming execution.

The mandatory .aligned qualifier indicates that all threads in the warpgroup must execute the same wgmma.commit_group instruction. In conditionally executed code, a wgmma.commit_group instruction should only be used if it is known that all threads in the warpgroup evaluate the condition identically, otherwise the behavior is undefined.

PTX ISA Notes

Introduced in PTX ISA version 8.0.

Target ISA Notes

Requires sm_90a.

Examples

ptx

wgmma.commit_group.sync.aligned;

9.7.15.7.3. Asynchronous Multiply-and-Accumulate Instruction: wgmma.wait_group

wgmma.wait_group

Signal the completion of a preceding warpgroup operation.

Syntax

ptx

wgmma.wait_group.sync.aligned N;

Description

wgmma.wait_group instruction will cause the executing thread to wait until only N or fewer of the most recent wgmma-groups are pending and all the prior wgmma-groups committed by the executing threads are complete. For example, when N is 0, the executing thread waits on all the prior wgmma-groups to complete. Operand N is an integer constant.

Accessing the accumulator register or the input register containing the fragments of matrix A of a wgmma.mma_async instruction without first performing a wgmma.wait_group instruction that waits on a wgmma-group including that wgmma.mma_async instruction is undefined behavior.

The mandatory .sync qualifier indicates that wgmma.wait_group instruction causes the executing thread to wait until all threads in the warp execute the same wgmma.wait_group instruction before resuming execution.

The mandatory .aligned qualifier indicates that all threads in the warpgroup must execute the same wgmma.wait_group instruction. In conditionally executed code, a wgmma.wait_group instruction should only be used if it is known that all threads in the warpgroup evaluate the condition identically, otherwise the behavior is undefined.

PTX ISA Notes

Introduced in PTX ISA version 8.0.

Target ISA Notes

Requires sm_90a.

Examples

ptx

wgmma.fence.sync.aligned;

wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.u8  {s32d0, s32d1, s32d2, s32d3},
                                                  descA, descB, scaleD;
wgmma.commit_group.sync.aligned;

wgmma.mma_async.sync.aligned.m64n8k16.f32.f16.f16 {f32d0, f32d1, f32d2, f32d3},
                                                  {f16a0, f16a1, f16a2, f16a3},
                                                   descB, 1, -1, -1, 1;
wgmma.commit_group.sync.aligned;

wgmma.wait_group.sync.aligned 0;