PTX ISA v9.2

9.7.14.5. Matrix Fragments for mma.m16n8k16

9.7.14.5.8. Matrix Fragments for mma.m16n8k16 with floating point type

A warp executing mma.m16n8k16 floating point types will compute an MMA operation of shape .m16n8k16.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.

Multiplicand A:

.f16 and .bf16:

.atype Fragment Elements (low to high)
.f16 / .bf16 A vector expression containing four .f16x2 registers, with each register containing two .f16 / .bf16 elements from the matrix A. a0, a1, a2, a3, a4, a5, a6, a7

The layout of the fragments held by different threads is shown in Figure 79.

MMA .m16n8k16 fragment layout for matrix A with .f16 / .bf16 type

Figure 79 MMA .m16n8k16 fragment layout for matrix A with .f16 / .bf16 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID            for ai where  0 <= i < 2 || 4 <= i < 6
          groupID + 8         Otherwise

col =  (threadID_in_group * 2) + (i & 0x1)          for ai where i <  4
(threadID_in_group * 2) + (i & 0x1) + 8      for ai where i >= 4

.f64:

.atype Fragment Elements (low to high)
.f64 A vector expression containing eight .f64 registers, with each register containing one .f64 element from the matrix A. a0, a1, a2, a3, a4, a5, a6, a7

The layout of the fragments held by different threads is shown in Figure 80.

MMA .m16n8k16 fragment layout for matrix A with .f64 type

Figure 80 MMA .m16n8k16 fragment layout for matrix A with .f64 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =  groupID                               for ai where  i % 2 = 0
       groupID + 8                           Otherwise

col =  (i * 2) + threadID_in_group           for ai where i % 2 = 0
       (i * 2) - 2 + (threadID_in_group      Otherwise

Multiplicand B:

.f16 and .bf16:

.btype Fragment Elements (low to high)
.f16 / .bf16 A vector expression containing two .f16x2 registers, with each register containing two .f16 / .bf16 elements from the matrix B. b0, b1, b2, b3

The layout of the fragments held by different threads is shown in Figure 81.

MMA .m16n8k16 fragment layout for matrix B with .f16 / .bf16 type

Figure 81 MMA .m16n8k16 fragment layout for matrix B with .f16 / .bf16 type.

where the row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =  (threadID_in_group * 2) + (i & 0x1)           for bi where i <  2
       (threadID_in_group * 2) + (i & 0x1) + 8       for bi where i >= 2

col = groupID

.f64:

.atype Fragment Elements (low to high)
.f64 A vector expression containing four .f64 registers, with each register containing one .f64 element from the matrix B. b0, b1, b2, b3

The layout of the fragments held by different threads is shown in Figure 82.

MMA .m16n8k16 fragment layout for matrix B with .f64 type

Figure 82 MMA .m16n8k16 fragment layout for matrix B with .f64 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =  threadID_in_group + (i * 4)           for bi where  i < 4

col =  groupID

Accumulators (C or D):

.ctype / .dtype Fragment Elements (low to high)
.f64 A vector expression containing four .f64 registers containing .f64 elements from the matrix C (or D). c0, c1, c2, c3
.f32 A vector expression containing four .f32 registers containing four .f32 elements from the matrix C (or D).
.f16 A vector expression containing two .f16x2 registers, with each register containing two .f16 elements from the matrix C (or D).

The layout of the fragments held by different threads is shown in Figure 83.

MMA .m16n8k16 fragment layout for accumulator matrix C/D

Figure 83 MMA .m16n8k16 fragment layout for accumulator matrix matrix C/D.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID                               for ci where i <  2
         groupID + 8                             for ci where i >= 2

col =  (threadID_in_group * 2) + (i & 0x1)        for ci where i = {0,..,3}

9.7.14.5.9. Matrix Fragments for mma.m16n8k16 with integer type

A warp executing mma.m16n8k16 will compute an MMA operation of shape .m16n8k16.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.

Multiplicand A:

.atype Fragment Elements (low to high)
.u8 / .s8 A vector expression containing two .b32 registers, with each register containing four .u8 / .s8 elements from the matrix A. a0, a1, a2, a3, a4, a5, a6, a7
.e4m3 / .e5m2 A vector expression containing two .b32 registers, with each register containing four .e4m3 / .e5m2 elements from the matrix A. a0, a1, a2, a3, a4, a5, a6, a7

The layout of the fragments held by different threads is shown in Figure 84.

MMA .m16n8k16 fragment layout for matrix A with .u8 / .s8 type

Figure 84 MMA .m16n8k16 fragment layout for matrix A with .u8 / .s8 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID                            for ai where i < 4
         groupID + 8                          for ai where i >= 4

col =  (threadID_in_group * 4) + (i & 0x3)    for ai where i = {0,..,7}

Multiplicand B:

.btype Fragment Elements (low to high)
.u8 / .s8 A vector expression containing a single .b32 register, containing four .u8 / .s8 elements from the matrix B. b0, b1, b2, b3
.e4m3 / .e5m2 A vector expression containing a single .b32 register, containing four .e4m3 / .e5m2 elements from the matrix B. b0, b1. b2. b3

The layout of the fragments held by different threads is shown in Figure 85.

MMA .m16n8k16 fragment layout for matrix B with .u8 / .s8 type

Figure 85 MMA .m16n8k16 fragment layout for matrix B with .u8 / .s8 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =  (threadID_in_group * 4) + i         for bi where i = {0,..,3}

col = groupID

Accumulators (C or D):

.ctype / .dtype Fragment Elements (low to high)
.s32 A vector expression containing four .s32 registers, containing four .s32 elements from the matrix C (or D). c0, c1, c2, c3
.f32 A vector expression containing four .f32 registers, containing four .f32 elements from the matrix C (or D). c0, c1, c2, c3
.f16 A vector expression containing two .f16x2 registers, with each register containing two .f16 elements from the matrix C (or D). c0, c1, c1, c2

The layout of the fragments held by different threads is shown in Figure 86.

MMA .m16n8k16 fragment layout for accumulator matrix C/D with .s32 type

Figure 86 MMA .m16n8k16 fragment layout for accumulator matrix C/D with .s32 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID                           for ci where i <  2
         groupID + 8                         for ci where i >= 2

col =  (threadID_in_group * 2) + (i & 0x1)    for ci where i = {0,..,3}
esc
Type to search across all documentation