9.7.14.5. Matrix Fragments for mma.m8n8k16
A warp executing mma.m8n8k16 will compute an MMA operation of shape .m8n8k16.
Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.
Multiplicand A:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.s8 / .u8 |
A vector expression containing a single .b32 register, containing four .s8 or .u8 elements from the matrix A. |
a0, a1, a2, a3 |
The layout of the fragments held by different threads is shown in Figure 56.

Figure 56 MMA .m8n8k16 fragment layout for matrix A with .u8/.s8 type
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID
col = (threadID_in_group * 4) + i for ai where i = {0,..,3}Multiplicand B:
.btype |
Fragment | Elements (low to high) |
|---|---|---|
.s8 / .u8 |
A vector expression containing a single .b32 register, containing four .s8 or .u8 elements from the matrix B. |
b0, b1, b2, b3 |
The layout of the fragments held by different threads is shown in Figure 57.

Figure 57 MMA .m8n8k16 fragment layout for matrix B with .u8/.s8 type
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = (threadID_in_group * 4) + i for bi where i = {0,..,3}
col = groupIDAccumulators (C or D):
.ctype / .dtype |
Fragment | Elements (low to high) |
|---|---|---|
.s32 |
A vector expression containing of two .s32 registers. |
c0, c1 |
The layout of the fragments held by different threads is shown in Figure 58.

Figure 58 MMA .m8n8k16 fragment layout for accumulator matrix C/D with .s32 type
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID
col = (threadID_in_group * 2) + i for ci where i = {0, 1}