9.7.14.5. Matrix Fragments for mma.m8n8k32
A warp executing mma.m8n8k32 will compute an MMA operation of shape .m8n8k32.
Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.
Multiplicand A:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.s4 / .u4 |
A vector expression containing a single .b32 register, containing eight .s4 or .u4 elements from the matrix A. |
a0, a1, a2, a3, a4, a5, a6, a7 |
The layout of the fragments held by different threads is shown in Figure 59.

Figure 59 MMA .m8n8k32 fragment layout for matrix A with .u4/.s4 type
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID
col = (threadID_in_group * 8) + i for ai where i = {0,..,7}Multiplicand B:
.btype |
Fragment | Elements (low to high) |
|---|---|---|
.s4 / .u4 |
A vector expression containing a single .b32 register, containing eight .s4 or .u4 elements from the matrix B. |
b0, b1, b2, b3, b4, b5, b6, b7 |
The layout of the fragments held by different threads is shown in Figure 60.

Figure 60 MMA .m8n8k32 fragment layout for matrix B with .u4/.s4 type
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = (threadID_in_group * 8) + i for bi where i = {0,..,7}
col = groupIDAccumulators (C or D):
.ctype / .dtype |
Fragment | Elements (low to high) |
|---|---|---|
.s32 |
A vector expression of two .s32 registers. |
c0, c1 |
The layout of the fragments held by different threads is shown in Figure 61:

Figure 61 MMA .m8n8k32 fragment layout for accumulator matrix C/D with .s32 type
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID
col = (threadID_in_group * 2) + i for ci where i = {0, 1}