9.7.14.5. Matrix Fragments for mma.m16n8k8
A warp executing mma.m16n8k8 will compute an MMA operation of shape .m16n8k8.
Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.
Multiplicand A:
.f16 and .bf16:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.f16 / .bf16 |
A vector expression containing two .f16x2 registers, with each register containing two .f16 / .bf16 elements from the matrix A. |
a0, a1, a2, a3 |
The layout of the fragments held by different threads is shown in Figure 71.

Figure 71 MMA .m16n8k8 fragment layout for matrix A with .f16 / .bf16 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for a0 and a1
groupID + 8 for a2 and a3
col = threadID_in_group * 2 + (i & 0x1) for ai where i = {0,..,3}.tf32:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.tf32 |
A vector expression containing four .b32 registers, containing four .tf32 elements from the matrix A. |
a0, a1, a2, a3 |
The layout of the fragments held by different threads is shown in Figure 72.

Figure 72 MMA .m16n8k8 fragment layout for matrix A with .tf32 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for a0 and a2
groupID + 8 for a1 and a3
col = threadID_in_group for a0 and a1
threadID_in_group + 4 for a2 and a3.f64:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.f64 |
A vector expression containing four .f64 registers, containing four .f64 elements from the matrix A. |
a0, a1, a2, a3 |
The layout of the fragments held by different threads is shown in Figure 73.

Figure 73 MMA .m16n8k8 fragment layout for matrix A with .f64 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for a0 and a2
groupID + 8 for a1 and a3
col = threadID_in_group for a0 and a1
threadID_in_group + 4 for a2 and a3Multiplicand B:
.f16 and .bf16:
.btype |
Fragment | Elements (low to high) |
|---|---|---|
.f16 / .bf16 |
A vector expression containing a single .f16x2 register, containing two .f16 / .bf16 elements from the matrix B. |
b0, b1 |
The layout of the fragments held by different threads is shown in Figure 74.

Figure 74 MMA .m16n8k8 fragment layout for matrix B with .f16 / .bf16 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = (threadID_in_group * 2) + i for bi where i = {0, 1}
col = groupID.tf32:
.btype |
Fragment | Elements (low to high) |
|---|---|---|
.tf32 |
A vector expression containing two .b32 registers, containing two .tf32 elements from the matrix B. |
b0, b1 |
The layout of the fragments held by different threads is shown in Figure 75.

Figure 75 MMA .m16n8k8 fragment layout for matrix B with .tf32 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = threadID_in_group for b0
threadID_in_group + 4 for b1
col = groupID.f64:
.btype |
Fragment | Elements (low to high) |
|---|---|---|
.f64 |
A vector expression containing two .f64 registers, containing two .f64 elements from the matrix B. |
b0, b1 |
The layout of the fragments held by different threads is shown in Figure 76.

Figure 76 MMA .m16n8k8 fragment layout for matrix B with .f64 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = threadID_in_group for b0
threadID_in_group + 4 for b1
col = groupIDAccumulators (C or D):
.f16, .bf16 and .tf32:
.ctype / .dtype |
Fragment | Elements (low to high) |
|---|---|---|
.f16 |
A vector expression containing two .f16x2 registers, with each register containing two .f16 elements from the matrix C (or D). |
c0, c1, c2, c3 |
.f32 |
A vector expression of four .f32 registers. |
The layout of the fragments held by different threads is shown in Figure 77.

Figure 77 MMA .m16n8k8 fragment layout for accumulator matrix C/D with .f16x2/.f32 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for c0 and c1
groupID + 8 for c2 and c3
col = (threadID_in_group * 2) + (i & 0x1) for ci where i = {0,..,3}.f64:
.ctype / .dtype |
Fragment | Elements (low to high) |
|---|---|---|
.f64 |
A vector expression of four .f64 registers containing four .f64 elements from the matrix C (or D). |
c0, c1, c2, c3 |
The layout of the fragments held by different threads is shown in Figure 78.

Figure 78 MMA .m16n8k8 fragment layout for accumulator matrix C/D with .f64 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for c0 and c1
groupID + 8 for c2 and c3
col = (threadID_in_group * 2) + (i & 0x1) for ci where i = {0,..,3}