9.7.14.5. Matrix Fragments for mma.m8n8k4
9.7.14.5.1. Matrix Fragments for mma.m8n8k4 with .f16 floating point type
A warp executing mma.m8n8k4 with .f16 floating point type will compute 4 MMA operations of shape .m8n8k4.
Elements of 4 matrices need to be distributed across the threads in a warp. The following table shows distribution of matrices for MMA operations.
| MMA Computation | Threads participating in MMA computation |
|---|---|
| MMA computation 1 | Threads with %laneid 0–3 (low group) and 16–19 (high group) |
| MMA computation 2 | Threads with %laneid 4–7 (low group) and 20–23 (high group) |
| MMA computation 3 | Threads with %laneid 8–11 (low group) and 24–27 (high group) |
| MMA computation 4 | Threads with %laneid 12–15 (low group) and 28–31 (high group) |
For each of the individual MMA computation shown above, each of the required thread holds a fragment of the matrix for performing mma operation as follows:
Multiplicand A:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.f16 |
A vector expression containing two .f16x2 registers, with each register containing two .f16 elements from the matrix A. |
a0, a1, a2, a3 |
The layout of the fragments held by different threads is shown below:
Fragment layout for Row Major matrix A is shown in Figure 46.

Figure 46 MMA .m8n8k4 fragment layout for row-major matrix A with .f16 type
The row and column of a matrix fragment can be computed as:
row = %laneid % 4 if %laneid < 16
(%laneid % 4) + 4 otherwise
col = i for ai where i = {0,..,3}Fragment layout for Column Major matrix A is shown in Figure 47.
The layout of the fragments held by different threads is shown below:

Figure 47 MMA .m8n8k4 fragment layout for column-major matrix A with .f16 type
The row and column of a matrix fragment can be computed as:
row = i % 4 for ai where i = {0,..,3} if %laneid < 16
(i % 4) + 4 for ai where i = {0,..,3} otherwise
col = %laneid % 4Multiplicand B:
.btype |
Fragment | Elements (low to high) |
|---|---|---|
.f16 |
A vector expression containing two .f16x2 registers, with each register containing two .f16 elements from the matrix B. |
b0, b1, b2, b3 |
The layout of the fragments held by different threads is shown below:
Fragment layout for Row Major matrix B is shown in Figure 48.

Figure 48 MMA .m8n8k4 fragment layout for row-major matrix B with .f16 type
The row and column of a matrix fragment can be computed as:
row = %laneid % 4
col = i for bi where i = {0,..,3} if %laneid < 16
i+4 for bi where i = {0,..,3} otherwiseFragment layout for Column Major matrix B is shown in Figure 49.

Figure 49 MMA .m8n8k4 fragment layout for column-major matrix B with .f16 type
The row and column of a matrix fragment can be computed as:
row = i for bi where i = {0,..,3}
col = %laneid % 4 if %laneid < 16
(%laneid % 4) + 4 otherwiseAccumulators C (or D):
.ctype / .dtype |
Fragment | Elements (low to high) |
|---|---|---|
.f16 |
A vector expression containing four .f16x2 registers, with each register containing two .f16 elements from the matrix C (or D). |
c0, c1, c2, c3, c4, c5, c6, c7 |
.f32 |
A vector expression of eight .f32 registers. |
The layout of the fragments held by different threads is shown below:
Fragment layout for accumulator matrix when .ctype is .f16 is shown in Figure 50.

Figure 50 MMA .m8n8k4 fragment layout for matrix C/D with .ctype = .f16
The row and column of a matrix fragment can be computed as:
row = %laneid % 4 if %laneid < 16
(%laneid % 4) + 4 otherwise
col = i for ci where i = {0,..,7}Fragment layout for accumulator matrix when .ctype is .f32 is shown in Figure 51 and Figure 52.

Figure 51 MMA .m8n8k4 computation 1 and 2 fragment layout for matrix C/D with .ctype = .f32

Figure 52 MMA .m8n8k4 computation 3 and 4 fragment layout for matrix C/D with .ctype = .f32
The row and column of a matrix fragment can be computed as:
row = X if %laneid < 16
X + 4 otherwise
where X = (%laneid & 0b1) + (i & 0b10) for ci where i = {0,..,7}
col = (i & 0b100) + (%laneid & 0b10) + (i & 0b1) for ci where i = {0,..,7}9.7.14.5.2. Matrix Fragments for mma.m8n8k4 with .f64 floating point type
A warp executing mma.m8n8k4 with .f64 floating point type will compute an MMA operation of shape .m8n8k4.
Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.
Multiplicand A:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.f64 |
A vector expression containing a single .f64 register, containing single .f64 element from the matrix A. |
a0 |
The layout of the fragments held by different threads is shown in Figure 53.

Figure 53 MMA .m8n8k4 fragment layout for matrix A with .f64 type
The row and column of a matrix fragment can be computed as:
row = %laneid >> 2
col = %laneid % 4Multiplicand B:
.btype |
Fragment | Elements (low to high) |
|---|---|---|
.f64 |
A vector expression containing a single .f64 register, containing a single .f64 element from the matrix B. |
b0 |
The layout of the fragments held by different threads is shown in Figure 54.

Figure 54 MMA .m8n8k4 fragment layout for matrix B with .f64 type
The row and column of a matrix fragment can be computed as:
row = %laneid % 4
col = %laneid >> 2Accumulators (C or D):
.ctype / .dtype |
Fragment | Elements (low to high) |
|---|---|---|
.f64 |
A vector expression containing of two .f64 registers containing two .f64 elements from the matrix C. |
c0, c1 |
The layout of the fragments held by different threads is shown in Figure 55.

Figure 55 MMA .m8n8k4 fragment layout for accumulator matrix C/D with .f64 type
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID
col = (threadID_in_group * 2) + (i & 0x1) for ci where i = {0, 1}