9.7.14.5. Matrix multiply-accumulate operation using mma instruction
This section describes warp-level mma, ldmatrix, stmatrix, and movmatrix instructions and the organization of various matrices involved in these instructions.
9.7.14.5.1. Matrix Fragments for mma.m8n8k4 with .f16 floating point type
A warp executing mma.m8n8k4 with .f16 floating point type will compute 4 MMA operations of shape .m8n8k4.
Elements of 4 matrices need to be distributed across the threads in a warp. The following table shows distribution of matrices for MMA operations.
| MMA Computation | Threads participating in MMA computation |
|---|---|
| MMA computation 1 | Threads with %laneid 0–3 (low group) and 16–19 (high group) |
| MMA computation 2 | Threads with %laneid 4–7 (low group) and 20–23 (high group) |
| MMA computation 3 | Threads with %laneid 8–11 (low group) and 24–27 (high group) |
| MMA computation 4 | Threads with %laneid 12–15 (low group) and 28–31 (high group) |
For each of the individual MMA computation shown above, each of the required thread holds a fragment of the matrix for performing mma operation as follows:
Multiplicand A:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.f16 |
A vector expression containing two .f16x2 registers, with each register containing two .f16 elements from the matrix A. |
a0, a1, a2, a3 |
The layout of the fragments held by different threads is shown below:
Fragment layout for Row Major matrix A is shown in Figure 46.
!MMA .m8n8k4 fragment layout for row-major matrix A with .f16 type
Figure 46 MMA .m8n8k4 fragment layout for row-major matrix A with .f16 type
The row and column of a matrix fragment can be computed as:
row = %laneid % 4 if %laneid < 16
(%laneid % 4) + 4 otherwise
col = i for ai where i = {0,..,3}Fragment layout for Column Major matrix A is shown in Figure 47.
The layout of the fragments held by different threads is shown below:
!MMA .m8n8k4 fragment layout for column-major matrix A with .f16 type
Figure 47 MMA .m8n8k4 fragment layout for column-major matrix A with .f16 type
The row and column of a matrix fragment can be computed as:
row = i % 4 for ai where i = {0,..,3} if %laneid < 16
(i % 4) + 4 for ai where i = {0,..,3} otherwise
col = %laneid % 4Multiplicand B:
.btype |
Fragment | Elements (low to high) |
|---|---|---|
.f16 |
A vector expression containing two .f16x2 registers, with each register containing two .f16 elements from the matrix B. |
b0, b1, b2, b3 |
The layout of the fragments held by different threads is shown below:
Fragment layout for Row Major matrix B is shown in Figure 48.
!MMA .m8n8k4 fragment layout for row-major matrix B with .f16 type
Figure 48 MMA .m8n8k4 fragment layout for row-major matrix B with .f16 type
The row and column of a matrix fragment can be computed as:
row = %laneid % 4
col = i for bi where i = {0,..,3} if %laneid < 16
i+4 for bi where i = {0,..,3} otherwiseFragment layout for Column Major matrix B is shown in Figure 49.
!MMA .m8n8k4 fragment layout for column-major matrix B with .f16 type
Figure 49 MMA .m8n8k4 fragment layout for column-major matrix B with .f16 type
The row and column of a matrix fragment can be computed as:
row = i for bi where i = {0,..,3}
col = %laneid % 4 if %laneid < 16
(%laneid % 4) + 4 otherwiseAccumulators C (or D):
.ctype / .dtype |
Fragment | Elements (low to high) |
|---|---|---|
.f16 |
A vector expression containing four .f16x2 registers, with each register containing two .f16 elements from the matrix C (or D). |
c0, c1, c2, c3, c4, c5, c6, c7 |
.f32 |
A vector expression of eight .f32 registers. |
The layout of the fragments held by different threads is shown below:
Fragment layout for accumulator matrix when .ctype is .f16 is shown in Figure 50.
!MMA .m8n8k4 fragment layout for matrix C/D with .ctype = .f16
Figure 50 MMA .m8n8k4 fragment layout for matrix C/D with .ctype = .f16
The row and column of a matrix fragment can be computed as:
row = %laneid % 4 if %laneid < 16
(%laneid % 4) + 4 otherwise
col = i for ci where i = {0,..,7}Fragment layout for accumulator matrix when .ctype is .f32 is shown in Figure 51 and Figure 52.
!MMA .m8n8k4 computation 1 and 2 fragment layout for matrix C/D with .ctype = .f32
Figure 51 MMA .m8n8k4 computation 1 and 2 fragment layout for matrix C/D with .ctype = .f32
!MMA .m8n8k4 computation 3 and 4 fragment layout for matrix C/D with .ctype = .f32
Figure 52 MMA .m8n8k4 computation 3 and 4 fragment layout for matrix C/D with .ctype = .f32
The row and column of a matrix fragment can be computed as:
row = X if %laneid < 16
X + 4 otherwise
where X = (%laneid & 0b1) + (i & 0b10) for ci where i = {0,..,7}
col = (i & 0b100) + (%laneid & 0b10) + (i & 0b1) for ci where i = {0,..,7}9.7.14.5.2. Matrix Fragments for mma.m8n8k4 with .f64 floating point type
A warp executing mma.m8n8k4 with .f64 floating point type will compute an MMA operation of shape .m8n8k4.
Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.
Multiplicand A:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.f64 |
A vector expression containing a single .f64 register, containing single .f64 element from the matrix A. |
a0 |
The layout of the fragments held by different threads is shown in Figure 53.
!MMA .m8n8k4 fragment layout for matrix A with .f64 type
Figure 53 MMA .m8n8k4 fragment layout for matrix A with .f64 type
The row and column of a matrix fragment can be computed as:
row = %laneid >> 2
col = %laneid % 4Multiplicand B:
.btype |
Fragment | Elements (low to high) |
|---|---|---|
.f64 |
A vector expression containing a single .f64 register, containing a single .f64 element from the matrix B. |
b0 |
The layout of the fragments held by different threads is shown in Figure 54.
!MMA .m8n8k4 fragment layout for matrix B with .f64 type
Figure 54 MMA .m8n8k4 fragment layout for matrix B with .f64 type
The row and column of a matrix fragment can be computed as:
row = %laneid % 4
col = %laneid >> 2Accumulators (C or D):
.ctype / .dtype |
Fragment | Elements (low to high) |
|---|---|---|
.f64 |
A vector expression containing of two .f64 registers containing two .f64 elements from the matrix C. |
c0, c1 |
The layout of the fragments held by different threads is shown in Figure 55.
!MMA .m8n8k4 fragment layout for accumulator matrix C/D with .f64 type
Figure 55 MMA .m8n8k4 fragment layout for accumulator matrix C/D with .f64 type
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID
col = (threadID_in_group * 2) + (i & 0x1) for ci where i = {0, 1}9.7.14.5.3. Matrix Fragments for mma.m8n8k16
A warp executing mma.m8n8k16 will compute an MMA operation of shape .m8n8k16.
Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.
Multiplicand A:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.s8 / .u8 |
A vector expression containing a single .b32 register, containing four .s8 or .u8 elements from the matrix A. |
a0, a1, a2, a3 |
The layout of the fragments held by different threads is shown in Figure 56.
!MMA .m8n8k16 fragment layout for matrix A with .u8/.s8 type
Figure 56 MMA .m8n8k16 fragment layout for matrix A with .u8/.s8 type
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID
col = (threadID_in_group * 4) + i for ai where i = {0,..,3}Multiplicand B:
.btype |
Fragment | Elements (low to high) |
|---|---|---|
.s8 / .u8 |
A vector expression containing a single .b32 register, containing four .s8 or .u8 elements from the matrix B. |
b0, b1, b2, b3 |
The layout of the fragments held by different threads is shown in Figure 57.
!MMA .m8n8k16 fragment layout for matrix B with .u8/.s8 type
Figure 57 MMA .m8n8k16 fragment layout for matrix B with .u8/.s8 type
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = (threadID_in_group * 4) + i for bi where i = {0,..,3}
col = groupIDAccumulators (C or D):
.ctype / .dtype |
Fragment | Elements (low to high) |
|---|---|---|
.s32 |
A vector expression containing of two .s32 registers. |
c0, c1 |
The layout of the fragments held by different threads is shown in Figure 58.
!MMA .m8n8k16 fragment layout for accumulator matrix C/D with .s32 type
Figure 58 MMA .m8n8k16 fragment layout for accumulator matrix C/D with .s32 type
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID
col = (threadID_in_group * 2) + i for ci where i = {0, 1}9.7.14.5.4. Matrix Fragments for mma.m8n8k32
A warp executing mma.m8n8k32 will compute an MMA operation of shape .m8n8k32.
Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.
Multiplicand A:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.s4 / .u4 |
A vector expression containing a single .b32 register, containing eight .s4 or .u4 elements from the matrix A. |
a0, a1, a2, a3, a4, a5, a6, a7 |
The layout of the fragments held by different threads is shown in Figure 59.
!MMA .m8n8k32 fragment layout for matrix A with .u4/.s4 type
Figure 59 MMA .m8n8k32 fragment layout for matrix A with .u4/.s4 type
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID
col = (threadID_in_group * 8) + i for ai where i = {0,..,7}Multiplicand B:
.btype |
Fragment | Elements (low to high) |
|---|---|---|
.s4 / .u4 |
A vector expression containing a single .b32 register, containing eight .s4 or .u4 elements from the matrix B. |
b0, b1, b2, b3, b4, b5, b6, b7 |
The layout of the fragments held by different threads is shown in Figure 60.
!MMA .m8n8k32 fragment layout for matrix B with .u4/.s4 type
Figure 60 MMA .m8n8k32 fragment layout for matrix B with .u4/.s4 type
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = (threadID_in_group * 8) + i for bi where i = {0,..,7}
col = groupIDAccumulators (C or D):
.ctype / .dtype |
Fragment | Elements (low to high) |
|---|---|---|
.s32 |
A vector expression of two .s32 registers. |
c0, c1 |
The layout of the fragments held by different threads is shown in Figure 61:
!MMA .m8n8k32 fragment layout for accumulator matrix C/D with .s32 type
Figure 61 MMA .m8n8k32 fragment layout for accumulator matrix C/D with .s32 type
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID
col = (threadID_in_group * 2) + i for ci where i = {0, 1}