9.7.14.5.12. Matrix Fragments for mma.m16n8k128
A warp executing mma.m16n8k128 will compute an MMA operation of shape .m16n8k128.
Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.
Multiplicand A:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.b1 |
A vector expression containing two .b32 registers, with each register containing thirty two .b1 elements from the matrix A. |
a0, a1, …, a62, a63 |
The layout of the fragments held by different threads is shown in Figure 97.
!MMA .m16n8k128 fragment layout for matrix A with .b1 type
Figure 97 MMA .m16n8k128 fragment layout for matrix A with .b1 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for ai where i < 32
groupID + 8 for ai where i >= 32
col = (threadID_in_group * 32) + (i & 0x1F) for ai where i = {0, ...,63}Multiplicand B:
.btype |
Fragment | Elements (low to high) |
|---|---|---|
.b1 |
A vector expression containing a single .b32 register containing thirty two .b1 elements from the matrix B. |
b0, b1, … , b30, b31 |
The layout of the fragments held by different threads is shown in Figure 98.
!MMA .m16n8k128 fragment layout for matrix B with .b1 type
Figure 98 MMA .m16n8k128 fragment layout for matrix B with .b1 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = (threadID_in_group * 32) + i for bi where i = {0,...,31}
col = groupIDAccumulators (C or D):
.ctype / .dtype |
Fragment | Elements (low to high) |
|---|---|---|
.s32 |
A vector expression containing four .s32 registers, containing four .s32 elements from the matrix C (or D). |
c0, c1, c2, c3 |
The layout of the fragments held by different threads is shown in Figure 99.
!MMA .m16n8k128 fragment layout for accumulator matrix C/D with .s32 type
Figure 99 MMA .m16n8k128 fragment layout for accumulator matrix C/D with .s32 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for ci where i < 2
groupID + 8 for ci where i >= 2
col = (threadID_in_group * 2) + (i & 0x1) for ci where i = {0, 1, 2, 3}9.7.14.5.13. Matrix Fragments for mma.m16n8k256
A warp executing mma.m16n8k256 will compute an MMA operation of shape .m16n8k256.
Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.
Multiplicand A:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.b1 |
A vector expression containing four .b32 registers, with each register containing thirty two .b1 elements from the matrix A. |
a0, a1, …, a126, a127 |
The layout of the fragments held by different threads is shown in Figure 100.
!MMA .m16n8k256 fragment layout for matrix A with .b1 type
Figure 100 MMA .m16n8k256 fragment layout for matrix A with .b1 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for ai where 0 <= i < 32 || 64 <= i < 96
groupID + 8 otherwise
col = (threadID_in_group * 32) + i for ai where i < 64
(threadID_in_group * 32) + (i & 0x1F) + 128 for ai where i >= 64Multiplicand B:
.btype |
Fragment | Elements (low to high) |
|---|---|---|
.b1 |
A vector expression containing two .b32 registers, with each register containing thirty two .b1 elements from the matrix B. |
b0, b1, …, b62, b63 |
The layout of the fragments held by different threads is shown in Figure 101 and Figure 102.
!MMA .m16n8k256 fragment layout for rows 0–127 of matrix B with .b1 type
Figure 101 MMA .m16n8k256 fragment layout for rows 0–127 of matrix B with .b1 type.
!MMA .m16n8k256 fragment layout for rows 128–255 of matrix B with .b1 type
Figure 102 MMA .m16n8k256 fragment layout for rows 128–255 of matrix B with .b1 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = (threadID_in_group * 32) + (i & 0x1F) for bi where i < 32
(threadID_in_group * 32) + (i & 0x1F) + 128 for bi where i >= 32
col = groupIDAccumulators (C or D):
.ctype / .dtype |
Fragment | Elements (low to high) |
|---|---|---|
.s32 |
A vector expression containing four .s32 registers, containing four .s32 elements from the matrix C (or D). |
c0, c1, c2, c3 |
The layout of the fragments held by different threads is shown in Figure 103.
!MMA .m16n8k256 fragment layout for accumulator matrix C/D with .s32 type
Figure 103 MMA .m16n8k256 fragment layout for accumulator matrix C/D with .s32 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for ci where i < 2
groupID + 8 for ci where i >= 2
col = (threadID_in_group * 2) + (i & 0x1) for ci where i = {0, 1, 2, 3}