9.7.14.5.5. Matrix Fragments for mma.m8n8k128
A warp executing mma.m8n8k128 will compute an MMA operation of shape .m8n8k128.
Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.
Multiplicand A:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.b1 |
A vector expression containing a single .b32 register, containing thirty two .b1 elements from the matrix A. |
a0, a1, … a30, a31 |
The layout of the fragments held by different threads is shown in Figure 62.
!MMA .m8n8k128 fragment layout for matrix A with .b1 type
Figure 62 MMA .m8n8k128 fragment layout for matrix A with .b1 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID
col = (threadID_in_group * 32) + i for ai where i = {0,..,31}Multiplicand B:
.btype |
Fragment | Elements (low to high) |
|---|---|---|
.b1 |
A vector expression containing a single .b32 register, containing thirty two .b1 elements from the matrix B. |
b0, b1, …, b30, b31 |
The layout of the fragments held by different threads is shown in Figure 63.
!MMA .m8n8k128 fragment layout for matrix B with .b1 type
Figure 63 MMA .m8n8k128 fragment layout for matrix B with .b1 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = (threadID_in_group * 32) + i for bi where i = {0,..,31}
col = groupIDAccumulators (C or D):
.ctype / .dtype |
Fragment | Elements (low to high) |
|---|---|---|
.s32 |
A vector expression containing two .s32 registers, containing two .s32 elements from the matrix C (or D). |
c0, c1 |
The layout of the fragments held by different threads is shown in Figure 64.
!MMA .m8n8k128 fragment layout for accumulator matrix C/D with .s32 type
Figure 64 MMA .m8n8k128 fragment layout for accumulator matrix C/D with .s32 type
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID
col = (threadID_in_group * 2) + i for ci where i = {0, 1}9.7.14.5.6. Matrix Fragments for mma.m16n8k4
A warp executing mma.m16n8k4 will compute an MMA operation of shape .m16n8k4.
Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.
Multiplicand A:
.tf32:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.tf32 |
A vector expression containing two .b32 registers, containing two .tf32 elements from the matrix A. |
a0, a1 |
The layout of the fragments held by different threads is shown in Figure 65.
!MMA .m16n8k4 fragment layout for matrix A with .tf32 type
Figure 65 MMA .m16n8k4 fragment layout for matrix A with .tf32 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for a0
groupID + 8 for a1
col = threadID_in_group.f64:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.f64 |
A vector expression containing two .f64 registers, containing two .f64 elements from the matrix A. |
a0, a1 |
The layout of the fragments held by different threads is shown in Figure 66.
!MMA .m16n8k4 fragment layout for matrix A with .f64 type
Figure 66 MMA .m16n8k4 fragment layout for matrix A with .f64 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for a0
groupID + 8 for a1
col = threadID_in_groupMultiplicand B:
.tf32:
.btype |
Fragment | Elements (low to high) |
|---|---|---|
.tf32 |
A vector expression of a single .b32 register, containing a single .tf32 element from the matrix B. |
b0 |
The layout of the fragments held by different threads is shown in Figure 67.
!MMA .m16n8k4 fragment layout for matrix B with .tf32 type
Figure 67 MMA .m16n8k4 fragment layout for matrix B with .tf32 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = threadID_in_group
col = groupID.f64:
.btype |
Fragment | Elements (low to high) |
|---|---|---|
.f64 |
A vector expression of a single .f64 register, containing a single .f64 element from the matrix B. |
b0 |
The layout of the fragments held by different threads is shown in Figure 68.
!MMA .m16n8k4 fragment layout for matrix B with .f64 type
Figure 68 MMA .m16n8k4 fragment layout for matrix B with .f64 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = threadID_in_group
col = groupIDAccumulators (C or D):
.tf32:
.ctype / .dtype |
Fragment | Elements (low to high) |
|---|---|---|
.f32 |
A vector expression containing four .f32 registers, containing four .f32 elements from the matrix C (or D). |
c0, c1, c2, c3 |
The layout of the fragments held by different threads is shown in Figure 69.
!MMA .m16n8k4 fragment layout for accumulator matrix C/D with .f32 type
Figure 69 MMA .m16n8k4 fragment layout for accumulator matrix C/D with .f32 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for c0 and c1
groupID + 8 for c2 and c3
col = (threadID_in_group * 2) + (i & 0x1) for ci where i = {0,..,3}.f64:
.ctype / .dtype |
Fragment | Elements (low to high) |
|---|---|---|
.f64 |
A vector expression containing four .f64 registers, containing four .f64 elements from the matrix C (or D). |
c0, c1, c2, c3 |
The layout of the fragments held by different threads is shown in Figure 70.
!MMA .m16n8k4 fragment layout for accumulator matrix C/D with .f64 type
Figure 70 MMA .m16n8k4 fragment layout for accumulator matrix C/D with .f64 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for c0 and c1
groupID + 8 for c2 and c3
col = (threadID_in_group * 2) + (i & 0x1) for ci where i = {0,..,3}