9.7.14.5.7. Matrix Fragments for mma.m16n8k8
A warp executing mma.m16n8k8 will compute an MMA operation of shape .m16n8k8.
Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.
Multiplicand A:
.f16 and .bf16:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.f16 / .bf16 |
A vector expression containing two .f16x2 registers, with each register containing two .f16 / .bf16 elements from the matrix A. |
a0, a1, a2, a3 |
The layout of the fragments held by different threads is shown in Figure 71.
!MMA .m16n8k8 fragment layout for matrix A with .f16 / .bf16 type
Figure 71 MMA .m16n8k8 fragment layout for matrix A with .f16 / .bf16 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for a0 and a1
groupID + 8 for a2 and a3
col = threadID_in_group * 2 + (i & 0x1) for ai where i = {0,..,3}.tf32:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.tf32 |
A vector expression containing four .b32 registers, containing four .tf32 elements from the matrix A. |
a0, a1, a2, a3 |
The layout of the fragments held by different threads is shown in Figure 72.
!MMA .m16n8k8 fragment layout for matrix A with .tf32 type
Figure 72 MMA .m16n8k8 fragment layout for matrix A with .tf32 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for a0 and a2
groupID + 8 for a1 and a3
col = threadID_in_group for a0 and a1
threadID_in_group + 4 for a2 and a3.f64:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.f64 |
A vector expression containing four .f64 registers, containing four .f64 elements from the matrix A. |
a0, a1, a2, a3 |
The layout of the fragments held by different threads is shown in Figure 73.
!MMA .m16n8k8 fragment layout for matrix A with .f64 type
Figure 73 MMA .m16n8k8 fragment layout for matrix A with .f64 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for a0 and a2
groupID + 8 for a1 and a3
col = threadID_in_group for a0 and a1
threadID_in_group + 4 for a2 and a3Multiplicand B:
.f16 and .bf16:
.btype |
Fragment | Elements (low to high) |
|---|---|---|
.f16 / .bf16 |
A vector expression containing a single .f16x2 register, containing two .f16 / .bf16 elements from the matrix B. |
b0, b1 |
The layout of the fragments held by different threads is shown in Figure 74.
!MMA .m16n8k8 fragment layout for matrix B with .f16 / .bf16 type
Figure 74 MMA .m16n8k8 fragment layout for matrix B with .f16 / .bf16 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = (threadID_in_group * 2) + i for bi where i = {0, 1}
col = groupID.tf32:
.btype |
Fragment | Elements (low to high) |
|---|---|---|
.tf32 |
A vector expression containing two .b32 registers, containing two .tf32 elements from the matrix B. |
b0, b1 |
The layout of the fragments held by different threads is shown in Figure 75.
!MMA .m16n8k8 fragment layout for matrix B with .tf32 type
Figure 75 MMA .m16n8k8 fragment layout for matrix B with .tf32 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = threadID_in_group for b0
threadID_in_group + 4 for b1
col = groupID.f64:
.btype |
Fragment | Elements (low to high) |
|---|---|---|
.f64 |
A vector expression containing two .f64 registers, containing two .f64 elements from the matrix B. |
b0, b1 |
The layout of the fragments held by different threads is shown in Figure 76.
!MMA .m16n8k8 fragment layout for matrix B with .f64 type
Figure 76 MMA .m16n8k8 fragment layout for matrix B with .f64 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = threadID_in_group for b0
threadID_in_group + 4 for b1
col = groupIDAccumulators (C or D):
.f16, .bf16 and .tf32:
.ctype / .dtype |
Fragment | Elements (low to high) |
|---|---|---|
.f16 |
A vector expression containing two .f16x2 registers, with each register containing two .f16 elements from the matrix C (or D). |
c0, c1, c2, c3 |
.f32 |
A vector expression of four .f32 registers. |
The layout of the fragments held by different threads is shown in Figure 77.
!MMA .m16n8k8 fragment layout for accumulator matrix C/D with .f16x2/.f32 type
Figure 77 MMA .m16n8k8 fragment layout for accumulator matrix C/D with .f16x2/.f32 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for c0 and c1
groupID + 8 for c2 and c3
col = (threadID_in_group * 2) + (i & 0x1) for ci where i = {0,..,3}.f64:
.ctype / .dtype |
Fragment | Elements (low to high) |
|---|---|---|
.f64 |
A vector expression of four .f64 registers containing four .f64 elements from the matrix C (or D). |
c0, c1, c2, c3 |
The layout of the fragments held by different threads is shown in Figure 78.
!MMA .m16n8k8 fragment layout for accumulator matrix C/D with .f64 type
Figure 78 MMA .m16n8k8 fragment layout for accumulator matrix C/D with .f64 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for c0 and c1
groupID + 8 for c2 and c3
col = (threadID_in_group * 2) + (i & 0x1) for ci where i = {0,..,3}9.7.14.5.8. Matrix Fragments for mma.m16n8k16 with floating point type
A warp executing mma.m16n8k16 floating point types will compute an MMA operation of shape .m16n8k16.
Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.
Multiplicand A:
.f16 and .bf16:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.f16 / .bf16 |
A vector expression containing four .f16x2 registers, with each register containing two .f16 / .bf16 elements from the matrix A. |
a0, a1, a2, a3, a4, a5, a6, a7 |
The layout of the fragments held by different threads is shown in Figure 79.
!MMA .m16n8k16 fragment layout for matrix A with .f16 / .bf16 type
Figure 79 MMA .m16n8k16 fragment layout for matrix A with .f16 / .bf16 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for ai where 0 <= i < 2 || 4 <= i < 6
groupID + 8 Otherwise
col = (threadID_in_group * 2) + (i & 0x1) for ai where i < 4
(threadID_in_group * 2) + (i & 0x1) + 8 for ai where i >= 4.f64:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.f64 |
A vector expression containing eight .f64 registers, with each register containing one .f64 element from the matrix A. |
a0, a1, a2, a3, a4, a5, a6, a7 |
The layout of the fragments held by different threads is shown in Figure 80.
!MMA .m16n8k16 fragment layout for matrix A with .f64 type
Figure 80 MMA .m16n8k16 fragment layout for matrix A with .f64 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for ai where i % 2 = 0
groupID + 8 Otherwise
col = (i * 2) + threadID_in_group for ai where i % 2 = 0
(i * 2) - 2 + (threadID_in_group OtherwiseMultiplicand B:
.f16 and .bf16:
.btype |
Fragment | Elements (low to high) |
|---|---|---|
.f16 / .bf16 |
A vector expression containing two .f16x2 registers, with each register containing two .f16 / .bf16 elements from the matrix B. |
b0, b1, b2, b3 |
The layout of the fragments held by different threads is shown in Figure 81.
!MMA .m16n8k16 fragment layout for matrix B with .f16 / .bf16 type
Figure 81 MMA .m16n8k16 fragment layout for matrix B with .f16 / .bf16 type.
where the row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = (threadID_in_group * 2) + (i & 0x1) for bi where i < 2
(threadID_in_group * 2) + (i & 0x1) + 8 for bi where i >= 2
col = groupID.f64:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.f64 |
A vector expression containing four .f64 registers, with each register containing one .f64 element from the matrix B. |
b0, b1, b2, b3 |
The layout of the fragments held by different threads is shown in Figure 82.
!MMA .m16n8k16 fragment layout for matrix B with .f64 type
Figure 82 MMA .m16n8k16 fragment layout for matrix B with .f64 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = threadID_in_group + (i * 4) for bi where i < 4
col = groupIDAccumulators (C or D):
.ctype / .dtype |
Fragment | Elements (low to high) |
|---|---|---|
.f64 |
A vector expression containing four .f64 registers containing .f64 elements from the matrix C (or D). |
c0, c1, c2, c3 |
.f32 |
A vector expression containing four .f32 registers containing four .f32 elements from the matrix C (or D). |
|
.f16 |
A vector expression containing two .f16x2 registers, with each register containing two .f16 elements from the matrix C (or D). |
The layout of the fragments held by different threads is shown in Figure 83.
!MMA .m16n8k16 fragment layout for accumulator matrix C/D
Figure 83 MMA .m16n8k16 fragment layout for accumulator matrix matrix C/D.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for ci where i < 2
groupID + 8 for ci where i >= 2
col = (threadID_in_group * 2) + (i & 0x1) for ci where i = {0,..,3}