9.7.14.5.9. Matrix Fragments for mma.m16n8k16 with integer type
A warp executing mma.m16n8k16 will compute an MMA operation of shape .m16n8k16.
Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.
Multiplicand A:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.u8 / .s8 |
A vector expression containing two .b32 registers, with each register containing four .u8 / .s8 elements from the matrix A. |
a0, a1, a2, a3, a4, a5, a6, a7 |
.e4m3 / .e5m2 |
A vector expression containing two .b32 registers, with each register containing four .e4m3 / .e5m2 elements from the matrix A. |
a0, a1, a2, a3, a4, a5, a6, a7 |
The layout of the fragments held by different threads is shown in Figure 84.
!MMA .m16n8k16 fragment layout for matrix A with .u8 / .s8 type
Figure 84 MMA .m16n8k16 fragment layout for matrix A with .u8 / .s8 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for ai where i < 4
groupID + 8 for ai where i >= 4
col = (threadID_in_group * 4) + (i & 0x3) for ai where i = {0,..,7}Multiplicand B:
.btype |
Fragment | Elements (low to high) |
|---|---|---|
.u8 / .s8 |
A vector expression containing a single .b32 register, containing four .u8 / .s8 elements from the matrix B. |
b0, b1, b2, b3 |
.e4m3 / .e5m2 |
A vector expression containing a single .b32 register, containing four .e4m3 / .e5m2 elements from the matrix B. |
b0, b1. b2. b3 |
The layout of the fragments held by different threads is shown in Figure 85.
!MMA .m16n8k16 fragment layout for matrix B with .u8 / .s8 type
Figure 85 MMA .m16n8k16 fragment layout for matrix B with .u8 / .s8 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = (threadID_in_group * 4) + i for bi where i = {0,..,3}
col = groupIDAccumulators (C or D):
.ctype / .dtype |
Fragment | Elements (low to high) |
|---|---|---|
.s32 |
A vector expression containing four .s32 registers, containing four .s32 elements from the matrix C (or D). |
c0, c1, c2, c3 |
.f32 |
A vector expression containing four .f32 registers, containing four .f32 elements from the matrix C (or D). |
c0, c1, c2, c3 |
.f16 |
A vector expression containing two .f16x2 registers, with each register containing two .f16 elements from the matrix C (or D). |
c0, c1, c1, c2 |
The layout of the fragments held by different threads is shown in Figure 86.
!MMA .m16n8k16 fragment layout for accumulator matrix C/D with .s32 type
Figure 86 MMA .m16n8k16 fragment layout for accumulator matrix C/D with .s32 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for ci where i < 2
groupID + 8 for ci where i >= 2
col = (threadID_in_group * 2) + (i & 0x1) for ci where i = {0,..,3}9.7.14.5.10. Matrix Fragments for mma.m16n8k32
A warp executing mma.m16n8k32 will compute an MMA operation of shape .m16n8k32.
Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.
Multiplicand A:
.s4 or .u4:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.s4 / .u4 |
A vector expression containing two .b32 registers, with each register containing eight .u4 / .s4 elements from the matrix A. |
a0, a1, …, a14, a15 |
The layout of the fragments held by different threads is shown in Figure 87.
!MMA .m16n8k32 fragment layout for matrix A with .u4 / .s4 type
Figure 87 MMA .m16n8k32 fragment layout for matrix A with .u4 / .s4 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for ai where i < 8
groupID + 8 for ai where i >= 8
col = (threadID_in_group * 8) + (i & 0x7) for ai where i = {0,..,15}.s8 or .u8 or .e4m3 or .e5m2 or .e3m2 or .e2m3 or .e2m1:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.s8 / .u8 |
A vector expression containing four .b32 registers, with each register containing four .s8 / .u8 elements from the matrix A. |
a0, a1, .., a14, a15 |
.e4m3 / .e5m2 / .e3m2 / .e2m3 / .e2m1 |
A vector expression containing four .b32 registers, with each register containing four .e4m3 / .e5m2 / .e3m2 / .e2m3 / .e2m1 elements from the matrix A. |
a0, a1, …, a14, a15 |
The layout of the fragments held by different threads is shown in Figure 88.
Figure 88 MMA .m16n8k32 fragment layout for matrix A with .u8 / .s8 / .e4m3 / .e5m2 / .e3m2 / .e2m3 / .e2m1 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for ai where 0 <= i < 4 || 8 <= i < 12
groupID + 8 otherwise
col = (threadID_in_group * 4) + (i & 0x3) for ai where i < 8
(threadID_in_group * 4) + (i & 0x3) + 16 for ai where i >= 8Multiplicand B:
.s4 or .u4:
.btype |
Fragment | Elements (low to high) |
|---|---|---|
.s4 / .u4 |
A vector expression containing a single .b32 register, containing eight .s4 / .u4 elements from the matrix B. |
b0, b1, b2, b3, b4, b5, b6, b7 |
The layout of the fragments held by different threads is shown in Figure 89.
!MMA .m16n8k32 fragment layout for matrix B with .u4 / .s4 type
Figure 89 MMA .m16n8k32 fragment layout for matrix B with .u4 / .s4 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = (threadID_in_group * 8) + (i & 0x7) for bi where i = {0,..,7}
col = groupID.s8 or .u8 or .e4m3 or .e5m2 or .e3m2 or .e2m3 or .e2m1:
.btype |
Fragment | Elements (low to high) |
|---|---|---|
.s8 / .u8 |
A vector expression containing two .b32 registers, with each register containing four .s8 / .u8 elements from the matrix B. |
b0, b1, b2, b3, b4, b5, b6, b7 |
.e4m3 / .e5m2 / .e3m2 / .e2m3 / .e2m1 |
A vector expression containing two .b32 registers, with each register containing four .e4m3 / .e5m2 / .e3m2 / .e2m3 / .e2m1 elements from the matrix B. |
b0, b1, b2, b3, b4, b5, b6, b7 |
The layout of the fragments held by different threads is shown in Figure 90 and Figure 91.
!MMA .m16n8k32 fragment layout for rows 0–15 of matrix B
Figure 90 MMA .m16n8k32 fragment layout for rows 0–15 of matrix B with .u8 / .s8 / .e4m3 / .e5m2 / .e3m2 / .e2m3 / .e2m1 type.
!MMA .m16n8k32 fragment layout for rows 16–31 of matrix B
Figure 91 MMA .m16n8k32 fragment layout for rows 16–31 of matrix B with .u8 / .s8 / .e4m3 / .e5m2 / .e3m2 / .e2m3 / .e2m1 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = (threadID_in_group * 4) + (i & 0x3) for bi where i < 4
(threadID_in_group * 4) + (i & 0x3) + 16 for bi where i >= 4
col = groupIDAccumulators (C or D):
.ctype / .dtype |
Fragment | Elements (low to high) |
|---|---|---|
.s32 |
A vector expression containing four .s32 registers, containing four .s32 elements from the matrix C (or D). |
c0, c1, c2, c3 |
.f32 |
A vector expression containing four .f32 registers, containing four .f32 elements from the matrix C (or D). |
c0, c1, c2, c3 |
.f16 |
A vector expression containing two .f16x2 registers, with each register containing two .f16 elements from the matrix C (or D). |
c0, c1, c2, c3 |
The layout of the fragments held by different threads is shown in Figure 92.
!MMA .m16n8k32 fragment layout for accumulator matrix C/D with .s32 / .f32 / .f16 type
Figure 92 MMA .m16n8k32 fragment layout for accumulator matrix C/D with .s32 / .f32 / .f16 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for ci where i < 2
groupID + 8 for ci where i >= 2
col = (threadID_in_group * 2) + (i & 0x1) for ci where i = {0,..,3}9.7.14.5.11. Matrix Fragments for mma.m16n8k64
A warp executing mma.m16n8k64 will compute an MMA operation of shape .m16n8k64.
Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.
Multiplicand A:
.atype |
Fragment | Elements (low to high) |
|---|---|---|
.s4 / .u4 |
A vector expression containing four .b32 registers, with each register containing eight .s4 / .u4 elements from the matrix A. |
a0, a1, …, a30, a31 |
.e2m1 |
A vector expression containing four .b32 registers, with each register containing eight .e2m1 elements from the matrix A. |
a0, a1, …, a30, a31 |
The layout of the fragments held by different threads is shown in Figure 93.
!MMA .m16n8k64 fragment layout for matrix A with .u4 / .s4 / .e2m1 type
Figure 93 MMA .m16n8k64 fragment layout for matrix A with .u4 / .s4 / .e2m1 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for ai where 0 <= i < 8 || 16 <= i < 24
groupID + 8 otherwise
col = (threadID_in_group * 8) + (i & 0x7) for ai where i < 16
(threadID_in_group * 8) + (i & 0x7) + 32 for ai where i >= 16Multiplicand B:
.btype |
Fragment | Elements (low to high) |
|---|---|---|
.s4 / .u4 |
A vector expression containing two .b32 registers, with each register containing eight .s4 / .u4 elements from the matrix B. |
b0, b1, …, b14, b15 |
.e2m1 |
A vector expression containing two .b32 registers, with each register containing eight .e2m1 elements from the matrix B. |
b0, b1, …, b14, b15 |
The layout of the fragments held by different threads is shown in Figure 94 and Figure 95.
!MMA .m16n8k64 fragment layout for rows 0–31 of matrix B with .u4 / .s4 / .e2m1 type
Figure 94 MMA .m16n8k64 fragment layout for rows 0–31 of matrix B with .u4 / .s4 / .e2m1 type.
!MMA .m16n8k64 fragment layout for rows 32–63 of matrix B with .u4 / .s4 / .e2m1 type
Figure 95 MMA .m16n8k64 fragment layout for rows 32–63 of matrix B with .u4 / .s4 / .e2m1 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = (threadID_in_group * 8) + (i & 0x7) for bi where i < 8
(threadID_in_group * 8) + (i & 0x7) + 32 for bi where i >= 8
col = groupIDAccumulators (C or D):
.ctype / .dtype |
Fragment | Elements (low to high) |
|---|---|---|
.s32 |
A vector expression containing four .s32 registers, containing four .s32 elements from the matrix C (or D). |
c0, c1, c2, c3 |
.f32 |
A vector expression containing four .f32 registers, containing four .f32 elements from the matrix C (or D). |
c0, c1, c2, c3 |
The layout of the fragments held by different threads is shown in Figure 96.
!MMA .m16n8k64 fragment layout for accumulator matrix C/D with .s32 / .f32 type
Figure 96 MMA .m16n8k64 fragment layout for accumulator matrix C/D with .s32 / .f32 type.
The row and column of a matrix fragment can be computed as:
groupID = %laneid >> 2
threadID_in_group = %laneid % 4
row = groupID for ci where i < 2
groupID + 8 for ci where i >= 2
col = (threadID_in_group * 2) + (i & 0x1) for ci where i = {0,..,3}