9.7.14.5.5. Matrix Fragments for mma.m8n8k128

A warp executing mma.m8n8k128 will compute an MMA operation of shape .m8n8k128.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.

Multiplicand A:

.atype Fragment Elements (low to high)
.b1 A vector expression containing a single .b32 register, containing thirty two .b1 elements from the matrix A. a0, a1, … a30, a31

The layout of the fragments held by different threads is shown in Figure 62.

!MMA .m8n8k128 fragment layout for matrix A with .b1 type

Figure 62 MMA .m8n8k128 fragment layout for matrix A with .b1 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =  groupID

col =  (threadID_in_group * 32) + i       for ai where i = {0,..,31}

Multiplicand B:

.btype Fragment Elements (low to high)
.b1 A vector expression containing a single .b32 register, containing thirty two .b1 elements from the matrix B. b0, b1, …, b30, b31

The layout of the fragments held by different threads is shown in Figure 63.

!MMA .m8n8k128 fragment layout for matrix B with .b1 type

Figure 63 MMA .m8n8k128 fragment layout for matrix B with .b1 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row = (threadID_in_group * 32) + i         for bi where i = {0,..,31}

col = groupID

Accumulators (C or D):

.ctype / .dtype Fragment Elements (low to high)
.s32 A vector expression containing two .s32 registers, containing two .s32 elements from the matrix C (or D). c0, c1

The layout of the fragments held by different threads is shown in Figure 64.

!MMA .m8n8k128 fragment layout for accumulator matrix C/D with .s32 type

Figure 64 MMA .m8n8k128 fragment layout for accumulator matrix C/D with .s32 type

The row and column of a matrix fragment can be computed as:

groupID = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID

col =  (threadID_in_group * 2) + i    for ci where i = {0, 1}

9.7.14.5.6. Matrix Fragments for mma.m16n8k4

A warp executing mma.m16n8k4 will compute an MMA operation of shape .m16n8k4.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.

Multiplicand A:

.tf32:

.atype Fragment Elements (low to high)
.tf32 A vector expression containing two .b32 registers, containing two .tf32 elements from the matrix A. a0, a1

The layout of the fragments held by different threads is shown in Figure 65.

!MMA .m16n8k4 fragment layout for matrix A with .tf32 type

Figure 65 MMA .m16n8k4 fragment layout for matrix A with .tf32 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID            for a0
           groupID + 8        for a1

col =  threadID_in_group

.f64:

.atype Fragment Elements (low to high)
.f64 A vector expression containing two .f64 registers, containing two .f64 elements from the matrix A. a0, a1

The layout of the fragments held by different threads is shown in Figure 66.

!MMA .m16n8k4 fragment layout for matrix A with .f64 type

Figure 66 MMA .m16n8k4 fragment layout for matrix A with .f64 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID            for a0
           groupID + 8        for a1

col =  threadID_in_group

Multiplicand B:

.tf32:

.btype Fragment Elements (low to high)
.tf32 A vector expression of a single .b32 register, containing a single .tf32 element from the matrix B. b0

The layout of the fragments held by different threads is shown in Figure 67.

!MMA .m16n8k4 fragment layout for matrix B with .tf32 type

Figure 67 MMA .m16n8k4 fragment layout for matrix B with .tf32 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =  threadID_in_group

col =  groupID

.f64:

.btype Fragment Elements (low to high)
.f64 A vector expression of a single .f64 register, containing a single .f64 element from the matrix B. b0

The layout of the fragments held by different threads is shown in Figure 68.

!MMA .m16n8k4 fragment layout for matrix B with .f64 type

Figure 68 MMA .m16n8k4 fragment layout for matrix B with .f64 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =  threadID_in_group

col =  groupID

Accumulators (C or D):

.tf32:

.ctype / .dtype Fragment Elements (low to high)
.f32 A vector expression containing four .f32 registers, containing four .f32 elements from the matrix C (or D). c0, c1, c2, c3

The layout of the fragments held by different threads is shown in Figure 69.

!MMA .m16n8k4 fragment layout for accumulator matrix C/D with .f32 type

Figure 69 MMA .m16n8k4 fragment layout for accumulator matrix C/D with .f32 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID                            for c0 and c1
         groupID + 8                          for c2 and c3

col =  (threadID_in_group * 2) + (i & 0x1)    for ci   where i = {0,..,3}

.f64:

.ctype / .dtype Fragment Elements (low to high)
.f64 A vector expression containing four .f64 registers, containing four .f64 elements from the matrix C (or D). c0, c1, c2, c3

The layout of the fragments held by different threads is shown in Figure 70.

!MMA .m16n8k4 fragment layout for accumulator matrix C/D with .f64 type

Figure 70 MMA .m16n8k4 fragment layout for accumulator matrix C/D with .f64 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID                            for c0 and c1
         groupID + 8                          for c2 and c3

col =  (threadID_in_group * 2) + (i & 0x1)    for ci   where i = {0,..,3}