9.7.14.5.7. Matrix Fragments for mma.m16n8k8

A warp executing mma.m16n8k8 will compute an MMA operation of shape .m16n8k8.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.

Multiplicand A:

.f16 and .bf16:

.atype Fragment Elements (low to high)
.f16 / .bf16 A vector expression containing two .f16x2 registers, with each register containing two .f16 / .bf16 elements from the matrix A. a0, a1, a2, a3

The layout of the fragments held by different threads is shown in Figure 71.

!MMA .m16n8k8 fragment layout for matrix A with .f16 / .bf16 type

Figure 71 MMA .m16n8k8 fragment layout for matrix A with .f16 / .bf16 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID            for a0 and a1
           groupID + 8        for a2 and a3

col =  threadID_in_group * 2 + (i & 0x1)    for ai     where i = {0,..,3}

.tf32:

.atype Fragment Elements (low to high)
.tf32 A vector expression containing four .b32 registers, containing four .tf32 elements from the matrix A. a0, a1, a2, a3

The layout of the fragments held by different threads is shown in Figure 72.

!MMA .m16n8k8 fragment layout for matrix A with .tf32 type

Figure 72 MMA .m16n8k8 fragment layout for matrix A with .tf32 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID            for a0 and a2
           groupID + 8        for a1 and a3

col =  threadID_in_group       for a0 and a1
       threadID_in_group + 4   for a2 and a3

.f64:

.atype Fragment Elements (low to high)
.f64 A vector expression containing four .f64 registers, containing four .f64 elements from the matrix A. a0, a1, a2, a3

The layout of the fragments held by different threads is shown in Figure 73.

!MMA .m16n8k8 fragment layout for matrix A with .f64 type

Figure 73 MMA .m16n8k8 fragment layout for matrix A with .f64 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID            for a0 and a2
           groupID + 8        for a1 and a3

col =  threadID_in_group       for a0 and a1
       threadID_in_group + 4   for a2 and a3

Multiplicand B:

.f16 and .bf16:

.btype Fragment Elements (low to high)
.f16 / .bf16 A vector expression containing a single .f16x2 register, containing two .f16 / .bf16 elements from the matrix B. b0, b1

The layout of the fragments held by different threads is shown in Figure 74.

!MMA .m16n8k8 fragment layout for matrix B with .f16 / .bf16 type

Figure 74 MMA .m16n8k8 fragment layout for matrix B with .f16 / .bf16 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row = (threadID_in_group * 2) + i       for bi    where i = {0, 1}

col =  groupID

.tf32:

.btype Fragment Elements (low to high)
.tf32 A vector expression containing two .b32 registers, containing two .tf32 elements from the matrix B. b0, b1

The layout of the fragments held by different threads is shown in Figure 75.

!MMA .m16n8k8 fragment layout for matrix B with .tf32 type

Figure 75 MMA .m16n8k8 fragment layout for matrix B with .tf32 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =    threadID_in_group         for b0
       threadID_in_group + 4       for b1

col =  groupID

.f64:

.btype Fragment Elements (low to high)
.f64 A vector expression containing two .f64 registers, containing two .f64 elements from the matrix B. b0, b1

The layout of the fragments held by different threads is shown in Figure 76.

!MMA .m16n8k8 fragment layout for matrix B with .f64 type

Figure 76 MMA .m16n8k8 fragment layout for matrix B with .f64 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =    threadID_in_group         for b0
       threadID_in_group + 4       for b1

col =  groupID

Accumulators (C or D):

.f16, .bf16 and .tf32:

.ctype / .dtype Fragment Elements (low to high)
.f16 A vector expression containing two .f16x2 registers, with each register containing two .f16 elements from the matrix C (or D). c0, c1, c2, c3
.f32 A vector expression of four .f32 registers.

The layout of the fragments held by different threads is shown in Figure 77.

!MMA .m16n8k8 fragment layout for accumulator matrix C/D with .f16x2/.f32 type

Figure 77 MMA .m16n8k8 fragment layout for accumulator matrix C/D with .f16x2/.f32 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID                            for c0 and c1
         groupID + 8                          for c2 and c3

col =  (threadID_in_group * 2) + (i & 0x1)    for ci   where i = {0,..,3}

.f64:

.ctype / .dtype Fragment Elements (low to high)
.f64 A vector expression of four .f64 registers containing four .f64 elements from the matrix C (or D). c0, c1, c2, c3

The layout of the fragments held by different threads is shown in Figure 78.

!MMA .m16n8k8 fragment layout for accumulator matrix C/D with .f64 type

Figure 78 MMA .m16n8k8 fragment layout for accumulator matrix C/D with .f64 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID                            for c0 and c1
         groupID + 8                          for c2 and c3

col =  (threadID_in_group * 2) + (i & 0x1)    for ci   where i = {0,..,3}

9.7.14.5.8. Matrix Fragments for mma.m16n8k16 with floating point type

A warp executing mma.m16n8k16 floating point types will compute an MMA operation of shape .m16n8k16.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.

Multiplicand A:

.f16 and .bf16:

.atype Fragment Elements (low to high)
.f16 / .bf16 A vector expression containing four .f16x2 registers, with each register containing two .f16 / .bf16 elements from the matrix A. a0, a1, a2, a3, a4, a5, a6, a7

The layout of the fragments held by different threads is shown in Figure 79.

!MMA .m16n8k16 fragment layout for matrix A with .f16 / .bf16 type

Figure 79 MMA .m16n8k16 fragment layout for matrix A with .f16 / .bf16 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID            for ai where  0 <= i < 2 || 4 <= i < 6
          groupID + 8         Otherwise

col =  (threadID_in_group * 2) + (i & 0x1)          for ai where i <  4
(threadID_in_group * 2) + (i & 0x1) + 8      for ai where i >= 4

.f64:

.atype Fragment Elements (low to high)
.f64 A vector expression containing eight .f64 registers, with each register containing one .f64 element from the matrix A. a0, a1, a2, a3, a4, a5, a6, a7

The layout of the fragments held by different threads is shown in Figure 80.

!MMA .m16n8k16 fragment layout for matrix A with .f64 type

Figure 80 MMA .m16n8k16 fragment layout for matrix A with .f64 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =  groupID                               for ai where  i % 2 = 0
       groupID + 8                           Otherwise

col =  (i * 2) + threadID_in_group           for ai where i % 2 = 0
       (i * 2) - 2 + (threadID_in_group      Otherwise

Multiplicand B:

.f16 and .bf16:

.btype Fragment Elements (low to high)
.f16 / .bf16 A vector expression containing two .f16x2 registers, with each register containing two .f16 / .bf16 elements from the matrix B. b0, b1, b2, b3

The layout of the fragments held by different threads is shown in Figure 81.

!MMA .m16n8k16 fragment layout for matrix B with .f16 / .bf16 type

Figure 81 MMA .m16n8k16 fragment layout for matrix B with .f16 / .bf16 type.

where the row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =  (threadID_in_group * 2) + (i & 0x1)           for bi where i <  2
       (threadID_in_group * 2) + (i & 0x1) + 8       for bi where i >= 2

col = groupID

.f64:

.atype Fragment Elements (low to high)
.f64 A vector expression containing four .f64 registers, with each register containing one .f64 element from the matrix B. b0, b1, b2, b3

The layout of the fragments held by different threads is shown in Figure 82.

!MMA .m16n8k16 fragment layout for matrix B with .f64 type

Figure 82 MMA .m16n8k16 fragment layout for matrix B with .f64 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =  threadID_in_group + (i * 4)           for bi where  i < 4

col =  groupID

Accumulators (C or D):

.ctype / .dtype Fragment Elements (low to high)
.f64 A vector expression containing four .f64 registers containing .f64 elements from the matrix C (or D). c0, c1, c2, c3
.f32 A vector expression containing four .f32 registers containing four .f32 elements from the matrix C (or D).
.f16 A vector expression containing two .f16x2 registers, with each register containing two .f16 elements from the matrix C (or D).

The layout of the fragments held by different threads is shown in Figure 83.

!MMA .m16n8k16 fragment layout for accumulator matrix C/D

Figure 83 MMA .m16n8k16 fragment layout for accumulator matrix matrix C/D.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID                               for ci where i <  2
         groupID + 8                             for ci where i >= 2

col =  (threadID_in_group * 2) + (i & 0x1)        for ci where i = {0,..,3}