Matrix Fragments for mma.m16n8k8

9.7.14.5.7. Matrix Fragments for mma.m16n8k8

A warp executing mma.m16n8k8 will compute an MMA operation of shape .m16n8k8.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.

Multiplicand A:

`.f16` and `.bf16`:

`.atype`	Fragment	Elements (low to high)
`.f16` / `.bf16`	A vector expression containing two `.f16x2` registers, with each register containing two `.f16` / `.bf16` elements from the matrix A.	a0, a1, a2, a3

The layout of the fragments held by different threads is shown in Figure 71.

!MMA .m16n8k8 fragment layout for matrix A with .f16 / .bf16 type

Figure 71 MMA .m16n8k8 fragment layout for matrix A with .f16 / .bf16 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID            for a0 and a1
           groupID + 8        for a2 and a3

col =  threadID_in_group * 2 + (i & 0x1)    for ai     where i = {0,..,3}

`.tf32`:

`.atype`	Fragment	Elements (low to high)
`.tf32`	A vector expression containing four `.b32` registers, containing four `.tf32` elements from the matrix A.	a0, a1, a2, a3

The layout of the fragments held by different threads is shown in Figure 72.

!MMA .m16n8k8 fragment layout for matrix A with .tf32 type

Figure 72 MMA .m16n8k8 fragment layout for matrix A with .tf32 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID            for a0 and a2
           groupID + 8        for a1 and a3

col =  threadID_in_group       for a0 and a1
       threadID_in_group + 4   for a2 and a3

`.f64`:

`.atype`	Fragment	Elements (low to high)
`.f64`	A vector expression containing four `.f64` registers, containing four `.f64` elements from the matrix A.	a0, a1, a2, a3

The layout of the fragments held by different threads is shown in Figure 73.

!MMA .m16n8k8 fragment layout for matrix A with .f64 type

Figure 73 MMA .m16n8k8 fragment layout for matrix A with .f64 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID            for a0 and a2
           groupID + 8        for a1 and a3

col =  threadID_in_group       for a0 and a1
       threadID_in_group + 4   for a2 and a3

Multiplicand B:

`.f16` and `.bf16`:

`.btype`	Fragment	Elements (low to high)
`.f16` / `.bf16`	A vector expression containing a single `.f16x2` register, containing two `.f16` / `.bf16` elements from the matrix B.	b0, b1

The layout of the fragments held by different threads is shown in Figure 74.

!MMA .m16n8k8 fragment layout for matrix B with .f16 / .bf16 type

Figure 74 MMA .m16n8k8 fragment layout for matrix B with .f16 / .bf16 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row = (threadID_in_group * 2) + i       for bi    where i = {0, 1}

col =  groupID

`.tf32`:

`.btype`	Fragment	Elements (low to high)
`.tf32`	A vector expression containing two `.b32` registers, containing two `.tf32` elements from the matrix B.	b0, b1

The layout of the fragments held by different threads is shown in Figure 75.

!MMA .m16n8k8 fragment layout for matrix B with .tf32 type

Figure 75 MMA .m16n8k8 fragment layout for matrix B with .tf32 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =    threadID_in_group         for b0
       threadID_in_group + 4       for b1

col =  groupID

`.f64`:

`.btype`	Fragment	Elements (low to high)
`.f64`	A vector expression containing two `.f64` registers, containing two `.f64` elements from the matrix B.	b0, b1

The layout of the fragments held by different threads is shown in Figure 76.

!MMA .m16n8k8 fragment layout for matrix B with .f64 type

Figure 76 MMA .m16n8k8 fragment layout for matrix B with .f64 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =    threadID_in_group         for b0
       threadID_in_group + 4       for b1

col =  groupID

Accumulators (C or D):

`.f16`, `.bf16` and `.tf32`:

`.ctype` / `.dtype`	Fragment	Elements (low to high)
`.f16`	A vector expression containing two `.f16x2` registers, with each register containing two `.f16` elements from the matrix C (or D).	c0, c1, c2, c3
`.f32`	A vector expression of four `.f32` registers.

The layout of the fragments held by different threads is shown in Figure 77.

!MMA .m16n8k8 fragment layout for accumulator matrix C/D with .f16x2/.f32 type

Figure 77 MMA .m16n8k8 fragment layout for accumulator matrix C/D with .f16x2/.f32 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID                            for c0 and c1
         groupID + 8                          for c2 and c3

col =  (threadID_in_group * 2) + (i & 0x1)    for ci   where i = {0,..,3}

`.f64`:

`.ctype` / `.dtype`	Fragment	Elements (low to high)
`.f64`	A vector expression of four `.f64` registers containing four `.f64` elements from the matrix C (or D).	c0, c1, c2, c3

The layout of the fragments held by different threads is shown in Figure 78.

!MMA .m16n8k8 fragment layout for accumulator matrix C/D with .f64 type

Figure 78 MMA .m16n8k8 fragment layout for accumulator matrix C/D with .f64 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID                            for c0 and c1
         groupID + 8                          for c2 and c3

col =  (threadID_in_group * 2) + (i & 0x1)    for ci   where i = {0,..,3}

9.7.14.5.8. Matrix Fragments for mma.m16n8k16 with floating point type

A warp executing mma.m16n8k16 floating point types will compute an MMA operation of shape .m16n8k16.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.

Multiplicand A:

`.f16` and `.bf16`:

`.atype`	Fragment	Elements (low to high)
`.f16` / `.bf16`	A vector expression containing four `.f16x2` registers, with each register containing two `.f16` / `.bf16` elements from the matrix A.	a0, a1, a2, a3, a4, a5, a6, a7

The layout of the fragments held by different threads is shown in Figure 79.

!MMA .m16n8k16 fragment layout for matrix A with .f16 / .bf16 type

Figure 79 MMA .m16n8k16 fragment layout for matrix A with .f16 / .bf16 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID            for ai where  0 <= i < 2 || 4 <= i < 6
          groupID + 8         Otherwise

col =  (threadID_in_group * 2) + (i & 0x1)          for ai where i <  4
(threadID_in_group * 2) + (i & 0x1) + 8      for ai where i >= 4

`.f64`:

`.atype`	Fragment	Elements (low to high)
`.f64`	A vector expression containing eight `.f64` registers, with each register containing one `.f64` element from the matrix A.	a0, a1, a2, a3, a4, a5, a6, a7

The layout of the fragments held by different threads is shown in Figure 80.

!MMA .m16n8k16 fragment layout for matrix A with .f64 type

Figure 80 MMA .m16n8k16 fragment layout for matrix A with .f64 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =  groupID                               for ai where  i % 2 = 0
       groupID + 8                           Otherwise

col =  (i * 2) + threadID_in_group           for ai where i % 2 = 0
       (i * 2) - 2 + (threadID_in_group      Otherwise

Multiplicand B:

`.f16` and `.bf16`:

`.btype`	Fragment	Elements (low to high)
`.f16` / `.bf16`	A vector expression containing two `.f16x2` registers, with each register containing two `.f16` / `.bf16` elements from the matrix B.	b0, b1, b2, b3

The layout of the fragments held by different threads is shown in Figure 81.

!MMA .m16n8k16 fragment layout for matrix B with .f16 / .bf16 type

Figure 81 MMA .m16n8k16 fragment layout for matrix B with .f16 / .bf16 type.

where the row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =  (threadID_in_group * 2) + (i & 0x1)           for bi where i <  2
       (threadID_in_group * 2) + (i & 0x1) + 8       for bi where i >= 2

col = groupID

`.f64`:

`.atype`	Fragment	Elements (low to high)
`.f64`	A vector expression containing four `.f64` registers, with each register containing one `.f64` element from the matrix B.	b0, b1, b2, b3

The layout of the fragments held by different threads is shown in Figure 82.

!MMA .m16n8k16 fragment layout for matrix B with .f64 type

Figure 82 MMA .m16n8k16 fragment layout for matrix B with .f64 type.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =  threadID_in_group + (i * 4)           for bi where  i < 4

col =  groupID

Accumulators (C or D):

`.ctype` / `.dtype`	Fragment	Elements (low to high)
`.f64`	A vector expression containing four `.f64` registers containing `.f64` elements from the matrix C (or D).	c0, c1, c2, c3
`.f32`	A vector expression containing four `.f32` registers containing four `.f32` elements from the matrix C (or D).
`.f16`	A vector expression containing two `.f16x2` registers, with each register containing two `.f16` elements from the matrix C (or D).

The layout of the fragments held by different threads is shown in Figure 83.

!MMA .m16n8k16 fragment layout for accumulator matrix C/D

Figure 83 MMA .m16n8k16 fragment layout for accumulator matrix matrix C/D.

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID                               for ci where i <  2
         groupID + 8                             for ci where i >= 2

col =  (threadID_in_group * 2) + (i & 0x1)        for ci where i = {0,..,3}

9.7.14.5.7. Matrix Fragments for mma.m16n8k8

Multiplicand A:

.f16 and .bf16:

.tf32:

.f64:

Multiplicand B:

.f16 and .bf16:

.tf32:

.f64:

Accumulators (C or D):

.f16, .bf16 and .tf32:

.f64:

9.7.14.5.8. Matrix Fragments for mma.m16n8k16 with floating point type

Multiplicand A:

.f16 and .bf16:

.f64:

Multiplicand B:

.f16 and .bf16:

.f64:

Accumulators (C or D):

`.f16` and `.bf16`:

`.tf32`:

`.f64`:

`.f16` and `.bf16`:

`.tf32`:

`.f64`:

`.f16`, `.bf16` and `.tf32`:

`.f64`:

`.f16` and `.bf16`:

`.f64`:

`.f16` and `.bf16`:

`.f64`: