9.7.14.5. Matrix multiply-accumulate operation using mma instruction

This section describes warp-level mma, ldmatrix, stmatrix, and movmatrix instructions and the organization of various matrices involved in these instructions.

9.7.14.5.1. Matrix Fragments for mma.m8n8k4 with .f16 floating point type

A warp executing mma.m8n8k4 with .f16 floating point type will compute 4 MMA operations of shape .m8n8k4.

Elements of 4 matrices need to be distributed across the threads in a warp. The following table shows distribution of matrices for MMA operations.

MMA Computation Threads participating in MMA computation
MMA computation 1 Threads with %laneid 0–3 (low group) and 16–19 (high group)
MMA computation 2 Threads with %laneid 4–7 (low group) and 20–23 (high group)
MMA computation 3 Threads with %laneid 8–11 (low group) and 24–27 (high group)
MMA computation 4 Threads with %laneid 12–15 (low group) and 28–31 (high group)

For each of the individual MMA computation shown above, each of the required thread holds a fragment of the matrix for performing mma operation as follows:

Multiplicand A:

.atype Fragment Elements (low to high)
.f16 A vector expression containing two .f16x2 registers, with each register containing two .f16 elements from the matrix A. a0, a1, a2, a3

The layout of the fragments held by different threads is shown below:

Fragment layout for Row Major matrix A is shown in Figure 46.

!MMA .m8n8k4 fragment layout for row-major matrix A with .f16 type

Figure 46 MMA .m8n8k4 fragment layout for row-major matrix A with .f16 type

The row and column of a matrix fragment can be computed as:

row =            %laneid % 4          if %laneid < 16
                (%laneid % 4) + 4     otherwise

col =            i                    for ai where i = {0,..,3}

Fragment layout for Column Major matrix A is shown in Figure 47.

The layout of the fragments held by different threads is shown below:

!MMA .m8n8k4 fragment layout for column-major matrix A with .f16 type

Figure 47 MMA .m8n8k4 fragment layout for column-major matrix A with .f16 type

The row and column of a matrix fragment can be computed as:

row =        i % 4            for ai  where i = {0,..,3}   if %laneid < 16
            (i % 4) + 4       for ai  where i = {0,..,3}   otherwise

col =        %laneid % 4

Multiplicand B:

.btype Fragment Elements (low to high)
.f16 A vector expression containing two .f16x2 registers, with each register containing two .f16 elements from the matrix B. b0, b1, b2, b3

The layout of the fragments held by different threads is shown below:

Fragment layout for Row Major matrix B is shown in Figure 48.

!MMA .m8n8k4 fragment layout for row-major matrix B with .f16 type

Figure 48 MMA .m8n8k4 fragment layout for row-major matrix B with .f16 type

The row and column of a matrix fragment can be computed as:

row =        %laneid % 4

col =         i      for bi   where i = {0,..,3}   if %laneid < 16
             i+4     for bi   where i = {0,..,3}   otherwise

Fragment layout for Column Major matrix B is shown in Figure 49.

!MMA .m8n8k4 fragment layout for column-major matrix B with .f16 type

Figure 49 MMA .m8n8k4 fragment layout for column-major matrix B with .f16 type

The row and column of a matrix fragment can be computed as:

row =       i                 for bi   where i = {0,..,3}

col =      %laneid % 4        if %laneid < 16
          (%laneid % 4) + 4   otherwise

Accumulators C (or D):

.ctype / .dtype Fragment Elements (low to high)
.f16 A vector expression containing four .f16x2 registers, with each register containing two .f16 elements from the matrix C (or D). c0, c1, c2, c3, c4, c5, c6, c7
.f32 A vector expression of eight .f32 registers.

The layout of the fragments held by different threads is shown below:

Fragment layout for accumulator matrix when .ctype is .f16 is shown in Figure 50.

!MMA .m8n8k4 fragment layout for matrix C/D with .ctype = .f16

Figure 50 MMA .m8n8k4 fragment layout for matrix C/D with .ctype = .f16

The row and column of a matrix fragment can be computed as:

row =       %laneid % 4         if %laneid < 16
           (%laneid % 4) + 4    otherwise

col =          i                for ci   where i = {0,..,7}

Fragment layout for accumulator matrix when .ctype is .f32 is shown in Figure 51 and Figure 52.

!MMA .m8n8k4 computation 1 and 2 fragment layout for matrix C/D with .ctype = .f32

Figure 51 MMA .m8n8k4 computation 1 and 2 fragment layout for matrix C/D with .ctype = .f32

!MMA .m8n8k4 computation 3 and 4 fragment layout for matrix C/D with .ctype = .f32

Figure 52 MMA .m8n8k4 computation 3 and 4 fragment layout for matrix C/D with .ctype = .f32

The row and column of a matrix fragment can be computed as:

row =     X           if %laneid < 16
        X + 4         otherwise

          where X = (%laneid & 0b1) + (i & 0b10)  for ci where i = {0,..,7}

col = (i & 0b100) + (%laneid & 0b10) + (i & 0b1)  for ci where i = {0,..,7}

9.7.14.5.2. Matrix Fragments for mma.m8n8k4 with .f64 floating point type

A warp executing mma.m8n8k4 with .f64 floating point type will compute an MMA operation of shape .m8n8k4.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.

Multiplicand A:

.atype Fragment Elements (low to high)
.f64 A vector expression containing a single .f64 register, containing single .f64 element from the matrix A. a0

The layout of the fragments held by different threads is shown in Figure 53.

!MMA .m8n8k4 fragment layout for matrix A with .f64 type

Figure 53 MMA .m8n8k4 fragment layout for matrix A with .f64 type

The row and column of a matrix fragment can be computed as:

row =        %laneid >> 2

col =        %laneid % 4

Multiplicand B:

.btype Fragment Elements (low to high)
.f64 A vector expression containing a single .f64 register, containing a single .f64 element from the matrix B. b0

The layout of the fragments held by different threads is shown in Figure 54.

!MMA .m8n8k4 fragment layout for matrix B with .f64 type

Figure 54 MMA .m8n8k4 fragment layout for matrix B with .f64 type

The row and column of a matrix fragment can be computed as:

row =        %laneid % 4

col =        %laneid >> 2

Accumulators (C or D):

.ctype / .dtype Fragment Elements (low to high)
.f64 A vector expression containing of two .f64 registers containing two .f64 elements from the matrix C. c0, c1

The layout of the fragments held by different threads is shown in Figure 55.

!MMA .m8n8k4 fragment layout for accumulator matrix C/D with .f64 type

Figure 55 MMA .m8n8k4 fragment layout for accumulator matrix C/D with .f64 type

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID

col =      (threadID_in_group * 2) + (i & 0x1)       for ci   where i = {0, 1}

9.7.14.5.3. Matrix Fragments for mma.m8n8k16

A warp executing mma.m8n8k16 will compute an MMA operation of shape .m8n8k16.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.

Multiplicand A:

.atype Fragment Elements (low to high)
.s8 / .u8 A vector expression containing a single .b32 register, containing four .s8 or .u8 elements from the matrix A. a0, a1, a2, a3

The layout of the fragments held by different threads is shown in Figure 56.

!MMA .m8n8k16 fragment layout for matrix A with .u8/.s8 type

Figure 56 MMA .m8n8k16 fragment layout for matrix A with .u8/.s8 type

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row = groupID

col =  (threadID_in_group * 4) + i       for ai    where i = {0,..,3}

Multiplicand B:

.btype Fragment Elements (low to high)
.s8 / .u8 A vector expression containing a single .b32 register, containing four .s8 or .u8 elements from the matrix B. b0, b1, b2, b3

The layout of the fragments held by different threads is shown in Figure 57.

!MMA .m8n8k16 fragment layout for matrix B with .u8/.s8 type

Figure 57 MMA .m8n8k16 fragment layout for matrix B with .u8/.s8 type

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =  (threadID_in_group * 4) + i         for bi    where i = {0,..,3}

col =    groupID

Accumulators (C or D):

.ctype / .dtype Fragment Elements (low to high)
.s32 A vector expression containing of two .s32 registers. c0, c1

The layout of the fragments held by different threads is shown in Figure 58.

!MMA .m8n8k16 fragment layout for accumulator matrix C/D with .s32 type

Figure 58 MMA .m8n8k16 fragment layout for accumulator matrix C/D with .s32 type

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row = groupID

col = (threadID_in_group * 2) + i         for ci    where i = {0, 1}

9.7.14.5.4. Matrix Fragments for mma.m8n8k32

A warp executing mma.m8n8k32 will compute an MMA operation of shape .m8n8k32.

Elements of the matrix are distributed across the threads in a warp so each thread of the warp holds a fragment of the matrix.

Multiplicand A:

.atype Fragment Elements (low to high)
.s4 / .u4 A vector expression containing a single .b32 register, containing eight .s4 or .u4 elements from the matrix A. a0, a1, a2, a3, a4, a5, a6, a7

The layout of the fragments held by different threads is shown in Figure 59.

!MMA .m8n8k32 fragment layout for matrix A with .u4/.s4 type

Figure 59 MMA .m8n8k32 fragment layout for matrix A with .u4/.s4 type

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =      groupID

col = (threadID_in_group * 8) + i         for ai    where i = {0,..,7}

Multiplicand B:

.btype Fragment Elements (low to high)
.s4 / .u4 A vector expression containing a single .b32 register, containing eight .s4 or .u4 elements from the matrix B. b0, b1, b2, b3, b4, b5, b6, b7

The layout of the fragments held by different threads is shown in Figure 60.

!MMA .m8n8k32 fragment layout for matrix B with .u4/.s4 type

Figure 60 MMA .m8n8k32 fragment layout for matrix B with .u4/.s4 type

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row = (threadID_in_group * 8) + i         for bi   where i = {0,..,7}

col = groupID

Accumulators (C or D):

.ctype / .dtype Fragment Elements (low to high)
.s32 A vector expression of two .s32 registers. c0, c1

The layout of the fragments held by different threads is shown in Figure 61:

!MMA .m8n8k32 fragment layout for accumulator matrix C/D with .s32 type

Figure 61 MMA .m8n8k32 fragment layout for accumulator matrix C/D with .s32 type

The row and column of a matrix fragment can be computed as:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row =   groupID
col = (threadID_in_group * 2) + i         for ci   where i = {0, 1}