9.7.16.2.3. Data Movement Shape

The data movement shape indicates the dimension of the data to be moved to or from the Tensor Memory. These shapes are described as a tuple lane x size where:

  • lane indicates the number of rows in the Tensor Memory; and
  • size indicates the amount of data, in units of bits (b), across the columns in the Tensor Memory.

The following shapes are supported by various tcgen05 operations:

Shape tcgen05.<op>
.16x64b, .16x128b, .16x256b, .16x32bx2, .32x32b .ld / .st
.4x256b, .32x128b, .64x128b, .128x256b, .128x128b .cp
.31x256b (implicit) .shift

9.7.16.2.3.1. Memory Layout

The following shows the layout of the matrix fragments across threads of the warp.

9.7.16.2.3.1.1. Matrix fragments for shape .32x32b

A tcgen05{.ld,.st}.32x32b instruction has the following data vector register.

Fragment Elements (low to high)
A vector expression containing .num number of .b32 registers as mentioned in the Table 49. r0, r1, …

A warp executing tcgen05{.ld,.st}.32x32b will access 32 lanes of the Tensor Memory. It loads from or stores to each of the lane (32 * .num)-bits of data as shown in Figure 183.

!Matrix Fragment for shape .32x32b

Figure 183 Matrix Fragment for shape .32x32b

9.7.16.2.3.1.2. Matrix fragments for shape .16x64b

A tcgen05{.ld,.st}.16x64b instruction has the following data vector register.

Fragment Elements (low to high)
A vector expression containing .num number of .b32 registers as mentioned in the Table 49. r0, r1, …

A warp executing tcgen05{.ld,.st}.16x64b will access 16 lanes of the Tensor Memory. It loads from or stores to each of the lane (64 * .num)-bits of data as shown in Figure 184.

!Matrix Fragment for shape .16x64b

Figure 184 Matrix Fragment for shape .16x64b

9.7.16.2.3.1.3. Matrix fragments for shape .16x128b

A tcgen05{.ld,.st}.16x128b instruction has the following data vector register.

Fragment Elements (low to high)
A vector expression containing .num number of .b32 registers as mentioned in the Table 49. r0, r1, …

A warp executing tcgen05{.ld,.st}.16x128b will access 16 lanes of the Tensor Memory. It loads from or stores to each of the lane (128 * .num)-bits of data as shown in Figure 185.

!Matrix Fragment for shape .16x128b

Figure 185 Matrix Fragment for shape .16x128b

9.7.16.2.3.1.4. Matrix fragments for shape .16x256b

A tcgen05{.ld,.st}.16x256b instruction has the following data vector register.

Fragment Elements (low to high)
A vector expression containing .num number of .b32 registers as mentioned in the Table 49. r0, r1, r2, r3, …

A warp executing tcgen05{.ld,.st}.16x256b will access 16 lanes of the Tensor Memory. It loads from or stores to each of the lane (256 * .num)-bits of data as shown in Figure 186.

!Matrix Fragment for shape .16x256b

Figure 186 Matrix Fragment for shape .16x256b

9.7.16.2.3.1.5. Matrix fragments for shape .16x32bx2

A tcgen05{.ld,.st}.16x32bx2 instruction has the following data vector register.

Fragment Elements (low to high)
A vector expression containing .num number of .b32 registers as mentioned in the Table 49. r0, r1, …

A warp executing tcgen05{.ld,.st}.16x32bx2 will access 16 lanes of the Tensor Memory. It loads from or stores to each of the lane (32 * .num)-bits of data as shown in Figure 187.

!Matrix Fragment for shape .16x32bx2

Figure 187 Matrix Fragment for shape .16x32bx2