9.7.15.5.1.2. Shared Memory Matrix Layout
If the argument imm-trans-a / imm-trans-b of the instruction wgmma.mma_async{.sp} is 0, then K-major is used for matrix A / B respectively. If the value of argument imm-trans-a is 1 then M-major is used for matrix A. If the value of the argument imm-trans-b is 1, then N-major is used for matrix B.
In a column-major default BLAS library such as cuBLAS, the matrices A and B with and without transpose can be classified as either K-Major or M-or-N-Major as shown in the following table:
| Non-Transposed | Transposed | |
|---|---|---|
| A | K-major | M-major |
| B | K-major | N-major |
To avoid confusion with A, B, row-major, col-major, transpose, and non-transpose, we will use MN-Major and K-Major throughout this section.
The matrices in the shared memory are made up of one or more "swizzle layout atom". The exact layout of these swizzle atoms depends on the swizzling mode, swizzle-atomicity, and the leading dimension. The layout of the swizzle are shown in Table 38.
Table 38 Various combinations of swizzling mode, leading dimension and swizzle-atom layout
| Swizzling mode | Leading Dimension / Major-ness | Swizzle atom layout (128b element) |
|---|---|---|
| 128B Swizzling Mode | M/N | 8x8 |
| 128B Swizzling Mode | K | 8x8 |
| 64B Swizzling Mode | M/N | 4x8 |
| 64B Swizzling Mode | K | 8x4 |
| 32B Swizzling Mode | M/N | 2x8 |
| 32B Swizzling Mode | K | 8x2 |
| None | M/N | 1x8 |
| None | K | 8x1 |
The above shapes are for elements of size 128 bits. For smaller elements sizes, the same shapes would get multiplied along the leading dimension by a factor of 128/sizeof_bits(Element). For example, 128B MN major swizzle atom would have a shape of (8*(128/32))x8 = 32x8 for tf32 tensor core inputs.
Examples
The following are some example layouts of MxK or KxN matrices with various swizzling modes, and are in units of 128b elements as shown by each colored cell as shown in Figure 156, Figure 157, Figure 158, Figure 159, Figure 160, Figure 161, Figure 162, Figure 163.
Figure 156 MN major 128B swizzling
Figure 157 K major 128B swizzling
Figure 158 MN major 64B swizzling
Figure 159 K major 64B swizzling
Figure 160 MN major 32B swizzling
Figure 161 K major 32B swizzling
Figure 162 MN major interleaved
Figure 163 K major interleaved
Following are some of the examples of the 128B swizzling layout for tf32 element type.
K-Major: Figure 164
Figure 164 K major
MN-Major: Figure 165
Figure 165 MN major
9.7.15.5.1.2.1. Major-ness supported by Strides
There are two strides involved while accessing a matrix from shared memory:
- Leading dimension byte offset
- Stride dimension byte offset
9.7.15.5.1.2.1.1. Leading Dimension Byte Offset
The leading dimension byte offset is defined differently for transposed and non-transposed matrices. The leading byte offset is defined as follows for matrices whose element types are normalized to 128-bits:
| Major-ness | Definition |
|---|---|
| K-Major | No-Swizzling: the offset from the first column to the second columns of the 8x2 tile in the 128-bit element type normalized matrix.<br>Swizzled layouts: not used, assumed to be 1. |
| MN-Major | Interleave: offset from the first 8 columns to the next 8 columns.<br>Swizzled layouts: offset from the first (swizzle-byte-size/16) rows to the next (swizzle-byte-size/16) rows. |
9.7.15.5.1.2.1.2. Stride Dimension Byte Offset
The stride dimension byte offset is defined differently for transposed and non-transposed matrices. The stride dimension byte offset is defined as follows for matrices whose element types are normalized to 128-bits:
| Major-ness | Definition |
|---|---|
| K-Major | The offset from the first 8 rows to the next 8 rows. |
| MN-Major | Interleave: offset from the first row to the next row.<br>Swizzled layout: offset from the first 8 columns to the next 8 columns |
9.7.15.5.1.2.1.3. Canonical Layouts
In terms of CuTe layouts the canonical layout can be expressed as follows:
| Major-ness | Swizzling mode | Canonical Layout without swizzling | Swizzling on the previous column |
|---|---|---|---|
| MN-major | No-swizzling or Interleaved | ((T,1,m),(8,k)):((1,T,SBO),(1T,LBO)) |
Swizzle<0, 4, 3> |
| MN-major | 32B Swizzling | ((T,2,m),(8,k)):((1,T,LBO),(2T,SBO)) |
Swizzle<1, 4, 3> |
| MN-major | 64B Swizzling | ((T,4,m),(8,k)):((1,T,LBO),(4T,SBO)) |
Swizzle<2, 4, 3> |
| MN-major | 128B Swizzling | ((T,8,m),(8,k)):((1,T,LBO),(8T,SBO)) |
Swizzle<3, 4, 3> |
| K-major | No-swizzling or Interleaved | ((8,m),(T,2k)):((1T,SBO),(1,LBO)) |
Swizzle<0, 4, 3> |
| K-major | 32B Swizzling | ((8,m),(T,2k)):((2T,SBO),(1,T)) |
Swizzle<1, 4, 3> |
| K-major | 64B Swizzling | ((8,m),(T,2k)):((4T,SBO),(1,T)) |
Swizzle<2, 4, 3> |
| K-major | 128B Swizzling | ((8,m),(T,2k)):((8T,SBO),(1,T)) |
Swizzle<3, 4, 3> |
where
- T = 128 / sizeof-elements-in-bits — T represents scale factor which normalizes matrix element types to 128-bits.
- m represents the number of repeating patterns across rows.
- k represents the number of repeating patterns across columns.
Examples
K-Major, no-swizzling and tf32 type: Figure 166
!K major, no-swizzling and tf32 type
Figure 166 K major, no-swizzling and tf32 type
the strides and related details are as follows:
Exact layout : Swizzle<0,4,3> o ((8,2),(4,4)):((4,32),(1,64))
Canonical Layout :Swizzle<0,4,3> o ((8,m),(T,2k)):((1T,SBO),(1,LBO))| Parameters | Value |
|---|---|
| T | 4 |
| m | 2 |
| k | 2 |
| LBO | 64\*sizeof(tf32) |
| SBO | 32\*sizeof(tf32) |
| Encoding of LBO in descriptor | (LBO) >> 4 = 16 |
| Encoding of SBO in descriptor | (SBO) >> 4 = 8 |
K-Major, 32B swizzling and tf32 type: Figure 167
!K major, 32B swizzling and tf32 type
Figure 167 K major, 32B swizzling and tf32 type
the strides and related details are as follows:
Exact layout : Swizzle<1,4,3> o ((8,2),(4,4)):((8,64),(1,4))
Canonical Layout :Swizzle<1,4,3> o ((8,m),(T,2k)):((2T,SBO),(1,T))| Parameters | Value |
|---|---|
| T | 4 |
| m | 2 |
| k | 2 |
| LBO | NA |
| SBO | 64\*sizeof(tf32) |
| Encoding of LBO in descriptor | 1 (assumed) |
| Encoding of SBO in descriptor | (SBO) >> 4 = 16 |
MN-Major, no-swizzling and bf16 type: Figure 168
!MN major, no-swizzling and bf16 type
Figure 168 MN major, no-swizzling and bf16 type
the strides and related details are as follows:
Exact layout : Swizzle<0,4,3> o ((8,1,2),(8,2)):((1,8,64),(8,128))
Canonical Layout :Swizzle<0,4,3> o ((T,1,m),(8,k)):((1,T,SBO),(1T,LBO))| Parameters | Value |
|---|---|
| T | 8 |
| m | 2 |
| k | 2 |
| LBO | 128\*sizeof(bf16) |
| SBO | 64\*sizeof(bf16) |
| Encoding of LBO in descriptor | (LBO) >> 4 = 16 |
| Encoding of SBO in descriptor | (SBO) >> 4 = 8 |
MN-Major, 32B swizzling and bf16 type: Figure 169
!MN major, 32B swizzling and bf16 type
Figure 169 MN major, 32B swizzling and bf16 type
the strides and related details are as follows:
Exact layout : Swizzle<1,4,3> o ((8,2,2),(8,2)):((1,8,128),(16,256))
Canonical Layout :Swizzle<1,4,3> o ((T,2,m),(8,k)):((1,T,LBO),(2T,SBO))| Parameters | Value |
|---|---|
| T | 8 |
| m | 2 |
| k | 2 |
| LBO | 128\*sizeof(bf16) |
| SBO | 256\*sizeof(bf16) |
| Encoding of LBO in descriptor | (LBO) >> 4 = 16 |
| Encoding of SBO in descriptor | (SBO) >> 4 = 32 |
MN-Major, 64B swizzling and bf16 type: Figure 170
!MN major, 64B swizzling and bf16 type
Figure 170 MN major, 64B swizzling and bf16 type
the strides and related details are as follows:
Exact layout : Swizzle<2,4,3> o ((8,4,2),(8,2)):((1,8,256),(32,512))
Canonical Layout :Swizzle<2,4,3> o ((T,4,m),(8,k)):((1,T,LBO),(4T,SBO))| Parameters | Value |
|---|---|
| T | 8 |
| m | 2 |
| k | 2 |
| LBO | 256\*sizeof(bf16) |
| SBO | 512\*sizeof(bf16) |
| Encoding of LBO in descriptor | (LBO) >> 4 = 32 |
| Encoding of SBO in descriptor | (SBO) >> 4 = 64 |