9.7.5. Mixed Precision Floating-Point Instructions

Mixed precision floating-point instructions operate on data with varied floating point precision. Before executing the specified operation, operands with different precision needs to be converted such that all the instruction operands can be represented with a consistent floating-point precision. The register variable to be used for holding a particular operand depends upon the combination of the instruction types. Refer Fundamental Types and Alternate Floating-Point Data Formats for more details around exact register operand to be used for a given data type.

The mixed precision floating point instructions are:

  • add
  • sub
  • fma

Mixed precision add, sub, fma support saturation of results to the range [0.0, 1.0], with NaN being flushed to positive zero.

9.7.5.1. Mixed Precision Floating Point Instructions: add

add

Add 2 values.

Syntax

add{.rnd}{.sat}.f32.atype  d, a, c;

.atype = { .f16, .bf16};
.rnd   = { .rn, .rz, .rm, .rp };

Description

Converts input operand a from .atype into .f32 type. The converted value is then used for the addition. The resulting value is stored in the destination operand d.

Semantics

ptx
d = convert(a) + c;

Notes

Rounding modifiers:

  • .rn — mantissa LSB rounds to nearest even
  • .rz — mantissa LSB rounds towards zero
  • .rm — mantissa LSB rounds towards negative infinity
  • .rp — mantissa LSB rounds towards positive infinity

The default value of rounding modifier is .rn. Note that an add instruction with an explicit rounding modifier is treated conservatively by the code optimizer. An add instruction with no rounding modifier defaults to round-to-nearest-even and may be optimized aggressively by the code optimizer. In particular, mul/add sequences with no rounding modifiers may be optimized to use fused-multiply-add instructions on the target device.

Subnormal numbers: By default, subnormal numbers are supported.

Saturation modifier: add.sat clamps the result to [0.0, 1.0]. NaN results are flushed to +0.0f.

PTX ISA Notes

add.f32.{f16/bf16} introduced in PTX ISA version 8.6.

Target ISA Notes

add.f32.{f16/bf16} requires sm_100 or higher.

Examples

ptx
.reg .f32 fc, fd;
.reg .b16 ba;
add.rz.f32.bf16.sat   fd, fa, fc;

9.7.5.2. Mixed Precision Floating Point Instructions: sub

sub

Subtract one value from another.

Syntax

sub{.rnd}{.sat}.f32.atype  d, a, c;

.atype = { .f16, .bf16};
.rnd   = { .rn, .rz, .rm, .rp };

Description

Converts input operand a from .atype into .f32 type. The converted value is then used for the subtraction. The resulting value is stored in the destination operand d.

Semantics

ptx
d = convert(a) - c;

Notes

Rounding modifiers:

  • .rn — mantissa LSB rounds to nearest even
  • .rz — mantissa LSB rounds towards zero
  • .rm — mantissa LSB rounds towards negative infinity
  • .rp — mantissa LSB rounds towards positive infinity

The default value of rounding modifier is .rn. Note that a sub instruction with an explicit rounding modifier is treated conservatively by the code optimizer. A sub instruction with no rounding modifier defaults to round-to-nearest-even and may be optimized aggressively by the code optimizer. In particular, mul/sub sequences with no rounding modifiers may be optimized to use fused-multiply-add instructions on the target device.

Subnormal numbers: By default, subnormal numbers are supported.

Saturation modifier: sub.sat clamps the result to [0.0, 1.0]. NaN results are flushed to +0.0f.

PTX ISA Notes

sub.f32.{f16/bf16} introduced in PTX ISA version 8.6.

Target ISA Notes

sub.f32.{f16/bf16} requires sm_100 or higher.

Examples

ptx
.reg .f32 fc, fd;
.reg .f16 ha;
sub.rz.f32.f16.sat   fd, ha, fc;

9.7.5.3. Mixed Precision Floating Point Instructions: fma

fma

Fused multiply-add.

Syntax

fma.rnd{.sat}.f32.abtype  d, a, b, c;

.abtype = { .f16, .bf16};
.rnd    = { .rn, .rz, .rm, .rp };

Description

Converts input operands a and b from .atype into .f32 type. The converted values are then used to perform fused multiply-add operation with no loss of precision in the intermediate product and addition. The resulting value is stored in the destination operand d.

Semantics

ptx
d = convert(a) * convert(b) + c;

Notes

fma.f32.{f16/bf16} computes the product of a and b to infinite precision and then adds c to this product, again in infinite precision. The resulting value is then rounded to single precision using the rounding mode specified by .rnd.

Rounding modifiers (no default):

  • .rn — mantissa LSB rounds to nearest even
  • .rz — mantissa LSB rounds towards zero
  • .rm — mantissa LSB rounds towards negative infinity
  • .rp — mantissa LSB rounds towards positive infinity

Subnormal numbers: By default, subnormal numbers are supported.

Saturation modifier: fma.sat clamps the result to [0.0, 1.0]. NaN results are flushed to +0.0f.

PTX ISA Notes

fma.f32.{f16/bf16} introduced in PTX ISA version 8.6.

Target ISA Notes

fma.f32.{f16/bf16} requires sm_100 or higher.

Examples

ptx
.reg .f32 fc, fd;
.reg .f16 ha, hb;
fma.rz.sat.f32.f16.sat   fd, ha, hb, fc;