9.7.5. Mixed Precision Floating-Point Instructions
Mixed precision floating-point instructions operate on data with varied floating point precision. Before executing the specified operation, operands with different precision needs to be converted such that all the instruction operands can be represented with a consistent floating-point precision. The register variable to be used for holding a particular operand depends upon the combination of the instruction types. Refer Fundamental Types and Alternate Floating-Point Data Formats for more details around exact register operand to be used for a given data type.
The mixed precision floating point instructions are:
- add
- sub
- fma
Mixed precision add, sub, fma support saturation of results to the range [0.0, 1.0], with NaN being flushed to positive zero.
9.7.5.1. Mixed Precision Floating Point Instructions: add
add
Add 2 values.
Syntax
add{.rnd}{.sat}.f32.atype d, a, c;
.atype = { .f16, .bf16};
.rnd = { .rn, .rz, .rm, .rp };Description
Converts input operand a from .atype into .f32 type. The converted value is then used for the addition. The resulting value is stored in the destination operand d.
Semantics
d = convert(a) + c;Notes
Rounding modifiers:
.rn— mantissa LSB rounds to nearest even.rz— mantissa LSB rounds towards zero.rm— mantissa LSB rounds towards negative infinity.rp— mantissa LSB rounds towards positive infinity
The default value of rounding modifier is .rn. Note that an add instruction with an explicit rounding modifier is treated conservatively by the code optimizer. An add instruction with no rounding modifier defaults to round-to-nearest-even and may be optimized aggressively by the code optimizer. In particular, mul/add sequences with no rounding modifiers may be optimized to use fused-multiply-add instructions on the target device.
Subnormal numbers: By default, subnormal numbers are supported.
Saturation modifier: add.sat clamps the result to [0.0, 1.0]. NaN results are flushed to +0.0f.
PTX ISA Notes
add.f32.{f16/bf16} introduced in PTX ISA version 8.6.
Target ISA Notes
add.f32.{f16/bf16} requires sm_100 or higher.
Examples
.reg .f32 fc, fd;
.reg .b16 ba;
add.rz.f32.bf16.sat fd, fa, fc;9.7.5.2. Mixed Precision Floating Point Instructions: sub
sub
Subtract one value from another.
Syntax
sub{.rnd}{.sat}.f32.atype d, a, c;
.atype = { .f16, .bf16};
.rnd = { .rn, .rz, .rm, .rp };Description
Converts input operand a from .atype into .f32 type. The converted value is then used for the subtraction. The resulting value is stored in the destination operand d.
Semantics
d = convert(a) - c;Notes
Rounding modifiers:
.rn— mantissa LSB rounds to nearest even.rz— mantissa LSB rounds towards zero.rm— mantissa LSB rounds towards negative infinity.rp— mantissa LSB rounds towards positive infinity
The default value of rounding modifier is .rn. Note that a sub instruction with an explicit rounding modifier is treated conservatively by the code optimizer. A sub instruction with no rounding modifier defaults to round-to-nearest-even and may be optimized aggressively by the code optimizer. In particular, mul/sub sequences with no rounding modifiers may be optimized to use fused-multiply-add instructions on the target device.
Subnormal numbers: By default, subnormal numbers are supported.
Saturation modifier: sub.sat clamps the result to [0.0, 1.0]. NaN results are flushed to +0.0f.
PTX ISA Notes
sub.f32.{f16/bf16} introduced in PTX ISA version 8.6.
Target ISA Notes
sub.f32.{f16/bf16} requires sm_100 or higher.
Examples
.reg .f32 fc, fd;
.reg .f16 ha;
sub.rz.f32.f16.sat fd, ha, fc;9.7.5.3. Mixed Precision Floating Point Instructions: fma
fma
Fused multiply-add.
Syntax
fma.rnd{.sat}.f32.abtype d, a, b, c;
.abtype = { .f16, .bf16};
.rnd = { .rn, .rz, .rm, .rp };Description
Converts input operands a and b from .atype into .f32 type. The converted values are then used to perform fused multiply-add operation with no loss of precision in the intermediate product and addition. The resulting value is stored in the destination operand d.
Semantics
d = convert(a) * convert(b) + c;Notes
fma.f32.{f16/bf16} computes the product of a and b to infinite precision and then adds c to this product, again in infinite precision. The resulting value is then rounded to single precision using the rounding mode specified by .rnd.
Rounding modifiers (no default):
.rn— mantissa LSB rounds to nearest even.rz— mantissa LSB rounds towards zero.rm— mantissa LSB rounds towards negative infinity.rp— mantissa LSB rounds towards positive infinity
Subnormal numbers: By default, subnormal numbers are supported.
Saturation modifier: fma.sat clamps the result to [0.0, 1.0]. NaN results are flushed to +0.0f.
PTX ISA Notes
fma.f32.{f16/bf16} introduced in PTX ISA version 8.6.
Target ISA Notes
fma.f32.{f16/bf16} requires sm_100 or higher.
Examples
.reg .f32 fc, fd;
.reg .f16 ha, hb;
fma.rz.sat.f32.f16.sat fd, ha, hb, fc;