Revision as of 08:46, 12 March 2025 edit Johnuniq (talk \| contribs) Autopatrolled, Administrators 88,310 edits m Reverted edit by 154.73.27.28 (talk) to last version by Vincent Lefèvre Tag: Rollback ← Previous edit		Revision as of 21:15, 26 March 2025 edit undo MrSwedishMeatballs (talk \| contribs) Extended confirmed users 4,250 edits →Other notable floating-point formats Tags: Mobile edit Mobile web edit Advanced mobile edit Next edit →
Line 258: * The TensorFloat-32<ref name="Kharya_2020"/> format combines the 8 bits of exponent of the Bfloat16 with the 10 bits of trailing significand field of half-precision formats, resulting in a size of 19 bits. This format was introduced by [[Nvidia]], which provides hardware support for it in the Tensor Cores of its [[Graphics processing unit\|GPUs]] based on the Nvidia Ampere architecture. The drawback of this format is its size, which is not a power of 2. However, according to Nvidia, this format should only be used internally by hardware to speed up computations, while inputs and outputs should be stored in the 32-bit single-precision IEEE 754 format.<ref name="Kharya_2020"/> * The [[Hopper (microarchitecture)\|Hopper]] architecture GPUs provide two FP8 formats: one with the same numerical range as half-precision (E5M2) and one with higher precision, but less range (E4M3).<ref name="NVIDIA_Hopper"/><ref name="Micikevicius_2022"/> * The [[Blackwell (microarchitecture)\|Blackwell]] GPU architecture includes support for FP6 (E3M2 and E2M3) and FP4 (E2M1) formats. {\| class="wikitable" Line 264 ⟶ 265: !Sign !Exponent !Mantissa ~~!Trailing significand field~~ !Total bits \|-

Floating-point arithmetic: Difference between revisions