Revision as of 15:07, 7 August 2025 edit NOTAROBOT1101 (talk \| contribs) 135 edits Expand section on FPU; add softfloat+hardfloat Tag: Visual edit ← Previous edit		Latest revision as of 13:43, 25 August 2025 edit undo Cloudream (talk \| contribs) 92 edits m →Other notable floating-point formats Tag: Visual edit
Line 268: * The [[bfloat16 floating-point format\|bfloat16 format]] requires the same amount of memory (16 bits) as the [[Half-precision floating-point format\|IEEE 754 half-precision format]], but allocates 8 bits to the exponent instead of 5, thus providing the same range as a [[Single-precision floating-point format\|IEEE 754 single-precision]] number. The tradeoff is a reduced precision, as the trailing significand field is reduced from 10 to 7 bits. This format is mainly used in the training of [[machine learning]] models, where range is more valuable than precision. Many machine learning accelerators provide hardware support for this format. * The TensorFloat-32<ref name="Kharya_2020"/> format combines the 8 bits of exponent of the bfloat16 with the 10 bits of trailing significand field of half-precision formats, resulting in a size of 19 bits. This format was introduced by [[Nvidia]], which provides hardware support for it in the Tensor Cores of its [[Graphics processing unit\|GPUs]] based on the Nvidia Ampere architecture. The drawback of this format is its size, which is not a power of 2. However, according to Nvidia, this format should only be used internally by hardware to speed up computations, while inputs and outputs should be stored in the 32-bit single-precision IEEE 754 format.<ref name="Kharya_2020"/> * The [[Hopper (microarchitecture)\|Hopper]] and [[CDNA 3]] architecture GPUs provide two FP8 formats: one with the same numerical range as half-precision (E5M2) and one with higher precision, but less range (E4M3).<ref name="NVIDIA_Hopper"/><ref name="Micikevicius_2022"/> * The [[Blackwell (microarchitecture)\|Blackwell]] and [[CDNA (microarchitecture)\|CDNA 4]] GPU architecture includes support for FP6 (E3M2 and E2M3) and FP4 (E2M1) formats. FP4 is the smallest floating-point format which allows for all IEEE 754 principles (see [[minifloat]]). {\| class="wikitable"

Floating-point arithmetic: Difference between revisions