Floating-point arithmetic: Difference between revisions

Content deleted Content added
m Reverted edit by 154.73.27.28 (talk) to last version by Vincent Lefèvre
Tags: Mobile edit Mobile web edit Advanced mobile edit
Line 258:
* The TensorFloat-32<ref name="Kharya_2020"/> format combines the 8 bits of exponent of the Bfloat16 with the 10 bits of trailing significand field of half-precision formats, resulting in a size of 19 bits. This format was introduced by [[Nvidia]], which provides hardware support for it in the Tensor Cores of its [[Graphics processing unit|GPUs]] based on the Nvidia Ampere architecture. The drawback of this format is its size, which is not a power of 2. However, according to Nvidia, this format should only be used internally by hardware to speed up computations, while inputs and outputs should be stored in the 32-bit single-precision IEEE 754 format.<ref name="Kharya_2020"/>
* The [[Hopper (microarchitecture)|Hopper]] architecture GPUs provide two FP8 formats: one with the same numerical range as half-precision (E5M2) and one with higher precision, but less range (E4M3).<ref name="NVIDIA_Hopper"/><ref name="Micikevicius_2022"/>
* The [[Blackwell (microarchitecture)|Blackwell]] GPU architecture includes support for FP6 (E3M2 and E2M3) and FP4 (E2M1) formats.
 
{| class="wikitable"
Line 264 ⟶ 265:
!Sign
!Exponent
!Mantissa
!Trailing significand field
!Total bits
|-