Floating-point arithmetic: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 12:55, 27 March 2025 edit Vincent Lefèvre (talk \| contribs) Extended confirmed users 5,208 edits →Internal representation: In the table, replaced "Type" by "Format" (types are for programming languages), and added the current standard names. ← Previous edit		Latest revision as of 13:43, 25 August 2025 edit undo Cloudream (talk \| contribs) 92 edits m →Other notable floating-point formats Tag: Visual edit
(31 intermediate revisions by 9 users not shown)
Line 37: The speed of floating-point operations, commonly measured in terms of [[FLOPS]], is an important characteristic of a [[computer system]], especially for applications that involve intensive mathematical calculations. AFloating-point numbers can be computed using software implementations (softfloat) or hardware implementations (hardfloat). [[floating-point unit\|Floating-point units]] (~~FPU~~FPUs, colloquially a math [[coprocessor\|coprocessors]]) ~~is a part of a computer system~~are specially designed to carry out operations on floating-point numbers and are part of most computer systems. When FPUs are not available, software implementations can be used instead. == Overview == Line 73: When this is stored in memory using the IEEE 754 encoding, this becomes the [[significand]] {{mvar\|s}}. The significand is assumed to have a binary point to the right of the leftmost bit. So, the binary representation of π is calculated from left-to-right as follows: <math display=block>\begin{align} &\~~left~~biggl(\sum_{n=0}^{p-1} \text{bit}_n \times 2^{-n}\~~right~~biggr) \times 2^e \\ =&\qquad{}= &\left(1 \times 2^{-0} + 1 \times 2^{-1} + 0 \times 2^{-2} + 0 \times 2^{-3~~} + 1 \times2^{-4~~} + \cdots + 1 \times 2^{-23}\right) \times 2^1 \\[2mu] &\~~approx~~qquad{} &\approx 1.57079637 \times 2 \\[3mu] &\~~approx~~qquad{} &\approx 3.1415927 \end{align}</math><!-- Ensure correct rounding by taking one more digit for the intermediate decimal approximation. --> Line 96: [[File:Quevedo 1917.jpg\|thumb\|upright=0.7\|[[Leonardo Torres Quevedo]], in 1914, published an analysis of floating point based on the [[analytical engine]].]] In 1914, the Spanish engineer [[Leonardo Torres Quevedo]] published ''Essays on Automatics'',<ref>Torres Quevedo, Leonardo. [https://quickclick.es/rop/pdf/publico/1914/1914_tomoI_2043_01.pdf Automática: Complemento de la Teoría de las Máquinas, (pdf)], pp. 575–583, Revista de Obras Públicas, 19 November 1914.</ref> where he designed a special-purpose electromechanical calculator based on [[Charles Babbage]]'s [[analytical engine]] and described a way to store floating-point numbers in a consistent manner. He stated that numbers will be stored in exponential format as ''n'' x× 10<math>^m</math>, and offered three rules by which consistent manipulation of floating-point numbers by machines could be implemented. For Torres, "''n'' will always be the same number of [[Numerical digit\|digits]] (e.g. six), the first digit of ''n'' will be of order of tenths, the second of hundredths, etc, and one will write each quantity in the form: ''n''; ''m''." The format he proposed shows the need for a fixed-sized significand as is presently used for floating-point data, fixing the ___location of the decimal point in the significand so that each representation was unique, and how to format such numbers by specifying a syntax to be used that could be entered through a [[typewriter]], as was the case of his [[Leonardo Torres y Quevedo#Analytical machines\|Electromechanical Arithmometer]] in 1920.<ref>Ronald T. Kneusel. ''[https://books.google.com/books?id=eq4ZDgAAQBAJ&dq=leonardo+torres+quevedo++electromechanical+machine+essays&pg=PA84 Numbers and Computers],'' Springer, pp. 84–85, 2017. {{ISBN\|978-3319505084}}</ref>{{Sfn\|Randell\|1982\|pp=6, 11–13}}<ref>Randell, Brian. [https://dl.acm.org/doi/pdf/10.5555/1074100.1074334 Digital Computers, History of Origins, (pdf)], p. 545, Digital Computers: Origins, Encyclopedia of Computer Science, January 2003.</ref> [[File:Konrad Zuse (1992).jpg\|thumb\|upright=0.7\|right\|[[Konrad Zuse]], architect of the [[Z3 (computer)\|Z3]] computer, which uses a 22-bit binary floating-point representation]] Line 122: In 1989, mathematician and computer scientist [[William Kahan]] was honored with the [[Turing Award]] for being the primary architect behind this proposal; he was aided by his student Jerome Coonen and a visiting professor, [[Harold S. Stone\|Harold Stone]].<ref name="Severance_1998"/> Among the x86 (more specifically i8087) innovations are these: * A precisely specified floating-point representation at the bit-string level, so that all compliant computers interpret bit patterns the same way. This makes it possible to accurately and efficiently transfer floating-point numbers from one computer to another (after accounting for [[endianness]]). * A precisely specified behavior for the arithmetic operations: A result is required to be produced as if infinitely precise arithmetic were used to yield a value that is then rounded according to specific rules. This means that a compliant computer program would always produce the same result when given a particular input, thus mitigating the almost mystical reputation that floating-point computation had developed for its hitherto seemingly non-deterministic behavior. * The ability of [[IEEE 754#Exception handling\|exceptional conditions]] (overflow, [[Division by zero\|divide by zero]], etc.) to propagate through a computation in a benign manner and then be handled by the software in a controlled fashion. These features would be inherited into IEEE 754-1985 (with the exception of the encoding of special values and exceptions), though the extended internal precision of x87 means it requires explicit rounding of exact results directly to the destination precision in order to match standard IEEE 754 results.<ref name="Goldberg_1991"/> However, the behavior may not be the same as a rounding to the destination format due to a possible wider exponent range of the extended format. == Range of floating-point numbers == Line 177 ⟶ 179: === Internal representation === Floating-point numbers are typically packed into a computer datum as the sign bit, the exponent field, and ~~the~~a ~~significand~~field orfor ~~mantissa~~the significand, from left to right. For the [[IEEE 754]] binary formats (basic and extended) that have extant hardware implementations, they are apportioned as follows: {\| class="wikitable" style="text-align:right; border:0" Line 183 ⟶ 185: !rowspan="2" \|Format !colspan="4" \|Bits for the encoding<!-- Since this is about the encoding, it should be clear that the number given for the significand below excludes the implicit bit, when this is used. --> \| rowspan="78" style="background:white; border:0"\| !rowspan="2" \|Exponent<br>bias !rowspan="2" \|Bits<br>precision Line 220 ⟶ 222: \|~15.9 \|- \|[[Extended precision#x86 extended-precision format\|x86 extended ~~precision~~]] \|1 \|15 Line 229 ⟶ 231: \|~19.2 \|- \|[[Quadruple-precision floating-point format\|~~Quad~~Quadruple]] (binary128) \|1 \|15 Line 237 ⟶ 239: \|113 \|~34.0 \|- \|[[Octuple-precision floating-point format\|Octuple]] (binary256) \|1 \|19 \|236 \|256 \|262143 \|237 \|~71.3 \|} While the exponent can be positive or negative, in binary formats it is stored as an unsigned number that has a fixed "bias" added to it. Values of all 0s in this field are reserved for the zeros and [[subnormal ~~numbers~~number]]s; values of all 1s are reserved for the infinities and NaNs. The exponent range for normal numbers is [−126, 127] for single precision, [−1022, 1023] for double, or [−16382, 16383] for quad. Normal numbers exclude subnormal values, zeros, infinities, and NaNs. In the IEEE binary interchange formats the leading bit of a normalized significand is not actually stored in the computer datum, since it is always 1. It is called the "hidden" or "implicit" bit. Because of this, the single-precision format actually has a significand with 24 bits of precision, the double-precision format has 53, ~~and~~ quad has 113, and octuple has 237. For example, it was shown above that π, rounded to 24 bits of precision, has: Line 254 ⟶ 265: == Other notable floating-point formats == In addition to the widely used [[IEEE 754]] standard formats, other floating-point formats are used, or have been used, in certain ___domain-specific areas. * The [[Microsoft Binary Format\|Microsoft Binary Format (MBF)]] was developed for the Microsoft BASIC language products, including Microsoft's first ever product the [[Altair BASIC]] (1975), [[TRS-80\|TRS-80 LEVEL II]], [[CP/M]]'s [[MBASIC]], [[IBM PC 5150]]'s [[BASICA]], [[MS-DOS]]'s [[GW-BASIC]] and [[QuickBASIC]] prior to version 4.00. QuickBASIC version 4.00 and 4.50 switched to the IEEE 754-1985 format but can revert to the MBF format using the /MBF command option. MBF was designed and developed on a simulated [[Intel 8080]] by [[Monte Davidoff]], a dormmate of [[Bill Gates]], during spring of 1975 for the [[MITS Altair 8800]]. The initial release of July 1975 supported a single-precision (32 bits) format due to cost of the [[MITS Altair 8800]] 4-kilobytes memory. In December 1975, the 8-kilobytes version added a double-precision (64 bits) format. A single-precision (40 bits) variant format was adopted for other CPU's, notably the [[MOS 6502]] ([[Apple //II]], [[Commodore PET]], [[Atari]]), [[Motorola 6800]] (MITS Altair 680) and [[Motorola 6809]] ([[TRS-80 Color Computer]]). All Microsoft language products from 1975 through 1987 used the [[Microsoft Binary Format]] until Microsoft adopted the IEEE- 754 standard format in all its products starting in 1988 to their current releases. MBF consists of the MBF single-precision format (32 bits, "6-digit BASIC"),<ref name="Borland_1994_MBF"/><ref name="Steil_2008_6502"/> the MBF extended-precision format (40 bits, "9<!-- is it really 9 digits, not 8? -->-digit BASIC"),<ref name="Steil_2008_6502"/> and the MBF double-precision format (64 bits);<ref name="Borland_1994_MBF"/><ref name="Microsoft_2006_KB35826"/> each of them is represented with an 8-bit exponent, followed by a sign bit, followed by a significand of respectively 23, 31, and 55 bits. * The [[~~Bfloat16~~bfloat16 floating-point format\|~~Bfloat16~~bfloat16 format]] requires the same amount of memory (16 bits) as the [[Half-precision floating-point format\|IEEE 754 half-precision format]], but allocates 8 bits to the exponent instead of 5, thus providing the same range as a [[Single-precision floating-point format\|IEEE 754 single-precision]] number. The tradeoff is a reduced precision, as the trailing significand field is reduced from 10 to 7 bits. This format is mainly used in the training of [[machine learning]] models, where range is more valuable than precision. Many machine learning accelerators provide hardware support for this format. * The TensorFloat-32<ref name="Kharya_2020"/> format combines the 8 bits of exponent of the ~~Bfloat16~~bfloat16 with the 10 bits of trailing significand field of half-precision formats, resulting in a size of 19 bits. This format was introduced by [[Nvidia]], which provides hardware support for it in the Tensor Cores of its [[Graphics processing unit\|GPUs]] based on the Nvidia Ampere architecture. The drawback of this format is its size, which is not a power of 2. However, according to Nvidia, this format should only be used internally by hardware to speed up computations, while inputs and outputs should be stored in the 32-bit single-precision IEEE 754 format.<ref name="Kharya_2020"/> * The [[Hopper (microarchitecture)\|Hopper]] and [[CDNA 3]] architecture GPUs provide two FP8 formats: one with the same numerical range as half-precision (E5M2) and one with higher precision, but less range (E4M3).<ref name="NVIDIA_Hopper"/><ref name="Micikevicius_2022"/> * The [[Blackwell (microarchitecture)\|Blackwell]] and [[CDNA (microarchitecture)\|CDNA 4]] GPU architecture includes support for FP6 (E3M2 and E2M3) and FP4 (E2M1) formats. FP4 is the smallest floating-point format which allows for all IEEE 754 principles (see [[minifloat]]). {\| class="wikitable" \|+ Comparison of common floating-point formats ~~\|+ Bfloat16, TensorFloat-32, and the two FP8 formats, compared with IEEE 754 half-precision and single-precision formats~~ !Type !Sign !Exponent !Significand ~~!Trailing significand field~~ !Total bits \|- \|FP4 \|1 \|2 \|1 \|4 \|- \|FP6 (E2M3) \|1 \|2 \|3 \|6 \|- \|FP6 (E3M2) \|1 \|3 \|2 \|6 \|- \|FP8 (E4M3) Line 286 ⟶ 315: \|16 \|- \|[[~~Bfloat16~~bfloat16 floating-point format\|~~Bfloat16~~bfloat16]] \|1 \|8 Line 292 ⟶ 321: \|16 \|- \|[[TensorFloat-32]] \|1 \|8 Line 303 ⟶ 332: \|23 \|32 \|- \|[[Double-precision floating-point format\|Double-precision]] \|1 \|11 \|52 \|64 \|- \|[[Quadruple-precision floating-point format\|Quadruple-precision]] \|1 \|15 \|112 \|128 \|- \|[[Octuple-precision floating-point format\|Octuple-precision]] \|1 \|19 \|236 \|256 \|} Line 334 ⟶ 381: The result of rounding differs from the true value by about 0.03 parts per million, and matches the decimal representation of π in the first 7 digits. The difference is the [[discretization error]] and is limited by the [[machine epsilon]]. The arithmetical difference between two consecutive representable floating-point numbers which have the same exponent is called a [[unit in the last place]] (ULP). For example, if there is no representable number lying between the representable numbers 1.~~45a70c22~~45A70C22<sub>~~hex~~16</sub> and 1.~~45a70c24~~45A70C24<sub>~~hex~~16</sub>, the ULP is 2×16<sup>−8</sup>, or 2<sup>−31</sup>. For numbers with a base-2 exponent part of 0, i.e. numbers with an absolute value higher than or equal to 1 but lower than 2, an ULP is exactly 2<sup>−23</sup> or about 10<sup>−7</sup> in single precision, and exactly 2<sup>−53</sup> or about 10<sup>−16</sup> in double precision. The mandated behavior of IEEE-compliant hardware is that the result be within one-half of a ULP. === Rounding modes === Line 528 ⟶ 575: === Incidents === * On 25 February 1991, a [[loss of significance]] in a [[MIM-104 Patriot]] missile battery [[MIM-104 Patriot#Failure at Dhahran\|prevented it from intercepting]] an incoming [[Al Hussein (missile)\|Scud]] missile in [[Dhahran]], [[Saudi Arabia]], contributing to the death of 28 soldiers from the U.S. Army's [[14th Quartermaster Detachment]].<ref name="GAO report IMTEC 92-26"/> The ~~error~~weapons ~~was~~control ~~actually~~computer ~~introduced~~counted time in an integer number of tenths of a second since boot. For conversion to a floating-point number of seconds in velocity and position calculations, the software originally multiplied this number by a 24-bit [[Fixed-point arithmetic\|fixed-point]] ~~computation~~binary approximation to 0.1, specifically <math display="block">0.00011001100110011001100_2 = 0.1 \times (1 - 2^{-20}).</math> Some parts of the software were later adapted to use a more accurate conversion to floating-point, but some parts were not updated and still used the 24-bit approximation.<ref name="Skeel"/> ~~but~~ These parts of the ~~underlying~~software ~~issue~~drifted ~~would~~from ~~have~~one ~~been~~another by about 3.43 milliseconds per hour. After 20 hours, the ~~same~~discrepancy ~~with~~of ~~floating~~about 68.7 ms was enough for the radar tracking system to lose track of Scuds; the control system in the Dhahran missile battery had been running for about 100 hours when it failed to track and intercept an incoming Scud.<ref name="GAO report IMTEC 92-26"/> The failure to intercept arose not from using floating point ~~arithmetic~~specifically, but from subtracting two different approximations to unit conversion with different errors when representing time, so the unit conversion error in the difference did not cancel out but rather grew indefinitely with uptime.<ref name="Skeel"/> * {{Clarify\|date=November 2024\|reason=It is not clear how this is an incident (the section title may have to be modified to cover more than incidents) and how this is due to floating-point arithmetic (rather than number approximations in general). The term '"invisible'" may also be misleading without following explanations. \|text=[[Salami slicing tactics#Financial schemes\|Salami slicing]] is the practice of removing the 'invisible' part of a transaction into a separate account.}} === Machine precision and backward error analysis === Line 724 ⟶ 771: <ref name="OpenEXR-half">{{cite web \|url=https://openexr.com/en/latest/TechnicalIntroduction.html#the-half-data-type \|title=Technical Introduction to OpenEXR – The half Data Type \|publisher=openEXR \|access-date=2024-04-16}}</ref> <ref name="IEEE-754_Analysis">{{cite web\|url=https://christophervickery.com/IEEE-754/\|title=IEEE-754 Analysis\|access-date=2024-08-29}}</ref> <ref name="Goldberg_1991">{{cite journal \|first=David \|last=Goldberg ~~\|author-link=David Goldberg (PARC)~~ \|title=What Every Computer Scientist Should Know About Floating-Point Arithmetic \|journal=[[ACM Computing Surveys]] \|date=March 1991 \|volume=23 \|issue=1 \|pages=5–48 \|doi=10.1145/103162.103163 \|doi-access=free \|s2cid=222008826}} (With the addendum "Differences Among IEEE 754 Implementations": [https://web.archive.org/web/20171011072644/http://www.cse.msu.edu/~cse320/Documents/FloatingPoint.pdf], [https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html])</ref> <ref name="Harris">{{Cite journal \|title=You're Going To Have To Think! \|first=Richard \|last=Harris \|journal=[[Overload (magazine)\|Overload]] \|issue=99 \|date=October 2010 \|issn=1354-3172 \|pages=5–10 \|url=http://accu.org/index.php/journals/1702 \|access-date=2011-09-24 \|quote=Far more worrying is cancellation error which can yield catastrophic loss of precision.}} [http://accu.org/var/uploads/journals/overload99.pdf]</ref> <ref name="GAO report IMTEC 92-26">{{cite web \|url=http://www.gao.gov/products/IMTEC-92-26 \|title=Patriot missile defense, Software problem led to system failure at Dharhan, Saudi Arabia \|id=GAO report IMTEC 92-26 \|publisher=[[US Government Accounting Office]]}}</ref>