Floating-point arithmetic: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 16:53, 15 June 2025 edit Jacobolus (talk \| contribs) Extended confirmed users 40,064 edits m →Floating-point numbers: re-align long equation following typographic conventions, and manually size parens around summation sign (LaTeX default is too big) ← Previous edit		Latest revision as of 13:43, 25 August 2025 edit undo Cloudream (talk \| contribs) 92 edits m →Other notable floating-point formats Tag: Visual edit
(10 intermediate revisions by 5 users not shown)
Line 37: The speed of floating-point operations, commonly measured in terms of [[FLOPS]], is an important characteristic of a [[computer system]], especially for applications that involve intensive mathematical calculations. AFloating-point numbers can be computed using software implementations (softfloat) or hardware implementations (hardfloat). [[floating-point unit\|Floating-point units]] (~~FPU~~FPUs, colloquially a math [[coprocessor\|coprocessors]]) ~~is a part of a computer system~~are specially designed to carry out operations on floating-point numbers and are part of most computer systems. When FPUs are not available, software implementations can be used instead. == Overview == Line 268: * The [[bfloat16 floating-point format\|bfloat16 format]] requires the same amount of memory (16 bits) as the [[Half-precision floating-point format\|IEEE 754 half-precision format]], but allocates 8 bits to the exponent instead of 5, thus providing the same range as a [[Single-precision floating-point format\|IEEE 754 single-precision]] number. The tradeoff is a reduced precision, as the trailing significand field is reduced from 10 to 7 bits. This format is mainly used in the training of [[machine learning]] models, where range is more valuable than precision. Many machine learning accelerators provide hardware support for this format. * The TensorFloat-32<ref name="Kharya_2020"/> format combines the 8 bits of exponent of the bfloat16 with the 10 bits of trailing significand field of half-precision formats, resulting in a size of 19 bits. This format was introduced by [[Nvidia]], which provides hardware support for it in the Tensor Cores of its [[Graphics processing unit\|GPUs]] based on the Nvidia Ampere architecture. The drawback of this format is its size, which is not a power of 2. However, according to Nvidia, this format should only be used internally by hardware to speed up computations, while inputs and outputs should be stored in the 32-bit single-precision IEEE 754 format.<ref name="Kharya_2020"/> * The [[Hopper (microarchitecture)\|Hopper]] and [[CDNA 3]] architecture GPUs provide two FP8 formats: one with the same numerical range as half-precision (E5M2) and one with higher precision, but less range (E4M3).<ref name="NVIDIA_Hopper"/><ref name="Micikevicius_2022"/> * The [[Blackwell (microarchitecture)\|Blackwell]] and [[CDNA (microarchitecture)\|CDNA 4]] GPU architecture includes support for FP6 (E3M2 and E2M3) and FP4 (E2M1) formats. FP4 is the smallest floating-point format which allows for all IEEE 754 principles (see [[minifloat]]). {\| class="wikitable" Line 575: === Incidents === * On 25 February 1991, a [[loss of significance]] in a [[MIM-104 Patriot]] missile battery [[MIM-104 Patriot#Failure at Dhahran\|prevented it from intercepting]] an incoming [[Al Hussein (missile)\|Scud]] missile in [[Dhahran]], [[Saudi Arabia]], contributing to the death of 28 soldiers from the U.S. Army's [[14th Quartermaster Detachment]].<ref name="GAO report IMTEC 92-26"/> The ~~error~~weapons ~~was~~control ~~actually~~computer ~~introduced~~counted time in an integer number of tenths of a second since boot. For conversion to a floating-point number of seconds in velocity and position calculations, the software originally multiplied this number by a 24-bit [[Fixed-point arithmetic\|fixed-point]] ~~computation~~binary approximation to 0.1, specifically <math display="block">0.00011001100110011001100_2 = 0.1 \times (1 - 2^{-20}).</math> Some parts of the software were later adapted to use a more accurate conversion to floating-point, but some parts were not updated and still used the 24-bit approximation.<ref name="Skeel"/> ~~but~~ These parts of the ~~underlying~~software ~~issue~~drifted ~~would~~from ~~have~~one ~~been~~another by about 3.43 milliseconds per hour. After 20 hours, the ~~same~~discrepancy ~~with~~of ~~floating~~about 68.7 ms was enough for the radar tracking system to lose track of Scuds; the control system in the Dhahran missile battery had been running for about 100 hours when it failed to track and intercept an incoming Scud.<ref name="GAO report IMTEC 92-26"/> The failure to intercept arose not from using floating point ~~arithmetic~~specifically, but from subtracting two different approximations to unit conversion with different errors when representing time, so the unit conversion error in the difference did not cancel out but rather grew indefinitely with uptime.<ref name="Skeel"/> * {{Clarify\|date=November 2024\|reason=It is not clear how this is an incident (the section title may have to be modified to cover more than incidents) and how this is due to floating-point arithmetic (rather than number approximations in general). The term '"invisible'" may also be misleading without following explanations. \|text=[[Salami slicing tactics#Financial schemes\|Salami slicing]] is the practice of removing the 'invisible' part of a transaction into a separate account.}} === Machine precision and backward error analysis === Line 771: <ref name="OpenEXR-half">{{cite web \|url=https://openexr.com/en/latest/TechnicalIntroduction.html#the-half-data-type \|title=Technical Introduction to OpenEXR – The half Data Type \|publisher=openEXR \|access-date=2024-04-16}}</ref> <ref name="IEEE-754_Analysis">{{cite web\|url=https://christophervickery.com/IEEE-754/\|title=IEEE-754 Analysis\|access-date=2024-08-29}}</ref> <ref name="Goldberg_1991">{{cite journal \|first=David \|last=Goldberg ~~\|author-link=David Goldberg (PARC)~~ \|title=What Every Computer Scientist Should Know About Floating-Point Arithmetic \|journal=[[ACM Computing Surveys]] \|date=March 1991 \|volume=23 \|issue=1 \|pages=5–48 \|doi=10.1145/103162.103163 \|doi-access=free \|s2cid=222008826}} (With the addendum "Differences Among IEEE 754 Implementations": [https://web.archive.org/web/20171011072644/http://www.cse.msu.edu/~cse320/Documents/FloatingPoint.pdf], [https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html])</ref> <ref name="Harris">{{Cite journal \|title=You're Going To Have To Think! \|first=Richard \|last=Harris \|journal=[[Overload (magazine)\|Overload]] \|issue=99 \|date=October 2010 \|issn=1354-3172 \|pages=5–10 \|url=http://accu.org/index.php/journals/1702 \|access-date=2011-09-24 \|quote=Far more worrying is cancellation error which can yield catastrophic loss of precision.}} [http://accu.org/var/uploads/journals/overload99.pdf]</ref> <ref name="GAO report IMTEC 92-26">{{cite web \|url=http://www.gao.gov/products/IMTEC-92-26 \|title=Patriot missile defense, Software problem led to system failure at Dharhan, Saudi Arabia \|id=GAO report IMTEC 92-26 \|publisher=[[US Government Accounting Office]]}}</ref>