Content deleted Content added
m link should be to radix not base (exponentiation). also directly link to power of 10 |
|||
(46 intermediate revisions by 13 users not shown) | |||
Line 5:
{{Floating-point}}
In [[computing]], '''floating-point arithmetic''' ('''FP''') is [[arithmetic]] on subsets of [[real number]]s formed by a ''[[significand]]'' (a [[Sign (mathematics)|signed]] sequence of a fixed number of digits in some [[Radix|base]]
Numbers of this form are called '''floating-point numbers'''.<ref name="Muller_2010"/>{{rp|3}}<ref name="sterbenz1974fpcomp">{{cite book
|last1=Sterbenz
Line 23:
In practice, most floating-point systems use [[Binary number|base two]], though base ten ([[decimal floating point]]) is also common.
Floating-point arithmetic operations, such as addition and division, approximate the corresponding real number arithmetic operations by [[rounding]] any result that is not a floating-point number itself to a nearby floating-point number.<ref name="Muller_2010"/>{{rp|22}}<ref name="sterbenz1974fpcomp"/>{{rp|10}}
For example, in a floating-point arithmetic with five base-ten digits, the sum 12.345 + 1.0001 = 13.3451 might be rounded to 13.345.
The term ''floating point'' refers to the fact that the number's [[radix point]] can "float" anywhere to the left, right, or between the [[significant digits]] of the number. This position is indicated by the exponent, so floating point can be considered a form of [[scientific notation]].
A floating-point system can be used to represent, with a fixed number of digits, numbers of very different [[Orders of magnitude (numbers)|orders of magnitude]] — such as the number of meters [[Orders of magnitude (length)#100_zettametres|between galaxies]] or [[Orders of magnitude (length)#10_femtometres|between protons in an atom]]. For this reason, floating-point arithmetic is often used to allow very small and very large real numbers that require fast processing times. The result of this [[dynamic range]] is that the numbers that can be represented are not uniformly spaced; the difference between two consecutive representable numbers varies with their exponent.<ref name="Smith_1997"/>
Line 37:
The speed of floating-point operations, commonly measured in terms of [[FLOPS]], is an important characteristic of a [[computer system]], especially for applications that involve intensive mathematical calculations.
== Overview ==
Line 73:
When this is stored in memory using the IEEE 754 encoding, this becomes the [[significand]] {{mvar|s}}. The significand is assumed to have a binary point to the right of the leftmost bit. So, the binary representation of π is calculated from left-to-right as follows:
<math display=block>\begin{align}
\end{align}</math><!-- Ensure correct rounding by taking one more digit for the intermediate decimal approximation. -->
Line 96:
[[File:Quevedo 1917.jpg|thumb|upright=0.7|[[Leonardo Torres Quevedo]], in 1914, published an analysis of floating point based on the [[analytical engine]].]]
In 1914, the Spanish engineer [[Leonardo Torres Quevedo]] published ''Essays on Automatics'',<ref>Torres Quevedo, Leonardo. [https://quickclick.es/rop/pdf/publico/1914/1914_tomoI_2043_01.pdf Automática: Complemento de la Teoría de las Máquinas, (pdf)], pp. 575–583, Revista de Obras Públicas, 19 November 1914.</ref> where he designed a special-purpose electromechanical calculator based on [[Charles Babbage]]'s [[analytical engine]] and described a way to store floating-point numbers in a consistent manner. He stated that numbers will be stored in exponential format as ''n''
[[File:Konrad Zuse (1992).jpg|thumb|upright=0.7|right|[[Konrad Zuse]], architect of the [[Z3 (computer)|Z3]] computer, which uses a 22-bit binary floating-point representation]]
Line 122:
In 1989, mathematician and computer scientist [[William Kahan]] was honored with the [[Turing Award]] for being the primary architect behind this proposal; he was aided by his student Jerome Coonen and a visiting professor, [[Harold S. Stone|Harold Stone]].<ref name="Severance_1998"/>
Among the x86 (more specifically i8087) innovations are these:
* A precisely specified floating-point representation at the bit-string level, so that all compliant computers interpret bit patterns the same way. This makes it possible to accurately and efficiently transfer floating-point numbers from one computer to another (after accounting for [[endianness]]).
* A precisely specified behavior for the arithmetic operations: A result is required to be produced as if infinitely precise arithmetic were used to yield a value that is then rounded according to specific rules. This means that a compliant computer program would always produce the same result when given a particular input, thus mitigating the almost mystical reputation that floating-point computation had developed for its hitherto seemingly non-deterministic behavior.
* The ability of [[IEEE 754#Exception handling|exceptional conditions]] (overflow, [[Division by zero|divide by zero]], etc.) to propagate through a computation in a benign manner and then be handled by the software in a controlled fashion.
These features would be inherited into IEEE 754-1985 (with the exception of the encoding of special values and exceptions), though the extended internal precision of x87 means it requires explicit rounding of exact results directly to the destination precision in order to match standard IEEE 754 results.<ref name="Goldberg_1991"/> However, the behavior may not be the same as a rounding to the destination format due to a possible wider exponent range of the extended format.
== Range of floating-point numbers ==
Line 177 ⟶ 179:
=== Internal representation ===
Floating-point numbers are typically packed into a computer datum as the sign bit, the exponent field, and
{| class="wikitable" style="text-align:right; border:0"
|-
!rowspan="2" |
!colspan="4" |Bits for the encoding<!-- Since this is about the encoding, it should be clear that the number given for the significand below excludes the implicit bit, when this is used. -->
| rowspan="
!rowspan="2" |Exponent<br>bias
!rowspan="2" |Bits<br>precision
Line 193 ⟶ 195:
!Total
|-
|[[Half-precision
|1
|5
Line 202 ⟶ 204:
|~3.3
|-
|[[Single
|1
|8
Line 211 ⟶ 213:
|~7.2
|-
|[[Double
|1
|11
Line 220 ⟶ 222:
|~15.9
|-
|[[Extended precision#x86 extended
|1
|15
Line 229 ⟶ 231:
|~19.2
|-
|[[
|1
|15
Line 237 ⟶ 239:
|113
|~34.0
|-
|[[Octuple-precision floating-point format|Octuple]] (binary256)
|1
|19
|236
|256
|262143
|237
|~71.3
|}
While the exponent can be positive or negative, in binary formats it is stored as an unsigned number that has a fixed "bias" added to it. Values of all 0s in this field are reserved for the zeros and [[subnormal
In the IEEE binary interchange formats the leading
For example, it was shown above that π, rounded to 24 bits of precision, has:
Line 254 ⟶ 265:
== Other notable floating-point formats ==
In addition to the widely used [[IEEE 754]] standard formats, other floating-point formats are used, or have been used, in certain ___domain-specific areas.
* The [[Microsoft Binary Format|Microsoft Binary Format (MBF)]] was developed for the Microsoft BASIC language products, including Microsoft's first ever product the [[Altair BASIC]] (1975), [[TRS-80|TRS-80 LEVEL II]], [[CP/M]]'s [[MBASIC]], [[IBM PC 5150]]'s [[BASICA]], [[MS-DOS]]'s [[GW-BASIC]] and [[QuickBASIC]] prior to version 4.00. QuickBASIC version 4.00 and 4.50 switched to the IEEE 754-1985 format but can revert to the MBF format using the /MBF command option. MBF was designed and developed on a simulated [[Intel 8080]] by [[Monte Davidoff]], a dormmate of [[Bill Gates]], during spring of 1975 for the [[MITS Altair 8800]]. The initial release of July 1975 supported a single-precision (32 bits) format due to cost of the [[MITS Altair 8800]] 4-kilobytes memory. In December 1975, the 8-kilobytes version added a double-precision (64 bits) format. A single-precision (40 bits) variant format was adopted for other CPU's, notably the [[MOS 6502]] ([[Apple
* The [[
* The TensorFloat-32<ref name="Kharya_2020"/> format combines the 8 bits of exponent of the
* The [[Hopper (microarchitecture)|Hopper]] and [[CDNA 3]] architecture GPUs provide two FP8 formats: one with the same numerical range as half-precision (E5M2) and one with higher precision, but less range (E4M3).<ref name="NVIDIA_Hopper"/><ref name="Micikevicius_2022"/>
* The [[Blackwell (microarchitecture)|Blackwell]] and [[CDNA (microarchitecture)|CDNA 4]] GPU architecture includes support for FP6 (E3M2 and E2M3) and FP4 (E2M1) formats. FP4 is the smallest floating-point format which allows for all IEEE 754 principles (see [[minifloat]]).
{| class="wikitable"
|+ Comparison of common floating-point formats
!Type
!Sign
!Exponent
!Significand
!Total bits
|-
|FP4
|1
|2
|1
|4
|-
|FP6 (E2M3)
|1
|2
|3
|6
|-
|FP6 (E3M2)
|1
|3
|2
|6
|-
|FP8 (E4M3)
Line 285 ⟶ 315:
|16
|-
|[[
|1
|8
Line 291 ⟶ 321:
|16
|-
|[[TensorFloat-32]]
|1
|8
Line 302 ⟶ 332:
|23
|32
|-
|[[Double-precision floating-point format|Double-precision]]
|1
|11
|52
|64
|-
|[[Quadruple-precision floating-point format|Quadruple-precision]]
|1
|15
|112
|128
|-
|[[Octuple-precision floating-point format|Octuple-precision]]
|1
|19
|236
|256
|}
==Representable numbers, conversion and rounding {{anchor|Representable numbers}}==
By their nature, all numbers expressed in floating-point format are [[rational number]]s with a terminating expansion in the relevant base (for example, a terminating decimal expansion in base-10, or a terminating binary expansion in base-2). Irrational numbers, such as [[Pi|π]] or
When a number is represented in some format (such as a character string) which is not a native floating-point representation supported in a computer implementation, then it will require a conversion before it can be used in that implementation. If the number can be represented exactly in the floating-point format then the conversion is exact. If there is not an exact representation then the conversion requires a choice of which floating-point number to use to represent the original value. The representation chosen will have a different value from the original, and the value thus adjusted is called the ''rounded value''.
Line 333 ⟶ 381:
The result of rounding differs from the true value by about 0.03 parts per million, and matches the decimal representation of π in the first 7 digits. The difference is the [[discretization error]] and is limited by the [[machine epsilon]].
The arithmetical difference between two consecutive representable floating-point numbers which have the same exponent is called a [[unit in the last place]] (ULP). For example, if there is no representable number lying between the representable numbers 1.
=== Rounding modes ===
Line 527 ⟶ 575:
=== Incidents ===
* On 25 February 1991, a [[loss of significance]] in a [[MIM-104 Patriot]] missile battery [[MIM-104 Patriot#Failure at Dhahran|prevented it from intercepting]] an incoming [[Al Hussein (missile)|Scud]] missile in [[Dhahran]], [[Saudi Arabia]], contributing to the death of 28 soldiers from the U.S. Army's [[14th Quartermaster Detachment]].<ref name="GAO report IMTEC 92-26"/> The
* {{Clarify|date=November 2024|reason=It is not clear how this is an incident (the section title may have to be modified to cover more than incidents) and how this is due to floating-point arithmetic (rather than number approximations in general). The term
=== Machine precision and backward error analysis ===
Line 598 ⟶ 646:
Expectations from mathematics may not be realized in the field of floating-point computation. For example, it is known that <math>(x+y)(x-y) = x^2-y^2\,</math>, and that <math>\sin^2{\theta}+\cos^2{\theta} = 1\,</math>, however these facts cannot be relied on when the quantities involved are the result of floating-point computation.
The use of the equality test (<code>if (x==y) ...</code>) requires care when dealing with floating-point numbers. Even simple expressions like <code>0.6/0.2-3==0</code> will, on most computers, fail to be true<ref name="Christiansen_Perl"/> (in IEEE 754 double precision, for example, <code>0.6/0.2 - 3</code> is approximately equal to {{val|-4.44089209850063e-16}}). Consequently, such tests are sometimes replaced with "fuzzy" comparisons (<code>if (abs(x-y) < epsilon) ...</code>, where epsilon is sufficiently small and tailored to the application, such as 1.0E−13). The wisdom of doing this varies greatly, and can require numerical analysis to bound epsilon.<ref name="Higham_2002"/> Values derived from the primary data representation and their comparisons should be performed in a wider, extended, precision to minimize the risk of such inconsistencies due to round-off errors.<ref name="Kahan_2000_Marketing"/> It is often better to organize the code in such a way that such tests are unnecessary. For example, in [[computational geometry]], exact tests of whether a point lies off or on a line or plane defined by other points can be performed using adaptive precision or exact arithmetic methods.<ref name="Shewchuk"/>
Small errors in floating-point arithmetic can grow when mathematical algorithms perform operations an enormous number of times. A few examples are [[matrix inversion]], [[eigenvector]] computation, and differential equation solving. These algorithms must be very carefully designed, using numerical approaches such as [[iterative refinement]], if they are to work well.<ref name="Kahan_1997_Cantilever"/>
Line 723 ⟶ 771:
<ref name="OpenEXR-half">{{cite web |url=https://openexr.com/en/latest/TechnicalIntroduction.html#the-half-data-type |title=Technical Introduction to OpenEXR – The half Data Type |publisher=openEXR |access-date=2024-04-16}}</ref>
<ref name="IEEE-754_Analysis">{{cite web|url=https://christophervickery.com/IEEE-754/|title=IEEE-754 Analysis|access-date=2024-08-29}}</ref>
<ref name="Goldberg_1991">{{cite journal |first=David |last=Goldberg
<ref name="Harris">{{Cite journal |title=You're Going To Have To Think! |first=Richard |last=Harris |journal=[[Overload (magazine)|Overload]] |issue=99 |date=October 2010 |issn=1354-3172 |pages=5–10 |url=http://accu.org/index.php/journals/1702 |access-date=2011-09-24 |quote=Far more worrying is cancellation error which can yield catastrophic loss of precision.}} [http://accu.org/var/uploads/journals/overload99.pdf]</ref>
<ref name="GAO report IMTEC 92-26">{{cite web |url=http://www.gao.gov/products/IMTEC-92-26 |title=Patriot missile defense, Software problem led to system failure at Dharhan, Saudi Arabia |id=GAO report IMTEC 92-26 |publisher=[[US Government Accounting Office]]}}</ref>
|