Quadruple-precision floating-point format: Difference between revisions

Content deleted Content added
structured explanation of sign, exponent and significand bits
Tags: Reverted Visual edit
Restored revision 1266771956 by MrOllie (talk): Rv more editorializing, see WP:NOR
Line 2:
{{Floating-point}}
{{Computer architecture bit widths}}
In [[computing]], '''quadruple precision''' (or '''quad precision''') is a binary [[Floating-point arithmetic|floating-point]]–based [[computer number format]] that occupies 16 bytes (128 bits) inwith memoryprecision at least twice the 53-bit [[Double-precision floating-point format|double precision]].
 
This 128-bit quadruple precision is designed not only for applications requiring results in higher than double precision,<ref>{{cite web |last1=Bailey |first1=David H. |last2=Borwein |first2=Jonathan M. |date=July 6, 2009 |title=High-Precision Computation and Mathematical Physics |url=https://www.davidhbailey.com/dhbpapers/dhb-jmb-acat08.pdf}}</ref> but also, as a primary function, to allow the computation of double precision results more reliably and accurately by minimising overflow and [[round-off error]]s in intermediate calculations and scratch variables. [[William Kahan]], primary architect of the original IEEE 754 floating-point standard noted, "For now the [[extended precision#x86 Architecture Extended Precision Format|10-byte Extended format]] is a tolerable compromise between the value of extra-precise arithmetic and the price of implementing it to run fast; very soon two more bytes of precision will become tolerable, and ultimately a 16-byte format ... That kind of gradual evolution towards wider precision was already in view when [[IEEE 754|IEEE Standard 754 for Floating-Point Arithmetic]] was framed."<ref>{{cite book |lastfirst=HighamNicholas |first last=NicholasHigham |title="Designing stable algorithms" in Accuracy and Stability of Numerical Algorithms (2 ed) | publisher=SIAM |year=2002 | pages=43 }}</ref>
The quadruple (base-2, 128-bit) format in the standard [[IEEE 754]] is named '''binary128'''.
 
In [[IEEE 754-2008]] the 128-bit base-2 format is officially referred to as '''binary128'''.
== Purpose and use ==
This 128-bit quadruple precision is designed not only for applications requiring results in higher than double precision,<ref>{{cite web |last1=Bailey |first1=David H. |last2=Borwein |first2=Jonathan M. |date=July 6, 2009 |title=High-Precision Computation and Mathematical Physics |url=https://www.davidhbailey.com/dhbpapers/dhb-jmb-acat08.pdf}}</ref> but also, as a primary function, to allow the computation of double precision results more reliably and accurately by minimising overflow and [[round-off error]]s in intermediate calculations and scratch variables.
 
Beware: on nowadays common 64-bit hardware 128-bit computations are often '''significantly slower''' than smaller datatypes.
 
== Range and precision ==
binary128 mostly provides 113 bits ~34 decimal digits of precision and an enormous range from 'denormal' ±6.E-4966 over min-normal ±3.3621031431120935062626778173217526E-4932 with full precision up to max ±1.189731495357231765085759326628007E+4932.
 
== IEEE 754 quadruple-precision binary floating-point format: binary128 ==
Line 21 ⟶ 15:
* [[Significand]] [[precision (arithmetic)|precision]]: 113 bits (112 explicitly stored)
<!-- "significand", with a d at the end, is a technical term, please do not confuse with "significant" -->
{| class="wikitable"
|+ binary128 encoding
|-
! <u>S</u>'''ign'''!! <u>E</u>xponent
!<u>H</u>idden leading significant bit!! <u>T</u>railing significand bits
|-
! 1 bit !! {{val|15|u=bits}}
!1 bit!! {{val|112|u=bits}}
|-
| <code>s</code> || eeeeeeeeeeeeeee
|h|| tt tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt
|}
 
* The sign bit determines the sign of the number (including when this number is zero, which is [[Signed zero|signed]]), "1" stands for negative.
 
* The exponent bits encode a binary integer from 0 up to 32767, from which a 'bias' of 16383 is subtracted to get the 'effective exponent' between -16382 and +16383. The values biased 0 and 32767 (de-biased -16383 and 16384) are reserved for special values, 'denormals', zeroes, infinities and NaN's (Not a Number).
 
* The sign bit determines the sign of the number (including when this number is zero, which is [[Signed zero|signed]]),. "1" stands for negative.
* The hidden (implicit) bit combined with the trailing significand bits encode a binary value which is (mostly, see 'integral view') understood as h.tt tttttttttt .. tttttttttt<sub>b</sub>. The hidden bit is "1" for 'normal' values, while 'denormal' values which fill the gap between the smallest 113 bit precision value and zero with gracefully degrading relative precision are encoded with an exponent of biased 0, de-biased -16383, and then calcuated with a hidden bit "0" and an effective exponent of -16382.
 
This gives from 33 to 36 significant decimal digits precision. If a decimal string with at most 33 significant digits is converted to the IEEE 754 quadruple-precision format, giving a normal number, and then converted back to a decimal string with the same number of digits, the final result should match the original string. If an IEEE 754 quadruple-precision number is converted to a decimal string with at least 36 significant digits, and then converted back to quadruple-precision representation, the final result must match the original number.<ref name="whyieee">{{cite web |author=Kahan |first=Wiliam |date=1 October 1987 |title=Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic |url=http://www.cs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF}}</ref>
Line 189 ⟶ 167:
 
Quadruple-precision (128-bit) hardware implementation should not be confused with "128-bit FPUs" that implement [[Single instruction, multiple data|SIMD]] instructions, such as [[Streaming SIMD Extensions]] or [[AltiVec]], which refers to 128-bit [[Vector processor|vectors]] of four 32-bit single-precision or two 64-bit double-precision values that are operated on simultaneously.
 
== History, additional info ==
[[William Kahan]], primary architect of the original IEEE 754 floating-point standard noted, "For now the [[extended precision#x86 Architecture Extended Precision Format|10-byte Extended format]] is a tolerable compromise between the value of extra-precise arithmetic and the price of implementing it to run fast; very soon two more bytes of precision will become tolerable, and ultimately a 16-byte format ... That kind of gradual evolution towards wider precision was already in view when [[IEEE 754|IEEE Standard 754 for Floating-Point Arithmetic]] was framed."<ref>{{cite book |last=Higham |first=Nicholas |title="Designing stable algorithms" in Accuracy and Stability of Numerical Algorithms (2 ed) |publisher=SIAM |year=2002 |pages=43}}</ref>
 
== See also ==
* [[IEEE 754]], IEEE standard for floating-point arithmetic
* [[ISO/IEC 10967]], Language independent arithmetic
* [[Primitive data type]]