Quadruple-precision floating-point format: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 17:48, 28 December 2024 edit Vincent Lefèvre (talk \| contribs) Extended confirmed users 5,215 edits →IEEE 754 quadruple-precision binary floating-point format: binary128: added some details; use standard terminology. ← Previous edit		Latest revision as of 01:32, 18 August 2025 edit undo Vincent Lefèvre (talk \| contribs) Extended confirmed users 5,215 edits Undid revision 1306475879 by 192.52.240.206 (talk) This example does not bring anything new. Tag: Undo
(47 intermediate revisions by 20 users not shown)
Line 2: {{Floating-point}} {{Computer architecture bit widths}} In [[computing]], '''quadruple precision''' (or '''quad precision''') is a binary [[Floating-point arithmetic\|floating-point]]–based [[computer number format]] that occupies 16 bytes ([[128-bit computing\|128 bits]]) with precision at least twice the 53-bit [[Double-precision floating-point format\|double precision]]. This 128-bit quadruple precision is designed ~~not only~~ for applications ~~requiring~~needing results in higher than double precision,<ref>{{cite web \|last1=Bailey \|first1=David H. \|last2=Borwein \|first2=Jonathan M. \|date=July 6, 2009 \|title=High-Precision Computation and Mathematical Physics \|url=https://www.davidhbailey.com/dhbpapers/dhb-jmb-acat08.pdf}}</ref> ~~but also,~~and as a primary function, to allow ~~the computation of~~computing double precision results more reliably and accurately by minimising overflow and [[round-off error]]s in intermediate calculations and scratch variables. [[William Kahan]], primary architect of the original IEEE 754 floating-point standard noted, "For now the [[extended precision#x86 Architecture Extended Precision Format\|10-byte Extended format]] is a tolerable compromise between the value of extra-precise arithmetic and the price of implementing it to run fast; very soon two more bytes of precision will become tolerable, and ultimately a 16-byte format ... That kind of gradual evolution towards wider precision was already in view when [[IEEE 754\|IEEE Standard 754 for Floating-Point Arithmetic]] was framed."<ref>{{cite book\|first=Nicholas \| last=Higham \|title="Designing stable algorithms" in Accuracy and Stability of Numerical Algorithms (2 ed)\| publisher=SIAM\|year=2002 \| pages=43 }}</ref> In [[IEEE 754-2008]] the 128-bit base-2 format is officially referred to as '''binary128'''. Line 35: The stored exponents 0000<sub>16</sub> and 7FFF<sub>16</sub> are interpreted specially. {\| class="wikitable" style="text-align: center;" \|- ! Exponent !! Significand zero !! Significand non-zero !! Equation \|- Line 42 ⟶ 43: \| 0001<sub>16</sub>, ..., 7FFE<sub>16</sub> \|\|colspan=2\| normalized value \|\| (−1)<sup>signbit</sup> × 2<sup>exponentbits<sub>2</sub> − 16383</sup> × 1.significandbits<sub>2</sub> \|- \| 7FFF<sub>16</sub> \|\| ±[[infinity\|∞]] \|\| [[NaN]] (quiet, ~~signalling~~signaling) \|} Line 49 ⟶ 50: === Quadruple precision examples === These examples are given in bit ''representation'', in [[hexadecimal]], of the floating-point value. This includes the sign, (biased) exponent, and significand. {\| style="font-family: monospace, monospace;" \|- \| 0000 0000 0000 0000 0000 0000 0000 0001<sub>16</sub> = 2<sup>−16382</sup> × 2<sup>−112</sup> = 2<sup>−16494</sup><br /> {{spaces\|42}}≈ 6.4751751194380251109244389582276465525 × 10<sup>−4966</sup><br />▼ {{spaces\|42}}(smallest positive subnormal number) 0000 ~~0000~~ffff ~~0000~~ffff ~~0000~~ffff ~~0000~~ffff ~~0000~~ffff ~~0000~~ffff ~~0001~~ffff<sub>16</sub> = 2<sup>−16382</sup> × (1 − 2<sup>−112</sup>)<br ~~= 2<sup>−16494<~~/~~sup~~> {{spaces\|42}}≈ 3.3621031431120935062626778173217519551 × 10<sup>−4932</sup><br /> ▲ ≈ 6.4751751194380251109244389582276465525 × 10<sup>−4966</sup> {{spaces\|42}}(~~smallest positive~~largest subnormal number) 0001 0000 ~~ffff~~0000 ~~ffff~~0000 ~~ffff~~0000 ~~ffff~~0000 ~~ffff~~0000 ~~ffff ffff~~0000<sub>16</sub> = 2<sup>−16382</sup><br ~~× (1 − 2<sup>−112<~~/~~sup~~>) {{spaces\|42}}≈ 3.~~3621031431120935062626778173217519551~~3621031431120935062626778173217526026 × 10<sup>−4932</sup><br /> {{spaces\|42}}(smallest positive normal number) ~~(largest subnormal number)~~ ~~0001~~7ffe ~~0000~~ffff ~~0000~~ffff ~~0000~~ffff ~~0000~~ffff ~~0000~~ffff ~~0000~~ffff ~~0000~~ffff<sub>16</sub> = 2<sup>~~−16382~~16383</sup> × (2 − 2<sup>−112</sup>)<br /> {{spaces\|42}}≈ 1.1897314953572317650857593266280070162 × 10<sup>4932</sup><br />▼ ~~≈ 3.3621031431120935062626778173217526026 × 10<sup>−4932</sup>~~ {{spaces\|42}}(~~smallest positive~~largest normal number) ~~7ffe~~3ffe ffff ffff ffff ffff ffff ffff ffff<sub>16</sub> = 1 − 2<sup>~~16383~~−113</sup><br ~~× (2 − 2<sup>−112<~~/~~sup~~>) {{spaces\|42}}≈ 0.9999999999999999999999999999999999037<br />▼ ▲ ≈ 1.1897314953572317650857593266280070162 × 10<sup>4932</sup> {{spaces\|42}}(largest number less than one) ~~(largest normal number)~~ ~~3ffe~~3fff ~~ffff~~0000 ~~ffff~~0000 ~~ffff~~0000 ~~ffff~~0000 ~~ffff~~0000 ~~ffff~~0000 ~~ffff~~0000<sub>16</sub> = 1 ~~− 2<sup>−113</sup>~~(one) ▲ ≈ 0.9999999999999999999999999999999999037 ~~(largest number less than one)~~ 3fff 0000 0000 0000 0000 0000 0000 ~~0000~~0001<sub>16</sub> = 1 ~~(one)~~+ 2<sup>−112</sup><br /> {{spaces\|42}}≈ 1.0000000000000000000000000000000001926<br />▼ {{spaces\|42}}(smallest number larger than one)▼ ~~3fff~~4000 0000 0000 0000 0000 0000 0000 ~~0001~~0000<sub>16</sub> = ~~1 +~~ 2<~~sup>−112<~~br /~~sup~~> ~~ffff~~c000 0000 0000 0000 0000 0000 0000 0000<sub>16</sub> = ~~−infinity~~−2▼ ▲ ≈ 1.0000000000000000000000000000000001926 ▲ (smallest number larger than one) ~~4000~~0000 0000 0000 0000 0000 0000 0000 0000<sub>16</sub> = 20<br /> ~~c000~~8000 0000 0000 0000 0000 0000 0000 0000<sub>16</sub> = −2−0 ~~0000~~7fff 0000 0000 0000 0000 0000 0000 0000<sub>16</sub> = 0infinity<br /> ~~8000~~ffff 0000 0000 0000 0000 0000 0000 0000<sub>16</sub> = −0−infinity ~~7fff~~4000 ~~0000~~921f ~~0000~~b544 ~~0000~~42d1 ~~0000~~8469 ~~0000~~898c ~~0000~~c517 ~~0000~~01b8<sub>16</sub> =≈ ~~infinity~~3.1415926535897932384626433832795027975<br /> {{spaces\|42}}(closest approximation to π) ▲ ffff 0000 0000 0000 0000 0000 0000 0000<sub>16</sub> = −infinity ~~4000~~3ffd ~~921f~~5555 ~~b544~~5555 ~~42d1~~5555 ~~8469~~5555 ~~898c~~5555 ~~c517~~5555 ~~01b8~~5555<sub>16</sub> ≈ π0.3333333333333333333333333333333333173<br /> {{spaces\|42}}(closest approximation to 1/3) ~~3ffd~~4008 ~~5555~~74d9 ~~5555~~9564 ~~5555~~5aa0 ~~5555~~0c11 ~~5555~~d0cc ~~5555~~9770 ~~5555~~5e5b<sub>16</sub> ≈ 1745.69987158227021999999999999999997147<br /3> {{spaces\|42}}(closest approximation to the number of<br /> {{spaces\|42}}Watts corresponding to 1 [[horsepower]]) \|} By default, 1/3 rounds down like [[double precision]], because of the odd number of bits in the significand. Thus, the bits beyond the rounding point are <code>0101...</code> which is less than 1/2 of a [[unit in the last place]]. == Double-double arithmetic == A common software technique to implement nearly quadruple precision using ''pairs'' of [[double-precision]] values is sometimes called '''double-double arithmetic'''.<ref name=Hida>Yozo Hida, X. Li, and D. H. Bailey, [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.5769 Quad-Double Arithmetic: Algorithms, Implementation, and Application], Lawrence Berkeley National Laboratory Technical Report LBNL-46996 (2000). Also Y. Hida et al., [~~http~~https://web.mit.edu/tabbott/Public/quaddouble-debian/qd-2.3.4-old/docs/qd.pdf Library for double-double and quad-double arithmetic] (2007).</ref><ref name="Shewchuk">J. R. Shewchuk, [https://www.cs.cmu.edu/~quake/robust.html Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates], [[Discrete & Computational Geometry]] 18: 305–363, 1997.</ref><ref name="Knuth-4.2.3-pr9">{{cite book \|last=Knuth \|first=D. E. \|title=The Art of Computer Programming \|edition=2nd \|at=chapter 4.2.3. problem 9. }}</ref> Using pairs of IEEE double-precision values with 53-bit significands, double-double arithmetic provides operations on numbers with significands of at least<ref name=Hida/> {{nowrap\|1=2 × 53 = 106 bits}} (actually 107 bits<ref>Robert Munafo. [~~http~~https://mrob.com/pub/math/f161.html F107 and F161 High-Precision Floating-Point Data Types] (2011).</ref> except for some of the largest values, due to the limited exponent range), only slightly less precise than the 113-bit significand of IEEE binary128 quadruple precision. The range of a double-double remains essentially the same as the double-precision format because the exponent has still 11 bits,<ref name=Hida /> significantly lower than the 15-bit exponent of IEEE quadruple precision (a range of {{nowrap\|1.8 × 10<sup>308</sup>}} for double-double versus {{nowrap\|1.2 × 10<sup>4932</sup>}} for binary128). In particular, a double-double/quadruple-precision value ''q'' in the double-double technique is represented implicitly as a sum {{nowrap\|1=''q'' = ''x'' + ''y''}} of two double-precision values ''x'' and ''y'', each of which supplies half of ''q''<nowiki/>'s significand.<ref name=Shewchuk/> That is, the pair {{nowrap\|(''x'', ''y'')}} is stored in place of ''q'', and operations on ''q'' values {{nowrap\|(+, −, ×, ...)}} are transformed into equivalent (but more complicated) operations on the ''x'' and ''y'' values. Thus, arithmetic in this technique reduces to a sequence of double-precision operations; since double-precision arithmetic is commonly implemented in hardware, double-double arithmetic is typically substantially faster than more general [[arbitrary-precision arithmetic]] techniques.<ref name=Hida/><ref name=Shewchuk/> Line 99 ⟶ 109: * As the magnitude of the value decreases, the amount of extra precision also decreases. Therefore, the smallest number in the normalized range is narrower than double precision. The smallest number with full precision is {{nowrap\|1000...0<sub>2</sub> (106 zeros) × 2<sup>−1074</sup>}}, or {{nowrap\|1.000...0<sub>2</sub> (106 zeros) × 2<sup>−968</sup>}}. Numbers whose magnitude is smaller than 2<sup>−1021</sup> will not have additional precision compared with double precision. * The actual number of bits of precision can vary. In general, the magnitude of the low-order part of the number is no greater than a half [[Unit in the last place\|ULP]] of the high-order part. If the low-order part is less than half ULP of the high-order part, significant bits (either all 0s or all 1s) are implied between the ~~significant~~significand of the high-order and low-order numbers. Certain algorithms that rely on having a fixed number of bits in the significand can fail when using 128-bit long double numbers. * Because of the reason above, it is possible to represent values like {{nowrap\|1 + 2<sup>−1074</sup>}}, which is the smallest representable number greater than 1. In addition to the double-double arithmetic, it is also possible to generate triple-double or quad-double arithmetic if higher precision is required without any higher precision floating-point library. They are represented as a sum of three (or four) double-precision values respectively. They can represent operations with at least 159/161 and 212/215 bits respectively. A natural extension to an arbitrary number of terms (though limited by the exponent range) is called ''floating-point expansions''. A similar technique can be used to produce a '''double-quad arithmetic''', which is represented as a sum of two quadruple-precision values. They can represent operations with at least 226 (or 227) bits.<ref>sourceware.org [~~http~~https://sourceware.org/legacy-ml/libc-alpha/2012-03/msg01024.html Re: The state of glibc libm]</ref> == Implementations == Quadruple precision is often implemented in software by a variety of techniques (such as the double-double technique above, although that technique does not implement IEEE quadruple precision), since direct hardware support for quadruple precision is, {{as of\|2016\|lc=on}}, less common (see "[[#Hardware support\|Hardware support]]" below). One can use general [[arbitrary-precision arithmetic]] libraries to obtain quadruple (or higher) precision, but specialized quadruple-precision implementations may achieve higher performance. === Computer-language support === A separate question is the extent to which quadruple-precision types are directly incorporated into computer [[programming language]]s. Line 116 ⟶ 126: For the [[C (programming language)\|C programming language]], ISO/IEC TS 18661-3 (floating-point extensions for C, interchange and extended types) specifies <code>_Float128</code> as the type implementing the IEEE 754 quadruple-precision format (binary128).<ref>{{cite web\|title=ISO/IEC TS 18661-3\|url=https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1945.pdf\|date=2015-06-10\|access-date=2019-09-22}}</ref> Alternatively, in [[C (programming language)\|C]]/[[C++]] with a few systems and compilers, quadruple precision may be specified by the [[long double]] type, but this is not required by the language (which only requires <code>long double</code> to be at least as precise as <code>double</code>), nor is it common. As of [[C++23]], the C++ language defines a <code><stdfloat></code> header that contains fixed-width floating-point types. Implementations of these are optional, but if supported, <code>std::float128_t</code> corresponds to quadruple precision. On x86 and x86-64, the most common C/C++ compilers implement <code>long double</code> as either 80-bit [[extended precision]] (e.g. the [[GNU C Compiler]] gcc<ref>[https://web.archive.org/web/20080713131713/https://gcc.gnu.org/onlinedocs/gcc/i386-and-x86_002d64-Options.html i386 and x86-64 Options (archived copy on web.archive.org)], ''Using the GNU Compiler Collection''.</ref> and the [[Intel C++ Compiler]] with a <code>/Qlong‑double</code> switch<ref>[http://software.intel.com/en-us/articles/size-of-long-integer-type-on-different-architecture-and-os/ Intel Developer Site].</ref>) or simply as being synonymous with double precision (e.g. [[Microsoft Visual C++]]<ref>[http://msdn.microsoft.com/en-us/library/9cx8xs15.aspx MSDN homepage, about Visual C++ compiler].</ref>), rather than as quadruple precision. The procedure call standard for the [[ARM architecture#AArch64\|ARM 64-bit architecture]] (AArch64) specifies that <code>long double</code> corresponds to the IEEE 754 quadruple-precision format.<ref>{{cite web\|title=Procedure Call Standard for the ARM 64-bit Architecture (AArch64)\|url=http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf\|date=2013-05-22\|access-date=2019-09-22\|archive-url=https://web.archive.org/web/20191016000704/http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf\|archive-date=2019-10-16\|url-status=dead}}</ref> On a few other architectures, some C/C++ compilers implement <code>long double</code> as quadruple precision, e.g. gcc on [[PowerPC]] (as double-double<ref>[https://gcc.gnu.org/onlinedocs/gcc/RS_002f6000-and-PowerPC-Options.html RS/6000 and PowerPC Options], ''Using the GNU Compiler Collection''.</ref><ref>[https://developer.apple.com/legacy/mac/library/documentation/Performance/Conceptual/Mac_OSX_Numerics/Mac_OSX_Numerics.pdf Inside Macintosh – PowerPC Numerics]. {{webarchive\|url=https://web.archive.org/web/20121009191824/http://developer.apple.com/legacy/mac/library/documentation/Performance/Conceptual/Mac_OSX_Numerics/Mac_OSX_Numerics.pdf\|date=October 9, 2012}}.</ref><ref>[https://opensource.apple.com/source/gcc/gcc-5646/gcc/config/rs6000/darwin-ldouble.c 128-bit long double support routines for Darwin].</ref>) and [[SPARC]],<ref>[https://gcc.gnu.org/onlinedocs/gcc/SPARC-Options.html SPARC Options], ''Using the GNU Compiler Collection''.</ref> or the [[Sun Studio (software)\|Sun Studio compilers]] on SPARC.<ref>[http://docs.oracle.com/cd/E19422-01/819-3693/ncg_lib.html The Math Libraries], Sun Studio 11 ''Numerical Computation Guide'' (2005).</ref> Even if <code>long double</code> is not quadruple precision, however, some C/C++ compilers provide a nonstandard quadruple-precision type as an extension. For example, gcc provides a quadruple-precision type called <code>__float128</code> for x86, x86-64 and [[Itanium]] CPUs,<ref>[https://gcc.gnu.org/onlinedocs/gcc/Floating-Types.html Additional Floating Types], ''Using the GNU Compiler Collection''</ref> and on [[PowerPC]] as IEEE 128-bit floating-point using the -mfloat128-hardware or -mfloat128 options;<ref name=gcc6changes>{{cite web\|title=GCC 6 Release Series - Changes, New Features, and Fixes\|url=https://gcc.gnu.org/gcc-6/changes.html\|access-date=2016-09-13}}</ref> and some versions of Intel's C/C++ compiler for x86 and x86-64 supply a nonstandard quadruple-precision type called <code>_Quad</code>.<ref>[http://software.intel.com/en-us/forums/showthread.php?t=56359 Intel C++ Forums] (2007).</ref>▼ ▲On x86 and x86-64, the most common C/C++ compilers implement <code>long double</code> as either 80-bit [[extended precision]] (e.g. the [[GNU C Compiler]] gcc<ref>[https://web.archive.org/web/20080713131713/https://gcc.gnu.org/onlinedocs/gcc/i386-and-x86_002d64-Options.html i386 and x86-64 Options (archived copy on web.archive.org)], ''Using the GNU Compiler Collection''.</ref> and the [[Intel C++ Compiler]] with a <code>/Qlong‑double</code> switch<ref>[http://software.intel.com/en-us/articles/size-of-long-integer-type-on-different-architecture-and-os/ Intel Developer Site].</ref>) or simply as being synonymous with double precision (e.g. [[Microsoft Visual C++]]<ref>[http://msdn.microsoft.com/en-us/library/9cx8xs15.aspx MSDN homepage, about Visual C++ compiler].</ref>), rather than as quadruple precision. The procedure call standard for the [[ARM architecture#AArch64\|ARM 64-bit architecture]] (AArch64) specifies that <code>long double</code> corresponds to the IEEE 754 quadruple-precision format.<ref>{{cite web\|title=Procedure Call Standard for the ARM 64-bit Architecture (AArch64)\|url=http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf\|date=2013-05-22\|access-date=2019-09-22\|archive-url=https://web.archive.org/web/20191016000704/http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf\|archive-date=2019-10-16\|url-status=dead}}</ref> On a few other architectures, some C/C++ compilers implement <code>long double</code> as quadruple precision, e.g. gcc on [[PowerPC]] (as double-double<ref>[https://gcc.gnu.org/onlinedocs/gcc/RS_002f6000-and-PowerPC-Options.html RS/6000 and PowerPC Options], ''Using the GNU Compiler Collection''.</ref><ref>[https://developer.apple.com/legacy/mac/library/documentation/Performance/Conceptual/Mac_OSX_Numerics/Mac_OSX_Numerics.pdf Inside Macintosh – PowerPC Numerics]. {{webarchive\|url=https://web.archive.org/web/20121009191824/http://developer.apple.com/legacy/mac/library/documentation/Performance/Conceptual/Mac_OSX_Numerics/Mac_OSX_Numerics.pdf\|date=October 9, 2012}}.</ref><ref>[https://opensource.apple.com/source/gcc/gcc-5646/gcc/config/rs6000/darwin-ldouble.c 128-bit long double support routines for Darwin] {{Webarchive\|url=https://web.archive.org/web/20171107030443/https://opensource.apple.com/source/gcc/gcc-5646/gcc/config/rs6000/darwin-ldouble.c \|date=2017-11-07 }}.</ref>) and [[SPARC]],<ref>[https://gcc.gnu.org/onlinedocs/gcc/SPARC-Options.html SPARC Options], ''Using the GNU Compiler Collection''.</ref> or the [[Sun Studio (software)\|Sun Studio compilers]] on SPARC.<ref>[http://docs.oracle.com/cd/E19422-01/819-3693/ncg_lib.html The Math Libraries], Sun Studio 11 ''Numerical Computation Guide'' (2005).</ref> Even if <code>long double</code> is not quadruple precision, however, some C/C++ compilers provide a nonstandard quadruple-precision type as an extension. For example, gcc provides a quadruple-precision type called <code>__float128</code> for x86, x86-64 and [[Itanium]] CPUs,<ref>[https://gcc.gnu.org/onlinedocs/gcc/Floating-Types.html Additional Floating Types], ''Using the GNU Compiler Collection''</ref> and on [[PowerPC]] as IEEE 128-bit floating-point using the -mfloat128-hardware or -mfloat128 options;<ref name=gcc6changes>{{cite web\|title=GCC 6 Release Series - Changes, New Features, and Fixes\|url=https://gcc.gnu.org/gcc-6/changes.html\|access-date=2016-09-13}}</ref> and some versions of Intel's C/C++ compiler for x86 and x86-64 supply a nonstandard quadruple-precision type called <code>_Quad</code>.<ref>[http://software.intel.com/en-us/forums/showthread.php?t=56359 Intel C++ Forums] (2007).</ref> [[Zig (programming language)\|Zig]] provides support for it with its <code>f128</code> type.<ref>{{cite web \|title=Floats \|url=https://ziglang.org/documentation/master/#Floats \|website=ziglang.org \|access-date=7 January 2024}}</ref> Google's work-in-progress language [[Carbon (programming language)\|Carbon]] provides support for it with the type called '<code>f128'</code>.<ref>{{cite web \|url=https://github.com/carbon-language/carbon-lang/blob/trunk/docs/design/README.md#floating-point-types \|title=Carbon Language's main repository - Language design \|date=2022-08-09 \|website=GitHub \|access-date=2022-09-22}}</ref> As of 2024, [[Rust (programming language)\|Rust]] is currently working on adding a new <code>f128</code> type for IEEE quadruple-precision 128-bit floats.<ref>{{cite web \|last1=Cross \|first1=Travis \|title=Tracking Issue for f16 and f128 float types \|url=https://github.com/rust-lang/rust/issues/116909 \|website=GitHub \|access-date=2024-07-05}}</ref> === Libraries and toolboxes === * The [[GNU Compiler Collection\|GCC]] quad-precision math library, [https://gcc.gnu.org/onlinedocs/libquadmath libquadmath], provides <code>__float128</code> and <code>__complex128</code> operations. * The [[Boost (C++ libraries)\|Boost]] multiprecision library Boost.Multiprecision provides unified cross-platform C++ interface for <code>__float128</code> and <code>_Quad</code> types, and includes a custom implementation of the standard math library.<ref>{{cite web \|title=Boost.Multiprecision – float128 \|url=http://www.boost.org/doc/libs/1_58_0/libs/multiprecision/doc/html/boost_multiprecision/tut/floats/float128.html \|access-date=2015-06-22}}</ref> Line 134 ⟶ 146: === Hardware support === IEEE quadruple precision was added to the [[IBM System/390]] G5 in 1998,<ref>{{cite journal \|last1=Schwarz \|first1=E. M. \|last2=Krygowski \|first2=C. A. \|date=September 1999 \|title=The S/390 G5 floating-point unit \|journal=IBM Journal of Research and Development \|volume=43 \|issue=5/6 \|pages=707–721 \|doi=10.1147/rd.435.0707 \|citeseerx=10.1.1.117.6711 }}</ref> and is supported in hardware in subsequent [[z/Architecture]] processors.<ref>{{cite news \|author=Gerwig \|first1=G. \|last2=Wetter \|first2=H. \|last3=Schwarz \|first3=E. M. \|last4=Haess \|first4=J. \|last5=Krygowski \|first5=C. A. \|last6=Fleischer \|first6=B. M. \|last7=Kroener \|first7=M. \|date=May 2004 \|title=The IBM eServer z990 floating-point unit. IBM J. Res. Dev. 48 \|pages=311–322}}</ref><ref>{{cite web \|author=Schwarz \|first=Eric \|date=June 22, 2015 \|title=The IBM z13 SIMD Accelerators for Integer, String, and Floating-Point \|url=http://arith22.gforge.inria.fr/slides/s1-schwarz.pdf \|access-date=July 13, 2015 \|archive-date=July 13, 2015 \|archive-url=https://web.archive.org/web/20150713231116/http://arith22.gforge.inria.fr/slides/s1-schwarz.pdf \|url-status=dead }}</ref> The IBM [[POWER9]] CPU ([[Power ISA#Power ISA v.3.0\|Power ISA 3.0]]) has native 128-bit hardware support.<ref name=gcc6changes/> Native support of IEEE 128-bit floats is defined in [[PA-RISC]] 1.0,<ref>{{cite web \|url=http://grouper.ieee.org/groups//754/email/msg04128.html \|title=Implementor support for the binary interchange formats \|website=[[IEEE]] \|archive-url=https://web.archive.org/web/20171027202715/https://grouper.ieee.org/groups//754/email/msg04128.html \|archive-date=2017-10-27 \|access-date=2021-07-15}}</ref> and in [[SPARC]] V8<ref>{{cite book Line 167 ⟶ 179: Quadruple-precision (128-bit) hardware implementation should not be confused with "128-bit FPUs" that implement [[Single instruction, multiple data\|SIMD]] instructions, such as [[Streaming SIMD Extensions]] or [[AltiVec]], which refers to 128-bit [[Vector processor\|vectors]] of four 32-bit single-precision or two 64-bit double-precision values that are operated on simultaneously. ~~== Add. info and curiosities ==~~ The IEEE 754 standard allows two different views / decodings for the numbers, one described above with a fractional understanding of the significand and a bias of 16383 for the exponent, the other understanding the significand as binary integer, 2^112 times larger, and in turn the bias for the significand 112 larger, 16495, which produces smaller effective exponents and by that the same final result. The fractional view is common for binaryxxx datatypes, while the integral is for decimalxxx datatypes. Section 3.3 "Sets of floating-point data" in 2019 ver. of the standard. == See also ==