Quadruple-precision floating-point format: Difference between revisions

Content deleted Content added
Quadruple precision examples: fix monospace formatting without using the space trick
Undid revision 1306475879 by 192.52.240.206 (talk) This example does not bring anything new.
 
(10 intermediate revisions by 6 users not shown)
Line 51:
These examples are given in bit ''representation'', in [[hexadecimal]], of the floating-point value. This includes the sign, (biased) exponent, and significand.
{| style="font-family: monospace, monospace;"
|-
|
0000 0000 0000 0000 0000 0000 0000 0001<sub>16</sub> = 2<sup>−16382</sup> × 2<sup>−112</sup> = 2<sup>−16494</sup><br />
{{spaces|42}}≈ 6.4751751194380251109244389582276465525 × 10<sup>−4966</sup><br />
{{spaces|42}}(smallest positive subnormal number)
 
0000 ffff ffff ffff ffff ffff ffff ffff<sub>16</sub> = 2<sup>−16382</sup> × (1 − 2<sup>−112</sup>)<br />
{{spaces|42}}≈ 3.3621031431120935062626778173217519551 × 10<sup>−4932</sup><br />
{{spaces|42}}(largest subnormal number)
 
0001 0000 0000 0000 0000 0000 0000 0000<sub>16</sub> = 2<sup>−16382</sup><br />
{{spaces|42}}≈ 3.3621031431120935062626778173217526026 × 10<sup>−4932</sup><br />
{{spaces|42}}(smallest positive normal number)
 
7ffe ffff ffff ffff ffff ffff ffff ffff<sub>16</sub> = 2<sup>16383</sup> × (2 − 2<sup>−112</sup>)<br />
{{spaces|42}}≈ 1.1897314953572317650857593266280070162 × 10<sup>4932</sup><br />
{{spaces|42}}(largest normal number)
 
3ffe ffff ffff ffff ffff ffff ffff ffff<sub>16</sub> = 1 − 2<sup>−113</sup><br />
{{spaces|42}}≈ 0.9999999999999999999999999999999999037<br />
{{spaces|42}}(largest number less than one)
 
3fff 0000 0000 0000 0000 0000 0000 0000<sub>16</sub> = 1 (one)
 
3fff 0000 0000 0000 0000 0000 0000 0001<sub>16</sub> = 1 + 2<sup>−112</sup><br />
{{spaces|42}}≈ 1.0000000000000000000000000000000001926<br />
{{spaces|42}}(smallest number larger than one)
 
4000 0000 0000 0000 0000 0000 0000 0000<sub>16</sub> = 2<br />
c000 0000 0000 0000 0000 0000 0000 0000<sub>16</sub> = −2
 
0000 0000 0000 0000 0000 0000 0000 0000<sub>16</sub> = 0<br />
8000 0000 0000 0000 0000 0000 0000 0000<sub>16</sub> = −0
 
7fff 0000 0000 0000 0000 0000 0000 0000<sub>16</sub> = infinity<br />
ffff 0000 0000 0000 0000 0000 0000 0000<sub>16</sub> = −infinity
 
4000 921f b544 42d1 8469 898c c517 01b8<sub>16</sub> ≈ 3.1415926535897932384626433832795027975<br />
{{spaces|42}}(closest approximation to π)
 
3ffd 5555 5555 5555 5555 5555 5555 5555<sub>16</sub> ≈ 0.3333333333333333333333333333333333173<br />
{{spaces|42}}(closest approximation to 1/3)
 
4008 74d9 9564 5aa0 0c11 d0cc 9770 5e5b<sub>16</sub> ≈ 745.69987158227021999999999999999997147<br />
{{spaces|42}}(closest approximation to the number of<br />
{{spaces|42}}Watts corresponding to 1 [[horsepower]])
|}
 
Line 97 ⟶ 102:
 
== Double-double arithmetic ==
A common software technique to implement nearly quadruple precision using ''pairs'' of [[double-precision]] values is sometimes called '''double-double arithmetic'''.<ref name=Hida>Yozo Hida, X. Li, and D. H. Bailey, [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.5769 Quad-Double Arithmetic: Algorithms, Implementation, and Application], Lawrence Berkeley National Laboratory Technical Report LBNL-46996 (2000). Also Y. Hida et al., [httphttps://web.mit.edu/tabbott/Public/quaddouble-debian/qd-2.3.4-old/docs/qd.pdf Library for double-double and quad-double arithmetic] (2007).</ref><ref name="Shewchuk">J. R. Shewchuk, [https://www.cs.cmu.edu/~quake/robust.html Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates], [[Discrete & Computational Geometry]] 18: 305–363, 1997.</ref><ref name="Knuth-4.2.3-pr9">{{cite book |last=Knuth |first=D. E. |title=The Art of Computer Programming |edition=2nd |at=chapter 4.2.3. problem 9. }}</ref> Using pairs of IEEE double-precision values with 53-bit significands, double-double arithmetic provides operations on numbers with significands of at least<ref name=Hida/> {{nowrap|1=2 × 53 = 106 bits}} (actually 107 bits<ref>Robert Munafo. [httphttps://mrob.com/pub/math/f161.html F107 and F161 High-Precision Floating-Point Data Types] (2011).</ref> except for some of the largest values, due to the limited exponent range), only slightly less precise than the 113-bit significand of IEEE binary128 quadruple precision. The range of a double-double remains essentially the same as the double-precision format because the exponent has still 11 bits,<ref name=Hida /> significantly lower than the 15-bit exponent of IEEE quadruple precision (a range of {{nowrap|1.8 × 10<sup>308</sup>}} for double-double versus {{nowrap|1.2 × 10<sup>4932</sup>}} for binary128).
 
In particular, a double-double/quadruple-precision value ''q'' in the double-double technique is represented implicitly as a sum {{nowrap|1=''q'' = ''x'' + ''y''}} of two double-precision values ''x'' and ''y'', each of which supplies half of ''q''<nowiki/>'s significand.<ref name=Shewchuk/> That is, the pair {{nowrap|(''x'', ''y'')}} is stored in place of ''q'', and operations on ''q'' values {{nowrap|(+, −, ×, ...)}} are transformed into equivalent (but more complicated) operations on the ''x'' and ''y'' values. Thus, arithmetic in this technique reduces to a sequence of double-precision operations; since double-precision arithmetic is commonly implemented in hardware, double-double arithmetic is typically substantially faster than more general [[arbitrary-precision arithmetic]] techniques.<ref name=Hida/><ref name=Shewchuk/>
Line 107 ⟶ 112:
* Because of the reason above, it is possible to represent values like {{nowrap|1 + 2<sup>−1074</sup>}}, which is the smallest representable number greater than 1.
 
In addition to the double-double arithmetic, it is also possible to generate triple-double or quad-double arithmetic if higher precision is required without any higher precision floating-point library. They are represented as a sum of three (or four) double-precision values respectively. They can represent operations with at least 159/161 and 212/215 bits respectively. A natural extension to an arbitrary number of terms (though limited by the exponent range) is called ''floating-point expansions''.
 
A similar technique can be used to produce a '''double-quad arithmetic''', which is represented as a sum of two quadruple-precision values. They can represent operations with at least 226 (or 227) bits.<ref>sourceware.org [httphttps://sourceware.org/legacy-ml/libc-alpha/2012-03/msg01024.html Re: The state of glibc libm]</ref>
 
== Implementations ==
Line 123 ⟶ 128:
As of [[C++23]], the C++ language defines a <code><stdfloat></code> header that contains fixed-width floating-point types. Implementations of these are optional, but if supported, <code>std::float128_t</code> corresponds to quadruple precision.
 
On x86 and x86-64, the most common C/C++ compilers implement <code>long double</code> as either 80-bit [[extended precision]] (e.g. the [[GNU C Compiler]] gcc<ref>[https://web.archive.org/web/20080713131713/https://gcc.gnu.org/onlinedocs/gcc/i386-and-x86_002d64-Options.html i386 and x86-64 Options (archived copy on web.archive.org)], ''Using the GNU Compiler Collection''.</ref> and the [[Intel C++ Compiler]] with a <code>/Qlong&#8209;double</code> switch<ref>[http://software.intel.com/en-us/articles/size-of-long-integer-type-on-different-architecture-and-os/ Intel Developer Site].</ref>) or simply as being synonymous with double precision (e.g. [[Microsoft Visual C++]]<ref>[http://msdn.microsoft.com/en-us/library/9cx8xs15.aspx MSDN homepage, about Visual C++ compiler].</ref>), rather than as quadruple precision. The procedure call standard for the [[ARM architecture#AArch64|ARM 64-bit architecture]] (AArch64) specifies that <code>long double</code> corresponds to the IEEE 754 quadruple-precision format.<ref>{{cite web|title=Procedure Call Standard for the ARM 64-bit Architecture (AArch64)|url=http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf|date=2013-05-22|access-date=2019-09-22|archive-url=https://web.archive.org/web/20191016000704/http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf|archive-date=2019-10-16|url-status=dead}}</ref> On a few other architectures, some C/C++ compilers implement <code>long double</code> as quadruple precision, e.g. gcc on [[PowerPC]] (as double-double<ref>[https://gcc.gnu.org/onlinedocs/gcc/RS_002f6000-and-PowerPC-Options.html RS/6000 and PowerPC Options], ''Using the GNU Compiler Collection''.</ref><ref>[https://developer.apple.com/legacy/mac/library/documentation/Performance/Conceptual/Mac_OSX_Numerics/Mac_OSX_Numerics.pdf Inside Macintosh – PowerPC Numerics]. {{webarchive|url=https://web.archive.org/web/20121009191824/http://developer.apple.com/legacy/mac/library/documentation/Performance/Conceptual/Mac_OSX_Numerics/Mac_OSX_Numerics.pdf|date=October 9, 2012}}.</ref><ref>[https://opensource.apple.com/source/gcc/gcc-5646/gcc/config/rs6000/darwin-ldouble.c 128-bit long double support routines for Darwin] {{Webarchive|url=https://web.archive.org/web/20171107030443/https://opensource.apple.com/source/gcc/gcc-5646/gcc/config/rs6000/darwin-ldouble.c |date=2017-11-07 }}.</ref>) and [[SPARC]],<ref>[https://gcc.gnu.org/onlinedocs/gcc/SPARC-Options.html SPARC Options], ''Using the GNU Compiler Collection''.</ref> or the [[Sun Studio (software)|Sun Studio compilers]] on SPARC.<ref>[http://docs.oracle.com/cd/E19422-01/819-3693/ncg_lib.html The Math Libraries], Sun Studio 11 ''Numerical Computation Guide'' (2005).</ref> Even if <code>long double</code> is not quadruple precision, however, some C/C++ compilers provide a nonstandard quadruple-precision type as an extension. For example, gcc provides a quadruple-precision type called <code>__float128</code> for x86, x86-64 and [[Itanium]] CPUs,<ref>[https://gcc.gnu.org/onlinedocs/gcc/Floating-Types.html Additional Floating Types], ''Using the GNU Compiler Collection''</ref> and on [[PowerPC]] as IEEE 128-bit floating-point using the -mfloat128-hardware or -mfloat128 options;<ref name=gcc6changes>{{cite web|title=GCC 6 Release Series - Changes, New Features, and Fixes|url=https://gcc.gnu.org/gcc-6/changes.html|access-date=2016-09-13}}</ref> and some versions of Intel's C/C++ compiler for x86 and x86-64 supply a nonstandard quadruple-precision type called <code>_Quad</code>.<ref>[http://software.intel.com/en-us/forums/showthread.php?t=56359 Intel C++ Forums] (2007).</ref>
 
[[Zig (programming language)|Zig]] provides support for it with its <code>f128</code> type.<ref>{{cite web |title=Floats |url=https://ziglang.org/documentation/master/#Floats |website=ziglang.org |access-date=7 January 2024}}</ref>
Line 141 ⟶ 146:
 
=== Hardware support ===
IEEE quadruple precision was added to the [[IBM System/390]] G5 in 1998,<ref>{{cite journal |last1=Schwarz |first1=E. M. |last2=Krygowski |first2=C. A. |date=September 1999 |title=The S/390 G5 floating-point unit |journal=IBM Journal of Research and Development |volume=43 |issue=5/6 |pages=707–721 |doi=10.1147/rd.435.0707 |citeseerx=10.1.1.117.6711 }}</ref> and is supported in hardware in subsequent [[z/Architecture]] processors.<ref>{{cite news |author=Gerwig |first1=G. |last2=Wetter |first2=H. |last3=Schwarz |first3=E. M. |last4=Haess |first4=J. |last5=Krygowski |first5=C. A. |last6=Fleischer |first6=B. M. |last7=Kroener |first7=M. |date=May 2004 |title=The IBM eServer z990 floating-point unit. IBM J. Res. Dev. 48 |pages=311–322}}</ref><ref>{{cite web |author=Schwarz |first=Eric |date=June 22, 2015 |title=The IBM z13 SIMD Accelerators for Integer, String, and Floating-Point |url=http://arith22.gforge.inria.fr/slides/s1-schwarz.pdf |access-date=July 13, 2015 |archive-date=July 13, 2015 |archive-url=https://web.archive.org/web/20150713231116/http://arith22.gforge.inria.fr/slides/s1-schwarz.pdf |url-status=dead }}</ref> The IBM [[POWER9]] CPU ([[Power ISA#Power ISA v.3.0|Power ISA 3.0]]) has native 128-bit hardware support.<ref name=gcc6changes/>
 
Native support of IEEE 128-bit floats is defined in [[PA-RISC]] 1.0,<ref>{{cite web |url=http://grouper.ieee.org/groups//754/email/msg04128.html |title=Implementor support for the binary interchange formats |website=[[IEEE]] |archive-url=https://web.archive.org/web/20171027202715/https://grouper.ieee.org/groups//754/email/msg04128.html |archive-date=2017-10-27 |access-date=2021-07-15}}</ref> and in [[SPARC]] V8<ref>{{cite book