Quadruple-precision floating-point format: Difference between revisions

Content deleted Content added
Tag: Reverted
Fix Linter errors.
Tag: Reverted
Line 50:
These examples are given in bit ''representation'', in [[hexadecimal]], of the floating-point value. This includes the sign, (biased) exponent, and significand.
 
<pre<includeonly />>
0000 0000 0000 0000 0000 0000 0000 0001<sub>16</sub> = 2<sup>−16382</sup> × 2<sup>−112</sup> = 2<sup>−16494</sup>
≈ 6.4751751194380251109244389582276465525 × 10<sup>−4966</sup>
Line 56:
</pre>
 
<pre<includeonly />>
0000 ffff ffff ffff ffff ffff ffff ffff<sub>16</sub> = 2<sup>−16382</sup> × (1 − 2<sup>−112</sup>)
≈ 3.3621031431120935062626778173217519551 × 10<sup>−4932</sup>
Line 62:
</pre>
 
<pre<includeonly />>
0001 0000 0000 0000 0000 0000 0000 0000<sub>16</sub> = 2<sup>−16382</sup>
≈ 3.3621031431120935062626778173217526026 × 10<sup>−4932</sup>
Line 68:
</pre>
 
<pre<includeonly />>
7ffe ffff ffff ffff ffff ffff ffff ffff<sub>16</sub> = 2<sup>16383</sup> × (2 − 2<sup>−112</sup>)
≈ 1.1897314953572317650857593266280070162 × 10<sup>4932</sup>
Line 74:
</pre>
 
<pre<includeonly />>
3ffe ffff ffff ffff ffff ffff ffff ffff<sub>16</sub> = 1 − 2<sup>−113</sup>
≈ 0.9999999999999999999999999999999999037
Line 80:
</pre>
 
<pre<includeonly />>
3fff 0000 0000 0000 0000 0000 0000 0000<sub>16</sub> = 1 (one)
</pre>
 
<pre<includeonly />>
3fff 0000 0000 0000 0000 0000 0000 0001<sub>16</sub> = 1 + 2<sup>−112</sup>
≈ 1.0000000000000000000000000000000001926
Line 90:
</pre>
 
<pre<includeonly />>
4000 0000 0000 0000 0000 0000 0000 0000<sub>16</sub> = 2
c000 0000 0000 0000 0000 0000 0000 0000<sub>16</sub> = −2
</pre>
 
<pre<includeonly />>
0000 0000 0000 0000 0000 0000 0000 0000<sub>16</sub> = 0
8000 0000 0000 0000 0000 0000 0000 0000<sub>16</sub> = −0
</pre>
 
<pre<includeonly />>
7fff 0000 0000 0000 0000 0000 0000 0000<sub>16</sub> = infinity
ffff 0000 0000 0000 0000 0000 0000 0000<sub>16</sub> = −infinity
</pre>
 
<pre<includeonly />>
4000 921f b544 42d1 8469 898c c517 01b8<sub>16</sub> ≈ 3.1415926535897932384626433832795027975
(closest approximation to π)
</pre>
 
<pre<includeonly />>
3ffd 5555 5555 5555 5555 5555 5555 5555<sub>16</sub> ≈ 0.3333333333333333333333333333333333173
(closest approximation to 1/3)
Line 142:
For the [[C (programming language)|C programming language]], ISO/IEC TS 18661-3 (floating-point extensions for C, interchange and extended types) specifies <code>_Float128</code> as the type implementing the IEEE 754 quadruple-precision format (binary128).<ref>{{cite web|title=ISO/IEC TS 18661-3|url=https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1945.pdf|date=2015-06-10|access-date=2019-09-22}}</ref> Alternatively, in [[C (programming language)|C]]/[[C++]] with a few systems and compilers, quadruple precision may be specified by the [[long double]] type, but this is not required by the language (which only requires <code>long double</code> to be at least as precise as <code>double</code>), nor is it common.
 
As of [[C++23]], the C++ language defines a <code><&lt;stdfloat></code> header that contains fixed-width floating-point types. Implementations of these are optional, but if supported, <code>std::float128_t</code> corresponds to quadruple precision.
 
On x86 and x86-64, the most common C/C++ compilers implement <code>long double</code> as either 80-bit [[extended precision]] (e.g. the [[GNU C Compiler]] gcc<ref>[https://web.archive.org/web/20080713131713/https://gcc.gnu.org/onlinedocs/gcc/i386-and-x86_002d64-Options.html i386 and x86-64 Options (archived copy on web.archive.org)], ''Using the GNU Compiler Collection''.</ref> and the [[Intel C++ Compiler]] with a <code>/Qlong&#8209;double</code> switch<ref>[http://software.intel.com/en-us/articles/size-of-long-integer-type-on-different-architecture-and-os/ Intel Developer Site].</ref>) or simply as being synonymous with double precision (e.g. [[Microsoft Visual C++]]<ref>[http://msdn.microsoft.com/en-us/library/9cx8xs15.aspx MSDN homepage, about Visual C++ compiler].</ref>), rather than as quadruple precision. The procedure call standard for the [[ARM architecture#AArch64|ARM 64-bit architecture]] (AArch64) specifies that <code>long double</code> corresponds to the IEEE 754 quadruple-precision format.<ref>{{cite web|title=Procedure Call Standard for the ARM 64-bit Architecture (AArch64)|url=http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf|date=2013-05-22|access-date=2019-09-22|archive-url=https://web.archive.org/web/20191016000704/http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf|archive-date=2019-10-16|url-status=dead}}</ref> On a few other architectures, some C/C++ compilers implement <code>long double</code> as quadruple precision, e.g. gcc on [[PowerPC]] (as double-double<ref>[https://gcc.gnu.org/onlinedocs/gcc/RS_002f6000-and-PowerPC-Options.html RS/6000 and PowerPC Options], ''Using the GNU Compiler Collection''.</ref><ref>[https://developer.apple.com/legacy/mac/library/documentation/Performance/Conceptual/Mac_OSX_Numerics/Mac_OSX_Numerics.pdf Inside Macintosh – PowerPC Numerics]. {{webarchive|url=https://web.archive.org/web/20121009191824/http://developer.apple.com/legacy/mac/library/documentation/Performance/Conceptual/Mac_OSX_Numerics/Mac_OSX_Numerics.pdf|date=October 9, 2012}}.</ref><ref>[https://opensource.apple.com/source/gcc/gcc-5646/gcc/config/rs6000/darwin-ldouble.c 128-bit long double support routines for Darwin].</ref>) and [[SPARC]],<ref>[https://gcc.gnu.org/onlinedocs/gcc/SPARC-Options.html SPARC Options], ''Using the GNU Compiler Collection''.</ref> or the [[Sun Studio (software)|Sun Studio compilers]] on SPARC.<ref>[http://docs.oracle.com/cd/E19422-01/819-3693/ncg_lib.html The Math Libraries], Sun Studio 11 ''Numerical Computation Guide'' (2005).</ref> Even if <code>long double</code> is not quadruple precision, however, some C/C++ compilers provide a nonstandard quadruple-precision type as an extension. For example, gcc provides a quadruple-precision type called <code>__float128</code> for x86, x86-64 and [[Itanium]] CPUs,<ref>[https://gcc.gnu.org/onlinedocs/gcc/Floating-Types.html Additional Floating Types], ''Using the GNU Compiler Collection''</ref> and on [[PowerPC]] as IEEE 128-bit floating-point using the -mfloat128-hardware or -mfloat128 options;<ref name=gcc6changes>{{cite web|title=GCC 6 Release Series - Changes, New Features, and Fixes|url=https://gcc.gnu.org/gcc-6/changes.html|access-date=2016-09-13}}</ref> and some versions of Intel's C/C++ compiler for x86 and x86-64 supply a nonstandard quadruple-precision type called <code>_Quad</code>.<ref>[http://software.intel.com/en-us/forums/showthread.php?t=56359 Intel C++ Forums] (2007).</ref>