Octuple-precision floating-point format: Difference between revisions

Content deleted Content added
No edit summary
No edit summary
Line 15:
<!-- "significand", with a d at the end, is a technical term, please do not confuse with "significant" -->
 
This gives from 33 - 36 significant decimal digits precision (if a decimal string with at most 33 significant decimal is converted to IEEE 754 quadrupleoctuple precision and then converted back to the same number of significant decimal, then the final string should match the original; and if an IEEE 754 quadrupleoctuple precision is converted to a decimal string with at least 36 significant decimal and then converted back to quadrupleoctuple, then the final number must match the original <ref name=whyieee>{{cite web|url=http://www.cs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF|title=Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic| author=William Kahan |date=1 October 1987}}</ref>).
 
The format is written with an implicit lead bit with value 1 unless the exponent is stored with all zeros. Thus only 112 bits of the [[significand]] appear in the memory format, but the total precision is 113 bits (approximately 34 decimal digits, <math>\log_{10}(2^{113}) \approx 34.016</math>). The bits are laid out as follows:
Line 47:
The maximum representable value is 2<sup>16384</sup> - 2<sup>16272</sup> ≈ 1.1897 × 10<sup>4932</sup>.
 
=== QuadrupleOctuple-precision examples ===
 
These examples are given in bit ''representation'', in [[hexadecimal]],
Line 60:
By default, 1/3 rounds down like [[double precision]], because of the odd number of bits in the significand.
So the bits beyond the rounding point are <code>0101...</code> which is less than 1/2 of a [[unit in the last place]].
 
== Double-double arithmetic ==
 
A common software technique to implement nearly quadruple precision using ''pairs'' of [[double-precision]] values is sometimes called '''double-double arithmetic'''.<ref name=Hida>Yozo Hida, X. Li, and D. H. Bailey, [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.5769 Quad-Double Arithmetic: Algorithms, Implementation, and Application], Lawrence Berkeley National Laboratory Technical Report LBNL-46996 (2000). Also Y. Hida et al., [http://web.mit.edu/tabbott/Public/quaddouble-debian/qd-2.3.4-old/docs/qd.pdf Library for double-double and quad-double arithmetic] (2007).</ref><ref name=Shewchuk>J. R. Shewchuk, [http://www.cs.cmu.edu/~quake/robust.html Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates], Discrete & Computational Geometry 18:305-363, 1997.</ref><ref name="Knuth-4.2.3-pr9">{{cite book |last=Knuth |first=D. E. |title=The Art of Computer Programming |edition=2nd |at=chapter 4.2.3. problem 9. }}</ref> Using pairs of IEEE double-precision values with 53-bit significands, double-double arithmetic can represent operations with at least<ref name=Hida/> a 2&times;53=106-bit significand (actually 107 bits<ref>Robert Munafo [http://mrob.com/pub/math/f161.html F107 and F161 High-Precision Floating-Point Data Types] (2011).</ref> except for some of the largest values, due to the limited exponent range), only slightly less precise than the 113-bit significand of IEEE binary128 quadruple precision. The range of a double-double remains essentially the same as the double-precision format because the exponent has still 11 bits,<ref name=Hida /> significantly lower than the 15-bit exponent of IEEE quadruple precision (a range of <math>1.8\times10^{308}</math> for double-double versus <math>1.2\times10^{4932}</math> for binary128).
 
In particular, a double-double/quadruple-precision value ''q'' in the double-double technique is represented implicitly as a sum ''q''=''x''+''y'' of two double-precision values ''x'' and ''y'', each of which supplies half of ''q'''s significand.<ref name=Shewchuk/> That is, the pair (''x'',''y'') is stored in place of ''q'', and operations on ''q'' values (+,&minus;,&times;,...) are transformed into equivalent (but more complicated) operations on the ''x'' and ''y'' values. Thus, arithmetic in this technique reduces to a sequence of double-precision operations; since double-precision arithmetic is commonly implemented in hardware, double-double arithmetic is typically substantially faster than more general [[arbitrary-precision arithmetic]] techniques.<ref name=Hida/><ref name=Shewchuk/>
 
Note that double-double arithmetic has the following special characteristics:<ref>[http://pic.dhe.ibm.com/infocenter/aix/v7r1/index.jsp?topic=%2Fcom.ibm.aix.genprogc%2Fdoc%2Fgenprogc%2F128bit_long_double_floating-point_datatype.htm 128-Bit Long Double Floating-Point Data Type]</ref>
 
* As the magnitude of the value decreases, the amount of extra precision also decreases. Therefore, the smallest number in the normalized range is narrower than double precision. The smallest number with full precision is 1000...0<sub>2</sub> (106 zeros) × 2<sup>−1074</sup>, or 1.000...0<sub>2</sub> (106 zeros) × 2<sup>−968</sup>. Numbers whose magnitude is smaller than 2<sup>−1021</sup> will not have additional precision compared with double precision.
* The actual number of bits of precision can vary. In general, the magnitude of low-order part of the number is no greater than half ULP of the high-order part. If the low-order part is less than half ULP of the high-order part, significant bits (either all 0's or all 1's) are implied between the significant of the high-order and low-order numbers. Certain algorithms that rely on having a fixed number of bits in the significand can fail when using 128-bit long double numbers.
* Because of the reason above, it is possible to represent values like 1 + 2<sup>−1074</sup>, which is the smallest representable number greater than 1.
 
In addition to the double-double arithmetic, it is also possible to generate triple-double or quad-double arithmetic if higher precision is required without any higher precision floating-point library. They are represented as a sum of three (or four) double-precision values respectively. They can represent operations with at least 159/161 and 212/215 bits respectively.
 
Similar technique can be used to produce a '''double-quad arithmetic''', which is represented as a sum of two quadruple-precision values. They can represent operations with at least 226 (or 227) bits.<ref>sourceware.org [http://sourceware.org/ml/libc-alpha/2012-03/msg01024.html Re: The state of glibc libm]</ref>
 
==Implementations==
QuadrupleOctuple precision is almostrarely alwaysif ever implemented in to software bysince a varietyusage of techniques (such as the double-double technique above, although that technique does not implement IEEE quadruple precision), since direct hardware support for quadruple precisionit is extremely rare. One can use general [[arbitrary-precision arithmetic]] libraries to obtain quadrupleoctuple (or higher) precision, but specialized quadrupleoctruple-precision implementations may achieve higher performance.
 
Quadruple precision is almost always implemented in software by a variety of techniques (such as the double-double technique above, although that technique does not implement IEEE quadruple precision), since direct hardware support for quadruple precision is extremely rare. One can use general [[arbitrary-precision arithmetic]] libraries to obtain quadruple (or higher) precision, but specialized quadruple-precision implementations may achieve higher performance.
 
===Computer-language support===
In C++, It is possible to make a library to handle Octuple-precision floating-point arithmetic. Controversially, in theory it is possible to do Octuple-precision floating-point arithmic in binary (but it would be incredibly hard, painful torture).
A separate question is the extent to which quadruple-precision types are directly incorporated into computer [[programming language]]s.
 
Quadruple precision is specified in [[Fortran]] by the <code>real(real128)</code> (module <code>iso_fortran_env</code> from Fortran 2008 must be used, the constant <code>real128</code> is equal to 16 on most processors), or as <code>real(selected_real_kind(33, 4931))</code>, or in a non-standard way as <code>REAL*16</code>. (Quadruple-precision <code>REAL*16</code> is supported by the [[Intel Fortran Compiler]]<ref>{{cite web|title= Intel Fortran Compiler Product Brief |url=http://h21007.www2.hp.com/portal/download/files/unprot/intel/product_brief_Fortran_Linux.pdf|work=|publisher=Su|date=|accessdate=2010-01-23}}</ref> and by the [[GNU Fortran]] compiler<ref>{{cite web|title= GCC 4.6 Release Series - Changes, New Features, and Fixes |url=http://gcc.gnu.org/gcc-4.6/changes.html|work=|publisher=|date=|accessdate=2010-02-06}}</ref> on [[x86]], [[x86-64]], and [[Itanium]] architectures, for example.)
 
In the [[C (programming language)|C]]/[[C++]] with a few systems and compilers, quadruple precision may be specified by the [[long double]] type, but this is not required by the language (which only requires <code>long double</code> to be at least as precise as <code>double</code>), nor is it common. On x86 and x86-64, the most common C/C++ compilers implement <code>long double</code> as either 80-bit [[extended precision]] (e.g. the [[GNU C Compiler]] gcc<ref>[http://gcc.gnu.org/onlinedocs/gcc/i386-and-x86_002d64-Options.html i386 and x86-64 Options], ''Using the GNU Compiler Collection''.</ref> and the [[Intel C++ compiler]] with a <code>/Qlong&#8209;double</code> switch<ref>[http://software.intel.com/en-us/articles/size-of-long-integer-type-on-different-architecture-and-os/ Intel Developer Site]</ref>) or simply as being synonymous with double precision (e.g. [[Microsoft Visual C++]]<ref>[http://msdn.microsoft.com/en-us/library/9cx8xs15.aspx MSDN homepage, about Visual C++ compiler]</ref>), rather than as quadruple precision. On a few other architectures, some C/C++ compilers implement <code>long double</code> as quadruple precision, e.g. gcc on [[PowerPC]] (as double-double<ref>[http://gcc.gnu.org/onlinedocs/gcc/RS_002f6000-and-PowerPC-Options.html RS/6000 and PowerPC Options], ''Using the GNU Compiler Collection''.</ref><ref>[http://developer.apple.com/legacy/mac/library/documentation/Performance/Conceptual/Mac_OSX_Numerics/Mac_OSX_Numerics.pdf Inside Macintosh - PowerPC Numerics]</ref><ref>[http://www.opensource.apple.com/source/gcc/gcc-5646/gcc/config/rs6000/darwin-ldouble.c 128-bit long double support routines for Darwin]</ref>) and [[SPARC]],<ref>[http://gcc.gnu.org/onlinedocs/gcc/SPARC-Options.html SPARC Options], ''Using the GNU Compiler Collection''.</ref> or the [[Sun Studio (software)|Sun Studio compilers]] on SPARC.<ref>[http://download.oracle.com/docs/cd/E19422-01/819-3693/ncg_lib.html The Math Libraries], Sun Studio 11 ''Numerical Computation Guide'' (2005).</ref> Even if <code>long double</code> is not quadruple precision, however, some C/C++ compilers provide a nonstandard quadruple-precision type as an extension. For example, gcc provides a quadruple-precision type called <code>__float128</code> for x86, x86-64 and [[Itanium]] CPUs,<ref>[http://gcc.gnu.org/onlinedocs/gcc/Floating-Types.html Additional Floating Types], ''Using the GNU Compiler Collection''</ref> and some versions of Intel's C/C++ compiler for x86 and x86-64 supply a nonstandard quadruple-precision type called <code>_Quad</code>.<ref>[http://software.intel.com/en-us/forums/showthread.php?t=56359 Intel C++ Forums] (2007).</ref>
 
=== Hardware support ===
There is little to no hardware support for octuple precision arithmetic.
In C++, It is possible to make a library to handle Octuple-precision floating-point arithmetic. Controversially, in theory it is possible to do Octuple-precision floating-point arithmic in binary (but it would be incredibly hard, painful torture).
 
== See also ==
* [[IEEE 754-2008|IEEE Standard for Floating-Point Arithmetic (IEEE 754)]]