Single-precision floating-point format: Difference between revisions

Content deleted Content added
Undid revision 1266725365 by Vincent Lefèvre (talk) vandalizing revert, suppression of information
Tags: Undo Reverted
 
(33 intermediate revisions by 16 users not shown)
Line 1:
{{short description|32-bit computer number format}}
{{Cleanup|reason=<br/>{{*}} This article doesn't provide a good structure to lead users from easy to deeper understanding,<br/>{{*}} someSome points are 'explained' by lengthy examples instead of concise description of the concept, will try to improve, pls. avoid silly 'no-no reverts', instead lets argue about on the talk page.|date=January 2025}}
 
'''Single-precision floating-point format''' (sometimes called '''FP32''' or '''float32''') is a [[computer number format]], usually occupying [[32 bits]] in [[computer memory]]; it represents a wide [[dynamic range]] of numeric values by using a [[floating point|floating radix point]].
Line 10:
One of the first [[programming language]]s to provide single- and double-precision floating-point data types was [[Fortran]]. Before the widespread adoption of IEEE 754-1985, the representation and properties of floating-point data types depended on the [[computer manufacturer]] and computer model, and upon decisions made by programming-language designers. E.g., [[GW-BASIC]]'s single-precision data type was the [[32-bit MBF]] floating-point format.
 
Single precision is termed ''REAL(4)'' or ''REAL*4'' in [[Fortran]];<ref>{{cite web|url=http://scc.ustc.edu.cn/zlsc/sugon/intel/compiler_f/main_for/lref_for/source_files/rfreals.htm|title=REAL Statement|website=scc.ustc.edu.cn|access-date=2013-02-28|archive-date=2021-02-24|archive-url=https://web.archive.org/web/20210224045812/http://scc.ustc.edu.cn/zlsc/sugon/intel/compiler_f/main_for/lref_for/source_files/rfreals.htm|url-status=dead}}</ref> ''SINGLE-FLOAT'' in [[Common Lisp]];<ref>{{Cite web|url=https://www.lispworks.com/documentation/HyperSpec/Body/t_short_.htm|title=CLHS: Type SHORT-FLOAT, SINGLE-FLOAT, DOUBLE-FLOAT...|website=www.lispworks.com}}</ref> ''float binary(p)'' with p&le;21, ''float decimal(p)'' with the maximum value of p depending on whether the DFP (IEEE 754 DFP) attribute applies, in PL/I; ''float'' in [[C (programming language)|C]] with IEEE 754 support, [[C++]] (if it is in C), [[C Sharp (programming language)|C#]] and [[Java (programming language)|Java]];<ref>{{cite web|url=https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html|title=Primitive Data Types|website=Java Documentation}}</ref> ''Float'' in [[Haskell (programming language)|Haskell]]<ref>{{cite web|url=https://www.haskell.org/onlinereport/haskell2010/haskellch6.html#x13-1350006.4|title=6 Predefined Types and Classes|date=20 July 2010|website=haskell.org}}</ref> and [[Swift (programming language)|Swift]];<ref>{{cite web|url=https://developer.apple.com/documentation/swift/float|title=Float|website=Apple Developer Documentation}}</ref> and ''Single'' in [[Object Pascal]] ([[Delphi (programming language)|Delphi]]), [[Visual Basic]], and [[MATLAB]]. However, ''float'' in [[Python (programming language)|Python]], [[Ruby (programming language)|Ruby]], [[PHP]], and [[OCaml]] and ''single'' in versions of [[GNU Octave|Octave]] before 3.2 refer to [[double-precision floating-point format|double-precision]] numbers. In most implementations of [[PostScript]], and some [[embedded systems]], the only supported precision is single.
{{Floating-point}}
 
== IEEE 754 standard: binary32<span class="anchor" id="IEEE 754 single-precision binary floating-point format: binary32"></span> ==
 
The IEEE 754 standard specifies a ''binary32'' as having:
* [[Sign bit]]: 1 bit
Line 30 ⟶ 31:
[[Image:Float example.svg]]
 
The real value assumed by a given 32-bit ''binary32'' data with a given ''sign'', biased exponent ''eE'' (the 8-bit unsigned integer), and a ''23-bit fraction'' is
: <math>(-1)^{b_{31}} \times 2^{(b_{30}b_{29} \dots b_{23})_2 - 127} \times (1.b_{22}b_{21} \dots b_0)_2</math>,
which yields
Line 63 ⟶ 64:
The stored exponents 00<sub>H</sub> and FF<sub>H</sub> are interpreted specially.
 
{|class="wikitable" style="text-align: center;"
|-
! Exponent !! fraction = 0 !! fraction ≠ 0 !! Equation
|-
Line 70 ⟶ 72:
| 01<sub>H</sub>, ..., FE<sub>H</sub> = 00000001<sub>2</sub>, ..., 11111110<sub>2</sub>||colspan=2| normal value || <math>(-1)^\text{sign}\times2^{\text{exponent}-127}\times1.\text{fraction}</math>
|-
| FF<sub>H</sub> = 11111111<sub>2</sub>|| ±[[infinity]] || [[NaN]] (quiet, signallingsignaling) ||
|}
 
Line 167 ⟶ 169:
Each of the 24 bits of the significand (including the implicit 24th bit), bit 23 to bit 0, represents a value, starting at 1 and halves for each bit, as follows:
 
<pre>
bit 23 = 1
bit 2223 = 0.51
bit 2122 = 0.255
bit 2021 = 0.12525
bit 1920 = 0.0625125
bit 1819 = 0.031250625
bit 1718 = 0.01562503125
bit 17 = 0.015625
.
.
.
bit 6 = 0.00000762939453125
bit 56 = 0.00000381469726562500000762939453125
bit 45 = 0.0000019073486328125000003814697265625
bit 34 = 0.000000953674316406250000019073486328125
bit 23 = 0.00000047683715820312500000095367431640625
bit 12 = 0.0000002384185791015625000000476837158203125
bit 01 = 0.000000119209289550781250000002384185791015625
bit 0 = 0.00000011920928955078125
</pre>
 
The significand in this example has three bits set: bit 23, bit 22, and bit 19. We can now decode the significand by adding the values represented by these bits.
Line 205 ⟶ 209:
* Decimals between 4 and 8: fixed interval 2<sup>−21</sup>
* ...
* Decimals between 2<sup>n</sup> and 2<sup>n+1</sup>: fixed interval 2<sup>n-23n−23</sup>
* ...
* Decimals between 2<sup>22</sup>=4194304 and 2<sup>23</sup>=8388608: fixed interval 2<sup>−1</sup>=0.5
Line 215 ⟶ 219:
* Integers between 2<sup>25</sup> and 2<sup>26</sup> round to a multiple of 4
* ...
* Integers between 2<sup>n</sup> and 2<sup>n+1</sup> round to a multiple of 2<sup>n-23n−23</sup>
* ...
* Integers between 2<sup>127</sup> and 2<sup>128</sup> round to a multiple of 2<sup>104</sup>
Line 223 ⟶ 227:
These examples are given in bit ''representation'', in [[hexadecimal]] and [[Binary number|binary]], of the floating-point value. This includes the sign, (biased) exponent, and significand.
 
{| style="font-family: monospace, monospace;"
0 00000000 00000000000000000000001<sub>2</sub> = 0000 0001<sub>16</sub> = 2<sup>−126</sup> × 2<sup>−23</sup> = 2<sup>−149</sup> ≈ 1.4012984643 × 10<sup>−45</sup>
|-
(smallest positive subnormal number)
|
0 00000000 00000000000000000000001<sub>2</sub> = 0000 0001<sub>16</sub> = 2<sup>−126</sup> × 2<sup>−23</sup> = 2<sup>−149</sup> ≈ 1.4012984643 × 10<sup>−45</sup><br />
{{spaces|38}}(smallest positive subnormal number)
 
0 00000000 11111111111111111111111<sub>2</sub> = 007f ffff<sub>16</sub> = 2<sup>−126</sup> × (1 − 2<sup>−23</sup>) ≈ 1.1754942107 ×10<sup>−38</sup><br />
{{spaces|38}}(largest subnormal number)
 
0 00000001 00000000000000000000000<sub>2</sub> = 0080 0000<sub>16</sub> = 2<sup>−126</sup> ≈ 1.1754943508 × 10<sup>−38</sup><br />
{{spaces|38}}(smallest positive normal number)
 
0 11111110 11111111111111111111111<sub>2</sub> = 7f7f ffff<sub>16</sub> = 2<sup>127</sup> × (2 − 2<sup>−23</sup>) ≈ 3.4028234664 × 10<sup>38</sup><br />
{{spaces|38}}(largest normal number)
 
0 01111110 11111111111111111111111<sub>2</sub> = 3f7f ffff<sub>16</sub> = 1 − 2<sup>−24</sup> ≈ 0.999999940395355225<br />
{{spaces|38}}(largest number less than one)
 
0 01111111 00000000000000000000000<sub>2</sub> = 3f80 0000<sub>16</sub> = 1 (one)
 
0 01111111 00000000000000000000001<sub>2</sub> = 3f80 0001<sub>16</sub> = 1 + 2<sup>−23</sup> ≈ 1.00000011920928955<br />
{{spaces|38}}(smallest number larger than one)
 
1 10000000 00000000000000000000000<sub>2</sub> = c000 0000<sub>16</sub> = −2<br />
0 00000000 00000000000000000000000<sub>2</sub> = 0000 0000<sub>16</sub> = 0<br />
1 00000000 00000000000000000000000<sub>2</sub> = 8000 0000<sub>16</sub> = −0
0 11111111 00000000000000000000000<sub>2</sub> = 7f80 0000<sub>16</sub> = infinity
1 11111111 00000000000000000000000<sub>2</sub> = ff80 0000<sub>16</sub> = −infinity
0 10000000 10010010000111111011011<sub>2</sub> = 4049 0fdb<sub>16</sub> ≈ 3.14159274101257324 ≈ π ( pi )
0 01111101 01010101010101010101011<sub>2</sub> = 3eaa aaab<sub>16</sub> ≈ 0.333333343267440796 ≈ 1/3
x 11111111 10000000000000000000001<sub>2</sub> = ffc0 0001<sub>16</sub> = qNaN (on x86 and ARM processors)
x 11111111 00000000000000000000001<sub>2</sub> = ff80 0001<sub>16</sub> = sNaN (on x86 and ARM processors)
 
0 11111111 00000000000000000000000<sub>2</sub> = 7f80 0000<sub>16</sub> = infinity<br />
By default, 1/3 rounds up, instead of down like [[double precision]], because of the even number of bits in the significand. The bits of 1/3 beyond the rounding point are <code>1010...</code> which is more than 1/2 of a [[unit in the last place]].
1 11111111 00000000000000000000000<sub>2</sub> = ff80 0000<sub>16</sub> = −infinity
 
0 10000000 10010010000111111011011<sub>2</sub> = 4049 0fdb<sub>16</sub> ≈ 3.14159274101257324 ≈ π ( pi )<br />
Encodings of qNaN and sNaN are not specified in [[IEEE floating point|IEEE 754]] and implemented differently on different processors. The [[x86]] family and the [[ARM architecture|ARM]] family processors use the most significant bit of the significand field to indicate a [[NaN#Quiet_NaN|quiet NaN]]. The [[PA-RISC]] processors use the bit to indicate a [[NaN#Signaling_NaN|signaling NaN]].
0 01111101 01010101010101010101011<sub>2</sub> = 3eaa aaab<sub>16</sub> ≈ 0.333333343267440796 ≈ 1/3
x 11111111 10000000000000000000001<sub>2</sub> = ffc0 0001<sub>16</sub> = qNaN (on x86 and ARM processors)<br />
x 11111111 00000000000000000000001<sub>2</sub> = ff80 0001<sub>16</sub> = sNaN (on x86 and ARM processors)
|}
 
By default, 1/3 rounds up, instead of down like [[doubleDouble-precision floating-point format|double-precision]], because of the even number of bits in the significand. The bits of 1/3 beyond the rounding point are <code>1010...</code> which is more than 1/2 of a [[unit in the last place]].
 
Encodings of qNaN and sNaN are not specified in [[IEEE floating point|IEEE 754]] and implemented differently on different processors. The [[x86]] family and the [[ARM architecture family|ARM]] family processors use the most significant bit of the significand field to indicate a [[NaN#Quiet_NaN|quiet NaN]]. The [[PA-RISC]] processors use the bit to indicate a [[NaN#Signaling_NaN|signaling NaN]].
 
=== Optimizations ===
The design of floating-point format allows various optimisations, resulting from the easy generation of a [[base-2 logarithm]] approximation from an integer view of the raw bit pattern. Integer arithmetic and bit-shifting can yield an approximation to [[reciprocal square root]] ([[fast inverse square root]]), commonly required in [[computer graphics]].
 
== Add. info and curiosities ==
The IEEE 754 standard allows two different views / decodings for the numbers, one described above with a fractional understanding of the significand and a bias of 127 for the exponent, the other understanding the significand as binary integer, 2^23 times larger, and in turn the bias for the significand 23 larger, 150, which produces smaller effective exponents and by that the same final result. The fractional view is common for binaryxxx datatypes, while the integral is for decimalxxx datatypes.
 
== See also ==