Single-precision floating-point format: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 23:33, 1 January 2025 edit Vincent Lefèvre (talk \| contribs) Extended confirmed users 5,212 edits →See also: removed IEEE 754: the link is already in the LEDE and various sections. Tag: Reverted ← Previous edit		Latest revision as of 13:52, 29 July 2025 edit undo 208.114.63.4 (talk) →Notable single-precision cases
(34 intermediate revisions by 16 users not shown)
Line 1: {{short description\|32-bit computer number format}} {{Cleanup\|reason=<br/>{{}} This article doesn't provide a good structure to lead users from easy to deeper understanding,<br/>{{}} ~~some~~Some points are 'explained' by lengthy examples instead of concise description of the concept~~, will try to improve, pls. avoid silly 'no-no reverts', instead lets argue about on the talk page.~~\|date=January 2025}} '''Single-precision floating-point format''' (sometimes called '''FP32''' or '''float32''') is a [[computer number format]], usually occupying [[32 bits]] in [[computer memory]]; it represents a wide [[dynamic range]] of numeric values by using a [[floating point\|floating radix point]]. Line 10: One of the first [[programming language]]s to provide single- and double-precision floating-point data types was [[Fortran]]. Before the widespread adoption of IEEE 754-1985, the representation and properties of floating-point data types depended on the [[computer manufacturer]] and computer model, and upon decisions made by programming-language designers. E.g., [[GW-BASIC]]'s single-precision data type was the [[32-bit MBF]] floating-point format. Single precision is termed ''REAL(4)'' or ''REAL4'' in [[Fortran]];<ref>{{cite web\|url=http://scc.ustc.edu.cn/zlsc/sugon/intel/compiler_f/main_for/lref_for/source_files/rfreals.htm\|title=REAL Statement\|website=scc.ustc.edu.cn\|access-date=2013-02-28\|archive-date=2021-02-24\|archive-url=https://web.archive.org/web/20210224045812/http://scc.ustc.edu.cn/zlsc/sugon/intel/compiler_f/main_for/lref_for/source_files/rfreals.htm\|url-status=dead}}</ref> ''SINGLE-FLOAT'' in [[Common Lisp]];<ref>{{Cite web\|url=https://www.lispworks.com/documentation/HyperSpec/Body/t_short_.htm\|title=CLHS: Type SHORT-FLOAT, SINGLE-FLOAT, DOUBLE-FLOAT...\|website=www.lispworks.com}}</ref> ''float binary(p)'' with p≤21, ''float decimal(p)'' with the maximum value of p depending on whether the DFP (IEEE 754 DFP) attribute applies, in PL/I; ''float'' in [[C (programming language)\|C]] with IEEE 754 support, [[C++]] (if it is in C), [[C Sharp (programming language)\|C#]] and [[Java (programming language)\|Java]];<ref>{{cite web\|url=https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html\|title=Primitive Data Types\|website=Java Documentation}}</ref> ''Float'' in [[Haskell (programming language)\|Haskell]]<ref>{{cite web\|url=https://www.haskell.org/onlinereport/haskell2010/haskellch6.html#x13-1350006.4\|title=6 Predefined Types and Classes\|date=20 July 2010\|website=haskell.org}}</ref> and [[Swift (programming language)\|Swift]];<ref>{{cite web\|url=https://developer.apple.com/documentation/swift/float\|title=Float\|website=Apple Developer Documentation}}</ref> and ''Single'' in [[Object Pascal]] ([[Delphi (programming language)\|Delphi]]), [[Visual Basic]], and [[MATLAB]]. However, ''float'' in [[Python (programming language)\|Python]], [[Ruby (programming language)\|Ruby]], [[PHP]], and [[OCaml]] and ''single'' in versions of [[GNU Octave\|Octave]] before 3.2 refer to [[double-precision floating-point format\|double-precision]] numbers. In most implementations of [[PostScript]], and some [[embedded systems]], the only supported precision is single. {{Floating-point}} == IEEE 754 standard: binary32<span class="anchor" id="IEEE 754 single-precision binary floating-point format: binary32"></span> == The IEEE 754 standard specifies a ''binary32'' as having: [[Sign bit]]: 1 bit Line 30 ⟶ 31: [[Image:Float example.svg]] The real value assumed by a given 32-bit ''binary32'' data with a given ''sign'', biased exponent ''eE'' (the 8-bit unsigned integer), and a ''23-bit fraction'' is : <math>(-1)^{b_{31}} \times 2^{(b_{30}b_{29} \dots b_{23})_2 - 127} \times (1.b_{22}b_{21} \dots b_0)_2</math>, which yields Line 63 ⟶ 64: The stored exponents 00<sub>H</sub> and FF<sub>H</sub> are interpreted specially. {\|class="wikitable" style="text-align: center;" \|- ! Exponent !! fraction = 0 !! fraction ≠ 0 !! Equation \|- Line 70 ⟶ 72: \| 01<sub>H</sub>, ..., FE<sub>H</sub> = 00000001<sub>2</sub>, ..., 11111110<sub>2</sub>\|\|colspan=2\| normal value \|\| <math>(-1)^\text{sign}\times2^{\text{exponent}-127}\times1.\text{fraction}</math> \|- \| FF<sub>H</sub> = 11111111<sub>2</sub>\|\| ±[[infinity]] \|\| [[NaN]] (quiet, ~~signalling~~signaling) \|\| \|} Line 167 ⟶ 169: Each of the 24 bits of the significand (including the implicit 24th bit), bit 23 to bit 0, represents a value, starting at 1 and halves for each bit, as follows: <pre> ~~bit 23 = 1~~ bit 2223 = ~~0.5~~1 bit 2122 = 0.255 bit 2021 = 0.~~125~~25 bit 1920 = 0.~~0625~~125 bit 1819 = 0.~~03125~~0625 bit 1718 = 0.~~015625~~03125 bit 17 = 0.015625 .▼ . ▲ . ~~bit 6 = 0.00000762939453125~~ bit 56 = 0.~~000003814697265625~~00000762939453125 bit 45 = 0.~~0000019073486328125~~000003814697265625 bit 34 = 0.~~00000095367431640625~~0000019073486328125 bit 23 = 0.~~000000476837158203125~~00000095367431640625 bit 12 = 0.~~0000002384185791015625~~000000476837158203125 bit 01 = 0.~~00000011920928955078125~~0000002384185791015625 bit 0 = 0.00000011920928955078125 </pre> The significand in this example has three bits set: bit 23, bit 22, and bit 19. We can now decode the significand by adding the values represented by these bits. Line 205 ⟶ 209: * Decimals between 4 and 8: fixed interval 2<sup>−21</sup> * ... * Decimals between 2<sup>n</sup> and 2<sup>n+1</sup>: fixed interval 2<sup>~~n-23~~n−23</sup> * ... * Decimals between 2<sup>22</sup>=4194304 and 2<sup>23</sup>=8388608: fixed interval 2<sup>−1</sup>=0.5 Line 215 ⟶ 219: * Integers between 2<sup>25</sup> and 2<sup>26</sup> round to a multiple of 4 * ... * Integers between 2<sup>n</sup> and 2<sup>n+1</sup> round to a multiple of 2<sup>~~n-23~~n−23</sup> * ... * Integers between 2<sup>127</sup> and 2<sup>128</sup> round to a multiple of 2<sup>104</sup> Line 223 ⟶ 227: These examples are given in bit ''representation'', in [[hexadecimal]] and [[Binary number\|binary]], of the floating-point value. This includes the sign, (biased) exponent, and significand. {\| style="font-family: monospace, monospace;" 0 00000000 00000000000000000000001<sub>2</sub> = 0000 0001<sub>16</sub> = 2<sup>−126</sup> × 2<sup>−23</sup> = 2<sup>−149</sup> ≈ 1.4012984643 × 10<sup>−45</sup>▼ \|- (smallest positive subnormal number)▼ \| ▲ 0 00000000 00000000000000000000001<sub>2</sub> = 0000 0001<sub>16</sub> = 2<sup>−126</sup> × 2<sup>−23</sup> = 2<sup>−149</sup> ≈ 1.4012984643 × 10<sup>−45</sup><br /> ▲ {{spaces\|38}}(smallest positive subnormal number) 0 00000000 11111111111111111111111<sub>2</sub> = 007f ffff<sub>16</sub> = 2<sup>−126</sup> × (1 − 2<sup>−23</sup>) ≈ 1.1754942107 ×10<sup>−38</sup><br /> {{spaces\|38}}(largest subnormal number) 0 00000001 00000000000000000000000<sub>2</sub> = 0080 0000<sub>16</sub> = 2<sup>−126</sup> ≈ 1.1754943508 × 10<sup>−38</sup><br /> {{spaces\|38}}(smallest positive normal number) 0 11111110 11111111111111111111111<sub>2</sub> = 7f7f ffff<sub>16</sub> = 2<sup>127</sup> × (2 − 2<sup>−23</sup>) ≈ 3.4028234664 × 10<sup>38</sup><br /> {{spaces\|38}}(largest normal number) 0 01111110 11111111111111111111111<sub>2</sub> = 3f7f ffff<sub>16</sub> = 1 − 2<sup>−24</sup> ≈ 0.999999940395355225<br /> {{spaces\|38}}(largest number less than one) 0 01111111 00000000000000000000000<sub>2</sub> = 3f80 0000<sub>16</sub> = 1 (one) 0 01111111 00000000000000000000001<sub>2</sub> = 3f80 0001<sub>16</sub> = 1 + 2<sup>−23</sup> ≈ 1.00000011920928955<br /> {{spaces\|38}}(smallest number larger than one) 1 10000000 00000000000000000000000<sub>2</sub> = c000 0000<sub>16</sub> = −2<br /> 0 00000000 00000000000000000000000<sub>2</sub> = 0000 0000<sub>16</sub> = 0<br /> 1 00000000 00000000000000000000000<sub>2</sub> = 8000 0000<sub>16</sub> = −0 0 11111111 00000000000000000000000<sub>2</sub> = 7f80 0000<sub>16</sub> = infinity<br /> 1 11111111 00000000000000000000000<sub>2</sub> = ff80 0000<sub>16</sub> = −infinity 0 10000000 10010010000111111011011<sub>2</sub> = 4049 0fdb<sub>16</sub> ≈ 3.14159274101257324 ≈ π ( pi )<br /> 0 01111101 01010101010101010101011<sub>2</sub> = 3eaa aaab<sub>16</sub> ≈ 0.333333343267440796 ≈ 1/3 x 11111111 10000000000000000000001<sub>2</sub> = ffc0 0001<sub>16</sub> = qNaN (on x86 and ARM processors)<br /> x 11111111 00000000000000000000001<sub>2</sub> = ff80 0001<sub>16</sub> = sNaN (on x86 and ARM processors) \|} By default, 1/3 rounds up, instead of down like [[~~double~~Double-precision floating-point format\|double-precision]], because of the even number of bits in the significand. The bits of 1/3 beyond the rounding point are <code>1010...</code> which is more than 1/2 of a [[unit in the last place]]. Encodings of qNaN and sNaN are not specified in [[~~IEEE floating point\|~~IEEE 754]] and implemented differently on different processors. The [[x86]] family and the [[ARM architecture family\|ARM]] family processors use the most significant bit of the significand field to indicate a [[NaN#Quiet_NaN\|quiet NaN]]. The [[PA-RISC]] processors use the bit to indicate a [[NaN#Signaling_NaN\|signaling NaN]]. === Optimizations ===