Binary scaling: Difference between revisions

Content deleted Content added
top: ce
Overview: ce; still needs rewrite
Line 3:
 
==Overview==
A representation of a floating point value using binary scaling is more precise than a floating point representation occupying the same number of bits, but cannot represent values beyond the range that it represents, thus more easily leading to [[arithmetic overflow]] during computation. Implementation of operations using integer arithmetic instructions is often (but not always) faster than the corresponding floating point instructions.
Binary scaling is both faster and more accurate than directly using floating point instructions; however, care must be taken not to cause an [[arithmetic overflow]].
 
A position for the virtual 'binary point' is takenchosen for each variable to be represented, and thenbinary shifts associated subsequentwith arithmetic operations determineare theadjusted 'binary point'accordingly.
 
Binary points obey the mathematical laws of [[exponentiation]].
 
To give an example, a common way to use [[Arbitrary-precision arithmetic|integer arithmetic]] to simulate floating point, using 32 bit numbers, is to multiply the coefficients by 65536.
 
Using [[binary scientific notation]], this will place the binary point at B16. That is to say, therethe aremost significant 16 binarybits integerrepresent bitsthe andinteger part the remainder are represent the fractional part. This means, as a signed two's complement integer B16 number can hold a highest value of <math> \approx 32767.9999847 </math> and a lowest value of <math> -32768−32768.0</math> . Put another way, the B number, is the number of integer bits used to represent the number which defines its value range. Remaining low bits (i.e. the non-integer bits) are used to store fractional quantities and supply more accuracy.
 
For instance, to represent 1.2 and 5.6 floating point real numbers as B16 one multiplies them by 2<sup>16</sup>, giving 78643 and 367001.
 
Multiplying these together gives
Line 21 ⟶ 19:
To convert it back to B16, divide it by 2<sup>16</sup>.
 
This gives 440400B16, which when converted back to a floating point number (by dividing again by 2<sup>16</sup>, but holding the result as floating point) gives 6.71999. The correct floating point result is 6.72.
The correct floating point result is 6.72.
 
==Re-scaling after multiplication==