Revision as of 20:55, 3 March 2018 edit Quondum (talk \| contribs) Extended confirmed users 38,936 edits →top: ce ← Previous edit		Revision as of 21:14, 3 March 2018 edit undo Quondum (talk \| contribs) Extended confirmed users 38,936 edits →Overview: ce; still needs rewrite Next edit →
Line 3: ==Overview== A representation of a floating point value using binary scaling is more precise than a floating point representation occupying the same number of bits, but cannot represent values beyond the range that it represents, thus more easily leading to [[arithmetic overflow]] during computation. Implementation of operations using integer arithmetic instructions is often (but not always) faster than the corresponding floating point instructions. ~~Binary scaling is both faster and more accurate than directly using floating point instructions; however, care must be taken not to cause an [[arithmetic overflow]].~~ A position for the ~~virtual~~ 'binary point' is ~~taken~~chosen for each variable to be represented, and ~~then~~binary shifts associated ~~subsequent~~with arithmetic operations ~~determine~~are ~~the~~adjusted ~~'binary point'~~accordingly. ~~Binary points obey the mathematical laws of [[exponentiation]].~~ To give an example, a common way to use [[Arbitrary-precision arithmetic\|integer arithmetic]] to simulate floating point, using 32 bit numbers, is to multiply the coefficients by 65536. Using [[binary scientific notation]], this will place the binary point at B16. That is to say, ~~there~~the ~~are~~most significant 16 ~~binary~~bits ~~integer~~represent ~~bits~~the ~~and~~integer part the remainder are represent the fractional part. This means, as a signed two's complement integer B16 number can hold a highest value of <math> \approx 32767.9999847 </math> and a lowest value of ~~<math> -32768~~−32768.0~~</math>~~ . Put another way, the B number, is the number of integer bits used to represent the number which defines its value range. Remaining low bits (i.e. the non-integer bits) are used to store fractional quantities and supply more accuracy. For instance, to represent 1.2 and 5.6 ~~floating point real numbers~~ as B16 one multiplies them by 2<sup>16</sup>, giving 78643 and 367001. Multiplying these together gives Line 21 ⟶ 19: To convert it back to B16, divide it by 2<sup>16</sup>. This gives 440400B16, which when converted back to a floating point number (by dividing again by 2<sup>16</sup>, but holding the result as floating point) gives 6.71999. The correct floating point result is 6.72. ~~The correct floating point result is 6.72.~~ ==Re-scaling after multiplication==

Binary scaling: Difference between revisions