Talk:Floating-point arithmetic/Archive 4: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 07:04, 23 May 2012 edit MiszaBot I (talk \| contribs) 234,552 edits m Archiving 1 thread(s) from Talk:Floating point. ← Previous edit		Latest revision as of 20:21, 9 August 2017 edit undo Deacon Vorbis (talk \| contribs) Extended confirmed users, Rollbackers 23,589 edits m Deacon Vorbis moved page Talk:Floating point/Archive 4 to Talk:Floating-point arithmetic/Archive 4: Talk archive wasn't moved with rest of page
(7 intermediate revisions by 2 users not shown)
Line 532: :::::::Thanks. Actually, I believe that "d) Provide direct support for― execution-time diagnosis of anomalies" is referring to this use of directed rounding to diagnose numerical instability. Certainly Kahan makes it clear that he considered it a key usage from the early design of the x87. I agree that its use for interval arithmetic was also considered from the beginning. [[User:Brianbjparker\|Brianbjparker]] ([[User talk:Brianbjparker\|talk]]) 02:11, 22 February 2012 (UTC) ::::::::No that refers to identification and methods of notifying the various exceptions and the handling of the signalling and quiet NaNs. Your reference from 2007 does not support in any way that arbitrarily jiggling the calculations using directed rounding was considered as a reason to include directed rounding in the specification. He'd have been just laughed at if he had justified spending money on the 8087 for such a purpose when there are easy ways of doing something like that without any hardware assistance. [[User:Dmcq\|Dmcq]] ([[User talk:Dmcq\|talk]]) 08:23, 22 February 2012 (UTC) == Trivia removed == I removed about that the full precision of extended precision is attained when extended precision is used. The point about the algorithm is it converges using the precision used. We don't need to put in the precisions of single double and extended precision versions of the algorithm. [[User:Dmcq\|Dmcq]] ([[User talk:Dmcq\|talk]]) 23:23, 23 February 2012 (UTC) :::I disagree that it is trivia-- it is a good example to also illustrate the earlier discussions on the usage of extended precision. In any case, to make it easier to find for those who may be interested in the information: the footnote to the final example, giving the precision using double extended for internal calculations, is included here- :::"As the recurrence is applied repeatedly, the accuracy improves at first, but then it deteriorates. It never gets better than about 8 digits, even though 53-bit arithmetic should be capable of about 16 digits of precision. When the second form of the recurrence is used, the value converges to 15 digits of precision. Footnote: if intermediate calculations are carried at a higher precision using double extended (x87 80 bit) format, it reaches 18 digits of precision, which is the full target double precision." [[User:Brianbjparker\|Brianbjparker]] ([[User talk:Brianbjparker\|talk]]) 23:37, 23 February 2012 (UTC) :It just has nothing to do with extended precision. The first algorithm would go wrong just as badly with extended precision and the second one behaves exactly like double. There is nothing of note here. Why should it have all the various precisons in? The same thing would happen with float or quad precision. All it says is that the precision for different orecisions is different. Also a double cannot hold 18 digits of precision, used as an intermediate for double you'd at most get one bit of precision extra. [[User:Dmcq\|Dmcq]] ([[User talk:Dmcq\|talk]]) 00:50, 25 February 2012 (UTC) ::::Agreed that the footnote does nothing to clarify the particular point being made by that example-- that wasn't the aim though. The intention was to also utilise the example to demonstrate the utility of computing intermediate values to higher precision than needed by the final destination format to limit the effects of round-off. In that sense it is an example for the earlier discussion on extended precision (and also the section of approaches to improve accuracy). Perhaps the text "Footnote: if intermediate calculations are carried at a higher precision using double extended (x87 80 bit) format, it reaches 18 digits of precision, which is the full target double precision (see discussion on extended precision above)." would be clearer. Agreed it is is not the most striking example of this, but still demonstrates the idea-- perhaps a separate, more striking and specific example would be preferable, I will see what I can find. [[User:Brianbjparker\|Brianbjparker]] ([[User talk:Brianbjparker\|talk]]) 04:52, 25 February 2012 (UTC) :::::It does not illustrate that. What give you the idea it does? If anything it is an argument against what was said before. Using extended precision in the intermediate calculation and storing back as double does not give increased precision in the final result. The 18 digits only applies to the extended precision, it does not apply to the double result. The 18 digits is not the target precision of a double. A double can only hold 15 digits accurately. There is no way to stick the extra precision of the extended precision into the target double. [[User:Dmcq\|Dmcq]] ([[User talk:Dmcq\|talk]]) 09:53, 25 February 2012 (UTC) ::::::IEEE 754 double precision gives from 15 to 17 decimal digits of precision (17 digits if round-tripping from double to text back to double). When the example is computed with extended precision it gives 17 decimal digits of precision, so if the returned double was to be used for further computation it would have less roundoff error, in ULP (at least one extra decimal digit worth). Although, as you say, if the double result is printed to 15 decimal digits this extra precision will be lost. I agree that it is not a compelling example-- a better example could show a difference in many decimal significant digits due to internal extended precision. [[Special:Contributions/121.45.205.130\|121.45.205.130]] ([[User talk:121.45.205.130\|talk]]) 23:21, 25 February 2012 (UTC) :::::::The 17 digits for a round trip is only needed to cope with making certain that rounding works okay. The actual precision is just less than 16 digits, about 15.95 if one cranks the figures. Printing has nothing to do with it. I was just talking about the 53 bits of precision information held within double precision format expressed as decimal digits. You can't shove any more information into the bits. The value there is about 1 ulp out and using extended precision would gain that back. This is what I was saying about extended precision being very useful for getting accurate maths functions, straightforward implementations in double will very often be 1 ulp out without special work whereas the extended precision result will very often give the value given by rounding the exact value. [[User:Dmcq\|Dmcq]] ([[User talk:Dmcq\|talk]]) 00:08, 26 February 2012 (UTC) ::::::::::Ideally, what should be added is a more striking example of using excess precision in intermediate computations to protect against numerical instability. The current one can indeed demonstrate this if excess precision is carried to IEEE quad precision, in which case the numerical unstable version gives good results. I have added notes to that effect which will do as an example for now. There are many examples also showing this using only double extended (e.g. even as simple as computing the roots of a quadratic equation), and I will add such an example in the future.. but not for a while (by the way, I think double extended adds more than 1 ULP but I haven't checked that). [[User:Brianbjparker\|Brianbjparker]] ([[User talk:Brianbjparker\|talk]]) 06:54, 26 February 2012 (UTC) :::::::::::That's not true either because how does one know when to stop? Using quadruple precision would still diverge. [[User:Dmcq\|Dmcq]] ([[User talk:Dmcq\|talk]]) 11:45, 26 February 2012 (UTC) ::::::::::::::::Yes that is so- once it does reach the correct value it stays there for several iterations (at double precision) but does eventually diverge from it again, so a stopping criterion of when the value does not change at double precision could be used. But yes, I am not completely happy with that example for that reason-- feel free to remove it if you feel it is misleading. Actually Kahan has several very compelling examples in his notes-- I will post one here in the next week or so. [[User:Brianbjparker\|Brianbjparker]] ([[User talk:Brianbjparker\|talk]]) 14:41, 26 February 2012 (UTC) The use of extra precision can be illustrated easily using differentiation. If the result is to be single precision then using double precision for all the calculations is a good idea because of th loss of significance when subtracting two values of he function. [[User:Dmcq\|Dmcq]] ([[User talk:Dmcq\|talk]]) 12:00, 26 February 2012 (UTC) ::: ok yes, that could be a good example-- I will see what I can come up with. [[User:Brianbjparker\|Brianbjparker]] ([[User talk:Brianbjparker\|talk]]) 14:41, 26 February 2012 (UTC) : If have added an example from Kahan's publications-- I think this is a good example as it demonstrates the massive roundoff error (up to half signif. digits lost) that can occur with even innocuous-looking formulae, and shows the two main methods to correct or improve that: increased internal precision, or numerical analysis. [[User:Brianbjparker\|Brianbjparker]] ([[User talk:Brianbjparker\|talk]]) 07:03, 28 February 2012 (UTC) ::Yes it is definitely better to source something like that to a good source like him. I may not agree with every last word he says about it but he definitely is the premiere source for anything on floating point. [[User:Dmcq\|Dmcq]] ([[User talk:Dmcq\|talk]]) 14:14, 28 February 2012 (UTC) == 01010111 01101000 01100001 01110100 00101110 00101110 00101110 00111111 (What...?) == The section on internal representation does not explain how decimals are converted to floating-point values. I think it will be helpful if we add a step-by-step procedure that the computer follows. Thanks! [[Special:Contributions/68.173.113.106\|68.173.113.106]] ([[User talk:68.173.113.106\|talk]]) 02:16, 25 February 2012 (UTC) :This gives an example of conversion and the articles on the particular formats give other examples. Wikipedia does not in general provide step by step procedures, it describes things, see [[WP:NOTHOWTO]]. [[User:Dmcq\|Dmcq]] ([[User talk:Dmcq\|talk]]) 02:24, 25 February 2012 (UTC) ::I just thought it was kind of unclear. Besides, doing so might actually help this article get to GA status. ::You see, I'm trying to design an algorithm for getting the mantissa, the exponent, and the sign of a <code>float</code> or <code>double</code>. So in case anyone else actually cares about that stuff. For the record, the storage is little-endian, so you have to reverse the bit order. [[Special:Contributions/68.173.113.106\|68.173.113.106]] ([[User talk:68.173.113.106\|talk]]) 02:50, 25 February 2012 (UTC) :::It would stop FA status. Have a look at the articles about the individual formats. They describe in quite enough details the format. Any particular algorithm is up to the user, they are not interesting or discussed in secondary sources. [[User:Dmcq\|Dmcq]] ([[User talk:Dmcq\|talk]]) 10:01, 25 February 2012 (UTC) :::The closest in Wikipedia for the sort of stuff you're talking about is if somebody wrote something for wikibooks. Have you had a look at the various external sites? Really to me what you're talking about sounds like some homework exercise and we shouldn't help with those except perhaps to give hints. [[User:Dmcq\|Dmcq]] ([[User talk:Dmcq\|talk]]) 10:20, 25 February 2012 (UTC) == imho, "real numbers" is didactically misleading == I'd like to propose to change the beginning of the first sentence, because the limited amount of bits in the significand only allows for storing rational binary numbers. Because two is a prime factor of ten, this means only rational decimal numbers can be stored as well. Concluding, I'd like to propose to replace "real" by "rational" there. [[User:Drgst\|Drgst]] ([[User talk:Drgst\|talk]]) 13:17, 25 February 2012 (UTC) :Definitely not. That is a bad idea. They are approximations to real numbers. The concept of rational number just doesn't come into it. That they are rational is just a side effect. [[User:Dmcq\|Dmcq]] ([[User talk:Dmcq\|talk]]) 14:32, 25 February 2012 (UTC) ::In the section 'Some other computer representations for non-integral numbers' there are some systems that can represent some irrational numbers. for instance a logarithmic system does not necessarily represent rational numbers. [[User:Dmcq\|Dmcq]] ([[User talk:Dmcq\|talk]]) 14:36, 25 February 2012 (UTC) :::Sorry for the delayed answer, Dmcq, it seems I forgot to tick the "watch page" checkbox... now for the content: IEEE FP numbers definitely are rational numbers. Even the most simple irrational number in the world, i.e. sqrt(2), cannot be represented, e.g. Any mathematical theorem that really depends on the existence of irrational numbers does not hold for the set of FP numbers. Nevertheless, you are right in stating that FP numbers are meant to approximate real numbers. Yet, as no non-rational number can be represented, transcendental numbers are far from being representable. Of course, this has serious consequences: for example, none of these nice trigonometric identities involving pi or pi/2 can be used naively without introducing large errors. This is just a simple example of why I think people should be warned of associating floating point numbers with real numbers.[[User:Drgst\|Drgst]] ([[User talk:Drgst\|talk]]) 21:14, 27 June 2012 (UTC) ::::"Irrational numbers are those real numbers that cannot be represented as terminating or repeating decimals." --[[Irrational number]] Therefore, irrational numbers ''cannot be exactly represented on any digital computer''. However, you can get arbitrarily close. It really doesn't take all that many bits to handle a Planck length (~10^-35m) and the estimated size of the universe (~10^26m) in the same calculation. ::::The key point here is that floating point really is a method of representing (not perfectly but arbitrarily close) real numbers. Yes, it just so happens that some of them are represented exactly and others are not, but that's not relevant to the fact that FP is a method of representing (imperfectly) real numbers. All of this is covered quite nicely in the "Representable numbers, conversion and rounding" section. No need to make the lead confusing and misleading. --[[User:Guy Macon\|Guy Macon]] ([[User talk:Guy Macon\|talk]]) 22:48, 27 June 2012 (UTC) :::::I don't think this is correct "floating point really is a method of representing (not perfectly but arbitrarily close) real numbers". We talk about the "representable numbers" as those real numbers which can be represented exactly within the system. Other real numbers are rounded to some representable number. So I think we should either speak in terms of "working with real numbers" (which seems a little vague) or "representing approximations to real numbers" (as we do later in the article). --[[User:JakeVortex\|Jake]] ([[User talk:JakeVortex\|talk]]) 08:50, 22 October 2012 (UTC) ::::::You make a good point, but while "working with real numbers" is inexact and vague, "representing approximations to real numbers" is wordy and clumsy. Perhaps we can devise a third alternative? --[[User:Guy Macon\|Guy Macon]] ([[User talk:Guy Macon\|talk]]) 12:57, 22 October 2012 (UTC) :::::::What about "approximating real numbers"? But IMHO, "real numbers" is slightly incorrect, because floating point can also be used for complex arithmetic (though a complex number is here seen as a pair of two real numbers). Moreover a floating-point arithmetic is not just about the representation, but also the behavior when doing an operation (e.g. how the result is rounded). So, I would prefer something like: "a method of doing numerical computations" [[User:Vincent Lefèvre\|Vincent Lefèvre]] ([[User talk:Vincent Lefèvre\|talk]]) 22:09, 22 October 2012 (UTC) == Guard bits == Anybody know where the business of needing three extra bits comes from? For addition one only needs a guard/round digit plus a sticky bit as the sticky bit will always be zero if subtraction means you have to shift up. And for multiplication one needs the double length to cope with carry properly before rounding - but one can still cut that down to two bits before applying the particular rounding. The literaure talks about guard and round and sticky so I'm not disputig putting it in the text, just wondering why people got the idea in their heads in the first place. [[User:Dmcq\|Dmcq]] ([[User talk:Dmcq\|talk]]) 13:03, 8 March 2012 (UTC) :Somewhat related: Take a look at "2 vs 3 guard bits" here: :http://www.engineering.uiowa.edu/~carch/lectures07/55035-070404-prn.pdf :Also interesting: :http://www.google.com/patents/US4282582.pdf :These two searches turn up some interesting pages: :[http://www.google.com/search?q=%22floating+point%22+%2240+bits%22 <nowiki>http://www.google.com/search?q="floating+point"+"40+bits"</nowiki>] :[http://www.google.com/search?q=%22floating+point%22+%22eight+guard+bits%22+%22DSP%22 <nowiki>http://www.google.com/search?q="floating+point"+"eight+guard+bits"+"DSP"</nowiki>] :--[[User:Guy Macon\|Guy Macon]] ([[User talk:Guy Macon\|talk]]) 00:39, 9 March 2012 (UTC) ::Goldberg gives a discussion of the need for two guard digits in http://www.validlab.com/goldberg/paper.pdf (page 195). There is a very clear description with example cases in: Michael L. Overton (2001). Numerical Computing with IEEE Floating Point Arithmetic. SIAM. [[User:Brianbjparker\|Brianbjparker]] ([[User talk:Brianbjparker\|talk]]) 06:17, 9 March 2012 (UTC) :::Very good reference. It should be noted that he not only covers base 10 and guard (decimal) digits but also base 2 and guard bits. --[[User:Guy Macon\|Guy Macon]] ([[User talk:Guy Macon\|talk]]) 07:02, 9 March 2012 (UTC) :::I just looked at some implementation I did of the whole business I did ages ago and I did actually use three bits! Just me forgetting what I'd done, sorry. yes the subtraction does actually require them all. [[User:Dmcq\|Dmcq]] ([[User talk:Dmcq\|talk]]) 11:33, 9 March 2012 (UTC) == edit : computation in page is correct after all == Sorry for the confusion : I used t_(i+1) instead of t_i. for that reason I missed a factor 2 : 2^(i+1) = 2 * 2^i. <small><span class="autosigned">— Preceding [[Wikipedia:Signatures\|unsigned]] comment added by [[User:KeesLem\|KeesLem]] ([[User talk:KeesLem\|talk]] • [[Special:Contributions/KeesLem\|contribs]]) 14:36, 21 February 2013 (UTC)</span></small><!-- Template:Unsigned --> <!--Autosigned by SineBot--> == Justification for division by zero definition == I [http://en.wikipedia.org/w/index.php?title=Division_by_zero&diff=511812597&oldid=510158610 recently added] to [[division by zero]] this statement with an appropriate source: :"The justification for this definition is to preserve the sign of the result in case of [[arithmetic underflow]]. For example, in the double-precision computation 1/(''x''/2), where ''x'' = ±2<sup>−149</sup>, the computation ''x''/2 underflows and produces ±0 with sign matching ''x'', and the result will be ±∞ with sign matching ''x''. The sign will match that of the exact result ±2<sup>150</sup>, but the magnitude of the exact result is too large to represent, so infinity is used to indicate overflow." Provided this is valid, I wonder if it could also be added in some relevant ___location in the body of floating point related articles. In general I'd like to see more information on design rationales. Thanks! [[User:Dcoetzee\|Dcoetzee]] 07:42, 11 September 2012 (UTC) == Signed zero section, branch cuts == The section on signed zero (under Internal representation >> Special values >> Signed zero) says the following: "The difference between +0 and −0 is mostly noticeable for complex operations at so-called [[Branch cut\|branch cuts]]." In a strictly mathematical sense, +0/-0 ''can'' be interpreted as describing the limiting behaviors of a function, but that's not actually what's happening here. Moreover, branch cuts are not the only situation where these exceptional limiting behaviors appear, one can have branch cuts without exceptional limiting behaviors of this sort, and none of the examples given in the section are actually branch cuts. As far as I can tell, there is absolutely no significance to the relationship between branch cuts in complex analysis and signed zero in floating point numerical representations, but I wanted to make sure there wasn't a good reason for this being here. Thoughts? [[Special:Contributions/71.227.119.236\|71.227.119.236]] ([[User talk:71.227.119.236\|talk]]) 15:25, 29 September 2012 (UTC) :Result of a quick Google search: :"A system with signed zero can distinguish between asin(5+0i) and asin(5-0i) and pick the appropriate branch cut continuous with quadrant I or quadrant IV, respectively. A system without signed zero cannot distinguish and, according to the choses the branch cut such that it is continuous with quadrant IV (consistent with the rule of CCC). So, for asin(5+0i) it will return the same value as a system with signed zero would for asin(5-0i)." -Richard B. Kreckel ( [ http://www.ginac.de/~kreckel/ ] [ http://lists.gnu.org/archive/html/bug-gsl/2011-12/msg00004.html ] ). :I think that when he wrote "according to the" he meant "accordingly" (probably not a native English speaker). --[[User:Guy Macon\|Guy Macon]] ([[User talk:Guy Macon\|talk]]) 23:34, 29 September 2012 (UTC) ::Somewhat straying from the subject but still quite interesting; the "Signed Zero" section of "What Every Computer Scientist Should Know About Floating-Point Arithmetic" ( [ http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html ] ) --[[User:Guy Macon\|Guy Macon]] ([[User talk:Guy Macon\|talk]]) 23:41, 29 September 2012 (UTC) == imho, the computation for Pi as shown actually computes only Pi/2 == The algorithm as shown to compute an approximation of Pi actually computes imo in this form only Pi/2, even while the output shown contains an approximation for Pi. I think either the values should be halved or the formula should be changed into : 12 * 2^i * t_i [[User:KeesLem\|KeesLem]] ([[User talk:KeesLem\|talk]]) 15:16, 21 February 2013 (UTC) <span style="font-size: smaller;" class="autosigned">— Preceding [[Wikipedia:Signatures\|unsigned]] comment added by [[Special:Contributions/130.161.210.156\|130.161.210.156]] ([[User talk:130.161.210.156\|talk]]) </span><!-- Template:Unsigned IP --> <!--Autosigned by SineBot-->