Content deleted Content added
Undid revision 1277724214 by Stevepang79 (talk) completely off-topic - The Talk page is not a forum. |
|||
(8 intermediate revisions by 2 users not shown) | |||
Line 41:
::I agree that [[Floating-point_arithmetic#Alternatives_to_floating-point_numbers|the subsection]] could be moved. For example, from [[Floating-point_arithmetic#Overview|the overview]] to the end of the article, just before [[Floating-point_arithmetic#See_also|See also]] as a separate section. [[User:Korektysta|Korektysta]] ([[User talk:Korektysta|talk]]) 22:56, 1 August 2024 (UTC)
::Ah, OK. Effectively, the arithmetic builds the computation tree, but it is opaque for the user. I guess that the treatment of leafs in the tree is also different because there is no special constant for <math>\pi</math>. <math>\pi</math> is just another function. [[User:Korektysta|Korektysta]] ([[User talk:Korektysta|talk]]) 04:49, 2 August 2024 (UTC)
== Patriot missile incident ==
The other day I made an edit clarifying the nature of the Patriot missile incident, based on the public sources already cited. [[User:Vincent Lefèvre]] reverted two parts of them:
First, I replaced the link to [[loss of significance]] by the simpler word ‘error’, because [[loss of significance]] now just redirects to [[catastrophic cancellation]] since the old article was [https://en.wikipedia.org/w/index.php?title=Loss_of_significance&diff=prev&oldid=1107845106 deleted]. I was [https://en.wikipedia.org/wiki/Talk:Catastrophic_cancellation#Proposed_merge_of_Loss_of_significance_into_Catastrophic_cancellation loosely involved] in this deletion but I don't feel strongly about this; I think the term ‘loss of significance’ is unnecessarily fancy without saying anything more than ‘error’ does, but it's fine, and the error is essentially catastrophic cancellation after all.
Second, I added the text:
: The error arose not from the use of floating-point, but from the use of two different unit conversions when representing time in different parts of a calculation.
This text was [https://en.wikipedia.org/w/index.php?title=Floating-point_arithmetic&curid=11376&diff=1299610166&oldid=1299537597 deleted] on the grounds that:
: The intent is that there is a single time unit: 0.1s. The issue is that the software assumed that its accuracy did not matter; Skeel says: "this time difference should be in error by only 0.0001%, a truly insignificant amount". Something that may remain true... until a cancellation occurs like here.
But I don't think that is the whole story. The Skeel citation<ref name="Skeel">{{citation |url=https://www-users.cse.umn.edu/~arnold/disasters/Patriot-dharan-skeel-siam.pdf |title=Roundoff Error and the Patriot Missile |last=Skeel |first=Robert |journal=SIAM News |volume=25 |issue=4 |page=11 |date=July 1992 |access-date=2024-11-15}}</ref> says (emphasis added):
: When Patriot systems were brought into the Gulf conflict, the software was modified (several times) to cope with the high speed of ballistic missiles, for which the system was not originally designed. <p>At least one of these software modifications was the introduction of a subroutine for converting clock-time more accurately into floating-point. This calculation was needed in about half a dozen places in the program, but the call to the subroutine was not inserted at every point where it was needed. '''Hence, with a less accurate truncated time of one radar pulse being subtracted from a more accurate time of another radar pulse, the error no longer cancelled.'''
The designers certainly didn't assume that its accuracy did not matter—if they did assume that, why would they have written a new conversion subroutine for more accurate conversion?
Suppose the floating-point system on the control computer had 30-bit precision (a low estimate for a 48-bit floating-point format). The logic computed something like <math>C_1(t_1) - C_0(t_0)</math>, where <math>C_1(t)</math> is (say) the ''new'' higher-precision conversion from fixed-point to floating-point giving <math>0.1\times t\times(1 - 2^{-30})</math>, and <math>C_0(t)</math> is (say) the ''old'' lower-precision conversion giving <math>0.1\times t\times(1 - 2^{-20})</math>. There may be an additional ''floating-point rounding error'' of about one ulp, but that pales in comparison to the ''discrepancy between conversion subroutines'' of about <math>2^{10} \approx 1000</math> ulps in this hypothesis of 30-bit precision (if it were 40-bit precision, then it would be <math>2^{20} \approx 10^6</math> ulps, and so on).
In brief, this was a much more mundane software engineering mistake—updating a unit conversion subroutine call in one place but not another, so the units are no longer commensurate—rather than anything you can rightly blame floating-point for.
It's possible that, after long enough uptime, computing <math>C(t_1) - C(t_0)</math> rather than <math>C(t_1 - t_0)</math> with the ''same'' conversion subroutine <math>C</math> ''could'' lose enough significant bits due to floating-point rounding error to cause the same problem. But in this case, the problem was using ''different'' conversion subroutines <math>C_1</math> and <math>C_0</math>. And, with at least 30-bit precision, the floating-point rounding error would take a thousand times as long to cause the same problem—over twenty thousand hours before a problem, or about two years and four months of continuous uptime. (I would also guess the format has >30 bits of precision, so it's likely much longer than that.)
This cautionary tale is often used to blame the designers for using floating-point to represent time and to argue that floating-point numbers are incomprehensible black magic where reasoning goes out the window (e.g., on [https://news.ycombinator.com/item?id=1667060 Hacker News] and [https://old.reddit.com/r/programming/comments/6npfz/the_patriot_missile_failure/ Reddit]), even though the underlying story justifies neither of these conclusions. So that's why I think it is important to spell out the actual bug here—incomplete software change caused subtraction of incommensurate (but similar) units. [[User:Taylor Riastradh Campbell|Taylor Riastradh Campbell]] ([[User talk:Taylor Riastradh Campbell|talk]]) 04:16, 17 July 2025 (UTC)
:
:{{Re|Taylor Riastradh Campbell}}
:Just saying "error" would be misleading because in general, one has an error at almost each floating-point operation, and this is often not a major issue (with carefully designed code). What matters here is that the (relative) error is very large due to a [[catastrophic cancellation]] as described in the document.
:Saying that there are "two different unit conversions" is incorrect, as the time unit is the same in both routines (0.1s), contrary to the [[Mars Climate Orbiter#Cause of failure|Mars Climate Orbiter failure]], where the pound-force second and newton-second units were mixed up (so, even with an infinite precision, the failure would still have occurred); the issue here is that there are different approximations in the time calculation, i.e. with different accuracy (see the term "accurate" used by Skeel). This is really related to error analysis (with an infinite precision, there would have been no issues).
:— [[User:Vincent Lefèvre|Vincent Lefèvre]] ([[User talk:Vincent Lefèvre|talk]]) 12:30, 17 July 2025 (UTC)
::{{Quote|Just saying "error" would be misleading because in general, one has an error at almost each floating-point operation, and this is often not a major issue (with carefully designed code). What matters here is that the (relative) error is very large due to a catastrophic cancellation as described in the document.}}
::Right, I have no objection to your reverting that part of the change. I was just explaining why I changed the text/link ‘loss of significance’ in the first place.
::{{Quote|Saying that there are "two different unit conversions" is incorrect, as the time unit is the same in both routines (0.1s)…the issue here is that there are different approximations in the time calculation, i.e. with different accuracy (see the term "accurate" used by Skeel).}}
::I'm not attached to phrasing it in terms of unit conversions (though I think a better analogy is yards/meters, which differ by about 9%, or quarts/liters, which differ by about 6%, rather than pounds/newtons, which differ by a factor of four). The part that is important to emphasize is the error from subtracting different approximations to the time conversion—because of an incomplete software change that didn't update all the subroutine calls—rather than the error from using floating-point.
::Had the control computer used ''either'' <math>C_0(t_1) - C_0(t_0)</math> ''or'' <math>C_1(t_1) - C_1(t_0)</math> ''consistently'', whether the older and worse approximation or the newer and better approximation, the incident likely wouldn't have happened even though it used floating-point (until several years of uptime, rather than several dozen hours of uptime). And subtracting different conversion approximations <math>C_0</math> and <math>C_1</math> even in exclusively fixed-point arithmetic, or infinite-precision arithmetic, would also have caused the same error. [[User:Taylor Riastradh Campbell|Taylor Riastradh Campbell]] ([[User talk:Taylor Riastradh Campbell|talk]]) 13:13, 17 July 2025 (UTC)
:::The old code had to be rewritten to use a more accurate time computation ''to cope with the high speed of ballistic missiles''. So, anyway, even the old code, which used a ''consistent'' conversion, was globally not accurate enough. And you do not necessarily need to use the same conversion; two different conversion routines with sufficient accuracy would be enough. With infinite precision, you would have <math>C_0 = C_1</math>, so there would be no errors. — [[User:Vincent Lefèvre|Vincent Lefèvre]] ([[User talk:Vincent Lefèvre|talk]]) 13:46, 17 July 2025 (UTC)
|