Content deleted Content added
Line 30:
Unless someone can show some external documentation, I remain convinced that the two phase commit cannot recover reliably from a failure in the commit phase. [[User:Dubwai|Dubwai]] 21:11, 25 January 2007 (UTC)
The "correct" 2PC is blocking. The coordinator resends on timeout, but NEVER gives up on waiting. This is how 2PC ensures that it always recovers reliably.
Once a cohort agrees to commit, it is assumed that it can eventually commit (the commit cannot fail). The machine can fail, but it will come back up, it will read the logs, and will wait for the coordinator to resend the message (to see if it is a commit or an abort message). The same goes for the coordinator - if the machine fails, it will read the log and find that it was supposed to commit/abort and will start resending the messages; cohorts which have already finalized will also respond to these messages.
The article isn't clear about the waiting and resending - the coordinator waits for all cohorts to finalize, resending the message on timeout.
Two phase commit is actually resilient to however many failures. Once the coordinator decides what to do and writes it to the log file, no matter how many failures occur, at some point the transaction will finalize. The wiki article is incorrect - the referenced paper does not say that 2PC FAILS, it says that in some cases everything must block for potentially a long time (which is not acceptable). But (the basic, blocking) 2PC IS resilient to multiple failures. The problem is that it is blocking, and yes, if you remove the blocking part, it is no longer resilient (duh). Hope this helps, I think the article should be corrected but the author might want to make the corrections himself.
|