Lockstep (computing): Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 00:05, 25 January 2015 edit Dsimic (talk \| contribs) Extended confirmed users, Pending changes reviewers, Rollbackers 39,664 edits →Lockstep memory: Slightly more readable ← Previous edit		Latest revision as of 10:03, 22 September 2024 edit undo Citation bot (talk \| contribs) Bots 5,867,213 edits Added publisher. \| Use this bot. Report bugs. \| Suggested by Spinixster \| Category:Classes of computers \| #UCB_Category 9/91
(27 intermediate revisions by 20 users not shown)
Line 1: {{Short description\|Fault-tolerant computer system}} {{otheruses\|Lockstep (disambiguation)}} {{More references\|date=September 2014}} '''Lockstep''' systems are [[fault-tolerant computer system]]s that run the same set of operations at the same time in [[Parallel computing\|parallel]].<ref>{{cite ~~web~~book \| url = ~~http~~https://books.google.com/books?id=UVlq7SFDCVUC&~~pg=PA80&lpg=PA80&dq~~q=lockstep+fault+tolerance&~~source~~pg=~~bl&ots=SCW40duM-x&sig=64PqQP6qaCJlkuNg16Nn1wLtjuM&hl=en&sa=X&ei=DRoOVNqyI6HiywOs3oKIBw&redir_esc=y#v=onepage&q=lockstep%20fault%20tolerance&f=false~~PA80 \| title = Fault-Tolerant Real-Time Systems: The Problem of Replica Determinism \| year = 1996 \| accessdate = 2014-09-08 \| author = Stefan Poledna \| ~~website~~page = ~~books.google.com~~80 \| ~~page~~publisher = 80Springer \| isbn = 9780585295800 }}</ref> The [[Redundancy (engineering)\|redundancy]] (duplication) allows error detection and error correction: the output from lockstep operations can be compared to determine if there has been a fault if there are at least two systems ([[dual modular redundancy]] DMR), and the error can be automatically corrected if there are at least three systems ([[triple modular redundancy]] TMR), via majority vote. The term "[[lockstep]]" originates ~~in the~~from army usage, where it refers to ~~the~~ synchronized walking, in which ~~the~~ marchers walk as closely together as physically practical. To run in lockstep, each system is set up to progress from one well-defined state to the next well-defined state. When a new set of inputs reaches the system, it processes them, generates new outputs and updates its state. This set of changes (new inputs, new outputs, new state) is considered to define that step, and must be treated as an atomic transaction; in other words, either all of it happens, or none of it happens, but not something in between. Sometimes a timeshift (delay) is set between systems, which increases the detection probability of errors induced by external influences (e.g. [[voltage spike]]s, [[ionizing radiation]], or [[in situ]] [[reverse engineering]]). == {{Anchor\|MEMORY}}Lockstep memory == {{See [[also\|Chipkill]]}}▼ Some vendors, including Intel, use the term ''lockstep memory'' to describe a [[Multi-channel memory architecture\|multi-channel]] memory layout in which [[cache line]]s are distributed between two memory channels, so one half of the cache line is stored in a [[DIMM]] on the first channel, while the second half goes to a DIMM on the second channel. By combining the [[single error correction and double error detection]] (SECDED) capabilities of two [[ECC memory\|ECC]]-enabled DIMMs in a lockstep layout, their ''single-device data correction'' (SDDC) nature can be extended into ''double-device data correction'' (DDDC), providing protection against the failure of any single memory chip.<ref name="intel-xeon-e7-v2">{{cite web \| url = https://software.intel.com/en-us/articles/intel-xeon-processor-e7-v2-family-technical-overview#c104 \| title = Intel Xeon Processor E7 V2 Family Technical Overview, Section 3.1: Intel C104/102 Scalable Memory Buffer \| date = 2014-02-18 \| accessdate = 2014-09-09 Line 20 ⟶ 24: }}</ref><ref name="intel-lockstep-mode">{{cite web \| url = https://software.intel.com/en-us/blogs/2014/07/11/independent-channel-vs-lockstep-mode-drive-you-memory-faster-or-safer \| title = Independent Channel vs. Lockstep Mode~~{{snd}}~~ – Drive your Memory Faster or Safer \| date = 2014-07-11 \| accessdate = 2014-09-09 \| author = Thomas Willhalm \| publisher = [[Intel]] }}</ref><ref name="hp-proliant-guidelines">{{cite web \| url = ~~ftp~~http://ftp.hp.com/pub/c-products/servers/options/Memory-Config-Recommendations-for-Intel-Xeon-5500-Series-Servers-Rev1.pdf#page=8 \| title = Best Practice Guidelines for ProLiant Servers with the Intel Xeon 5500 processor series Engineering Whitepaper, 1st Edition \| date = May 2009 \| accessdate = 2014-09-09 \| publisher = [[Hewlett-Packard\|HP]] \| format = PDF ~~\| format = PDF~~ \| pages = 8–9 }}</ref><ref>{{cite web \| url = http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/c102-c104-scalable-memory-buffer-datasheet.pdf#page=9 \| title = Intel C102/C104 Scalable Memory Buffer Datasheet, Section 1.3.1.2.2: 1:1 Sub-channel Lockstep Mode \| date = February 2014 \| accessdate = 2015-01-25 \| publisher = [[Intel]] \| format = PDF \| page = 9 }}</ref> Downsides of the Intel's lockstep memory layout are the reduction of effectively usable amount of RAM (in case of a triple-channel memory layout, maximum amount of memory reduces to one third of the physically available maximum), and reduced performance of the memory subsystem.<ref name="intel-xeon-e7-v2~~" /><ref name="intel-lockstep-mode~~" /><ref name="hp-proliant-guidelines" /> == Dual modular redundancy == Line 38 ⟶ 48: Where the computing systems are duplicated, but both actively process each step, it is difficult to arbitrate between them if their outputs differ at the end of a step. For this reason, it is common practice to run DMR systems as "master/slave" configurations with the slave as a "hot-standby" to the master, rather than in lockstep. Since there is no advantage in having the slave unit actively process each step, a common method of working is for the master to copy its state at the end of each step's processing to the slave. Should the master fail at some point, the slave is ready to continue from the previous known good step. While either the lockstep or the DMR approach (when combined with some means of detecting errors in the master) can provide redundancy against hardware failure in the master, they do not protect against software ~~failure~~error. If the master fails because of a software error, it is highly likely that the slave - in attempting to repeat the execution of the step which failed - will simply repeat the same error and fail in the same way, an example of a [[common mode failure]]. == Triple modular redundancy == Line 46 ⟶ 56: == See also == ▲ [[Chipkill]] * [[~~NonStop~~Master-checker]] * [[~~Stratus~~NonStop ~~Technologies~~(server computers)]] * [[Stratus VOS]] * [[VAXft]] == References == {{Reflist\|30em}} == External links == * [http://www.dell.com/downloads/global/power/ps3q05-20050176-Patel-OE.pdf Enabling Memory Reliability, Availability, and Serviceability Features on Dell PowerEdge Servers], 2005 * [https://web.archive.org/web/20150923233016/http://www.ece.umd.edu/courses/enee759h.S2003/references/chipkill.pdf Chipkill correct memory architecture], August 2000, by David Locklear [[Category:Classes of computers]]