Erasure code: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 10:13, 13 May 2009 edit Cunchem (talk \| contribs) 52 edits Near-optimal erasure codes section added ← Previous edit		Latest revision as of 05:44, 30 June 2025 edit undo Smarter1 (talk \| contribs) Extended confirmed users 577 edits →Applications of erasure coding in storage systems Tags: Mobile edit Mobile web edit Advanced mobile edit
(114 intermediate revisions by 80 users not shown)
Line 1: {{More citations needed\|date=August 2023}} In [[computer science]], an '''erasure code''' transforms a message of ''n'' blocks into a message with more than ''n'' blocks, such that the original message can be recovered from a subset of those blocks. ~~The fraction of the blocks required is called the ''rate'', denoted ''r''.~~ {{Short description\|Code added to allow recovery of lost data}} ~~Erasure codes are used in some forms of [[forward error correction]].~~ In [[coding theory]], an '''erasure code''' is a [[forward error correction]] (FEC) code under the assumption of bit erasures (rather than bit errors), which transforms a message of ''k'' symbols into a longer message (code word) with ''n'' symbols such that the original message can be recovered from a subset of the ''n'' symbols. The fraction ''r'' = ''k''/''n'' is called the [[code rate]]. The fraction ''k’/k'', where ''k’'' denotes the number of symbols required for recovery, is called [[reception efficiency]]. The recovery algorithm expects that it is known which of the ''n'' symbols are lost. == History == Erasure coding was invented by [[Irving Reed]] and [[Gustave Solomon]] in 1960.<ref name="xinnor"> {{Cite web \|date=2023-09-03 \|title=RAID vs. Erasure Coding. What's the Difference? {{!}} Blog {{!}} Xinnor \|url=https://xinnor.io/blog/a-guide-to-raid-pt-4-raid-vs-erasure-coding-whats-the-difference/ \|access-date=2024-09-18 \|website=The Fastest and Most Reliable Software RAID {{!}} Xinnor \|language=en}} </ref> There are many different erasure coding schemes. The most popular erasure codes are [[Reed-Solomon coding]], [[Low-density parity-check code]] (LDPC codes), and [[Turbo codes]].<ref name="xinnor" /> As of 2023, modern data storage systems can be designed to tolerate the complete failure of a few disks without data loss, using one of 3 approaches:<ref> {{Cite web \|date=2014-04-07 \|title=Ceph.io — Erasure Coding in Ceph \|url=https://ceph.io/en/news/blog/2014/erasure-coding-in-ceph-2/ \|access-date=2024-09-18 \|website=ceph.io \|language=en}} </ref><ref name="bdr"> {{Cite web \|last=Lee \|first=Brandon \|date=2023-12-26 \|title=RAID vs Erasure Coding vs Replication \|url=https://www.bdrsuite.com/blog/raid-vs-erasure-coding-vs-replication/ \|access-date=2024-09-18 \|website=BDRSuite \|language=en-US}} </ref><ref name="jor"> {{Cite web \|last=O'Reilly \|first=Jim \|title=RAID Vs. Erasure Coding \|url=https://www.networkcomputing.com/data-center-networking/raid-vs-erasure-coding \|access-date=2024-09-18 \|website=www.networkcomputing.com \|language=en}} </ref> * Replication * RAID * Erasure Coding While technically RAID can be seen as a kind of erasure code,<ref> Dimitri Pertin, Alexandre van Kempen, Benoît Parrein, Nicolas Normand. [https://hal.science/hal-01162047/document "Comparison of RAID-6 Erasure Codes"]. The third Sino-French Workshop on Information and Communication Technologies, SIFWICT 2015, Jun 2015, Nantes, France. ffhal-01162047f </ref> "RAID" is generally applied to an array attached to a single host computer (which is a single point of failure), while "erasure coding" generally implies multiple hosts,<ref name="bdr" /> sometimes called a [[Redundant Array of Inexpensive Servers]] (RAIS). The erasure code allows operations to continue when any one of those hosts stops.<ref name="jor" /><ref> {{Cite web \|title=Understanding IBM Spectrum Scale Erasure Code Edition fault tolerance \|url=https://www.ibm.com/docs/en/storage-scale-ece/5.1.3?topic=issece-understanding-spectrum-scale-erasure-code-edition-fault-tolerance \|access-date=2024-09-18 \|website=www.ibm.com \|language=en-us}} </ref> Compared to block-level RAID systems, object storage erasure coding has some significant differences that make it more resilient.<ref> {{Cite web \|date=2021-07-27 \|title=Object Storage Erasure Coding vs. Block Storage RAID \|url=https://blog.min.io/erasure-coding-vs-raid/ \|access-date=2024-09-18 \|website=MinIO Blog \|language=en}} </ref><ref> {{Cite web \|title=Erasure coding vs Raid as a data protection method {{!}} Computer Weekly \|url=https://www.computerweekly.com/feature/Erasure-coding-versus-RAID-as-a-data-protection-method \|access-date=2024-09-18 \|website=ComputerWeekly.com \|language=en}} </ref><ref> {{Cite web \|last=Kruth \|first=Peter \|date=2023-10-04 \|title=Erasure Code: RAID As It Should Be – Huawei BLOG \|url=https://blog.huawei.com/2021/11/30/erasure-code-raid-as-it-should-be/ \|access-date=2024-09-18 \|archive-url=https://web.archive.org/web/20231004193238/https://blog.huawei.com/2021/11/30/erasure-code-raid-as-it-should-be/ \|archive-date=2023-10-04 }} </ref><ref> {{Cite web \|date=2022-04-25 \|title=Erasure Coding 101 \|url=https://blog.min.io/erasure-coding/ \|access-date=2024-09-18 \|website=MinIO Blog \|language=en}} </ref><ref> {{Cite web \|last=Bhaskaran \|first=Dinesh Kumar \|date=July 6, 2018 \|title=Why erasure coding is the future of data resiliency \|url=https://www.datacenterdynamics.com/en/opinions/why-erasure-coding-is-the-future-of-data-resiliency/ \|url-status=dead \|archive-url=https://web.archive.org/web/20200807015110/https://www.datacenterdynamics.com/en/opinions/why-erasure-coding-is-the-future-of-data-resiliency/ \|archive-date=August 7, 2020}} </ref> ==Optimal erasure codes== {{Technical\|section\|date=August 2023}} ~~Optimal erasure codes produce ''n/r'' blocks where any ''n'' blocks is sufficient to recover the original message.~~ Optimal erasure codes have the property that any ''k'' out of the ''n'' code word symbols are sufficient to recover the original message (i.e., they have optimal reception efficiency). Optimal erasure codes are [[Maximum distance separable code#MDS codes\|maximum distance separable codes]] (MDS codes). ~~Optimal codes are often costly (in terms of memory usage, CPU time or both) when n is large.~~ ===~~Err-mail~~Parity check=== {{see also\|\|Parity bit}} ~~Alice wants to send her telephone number (555629) to Bob using err-mail. Err-mail works just like e-mail, except~~ Parity check is the special case where ''n ''= ''k'' + 1. From a set of ''k'' values <math>\{v_i\}_{1\leq i \leq k}</math>, a checksum is computed and appended to the ''k'' source values: ~~#About half of all the mail gets lost.<ref>Some versions of this story refer to the err-mail daemon.</ref>~~ :<math>v_{k+1}= - \sum_{i=1}^k v_i.</math> ~~#Messages longer than 5 characters are illegal.~~ The set of ''k'' + 1 values <math>\{v_i\}_{1\leq i \leq k+1}</math> is now consistent with regard to the checksum. ~~#It is very expensive (similar to air-mail).~~ If one of these values, <math>v_e</math>, is erased, it can be easily recovered by summing the remaining variables: :<math>v_{e}= - \sum_{i=1 ,i \neq e }^{k+1}v_i.</math> [[RAID 5]] is a widely used application of the parity check erasure code.<ref name="xinnor" /> ~~Instead of asking Bob to acknowledge the messages she sends, Alice devises the following scheme.~~ ===Polynomial oversampling=== ~~#She breaks her telephone number up into two parts ''a''=555, ''b''=629, and sends 2 messages – "A=555" and "B=629" – to Bob.~~ ~~#She constructs a linear function, <math>f(n) = a + (b-a)(n-1)</math>, in this case <math>f(n) = 555 + 74(n-1).</math>~~ ==== Example: Err-mail (''k'' = 2) ==== ~~[[Image:Optimal Erasure Codes1.gif]]~~ In the simple case where ''k'' = 2, redundancy symbols may be created by sampling different points along the line between the two original symbols. This is pictured with a simple example, called err-mail: ~~#She computes the values ''f''(3), ''f''(4), and ''f''(5), and then transmits three redundant messages: "C=703", "D=777" and "E=851".~~ [[Alice and Bob\|Alice]] wants to send her telephone number (555629) to [[Alice and Bob\|Bob]] using err-mail. Err-mail works just like e-mail, except ~~Bob knows that the form of ''f''(''n'') is <math>f(n) = a + (b-a)(n-1)</math>, where ''a'' and ''b'' are the two parts of the telephone number.~~ #About half of all the mail gets lost. #Messages longer than 5 characters are illegal. #It is very expensive (similar to air-mail). Instead of asking Bob to acknowledge the messages she sends, Alice devises the following scheme. #She breaks her telephone number up into two parts ''a'' = 555, ''b'' = 629, and sends 2 messages – "A=555" and "B=629" – to Bob. #She constructs a linear function, <math>f(i) = a + (b-a)(i-1)</math>, in this case <math>f(i) = 555 + 74(i-1)</math>, such that <math>f(1) = 555</math> and <math>f(2) = 629</math>. [[File:Code d'effacement optimal 1.gif]] #She computes the values ''f''(3), ''f''(4), and ''f''(5), and then transmits three redundant messages: "C=703", "D=777" and "E=851". Bob knows that the form of ''f''(''k'') is <math>f(i) = a + (b-a)(i-1)</math>, where ''a'' and ''b'' are the two parts of the telephone number. Now suppose Bob receives "D=777" and "E=851". [[~~Image~~File:~~Optimal~~Code ~~Erasure~~d'effacement ~~Codes2~~optimal 2.gif]] Bob can reconstruct Alice's phone number by computing the values of ''a'' and ''b'' from the values (''f''(4) and ''f''(5)) he has received. Bob can perform this procedure using any two err-mails, so the erasure code in this example has a rate of 40%. Note that Alice cannot encode her telephone number in just one err-mail, because it contains six characters, and that the maximum length of one err-mail message is five characters. If she sent her phone number in pieces, asking Bob to acknowledge receipt of each piece, at least four messages would have to be sent anyway (two from Alice, and two acknowledgments from Bob). So the erasure code in this example, which requires five messages, is quite economical. <!-- The reader should also note that -->This example is a little bit contrived<!-- in that the linear function itself actually contains Alice's telephone number, i.e., 555 + 74 = 629. This makes the given linear function really only useful for the given data -->. For truly generic erasure codes that work over any data set, we would need something other than the ''f''(n''i'') given. ===~~Err-mail~~=General 2case==== The linear construction above can be generalized to [[polynomial interpolation]]. Additionally, points are now computed over a [[finite field]]. First we choose a finite field ''F'' with order of at least ''n'', but usually a power of 2. The sender numbers the data symbols from 0 to ''k'' − 1 and sends them. He then constructs a [[Lagrange polynomial\|(Lagrange) polynomial]] ''p''(''x'') of order ''k'' such that ''p''(''i'') is equal to data symbol ''i''. He then sends ''p''(''k''), ..., ''p''(''n'' − 1). The receiver can now also use polynomial interpolation to recover the lost packets, provided he receives ''k'' symbols successfully. If the order of ''F'' is less than 2<sup>''b''</sup>, where b is the number of bits in a symbol, then multiple polynomials can be used. When err-mail 2 arrives (the upgrade), the maximum message length is reduced to 4 characters. Alice will now start by sending "A=55", "B=56" and "C=29". She will then place markers called A through E on the floor in a configuration that is known to Bob. She will then fit a plane such that its height above A is 55, its height above B is 56 and its height above C is 29. The redundant messages are then the height of that plane above D and E. The sender can construct symbols ''k'' to ''n'' − 1 'on the fly', i.e., distribute the workload evenly between transmission of the symbols. If the receiver wants to do his calculations 'on the fly', he can construct a new polynomial ''q'', such that ''q''(''i'') = ''p''(''i'') if symbol ''i'' < ''k'' was received successfully and ''q''(''i'') = 0 when symbol ''i'' < ''k'' was not received. Now let ''r''(''i'') = ''p''(''i'') − ''q''(''i''). Firstly we know that ''r''(''i'') = 0 if symbol ''i'' < ''k'' has been received successfully. Secondly, if symbol ''i'' ≥ ''k'' has been received successfully, then ''r''(''i'') = ''p''(''i'') − ''q''(''i'') can be calculated. So we have enough data points to construct ''r'' and evaluate it to find the lost packets. So both the sender and the receiver require ''O''(''n'' (''n'' − ''k'')) operations and only ''O''(''n'' − ''k'') space for operating 'on the fly'. ~~If Bob receives three messages he can reconstruct the plane and extract Alice's phone number.~~ ====Real world implementation==== This process is implemented by [[~~Reed-Solomon_error_correction\|Reed-Solomon~~Reed–Solomon ~~codes~~code]]s, with code words constructed over a [[finite field]] using a [[Vandermonde matrix]]. Most practical erasure codes are [[systematic code]]s -- each one of the original ''k'' symbols can be found copied, unencoded, as one of the ''n'' message symbols.<ref name="vinayak" /> (Erasure codes that support [[secret sharing]] never use a systematic code). ==Near-optimal erasure codes== ''Near-optimal erasure codes'' ~~poduce ''n/r'' symbols from ''n'' source symbols, and they~~ require (1 + ε)''nk'' symbols to recover the message (where ε>0). Reducing ε can be done at the cost of CPU time. ''Near+-optimal erasure codes'' trade correction capabilities for computational complexity,: ~~they~~practical algorithms can beencode ~~decoded~~and decode with ~~algorithm having a~~ linear time complexity. [[Fountain code]]s (also known as ''rateless erasure codes'') are annotable ~~example~~examples of ''near-optimal erasure codes'',. ~~they~~They can transform ana ''nk'' ~~block~~symbol message into a practically infinite encoded form, i.e., ~~Encoded~~they can generate an arbitrary amount of redundancy symbols that can all be ~~generated~~used adfor ~~infinitum~~error ~~and~~correction. ~~some~~Receivers ~~number~~can ofstart ~~them~~decoding isafter ~~enough~~they tohave ~~recover~~received ~~the~~slightly ~~message~~more than ''k'' encoded symbols. [[Regenerating code]]s address the issue of rebuilding (also called repairing) lost encoded fragments from existing encoded fragments. This issue occurs in distributed storage systems where communication to maintain encoded redundancy is a problem.<ref name="vinayak" /> == Applications of erasure coding in storage systems == Erasure coding is now standard practice for reliable data storage.<ref> [https://www.usenix.org/conference/fast16/training-program/session/erasure-encoding%E2%80%94practice-and-principles "Erasure Encoding—Practice and Principles"]. 2016. </ref><ref> Matt Sarrel. [https://blog.min.io/erasure-coding/ "Erasure Coding 101"]. 2022. </ref><ref name="bbeach" > Brian Beach. [https://www.backblaze.com/blog/reed-solomon/ "Backblaze Open-sources Reed-Solomon Erasure Coding Source Code"]. 2015. </ref> In particular, various implementations of Reed-Solomon erasure coding are used by [[Apache Hadoop]], the [[RAID-6]] built into Linux, Microsoft Azure, Facebook cold storage, and Backblaze Vaults.<ref name="bbeach" /><ref name="vinayak" > Rashmi Vinayak. [https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-155.html "Erasure Coding for Big-data Systems: Theory and Practice"]. 2016. p. 2: section "Abstract". p. 9: section "Systematic codes". p. 12: section "Regenerating codes". </ref> The classical way to recover from failures in storage systems was to use replication. However, replication incurs significant overhead in terms of wasted bytes. Therefore, increasingly large storage systems, such as those used in data centers use erasure-coded storage. The most common form of erasure coding used in storage systems is [[Reed–Solomon error correction\|Reed-Solomon (RS) code]], an advanced mathematics formula used to enable regeneration of missing data from pieces of known data, called parity blocks. In a (''k'', ''m'') RS code, a given set of ''k'' data blocks, called "chunks", are encoded into (''k'' + ''m'') chunks. The total set of chunks comprises a ''stripe''. The coding is done such that as long as at least ''k'' out of (''k'' + ''m'') chunks are available, one can recover the entire data. This means a (''k'', ''m'') RS-encoded storage can tolerate up to ''m'' failures. '''Example:''' In RS (10, 4) code, which is used in Facebook for their [[HDFS]],<ref>{{Cite book\|last1=Xia\|first1=Mingyuan\|last2=Saxena\|first2=Mohit\|last3=Blaum\|first3=Mario\|last4=Pease\|first4=David A.\|date=2015\|title=A Tale of Two Erasure Codes in {HDFS}\|url=https://www.usenix.org/conference/fast15/technical-sessions/presentation/xia\|language=en\|pages=213–226\|isbn=978-1-931971-20-1}}</ref> 10 MB of user data is divided into ten 1MB blocks. Then, four additional 1 MB parity blocks are created to provide redundancy. This can tolerate up to 4 concurrent failures. The storage overhead here is 14/10 = 1.4X. In the case of a fully replicated system, the 10 MB of user data will have to be replicated 4 times to tolerate up to 4 concurrent failures. The storage overhead in that case will be 50/10 = 5 times. This gives an idea of the lower storage overhead of erasure-coded storage compared to full replication and thus the attraction in today's storage systems. Initially, erasure codes were used to reduce the cost of storing "cold" (rarely-accessed) data efficiently; but erasure codes can also be used to improve performance serving "hot" (more-frequently-accessed) data.<ref name="vinayak" /> [[RAID]] N+M divides data blocks across N+M drives, and can recover all the data when any M drives fail.<ref name="xinnor" /> In particular, RAID 7.3 refers to triple-parity RAID, and can recover all the data when any 3 drives fail.<ref> Adam Leventhal. [https://dl.acm.org/doi/pdf/10.1145/1661785.1670144 ""Triple-parity RAID and beyond"]. 2009. </ref> ==Examples== Here are some examples of implementations of the various codes: ===Near optimal erasure codes=== [[Tornado codes]] [[Low-density parity-check codes]] ===Near optimal ~~[[Fountain code\|~~fountain]] (rateless erasure) codes=== [[Fountain code]] [[Online codes]] [[LT codes]] [[Raptor codes]] [[Linear_network_coding\|Network Codes]] [[Tornado_codes\| Tornado codes]] [[Triangular network coding\|Triangular codes]] ===Optimal erasure codes=== [[Parity (telecommunication)\|Parity]]: used in [[~~redundant~~RAID]] ~~array~~storage ~~of independent disks]]~~systems. [[Parchive]] [[Optimal erasure codes with arbitrary parameters]] are surprisingly simple. [[Tahoe-LAFS]] includes [https://pypi.python.org/pypi/zfec zfec] [[Reed Solomon\|Reed-Solomon coding]] [[Reed–Solomon\|Reed–Solomon codes]] [http://switzernet.com/people/emin-gabrielyan/051103-erasure-9-5-resilient/ Erasure Resilient Systematic Code], an MDS code outperforming Reed–Solomon in the maximal number of redundant packets,{{According to whom\|date=August 2023}} see [http://switzernet.com/people/emin-gabrielyan/051025-erasure-resilient/ RS(4,2) with 2 bits] or [http://switzernet.com/people/emin-gabrielyan/051027-erasure-9-2-resilient/ RS(9,2) with 3 bits] [[Regenerating Codes]]<ref name="regcodes">{{cite journal\|last1=Dimakis\|first1=Alexandros G.\|last2=Godfrey\|first2=P. Brighten\|last3=Wu\|first3=Yunnan\|last4=Wainwright\|first4=Martin J.\|last5=Ramchandran\|first5=Kannan\|title=Network Coding for Distributed Storage Systems\|journal=IEEE Transactions on Information Theory\|date=September 2010\|volume=56\|issue=9\|pages=4539–4551\|doi=10.1109/TIT.2010.2054295\|arxiv=cs/0702015\|citeseerx=10.1.1.117.6892\|s2cid=260559901 }}</ref><ref>{{Cite web \|date=2017-07-31 \|title=home [Erasure Coding for Distributed Storage Wiki] \|url=http://storagewiki.ece.utexas.edu/doku.php \|access-date=2023-08-20 \|archive-url=https://web.archive.org/web/20170731144507/http://storagewiki.ece.utexas.edu/doku.php \|archive-date=2017-07-31 }}</ref> any other [[maximum distance separable code]] ==See also== [[Forward error correction]] codes. [[Secret sharing]] (differs in that the original secret is encrypted and obscured until the decode quorum is reached) [[Spelling alphabet]] [[Binary erasure channel]] ==References== {{Reflist}} [http://info.iet.unipi.it/~luigi/fec.html Software FEC in computer communications] by Luigi Rizzo describes optimal erasure correction codes. ==External links== [http://jerasure.org/ Jerasure] is a Free Software library implementing Reed-Solomon and Cauchy erasure code techniques with SIMD optimisations. [https://web.archive.org/web/20160303184031/http://info.iet.unipi.it/~luigi/fec.html Software FEC in computer communications] by Luigi Rizzo describes optimal erasure correction codes [http://feclib.sourceforge.net Feclib] is a near optimal extension to Luigi Rizzo's work that uses band matrices. Many parameters can be set, like the size of the width of the band and size of the finite field. It also successfully exploits the large [[Processor register\|register]] size of modern CPUs. How it compares to the near optimal codes mentioned above is unknown. [https://archive.today/20140620164304/http://storagewiki.ece.utexas.edu/?id=wiki:definitions:repair_problem Coding for Distributed Storage wiki] for regenerating codes and rebuilding erasure codes. [http://ecip.org/ ECIP "Erasure Code Internet Protocol"] Developed in 1996, was the first use of FEC "Forward Error correction" on the Internet. It was first used commercially to *[http://www.dnull.com/~sokol/clarke.html stream live video of Sir Arthur C. Clarke in Sri Lanka to UIUC in Indiana]. [[Category:Coding theory]] ~~[[zh:抹除碼]]~~