Burrows–Wheeler transform: Difference between revisions

Content deleted Content added
"More importantly" -> "Importantly". For one thing, it's not remotely clear which preceding fact the reversibility of BWT is meant to be "more" important than, but whatever the answer to that, it's a dubious subjective claim.
Rearrange and flesh out early content a bit. The "Description" section actually didn't contain a full description of the algorithm before, just an example, before the description finally came in the "Example" section, which was weird! Also we had multiple sections explaining why the algo tends to produce runs of a single character repeated many times, so I've merged them into a single such explanation.
Line 26:
 
==Description==
The transform is done by constructing a matrix (known as the Burrows-Wheeler Matrix<ref name="langmead">{{cite web |last1=Langmead |first1=Ben |title=Burrows-Wheeler Transform and FM Index |url=https://www.cs.jhu.edu/~langmea/resources/lecture_notes/bwt_and_fm_index.pdf |website=Johns Hopkins Whiting School of Engineering |access-date=23 April 2025}}</ref>) whose rows are the [[circular shift]]s of the input text, [[sorted]] in [[lexicographic order]], then taking the final column of that matrix.
When a [[character string (computer science)|character string]] is transformed by the BWT, the transformation [[Permutation|permutes]] the order of the characters. If the original string had several substrings that occurred often, then the transformed string will have several places where a single character is repeated multiple times in a row.
 
To allow the transform to be reversed, one additional step is necessary: either the index of the original string in the Burrows-Wheeler Matrix must be returned along with the transformed string (the approach shown in the original paper by Burrows and Wheeler<ref name=Burrows1994/>) or a special end-of-text character must be added at the start or end of the input text before the transform is executed.<ref name="langmead"/>
For example:
{| class="wikitable"
! Input
| <code>SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES</code>
|-
! Output
| <code>TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT</code><ref>{{cite web|url=https://github.com/adrien-mogenet/scala-bwt/blob/master/src/main/scala/me/algos/bwt/BurrowsWheelerCodec.scala|title=adrien-mogenet/scala-bwt|website=GitHub|access-date=19 April 2018}}</ref>
|}
The output is easier to compress because it has many repeated characters.
In this example the transformed string contains six runs of identical characters:
<code>XX</code>,
<code>SS</code>,
<code>PP</code>,
<code>..</code>,
<code>II</code>,
and
<code>III</code>, which together make 13 out of the 44 characters.
 
===Example===
The transform is done by [[sorting]] all the [[circular shift]]s of a text in [[lexicographic order]] and by extracting the last column and the index of the original string in the set of sorted permutations of <code>S</code>.
 
Given an input string <code>S = <span style="color:red;">^</span>BANANA<span style="color:red;">$</span></code> (step 1 in the table below), rotate it ''N'' times (step 2), where <code>N = 8</code> is the length of the <code>S</code> string considering also the red <code><span style="color:red;">^</span></code> character representing the start of the string and the red <code><span style="color:red;">$</span></code> character representing the '[[End-of-file|EOF]]' pointer; these rotations, or circular shifts, are then sorted lexicographically (step 3). The output of the encoding phase is the last column <code>L = BNN<span style="color:red;">^</span>AA<span style="color:red;">$</span>A</code> after step 3, and the index (0-based) <code>I</code> of the row containing the original string <code>S</code>, in this case <code>I = 6</code>.
Line 96 ⟶ 79:
BNN<span style="color:red;">^</span>AA<span style="color:red;">$</span>A
|}
 
===Pseudocode===
 
The following [[pseudocode]] gives a simple (though inefficient) way to calculate the BWT and its inverse. It assumes that the input string <code>s</code> contains a special character 'EOF' which is the last character and occurs nowhere else in the text.
Line 113 ⟶ 98:
 
==Explanation==
 
To understand why this creates more-easily-compressible data, consider transforming a long English text frequently containing the word "the". Sorting the rotations of this text will group rotations starting with "he " together, and the last character of that rotation (which is also the character before the "he ") will usually be "t", so the result of the transform would contain a number of "t" characters along with the perhaps less-common exceptions (such as if it contains "ache ") mixed in. So it can be seen that the success of this transform depends upon one value having a high probability of occurring before a sequence, so that in general it needs fairly long samples (a few kilobytes at least) of appropriate data (such as text).
|If <code>TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT</code>the original string had several substrings that occurred often, then the BWT-transformed string will have several places where a single character is repeated many times in a row<ref>{{cite web|url=https://github.com/adrien-mogenet/scala-bwt/blob/master/src/main/scala/me/algos/bwt/BurrowsWheelerCodec.scala|title=adrien-mogenet/scala-bwt|website=GitHub|access-date=19 April 2018}}</ref>, creating more-easily-compressible data. For instance, consider transforming an English text frequently containing the word "the":
 
For example:
{| class="wikitable"
! Input
| <code>THE.MAN.AND.THE.DOG.WAITED.AT.THE.STATION.FOR.THE.TRAIN.TO.THE.CITY</code>
|-
! Output
| <code>NDEENEEODTRNEGRWM..T.EN.HHHHHT.OTTTTTATAC.AOIATDIFOT.ASI..Y..A..I.T</code>
|}
 
Sorting the rotations of this text groups rotations starting with "he " together, and the last character of such a rotation (which is also the character before the "he ") will usually be "t" (though perhaps occasionally not, such as if the text contained "ache "), so the result of the transform will contain a run, or runs, of many consecutive "t" characters. Similarly, rotations beginning with "e " are grouped together, but "e " is often preceded by "h", so we see the output above contains a run of five consecutive "h" characters.
 
Thus it can be seen that the success of this transform depends upon one value having a high probability of occurring before a sequence, so that in general it needs fairly long samples (a few kilobytes at least) of appropriate data (such as text).
 
The remarkable thing about the BWT is not that it generates a more easily encoded output—an ordinary sort would do that—but that it does this ''reversibly'', allowing the original document to be re-generated from the last column data.