Content deleted Content added
"More importantly" -> "Importantly". For one thing, it's not remotely clear which preceding fact the reversibility of BWT is meant to be "more" important than, but whatever the answer to that, it's a dubious subjective claim. |
Rearrange and flesh out early content a bit. The "Description" section actually didn't contain a full description of the algorithm before, just an example, before the description finally came in the "Example" section, which was weird! Also we had multiple sections explaining why the algo tends to produce runs of a single character repeated many times, so I've merged them into a single such explanation. Tag: Disambiguation links added |
||
Line 26:
==Description==
The transform is done by constructing a matrix (known as the Burrows-Wheeler Matrix<ref name="langmead">{{cite web |last1=Langmead |first1=Ben |title=Burrows-Wheeler Transform and FM Index |url=https://www.cs.jhu.edu/~langmea/resources/lecture_notes/bwt_and_fm_index.pdf |website=Johns Hopkins Whiting School of Engineering |access-date=23 April 2025}}</ref>) whose rows are the [[circular shift]]s of the input text, [[sorted]] in [[lexicographic order]], then taking the final column of that matrix.
To allow the transform to be reversed, one additional step is necessary: either the index of the original string in the Burrows-Wheeler Matrix must be returned along with the transformed string (the approach shown in the original paper by Burrows and Wheeler<ref name=Burrows1994/>) or a special end-of-text character must be added at the start or end of the input text before the transform is executed.<ref name="langmead"/>
For example:▼
{| class="wikitable"▼
! Input▼
|-▼
! Output▼
| <code>TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT</code><ref>{{cite web|url=https://github.com/adrien-mogenet/scala-bwt/blob/master/src/main/scala/me/algos/bwt/BurrowsWheelerCodec.scala|title=adrien-mogenet/scala-bwt|website=GitHub|access-date=19 April 2018}}</ref>▼
|}▼
===Example===
Given an input string <code>S = <span style="color:red;">^</span>BANANA<span style="color:red;">$</span></code> (step 1 in the table below), rotate it ''N'' times (step 2), where <code>N = 8</code> is the length of the <code>S</code> string considering also the red <code><span style="color:red;">^</span></code> character representing the start of the string and the red <code><span style="color:red;">$</span></code> character representing the '[[End-of-file|EOF]]' pointer; these rotations, or circular shifts, are then sorted lexicographically (step 3). The output of the encoding phase is the last column <code>L = BNN<span style="color:red;">^</span>AA<span style="color:red;">$</span>A</code> after step 3, and the index (0-based) <code>I</code> of the row containing the original string <code>S</code>, in this case <code>I = 6</code>.
Line 96 ⟶ 79:
BNN<span style="color:red;">^</span>AA<span style="color:red;">$</span>A
|}
===Pseudocode===
The following [[pseudocode]] gives a simple (though inefficient) way to calculate the BWT and its inverse. It assumes that the input string <code>s</code> contains a special character 'EOF' which is the last character and occurs nowhere else in the text.
Line 113 ⟶ 98:
==Explanation==
▲
▲For example:
▲{| class="wikitable"
▲! Input
| <code>THE.MAN.AND.THE.DOG.WAITED.AT.THE.STATION.FOR.THE.TRAIN.TO.THE.CITY</code>
▲|-
▲! Output
| <code>NDEENEEODTRNEGRWM..T.EN.HHHHHT.OTTTTTATAC.AOIATDIFOT.ASI..Y..A..I.T</code>
▲|}
Sorting the rotations of this text groups rotations starting with "he " together, and the last character of such a rotation (which is also the character before the "he ") will usually be "t" (though perhaps occasionally not, such as if the text contained "ache "), so the result of the transform will contain a run, or runs, of many consecutive "t" characters. Similarly, rotations beginning with "e " are grouped together, but "e " is often preceded by "h", so we see the output above contains a run of five consecutive "h" characters.
Thus it can be seen that the success of this transform depends upon one value having a high probability of occurring before a sequence, so that in general it needs fairly long samples (a few kilobytes at least) of appropriate data (such as text).
The remarkable thing about the BWT is not that it generates a more easily encoded output—an ordinary sort would do that—but that it does this ''reversibly'', allowing the original document to be re-generated from the last column data.
|