Burrows–Wheeler transform: Difference between revisions

Content deleted Content added
Tweak intro to and 1. correct claim about the inventor and date of invention - B&W's own paper says the transform was "discovered" by Wheeler all the way back in 1983, and simply wasn't published until later. 2. fix conflation of BSLDCA with BWT - see Talk:Burrows–Wheeler_transform#compression? and 3. reduce word count
Unlinked ambiguous—common term used in an ordinary sense
 
(5 intermediate revisions by 5 users not shown)
Line 23:
| archiveurl=http://web.archive.org/web/20030105080431/https://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-124.html
| archivedate=January 5, 2003
}}</ref><ref name="u822">{{cite conference | lastlast1=Arnavut | firstfirst1=Z. | last2=Magliveras | first2=S.S. | title=Proceedings DCC '97. Data Compression Conference | chapter=Block sorting and compression | publisher=IEEE Comput. Soc. Press | date=1997 | page= | isbn=978-0-8186-7761-8 | doi=10.1109/DCC.1997.582009 | url=https://ieeexplore.ieee.org/document/582009/ | access-date=2025-05-07 | pagepages=181–190}}</ref>
 
==Description==
The transform is done by constructing a matrix (known as the Burrows-Wheeler Matrix<ref name="langmead">{{cite web |last1=Langmead |first1=Ben |title=Burrows-Wheeler Transform and FM Index |url=https://www.cs.jhu.edu/~langmea/resources/lecture_notes/bwt_and_fm_index.pdf |website=Johns Hopkins Whiting School of Engineering |access-date=23 April 2025}}</ref>) whose rows are the [[circular shift]]s of the input text, [[sorted]] in [[lexicographic order]], then taking the final column of that matrix.
 
To allow the transform to be reversed, one additional step is necessary: either the index of the original string in the Burrows-Wheeler Matrix must be returned along with the transformed string (the approach shown in the original paper by Burrows and Wheeler<ref name=Burrows1994/>) or a special end-of-text character must be added at the start or end of the input text before the transform is executed.<ref name="langmead"/>
Line 99:
==Explanation==
 
If the original string had several substrings that occurred often, then the BWT-transformed string will have several places where a single character is repeated many times in a row,<ref>{{cite web|url=https://github.com/adrien-mogenet/scala-bwt/blob/master/src/main/scala/me/algos/bwt/BurrowsWheelerCodec.scala|title=adrien-mogenet/scala-bwt|website=GitHub|access-date=19 April 2018}}</ref>, creating more-easily-compressible data. For instance, consider transforming an English text frequently containing the word "the":
 
For example:
Line 301:
 
==Optimization==
A number of [[Optimization (computer science)|optimizations]] can make these algorithms run more efficiently without changing the output. There is no need to represent the table in either the encoder or decoder. In the encoder, each row of the table can be represented by a single pointer into the strings, and the sort performed using the indices. In the decoder, there is also no need to store the table, and inthe factdecoded nostring sortcan isbe neededgenerated atone all.character Inat a time proportionalfrom left to theright. alphabetComparative sizesorting andcan stringeven length,be theavoided decodedin stringfavor mayof belinear generatedsorting, onewith characterperformance atproportional ato timethe fromalphabet rightsize toand leftstring length. A "character" in the algorithm can be a byte, or a bit, or any other convenient size.
 
One may also make the observation that mathematically, the encoded string can be computed as a simple modification of the [[suffix array]], and suffix arrays can be computed with linear time and memory. The BWT can be defined with regards to the suffix array SA of text T as (1-based indexing):
Line 549:
 
===BWT for sequence alignment===
The advent of [[next-generation sequencing]] (NGS) techniques at the end of the 2000s decade has led to another application of the Burrows–Wheeler transformation. In NGS, [[DNA]] is fragmented into small pieces, of which the first few bases are [[DNA sequencing|sequenced]], yielding several millions of "reads", each 30 to 500 [[base pair]]s ("DNA characters") long. In many experiments, e.g., in [[ChIP-Seq]], the task is now to align these reads to a reference [[genome]], i.e., to the known, nearly complete sequence of the organism in question (which may be up to several billion base pairs long). A number of alignment programs, specialized for this task, were published, which initially relied on [[Hash function|hashing]] (e.g., [[Eland (software)|Eland]], SOAP,<ref name="Li, R2008">{{cite journal |author=Li R |title=SOAP: short oligonucleotide alignment program |journal=Bioinformatics |year=2008 |volume=24 |issue=5 |pages=713–714 |pmid=18227114 |doi=10.1093/bioinformatics/btn025|display-authors=etal|doi-access=free }}</ref> or Maq<ref name="Li, H2008">{{cite journal |vauthors=Li H, Ruan J, Durbin R |title=Mapping short DNA sequencing reads and calling variants using mapping quality scores |journal=Genome Research |volume=18 |issue=11 |pages=1851–1858 |date=2008-08-19 |pmid=18714091 |doi=10.1101/gr.078212.108 |pmc=2577856}}</ref>). In an effort to reduce the memory requirement for sequence alignment, several alignment programs were developed ([[Bowtie (sequence analysis)|Bowtie]],<ref name="Langmead2009">{{cite journal |vauthors=Langmead B, Trapnell C, Pop M, Salzberg SL |title=Ultrafast and memory-efficient alignment of short DNA sequences to the human genome |journal=Genome Biology |year=2009 |volume=10 |issue=3 |pagearticle-number=R25 |pmid=19261174 |doi=10.1186/gb-2009-10-3-r25 |pmc=2690996 |doi-access=free }}</ref> BWA,<ref name="Li, H2009">{{cite journal |vauthors=Li H, Durbin R |title=Fast and accurate short read alignment with Burrows–Wheeler Transform |journal=Bioinformatics |year=2009 |pmid=19451168 |volume=25 |issue=14 |pages=1754–1760 |doi=10.1093/bioinformatics/btp324 |pmc=2705234}}</ref> and SOAP2<ref name="Li, R2009">{{cite journal |author=Li R |title=SOAP2: an improved ultrafast tool for short read alignment |journal=Bioinformatics |year=2009 |pmid=19497933 |volume=25 |issue=15 |pages=1966–1967 |doi=10.1093/bioinformatics/btp336 |display-authors=etal|doi-access= }}</ref>) that use the Burrows–Wheeler transform.
 
===BWT for image compression===
The Burrows–Wheeler transformation has proved to be fundamental for [[image compression]] applications. For example,<ref name="Collin, P2019">{{cite book |vauthors=Collin P, Arnavut Z, Koc B |chapter=Lossless compression of medical images using Burrows–Wheeler Transformation with Inversion Coder |chapter-url=https://ieeexplore.ieee.org/document/7319012| title=2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)|journal=<!-- --> |date=2015 |volume=2015 |pages=2956–2959 |pmid=26736912 |doi=10.1109/EMBC.2015.7319012 |isbn=978-1-4244-9271-8 |s2cid=4460328 }}</ref> Showed a compression pipeline based on the application of the Burrows–Wheeler transformation followed by inversion, run-length, and arithmetic encoders. The pipeline developed in this case is known as Burrows–Wheeler transform with an inversion encoder (BWIC). The results shown by BWIC are shown to outperform the compression performance of well-known and widely used algorithms like [[Lossless JPEG]] and [[JPEG 2000]]. BWIC is shown to outperform those in terms of final compression size of radiography medical images on the order of 5.1% and 4.1% respectively. The improvements are achieved by combining BWIC and a pre-BWIC scan of the image in a vertical snake order fashion. More recently, additional works have shown the implementation of the Burrows–Wheeler Transform in conjunction with the known [[move-to-front transform]] (MTF) achieve near lossless compression of images. <ref name="Devadoss, CP2019">{{cite journal |vauthors=Devadoss CP, Sankaragomathi B |title=Near lossless medical image compression using block BWT–MTF and hybrid fractal compression techniques |url=https://link.springer.com/article/10.1007/s10586-018-1801-3| journal=Cluster Computing| date=2019| volume=22 |pages=12929–12937 |doi=10.1007/s10586-018-1801-3|s2cid=33687086 |url-access=subscription }}</ref>
 
===BWT for compression of genomic databases===
Line 568:
* [https://web.archive.org/web/20170306035431/https://encode.ru/attachment.php?attachmentid=959&d=1249146089 Yuta's openbwt-v1.5.zip contains source code for various BWT routines including BWTS for bijective version]
* [https://arxiv.org/abs/0908.0239 On Bijective Variants of the Burrows–Wheeler Transform, by Kufleitner]
* [httphttps://google-opensource.blogspot.com/2008/06/debuting-dcs-bwt-experimental-burrows.html Blog post] and [https://code.google.com/p/dcs-bwt-compressor/ project page] for an open-source compression program and library based on the Burrows–Wheeler algorithm
* [https://www.youtube.com/watch?v=P3ORBMon8aw MIT open courseware lecture on BWT (Foundations of Computational and Systems Biology)]
* [https://github.com/abderrahimh/ARahim League Table Sort (LTS) or The Weighting algorithm to BWT by Abderrahim Hechachena]