Content deleted Content added
Tweak intro to and 1. correct claim about the inventor and date of invention - B&W's own paper says the transform was "discovered" by Wheeler all the way back in 1983, and simply wasn't published until later. 2. fix conflation of BSLDCA with BWT - see Talk:Burrows–Wheeler_transform#compression? and 3. reduce word count |
ShelfSkewed (talk | contribs) Unlinked ambiguous—common term used in an ordinary sense |
||
(5 intermediate revisions by 5 users not shown) | |||
Line 23:
| archiveurl=http://web.archive.org/web/20030105080431/https://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-124.html
| archivedate=January 5, 2003
}}</ref><ref name="u822">{{cite conference |
==Description==
The transform is done by constructing a matrix (known as the Burrows-Wheeler Matrix<ref name="langmead">{{cite web |last1=Langmead |first1=Ben |title=Burrows-Wheeler Transform and FM Index |url=https://www.cs.jhu.edu/~langmea/resources/lecture_notes/bwt_and_fm_index.pdf |website=Johns Hopkins Whiting School of Engineering |access-date=23 April 2025}}</ref>) whose rows are the [[circular shift]]s of the input text,
To allow the transform to be reversed, one additional step is necessary: either the index of the original string in the Burrows-Wheeler Matrix must be returned along with the transformed string (the approach shown in the original paper by Burrows and Wheeler<ref name=Burrows1994/>) or a special end-of-text character must be added at the start or end of the input text before the transform is executed.<ref name="langmead"/>
Line 99:
==Explanation==
If the original string had several substrings that occurred often, then the BWT-transformed string will have several places where a single character is repeated many times in a row,<ref>{{cite web|url=https://github.com/adrien-mogenet/scala-bwt/blob/master/src/main/scala/me/algos/bwt/BurrowsWheelerCodec.scala|title=adrien-mogenet/scala-bwt|website=GitHub|access-date=19 April 2018}}</ref>
For example:
Line 301:
==Optimization==
A number of [[Optimization (computer science)|optimizations]] can make these algorithms run more efficiently without changing the output. There is no need to represent the table in either the encoder or decoder. In the encoder, each row of the table can be represented by a single pointer into the strings, and the sort performed using the indices. In the decoder, there is also no need to store the table, and
One may also make the observation that mathematically, the encoded string can be computed as a simple modification of the [[suffix array]], and suffix arrays can be computed with linear time and memory. The BWT can be defined with regards to the suffix array SA of text T as (1-based indexing):
Line 549:
===BWT for sequence alignment===
The advent of [[next-generation sequencing]] (NGS) techniques at the end of the 2000s decade has led to another application of the Burrows–Wheeler transformation. In NGS, [[DNA]] is fragmented into small pieces, of which the first few bases are [[DNA sequencing|sequenced]], yielding several millions of "reads", each 30 to 500 [[base pair]]s ("DNA characters") long. In many experiments, e.g., in [[ChIP-Seq]], the task is now to align these reads to a reference [[genome]], i.e., to the known, nearly complete sequence of the organism in question (which may be up to several billion base pairs long). A number of alignment programs, specialized for this task, were published, which initially relied on [[Hash function|hashing]] (e.g., [[Eland (software)|Eland]], SOAP,<ref name="Li, R2008">{{cite journal |author=Li R |title=SOAP: short oligonucleotide alignment program |journal=Bioinformatics |year=2008 |volume=24 |issue=5 |pages=713–714 |pmid=18227114 |doi=10.1093/bioinformatics/btn025|display-authors=etal|doi-access=free }}</ref> or Maq<ref name="Li, H2008">{{cite journal |vauthors=Li H, Ruan J, Durbin R |title=Mapping short DNA sequencing reads and calling variants using mapping quality scores |journal=Genome Research |volume=18 |issue=11 |pages=1851–1858 |date=2008-08-19 |pmid=18714091 |doi=10.1101/gr.078212.108 |pmc=2577856}}</ref>). In an effort to reduce the memory requirement for sequence alignment, several alignment programs were developed ([[Bowtie (sequence analysis)|Bowtie]],<ref name="Langmead2009">{{cite journal |vauthors=Langmead B, Trapnell C, Pop M, Salzberg SL |title=Ultrafast and memory-efficient alignment of short DNA sequences to the human genome |journal=Genome Biology |year=2009 |volume=10 |issue=3 |
===BWT for image compression===
The Burrows–Wheeler transformation has proved to be fundamental for [[image compression]] applications. For example,<ref name="Collin, P2019">{{cite book |vauthors=Collin P, Arnavut Z, Koc B |chapter=Lossless compression of medical images using Burrows–Wheeler Transformation with Inversion Coder
===BWT for compression of genomic databases===
Line 568:
* [https://web.archive.org/web/20170306035431/https://encode.ru/attachment.php?attachmentid=959&d=1249146089 Yuta's openbwt-v1.5.zip contains source code for various BWT routines including BWTS for bijective version]
* [https://arxiv.org/abs/0908.0239 On Bijective Variants of the Burrows–Wheeler Transform, by Kufleitner]
* [
* [https://www.youtube.com/watch?v=P3ORBMon8aw MIT open courseware lecture on BWT (Foundations of Computational and Systems Biology)]
* [https://github.com/abderrahimh/ARahim League Table Sort (LTS) or The Weighting algorithm to BWT by Abderrahim Hechachena]
|