Content deleted Content added
XOR'easter (talk | contribs) Undid revision 1260742911 by JulioISalazarG (talk): a source for the claim "The phrase '''full-text index''' is also often used" needs, at a bare minimum, to use the phrase "full-text index" |
expand; rm non-bluelinked entries |
||
Line 1:
{{Short description|Data structure}}
{{more references|date=December 2021}}
In [[computer science]], a '''substring index''' is a [[data structure]] which gives [[substring]] search in a text or text collection in [[sublinear]] time. If you have a document <math>S</math> of length <math>n</math>, or a set of documents <math>D=\{S^1,S^2, \dots, S^d\}</math>,
These data structures typically treat their text and pattern as [[string (computer science)|strings]] over a fixed alphabet, and search for locations where the pattern occurs as a substring of the text. The alphabet may consist of characters (for instance in [[Unicode]]) but in practical applications for [[text retrieval]] it may be preferable to treat the ([[Stemming|stemmed]]) words of a document as the elements of the alphabet, because doing this reduces the lengths of both the text and pattern as measured in letters of their alphabet.
The phrase '''full-text index''' is also often used for an index of all substrings of a text. But this is ambiguous, as it is also used for regular word indexes such as [[inverted file]]s and [[document retrieval]]. See [[full text search]].▼
▲The phrase '''full-text index''' is
Specific data structures that can be used as substring indexes include:
* The [[suffix tree]], a [[radix tree]] of the suffixes of the string, allowing substring search to be performed character-by-character<ref name=gv/>
* The [[suffix array]], a sorted array of the starting positions of suffixes of the string, allowing substring search to be performed by [[binary search]]<ref name=gv/>
* The [[compressed suffix array]], a data structure that combines [[data compression]] with the suffix array, allowing the structure to be stored in space sublinear in the text length<ref name=gv>{{citation
| last1 = Grossi | first1 = Roberto
| last2 = Vitter | first2 = Jeffrey Scott | author2-link = Jeffrey Vitter
| doi = 10.1137/S0097539702402354
| issue = 2
| journal = SIAM Journal on Computing
| mr = 2191449
| pages = 378–407
| title = Compressed suffix arrays and suffix trees with applications to text indexing and string matching
| url = https://www.di.unipi.it/~grossi/PAPERS/sicomp05.pdf
| volume = 35
| year = 2005}}</ref>
* The [[FM-index]], another compressed substring index based on the [[Burrows–Wheeler transform]] and closely related to the suffix array
== References ==
|