Substring index: Difference between revisions

Content deleted Content added
Undid revision 1260742911 by JulioISalazarG (talk): a source for the claim "The phrase '''full-text index''' is also often used" needs, at a bare minimum, to use the phrase "full-text index"
Citation bot (talk | contribs)
Add: hdl, isbn. | Use this bot. Report bugs. | Suggested by Dominic3203 | Category:String data structures | #UCB_Category 2/16
 
(13 intermediate revisions by one other user not shown)
Line 1:
{{Short description|Data structure}}
In [[computer science]], a '''substring index''' is a [[data structure]] which gives [[substring]] search in a text or text collection in [[sublinear]] time. IfOnce youconstructed havefrom a document <math>S</math> of length <math>n</math>, or a set of documents <math>D=\{S^1,S^2, \dots,a S^d\}</math>substring ofindex totalcan lengthbe <math>n</math>,used you canto locate all occurrences of a pattern <math>P</math>in time linear or near-linear in <math>o(n)</math>the time.pattern (Seesize, [[Bigwith Ono notation|Bigdependence ''O''or notation]]only logarithmic dependence on the document size.)<ref name=bst>{{citation
{{more references|date=December 2021}}
| last1 = Barsky | first1 = Marina
In [[computer science]], a '''substring index''' is a [[data structure]] which gives [[substring]] search in a text or text collection in [[sublinear]] time. If you have a document <math>S</math> of length <math>n</math>, or a set of documents <math>D=\{S^1,S^2, \dots, S^d\}</math> of total length <math>n</math>, you can locate all occurrences of a pattern <math>P</math> in <math>o(n)</math> time. (See [[Big O notation|Big ''O'' notation]].)
| last2 = Stege | first2 = Ulrike
| last3 = Thomo | first3 = Alex
| contribution = Chapter 1: Structures for Indexing Substrings
| doi = 10.1007/978-3-031-01885-5_1
| isbn = 9783031018855
| pages = 1–15
| publisher = Springer International Publishing
| series = Synthesis Lectures on Data Management
| title = Full-Text (Substring) Indexes in External Memory
| year = 2012}}</ref>
 
The phrase '''full-text index''' is also often used for an index of all substrings of asubstring textindexes. But this is ambiguous, as it is also used for regular word indexes such as [[inverted file]]s and [[document retrieval]]. See [[full text search]].
 
==General considerations==
Substring indexes include:
These data structures typically treat their text and pattern as [[string (computer science)|strings]] over a fixed alphabet, and search for locations where the pattern occurs as a substring of the text. The symbols of the alphabet may be characters (for instance in [[Unicode]]) but in practical applications for [[text retrieval]] it may be preferable to treat the ([[Stemming|stemmed]]) words of a document as the symbols of its alphabet, because doing this reduces the lengths of both the text and pattern as measured in numbers of symbols.<ref>{{citation
| last = Risvik | first = Knut Magne
| editor-last = Farach-Colton | editor-first = Martin | editor-link = Martin Farach-Colton
| contribution = Approximate word sequence matching over sparse suffix trees
| doi = 10.1007/BFB0030781
| pages = 65–79
| publisher = Springer
| series = Lecture Notes in Computer Science
| title = Combinatorial Pattern Matching, 9th Annual Symposium, CPM 98, Piscataway, New Jersey, USA, July 20–22, 1998, Proceedings
| volume = 1448
| year = 1998| isbn = 978-3-540-64739-3
}}</ref>
 
==Examples==
* [[Suffix tree]]
Specific data structures that can be used as substring indexes include:
* [[Suffix array]]
* The [[suffix tree]], a [[radix tree]] of the suffixes of the string, allowing substring search to be performed symbol-by-symbol<ref name=bst/><ref name=gv/>
* N-gram index, an [[inverted file]] for all [[N-gram]]s of the text
* The [[suffix automaton]], the minimal [[deterministic finite automaton]] that recognizes substrings of a given text, closely related to the suffix tree and constructable by variants of the same algorithms.<ref>{{citation
* [[Compressed suffix array]]<ref>R. Grossi and J. S. Vitter, [http://www.di.unipi.it/~grossi/PAPERS/sicomp05.pdf Compressed Suffix Arrays and Suffix Trees, with Applications to Text Indexing and String Matching], ''SIAM Journal on Computing,'' 35(2), 2005, 378–407.</ref>
| last1 = Blumer | first1 = Anselm
* [[FM-index]]
| last2 = Blumer | first2 = J.
* [[LZ-index]]
| last3 = Ehrenfeucht | first3 = Andrzej | author3-link = Andrzej Ehrenfeucht
| last4 = Haussler | first4 = David | author4-link = David Haussler
| last5 = McConnell | first5 = Ross M.
| editor-last = Paredaens | editor-first = Jan
| contribution = Building the minimal DFA for the set of all subwords of a word on-line in linear time
| doi = 10.1007/3-540-13345-3_9
| pages = 109–118
| publisher = Springer
| series = Lecture Notes in Computer Science
| title = Automata, Languages and Programming, 11th Colloquium, Antwerp, Belgium, July 16–20, 1984, Proceedings
| volume = 172
| year = 1984| isbn = 978-3-540-13345-2
}}</ref>
* The [[suffix array]], a sorted array of the starting positions of suffixes of the string, allowing substring search to be performed by [[binary search]]<ref name=bst/><ref name=gv/> Augmenting a suffix array with an [[LCP array]] of the lengths of common prefixes of consecutive suffixes allows the search to be performed symbol-by-symbol, matching the search time of the suffix tree.<ref>{{citation
| last1 = Manber | first1 = Udi | author1-link = Udi Manber
| last2 = Myers | first2 = Gene | author2-link = Eugene Myers
| doi = 10.1137/0222058
| issue = 5
| journal = [[SIAM Journal on Computing]]
| mr = 1237156
| pages = 935–948
| title = Suffix arrays: a new method for on-line string searches
| volume = 22
| year = 1993}}</ref>
* The [[compressed suffix array]], a data structure that combines [[data compression]] with the suffix array, allowing the structure to be stored in space sublinear in the text length<ref name=bst/><ref name=gv>{{citation
| last1 = Grossi | first1 = Roberto
| last2 = Vitter | first2 = Jeffrey Scott | author2-link = Jeffrey Vitter
| doi = 10.1137/S0097539702402354
| issue = 2
| journal = SIAM Journal on Computing
| mr = 2191449
| pages = 378–407
| title = Compressed suffix arrays and suffix trees with applications to text indexing and string matching
| url = https://www.di.unipi.it/~grossi/PAPERS/sicomp05.pdf
| volume = 35
| year = 2005| hdl = 1808/18962
}}</ref>
* The [[FM-index]], another compressed substring index based on the [[Burrows–Wheeler transform]] and closely related to the suffix array<ref>{{citation
| last1 = Ferragina | first1 = Paolo
| last2 = Manzini | first2 = Giovanni
| doi = 10.1145/1082036.1082039
| issue = 4
| journal = [[Journal of the ACM]]
| mr = 2164632
| pages = 552–581
| title = Indexing compressed text
| volume = 52
| year = 2005}}</ref>
 
== References ==
{{reflist}}
 
{{Strings|state=collapsed}}
[[Category:Algorithms on strings]]
[[Category:String data structures]]