Substring index: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 17:36, 21 November 2009 edit Andreas Kaufmann (talk \| contribs) Extended confirmed users, Pending changes reviewers 7,180 edits m Removed category Data structures; Quick-adding category String data structures (using HotCat) ← Previous edit		Latest revision as of 08:19, 10 January 2025 edit undo Citation bot (talk \| contribs) Bots 5,865,271 edits Add: hdl, isbn. \| Use this bot. Report bugs. \| Suggested by Dominic3203 \| Category:String data structures \| #UCB_Category 2/16
(26 intermediate revisions by 12 users not shown)
Line 1: {{Short description\|Data structure}} A '''substring index''' is a [[data structure]] which gives [[substring]] search in a text or text collection in [[sublinear]] time. If you have a document <math>S</math> of length <math>n</math>, or a set of documents <math>D=\{S^1,S^2, \dots, S^d\}</math> of total length <math>n</math>, you can locate all occurrences of a pattern <math>P</math> in <math>o(n)</math> time. (<math>o(n)</math> means less than <math>O(n)</math>. See [[Big O notation]].) In [[computer science]], a '''substring index''' is a [[data structure]] which gives [[substring]] search in a text or text collection in [[sublinear]] time. Once constructed from a document or set of documents, a substring index can be used to locate all occurrences of a pattern in time linear or near-linear in the pattern size, with no dependence or only logarithmic dependence on the document size.<ref name=bst>{{citation \| last1 = Barsky \| first1 = Marina \| last2 = Stege \| first2 = Ulrike \| last3 = Thomo \| first3 = Alex \| contribution = Chapter 1: Structures for Indexing Substrings \| doi = 10.1007/978-3-031-01885-5_1 \| isbn = 9783031018855 \| pages = 1–15 \| publisher = Springer International Publishing \| series = Synthesis Lectures on Data Management \| title = Full-Text (Substring) Indexes in External Memory \| year = 2012}}</ref> The phrase '''full-text index''' is ~~also~~ often used for ansubstring ~~index of all substrings of a text~~indexes. But this is ambiguous, as it is also used for regular word indexes such as [[inverted file]]s and [[~~signature~~document ~~file~~retrieval]]s. See [[full text search]]. ==General considerations== ~~Substring indexes include:~~ These data structures typically treat their text and pattern as [[string (computer science)\|strings]] over a fixed alphabet, and search for locations where the pattern occurs as a substring of the text. The symbols of the alphabet may be characters (for instance in [[Unicode]]) but in practical applications for [[text retrieval]] it may be preferable to treat the ([[Stemming\|stemmed]]) words of a document as the symbols of its alphabet, because doing this reduces the lengths of both the text and pattern as measured in numbers of symbols.<ref>{{citation \| last = Risvik \| first = Knut Magne \| editor-last = Farach-Colton \| editor-first = Martin \| editor-link = Martin Farach-Colton \| contribution = Approximate word sequence matching over sparse suffix trees \| doi = 10.1007/BFB0030781 \| pages = 65–79 \| publisher = Springer \| series = Lecture Notes in Computer Science \| title = Combinatorial Pattern Matching, 9th Annual Symposium, CPM 98, Piscataway, New Jersey, USA, July 20–22, 1998, Proceedings \| volume = 1448 \| year = 1998\| isbn = 978-3-540-64739-3 }}</ref> ==Examples== * [[Suffix tree]] Specific data structures that can be used as substring indexes include: * [[Suffix array]] * The [[suffix tree]], a [[radix tree]] of the suffixes of the string, allowing substring search to be performed symbol-by-symbol<ref name=bst/><ref name=gv/> * N-gram index, an [[inverted file]] for all [[N-gram]]s of the text * The [[suffix automaton]], the minimal [[deterministic finite automaton]] that recognizes substrings of a given text, closely related to the suffix tree and constructable by variants of the same algorithms.<ref>{{citation * [[Compressed suffix array]]<ref>R. Grossi and J. S. Vitter, [http://faculty.cse.tamu.edu/jsv/Papers/GrV00.text_indexing.pdf Compressed Suffix Arrays and Suffix Trees, with Applications to Text Indexing and String Matching], ''SIAM Journal on Computing,'' 35(2), 2005, 378-407.</ref> \| last1 = Blumer \| first1 = Anselm * [[FM-index]] \| last2 = Blumer \| first2 = J. * [[LZ-index]] \| last3 = Ehrenfeucht \| first3 = Andrzej \| author3-link = Andrzej Ehrenfeucht \| last4 = Haussler \| first4 = David \| author4-link = David Haussler \| last5 = McConnell \| first5 = Ross M. \| editor-last = Paredaens \| editor-first = Jan \| contribution = Building the minimal DFA for the set of all subwords of a word on-line in linear time \| doi = 10.1007/3-540-13345-3_9 \| pages = 109–118 \| publisher = Springer \| series = Lecture Notes in Computer Science \| title = Automata, Languages and Programming, 11th Colloquium, Antwerp, Belgium, July 16–20, 1984, Proceedings \| volume = 172 \| year = 1984\| isbn = 978-3-540-13345-2 }}</ref> * The [[suffix array]], a sorted array of the starting positions of suffixes of the string, allowing substring search to be performed by [[binary search]]<ref name=bst/><ref name=gv/> Augmenting a suffix array with an [[LCP array]] of the lengths of common prefixes of consecutive suffixes allows the search to be performed symbol-by-symbol, matching the search time of the suffix tree.<ref>{{citation \| last1 = Manber \| first1 = Udi \| author1-link = Udi Manber \| last2 = Myers \| first2 = Gene \| author2-link = Eugene Myers \| doi = 10.1137/0222058 \| issue = 5 \| journal = [[SIAM Journal on Computing]] \| mr = 1237156 \| pages = 935–948 \| title = Suffix arrays: a new method for on-line string searches \| volume = 22 \| year = 1993}}</ref> * The [[compressed suffix array]], a data structure that combines [[data compression]] with the suffix array, allowing the structure to be stored in space sublinear in the text length<ref name=bst/><ref name=gv>{{citation \| last1 = Grossi \| first1 = Roberto \| last2 = Vitter \| first2 = Jeffrey Scott \| author2-link = Jeffrey Vitter \| doi = 10.1137/S0097539702402354 \| issue = 2 \| journal = SIAM Journal on Computing \| mr = 2191449 \| pages = 378–407 \| title = Compressed suffix arrays and suffix trees with applications to text indexing and string matching \| url = https://www.di.unipi.it/~grossi/PAPERS/sicomp05.pdf \| volume = 35 \| year = 2005\| hdl = 1808/18962 }}</ref> * The [[FM-index]], another compressed substring index based on the [[Burrows–Wheeler transform]] and closely related to the suffix array<ref>{{citation \| last1 = Ferragina \| first1 = Paolo \| last2 = Manzini \| first2 = Giovanni \| doi = 10.1145/1082036.1082039 \| issue = 4 \| journal = [[Journal of the ACM]] \| mr = 2164632 \| pages = 552–581 \| title = Indexing compressed text \| volume = 52 \| year = 2005}}</ref> == References == {{reflist}} {{Strings\|state=collapsed}} [[Category:Algorithms on strings]] [[Category:String data structures]] [[Category:Database index techniques]] [[Category:Substring indices\| ]]