Revision as of 01:39, 12 December 2024 edit XOR'easter (talk \| contribs) Extended confirmed users 33,333 edits Undid revision 1260742911 by JulioISalazarG (talk): a source for the claim "The phrase '''full-text index''' is also often used" needs, at a bare minimum, to use the phrase "full-text index" Tag: Undo ← Previous edit		Revision as of 08:10, 12 December 2024 edit undo David Eppstein (talk \| contribs) Autopatrolled, Administrators 235,647 edits expand; rm non-bluelinked entries Next edit →
Line 1: {{Short description\|Data structure}} {{more references\|date=December 2021}} In [[computer science]], a '''substring index''' is a [[data structure]] which gives [[substring]] search in a text or text collection in [[sublinear]] time. If you have a document <math>S</math> of length <math>n</math>, or a set of documents <math>D=\{S^1,S^2, \dots, S^d\}</math>, ofa ~~total~~substring ~~length~~index ~~<math>n</math>,~~can ~~you~~be ~~can~~used to locate all occurrences of a pattern ~~<math>P</math>~~in time linear or near-linear in ~~<math>o(n)</math>~~the ~~time.~~pattern size, with no dependence or only ~~(See~~logarithmic ~~[[Big~~dependence Oon ~~notation\|Big~~the ~~''O''~~document ~~notation]]~~size.) These data structures typically treat their text and pattern as [[string (computer science)\|strings]] over a fixed alphabet, and search for locations where the pattern occurs as a substring of the text. The alphabet may consist of characters (for instance in [[Unicode]]) but in practical applications for [[text retrieval]] it may be preferable to treat the ([[Stemming\|stemmed]]) words of a document as the elements of the alphabet, because doing this reduces the lengths of both the text and pattern as measured in letters of their alphabet. The phrase '''full-text index''' is also often used for an index of all substrings of a text. But this is ambiguous, as it is also used for regular word indexes such as [[inverted file]]s and [[document retrieval]]. See [[full text search]].▼ ▲The phrase '''full-text index''' is ~~also~~ often used for ~~an index of all substrings of a~~substring ~~text~~indexes. But this is ambiguous, as it is also used for regular word indexes such as [[inverted file]]s and [[document retrieval]]. See [[full text search]]. ~~Substring indexes include:~~ Specific data structures that can be used as substring indexes include: * [[Suffix tree]] * The [[suffix tree]], a [[radix tree]] of the suffixes of the string, allowing substring search to be performed character-by-character<ref name=gv/> * [[Suffix array]] * The [[suffix array]], a sorted array of the starting positions of suffixes of the string, allowing substring search to be performed by [[binary search]]<ref name=gv/> * N-gram index, an [[inverted file]] for all [[N-gram]]s of the text * The [[compressed suffix array]], a data structure that combines [[data compression]] with the suffix array, allowing the structure to be stored in space sublinear in the text length<ref name=gv>{{citation * [[Compressed suffix array]]<ref>R. Grossi and J. S. Vitter, [http://www.di.unipi.it/~grossi/PAPERS/sicomp05.pdf Compressed Suffix Arrays and Suffix Trees, with Applications to Text Indexing and String Matching], ''SIAM Journal on Computing,'' 35(2), 2005, 378–407.</ref> \| last1 = Grossi \| first1 = Roberto * [[FM-index]] \| last2 = Vitter \| first2 = Jeffrey Scott \| author2-link = Jeffrey Vitter * [[LZ-index]] \| doi = 10.1137/S0097539702402354 \| issue = 2 \| journal = SIAM Journal on Computing \| mr = 2191449 \| pages = 378–407 \| title = Compressed suffix arrays and suffix trees with applications to text indexing and string matching \| url = https://www.di.unipi.it/~grossi/PAPERS/sicomp05.pdf \| volume = 35 \| year = 2005}}</ref> * The [[FM-index]], another compressed substring index based on the [[Burrows–Wheeler transform]] and closely related to the suffix array == References ==

Substring index: Difference between revisions