Content deleted Content added
ce |
this chapter looks like a good general reference |
||
Line 1:
{{Short description|Data structure}}
{{more references|date=December 2021}}
In [[computer science]], a '''substring index''' is a [[data structure]] which gives [[substring]] search in a text or text collection in [[sublinear]] time. Once constructed from a document or set of documents, a substring index can be used to locate all occurrences of a pattern in time linear or near-linear in the pattern size, with no dependence or only logarithmic dependence on the document size.<ref name=bst>{{citation
| last1 = Barsky | first1 = Marina
| last2 = Stege | first2 = Ulrike
| last3 = Thomo | first3 = Alex
| contribution = Chapter 1: Structures for Indexing Substrings
| doi = 10.1007/978-3-031-01885-5_1
| isbn = 9783031018855
| pages = 1–15
| publisher = Springer International Publishing
| series = Synthesis Lectures on Data Management
| title = Full-Text (Substring) Indexes in External Memory
| year = 2012}}</ref>
These data structures typically treat their text and pattern as [[string (computer science)|strings]] over a fixed alphabet, and search for locations where the pattern occurs as a substring of the text. The symbols of the alphabet may be characters (for instance in [[Unicode]]) but in practical applications for [[text retrieval]] it may be preferable to treat the ([[Stemming|stemmed]]) words of a document as the symbols of its alphabet, because doing this reduces the lengths of both the text and pattern as measured in letters of their alphabet.
Line 8 ⟶ 19:
Specific data structures that can be used as substring indexes include:
* The [[suffix tree]], a [[radix tree]] of the suffixes of the string, allowing substring search to be performed symbol-by-symbol<ref name=bst/><ref name=gv/>
* The [[suffix array]], a sorted array of the starting positions of suffixes of the string, allowing substring search to be performed by [[binary search]]<ref name=bst/><ref name=gv/>
* The [[compressed suffix array]], a data structure that combines [[data compression]] with the suffix array, allowing the structure to be stored in space sublinear in the text length<ref name=bst/><ref name=gv>{{citation
| last1 = Grossi | first1 = Roberto
| last2 = Vitter | first2 = Jeffrey Scott | author2-link = Jeffrey Vitter
|