Content deleted Content added
GraziePrego (talk | contribs) Adding local short description: "Method for data management", overriding Wikidata description "processing and storage of data to enable fast information querying" |
→Inverted indices: phrase search |
||
Line 65:
The inverted index is a [[sparse matrix]], since not all words are present in each document. To reduce [[Computer Storage|computer storage]] memory requirements, it is stored differently from a two dimensional [[Array data structure|array]]. The index is similar to the [[document-term matrix|term document matrices]] employed by [[latent semantic analysis]]. The inverted index can be considered a form of a hash table. In some cases the index is a form of a [[binary tree]], which requires additional storage but may reduce the lookup time. In larger indices the architecture is typically a [[distributed hash table]].<ref>Tang, Hunqiang. [[Sandhya Dwarkadas|Dwarkadas, Sandhya]]. "Hybrid Global Local Indexing for Efficient
Peer to Peer Information Retrieval". University of Rochester. Pg 1. http://www.cs.rochester.edu/u/sandhya/papers/nsdi04.ps</ref>
==== Implementation of Phrase Search Using an Inverted Index ====
For phrase searching, a specialized form of an inverted index called a positional index is used. A positional index not only stores the ID of the document containing the token but also the exact position(s) of the token within the document in the [[postings list]]. The occurrences of the phrase specified in the query are retrieved by navigating these postings list and identifying the indexes at which the desired terms occur in the expected order (the same as the order in the phrase). So if we are searching for occurrence of the phrase "First Witch", we would:
# Retrieve the postings list for "first" and "witch"
# Identify the first time that "with" occurs after "first"
# Check that this occurrence is immediately after the occurrence of "first".
# If not, continue to the next occurrence of "first".
The postings lists can be navigated using a binary search in order to minimize the time complexity of this procedure.<ref>{{Cite book |last=Büttcher |first=Stefan |title=Information retrieval: implementing and evaluating search engines |last2=Clarke |first2=Charles L. A. |last3=Cormack |first3=Gordon V. |date=2016 |publisher=The MIT Press |isbn=978-0-262-52887-0 |edition=First MIT Press paperback edition |___location=Cambridge, Massachusetts London, England}}</ref>
===Index merging===
|