Content deleted Content added
link to related articles; mention the 2 most popular implementation algorithms; etc. |
|||
Line 1:
{{technical|date=June 2018}}<!-- Needs a plain-English definition without mathematical symbols -->
The (standard) '''Boolean model of information retrieval (BIR)'''<ref>{{citation | author1=Lancaster, F.W. | author2=Fayen, E.G. | title=Information Retrieval On-Line | publisher=Melville Publishing Co., Los Angeles, California | year=1973}}</ref> is a classical [[information retrieval]] (IR) model and, at the same time, the first and most-adopted one. It is used by many IR systems to this day.{{citation needed|date=November 2010}} The BIR is based on [[Boolean logic]] and classical [[set theory]] in that both the documents to be searched and the user's query are conceived as sets of terms (a [[bag-of-words model]]). Retrieval is based on whether or not the documents contain the query terms.
==Definitions==
Line 79:
=== Hash sets ===
{{ main | feature hashing }}
Another possibility is to use [[hash set]]s. Each document is represented by a hash table which contains every single term of that document. Since hash table size increases and decreases in real time with the addition and removal of terms, each document will occupy much less space in memory. However, it will have a slowdown in performance because the operations are more complex than with [[bit vector]]s. On the worst-case performance can degrade from O(''n'') to O(''n''<sup>2</sup>). On the average case, the performance slowdown will not be that much worse than bit vectors and the space usage is much more efficient.
=== Signature file ===
Each document can be summarized by [[Bloom filter]] representing the set of words in that document, stored in a fixed-length bitstring, called a signature.
The signature file contains one such [[superimposed code]] bitstring for every document in the collection.
Each query can also be summarized by a [[Bloom filter]] representing the set of words in the query, stored in a bitstring of the same fixed length.
The query bitstring is tested against each signature.<ref name="zobel" >
Justin Zobel; Alistair Moffat; and Kotagiri Ramamohanarao.
[https://people.eng.unimelb.edu.au/jzobel/fulltext/acmtods98.pdf "Inverted Files Versus Signature Files for Text Indexing"].
</ref><ref name="goodwin" >
Bob Goodwin; et. al.
[https://safari.ethz.ch/architecture/fall2018/lib/exe/fetch.php?media=p605-goodwin.pdf "BitFunnel: Revisiting Signatures for Search"].
2017.
</ref><ref>
Richard Startin.
[https://richardstartin.github.io/posts/bit-sliced-signatures-and-bloom-filters "Bit-Sliced Signatures and Bloom Filters"].
</ref>
The signature file approached is used in [[BitFunnel]].
=== Inverted file ===
{{ main | inverted index }}
An inverted index file contains two parts:
a vocabulary containing all the terms used in the collection,
and for each distinct term an inverted index that lists every document that mentions that term.<ref name="zobel" /><ref name="goodwin" />
== References ==
|