Content deleted Content added
TokenByToken (talk | contribs) revise for clarity Tags: Visual edit Disambiguation links added |
|||
Line 1:
{{short description|Classical information retrieval model}}
{{technical|date=June 2018}}The (standard) '''Boolean model of information retrieval''' ('''BIR''')<ref>{{citation | author1=Lancaster, F.W. | author2=Fayen, E.G. | title=Information Retrieval On-Line | publisher=Melville Publishing Co., Los Angeles, California | year=1973}}</ref> is a classical [[information retrieval]] (IR) model
▲The (standard) '''Boolean model of information retrieval''' ('''BIR''')<ref>{{citation | author1=Lancaster, F.W. | author2=Fayen, E.G. | title=Information Retrieval On-Line | publisher=Melville Publishing Co., Los Angeles, California | year=1973}}</ref> is a classical [[information retrieval]] (IR) model and, at the same time, the first and most-adopted one.<ref>{{Cite web |title=Information Retrieval |url=https://mitpress.mit.edu/9780262528870/information-retrieval/ |access-date=2023-12-09 |website=MIT Press |language=en-US}}</ref> The BIR is based on [[Boolean logic]] and classical [[set theory]] in that both the documents to be searched and the user's query are conceived as sets of terms (a [[bag-of-words model]]). Retrieval is based on whether or not the documents contain the query terms and whether they satisfy the boolean conditions described by the query.
==Definitions==
In the Boolean model, documents and queries are represented using concepts from [[set theory]]. A document is seen as a simple collection (a set) of terms, and a query is a formal statement (a Boolean expression) that specifies which terms must or must not be present in a retrieved document.
* An '''index term''' (or ''term'') is a [[keyword]] that characterizes the content of a document. Terms are the fundamental units of the model. Common, low-information words (called [[stop words]]) like "a", "the", and "is" are typically excluded from being used as index terms.
* A '''document''' is represented as a [[set]] of index terms. This is a [[bag-of-words model]], meaning the order and frequency of terms in the original document are ignored. For example, a document about Bayes' theorem might be represented simply as the set <math>\{\text{Bayes' theorem, probability, decision-making}\}</math>.
* A '''query''' is a formal expression of the user's information need, written using index terms and Boolean operators (AND, OR, NOT). The model retrieves every document that is considered a "match" for this logical expression.
===Formal representation===
The model can be defined formally as follows:
* A document <math>D_j</math> is any subset of <math>T</math>.
* A query <math>Q</math> is a Boolean expression, typically in [[conjunctive normal form]]:<math display="block">Q = (t_a \lor t_b) \land (\lnot t_c \lor t_d) \land \dots</math>where <math>t_a, t_b, \dots \in T</math>.
==Example==
Line 42 ⟶ 30:
<math display="inline">D_3</math> = "Bayesian [[epistemology]]: A philosophical theory which holds that the epistemic status of a proposition (i.e. how well proven or well established it is) is best measured by a probability and that the proper way to revise this probability is given by Bayesian conditionalisation or similar procedures. A Bayesian epistemologist would use probability to define, and explore the relationship between, concepts such as epistemic status, support or explanatory power."
Let the set <math display="inline">T</math> of terms be:<math display="block">T = \{t_1=\text{Bayes' principle}, t_2=\text{probability}, t_3=\text{decision-making}, t_4=\text{Bayesian epistemology}\}</math>Then, the set <math display="inline">D</math> of documents is as follows:<math display="block">D = \{D_1,\ D_2,\ D_3\}</math>where <math display="block">\begin{align}▼
▲<math display="block">T = \{t_1=\text{Bayes' principle}, t_2=\text{probability}, t_3=\text{decision-making}, t_4=\text{Bayesian epistemology}\}</math>
D_1 &= \{\text{probability},\ \text{Bayes' principle}\} \\
D_2 &= \{\text{probability},\ \text{decision-making}\} \\
D_3 &= \{\text{probability},\ \text{Bayesian epistemology}\}
\end{align}</math>Let the query <math display="inline">Q</math> be ("probability" AND "decision-making"):<math display="block">Q = \text{probability} \and \text{decision-making}</math>Then to retrieve the relevant documents:▼
▲<math display="block">Q = \text{probability} \and \text{decision-making}</math>Then to retrieve the relevant documents:
# Firstly, the following sets <math display="inline">S_1</math> and <math display="inline">S_2</math> of documents <math display="inline">D_i</math> are obtained (retrieved):<math display="block">\begin{align}
S_1 &= \{D_1,\ D_2,\ D_3\} \\
Line 96 ⟶ 72:
{{ main | feature hashing }}
Another possibility is to use [[Set (abstract data type)|hash
=== Signature file ===
Each document can be summarized by [[Bloom filter]] representing the set of words in that document, stored in a fixed-length bitstring, called a signature. The signature file contains one such [[superimposed code]] bitstring for every document in the collection. Each query can also be summarized by a [[Bloom filter]] representing the set of words in the query, stored in a bitstring of the same fixed length. The query bitstring is tested against each signature.<ref name="zobel" >
Justin Zobel; Alistair Moffat; and Kotagiri Ramamohanarao.
[https://people.eng.unimelb.edu.au/jzobel/fulltext/acmtods98.pdf "Inverted Files Versus Signature Files for Text Indexing"].
|