Boolean model of information retrieval: Difference between revisions

Content deleted Content added
revise for clarity
 
Line 1:
{{short description|Classical information retrieval model}}
{{technical|date=June 2018}}The (standard) '''Boolean model of information retrieval''' ('''BIR''')<ref>{{citation | author1=Lancaster, F.W. | author2=Fayen, E.G. | title=Information Retrieval On-Line | publisher=Melville Publishing Co., Los Angeles, California | year=1973}}</ref> is a classical [[information retrieval]] (IR) model and,where atdocuments are retrieved based on whether they satisfy the sameconditions time,of a query that uses [[Boolean logic]]. As the first and most-adopted one.information retrieval model,<ref>{{Cite web |title=Information Retrieval |url=https://mitpress.mit.edu/9780262528870/information-retrieval/ |access-date=2023-12-09 |website=MIT Press |language=en-US}}</ref> Theit BIRtreats iseach baseddocument onas [[Booleana logic]] and classical [[set theory]]of inwords, thator both''terms''. the documents to be searched and theA user's query areuses conceivedlogical asoperators setslike ofAND, termsOR, (aand [[bag-of-wordsNOT model]]).to Retrievalcreate isa basedrule onfor whetherretrieval. orThe notsystem thethen documentsreturns containall thedocuments querythat terms and whether they satisfymatch the boolean conditions described by the queryrule.
{{technical|date=June 2018}}<!-- Needs a plain-English definition without mathematical symbols -->
The (standard) '''Boolean model of information retrieval''' ('''BIR''')<ref>{{citation | author1=Lancaster, F.W. | author2=Fayen, E.G. | title=Information Retrieval On-Line | publisher=Melville Publishing Co., Los Angeles, California | year=1973}}</ref> is a classical [[information retrieval]] (IR) model and, at the same time, the first and most-adopted one.<ref>{{Cite web |title=Information Retrieval |url=https://mitpress.mit.edu/9780262528870/information-retrieval/ |access-date=2023-12-09 |website=MIT Press |language=en-US}}</ref> The BIR is based on [[Boolean logic]] and classical [[set theory]] in that both the documents to be searched and the user's query are conceived as sets of terms (a [[bag-of-words model]]). Retrieval is based on whether or not the documents contain the query terms and whether they satisfy the boolean conditions described by the query.
 
==Definitions==
In the Boolean model, documents and queries are represented using concepts from [[set theory]]. A document is seen as a simple collection (a set) of terms, and a query is a formal statement (a Boolean expression) that specifies which terms must or must not be present in a retrieved document.
 
* An '''index term''' (or ''term'') is a [[keyword]] that characterizes the content of a document. Terms are the fundamental units of the model. Common, low-information words (called [[stop words]]) like "a", "the", and "is" are typically excluded from being used as index terms.
* A '''document''' is represented as a [[set]] of index terms. This is a [[bag-of-words model]], meaning the order and frequency of terms in the original document are ignored. For example, a document about Bayes' theorem might be represented simply as the set <math>\{\text{Bayes' theorem, probability, decision-making}\}</math>.
* A '''query''' is a formal expression of the user's information need, written using index terms and Boolean operators (AND, OR, NOT). The model retrieves every document that is considered a "match" for this logical expression.
 
===Formal representation===
An ''index term'' is a word or expression'','' which may be [[stemming|stemmed]], describing or characterizing a document, such as a keyword given for a journal article. Let<math display="block">T = \{t_1, t_2,\ \ldots,\ t_n\}</math>be the set of all such index terms.
The model can be defined formally as follows:
 
A* ''document'' is any subset ofLet <math>T</math>. Let<math display="block">D = \{D_1t_1, t_2,\ \ldots\, ,D_nt_k\}</math> be thea set of all documentsindex terms.
* A document <math>D_j</math> is any subset of <math>T</math>.
 
* A query <math>Q</math> is a Boolean expression, typically in [[conjunctive normal form]]:<math display="block">Q = (t_a \lor t_b) \land (\lnot t_c \lor t_d) \land \dots</math>where <math>t_a, t_b, \dots \in T</math>.
 
<math>T</math>Retrieval is athe seriesprocess of wordsidentifying orthe small phrases (index terms). Eachset of thoseall wordsdocuments or<math>\{D_j\}</math> smallthat phrasessatisfy arethe namedquery <math>t_nQ</math>. For example, wherefor the simple query <math>nQ = t_a \land t_b</math> is, the numbersystem ofwould theretrieve termall indocuments thewhose series/list.set Youof canterms think ofcontains both <math>Tt_a</math> as "Terms" and <math>t_nt_b</math> as "index term ''n''".
 
The words or small phrases (index terms <math>t_n</math>) can exist in documents. These documents then form a series/list <math>D</math> where each individual documents are called <math>D_n</math>. These documents (<math>D_n</math>) can contain words or small phrases (index terms <math>t_n</math>) such as <math>D_1</math> ''could'' contain the terms <math>t_1</math>and <math>t_2</math> from <math>T</math>. There is an example of this in the following section.
 
Index terms generally want to represent words which have more meaning to them and corresponds to what the content of an article or document could talk about. Terms like "the" and "like" would appear in nearly all documents whereas "Bayesian" would only be a small fraction of documents. Therefor, rarer terms like "Bayesian" are a better choice to be selected in the <math>T</math> sets. This relates to [[Entropy (information theory)]]. There are multiple types of operations that can be applied to index terms used in queries to make them more generic and more relevant. One such is [[Stemming]].
 
 
A ''query'' is a Boolean expression <math display="inline">Q</math> in normal form:<math display="block">Q = (W_1\ \or\ W_2\ \or\ \cdots) \and\ \cdots\ \and\ (W_i\ \or\ W_{i+1}\ \or\ \cdots)</math>where <math display="inline">W_i</math> is true for <math>D_j</math> when <math>t_i \in D_j</math>. (Equivalently, <math display="inline">Q</math> could be expressed in [[disjunctive normal form]].)
 
Any <math>Q</math> queries are a selection of index terms (<math>t_n</math> or <math>W_n</math>) picked from a set <math>T</math> of terms which are combined using [[Boolean algebra#Operations|Boolean operators]] to form a set of conditions.
 
These conditions are then applied to a set <math>D</math> of documents which contain the same index terms (<math>t_n</math>) from the set <math>T</math>.
 
We seek to find the set of documents that satisfy <math display="inline">Q</math>. This operation is called ''retrieval'' and consists of the following two steps:
 
: 1. For each <math display="inline">W_j</math> in <math display="inline">Q</math>, find the set <math display="inline">S_j</math> of documents that satisfy <math display="inline">W_j</math>:<math display="block">S_j = \{D_i\mid W_j\}</math>2. Then the set of documents that satisfy Q is given by:<math display="block">(S_1 \cup S_2 \cup \cdots) \cap \cdots \cap (S_i \cup S_{i+1} \cup \cdots)</math>Where <math>\cup</math> means ''OR'' and <math>\cap</math> means ''AND'' as Boolean operators.
 
==Example==
Line 42 ⟶ 30:
<math display="inline">D_3</math> = "Bayesian [[epistemology]]: A philosophical theory which holds that the epistemic status of a proposition (i.e. how well proven or well established it is) is best measured by a probability and that the proper way to revise this probability is given by Bayesian conditionalisation or similar procedures. A Bayesian epistemologist would use probability to define, and explore the relationship between, concepts such as epistemic status, support or explanatory power."
 
Let the set <math display="inline">T</math> of terms be:<math display="block">T = \{t_1=\text{Bayes' principle}, t_2=\text{probability}, t_3=\text{decision-making}, t_4=\text{Bayesian epistemology}\}</math>Then, the set <math display="inline">D</math> of documents is as follows:<math display="block">D = \{D_1,\ D_2,\ D_3\}</math>where <math display="block">\begin{align}
Let the set <math display="inline">T</math> of terms be:
 
<math display="block">T = \{t_1=\text{Bayes' principle}, t_2=\text{probability}, t_3=\text{decision-making}, t_4=\text{Bayesian epistemology}\}</math>
 
Then, the set <math display="inline">D</math> of documents is as follows:
 
: <math display="block">D = \{D_1,\ D_2,\ D_3\}</math>
 
where <math display="block">\begin{align}
D_1 &= \{\text{probability},\ \text{Bayes' principle}\} \\
D_2 &= \{\text{probability},\ \text{decision-making}\} \\
D_3 &= \{\text{probability},\ \text{Bayesian epistemology}\}
\end{align}</math>Let the query <math display="inline">Q</math> be ("probability" AND "decision-making"):<math display="block">Q = \text{probability} \and \text{decision-making}</math>Then to retrieve the relevant documents:
\end{align}</math>
 
Let the query <math display="inline">Q</math> be ("probability" AND "decision-making"):
 
<math display="block">Q = \text{probability} \and \text{decision-making}</math>Then to retrieve the relevant documents:
# Firstly, the following sets <math display="inline">S_1</math> and <math display="inline">S_2</math> of documents <math display="inline">D_i</math> are obtained (retrieved):<math display="block">\begin{align}
S_1 &= \{D_1,\ D_2,\ D_3\} \\
Line 96 ⟶ 72:
{{ main | feature hashing }}
 
Another possibility is to use [[Set (abstract data type)|hash setsets]]s. Each document is represented by a hash table which contains every single term of that document. Since hash table size increases and decreases in real time with the addition and removal of terms, each document will occupy much less space in memory. However, it will have a slowdown in performance because the operations are more complex than with [[bit vector]]s. On the worst-case performance can degrade from O(''n'') to O(''n''<sup>2</sup>). On the average case, the performance slowdown will not be that much worse than bit vectors and the space usage is much more efficient.
 
=== Signature file ===
Each document can be summarized by [[Bloom filter]] representing the set of words in that document, stored in a fixed-length bitstring, called a signature. The signature file contains one such [[superimposed code]] bitstring for every document in the collection. Each query can also be summarized by a [[Bloom filter]] representing the set of words in the query, stored in a bitstring of the same fixed length. The query bitstring is tested against each signature.<ref name="zobel" >
The signature file contains one such [[superimposed code]] bitstring for every document in the collection.
Each query can also be summarized by a [[Bloom filter]] representing the set of words in the query, stored in a bitstring of the same fixed length.
The query bitstring is tested against each signature.<ref name="zobel" >
Justin Zobel; Alistair Moffat; and Kotagiri Ramamohanarao.
[https://people.eng.unimelb.edu.au/jzobel/fulltext/acmtods98.pdf "Inverted Files Versus Signature Files for Text Indexing"].