Boolean model of information retrieval: Difference between revisions

Content deleted Content added
No edit summary
Line 1:
{{technical|date=June 2018}}<!-- Needs a plain-English definition without mathematical symbols -->
The (standard) '''Boolean model of information retrieval''' (BIR)'''<ref>{{citation | author1=Lancaster, F.W. | author2=Fayen, E.G. | title=Information Retrieval On-Line | publisher=Melville Publishing Co., Los Angeles, California | year=1973}}</ref> is a classical [[information retrieval]] (IR) model and, at the same time, the first and most-adopted one. It is used by many IR systems to this day.{{citation needed|date=November 2010}} The BIR is based on [[Boolean logic]] and classical [[set theory]] in that both the documents to be searched and the user's query are conceived as sets of terms. Retrieval is based on whether or not the documents contain the query terms.
 
==Definitions==
Line 12:
We seek to find the set of documents that satisfy <math display="inline">Q</math>. This operation is called ''retrieval'' and consists of the following two steps:
 
: 1. For each <math display="inline">W_j</math> in <math display="inline">Q</math>, find the set <math display="inline">S_j</math> of documents that satisfy <math display="inline">W_j</math>:<math display="block">S_j = \{D_i\ |\mid W_j\}</math>2. Then the set of documents that satisfy Q is given by:<math display="block">(S_1 \cup S_2 \cup \ \ldotscdots) \cap\ \ldots\cdots \cap\ (S_i \cup S_{i+1} \cup\ \ldotscdots)</math>
 
==Example==
Line 18:
Let the set of original (real) documents be, for example
 
: <math>O = \{O_1,\ O_2,\ O_3\}</math>
 
where
Line 24:
<math display="inline">O_1</math> = "Bayes' Principle: The principle that, in estimating a parameter, one should initially assume that each possible value has equal probability (a uniform prior distribution)."
 
<math display="inline">O_2</math> = "[[Bayes' theorem|Bayesian Decisiondecision Theorytheory]]: A mathematical theory of decision-making which presumes utility and probability functions, and according to which the act to be chosen is the Bayes act, i.e. the one with highest subjective expected utility. If one had unlimited time and calculating power with which to make every decision, this procedure would be the best way to make any decision."
 
<math display="inline">O_3</math> = "Bayesian [[Epistemologyepistemology]]: A philosophical theory which holds that the epistemic status of a proposition (i.e. how well proven or well established it is) is best measured by a probability and that the proper way to revise this probability is given by Bayesian conditionalisation or similar procedures. A Bayesian epistemologist would use probability to define, and explore the relationship between, concepts such as epistemic status, support or explanatory power."
 
Let the set <math display="inline">T</math> of terms be:
 
<math display="block">T = \{t_1=\text{Bayes' Principle}, t_2=\text{probability}, t_3=\text{decision-making}, t_4=\text{Bayesian Epistemologyepistemology}\}</math>
 
Then, the set <math display="inline">D</math> of documents is as follows:
 
: <math display="block">D = \{D_1,\ D_2,\ D_3\}</math>
 
where <math display="block">\begin{align}
D_1 &= \{\text{probability},\ \text{Bayes' Principleprinciple}\} \\
D_2 &= \{\text{probability},\ \text{decision-making}\} \\
D_3 &= \{\text{probability},\ \text{Bayesian Epistemologyepistemology}\}
\end{align}</math>
 
Line 73:
=== Hash sets ===
 
Another possibility is to use [[hash set]]s. Each document is represented by a hash table which contains every single term of that document. Since hash table size increases and decreases in real time with the addition and removal of terms, each document will occupy much less space in memory. However, it will have a slowdown in performance because the operations are more complex than with [[bit vector]]s. On the worst-case performance can degrade from O(''n'') to O(''n''<sup>2</sup>). On the average case, the performance slowdown will not be that much worse than bit vectors and the space usage is much more efficient.
 
== References ==