Boolean model of information retrieval: Difference between revisions

Content deleted Content added
info r
Converted plain text formulas into math formulas
Line 1:
 
The '''Boolean model''' of [[information retrieval]] (BIR)<ref>{{citation | author1=Lancaster, F.W. | author2=Fayen, E.G. | title=Information Retrieval On-Line | publisher=Melville Publishing Co., Los Angeles, California | year=1973}}</ref> is a classical [[information retrieval]] (IR) model and, at the same time, the first and most adopted one. It is used by many IR systems to this day.{{citation needed|date=November 2010}}
 
==Definitions==
 
The BIR is based on [[Boolean logic]] and classical [[set theory]] in that both the documents to be searched and the user's query are conceived as sets of terms. Retrieval is based on whether or not the documents contain the query terms. Given a finite set<math display="block">T = \{t_1, t_2,\ ...,\ t_m\}</math>of elements called index terms (e.g. words or expressions - which may be [[stemming|stemmed]] - describing or characterizing documents such as keywords given for a journal article), a finite set<math display="block">D = \{D_1,\ ...\ ,D_n\} \text{ where } D_i \in Powerset(T)</math>of elements called documents, and a Boolean expression - in a normal form - <math display="inline">Q</math><math display="block">Q = (W_1\ \or\ W_2\ \or\ \ldots) \and\ \ldots\ \and\ (W_i\ \or\ W_j\ \or\ \ldots)</math>called a query where <math display="inline">W_i</math> either implies that <math>t_i \in D_j</math> or that <math>t_i \notin D_j</math> (equivalently, <math display="inline">Q</math> can be given in a [[disjunctive normal form]] as well). Then an operation called retrieval, consisting of two steps, is defined as follows:
 
: 1. Find the sets <math display="inline">S_j</math> of documents that satisfy <math display="inline">W_j</math> (that is, whether <math display="inline">t_j \in D_k</math> or <math>t_j \notin D_k</math> ) :<math display="block">S_j = \{D_i\ |\ W_j\}</math>2. Those documents are retrieved in response to Q which are the result of the corresponding sets operations, i.e. the answer to <math display="inline">Q</math> is as follows:<math display="block">(S_1 \cup S_2 \cup \ \ldots) \cap\ \ldots\ \cap\ (S_i \cup S_j \cup\ \ldots)</math>
: T = {t1, t2, ..., tj, ..., tm}
 
of elements called index terms (e.g. words or expressions - which may be [[stemming|stemmed]] - describing or characterising documents such as keywords given for a journal article), a finite set
 
: D = {D1, ..., Di, ..., Dn}, where Di is an element of the powerset of T
 
of elements called documents. Given a Boolean expression - in a normal form - Q called a query as follows:
 
:Q = (Wi OR Wk OR ...) AND ... AND (Wj OR Ws OR ...),
:with Wi=ti, Wk=tk, Wj=tj, Ws=ts, or Wi=NON ti, Wk=NON tk, Wj=NON tj, Ws=NON ts
 
where ti means that the term ti is present in document Di, whereas NON ti means that it is not.
 
Equivalently, Q can be given in a [[disjunctive normal form]], too. An operation called retrieval, consisting of two steps, is defined as follows:
 
: 1. The sets Sj of documents are obtained that contain or not term tj (depending on whether Wj=tj or Wj=NON tj) :
 
:: Sj = {Di|Wj element of Di}
 
: 2. Those documents are retrieved in response to Q which are the result of the corresponding sets operations, i.e. the answer to Q is as follows:
 
:: UNION ( [[Intersection|INTERSECTION]] Sj)
 
==Example==
Line 32 ⟶ 12:
Let the set of original (real) documents be, for example
 
<math>O = \{O1O_1,\ O2O_2,\ O3O_3\}</math>
 
where
 
O1 = Bayes' Principle: The principle that, in estimating a parameter, one should initially assume that each possible value has equal probability (a uniform prior distribution).
 
O2 = [[Bayes' theorem|Bayesian Decision Theory]]: A mathematical theory of decision-making which presumes utility and probability functions, and according to which the act to be chosen is the Bayes act, i.e. the one with highest subjective expected utility. If one had unlimited time and calculating power with which to make every decision, this procedure would be the best way to make any decision.
 
O3 = Bayesian [[Epistemology]]: A philosophical theory which holds that the epistemic status of a proposition (i.e. how well proven or well established it is) is best measured by a probability and that the proper way to revise this probability is given by Bayesian conditionalisation or similar procedures. A Bayesian epistemologist would use probability to define, and explore the relationship between, concepts such as epistemic status, support or explanatory power.
 
Let the set T of terms be:
 
T = {t1 = Bayes' Principle, t2 = probability, t3 = decision-making, t4 =
Bayesian Epistemology}
 
Then, the set D of documents is as follows:
 
D = {D1, D2, D3}
 
where
 
O1<math display="inline">O_1</math> = "Bayes' Principle: The principle that, in estimating a parameter, one should initially assume that each possible value has equal probability (a uniform prior distribution)."
D1 = {Bayes' Principle, probability}
 
O2<math display="inline">O_2</math> = "[[Bayes' theorem|Bayesian Decision Theory]]: A mathematical theory of decision-making which presumes utility and probability functions, and according to which the act to be chosen is the Bayes act, i.e. the one with highest subjective expected utility. If one had unlimited time and calculating power with which to make every decision, this procedure would be the best way to make any decision."
D2 = {probability, decision-making}
 
O3<math display="inline">O_3</math> = "Bayesian [[Epistemology]]: A philosophical theory which holds that the epistemic status of a proposition (i.e. how well proven or well established it is) is best measured by a probability and that the proper way to revise this probability is given by Bayesian conditionalisation or similar procedures. A Bayesian epistemologist would use probability to define, and explore the relationship between, concepts such as epistemic status, support or explanatory power."
D3 = {probability, Bayesian Epistemology}
 
Let the queryset Q<math display="inline">T</math> of terms be:
 
<math display="block">T = \{t_1=\text{Bayes' Principle}, t_2=\text{probability}, t_3=\text{decision-making}, t_4=\text{Bayesian Epistemology}\}</math>
Q = probability AND decision-making
 
1. FirstlyThen, the followingset sets<math S1 and S2display="inline">D</math> of documents Di areis obtainedas (retrieved)follows:
 
<math display="block">D = \{D_1,\ D_2,\ D_3\}</math>
S1 = {D1, D2, D3}
 
where<math display="block">\begin{align}
S2 = {D2}
D_1 &= \{\text{probability},\ \text{Bayes' Principle}\} \\
D2D_2 &= \{\text{probability},\ \text{decision-making}\} \\
D3D_3 &= \{\text{probability},\ \text{Bayesian Epistemology}\}
\end{align}</math>
 
Let the query <math display="inline">Q</math> be:
2. Finally, the following documents Di are retrieved in response to Q:
{D1, D2, D3} INTERSECTION {D2} = {D2}
 
<math display="block">Q = \text{probability} \and \text{decision-making}</math>Then to retrieve the relevant documents:
This means that the original document O2 (corresponding to D2) is the answer to Q.
# Firstly, the following sets <math display="inline">S_1</math> and <math display="inline">S_2</math> of documents <math display="inline">D_i</math> are obtained (retrieved):<math display="block">\begin{align}
S_1 &= \{D_1,\ D_2,\ D_3\} \\
S_2 &= \{D_2\}
\end{align}</math>
# Finally, the following documents <math display="inline">D_i</math> are retrieved in response to <math display="inline">Q</math> <math display="block">Q: \{D_1,\ D_2,\ D_3\}\ \cap\ \{D_2\}\ =\ \{D_2\}</math>
This means that the original document O2<math display="inline">O_2</math> (corresponding to D2<math display="inline">D_2</math>) is the answer to <math display="inline">Q</math>.
 
Obviously, if there is more than one document with the same representation, every such document is retrieved. Such documents are, indistinguishable in the BIR, indistinguishable (or, in other words, equivalent).
 
== Advantages ==