Boolean model of information retrieval: Difference between revisions

Content deleted Content added
more detail
Jatzuli (talk | contribs)
Definitions: Explained in simpler terms the more mathematical definitions. I welcome other editors to re-arrange these explanations to have a better flow in the article. I am a Computer Science student in his third year specializing in data.
Line 5:
==Definitions==
 
 
An ''index term'' is a word or expression'','' which may be [[stemming|stemmed]], describing or characterizing a document, such as a keyword given for a journal article. Let<math display="block">T = \{t_1, t_2,\ \ldots,\ t_mt_n\}</math>be the set of all such index terms.
 
A ''document'' is any subset of <math>T</math>. Let<math display="block">D = \{D_1,\ \ldots\ ,D_n\}</math>be the set of all documents.
 
 
<math>T</math> is a series of words or small phrases (index terms). Each of those words or small phrases are named <math>t_n</math>, where <math>n</math> is the number of the term in the series/list. You can think of <math>T</math> as "Terms" and <math>t_n</math> as "index term ''n''".
 
The words or small phrases (index terms <math>t_n</math>) can exist in documents. These documents then form a series/list <math>D</math> where each individual documents are called <math>D_n</math>. These documents (<math>D_n</math>) can contain words or small phrases (index terms <math>t_n</math>) such as <math>D_1</math> ''could'' contain the terms <math>t_1</math>and <math>t_2</math> from <math>T</math>. There is an example of this in the following section.
 
Index terms generally want to represent words which have more meaning to them and corresponds to what the content of an article or document could talk about. Terms like "the" and "like" would appear in nearly all documents whereas "Bayesian" would only be a small fraction of documents. Therefor, rarer terms like "Bayesian" are a better choice to be selected in the <math>T</math> sets. This relates to [[Entropy (information theory)]]. There are multiple types of operations that can be applied to index terms used in queries to make them more generic and more relevant. One such is [[Stemming]].
 
 
A ''query'' is a Boolean expression <math display="inline">Q</math> in normal form:<math display="block">Q = (W_1\ \or\ W_2\ \or\ \cdots) \and\ \cdots\ \and\ (W_i\ \or\ W_{i+1}\ \or\ \cdots)</math>where <math display="inline">W_i</math> is true for <math>D_j</math> when <math>t_i \in D_j</math>. (Equivalently, <math display="inline">Q</math> could be expressed in [[disjunctive normal form]].)
 
Any <math>Q</math> queries are a selection of index terms (<math>t_n</math> or <math>W_n</math>) picked from a set <math>T</math> of terms which are combined using [[Boolean algebra#Operations|Boolean operators]] to form a set of conditions.
 
These conditions are then applied to a set <math>D</math> of documents which contain the same index terms (<math>t_n</math>) from the set <math>T</math>.
 
We seek to find the set of documents that satisfy <math display="inline">Q</math>. This operation is called ''retrieval'' and consists of the following two steps:
 
: 1. For each <math display="inline">W_j</math> in <math display="inline">Q</math>, find the set <math display="inline">S_j</math> of documents that satisfy <math display="inline">W_j</math>:<math display="block">S_j = \{D_i\mid W_j\}</math>2. Then the set of documents that satisfy Q is given by:<math display="block">(S_1 \cup S_2 \cup \cdots) \cap \cdots \cap (S_i \cup S_{i+1} \cup \cdots)</math>Where <math>\cup</math> means ''OR'' and <math>\cap</math> means ''AND'' as Boolean operators.
 
==Example==
Line 19 ⟶ 32:
Let the set of original (real) documents be, for example
 
: <math>OD = \{O_1D_1,\ O_2D_2,\ O_3D_3\}</math>
 
where
 
<math display="inline">O_1D_1</math> = "Bayes' principle: The principle that, in estimating a parameter, one should initially assume that each possible value has equal probability (a uniform prior distribution)."
 
<math display="inline">O_2D_2</math> = "[[Bayes' theorem|Bayesian decision theory]]: A mathematical theory of decision-making which presumes utility and probability functions, and according to which the act to be chosen is the Bayes act, i.e. the one with highest subjective expected utility. If one had unlimited time and calculating power with which to make every decision, this procedure would be the best way to make any decision."
 
<math display="inline">O_3D_3</math> = "Bayesian [[epistemology]]: A philosophical theory which holds that the epistemic status of a proposition (i.e. how well proven or well established it is) is best measured by a probability and that the proper way to revise this probability is given by Bayesian conditionalisation or similar procedures. A Bayesian epistemologist would use probability to define, and explore the relationship between, concepts such as epistemic status, support or explanatory power."
 
Let the set <math display="inline">T</math> of terms be:
Line 43 ⟶ 56:
\end{align}</math>
 
Let the query <math display="inline">Q</math> be ("probability" AND "decision-making"):
 
<math display="block">Q = \text{probability} \and \text{decision-making}</math>Then to retrieve the relevant documents:
Line 49 ⟶ 62:
S_1 &= \{D_1,\ D_2,\ D_3\} \\
S_2 &= \{D_2\}
\end{align}</math>Where <math>S_1</math> corresponds to the documents which contain the term "probability" and <math>S_2</math> contain the term "decision-making".
\end{align}</math>
# Finally, the following documents <math display="inline">D_i</math> are retrieved in response to <math display="inline">Q</math>: <math display="block">Q: \{D_1,\ D_2,\ D_3\}\ \cap\ \{D_2\}\ =\ \{D_2\}</math>Where the query looks for documents that are contained in both sets <math>S</math> using the intersection operator.
This means that the original document <math display="inline">O_2</math> (corresponding to <math display="inline">D_2</math>) is the answer to <math display="inline">Q</math>.
 
Obviously, ifIf there is more than one document with the same representation (the same subset of index terms <math>t_n</math>), every such document is retrieved. Such documents are indistinguishable in the BIR (in other words, equivalent).
 
== Advantages ==