Boolean model of information retrieval: Difference between revisions

Content deleted Content added
revise for clarity
 
(13 intermediate revisions by 10 users not shown)
Line 1:
{{short description|Classical information retrieval model}}
{{technical|date=June 2018}}<!-- Needs a plain-English definition without mathematical symbols -->
{{technical|date=June 2018}}The (standard) '''Boolean model of information retrieval''' ('''BIR)''')<ref>{{citation | author1=Lancaster, F.W. | author2=Fayen, E.G. | title=Information Retrieval On-Line | publisher=Melville Publishing Co., Los Angeles, California | year=1973}}</ref> is a classical [[information retrieval]] (IR) model and,where documents are retrieved based on whether they atsatisfy the sameconditions time,of a query that uses [[Boolean logic]]. As the first and most-adopted one.information Itretrieval ismodel,<ref>{{Cite usedweb by|title=Information many IR systemsRetrieval to this day|url=https://mitpress.{{citationmit.edu/9780262528870/information-retrieval/ needed|access-date=November2023-12-09 2010|website=MIT Press |language=en-US}}</ref> Theit BIRtreats iseach baseddocument onas [[Booleana logic]]set andof classicalwords, [[setor theory]]''terms''. inA thatuser's bothquery theuses documentslogical tooperators belike searchedAND, OR, and theNOT user'sto querycreate area conceivedrule asfor sets of termsretrieval. RetrievalThe issystem basedthen onreturns whetherall ordocuments notthat the documents containmatch the query termsrule.
 
==Definitions==
In the Boolean model, documents and queries are represented using concepts from [[set theory]]. A document is seen as a simple collection (a set) of terms, and a query is a formal statement (a Boolean expression) that specifies which terms must or must not be present in a retrieved document.
 
* An '''index term''' is a word (or expression'',term'') whichis may bea [[stemming|stemmedkeyword]], describingthat orcharacterizes characterizingthe content of a document,. Terms suchare asthe afundamental keywordunits givenof forthe amodel. journalCommon, article.low-information Let<mathwords display="block">T(called =[[stop \{t_1,words]]) t_2like "a",\ \ldots"the",\ t_m\}</math>beand the"is" setare oftypically allexcluded suchfrom being used as index terms.
* A '''document''' is represented as a [[set]] of index terms. This is a [[bag-of-words model]], meaning the order and frequency of terms in the original document are ignored. For example, a document about Bayes' theorem might be represented simply as the set <math>\{\text{Bayes' theorem, probability, decision-making}\}</math>.
* A '''query''' is a formal expression of the user's information need, written using index terms and Boolean operators (AND, OR, NOT). The model retrieves every document that is considered a "match" for this logical expression.
 
===Formal representation===
A ''document'' is any subset of <math>T</math>. Let<math display="block">D = \{D_1,\ \ldots\ ,D_n\}</math>be the set of all documents.
The model can be defined formally as follows:
 
* Let <math>T = \{t_1, t_2, \ldots, t_k\}</math> be a set of all index terms.
A ''query'' is a Boolean expression <math display="inline">Q</math> in normal form:<math display="block">Q = (W_1\ \or\ W_2\ \or\ \cdots) \and\ \cdots\ \and\ (W_i\ \or\ W_{i+1}\ \or\ \cdots)</math>where <math display="inline">W_i</math> is true for <math>D_j</math> when <math>t_i \in D_j</math>. (Equivalently, <math display="inline">Q</math> could be expressed in [[disjunctive normal form]].)
* A document <math>D_j</math> is any subset of <math>T</math>.
 
* A query <math>Q</math> is a Boolean expression, typically in [[conjunctive normal form]]:<math display="block">Q = (t_a \lor t_b) \land (\lnot t_c \lor t_d) \land \dots</math>where <math>t_a, t_b, \dots \in T</math>.
We seek to find the set of documents that satisfy <math display="inline">Q</math>. This operation is called ''retrieval'' and consists of the following two steps:
Retrieval is the process of identifying the set of all documents <math>\{D_j\}</math> that satisfy the query <math>Q</math>. For example, for the simple query <math>Q = t_a \land t_b</math>, the system would retrieve all documents whose set of terms contains both <math>t_a</math> and <math>t_b</math>.
 
: 1. For each <math display="inline">W_j</math> in <math display="inline">Q</math>, find the set <math display="inline">S_j</math> of documents that satisfy <math display="inline">W_j</math>:<math display="block">S_j = \{D_i\mid W_j\}</math>2. Then the set of documents that satisfy Q is given by:<math display="block">(S_1 \cup S_2 \cup \cdots) \cap \cdots \cap (S_i \cup S_{i+1} \cup \cdots)</math>
 
==Example==
Line 18 ⟶ 20:
Let the set of original (real) documents be, for example
 
: <math>OD = \{O_1D_1,\ O_2D_2,\ O_3D_3\}</math>
 
where
 
<math display="inline">O_1D_1</math> = "Bayes' principle: The principle that, in estimating a parameter, one should initially assume that each possible value has equal probability (a uniform prior distribution)."
 
<math display="inline">O_2D_2</math> = "[[Bayes' theorem|Bayesian decision theory]]: A mathematical theory of decision-making which presumes utility and probability functions, and according to which the act to be chosen is the Bayes act, i.e. the one with highest subjective expected utility. If one had unlimited time and calculating power with which to make every decision, this procedure would be the best way to make any decision."
 
<math display="inline">O_3D_3</math> = "Bayesian [[epistemology]]: A philosophical theory which holds that the epistemic status of a proposition (i.e. how well proven or well established it is) is best measured by a probability and that the proper way to revise this probability is given by Bayesian conditionalisation or similar procedures. A Bayesian epistemologist would use probability to define, and explore the relationship between, concepts such as epistemic status, support or explanatory power."
 
Let the set <math display="inline">T</math> of terms be:<math display="block">T = \{t_1=\text{Bayes' principle}, t_2=\text{probability}, t_3=\text{decision-making}, t_4=\text{Bayesian epistemology}\}</math>Then, the set <math display="inline">D</math> of documents is as follows:<math display="block">D = \{D_1,\ D_2,\ D_3\}</math>where <math display="block">\begin{align}
Let the set <math display="inline">T</math> of terms be:
 
<math display="block">T = \{t_1=\text{Bayes' principle}, t_2=\text{probability}, t_3=\text{decision-making}, t_4=\text{Bayesian epistemology}\}</math>
 
Then, the set <math display="inline">D</math> of documents is as follows:
 
: <math display="block">D = \{D_1,\ D_2,\ D_3\}</math>
 
where <math display="block">\begin{align}
D_1 &= \{\text{probability},\ \text{Bayes' principle}\} \\
D_2 &= \{\text{probability},\ \text{decision-making}\} \\
D_3 &= \{\text{probability},\ \text{Bayesian epistemology}\}
\end{align}</math>Let the query <math display="inline">Q</math> be ("probability" AND "decision-making"):<math display="block">Q = \text{probability} \and \text{decision-making}</math>Then to retrieve the relevant documents:
\end{align}</math>
 
Let the query <math display="inline">Q</math> be:
 
<math display="block">Q = \text{probability} \and \text{decision-making}</math>Then to retrieve the relevant documents:
# Firstly, the following sets <math display="inline">S_1</math> and <math display="inline">S_2</math> of documents <math display="inline">D_i</math> are obtained (retrieved):<math display="block">\begin{align}
S_1 &= \{D_1,\ D_2,\ D_3\} \\
S_2 &= \{D_2\}
\end{align}</math>Where <math>S_1</math> corresponds to the documents which contain the term "probability" and <math>S_2</math> contain the term "decision-making".
\end{align}</math>
# Finally, the following documents <math display="inline">D_i</math> are retrieved in response to <math display="inline">Q</math>: <math display="block">Q: \{D_1,\ D_2,\ D_3\}\ \cap\ \{D_2\}\ =\ \{D_2\}</math>Where the query looks for documents that are contained in both sets <math>S</math> using the intersection operator.
This means that the original document <math display="inline">O_2</math> (corresponding to <math display="inline">D_2</math>) is the answer to <math display="inline">Q</math>.
 
Obviously, ifIf there is more than one document with the same representation (the same subset of index terms <math>t_n</math>), every such document is retrieved. Such documents are indistinguishable in the BIR (in other words, equivalent).
 
== Advantages ==
Line 59 ⟶ 49:
* Easy to implement
* Intuitive concept
* If the resulting document set is either too small or too big, it is directly clear which operators will produce respectively a bigger or smaller set.
*It gives (expert) users a sense of control over the system. It is immediately clear why a document has been retrieved given a query.
 
== Disadvantages ==
Line 64 ⟶ 56:
* [[String search algorithm|Exact matching]] may retrieve too few or too many documents
* Hard to translate a query into a Boolean expression
* Ineffective for Search-Resistant Concepts<ref>{{cite journal |last1=Shokraneh |first1=Farhad |title=Stop searching and you will find it: Search-Resistant Concepts in systematic review searches |journal=BMJ Evidence-Based Medicine |date=6 August 2024 |pages=bmjebm–2023–112798 |doi=10.1136/bmjebm-2023-112798}}</ref>
* All terms are equally weighted
* More like ''[[data retrieval]]'' than ''information retrieval''
* Retrieval based on binary decision criteria with no notion of partial matching
* No ranking of the documents is provided (absence of a grading scale)
* Information need has to be translated into a Boolean expression, which most users find awkward
* The Boolean queries formulated by the users are most often too simplistic
* The model frequently returns either too few or too many documents in response to a user query
 
== Data structures and algorithms ==
Line 72 ⟶ 70:
 
=== Hash sets ===
{{ main | feature hashing }}
 
Another possibility is to use [[Set (abstract data type)|hash setsets]]s. Each document is represented by a hash table which contains every single term of that document. Since hash table size increases and decreases in real time with the addition and removal of terms, each document will occupy much less space in memory. However, it will have a slowdown in performance because the operations are more complex than with [[bit vector]]s. On the worst-case performance can degrade from O(''n'') to O(''n''<sup>2</sup>). On the average case, the performance slowdown will not be that much worse than bit vectors and the space usage is much more efficient.
 
=== Signature file ===
Each document can be summarized by [[Bloom filter]] representing the set of words in that document, stored in a fixed-length bitstring, called a signature. The signature file contains one such [[superimposed code]] bitstring for every document in the collection. Each query can also be summarized by a [[Bloom filter]] representing the set of words in the query, stored in a bitstring of the same fixed length. The query bitstring is tested against each signature.<ref name="zobel" >
Justin Zobel; Alistair Moffat; and Kotagiri Ramamohanarao.
[https://people.eng.unimelb.edu.au/jzobel/fulltext/acmtods98.pdf "Inverted Files Versus Signature Files for Text Indexing"].
</ref><ref name="goodwin" >
Bob Goodwin; et al.
[https://safari.ethz.ch/architecture/fall2018/lib/exe/fetch.php?media=p605-goodwin.pdf "BitFunnel: Revisiting Signatures for Search"].
2017.
</ref><ref>
Richard Startin.
[https://richardstartin.github.io/posts/bit-sliced-signatures-and-bloom-filters "Bit-Sliced Signatures and Bloom Filters"].
</ref>
 
The signature file approached is used in [[BitFunnel]].
 
=== Inverted file ===
Another possibility is to use [[hash set]]s. Each document is represented by a hash table which contains every single term of that document. Since hash table size increases and decreases in real time with the addition and removal of terms, each document will occupy much less space in memory. However, it will have a slowdown in performance because the operations are more complex than with [[bit vector]]s. On the worst-case performance can degrade from O(''n'') to O(''n''<sup>2</sup>). On the average case, the performance slowdown will not be that much worse than bit vectors and the space usage is much more efficient.
{{ main | inverted index }}
An inverted index file contains two parts:
a vocabulary containing all the terms used in the collection,
and for each distinct term an inverted index that lists every document that mentions that term.<ref name="zobel" /><ref name="goodwin" />
 
== References ==
{{reflist}}
* {{ citation | last1= Lashkari | first1=A.H. | last2=Mahdavi | first2= F.| last3=Ghomi | first3= V. | title=2009 International Conference on Information Management and Engineering | doi= 10.1109/ICIME.2009.101 | titlechapter= A Boolean Model in Information Retrieval for Search Engines | year=2009 | pages=385–389 | isbn=978-0-7695-3595-1 | s2cid=18147603 }}
 
[[Category:Mathematical modeling]]