Content deleted Content added
cite |
TokenByToken (talk | contribs) revise for clarity Tags: Visual edit Disambiguation links added |
||
(3 intermediate revisions by 3 users not shown) | |||
Line 1:
{{short description|Classical information retrieval model}}
{{technical|date=June 2018}}The (standard) '''Boolean model of information retrieval''' ('''BIR''')<ref>{{citation | author1=Lancaster, F.W. | author2=Fayen, E.G. | title=Information Retrieval On-Line | publisher=Melville Publishing Co., Los Angeles, California | year=1973}}</ref> is a classical [[information retrieval]] (IR) model
▲The (standard) '''Boolean model of information retrieval''' ('''BIR''')<ref>{{citation | author1=Lancaster, F.W. | author2=Fayen, E.G. | title=Information Retrieval On-Line | publisher=Melville Publishing Co., Los Angeles, California | year=1973}}</ref> is a classical [[information retrieval]] (IR) model and, at the same time, the first and most-adopted one.<ref>{{Cite web |title=Information Retrieval |url=https://mitpress.mit.edu/9780262528870/information-retrieval/ |access-date=2023-12-09 |website=MIT Press |language=en-US}}</ref> The BIR is based on [[Boolean logic]] and classical [[set theory]] in that both the documents to be searched and the user's query are conceived as sets of terms (a [[bag-of-words model]]). Retrieval is based on whether or not the documents contain the query terms.
==Definitions==
In the Boolean model, documents and queries are represented using concepts from [[set theory]]. A document is seen as a simple collection (a set) of terms, and a query is a formal statement (a Boolean expression) that specifies which terms must or must not be present in a retrieved document.
* An '''index term'''
* A '''document''' is represented as a [[set]] of index terms. This is a [[bag-of-words model]], meaning the order and frequency of terms in the original document are ignored. For example, a document about Bayes' theorem might be represented simply as the set <math>\{\text{Bayes' theorem, probability, decision-making}\}</math>.
* A '''query''' is a formal expression of the user's information need, written using index terms and Boolean operators (AND, OR, NOT). The model retrieves every document that is considered a "match" for this logical expression.
===Formal representation===
The model can be defined formally as follows:
* Let <math>T = \{t_1, t_2, \ldots, t_k\}</math> be a set of all index terms.
* A document <math>D_j</math> is any subset of <math>T</math>.
* A query <math>Q</math> is a Boolean expression, typically in [[conjunctive normal form]]:<math display="block">Q = (t_a \lor t_b) \land (\lnot t_c \lor t_d) \land \dots</math>where <math>t_a, t_b, \dots \in T</math>.
Retrieval is the process of identifying the set of all documents <math>\{D_j\}</math> that satisfy the query <math>Q</math>. For example, for the simple query <math>Q = t_a \land t_b</math>, the system would retrieve all documents whose set of terms contains both <math>t_a</math> and <math>t_b</math>.
==Example==
Line 19 ⟶ 20:
Let the set of original (real) documents be, for example
: <math>
where
<math display="inline">
<math display="inline">
<math display="inline">
Let the set <math display="inline">T</math> of terms be:<math display="block">T = \{t_1=\text{Bayes' principle}, t_2=\text{probability}, t_3=\text{decision-making}, t_4=\text{Bayesian epistemology}\}</math>Then, the set <math display="inline">D</math> of documents is as follows:<math display="block">D = \{D_1,\ D_2,\ D_3\}</math>where <math display="block">\begin{align}▼
▲<math display="block">T = \{t_1=\text{Bayes' principle}, t_2=\text{probability}, t_3=\text{decision-making}, t_4=\text{Bayesian epistemology}\}</math>
D_1 &= \{\text{probability},\ \text{Bayes' principle}\} \\
D_2 &= \{\text{probability},\ \text{decision-making}\} \\
D_3 &= \{\text{probability},\ \text{Bayesian epistemology}\}
\end{align}</math>Let the query <math display="inline">Q</math> be ("probability" AND "decision-making"):<math display="block">Q = \text{probability} \and \text{decision-making}</math>Then to retrieve the relevant documents:▼
▲<math display="block">Q = \text{probability} \and \text{decision-making}</math>Then to retrieve the relevant documents:
# Firstly, the following sets <math display="inline">S_1</math> and <math display="inline">S_2</math> of documents <math display="inline">D_i</math> are obtained (retrieved):<math display="block">\begin{align}
S_1 &= \{D_1,\ D_2,\ D_3\} \\
S_2 &= \{D_2\}
\end{align}</math>Where <math>S_1</math> corresponds to the documents which contain the term "probability" and <math>S_2</math> contain the term "decision-making".
# Finally, the following documents <math display="inline">D_i</math> are retrieved in response to <math display="inline">Q</math>: <math display="block">Q: \{D_1,\ D_2,\ D_3\}\ \cap\ \{D_2\}\ =\ \{D_2\}</math>Where the query looks for documents that are contained in both sets <math>S</math> using the intersection operator.
This means that the original document <math
== Advantages ==
Line 67 ⟶ 56:
* [[String search algorithm|Exact matching]] may retrieve too few or too many documents
* Hard to translate a query into a Boolean expression
* Ineffective for Search-Resistant Concepts<ref>{{cite journal |last1=Shokraneh |first1=Farhad |title=Stop searching and you will find it: Search-Resistant Concepts in systematic review searches |journal=BMJ Evidence-Based Medicine |date=6 August 2024 |pages=bmjebm–2023–112798 |doi=10.1136/bmjebm-2023-112798}}</ref>
* All terms are equally weighted
* More like ''[[data retrieval]]'' than ''information retrieval''
Line 82 ⟶ 72:
{{ main | feature hashing }}
Another possibility is to use [[Set (abstract data type)|hash
=== Signature file ===
Each document can be summarized by [[Bloom filter]] representing the set of words in that document, stored in a fixed-length bitstring, called a signature. The signature file contains one such [[superimposed code]] bitstring for every document in the collection. Each query can also be summarized by a [[Bloom filter]] representing the set of words in the query, stored in a bitstring of the same fixed length. The query bitstring is tested against each signature.<ref name="zobel" >
Justin Zobel; Alistair Moffat; and Kotagiri Ramamohanarao.
[https://people.eng.unimelb.edu.au/jzobel/fulltext/acmtods98.pdf "Inverted Files Versus Signature Files for Text Indexing"].
|