Vector space model: Difference between revisions

Content deleted Content added
dash
Line 2:
 
==Definitions==
 
Documents and queries are represented as vectors.
 
Line 15 ⟶ 14:
 
==Applications==
[[ImageFile:vector space model.jpg|right|250px]]
 
[[Image:vector space model.jpg|right|250px]]
 
[[Relevance (information retrieval)|Relevance]] [[ranking]]s of documents in a keyword search can be calculated, using the assumptions of [[semantic similarity|document similarities]] theory, by comparing the deviation of angles between each document vector and the original query vector where the query is represented as a vector with same dimension as the vectors that represent the other documents.
Line 39 ⟶ 37:
 
==Example: tf-idf weights==
In the classic vector space model proposed by [[Gerard Salton|Salton]], Wong and Yang <ref>[http://doi.acm.org/10.1145/361219.361220 G. Salton , A. Wong , C. S. Yang, A vector space model for automatic indexing], Communications of the ACM, v.18 n.11, p.613-620613–620, Nov. 1975</ref> the term-specific weights in the document vectors are products of local and global parameters. The model is known as [[tf-idf|term frequency-inverse document frequency]] model. The weight vector for document ''d'' is <math>\mathbf{v}_d = [w_{1,d}, w_{2,d}, \ldots, w_{N,d}]^T</math>, where
 
In the classic vector space model proposed by [[Gerard Salton|Salton]], Wong and Yang <ref>[http://doi.acm.org/10.1145/361219.361220 G. Salton , A. Wong , C. S. Yang, A vector space model for automatic indexing], Communications of the ACM, v.18 n.11, p.613-620, Nov. 1975</ref> the term-specific weights in the document vectors are products of local and global parameters. The model is known as [[tf-idf|term frequency-inverse document frequency]] model. The weight vector for document ''d'' is <math>\mathbf{v}_d = [w_{1,d}, w_{2,d}, \ldots, w_{N,d}]^T</math>, where
 
:<math>
Line 51 ⟶ 48:
 
==Advantages==
 
The vector space model has the following advantages over the [[Standard Boolean model]]:
 
Line 62 ⟶ 58:
 
==Limitations==
 
The vector space model has the following limitations:
 
Line 70 ⟶ 65:
#The order in which the terms appear in the document is lost in the vector space representation.
#Theoretically assumes terms are statistically independent.
#Weighting is intuitive but not very formal.
 
Many of these difficulties can, however, be overcome by the integration of various tools, including mathematical techniques such as [[singular value decomposition]] and [[lexical database]]s such as [[WordNet]].
 
==Models based on and extending the vector space model==
 
Models based on and extending the vector space model include:
* [[Generalized vector space model]]
Line 84 ⟶ 78:
 
==Software that implements the vector space model==
 
The following software packages may be of interest to those wishing to experiment with vector models and implement search services based upon them.
 
===Free open source software===
 
* [[Apache Lucene]]. Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.
* [[Elasticsearch]]. Another high-performance, full-featured text search engine using Lucene.
Line 96 ⟶ 88:
 
==Further reading==
* [[Gerard Salton|G. Salton]] (1962), "[https://dl.acm.org/citation.cfm?id=1461544 Some experiments in the generation of word and document associations]" ''Proceeding AFIPS '62 (Fall) Proceedings of the December 4-64–6, 1962, fall joint computer conference'', pages 234-250234–250. ''(Early paper of Salton using the term-document matrix formalization)''
 
* [[Gerard Salton|G. Salton]] (1962), "[https://dl.acm.org/citation.cfm?id=1461544 Some experiments in the generation of word and document associations]" ''Proceeding AFIPS '62 (Fall) Proceedings of the December 4-6, 1962, fall joint computer conference'', pages 234-250. ''(Early paper of Salton using the term-document matrix formalization)''
* [[Gerard Salton|G. Salton]], A. Wong, and C. S. Yang (1975), "[https://dl.acm.org/citation.cfm?id=361220 A Vector Space Model for Automatic Indexing]" ''Communications of the ACM'', vol. 18, nr. 11, pages 613–620. ''(Article in which a vector space model was presented)''
* David Dubin (2004), [http://www.ideals.uiuc.edu/bitstream/2142/1697/2/Dubin748764.pdf The Most Influential Paper Gerard Salton Never Wrote] ''(Explains the history of the Vector Space Model and the non-existence of a frequently cited publication)''
Line 115 ⟶ 106:
 
==References==
{{reflist}}
<references/>
 
[[Category:Vector space model|* ]]