Vector space model: Difference between revisions

Content deleted Content added
use standard variant
 
(2 intermediate revisions by 2 users not shown)
Line 1:
{{Short description|Model for representing text documents}}
'''Vector space model''' or '''term vector model''' is an algebraic model for representing text documents (or more generally, items) as [[vector space|vectors]] such that the distance between vectors represents the relevance between the documents. It is used in [[information filtering]], [[information retrieval]], [[index (search engine)|index]]ing and relevancyrelevance rankings. Its first use was in the [[SMART Information Retrieval System]].<ref>{{cite journal
| last1 = Berry | first1 = Michael W.
| last2 = Drmac | first2 = Zlatko
Line 47:
As all vectors under consideration by this model are element-wise nonnegative, a cosine value of zero means that the query and document vector are [[orthogonal]] and have no match (i.e. the query term does not exist in the document being considered). See [[cosine similarity]] for further information.<ref name=":0" />
 
== Term frequency-inversefrequency–inverse document frequency (if–idf) weights==
In the classic vector space model proposed by [[Gerard Salton|Salton]], Wong and Yang ,<ref>[http://doi.acm.org/10.1145/361219.361220 G. Salton , A. Wong , C. S. Yang, A vector space model for automatic indexing], Communications of the ACM, v.18 n.11, p.613–620, Nov. 1975</ref> the term-specific weights in the document vectors are products of local and global parameters. The model is known as [[tf-idf|term frequency-inversefrequency–inverse document frequency]] (if–idf) model. The weight vector for document ''d'' is <math>\mathbf{v}_d = [w_{1,d}, w_{2,d}, \ldots, w_{N,d}]^T</math>, where
 
:<math>
Line 88:
===Free open source software===
* [[Apache Lucene]]. Apache Lucene is a high-performance, open source, full-featured text search engine library written entirely in Java.
* [[OpenSearch (software)]], [[Elasticsearch]] and [[Apache Solr|Solr]]: the twothree most well-known search engine programs (many smaller exist) based on Lucene. Others are also available.
* [[Gensim]] is a Python+[[NumPy]] framework for Vector Space modelling. It contains incremental (memory-efficient) algorithms for [[tf–idf|term frequency-inverse document frequency]], [[Latent Semantic Indexing|latent semantic indexing]], [[Locality sensitive hashing#Random projection|random projections]] and [[Latent Dirichlet Allocation|latent Dirichlet allocation]].
* [[Weka (machine learning)|Weka]]. Weka is a popular data mining package for Java including WordVectors and [[Bag-of-words model|Bag Of Words models]].