Vector space model: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 11:13, 8 May 2024 edit 151.236.179.198 (talk) Znamy się ? Tags: Reverted Visual edit Mobile edit Mobile web edit ← Previous edit		Latest revision as of 16:58, 17 August 2025 edit undo Macrakis (talk \| contribs) Autopatrolled, Extended confirmed users, Pending changes reviewers, Rollbackers 54,689 edits use standard variant Tag: Visual edit
(8 intermediate revisions by 8 users not shown)
Line 1: {{Short description\|Model for representing text documents}} '''Vector space model''' or '''term vector model''' is an algebraic model for representing text documents (or more generally, items) as [[vector space\|vectors]] such that the distance between vectors represents the relevance between the documents. It is used in [[information filtering]], [[information retrieval]], [[index (search engine)\|index]]ing and ~~relevancy~~relevance rankings. Its first use was in the [[SMART Information Retrieval System]].<ref>{{~~citation needed\|date=December 2023}}.Znamy się~~cite ?journal \| last1 = Berry \| first1 = Michael W. \| last2 = Drmac \| first2 = Zlatko \| last3 = Jessup \| first3 = Elizabeth R. \| date = January 1999 \| doi = 10.1137/s0036144598347035 \| issue = 2 \| journal = SIAM Review \| pages = 335–362 \| title = Matrices, Vector Spaces, and Information Retrieval \| volume = 41}}</ref> ==Definitions== Line 37 ⟶ 47: As all vectors under consideration by this model are element-wise nonnegative, a cosine value of zero means that the query and document vector are [[orthogonal]] and have no match (i.e. the query term does not exist in the document being considered). See [[cosine similarity]] for further information.<ref name=":0" /> == Term ~~frequency-inverse~~frequency–inverse document frequency (if–idf) weights== In the classic vector space model proposed by [[Gerard Salton\|Salton]], Wong and Yang ,<ref>[http://doi.acm.org/10.1145/361219.361220 G. Salton , A. Wong , C. S. Yang, A vector space model for automatic indexing], Communications of the ACM, v.18 n.11, p.613–620, Nov. 1975</ref> the term-specific weights in the document vectors are products of local and global parameters. The model is known as [[~~tf-idf\|~~term ~~frequency-inverse~~frequency–inverse document frequency]] (if–idf) model. The weight vector for document ''d'' is <math>\mathbf{v}_d = [w_{1,d}, w_{2,d}, \ldots, w_{N,d}]^T</math>, where :<math> Line 73 ⟶ 83: ==Software that implements the vector space model== {{further information\|Vector database}} The following software packages may be of interest to those wishing to experiment with vector models and implement search services based upon them. ===Free open source software=== * [[Apache Lucene]]. Apache Lucene is a high-performance, open source, full-featured text search engine library written entirely in Java. * [[OpenSearch (software)]], [[Elasticsearch]] and [[Apache Solr\|Solr]] : the 2three most ~~famous~~well-known search engine ~~software (many smaller exist)~~programs based on Lucene. Others are also available. * [[Gensim]] is a Python+[[NumPy]] framework for Vector Space modelling. It contains incremental (memory-efficient) algorithms for [[tf–idf\|term frequency-inverse document frequency]], [[Latent Semantic Indexing\|latent semantic indexing]], [[Locality sensitive hashing#Random projection\|~~Random~~random ~~Projections~~projections]] and [[Latent Dirichlet Allocation\|latent Dirichlet allocation]]. * [[Weka (machine learning)\|Weka]]. Weka is a popular data mining package for Java including WordVectors and [[Bag-of-words model\|Bag Of Words models]]. * [[Word2vec]]. Word2vec uses vector spaces for word embeddings.