Talk:Document-term matrix: Difference between revisions

Content deleted Content added
Kh251 (talk | contribs)
No edit summary
 
Cewbot (talk | contribs)
m Maintain {{WPBS}} and vital articles: 1 WikiProject template. Create {{WPBS}}. Keep majority rating "Stub" in {{WPBS}}. Remove 1 same rating as {{WPBS}} in {{WikiProject Linguistics}}.
 
(11 intermediate revisions by 6 users not shown)
Line 1:
{{WikiProject banner shell|class=Stub|
{{WikiProject Linguistics|importance=Low|applied=Yes|applied-importance=|auto=Yes}}
}}
==Comments==
I need some help here:
 
Line 4 ⟶ 8:
 
We definitely need more applications.
[[User:Kh251|Kh251]]
 
I don't agree with the last changes. Performing eigenvalue decomposition reduce the size of the matrix, thus improves speed, but decreases accuracy. I know I might be wrong, but I'd like to understand...
[[User:Kh251|KH251]] 09:32, 21 July 2005 (UTC)
 
: Not necessarily: what you say is one valid interpretation of the reduction, but the reduction can also be interpreted as creating a "better" matrix, since the operation tends to "soften" the representation and reduce possible noise.
: Also, it's not always true that this makes it easier on the computational side; for instance, LSA is rather ''heavier'' than just just leaving the thing alone (I have a reference for that somewhere, I am just rather busy at the moment...). Hope it helps ! Cheers ! [[User:Rama|Rama]] 12:14, 21 July 2005 (UTC)
 
:: Yes but LSA is computed once, the important part is having real time answers to ''queries''. Once the matrix is smaller, this will be faster, won't it ? [[User:Kh251|KH251]] 12:37, 21 July 2005 (UTC)
 
:::LSA produces a very serious computation burden on a search engine. Right now, if you type a word at a search engine, it looks the word up in a [[trie]] and finds documents that contain that word in O(1) time (independent of the number of documents in the collection). If you had a search engine that looked up documents in the LSA latent space, it would have to perform high-dimensional nearest neighbor search. LSA is typically used with 100+ dimensions, so none of the [[computational geometry]] speed-ups for nearest neighbor search apply. Therefore, the search would be O(N), where N is the ''number of documents in the collection''. For Google, that would be 8,000,000,000. As you can see, this is disastrous for searching the web. -- [[User:Hike395|hike395]] 06:14, July 22, 2005 (UTC)
 
:::: Oh ! That's how ! Thank you very much for the explanation. You made my day. [[User:Kh251|KH251]] 09:02, 22 July 2005 (UTC)
 
Since we seem to be several people to have a taste for the thing, would anyone fancy creating a "NLP project" on Wikipedia ? [[User:Rama|Rama]] 12:18, 22 July 2005 (UTC)
 
= Intro Improvement Request =
 
I encountered this term for the first time just a few minutes ago. I read the intro, but I still don't have a clear idea of what a document-term matrix is, other than it is a mathematical matrix and that it is related to a body of text. [[User:Danielx|Danielx]] ([[User talk:Danielx|talk]]) 01:42, 2 November 2009 (UTC)