Content deleted Content added
→Derive an appropriate feature representation for Web queries: c/e grammar and clarity |
→Derive an appropriate feature representation for Web queries: ask clarification of "noisy" |
||
Line 8:
=== Derive an appropriate feature representation for Web queries ===
Many queries are short, and query terms are often noisy.{{Clarify|reason=what does "noisy" mean in this context|date=December 2024|text=|post-text=What does "noisy" mean here?}} For instance, in the KDDCUP 2005 dataset, queries containing 3 words are the most frequent (22%). Additionally, 79% of queries consist of no more than 4 words. A user query frequently carries multiple meanings. For example, "apple" could refer to a type of fruit or a computer company, while "Java" could signify a programming language or an island in Indonesia. In the KDDCUP 2005 dataset, a majority of queries contain more than one meaning. Therefore, only using the keywords of the query to set up a [[vector space model]] for classification is not appropriate.
Query-enrichment based methods<ref>Shen et al. [http://www.sigkdd.org/sites/default/files/issues/7-2-2005-12/KDDCUP2005Report_Shen.pdf "Q2C@UST: Our Winning Solution to Query Classification"]. ''ACM SIGKDD Exploration, December 2005, Volume 7, Issue 2''.</ref><ref>Shen et al. [http://portal.acm.org/ft_gateway.cfm?id=1165776 "Query Enrichment for Web-query Classification"]. ''ACM TOIS, Vol. 24, No. 3, July 2006''.</ref> start by enriching user queries to a collection of text documents through [[search engines]]. Thus, each query is represented by a pseudo-document which consists of the snippets of top ranked result pages retrieved by search engine. Subsequently, the text documents are classified into the target categories using synonym based classifier or statistical classifiers, such as [[Naive Bayes]] (NB) and [[Support Vector Machines]] (SVMs).
|