Web query classification: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 21:55, 15 August 2024 edit RJFJR (talk \| contribs) Administrators 166,891 edits →How to use the unlabeled query logs to help with query classification?: === Using unlabeled query logs to help with query classification === ← Previous edit		Latest revision as of 22:29, 3 January 2025 edit undo Eurohunter (talk \| contribs) Autopatrolled, Extended confirmed users 26,273 edits →top: -capitals
(4 intermediate revisions by 2 users not shown)
Line 1: {{Cleanup\|date=March 2011}} A '''~~Web~~web query topic classification/categorization''' is a problem in [[information science]]. The task is to assign a [[~~Web~~web search query]] to one or more predefined [[Categorization\|categories]], based on its topics. The importance of query classification is underscored by many services provided by Web search. A direct application is to provide better search result pages for users with interests ofin different categories. For example, ~~the~~ users issuing a Web query such as "''apple''" might expect to see Web pages related to the fruit apple, or they may prefer to see products or news related to the computer company. Online advertisement services can rely on the query classification results to promote different products more accurately. Search result pages can be grouped according to the categories predicted by a query classification algorithm. However, the computation of query classification is non-trivial. Different from the [[document classification]] tasks, queries submitted by Web search users are usually short and ambiguous; also the meanings of the queries are evolving over time. Therefore, query topic classification is much more difficult than traditional document classification tasks. ~~== KDDCUP 2005 ==~~ KDDCUP 2005 competition<ref>[http://www.kdd.org/kdd-cup/view/kdd-cup-2005 KDDCUP 2005 dataset]</ref> highlighted the interests in query classification. The objective of this competition is to classify 800,000 real user queries into 67 target categories. Each query can belong to more than one target category. As an example of a QC task, given the query "apple", it should be classified into ranked categories: "Computers \ Hardware; Living \ Food & Cooking". ~~{\| class="wikitable"~~ \|- ~~! Query~~ ~~! Categories~~ \|- ~~\| apple~~ ~~\| Computers \ Hardware<br />Living \ Food & Cooking~~ \|- ~~\| FIFA 2006~~ ~~\| Sports \ Soccer<br />Sports \ Schedules & Tickets<br />Entertainment \ Games & Toys~~ \|- ~~\| cheesecake recipes~~ ~~\| Living \ Food & Cooking<br />Information \ Arts & Humanities~~ \|- ~~\| friendships poem~~ ~~\| Information \ Arts & Humanities<br />Living \ Dating & Relationships~~ \|} ~~[[Image:Web query length.gif]]~~ ~~[[Image:Web query meaning.gif]]~~ == Difficulties == Line 33 ⟶ 8: === Derive an appropriate feature representation for Web queries === Many queries are short, and query terms are often noisy.{{Clarify\|reason=what Asdoes an"noisy" ~~example~~mean in this context\|date=December 2024\|text=\|post-text=What does "noisy" mean here?}} For instance, in the KDDCUP 2005 dataset, queries containing 3 words are the most frequent (22%). ~~Furthermore~~Additionally, 79% of queries ~~have~~consist of no more than 4 words. A user query ~~often~~frequently ~~has~~carries multiple meanings. For example, "''apple''" ~~can~~could ~~mean~~refer to a ~~kind~~type of fruit or a computer company., while "''Java''" ~~can~~could ~~mean~~signify a programming language or an island in Indonesia. In the KDDCUP 2005 dataset, ~~most~~a majority of ~~the~~ queries contain more than one meaning. Therefore, only using the keywords of the query to set up a [[vector space model]] for classification is not appropriate. Query-enrichment based methods<ref>Shen et al. [http://www.sigkdd.org/sites/default/files/issues/7-2-2005-12/KDDCUP2005Report_Shen.pdf "Q2C@UST: Our Winning Solution to Query Classification"]. ''ACM SIGKDD Exploration, December 2005, Volume 7, Issue 2''.</ref><ref>Shen et al. [http://portal.acm.org/ft_gateway.cfm?id=1165776 "Query Enrichment for Web-query Classification"]. ''ACM TOIS, Vol. 24, No. 3, July 2006''.</ref> start by enriching user queries to a collection of text documents through [[search engines]]. Thus, each query is represented by a pseudo-document which consists of the snippets of top ranked result pages retrieved by search engine. Subsequently, the text documents are classified into the target categories using synonym based classifier or statistical classifiers, such as [[Naive Bayes]] (NB) and [[Support Vector Machines]] (SVMs).