Web query classification: Difference between revisions

Content deleted Content added
EvanXW (talk | contribs)
No edit summary
top: -capitals
 
(72 intermediate revisions by 39 users not shown)
Line 1:
{{Cleanup|date=March 2011}}
'''Web query topic classification/categorization''' is a problem in [[information science]]. The task is to assign a [[Web search query]] to one or more predefined [[Categorization|categories]], based on its topics. Different from the traditional [[document classification]] tasks, queries submitted by Web search users are usually short and ambiguous; also the meanings of the queries are evolving over time. Therefore, some canonical document classification techniques cannot be directly applied to the query topic classification tasks.
A '''web query topic classification/categorization''' is a problem in [[information science]]. The task is to assign a [[web search query]] to one or more predefined [[Categorization|categories]], based on its topics. The importance of query classification is underscored by many services provided by Web search. A direct application is to provide better search result pages for users with interests in different categories. For example, users issuing a Web query such as "apple" might expect to see Web pages related to the fruit apple, or they may prefer to see products or news related to the computer company. Online advertisement services can rely on the query classification results to promote different products more accurately. Search result pages can be grouped according to the categories predicted by a query classification algorithm. However, the computation of query classification is non-trivial. Different from the [[document classification]] tasks, queries submitted by Web search users are usually short and ambiguous; also the meanings of the queries are evolving over time. Therefore, query topic classification is much more difficult than traditional document classification tasks.
 
== Difficulties ==
 
Web query topic classification is to automatically assign a query to some predefined categories. Different from the traditional document classification tasks, there are several major difficulties which hinder the progress of Web [[query understanding]]:
== Problem ==
 
=== Derive an appropriate feature representation for Web queries ===
Web query topic classification is to automatically assign a query to some predefined categories. Different from the traditional document classification tasks, there are several major difficulties which hinder the progress of Web query understanding:
 
* '''Short and Noisy''' - Many queries are short, and query terms are often noisy.{{Clarify|reason=what Asdoes an"noisy" example,mean in thethis KDDCUPcontext|date=December 20052024|text=|post-text=What dataset<ref>[http://www.sigkdd.org/kdd2005/kddcup.htmldoes "noisy" mean here?}} For instance, in the KDDCUP 2005 dataset]</ref>, queries containing 3 words are the most frequent (22%). FurthermoreAdditionally, 79% of queries haveconsist of no more than 4 words. EachA user query isfrequently acarries combinationmultiple ofmeanings. wordsFor example, names"apple" could refer to a type of personsfruit or locations,a URLscomputer company, specialwhile acronyms,"Java" programcould segmentssignify anda maliciousprogramming codeslanguage or an island in Indonesia. SomeIn the KDDCUP 2005 dataset, a majority of queries contain themore wordsthan whichone aremeaning. veryTherefore, cleanonly whileusing othersthe maykeywords containof typosthe orquery meaninglessto stringsset whichup area totally[[vector noisy.space <brmodel]] />for classification is not appropriate.
* '''Ambiguous''' - A user query often has multiple meanings. For example, "Apple" can mean a kind of fruit or a computer company. "Java" can mean a programming language or an island in Indonesia. In the KDDCUP 2005 dataset, most of the queries contain more than one meaning. <br />
* '''Concept Drift''' - The meanings of queries may also evolve over time. For example, the word "Barcelona" has a new meaning of the new micro-processor of AMD, while it refers to a city or football club before 2007. The distribution of the meanings of this term is therefore a function of time on the Web.<br />
 
SinceQuery-enrichment thebased keywordsmethods<ref>Shen ofet theal. queries are[http://www.sigkdd.org/sites/default/files/issues/7-2-2005-12/KDDCUP2005Report_Shen.pdf too"Q2C@UST: shortOur andWinning ambiguous,Solution to Query-enrichment basedClassification"]. methods''ACM SIGKDD Exploration, December 2005, Volume 7, Issue 2''.</ref><ref>Dou Shen. et al. [http://lbxmlportal.ustacm.hkorg/th/th_searchft_gateway.plcfm?smodeid=VIEWBYCALLNUM&skeywords=CSED%202007%20Shen1165776 "Learning-basedQuery WebEnrichment Queryfor Web-query UnderstandingClassification"]. ''PhdACM Thesis''TOIS, ''HKUST''Vol. 24, JuneNo. 20073, July 2006''.</ref> start by enriching user queries to a collection of text documents through [[search engines]]. Thus, each query is represented by a pseudo-document which consists of the snippets of top ranked webresult pages retrieved by search engine. Subsequently, the text documents are classified into the target categories using synonym based classifier or statistical classifiers, such as [[Naive Bayes]] (NB) and [[Support Vector Machines]] (SVMs), or similarity computation through a group of intermediate categories.
 
=== Adapting to changes of the queries and categories over time ===
== Methodology ==
 
* '''Concept Drift''' - The meanings of queries may also evolve over time. Therefore, the old labeled training queries may be out-of-data and useless soon. How to make the classifier adaptive over time becomes a big issue. For example, the word "''Barcelona''" has a new meaning of the new micro-processor of AMD, while it refers to a city or football club before 2007. The distribution of the meanings of this term is therefore a function of time on the Web.<br />
 
Intermediate taxonomy based method<ref>Shen et al. [http://portal.acm.org/ft_gateway.cfm?id=1148196 "Building bridges for web query classification"]. ''ACM SIGIR, 2006''.</ref> first builds a bridging classifier on an intermediate taxonomy, such as [[Open Directory Project]] (ODP), in an offline mode. This classifier is then used in an online mode to map user queries to the target categories via the intermediate taxonomy. The advantage of this approach is that the bridging classifier needs to be trained only once and is adaptive for each new set of target categories and incoming queries.
=== Query-enrichment based method ===
 
=== Using unlabeled query logs to help with query classification ===
Since the keywords of the queries are too short and ambiguous, Query-enrichment based methods<ref>Dou Shen. [http://lbxml.ust.hk/th/th_search.pl?smode=VIEWBYCALLNUM&skeywords=CSED%202007%20Shen "Learning-based Web Query Understanding"]. ''Phd Thesis'', ''HKUST'', June 2007.</ref> start by enriching user queries to a collection of text documents through [[search engines]]. Thus, each query is represented by a pseudo-document which consists of the top ranked web pages retrieved by search engine. Subsequently, the text documents are classified into the target categories using statistical classifiers, such as [[Naive Bayes]] (NB) and [[Support Vector Machines]] (SVMs), or similarity computation through a group of intermediate categories.
 
Since the manually labeled training data for query classification is expensive, how to use a very large web search engine query log as a source of unlabeled data to aid in automatic query classification becomes a hot issue. These logs record the Web users' behavior when they search for information via a search engine. Over the years, query logs have become a rich resource which contains Web users' knowledge about the World Wide Web.
 
Query clustering method<ref>Wen et al. [http://portal.acm.org/ft_gateway.cfm?id=503108 "Query Clustering Using User Logs"], ''ACM TOIS, Volume 20, Issue 1, January 2002''.</ref> tries to associate related queries by clustering "session data", which contain multiple queries and click-through information from a single user interaction. They take into account terms from result documents that a set of queries has in common. The use of query keywords together with session data is shown to be the most effective method of performing query clustering.
=== Selectional preference based method ===
 
Selectional preference based methods<ref>Steven M. Beitzel, [http://ir.iit.edu/~steve/beitzel_phd_thesis.pdf "On Understanding and Classifying Web Queries"], ''Phd Thesis'', ''IIT'', May 2006.</ref> try to exploit some [[association rules]] between the query terms to help with the query classification. Given the training data, they exploit several classification approaches including exact-match using labeled data, N-Gram match using labeled data and classifiers based on perceptron. They emphasize on an approach adapted from computational linguistics named selectional preferences. If x and y form a pair (x; y) and y belongs to category c, then all other pairs (x; z) headed by x belong to c. They use unlabeled query log data to mine these rules and validate the effectiveness of their approaches on some labeled queries.
 
Selectional preference based methodsmethod<ref>StevenBeitzel Met al. Beitzel, [http://irportal.iitacm.eduorg/~steve/beitzel_phd_thesisft_gateway.pdfcfm?id=1229183 "OnAutomatic UnderstandingClassification and Classifyingof Web Queries Using Very Large Unlabeled Query Logs"], ''PhdACM Thesis''TOIS, ''IIT''Volume 25, MayIssue 20062, April 2007''.</ref> trytries to exploit some [[association rules]] between the query terms to help with the query classification. Given the training data, they exploit several classification approaches including exact-match using labeled data, N-Gram match using labeled data and classifiers based on perceptronperception. They emphasize on an approach adapted from computational linguistics named selectional preferences. If x and y form a pair (x; y) and y belongs to category c, then all other pairs (x; z) headed by x belong to c. They use unlabeled query log data to mine these rules and validate the effectiveness of their approaches on some labeled queries.
 
== Applications ==
 
* '''[[metasearch|Metasearch engines]]''' send a user's query to multiple search engines and blend the top results from each into one overall list. The search engine can organize the large number of Web pages in the search results, according to the potential categories of the issued query, for the convenience of Web users' navigation.<br />
After decades of development, Web search is moving into a new era which attempts are being made to serve Web users more intelligently. These attempts include [[metasearch]], [[vertical search]], [[online advertising]] and so on.
* '''[[Vertical search]]''', compared to general search, focuses on specific domains and addresses the particular information needs of niche audiences and professions. Once the search engine can predict the category of information a Web user is looking for, it can select a certain vertical search engine automatically, without forcing the user to access the vertical search engine explicitly. <br />
 
* '''[[Online advertising]]'''<ref>[http://www.kdd2007.com/workshops.html#adkdd Data Mining and Audience Intelligence for Advertising (ADKDD'07)], KDD workshop 2007</ref><ref>[http://research.yahoo.com/workshops/troa-2008/ Targeting and Ranking for Online Advertising (TROA'08)], WWW workshop 2008</ref> aims at providing interesting advertisements to Web users during their search activities. The search engine can provide relevant advertising to Web users according to their interests, so that the Web users can save time and effort in research while the advertisers can reduce their advertising costs.
* '''Metasearch engines''' send a user's query to multiple search engines and blend the top results from each into one overall list. The search engine can organize the large number of Web pages in the search results, according to the potential categories of the issued query, for the convenience of Web users' navigation.<br />
* '''Vertical search''', compared to general search, focuses on specific domains and addresses the particular information needs of niche audiences and professions. Once the search engine can predict the category of information a Web user is looking for, it can select a certain vertical search engine automatically, without forcing the user to access the vertical search engine explicitly. <br />
* '''Online advertising''' aims at providing interesting advertisements to Web users during their search activities. The search engine can provide relevant advertising to Web users according to their interests, so that the Web users can save time and effort in research while the advertisers can reduce their advertising costs.
All these services rely on the understanding Web users' search intents through their Web queries.
 
 
== See also ==
 
* [[Document classification]]
* [[Web search query]]
* [[Information retrieval]]
* [[Query expansion]]
* [[Naive Bayes classifier]]
* [[Support vector machines]]
* [[Meta search]]
* [[Vertical search]]
* [[Online advertising]]
 
== NotesReferences ==
 
{{reflist}}
 
== Further reading ==
* Shen. [http://lbxml.ust.hk/th/th_search.pl?smode=VIEWBYCALLNUM&skeywords=CSED%202007%20Shen "Learning-based Web Query Understanding"]. ''Phd Thesis'', ''HKUST'', June 2007.
{{Internet search}}
 
{{DEFAULTSORT:Web Query Classification}}
[[Category:Information retrieval techniques]]
[[Category:Internet search]]