Content deleted Content added
No edit summary |
No edit summary |
||
Line 1:
'''Web query topic classification/categorization''' is a problem in [[information science]]. The task is to assign a [[Web search query]] to one or more predefined [[Categorization|categories]], based on its topics. Different from the traditional [[document classification]] tasks, queries submitted by Web search users are usually short and ambiguous; also the meanings of the queries are evolving over time. Therefore, some canonical document classification techniques cannot be directly applied to the query topic classification tasks.
Line 9:
* '''Ambiguous''' - A user query often has multiple meanings. For example, "Apple" can mean a kind of fruit or a computer company. "Java" can mean a programming language or an island in Indonesia. In the KDDCUP 2005 dataset, most of the queries contain more than one meaning. <br />
* '''Concept Drift''' - The meanings of queries may also evolve over time. For example, the word "Barcelona" has a new meaning of the new micro-processor of AMD, while it refers to a city or football club before 2007. The distribution of the meanings of this term is therefore a function of time on the Web.<br />
== Methodology ==
=== Query-enrichment based method ===
Since the keywords of the queries are too short and ambiguous, Query-enrichment based methods[2] start by enriching user queries to a collection of text documents through [[search engines]]. Thus, each query is represented by a pseudo-document which consists of the top ranked web pages retrieved by search engine. Subsequently, the text documents are classified into the target categories using statistical classifiers, such as [[Naive Bayes]] (NB) and [[Support Vector Machines]] (SVMs), or similarity computation through a group of intermediate categories.
=== Selectional preference based method ===
Selectional preference based methods[3] try to exploit some [[association rules]] between the query terms to help with the query classification. Given the training data, they exploit several classification approaches including exact-match using labeled data, N-Gram match using labeled data and classifiers based on perceptron. They emphasize on an approach adapted from computational linguistics named selectional preferences. If x and y form a pair (x; y) and y belongs to category c, then all other pairs (x; z) headed by x belong to c. They use unlabeled query log data to mine these rules and validate the effectiveness of their approaches on some labeled queries.
== Applications ==
After decades of development, Web search is moving into a new era which attempts are being made to serve Web users more intelligently. These attempts include [[metasearch]], [[vertical search]], [[online advertising]] and so on.
* '''Metasearch engines''' send a user's query to multiple search engines and blend the top results from each into one overall list. The search engine can organize the large number of Web pages in the search results, according to the potential categories of the issued query, for the convenience of Web users' navigation.<br />
* '''Vertical search''', compared to general search, focuses on specific domains and addresses the particular information needs of niche audiences and professions. Once the search engine can predict the category of information a Web user is looking for, it can select a certain vertical search engine automatically, without forcing the user to access the vertical search engine explicitly. <br />
* '''Online advertising''' aims at providing interesting advertisements to Web users during their search activities. The search engine can provide relevant advertising to Web users according to their interests, so that the Web users can save time and effort in research while the advertisers can reduce their advertising costs.
All these services rely on the understanding Web users' search intents through their Web queries.
|