Content deleted Content added
→External links: recategorize per the category's entry, replaced: [[Category:Wikipedia outlines → [[Category:Outlines |
m Redirect bypass from Georgetown-IBM experiment to Georgetown–IBM experiment using popups | wp:datescript-assisted date/terms audit; see wp:unlinkdates, wp:overlink |
||
Line 217:
* [[Latent semantic indexing]] –
* [[Lemmatisation]] – groups together all like terms that share a same lemma such that they are classified as a single item.
* [[Morphology (linguistics)|Morphological segmentation]] – separates words into individual [[morphemes]] and identifies the class of the morphemes. The difficulty of this task depends greatly on the complexity of the [[morphology (linguistics)|morphology]] (i.e. the structure of words) of the language being considered.
* [[Named-entity recognition]] (NER) – given a stream of text, determines which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, ___location, organization). Although [[capitalization]] can aid in recognizing named entities in languages such as English, this information cannot aid in determining the type of named entity, and in any case is often inaccurate or insufficient. For example, the first word of a sentence is also capitalized, and named entities often span several words, only some of which are capitalized. Furthermore, many other languages in non-Western scripts (e.g.
* [[Ontology learning]] – automatic or semi-automatic creation of [[Ontology (information science)|ontologies]], including extracting the corresponding ___domain's terms and the relationships between those concepts from a corpus of natural-language text, and encoding them with an [[ontology language]] for easy retrieval. Also called "ontology extraction", "ontology generation", and "ontology acquisition".
* [[Parsing]] – determines the [[parse tree]] (grammatical analysis) of a given sentence. The [[grammar]] for [[natural language]]s is [[ambiguous]] and typical sentences have multiple possible analyses. In fact, perhaps surprisingly, for a typical sentence there may be thousands of potential parses (most of which will seem completely nonsensical to a human).
** [[Shallow parsing]] –
* [[Part-of-speech tagging]] – given a sentence, determines the [[part of speech]] for each word. Many words, especially common ones, can serve as multiple [[parts of speech]]. For example, "book" can be a [[noun]] ("the book on the table") or [[verb]] ("to book a flight"); "set" can be a [[noun]], [[verb]] or [[adjective]]; and "out" can be any of at least five different parts of speech. Some languages have more such ambiguity than others. Languages with little [[inflectional morphology]], such as
* [[Query expansion]] –
* [[Relationship extraction]] – given a chunk of text, identifies the relationships among named entities (e.g. who is the wife of whom).
Line 236:
* [[Topic segmentation]] and recognition – given a chunk of text, separates it into segments each of which is devoted to a topic, and identifies the topic of the segment.
* [[Truecasing]] –
* [[Word segmentation]] – separates a chunk of continuous text into separate words. For a language like
* [[Word-sense disambiguation]] (WSD) – because many words have more than one [[Meaning (linguistics)|meaning]], word-sense disambiguation is used to select the meaning which makes the most sense in context. For this problem, we are typically given a list of words and associated word senses, e.g. from a dictionary or from an online resource such as [[WordNet]].
** [[Word-sense induction]] – open problem of natural-language processing, which concerns the automatic identification of the senses of a word (i.e. meanings). Given that the output of word-sense induction is a set of senses for the target word (sense inventory), this task is strictly related to that of word-sense disambiguation (WSD), which relies on a predefined sense inventory and aims to solve the ambiguity of words in context.
Line 271:
! style="background-color:#EEF6D6;" | Reference
|-
|[[
|1954
|[[Georgetown University]] and [[IBM]]
Line 422:
* [[Brill tagger]] –
* [[Cache language model]] –
* [[ChaSen]], [[MeCab]] – provide morphological analysis and word splitting for
* [[Classic monolingual WSD]] –
* [[ClearForest]] –
Line 527:
|[[Gensim]]||[[Python (programming language)|Python]]||[[LGPL]]|| Radim Řehůřek
|-
|[[LinguaStream]]||[[Java (programming language)|Java]]||Free for research ||[[University of Caen]],
|-
|[[Mallet (software project)|Mallet]]||[[Java (programming language)|Java]]||[[Common Public License]]||[[University of Massachusetts Amherst]]
|