Outline of natural language processing: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 21:38, 2 May 2022 edit Comp.arch (talk \| contribs) Extended confirmed users 41,488 edits mNo edit summary Tag: 2017 wikitext editor ← Previous edit		Latest revision as of 00:00, 15 July 2025 edit undo Citation bot (talk \| contribs) Bots 5,865,707 edits Altered template type. Add: journal, publisher, authors 1-1. \| Use this bot. Report bugs. \| Suggested by Abductive \| Category:Outlines \| #UCB_Category 560/928
(9 intermediate revisions by 7 users not shown)
Line 2: <!--... Attention: THIS IS AN OUTLINE part of the set of ~~740~~830+ outlines listed at [[~~Portal~~Wikipedia:Contents/Outlines]]. Wikipedia outlines are Line 10: content navigation systems See [[Wikipedia:Outlines]] and [[Wikipedia:WikiProject Outlines]] for more details. Further improvements to this outline are on the way Line 25: Natural-language processing can be described as all of the following: * A field of [[science]] – systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe.<ref>"... modern science is a discovery as well as an invention. It was a discovery that nature generally acts regularly enough to be described by laws and even by mathematics; and required invention to devise the techniques, abstractions, apparatus, and organization for exhibiting the regularities and securing their law-like descriptions." —p.vii, [[J. L. Heilbron]], (2003, editor-in-chief) ''The Oxford Companion to the History of Modern Science'' New York: Oxford University Press {{ISBN\|0-19-511229-6}} {{cite ~~dictionary~~encyclopedia \|encyclopedia=Merriam-Webster Online Dictionary \|title=science \|url=http://www.merriam-webster.com/dictionary/science \|access-date=2011-10-16 \|publisher=[[Merriam-Webster]], Inc \|quote='''3 a:''' knowledge or a system of knowledge covering general truths or the operation of general laws especially as obtained and tested through scientific method '''b:''' such knowledge or such a system of knowledge concerned with the physical world and its phenomena }} <!--{{sfn\|Popper\|2002\|p=3}}--></ref> * An [[applied science]] – field that applies human knowledge to build or design useful things. Line 32: ** A subfield of [[computational linguistics]] – interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective. An application of [[engineering]] – science, skill, and profession of acquiring and applying scientific, economic, social, and practical knowledge, in order to design and also build structures, machines, devices, systems, materials and processes. * An application of [[software engineering]] – application of a systematic, disciplined, quantifiable approach to the design, development, operation, and maintenance of software, and the study of these approaches; that is, the application of engineering to software.<ref name="BoDu04">[[Software Engineering Body of Knowledge\|SWEBOK]] {{Cite book\|editor1= Pierre Bourque \|editor2=Robert Dupuis \| title = Guide to the Software Engineering Body of Knowledge - 2004 Version \| publisher = [[IEEE Computer Society]] \| year = 2004 \| pages = 1 \| isbn = 0-7695-2330-7 \| url = http://www.swebok.org \| others = executive editors, Alain Abran, James W. Moore ; editors, Pierre Bourque, Robert Dupuis.}}</ref><ref>{{cite web \| last = ACM \| year = 2006 Line 208: [[Text simplification]] – * [[Deep linguistic processing]] – * [[Discourse analysis]] – includes a number of related tasks. One task is identifying the [[discourse]] structure of connected text, i.e. the nature of the discourse relationships between sentences (e.g. elaboration, explanation, contrast). Another possible task is recognizing and classifying the [[speech act]]s in a chunk of text (e.g. ~~yes-no~~yes–no questions, content questions, statements, assertions, orders, suggestions, etc.). * [[Information extraction]] – ** [[Text mining]] – process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Line 217: * [[Latent semantic indexing]] – * [[Lemmatisation]] – groups together all like terms that share a same lemma such that they are classified as a single item. * [[Morphology (linguistics)\|Morphological segmentation]] – separates words into individual [[morphemes]] and identifies the class of the morphemes. The difficulty of this task depends greatly on the complexity of the [[morphology (linguistics)\|morphology]] (i.e. the structure of words) of the language being considered. [[English ~~language\|English]]~~ has fairly simple morphology, especially [[inflectional morphology]], and thus it is often possible to ignore this task entirely and simply model all possible forms of a word (e.g. "open, opens, opened, opening") as separate words. In languages such as [[Turkish language\|Turkish]], however, such an approach is not possible, as each dictionary entry has thousands of possible word forms. * [[Named-entity recognition]] (NER) – given a stream of text, determines which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, ___location, organization). Although [[capitalization]] can aid in recognizing named entities in languages such as English, this information cannot aid in determining the type of named entity, and in any case is often inaccurate or insufficient. For example, the first word of a sentence is also capitalized, and named entities often span several words, only some of which are capitalized. Furthermore, many other languages in non-Western scripts (e.g. [[Chinese ~~language\|Chinese]]~~ or [[Arabic language\|Arabic]]) do not have any capitalization at all, and even languages with capitalization may not consistently use it to distinguish names. For example, [[German ~~language\|German]]~~ capitalizes all [[noun]]s, regardless of whether they refer to names, and [[French ~~language\|French]]~~ and [[Spanish ~~language\|Spanish]]~~ do not capitalize names that serve as [[adjective]]s. * [[Ontology learning]] – automatic or semi-automatic creation of [[Ontology (information science)\|ontologies]], including extracting the corresponding ___domain's terms and the relationships between those concepts from a corpus of natural-language text, and encoding them with an [[ontology language]] for easy retrieval. Also called "ontology extraction", "ontology generation", and "ontology acquisition". * [[Parsing]] – determines the [[parse tree]] (grammatical analysis) of a given sentence. The [[grammar]] for [[natural language]]s is [[ambiguous]] and typical sentences have multiple possible analyses. In fact, perhaps surprisingly, for a typical sentence there may be thousands of potential parses (most of which will seem completely nonsensical to a human). ** [[Shallow parsing]] – * [[Part-of-speech tagging]] – given a sentence, determines the [[part of speech]] for each word. Many words, especially common ones, can serve as multiple [[parts of speech]]. For example, "book" can be a [[noun]] ("the book on the table") or [[verb]] ("to book a flight"); "set" can be a [[noun]], [[verb]] or [[adjective]]; and "out" can be any of at least five different parts of speech. Some languages have more such ambiguity than others. Languages with little [[inflectional morphology]], such as [[English ~~language\|English]]~~ are particularly prone to such ambiguity. ~~[[Chinese language\|~~Chinese]] is prone to such ambiguity because it is a [[tonal language]] during verbalization. Such inflection is not readily conveyed via the entities employed within the orthography to convey intended meaning. * [[Query expansion]] – * [[Relationship extraction]] – given a chunk of text, identifies the relationships among named entities (e.g. who is the wife of whom). Line 236: * [[Topic segmentation]] and recognition – given a chunk of text, separates it into segments each of which is devoted to a topic, and identifies the topic of the segment. * [[Truecasing]] – * [[Word segmentation]] – separates a chunk of continuous text into separate words. For a language like [[English ~~language\|English]]~~, this is fairly trivial, since words are usually separated by spaces. However, some written languages like [[Chinese ~~language\|Chinese]]~~, ~~[[Japanese language\|~~Japanese]] and [[Thai language\|Thai]] do not mark word boundaries in such a fashion, and in those languages text segmentation is a significant task requiring knowledge of the [[vocabulary]] and [[morphology (linguistics)\|morphology]] of words in the language. * [[Word-sense disambiguation]] (WSD) – because many words have more than one [[Meaning (linguistics)\|meaning]], word-sense disambiguation is used to select the meaning which makes the most sense in context. For this problem, we are typically given a list of words and associated word senses, e.g. from a dictionary or from an online resource such as [[WordNet]]. ** [[Word-sense induction]] – open problem of natural-language processing, which concerns the automatic identification of the senses of a word (i.e. meanings). Given that the output of word-sense induction is a set of senses for the target word (sense inventory), this task is strictly related to that of word-sense disambiguation (WSD), which relies on a predefined sense inventory and aims to solve the ambiguity of words in context. Line 271: ! style="background-color:#EEF6D6;" \| Reference \|- \|[[~~Georgetown-IBM~~Georgetown–IBM experiment\|Georgetown experiment]] \|1954 \|[[Georgetown University]] and [[IBM]] Line 422: * [[Brill tagger]] – * [[Cache language model]] – * [[ChaSen]], [[MeCab]] – provide morphological analysis and word splitting for [[Japanese ~~language\|Japanese]]~~ * [[Classic monolingual WSD]] – * [[ClearForest]] – Line 449: * [[Language Computer Corporation]] – * [[Language model]] – * [[~~Languageware~~LanguageWare]] – * [[Latent semantic mapping]] – * [[Legal information retrieval]] – Line 472: * [[Naive semantics]] – * [[Natural language]] – * [[Natural -language user interface\|Natural-language interface]] – * [[Natural-language user interface]] – * [[News analytics]] – Line 527: \|[[Gensim]]\|\|[[Python (programming language)\|Python]]\|\|[[LGPL]]\|\| Radim Řehůřek \|- \|[[LinguaStream]]\|\|[[Java (programming language)\|Java]]\|\|Free for research \|\|[[University of Caen]], [[France]] \|- \|[[Mallet (software project)\|Mallet]]\|\|[[Java (programming language)\|Java]]\|\|[[Common Public License]]\|\|[[University of Massachusetts Amherst]] Line 558: [[DeepL]] [[Linguee]] – web service that provides an online dictionary for a number of language pairs. Unlike similar services, such as LEO, Linguee incorporates a search engine that provides access to large amounts of bilingual, translated sentence pairs, which come from the World Wide Web. As a translation aid, Linguee therefore differs from machine translation services like Babelfish and is more similar in function to a translation memory. [[Hindi-to-Punjabi Machine Translation System]] [[Universal Networking Language\|UNL]] Universal Networking Language ** [[Yahoo! Babel Fish]] Line 643 ⟶ 642: \| url = http://www.foo.be/docs/tpj/issues/vol3_2/tpj0302-0002.html }}</ref> * [[Negobot]], a bot designed to catch online pedophiles by posing as a young girl and attempting to elicit personal details from people it speaks to.<ref>{{cite book\|last1=Laorden\|first1=Carlos\|last2=Galan-Garcia\|first2=Patxi\|last3=Santos\|first3=Igor\|last4=Sanz\|first4=Borja\|last5=Hidalgo\|first5=Jose Maria Gomez\|last6=Bringas\|first6=Pablo G.\|title=Negobot: A conversational agent based on game theory for the detection of paedophile behaviour\|date=23 August 2012\|publisher=Springer \|url=http://paginaspersonales.deusto.es/isantos/publications/2012/Laorden_2012_CISIS_Negobot.pdf\|isbn=978-3-642-33018-6\|url-status=dead\|archive-url=https://web.archive.org/web/20130917013039/http://paginaspersonales.deusto.es/isantos/publications/2012/Laorden_2012_CISIS_Negobot.pdf\|archive-date=2013-09-17}}</ref> == Natural-language processing organizations == Line 695 ⟶ 694: * [[William Aaron Woods]] – * [[Maurice Gross]] – author of the concept of local grammar,<ref name="AHI">[http://hdl.handle.net/2042/14456 Ibrahim, Amr Helmy. 2002. "Maurice Gross (1934-2001). À la mémoire de Maurice Gross". ''Hermès'' 34.]</ref> taking finite automata as the competence model of language.<ref name="RD">[http://www.nyu.edu/pages/linguistics/kaliedoscope/mauricegross13.pdf Dougherty, Ray. 2001. ''Maurice Gross Memorial Letter''.]</ref> * [[Stephen Wolfram]] – CEO and founder of [[Wolfram Research]], creator of the programming language (natural-language understanding) [[Wolfram Language]], and natural-language processing computation engine [[Wolfram Alpha]].<ref>{{cite ~~web~~journal\|last1=Wolfram \|first1=Stephen \|url=https://blog.wolfram.com/2010/11/16/programming-with-natural-language-is-actually-going-to-work/\|title=Programming with Natural Language Is Actually Going to Work—Wolfram Blog\|journal=Stephen Wolfram Writings \|date=16 November 2010 }}</ref> * [[Victor Yngve]] – Line 733 ⟶ 732: == External links == {{~~Sisterlinks~~Sister project links\|Natural language processing}} {{Outline footer}} Line 739 ⟶ 738: [[Category:Natural language processing\|*]] [[Category:Outlines of applied sciences\|Natural language processing]] [[Category:~~Wikipedia outlines~~Outlines\|Natural language processing]]