Content deleted Content added
No edit summary |
Citation bot (talk | contribs) Removed URL that duplicated identifier. Removed parameters. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 506/990 |
||
(34 intermediate revisions by 26 users not shown) | |||
Line 2:
{{Use dmy dates|date=July 2022}}
A '''language model''' is a [[Model#Conceptual model|model]] of the human brain's ability to produce [[natural language]].<ref>{{cite journal |last1=Blank |first1=Idan A. |title=What are large language models supposed to model? |journal=Trends in Cognitive Sciences |date=November 2023 |volume=27 |issue=11 |pages=987–989 |doi=10.1016/j.tics.2023.08.006|pmid=37659920 |doi-access=free }}"LLMs are supposed to model how utterances behave." </ref><ref>{{cite book |last1=Jurafsky |first1=Dan |last2=Martin |first2=James H. |title=Speech and Language Processing |date=2021 |edition=3rd |url=https://web.stanford.edu/~jurafsky/slp3/ |access-date=24 May 2022 |chapter=N-gram Language Models |chapter-url= https://web.stanford.edu/~jurafsky/slp3/3.pdf |archive-date=22 May 2022 |archive-url=https://web.archive.org/web/20220522005855/https://web.stanford.edu/~jurafsky/slp3/ |url-status=live }}</ref> Language models are useful for a variety of tasks, including [[speech recognition]],<ref>Kuhn, Roland, and Renato De Mori (1990). [https://www.researchgate.net/profile/Roland_Kuhn2/publication/3191800_Cache-based_natural_language_model_for_speech_recognition/links/004635184ee5b2c24f000000.pdf "A cache-based natural language model for speech recognition"]. ''IEEE transactions on pattern analysis and machine intelligence'' 12.6: 570–583.</ref>
A '''language model''' is a probabilistic [[Model#Conceptual model|model]] of a natural language.<ref>{{cite book |last1=Jurafsky |first1=Dan |last2=Martin |first2=James H. |title=Speech and Language Processing |date=2021 |edition=3rd |url=https://web.stanford.edu/~jurafsky/slp3/ |access-date=24 May 2022 |chapter=N-gram Language Models |archive-date=22 May 2022 |archive-url=https://web.archive.org/web/20220522005855/https://web.stanford.edu/~jurafsky/slp3/ |url-status=live }}</ref> In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘[[Claude Shannon|Shannon]]-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.<ref>{{cite journal |last1=Rosenfeld |first1=Ronald |year=2000 |title=Two decades of statistical language modeling: Where do we go from here? |journal=Proceedings of the IEEE |volume=88 |issue=8|pages=1270–1278 |doi=10.1109/5.880083 |s2cid=10959945 |url=https://figshare.com/articles/journal_contribution/6611138 }}</ref>▼
[[Large language model]]s (LLMs), currently their most advanced form, are
▲Language models are useful for a variety of tasks, including [[speech recognition]]<ref>Kuhn, Roland, and Renato De Mori (1990). [https://www.researchgate.net/profile/Roland_Kuhn2/publication/3191800_Cache-based_natural_language_model_for_speech_recognition/links/004635184ee5b2c24f000000.pdf "A cache-based natural language model for speech recognition"]. ''IEEE transactions on pattern analysis and machine intelligence'' 12.6: 570–583.</ref> (helping prevent predictions of low-probability (e.g. nonsense) sequences), [[machine translation]],<ref name="Semantic parsing as machine translation">Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013). [https://www.aclweb.org/anthology/P13-2009 "Semantic parsing as machine translation"] {{Webarchive|url=https://web.archive.org/web/20200815080932/https://www.aclweb.org/anthology/P13-2009/ |date=15 August 2020 }}. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).</ref> [[natural language generation]] (generating more human-like text), [[optical character recognition]], [[route optimization]],<ref>{{cite journal |last1=Liu |first1=Yang |last2=Wu |first2=Fanyou |last3=Liu |first3=Zhiyuan |last4=Wang |first4=Kai |last5=Wang |first5=Feiyue |last6=Qu |first6=Xiaobo |title=Can language models be used for real-world urban-delivery route optimization? |journal=The Innovation |date=2023 |volume=4 |issue=6 |pages=100520 |doi=10.1016/j.xinn.2023.100520 |doi-access=free}}</ref> [handwriting recognition]],<ref>Pham, Vu, et al (2014). [https://arxiv.org/abs/1312.4569 "Dropout improves recurrent neural networks for handwriting recognition"] {{Webarchive|url=https://web.archive.org/web/20201111170554/https://arxiv.org/abs/1312.4569 |date=11 November 2020 }}. 14th International Conference on Frontiers in Handwriting Recognition. IEEE.</ref> [[grammar induction]],<ref>Htut, Phu Mon, Kyunghyun Cho, and Samuel R. Bowman (2018). [https://arxiv.org/pdf/1808.10000.pdf?source=post_page--------------------------- "Grammar induction with neural language models: An unusual replication"] {{Webarchive|url=https://web.archive.org/web/20220814010528/https://arxiv.org/pdf/1808.10000.pdf?source=post_page--------------------------- |date=14 August 2022 }}. {{arXiv|1808.10000}}.</ref> and [[information retrieval]].<ref name=ponte1998>{{cite conference |first1=Jay M. |last1=Ponte |first2= W. Bruce |last2=Croft | title= A language modeling approach to information retrieval |conference=Proceedings of the 21st ACM SIGIR Conference |year=1998 |publisher=ACM |place=Melbourne, Australia | pages = 275–281| doi=10.1145/290941.291008}}</ref><ref name=hiemstra1998>{{cite conference | first=Djoerd | last=Hiemstra | year = 1998 | title = A linguistically motivated probabilistically model of information retrieval | conference = Proceedings of the 2nd European conference on Research and Advanced Technology for Digital Libraries | publisher = LNCS, Springer | pages=569–584 | doi= 10.1007/3-540-49653-X_34}}</ref>
== History ==
▲[[Large language model]]s, currently their most advanced form, are a combination of larger datasets (frequently using words [[Web scraping|scraped]] from the public internet), [[feedforward neural network]]s, and [[transformer (machine learning)|transformer]]s. They have superseded [[recurrent neural network]]-based models, which had previously superseded the pure statistical models, such as [[Word n-gram language model|word ''n''-gram language model]].
[[Noam Chomsky]] did pioneering work on language models in the 1950s by developing a theory of [[formal grammar]]s.<ref>{{Cite journal |last=Chomsky |first=N. |date=September 1956 |title=Three models for the description of language |journal=IRE Transactions on Information Theory |volume=2 |issue=3 |pages=113–124 |doi=10.1109/TIT.1956.1056813 |issn=2168-2712}}</ref>
In 1980, statistical approaches were explored and found to be more useful for many purposes than rule-based formal grammars. Discrete representations like [[Word n-gram language model|word ''n''-gram language models]], with probabilities for discrete combinations of words, made significant advances.
In the 2000s, continuous representations for words, such as [[Word2vec|word embeddings]], began to replace discrete representations.<ref>{{Cite news |date=2022-02-22 |title=The Nature Of Life, The Nature Of Thinking: Looking Back On Eugene Charniak's Work And Life |url=https://cs.brown.edu/news/2022/02/22/the-nature-of-life-the-nature-of-thinking-looking-back-on-eugene-charniaks-work-and-life/ |archive-url=https://web.archive.org/web/20241103134558/https://cs.brown.edu/news/2022/02/22/the-nature-of-life-the-nature-of-thinking-looking-back-on-eugene-charniaks-work-and-life/ |archive-date=3 November 2024 |access-date=2025-02-05 |language=en |url-status=live }}</ref> Typically, the representation is a [[Real number|real-valued]] vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning, and common relationships between pairs of words like plurality or gender.
== Pure statistical models ==
▲
=== Models based on word ''n''-grams ===
Line 19 ⟶ 25:
<math display="block"> P(w_m \mid w_1,\ldots,w_{m-1}) = \frac{1}{Z(w_1,\ldots,w_{m-1})} \exp (a^T f(w_1,\ldots,w_m))</math>
where <math>Z(w_1,\ldots,w_{m-1})</math> is the [[Partition function (mathematics)|partition function]], <math>a</math> is the parameter vector, and <math>f(w_1,\ldots,w_m)</math> is the feature function. In the simplest case, the feature function is just an indicator of the presence of a certain ''n''-gram. It is helpful to use a prior on <math>a</math> or some form of [[Regularization (mathematics)|regularization]].
The log-bilinear model is another example of an exponential language model.
Line 28 ⟶ 34:
== Neural models ==
=== Recurrent neural network ===
Continuous representations or [[Word embedding|embeddings of words]] are produced in [[recurrent neural network]]-based language models (known also as ''continuous space language models'').<ref>{{cite web |last1=Karpathy |first1=Andrej |title=The Unreasonable Effectiveness of Recurrent Neural Networks |url=https://karpathy.github.io/2015/05/21/rnn-effectiveness/ |access-date=27 January 2019 |archive-date=1 November 2020 |archive-url=https://web.archive.org/web/20201101215448/http://karpathy.github.io/2015/05/21/rnn-effectiveness/ |url-status=live }}</ref> Such continuous space embeddings help to alleviate the [[curse of dimensionality]], which is the consequence of the number of possible sequences of words increasing [[Exponential growth|exponentially]] with the size of the vocabulary,
=== Large language models ===
{{excerpt|Large language model}}
Although sometimes matching human performance, it is not clear whether they are plausible [[
== Evaluation and benchmarks ==
Line 41 ⟶ 47:
Various data sets have been developed for use in evaluating language processing systems.<ref name=":0">{{cite arXiv|last1=Devlin|first1=Jacob|last2=Chang|first2=Ming-Wei|last3=Lee|first3=Kenton|last4=Toutanova|first4=Kristina|date=2018-10-10|title=BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding|eprint=1810.04805|class=cs.CL}}</ref> These include:
* [[Massive Multitask Language Understanding]] (MMLU)<ref>{{Citation |last=Hendrycks |first=Dan |title=Measuring Massive Multitask Language Understanding |date=2023-03-14 |url=https://github.com/hendrycks/test |archive-url=https://web.archive.org/web/20230315011614/https://github.com/hendrycks/test |archive-date=15 March 2023 |url-status=live |accessdate=2023-03-15}}</ref>
* Corpus of Linguistic Acceptability<ref>{{Cite web|url=https://nyu-mll.github.io/CoLA/|title=The Corpus of Linguistic Acceptability (CoLA)|website=nyu-mll.github.io|access-date=2019-02-25|archive-date=7 December 2020|archive-url=https://web.archive.org/web/20201207081834/https://nyu-mll.github.io/CoLA/|url-status=live}}</ref>
* GLUE benchmark<ref>{{Cite web|url=https://gluebenchmark.com/|title=GLUE Benchmark|website=gluebenchmark.com|language=en|access-date=2019-02-25|archive-date=4 November 2020|archive-url=https://web.archive.org/web/20201104161928/https://gluebenchmark.com/|url-status=live}}</ref>
Line 52 ⟶ 59:
* Stanford Sentiment [[Treebank]]<ref>{{Cite web|url=https://nlp.stanford.edu/sentiment/treebank.html|title=Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank|website=nlp.stanford.edu|access-date=2019-02-25|archive-date=27 October 2020|archive-url=https://web.archive.org/web/20201027125825/https://nlp.stanford.edu/sentiment/treebank.html|url-status=live}}</ref>
* Winograd NLI
* BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC, OpenBookQA, NaturalQuestions, TriviaQA, RACE
== See also ==
{{portal |Linguistics |Mathematics |Technology}}
{{div col|colwidth=
* {{Annotated link|Artificial intelligence and elections}}
* [[Cache language model]]
* [[Deep linguistic processing]]
* [[Ethics of artificial intelligence]]▼
* [[Factored language model]]
* [[Generative pre-trained transformer]]
* [[Katz's back-off model]]
* [[Language technology]]
* [[Semantic similarity network]]▼
* [[Statistical model]]
▲* [[Ethics of artificial intelligence]]
▲* [[Semantic similarity network]]
{{div col end}}
Line 93 ⟶ 102:
{{refbegin}}
* {{cite conference |author1=
* {{cite conference |author1=
* {{cite tech report |first=Stanley F. |last=Chen |author2=Joshua Goodman |title=An Empirical Study of Smoothing Techniques for Language Modeling |institution=Harvard University |year=1998
{{refend}}
{{Natural language processing}}
{{Artificial intelligence navbox}}
[[Category:Language modeling|*]]
|