Language model: Difference between revisions

Content deleted Content added
m v2.05b - WPCleaner - Fix errors for CW project (Double pipe in a link)
Citation bot (talk | contribs)
Removed URL that duplicated identifier. Removed parameters. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 506/990
 
(48 intermediate revisions by 36 users not shown)
Line 2:
{{Use dmy dates|date=July 2022}}
 
A '''language model''' is a [[Model#Conceptual model|model]] of the human brain's ability to produce [[natural language]].<ref>{{cite journal |last1=Blank |first1=Idan A. |title=What are large language models supposed to model? |journal=Trends in Cognitive Sciences |date=November 2023 |volume=27 |issue=11 |pages=987–989 |doi=10.1016/j.tics.2023.08.006|pmid=37659920 |doi-access=free }}"LLMs are supposed to model how utterances behave." </ref><ref>{{cite book |last1=Jurafsky |first1=Dan |last2=Martin |first2=James H. |title=Speech and Language Processing |date=2021 |edition=3rd |url=https://web.stanford.edu/~jurafsky/slp3/ |access-date=24 May 2022 |chapter=N-gram Language Models |chapter-url= https://web.stanford.edu/~jurafsky/slp3/3.pdf |archive-date=22 May 2022 |archive-url=https://web.archive.org/web/20220522005855/https://web.stanford.edu/~jurafsky/slp3/ |url-status=live }}</ref> Language models are useful for a variety of tasks, including [[speech recognition]],<ref>Kuhn, Roland, and Renato De Mori (1990). [https://www.researchgate.net/profile/Roland_Kuhn2/publication/3191800_Cache-based_natural_language_model_for_speech_recognition/links/004635184ee5b2c24f000000.pdf "A cache-based natural language model for speech recognition"]. ''IEEE transactions on pattern analysis and machine intelligence'' 12.6: 570–583.</ref> (helping prevent predictions of low-probability (e.g. nonsense) sequences), [[machine translation]],<ref name="Semantic parsing as machine translation">Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013). [https://www.aclweb.org/anthology/P13-2009 "Semantic parsing as machine translation"] {{Webarchive|url=https://web.archive.org/web/20200815080932/https://www.aclweb.org/anthology/P13-2009/ |date=15 August 2020 }}. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).</ref> [[natural language generation]] (generating more human-like text), [[optical character recognition]], [[route optimization]],<ref>{{cite journal |last1=Liu |first1=Yang |last2=Wu |first2=Fanyou |last3=Liu |first3=Zhiyuan |last4=Wang |first4=Kai |last5=Wang |first5=Feiyue |last6=Qu |first6=Xiaobo |title=Can language models be used for real-world urban-delivery route optimization? |journal=The Innovation |date=2023 |volume=4 |issue=6 |pages=100520 |doi=10.1016/j.xinn.2023.100520 |doi-access=free|pmid=37869471 |pmc=10587631 |bibcode=2023Innov...400520L }}</ref> [[handwriting recognition]],<ref>Pham, Vu, et al (2014). [https://arxiv.org/abs/1312.4569 "Dropout improves recurrent neural networks for handwriting recognition"] {{Webarchive|url=https://web.archive.org/web/20201111170554/https://arxiv.org/abs/1312.4569 |date=11 November 2020 }}. 14th International Conference on Frontiers in Handwriting Recognition. IEEE.</ref> [[grammar induction]],<ref>Htut, Phu Mon, Kyunghyun Cho, and Samuel R. Bowman (2018). [https://arxiv.org/pdf/1808.10000.pdf?source=post_page--------------------------- "Grammar induction with neural language models: An unusual replication"] {{Webarchive|url=https://web.archive.org/web/20220814010528/https://arxiv.org/pdf/1808.10000.pdf?source=post_page--------------------------- |date=14 August 2022 }}. {{arXiv|1808.10000}}.</ref> and [[information retrieval]].<ref name="ponte1998">{{cite conference |first1=Jay M. |last1=Ponte |first2= W. Bruce |last2=Croft | title= A language modeling approach to information retrieval |conference=Proceedings of the 21st ACM SIGIR Conference |year=1998 |publisher=ACM |place=Melbourne, Australia | pages = 275–281| doi=10.1145/290941.291008}}</ref><ref name="hiemstra1998">{{cite conference | first=Djoerd | last=Hiemstra | year = 1998 | title = A linguistically motivated probabilistically model of information retrieval | conference = Proceedings of the 2nd European conference on Research and Advanced Technology for Digital Libraries | publisher = LNCS, Springer | pages=569–584 | doi= 10.1007/3-540-49653-X_34}}</ref>
A '''language model''' is a probabilistic model of a natural language.<ref>{{cite book |last1=Jurafsky |first1=Dan |last2=Martin |first2=James H. |title=Speech and Language Processing |date=2021 |edition=3rd |url=https://web.stanford.edu/~jurafsky/slp3/ |access-date=24 May 2022 |chapter=N-gram Language Models |archive-date=22 May 2022 |archive-url=https://web.archive.org/web/20220522005855/https://web.stanford.edu/~jurafsky/slp3/ |url-status=live }}</ref> In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.<ref>{{cite journal |last1=Rosenfeld |first1=Ronald |year=2000 |title=Two decades of statistical language modeling: Where do we go from here? |journal=Proceedings of the IEEE |volume=88 |issue=8|pages=1270–1278 |doi=10.1109/5.880083 |s2cid=10959945 |url=https://figshare.com/articles/journal_contribution/6611138 }}</ref>
 
[[Large language model]]s (LLMs), currently their most advanced form, are apredominantly combinationbased on [[Transformer (machine learning)|transformers]] trained ofon larger datasets (frequently using scrapedtexts words[[Web scraping|scraped]] from the public internet), [[feedforward neural networkinternet]]s, and [[transformer (machine learning)|transformer]]s. They have superseded [[recurrent neural network]]-based models, which had previously superseded the purepurely statistical models, such as the [[Word n-gram language model|word ''n''-gram language model]].
Language models are useful for a variety of tasks, including [[speech recognition]]<ref>Kuhn, Roland, and Renato De Mori (1990). [https://www.researchgate.net/profile/Roland_Kuhn2/publication/3191800_Cache-based_natural_language_model_for_speech_recognition/links/004635184ee5b2c24f000000.pdf "A cache-based natural language model for speech recognition"]. ''IEEE transactions on pattern analysis and machine intelligence'' 12.6: 570–583.</ref> (helping prevent predictions of low-probability (e.g. nonsense) sequences), [[machine translation]],<ref name="Semantic parsing as machine translation">Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013). [https://www.aclweb.org/anthology/P13-2009 "Semantic parsing as machine translation"] {{Webarchive|url=https://web.archive.org/web/20200815080932/https://www.aclweb.org/anthology/P13-2009/ |date=15 August 2020 }}. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).</ref> [[natural language generation]] (generating more human-like text), [[optical character recognition]], [[handwriting recognition]],<ref>Pham, Vu, et al (2014). [https://arxiv.org/abs/1312.4569 "Dropout improves recurrent neural networks for handwriting recognition"] {{Webarchive|url=https://web.archive.org/web/20201111170554/https://arxiv.org/abs/1312.4569 |date=11 November 2020 }}. 14th International Conference on Frontiers in Handwriting Recognition. IEEE.</ref> [[grammar induction]],<ref>Htut, Phu Mon, Kyunghyun Cho, and Samuel R. Bowman (2018). [https://arxiv.org/pdf/1808.10000.pdf?source=post_page--------------------------- "Grammar induction with neural language models: An unusual replication"] {{Webarchive|url=https://web.archive.org/web/20220814010528/https://arxiv.org/pdf/1808.10000.pdf?source=post_page--------------------------- |date=14 August 2022 }}. {{arXiv|1808.10000}}.</ref> and [[information retrieval]].<ref name=ponte1998>{{cite conference |first1=Jay M. |last1=Ponte |first2= W. Bruce |last2=Croft | title= A language modeling approach to information retrieval |conference=Proceedings of the 21st ACM SIGIR Conference |year=1998 |publisher=ACM |place=Melbourne, Australia | pages = 275–281| doi=10.1145/290941.291008}}</ref><ref name=hiemstra1998>{{cite conference | first=Djoerd | last=Hiemstra | year = 1998 | title = A linguistically motivated probabilistically model of information retrieval | conference = Proceedings of the 2nd European conference on Research and Advanced Technology for Digital Libraries | publisher = LNCS, Springer | pages=569–584 | doi= 10.1007/3-540-49653-X_34}}</ref>
 
== History ==
[[Large language model]]s, currently their most advanced form, are a combination of larger datasets (frequently using scraped words from the public internet), [[feedforward neural network]]s, and [[transformer (machine learning)|transformer]]s. They have superseded [[recurrent neural network]]-based models, which had previously superseded the pure statistical models, such as [[Word n-gram language model|word ''n''-gram language model]].
[[Noam Chomsky]] did pioneering work on language models in the 1950s by developing a theory of [[formal grammar]]s.<ref>{{Cite journal |last=Chomsky |first=N. |date=September 1956 |title=Three models for the description of language |journal=IRE Transactions on Information Theory |volume=2 |issue=3 |pages=113–124 |doi=10.1109/TIT.1956.1056813 |issn=2168-2712}}</ref>
 
In 1980, statistical approaches were explored and found to be more useful for many purposes than rule-based formal grammars. Discrete representations like [[Word n-gram language model|word ''n''-gram language models]], with probabilities for discrete combinations of words, made significant advances.
 
In the 2000s, continuous representations for words, such as [[Word2vec|word embeddings]], began to replace discrete representations.<ref>{{Cite news |date=2022-02-22 |title=The Nature Of Life, The Nature Of Thinking: Looking Back On Eugene Charniak's Work And Life |url=https://cs.brown.edu/news/2022/02/22/the-nature-of-life-the-nature-of-thinking-looking-back-on-eugene-charniaks-work-and-life/ |archive-url=https://web.archive.org/web/20241103134558/https://cs.brown.edu/news/2022/02/22/the-nature-of-life-the-nature-of-thinking-looking-back-on-eugene-charniaks-work-and-life/ |archive-date=3 November 2024 |access-date=2025-02-05 |language=en |url-status=live }}</ref> Typically, the representation is a [[Real number|real-valued]] vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning, and common relationships between pairs of words like plurality or gender.
 
== Pure statistical models ==
A '''language model''' is a probabilistic model of a natural language.<ref>{{cite book |last1=Jurafsky |first1=Dan |last2=Martin |first2=James H. |title=Speech and Language Processing |date=2021 |edition=3rd |url=https://web.stanford.edu/~jurafsky/slp3/ |access-date=24 May 2022 |chapter=N-gram Language Models |archive-date=22 May 2022 |archive-url=https://web.archive.org/web/20220522005855/https://web.stanford.edu/~jurafsky/slp3/ |url-status=live }}</ref> In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon‘[[Claude Shannon|Shannon]]-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.<ref>{{cite journal |last1=Rosenfeld |first1=Ronald |year=2000 |title=Two decades of statistical language modeling: Where do we go from here? |url=https://figshare.com/articles/journal_contribution/6611138 |journal=Proceedings of the IEEE |volume=88 |issue=8 |pages=1270–1278 |doi=10.1109/5.880083 |s2cid=10959945 |url=https://figshare.com/articles/journal_contribution/6611138 }}</ref>
 
=== Models based on word ''n''-grams ===
Line 19 ⟶ 25:
<math display="block"> P(w_m \mid w_1,\ldots,w_{m-1}) = \frac{1}{Z(w_1,\ldots,w_{m-1})} \exp (a^T f(w_1,\ldots,w_m))</math>
 
where <math>Z(w_1,\ldots,w_{m-1})</math> is the [[Partition function (mathematics)|partition function]], <math>a</math> is the parameter vector, and <math>f(w_1,\ldots,w_m)</math> is the feature function. In the simplest case, the feature function is just an indicator of the presence of a certain ''n''-gram. It is helpful to use a prior on <math>a</math> or some form of [[Regularization (mathematics)|regularization]].
 
The log-bilinear model is another example of an exponential language model.
Line 28 ⟶ 34:
== Neural models ==
=== Recurrent neural network ===
Continuous representations or [[Word embedding|embeddings of words]] are produced in [[recurrent neural network]]-based language models (known also as ''continuous space language models'').<ref>{{cite web |last1=Karpathy |first1=Andrej |title=The Unreasonable Effectiveness of Recurrent Neural Networks |url=https://karpathy.github.io/2015/05/21/rnn-effectiveness/ |access-date=27 January 2019 |archive-date=1 November 2020 |archive-url=https://web.archive.org/web/20201101215448/http://karpathy.github.io/2015/05/21/rnn-effectiveness/ |url-status=live }}</ref> Such continuous space embeddings help to alleviate the [[curse of dimensionality]], which is the consequence of the number of possible sequences of words increasing [[Exponential growth|exponentially]] with the size of the vocabulary, furtherlyfurther causing a data sparsity problem. Neural networks avoid this problem by representing words as non-linear combinations of weights in a neural net.<ref name="bengio">{{cite encyclopedia|title=Neural net language models|first=Yoshua|last=Bengio|year=2008|encyclopedia=[[Scholarpedia]]|volume=3|issue=1|page=3881|url=http://www.scholarpedia.org/article/Neural_net_language_models|doi=10.4249/scholarpedia.3881|bibcode=2008SchpJ...3.3881B|doi-access=free|access-date=28 August 2015|archive-date=26 October 2020|archive-url=https://web.archive.org/web/20201026161505/http://www.scholarpedia.org/article/Neural_net_language_models|url-status=live}}</ref>
 
=== Large language models ===
{{excerpt|Large language model}}
 
Although sometimes matching human performance, it is not clear whether they are plausible [[Cognitivecognitive model|cognitive models]]s. At least for recurrent neural networks, it has been shown that they sometimes learn patterns whichthat humans do not learn, but fail to learn patterns that humans typically do learn.<ref>{{Cite book|last1=Hornstein|first1=Norbert|url=https://books.google.com/books?id=XoxsDwAAQBAJ&dq=adger+%22goldilocks%22&pg=PA153|title=Syntactic Structures after 60 Years: The Impact of the Chomskyan Revolution in Linguistics|last2=Lasnik|first2=Howard|last3=Patel-Grosz|first3=Pritty|last4=Yang|first4=Charles|date=2018-01-09|publisher=Walter de Gruyter GmbH & Co KG|isbn=978-1-5015-0692-5|language=en|access-date=11 December 2021|archive-date=16 April 2023|archive-url=https://web.archive.org/web/20230416160343/https://books.google.com/books?id=XoxsDwAAQBAJ&dq=adger+%22goldilocks%22&pg=PA153|url-status=live}}</ref>
 
== Evaluation and benchmarks ==
 
Evaluation of the quality of language models is mostly done by comparison to human created sample benchmarks created from typical language-oriented tasks. Other, less established, quality tests examine the intrinsic character of a language model or compare two such models. Since language models are typically intended to be dynamic and to learn from data itthey seessee, some proposed models investigate the rate of learning, e.g., through inspection of learning curves. <ref>{{Citation|last1=Karlgren|first1=Jussi|last2=Schutze|first2=Hinrich|chapter=Evaluating Learning Language Representations|date=2015|pages=254–260|publisher=Springer International Publishing|isbn=9783319642055|doi=10.1007/978-3-319-64206-2_8|title=International Conference of the Cross-Language Evaluation Forum|series=Lecture Notes in Computer Science}}</ref>
 
Various data sets have been developed tofor use toin evaluateevaluating language processing systems.<ref name=":0">{{cite arXiv|last1=Devlin|first1=Jacob|last2=Chang|first2=Ming-Wei|last3=Lee|first3=Kenton|last4=Toutanova|first4=Kristina|date=2018-10-10|title=BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding|eprint=1810.04805|class=cs.CL}}</ref> These include:
 
* [[Massive Multitask Language Understanding]] (MMLU)<ref>{{Citation |last=Hendrycks |first=Dan |title=Measuring Massive Multitask Language Understanding |date=2023-03-14 |url=https://github.com/hendrycks/test |archive-url=https://web.archive.org/web/20230315011614/https://github.com/hendrycks/test |archive-date=15 March 2023 |url-status=live |accessdate=2023-03-15}}</ref>
* Corpus of Linguistic Acceptability<ref>{{Cite web|url=https://nyu-mll.github.io/CoLA/|title=The Corpus of Linguistic Acceptability (CoLA)|website=nyu-mll.github.io|access-date=2019-02-25|archive-date=7 December 2020|archive-url=https://web.archive.org/web/20201207081834/https://nyu-mll.github.io/CoLA/|url-status=live}}</ref>
* GLUE benchmark<ref>{{Cite web|url=https://gluebenchmark.com/|title=GLUE Benchmark|website=gluebenchmark.com|language=en|access-date=2019-02-25|archive-date=4 November 2020|archive-url=https://web.archive.org/web/20201104161928/https://gluebenchmark.com/|url-status=live}}</ref>
Line 52 ⟶ 59:
* Stanford Sentiment [[Treebank]]<ref>{{Cite web|url=https://nlp.stanford.edu/sentiment/treebank.html|title=Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank|website=nlp.stanford.edu|access-date=2019-02-25|archive-date=27 October 2020|archive-url=https://web.archive.org/web/20201027125825/https://nlp.stanford.edu/sentiment/treebank.html|url-status=live}}</ref>
* Winograd NLI
* BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC, OpenBookQA, NaturalQuestions, TriviaQA, RACE, MMLU (Massive Multitask Language Understanding), BIG-bench hard, GSM8k, RealToxicityPrompts, WinoGender, CrowS-Pairs.<ref>{{Citation|Cite lastweb = Hendrycks| first = Dan| title =llama/MODEL_CARD.md Measuringat Massivemain Multitask· Language Understanding| accessdate = 2023meta-03-15|llama/llama date = 2023-03-14| url = https://github.com/hendrycksmeta-llama/test|llama/blob/main/MODEL_CARD.md archive|access-date =2024-12-28 15 March 2023| archive-url website=GitHub https://web.archive.org/web/20230315011614/https://github.com/hendrycks/test| url-status language= liveen}}</ref> ([https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md LLaMa Benchmark])
 
== See also ==
{{portal |Linguistics |Mathematics |Technology}}
{{div col|colwidth=18em15em}}
* {{Annotated link|Artificial intelligence and elections}}
* [[Cache language model]]
* [[Deep linguistic processing]]
* [[Ethics of artificial intelligence]]
* [[Factored language model]]
* [[Generative pre-trained transformer]]
* [[Katz's back-off model]]
* [[Language technology]]
* [[Semantic similarity network]]
* [[Statistical model]]
* [[Ethics of artificial intelligence]]
* [[Semantic similarity network]]
 
{{div col end}}
Line 93 ⟶ 102:
{{refbegin}}
 
* {{cite conference |author1=JJay M. Ponte |author2=W. BBruce Croft | citeseerx=10.1.1.117.4237 |doi=10.1145/290941.291008 |doi-access=free | title = A Language Modeling Approach to Information Retrieval | book-title=Research and Development in Information Retrieval | year=1998 | pages=275–281 }}
* {{cite conference |author1=FFei Song |author2=W. BBruce Croft | citeseerx=10.1.1.21.6467 |doi=10.1145/319950.320022 |doi-access=free |title=A General Language Model for Information Retrieval | book-title=Research and Development in Information Retrieval |year=1999 | pages=279–280 }}
* {{cite tech report |first=Stanley F. |last=Chen |author2=Joshua Goodman |title=An Empirical Study of Smoothing Techniques for Language Modeling |institution=Harvard University |year=1998 |citeseerx=10.1.1.131.5458 |url=https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=273adbdb43097636aa9260d9ecd60d0787b0ef4d }}
 
{{refend}}
{{Natural language processing}}
{{Artificial intelligence navbox}}
 
[[Category:Language modeling|*]]