Language model: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 17:41, 10 October 2024 edit 82.49.100.199 (talk) No edit summary ← Previous edit		Latest revision as of 17:23, 26 August 2025 edit undo Citation bot (talk \| contribs) Bots 5,868,352 edits Removed URL that duplicated identifier. Removed parameters. \| Use this bot. Report bugs. \| Suggested by Headbomb \| Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox \| #UCB_webform_linked 506/990
(34 intermediate revisions by 26 users not shown)
Line 2: {{Use dmy dates\|date=July 2022}} A '''language model''' is a [[Model#Conceptual model\|model]] of the human brain's ability to produce [[natural language]].<ref>{{cite journal \|last1=Blank \|first1=Idan A. \|title=What are large language models supposed to model? \|journal=Trends in Cognitive Sciences \|date=November 2023 \|volume=27 \|issue=11 \|pages=987–989 \|doi=10.1016/j.tics.2023.08.006\|pmid=37659920 \|doi-access=free }}"LLMs are supposed to model how utterances behave." </ref><ref>{{cite book \|last1=Jurafsky \|first1=Dan \|last2=Martin \|first2=James H. \|title=Speech and Language Processing \|date=2021 \|edition=3rd \|url=https://web.stanford.edu/~jurafsky/slp3/ \|access-date=24 May 2022 \|chapter=N-gram Language Models \|chapter-url= https://web.stanford.edu/~jurafsky/slp3/3.pdf \|archive-date=22 May 2022 \|archive-url=https://web.archive.org/web/20220522005855/https://web.stanford.edu/~jurafsky/slp3/ \|url-status=live }}</ref> Language models are useful for a variety of tasks, including [[speech recognition]],<ref>Kuhn, Roland, and Renato De Mori (1990). [https://www.researchgate.net/profile/Roland_Kuhn2/publication/3191800_Cache-based_natural_language_model_for_speech_recognition/links/004635184ee5b2c24f000000.pdf "A cache-based natural language model for speech recognition"]. ''IEEE transactions on pattern analysis and machine intelligence'' 12.6: 570–583.</ref> ~~(helping prevent predictions of low-probability (e.g. nonsense) sequences),~~ [[machine translation]],<ref name="Semantic parsing as machine translation">Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013). [https://www.aclweb.org/anthology/P13-2009 "Semantic parsing as machine translation"] {{Webarchive\|url=https://web.archive.org/web/20200815080932/https://www.aclweb.org/anthology/P13-2009/ \|date=15 August 2020 }}. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).</ref> [[natural language generation]] (generating more human-like text), [[optical character recognition]], [[route optimization]],<ref>{{cite journal \|last1=Liu \|first1=Yang \|last2=Wu \|first2=Fanyou \|last3=Liu \|first3=Zhiyuan \|last4=Wang \|first4=Kai \|last5=Wang \|first5=Feiyue \|last6=Qu \|first6=Xiaobo \|title=Can language models be used for real-world urban-delivery route optimization? \|journal=The Innovation \|date=2023 \|volume=4 \|issue=6 \|pages=100520 \|doi=10.1016/j.xinn.2023.100520 \|doi-access=free\|pmid=37869471 \|pmc=10587631 \|bibcode=2023Innov...400520L }}</ref> [[handwriting recognition]],<ref>Pham, Vu, et al (2014). [https://arxiv.org/abs/1312.4569 "Dropout improves recurrent neural networks for handwriting recognition"] {{Webarchive\|url=https://web.archive.org/web/20201111170554/https://arxiv.org/abs/1312.4569 \|date=11 November 2020 }}. 14th International Conference on Frontiers in Handwriting Recognition. IEEE.</ref> [[grammar induction]],<ref>Htut, Phu Mon, Kyunghyun Cho, and Samuel R. Bowman (2018). [https://arxiv.org/pdf/1808.10000.pdf?source=post_page--------------------------- "Grammar induction with neural language models: An unusual replication"] {{Webarchive\|url=https://web.archive.org/web/20220814010528/https://arxiv.org/pdf/1808.10000.pdf?source=post_page--------------------------- \|date=14 August 2022 }}. {{arXiv\|1808.10000}}.</ref> and [[information retrieval]].<ref name="ponte1998">{{cite conference \|first1=Jay M. \|last1=Ponte \|first2= W. Bruce \|last2=Croft \| title= A language modeling approach to information retrieval \|conference=Proceedings of the 21st ACM SIGIR Conference \|year=1998 \|publisher=ACM \|place=Melbourne, Australia \| pages = 275–281\| doi=10.1145/290941.291008}}</ref><ref name="hiemstra1998">{{cite conference \| first=Djoerd \| last=Hiemstra \| year = 1998 \| title = A linguistically motivated probabilistically model of information retrieval \| conference = Proceedings of the 2nd European conference on Research and Advanced Technology for Digital Libraries \| publisher = LNCS, Springer \| pages=569–584 \| doi= 10.1007/3-540-49653-X_34}}</ref>▼ A '''language model''' is a probabilistic [[Model#Conceptual model\|model]] of a natural language.<ref>{{cite book \|last1=Jurafsky \|first1=Dan \|last2=Martin \|first2=James H. \|title=Speech and Language Processing \|date=2021 \|edition=3rd \|url=https://web.stanford.edu/~jurafsky/slp3/ \|access-date=24 May 2022 \|chapter=N-gram Language Models \|archive-date=22 May 2022 \|archive-url=https://web.archive.org/web/20220522005855/https://web.stanford.edu/~jurafsky/slp3/ \|url-status=live }}</ref> In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘[[Claude Shannon\|Shannon]]-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.<ref>{{cite journal \|last1=Rosenfeld \|first1=Ronald \|year=2000 \|title=Two decades of statistical language modeling: Where do we go from here? \|journal=Proceedings of the IEEE \|volume=88 \|issue=8\|pages=1270–1278 \|doi=10.1109/5.880083 \|s2cid=10959945 \|url=https://figshare.com/articles/journal_contribution/6611138 }}</ref>▼ [[Large language model]]s (LLMs), currently their most advanced form, are apredominantly ~~combination~~based on [[Transformer (machine learning)\|transformers]] trained ofon larger datasets (frequently using ~~words~~texts [[Web scraping\|scraped]] from the public ~~internet),~~ [[~~feedforward neural network~~internet]]~~s, and [[transformer (machine learning~~)~~\|transformer]]s~~. They have superseded [[recurrent neural network]]-based models, which had previously superseded the ~~pure~~purely statistical models, such as the [[Word n-gram language model\|word ''n''-gram language model]].▼ ▲Language models are useful for a variety of tasks, including [[speech recognition]]<ref>Kuhn, Roland, and Renato De Mori (1990). [https://www.researchgate.net/profile/Roland_Kuhn2/publication/3191800_Cache-based_natural_language_model_for_speech_recognition/links/004635184ee5b2c24f000000.pdf "A cache-based natural language model for speech recognition"]. ''IEEE transactions on pattern analysis and machine intelligence'' 12.6: 570–583.</ref> (helping prevent predictions of low-probability (e.g. nonsense) sequences), [[machine translation]],<ref name="Semantic parsing as machine translation">Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013). [https://www.aclweb.org/anthology/P13-2009 "Semantic parsing as machine translation"] {{Webarchive\|url=https://web.archive.org/web/20200815080932/https://www.aclweb.org/anthology/P13-2009/ \|date=15 August 2020 }}. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).</ref> [[natural language generation]] (generating more human-like text), [[optical character recognition]], [[route optimization]],<ref>{{cite journal \|last1=Liu \|first1=Yang \|last2=Wu \|first2=Fanyou \|last3=Liu \|first3=Zhiyuan \|last4=Wang \|first4=Kai \|last5=Wang \|first5=Feiyue \|last6=Qu \|first6=Xiaobo \|title=Can language models be used for real-world urban-delivery route optimization? \|journal=The Innovation \|date=2023 \|volume=4 \|issue=6 \|pages=100520 \|doi=10.1016/j.xinn.2023.100520 \|doi-access=free}}</ref> [handwriting recognition]],<ref>Pham, Vu, et al (2014). [https://arxiv.org/abs/1312.4569 "Dropout improves recurrent neural networks for handwriting recognition"] {{Webarchive\|url=https://web.archive.org/web/20201111170554/https://arxiv.org/abs/1312.4569 \|date=11 November 2020 }}. 14th International Conference on Frontiers in Handwriting Recognition. IEEE.</ref> [[grammar induction]],<ref>Htut, Phu Mon, Kyunghyun Cho, and Samuel R. Bowman (2018). [https://arxiv.org/pdf/1808.10000.pdf?source=post_page--------------------------- "Grammar induction with neural language models: An unusual replication"] {{Webarchive\|url=https://web.archive.org/web/20220814010528/https://arxiv.org/pdf/1808.10000.pdf?source=post_page--------------------------- \|date=14 August 2022 }}. {{arXiv\|1808.10000}}.</ref> and [[information retrieval]].<ref name=ponte1998>{{cite conference \|first1=Jay M. \|last1=Ponte \|first2= W. Bruce \|last2=Croft \| title= A language modeling approach to information retrieval \|conference=Proceedings of the 21st ACM SIGIR Conference \|year=1998 \|publisher=ACM \|place=Melbourne, Australia \| pages = 275–281\| doi=10.1145/290941.291008}}</ref><ref name=hiemstra1998>{{cite conference \| first=Djoerd \| last=Hiemstra \| year = 1998 \| title = A linguistically motivated probabilistically model of information retrieval \| conference = Proceedings of the 2nd European conference on Research and Advanced Technology for Digital Libraries \| publisher = LNCS, Springer \| pages=569–584 \| doi= 10.1007/3-540-49653-X_34}}</ref> == History == ▲[[Large language model]]s, currently their most advanced form, are a combination of larger datasets (frequently using words [[Web scraping\|scraped]] from the public internet), [[feedforward neural network]]s, and [[transformer (machine learning)\|transformer]]s. They have superseded [[recurrent neural network]]-based models, which had previously superseded the pure statistical models, such as [[Word n-gram language model\|word ''n''-gram language model]]. [[Noam Chomsky]] did pioneering work on language models in the 1950s by developing a theory of [[formal grammar]]s.<ref>{{Cite journal \|last=Chomsky \|first=N. \|date=September 1956 \|title=Three models for the description of language \|journal=IRE Transactions on Information Theory \|volume=2 \|issue=3 \|pages=113–124 \|doi=10.1109/TIT.1956.1056813 \|issn=2168-2712}}</ref> In 1980, statistical approaches were explored and found to be more useful for many purposes than rule-based formal grammars. Discrete representations like [[Word n-gram language model\|word ''n''-gram language models]], with probabilities for discrete combinations of words, made significant advances. In the 2000s, continuous representations for words, such as [[Word2vec\|word embeddings]], began to replace discrete representations.<ref>{{Cite news \|date=2022-02-22 \|title=The Nature Of Life, The Nature Of Thinking: Looking Back On Eugene Charniak's Work And Life \|url=https://cs.brown.edu/news/2022/02/22/the-nature-of-life-the-nature-of-thinking-looking-back-on-eugene-charniaks-work-and-life/ \|archive-url=https://web.archive.org/web/20241103134558/https://cs.brown.edu/news/2022/02/22/the-nature-of-life-the-nature-of-thinking-looking-back-on-eugene-charniaks-work-and-life/ \|archive-date=3 November 2024 \|access-date=2025-02-05 \|language=en \|url-status=live }}</ref> Typically, the representation is a [[Real number\|real-valued]] vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning, and common relationships between pairs of words like plurality or gender. == Pure statistical models == ▲A '''language model''' is a probabilistic [[Model#Conceptual model\|model]] of a natural language.<ref>{{cite book \|last1=Jurafsky \|first1=Dan \|last2=Martin \|first2=James H. \|title=Speech and Language Processing \|date=2021 \|edition=3rd \|url=https://web.stanford.edu/~jurafsky/slp3/ \|access-date=24 May 2022 \|chapter=N-gram Language Models \|archive-date=22 May 2022 \|archive-url=https://web.archive.org/web/20220522005855/https://web.stanford.edu/~jurafsky/slp3/ \|url-status=live }}</ref> In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘[[Claude Shannon\|Shannon]]-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.<ref>{{cite journal \|last1=Rosenfeld \|first1=Ronald \|year=2000 \|title=Two decades of statistical language modeling: Where do we go from here? \|url=https://figshare.com/articles/journal_contribution/6611138 \|journal=Proceedings of the IEEE \|volume=88 \|issue=8 \|pages=1270–1278 \|doi=10.1109/5.880083 \|s2cid=10959945 ~~\|url=https://figshare.com/articles/journal_contribution/6611138~~ }}</ref> === Models based on word ''n''-grams === Line 19 ⟶ 25: <math display="block"> P(w_m \mid w_1,\ldots,w_{m-1}) = \frac{1}{Z(w_1,\ldots,w_{m-1})} \exp (a^T f(w_1,\ldots,w_m))</math> where <math>Z(w_1,\ldots,w_{m-1})</math> is the [[Partition function (mathematics)\|partition function]], <math>a</math> is the parameter vector, and <math>f(w_1,\ldots,w_m)</math> is the feature function. In the simplest case, the feature function is just an indicator of the presence of a certain ''n''-gram. It is helpful to use a prior on <math>a</math> or some form of [[Regularization (mathematics)\|regularization]]. The log-bilinear model is another example of an exponential language model. Line 28 ⟶ 34: == Neural models == === Recurrent neural network === Continuous representations or [[Word embedding\|embeddings of words]] are produced in [[recurrent neural network]]-based language models (known also as ''continuous space language models'').<ref>{{cite web \|last1=Karpathy \|first1=Andrej \|title=The Unreasonable Effectiveness of Recurrent Neural Networks \|url=https://karpathy.github.io/2015/05/21/rnn-effectiveness/ \|access-date=27 January 2019 \|archive-date=1 November 2020 \|archive-url=https://web.archive.org/web/20201101215448/http://karpathy.github.io/2015/05/21/rnn-effectiveness/ \|url-status=live }}</ref> Such continuous space embeddings help to alleviate the [[curse of dimensionality]], which is the consequence of the number of possible sequences of words increasing [[Exponential growth\|exponentially]] with the size of the vocabulary, ~~furtherly~~further causing a data sparsity problem. Neural networks avoid this problem by representing words as non-linear combinations of weights in a neural net.<ref name="bengio">{{cite encyclopedia\|title=Neural net language models\|first=Yoshua\|last=Bengio\|year=2008\|encyclopedia=[[Scholarpedia]]\|volume=3\|issue=1\|page=3881\|url=http://www.scholarpedia.org/article/Neural_net_language_models\|doi=10.4249/scholarpedia.3881\|bibcode=2008SchpJ...3.3881B\|doi-access=free\|access-date=28 August 2015\|archive-date=26 October 2020\|archive-url=https://web.archive.org/web/20201026161505/http://www.scholarpedia.org/article/Neural_net_language_models\|url-status=live}}</ref> === Large language models === {{excerpt\|Large language model}} Although sometimes matching human performance, it is not clear whether they are plausible [[~~Cognitive~~cognitive model~~\|cognitive models~~]]s. At least for recurrent neural networks, it has been shown that they sometimes learn patterns that humans do not, but fail to learn patterns that humans typically do.<ref>{{Cite book\|last1=Hornstein\|first1=Norbert\|url=https://books.google.com/books?id=XoxsDwAAQBAJ&dq=adger+%22goldilocks%22&pg=PA153\|title=Syntactic Structures after 60 Years: The Impact of the Chomskyan Revolution in Linguistics\|last2=Lasnik\|first2=Howard\|last3=Patel-Grosz\|first3=Pritty\|last4=Yang\|first4=Charles\|date=2018-01-09\|publisher=Walter de Gruyter GmbH & Co KG\|isbn=978-1-5015-0692-5\|language=en\|access-date=11 December 2021\|archive-date=16 April 2023\|archive-url=https://web.archive.org/web/20230416160343/https://books.google.com/books?id=XoxsDwAAQBAJ&dq=adger+%22goldilocks%22&pg=PA153\|url-status=live}}</ref> == Evaluation and benchmarks == Line 41 ⟶ 47: Various data sets have been developed for use in evaluating language processing systems.<ref name=":0">{{cite arXiv\|last1=Devlin\|first1=Jacob\|last2=Chang\|first2=Ming-Wei\|last3=Lee\|first3=Kenton\|last4=Toutanova\|first4=Kristina\|date=2018-10-10\|title=BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\|eprint=1810.04805\|class=cs.CL}}</ref> These include: * [[Massive Multitask Language Understanding]] (MMLU)<ref>{{Citation \|last=Hendrycks \|first=Dan \|title=Measuring Massive Multitask Language Understanding \|date=2023-03-14 \|url=https://github.com/hendrycks/test \|archive-url=https://web.archive.org/web/20230315011614/https://github.com/hendrycks/test \|archive-date=15 March 2023 \|url-status=live \|accessdate=2023-03-15}}</ref> * Corpus of Linguistic Acceptability<ref>{{Cite web\|url=https://nyu-mll.github.io/CoLA/\|title=The Corpus of Linguistic Acceptability (CoLA)\|website=nyu-mll.github.io\|access-date=2019-02-25\|archive-date=7 December 2020\|archive-url=https://web.archive.org/web/20201207081834/https://nyu-mll.github.io/CoLA/\|url-status=live}}</ref> * GLUE benchmark<ref>{{Cite web\|url=https://gluebenchmark.com/\|title=GLUE Benchmark\|website=gluebenchmark.com\|language=en\|access-date=2019-02-25\|archive-date=4 November 2020\|archive-url=https://web.archive.org/web/20201104161928/https://gluebenchmark.com/\|url-status=live}}</ref> Line 52 ⟶ 59: * Stanford Sentiment [[Treebank]]<ref>{{Cite web\|url=https://nlp.stanford.edu/sentiment/treebank.html\|title=Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank\|website=nlp.stanford.edu\|access-date=2019-02-25\|archive-date=27 October 2020\|archive-url=https://web.archive.org/web/20201027125825/https://nlp.stanford.edu/sentiment/treebank.html\|url-status=live}}</ref> * Winograd NLI * BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC, OpenBookQA, NaturalQuestions, TriviaQA, RACE~~, [[MMLU\|MMLU (Massive Multitask Language Understanding)]]~~, BIG-bench hard, GSM8k, RealToxicityPrompts, WinoGender, CrowS-Pairs.<ref>{{~~Citation\|~~Cite ~~last~~web ~~= Hendrycks~~\| ~~first = Dan\|~~ title =llama/MODEL_CARD.md ~~Measuring~~at ~~Massive~~main ~~Multitask~~· ~~Language Understanding\| accessdate = 2023~~meta-~~03-15\|~~llama/llama ~~date = 2023-03-14~~\| url = https://github.com/~~hendrycks~~meta-llama/~~test\|~~llama/blob/main/MODEL_CARD.md ~~archive~~\|access-date =2024-12-28 ~~15 March 2023~~\| ~~archive-url~~ website=GitHub ~~https://web.archive.org/web/20230315011614/https://github.com/hendrycks/test~~\| ~~url-status~~ language= ~~live~~en}}</ref> ~~([https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md LLaMa Benchmark])~~ == See also == {{portal \|Linguistics \|Mathematics \|Technology}} {{div col\|colwidth=~~18em~~15em}} * {{Annotated link\|Artificial intelligence and elections}} * [[Cache language model]] * [[Deep linguistic processing]] * [[Ethics of artificial intelligence]]▼ * [[Factored language model]] * [[Generative pre-trained transformer]] * [[Katz's back-off model]] * [[Language technology]] * [[Semantic similarity network]]▼ * [[Statistical model]] ▲* [[Ethics of artificial intelligence]] ▲* [[Semantic similarity network]] {{div col end}} Line 93 ⟶ 102: {{refbegin}} * {{cite conference \|author1=JJay M. Ponte \|author2=W. BBruce Croft \| citeseerx=10.1.1.117.4237 \|doi=10.1145/290941.291008 \|doi-access=free \| title = A Language Modeling Approach to Information Retrieval \| book-title=Research and Development in Information Retrieval \| year=1998 \| pages=275–281 }} * {{cite conference \|author1=FFei Song \|author2=W. BBruce Croft \| citeseerx=10.1.1.21.6467 \|doi=10.1145/319950.320022 \|doi-access=free \|title=A General Language Model for Information Retrieval \| book-title=Research and Development in Information Retrieval \|year=1999 \| pages=279–280 }} * {{cite tech report \|first=Stanley F. \|last=Chen \|author2=Joshua Goodman \|title=An Empirical Study of Smoothing Techniques for Language Modeling \|institution=Harvard University \|year=1998 \|citeseerx=10.1.1.131.5458 \|url=https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=273adbdb43097636aa9260d9ecd60d0787b0ef4d }} {{refend}} {{Natural language processing}} {{Artificial intelligence navbox}} [[Category:Language modeling\|*]]