Large language model: Difference between revisions

Content deleted Content added
Yoderj (talk | contribs)
History: Again emphasize the strong link between LLMs and transformers by moving "transformer" a little earlier in the history discussion.
Added citations and content on self-consistency, least-to-most prompting, reflection and tool-augmented reasoning. Expanded reasoning section with follow-up methods (self-consistency, least-to-most, reflection, tool use) + sources. Added sources on reasoning methods: self-consistency, least-to-most prompting, reflection, tool-augmented reasoning.
Tags: possible AI-generated citations Visual edit
 
(657 intermediate revisions by more than 100 users not shown)
Line 1:
{{Short description|Type of artificialmachine neurallearning networkmodel}}
{{MachineDistinguish|Logic learning|Artificial neural networkmachine}}
{{redirect|LLM}}
A '''large language model''' ('''LLM''') is a [[language model]] notable for its ability to achieve general-purpose language generation. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive [[self-supervised learning|self-supervised]] and [[semi-supervised learning|semi-supervised]] training process.<ref name=":7">{{Cite web |date=2019-02-14 |title=Better Language Models and Their Implications |url=https://openai.com/blog/better-language-models/ |url-status=live |archive-url= https://web.archive.org/web/20201219132206/https://openai.com/blog/better-language-models/ |archive-date=2020-12-19 |access-date=2019-08-25 |website=OpenAI}}</ref> LLMs are [[artificial neural network]]s, with the largest and most capable LLMs (such as [[ChatGPT]]) built with a [[Transformer (machine learning model)|transformer]]-based architecture, although other architectures, such as [[recurrent neural network]] variants and [[Mamba (deep learning)|Mamba]] (a [[state-space representation|state space]] model) are used as well.<ref>{{cite arXiv |eprint=2305.13048 |last1=Peng |first1=Bo |last2=Alcaide |first2=Eric |last3=Anthony |first3=Quentin |last4=Albalak |first4=Alon |last5=Arcadinho |first5=Samuel |last6=Biderman |first6=Stella |last7=Cao |first7=Huanqi |last8=Cheng |first8=Xin |last9=Chung |first9=Michael |last10=Grella |first10=Matteo |author11=Kranthi Kiran GV |last12=He |first12=Xuzheng |last13=Hou |first13=Haowen |last14=Lin |first14=Jiaju |last15=Kazienko |first15=Przemyslaw |last16=Kocon |first16=Jan |last17=Kong |first17=Jiaming |last18=Koptyra |first18=Bartlomiej |last19=Lau |first19=Hayden |author20=Krishna Sri Ipsit Mantri |last21=Mom |first21=Ferdinand |last22=Saito |first22=Atsushi |last23=Song |first23=Guangyu |last24=Tang |first24=Xiangru |last25=Wang |first25=Bolun |last26=Wind |first26=Johan S. |last27=Wozniak |first27=Stanislaw |last28=Zhang |first28=Ruichong |last29=Zhang |first29=Zhenyuan |last30=Zhao |first30=Qihang |title=RWKV: Reinventing RNNS for the Transformer Era |date=2023 |class=cs.CL |display-authors=1 }}</ref><ref>{{Cite web |last=Merritt |first=Rick |date=2022-03-25 |title=What Is a Transformer Model? |url=https://blogs.nvidia.com/blog/2022/03/25/what-is-a-transformer-model/ |access-date=2023-07-25 |website=NVIDIA Blog |language=en-US}}</ref><ref>{{Citation |last1=Gu |first1=Albert |title=Mamba: Linear-Time Sequence Modeling with Selective State Spaces |date=2023-12-01 |arxiv=2312.00752 |last2=Dao |first2=Tri}}</ref>
{{Unreliable sources|date=August 2025}}
{{Machine learning|Artificial neural network}}A '''large language model''' ('''LLM''') is a [[language model]] trained with [[Self-supervised learning|self-supervised]] [[machine learning]] on a vast amount of text, designed for [[natural language processing]] tasks, especially [[Natural language generation|language generation]].
 
LLMsThe canlargest beand usedmost forcapable textLLMs generation, a form ofare [[Generative artificialpre-trained intelligencetransformer|generative AIpretrained transformers]] (GPTs), bybased takingon ana input[[transformer textarchitecture]], andwhich repeatedlyare predictinglargely theused nextin token[[Generative or word.<ref name="Bowman">{{cite arXivartificial intelligence|eprint=2304.00612generative]] [[Chatbot|class=cs.CLchatbots]] |first=Samuelsuch R.as |last=Bowman[[ChatGPT]], [[Gemini (chatbot)|title=EightGemini]] Thingsand to[[Claude Know about Large Language Models(language model)|year=2023}}</ref>Claude]]. UpLLMs tocan 2020,be [[Fine-tuning (machinedeep learning)|fine tuning-tuned]] was the only way a model could be adapted to be able to accomplishfor specific tasks. Largeror sizedguided models, such as [[GPT-3]], however, can beby [[prompt engineering|prompt-engineered]] to achieve similar results.<ref name="few-shot-learnerslearners2">{{cite journal |last1=Brown |first1=Tom B. |last2=Mann |first2=Benjamin |last3=Ryder |first3=Nick |last4=Subbiah |first4=Melanie |last5=Kaplan |first5=Jared |last6=Dhariwal |first6=Prafulla |last7=Neelakantan |first7=Arvind |last8=Shyam |first8=Pranav |last9=Sastry |first9=Girish |last10=Askell |first10=Amanda |last11=Agarwal |first11=Sandhini |last12=Herbert-Voss |first12=Ariel |last13=Krueger |first13=Gretchen |last14=Henighan |first14=Tom |last15=Child |first15=Rewon |date=Dec 2020 |editor1-last=Larochelle |editor1-first=H. |editor2-last=Ranzato |editor2-first=M. |editor3-last=Hadsell |editor3-first=R. |editor4-last=Balcan |editor4-first=M.F. |editor5-last=Lin |editor5-first=H. |title=Language Models are Few-Shot Learners |url=https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf |url-status=live |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=33 |pages=1877–1901 |archive-url=https://web.archive.org/web/20231117204007/https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf |archive-date=2023-11-17 |access-date=2023-03-14 |last25=Chess |last20=Hesse |first20=Christopher |last21=Chen |first21=Mark |last22=Sigler |first22=Eric |last23=Litwin |first23=Mateusz |last24=Gray |first24=Scott |first26=Jack |first25=Benjamin |last26=Clark |last19=Winter |last27=Berner |first27=Christopher |last28=McCandlish |first28=Sam |last29=Radford |first29=Alec |last30=Sutskever |first30=Ilya |last31=Amodei |first31=Dario |first19=Clemens |first18=Jeffrey |last18=Wu |last16=Ramesh |first16=Aditya |last17=Ziegler |first17=Daniel M.|arxiv=2005.14165}}</ref> TheyThese aremodels thoughtacquire to[[Predictive acquirelearning|predictive knowledgepower]] aboutregarding [[syntax]], [[semantics]], and "ontology"[[Ontology (information science)|ontologies]]<ref>{{cite conference |last1=Fathallah |first1=Nadeen |last2=Das |first2=Arunav |last3=De Giorgis |first3=Stefano |last4=Poltronieri |first4=Andrea |last5=Haase |first5=Peter |last6=Kovriguina |first6=Liubov |date=2024-05-26 |title=NeOn-GPT: A Large Language Model-Powered Pipeline for Ontology Learning |url=https://2024.eswc-conferences.org/wp-content/uploads/2024/05/77770034.pdf |conference=Extended Semantic Web Conference 2024 |___location=Hersonissos, Greece}}</ref> inherent in human [[Text corpus|language corpora]], but they also inherit inaccuracies and [[Algorithmic bias|biases]] present in the corpora[[Training, validation, and test data sets|data]] they are trained on.<ref name="Manning-2022">{{cite journal |last=Manning |first=Christopher D. |author-link=Christopher D. Manning |year=2022 |title=Human Language Understanding & Reasoning |url=https://www.amacad.org/publication/human-language-understanding-reasoning |url-status=live |journal=Daedalus |volume=151 |issue=2 |pages=127–138 |doi=10.1162/daed_a_01905 |s2cid=248377870 |archive-url=https://web.archive.org/web/20231117205531/https://www.amacad.org/publication/human-language-understanding-reasoning |archive-date=2023-11-17 |access-date=2023-03-09 |doi-access=free }}</ref>
 
Some notable LLMs are [[OpenAI]]'s [[Generative pre-trained transformer|GPT]] models (e.g., [[GPT-3.5]] and [[GPT-4]], used in [[ChatGPT]]), [[Google]]'s [[PaLM]] and [[Gemini (language model)|Gemini]] (used in [[Google Bard|Bard]]), [[Microsoft Copilot|Microsoft's Copilot]], [[Meta Platforms|Meta]]'s [[LLaMA]] family of open-source models, and [[Anthropic]]'s [[Claude (language model)|Claude]] models.
 
==History==
[[File:Trends_in_AI_training_FLOP_over_time_(2010-2025).svg|thumb|The training compute of notable large models in FLOPs vs publication date over the period 2010–2024. For overall notable models (top left), frontier models (top right), top language models (bottom left) and top models within leading companies (bottom right). The majority of these models are language models.]]
[[File:The-Transformer-model-architecture.png|thumb|upright=1.3|An illustration of main components of the transformer model from the original paper, where layers were normalized after (instead of before) multiheaded attention.]]
[[File:Large-scale_AI_training_compute_(FLOP)_vs_Publication_date_(2017-2024).svg|thumb|The training compute of notable large AI models in FLOPs vs publication date over the period 2017–2024. The majority of large models are language models or multimodal models with language capacity.]]
At the 2017 [[NeurIPS]] conference, Google researchers introduced the [[transformer architecture]] in their landmark paper "[[Attention Is All You Need]]". This paper's goal was to improve upon 2014 [[Seq2seq]] technology, <ref>{{cite journal |last1=Vaswani |first1=Ashish |author1-link= Ashish Vaswani |last2=Shazeer |first2=Noam |last3=Parmar |first3=Niki |last4=Uszkoreit |first4=Jakob |last5=Jones |first5=Llion |last6=Gomez |first6=Aidan N |author6-link= Aidan Gomez |last7=Kaiser |first7=Łukasz |last8=Polosukhin |first8=Illia |title=Attention is All you Need |journal=Advances in Neural Information Processing Systems |date=2017 |volume=30 |url=https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf |publisher=Curran Associates, Inc.}}</ref> and was based mainly on the [[attention (machine learning)|attention]] mechanism developed by Bahdanau et. al. in 2014.<ref>{{cite arxiv |last1=Bahdanau |first1=Dzmitry |last2=Cho |first2=Kyunghyun |last3=Bengio |first3=Yoshua |title=Neural Machine Translation by Jointly Learning to Align and Translate |date=2014 |arxiv=1409.0473}}</ref> The following year in 2018, [[BERT (language model)|BERT]] was introduced and quickly became "ubiquitous".<ref>{{Cite journal|last1=Rogers|first1=Anna|last2=Kovaleva|first2=Olga|last3=Rumshisky|first3=Anna|date=2020|title=A Primer in BERTology: What We Know About How BERT Works|url=https://aclanthology.org/2020.tacl-1.54|journal=Transactions of the Association for Computational Linguistics|volume=8|pages=842–866|doi=10.1162/tacl_a_00349|arxiv=2002.12327|s2cid=211532403}}</ref> Though the original transformer has both encoder and decoder blocks, BERT is an encoder-only model.
Before the emergence of transformer-based models in 2017, some [[Language model|language models]] were considered large relative to the computational and data constraints of their time. In the early 1990s, [[IBM]]'s statistical models pioneered [[Bitext word alignment|word alignment]] techniques for machine translation, laying the groundwork for [[Construction grammar|corpus-based language modeling]]. A smoothed [[:Word n-gram language model|n-gram model]] in 2001, such as those employing [[Kneser-Ney smoothing]], trained on 300 million words achieved state-of-the-art [[perplexity]] on benchmark tests at the time.<ref>{{cite journal |journal=Computer Speech and Language|last=Goodman |first=Joshua |title=A Bit of Progress in Language Modeling |date=2001-08-09 |url=https://arxiv.org/abs/cs/0108005}}</ref> During the 2000s, with the rise of widespread internet access, researchers began compiling massive text datasets from the web ("web as corpus"<ref>{{Cite journal |last1=Kilgarriff |first1=Adam |last2=Grefenstette |first2=Gregory |date=September 2003 |title=Introduction to the Special Issue on the Web as Corpus |url=https://direct.mit.edu/coli/article/29/3/333-347/1816 |journal=Computational Linguistics |volume=29 |issue=3 |pages=333–347 |doi=10.1162/089120103322711569 |issn=0891-2017}}</ref>) to train statistical language models.<ref>{{Cite journal |last1=Banko |first1=Michele |last2=Brill |first2=Eric |date=2001 |title=Scaling to very very large corpora for natural language disambiguation |url=http://dx.doi.org/10.3115/1073012.1073017 |journal=Proceedings of the 39th Annual Meeting on Association for Computational Linguistics - ACL '01 |pages=26–33 |___location=Morristown, NJ, USA |publisher=Association for Computational Linguistics |doi=10.3115/1073012.1073017}}</ref><ref>{{Cite journal |last1=Resnik |first1=Philip |last2=Smith |first2=Noah A. |date=September 2003 |title=The Web as a Parallel Corpus |url=https://direct.mit.edu/coli/article/29/3/349-380/1809 |journal=Computational Linguistics |volume=29 |issue=3 |pages=349–380 |doi=10.1162/089120103322711578 |issn=0891-2017 |doi-access=free |access-date=2024-06-07 |archive-date=2024-06-07 |archive-url=https://web.archive.org/web/20240607172811/https://direct.mit.edu/coli/article/29/3/349-380/1809 |url-status=live |url-access=subscription }}</ref>
 
Moving beyond n-gram models, researchers started in 2000 to use neural networks to learn language models.<ref>{{Cite book |last1=Xu |first1=Wei |last2=Rudnicky |first2=Alex |chapter=Can artificial neural networks learn language models? |date=2000-10-16 |title=6th International Conference on Spoken Language Processing (ICSLP 2000) |chapter-url=https://www.isca-archive.org/icslp_2000/xu00b_icslp.html |publisher=ISCA |volume=1 |page= |doi=10.21437/icslp.2000-50}}</ref> Following the breakthrough of [[Deep learning|deep neural networks]] in image classification around 2012,<ref>{{cite journal | doi=10.3390/rs13224712 | doi-access=free | title=Review of Image Classification Algorithms Based on Convolutional Neural Networks | date=2021 | last1=Chen | first1=Leiyu | last2=Li | first2=Shaobo | last3=Bai | first3=Qiang | last4=Yang | first4=Jing | last5=Jiang | first5=Sanlong | last6=Miao | first6=Yanming | journal=Remote Sensing | volume=13 | issue=22 | page=4712 | bibcode=2021RemS...13.4712C }}</ref> similar architectures were adapted for language tasks. This shift was marked by the development of [[Word embedding|word embeddings]] (eg, [[Word2vec|Word2Vec]] by Mikolov in 2013) and sequence-to-sequence ([[seq2seq]]) models using [[Long short-term memory|LSTM]]. In 2016, Google transitioned its translation service to [[neural machine translation]] (NMT), replacing statistical phrase-based models with deep [[Recurrent neural network|recurrent neural networks]]. These early NMT systems used LSTM-based [[Encoder-decoder model|encoder-decoder architectures]], as they preceded the invention of [[Transformer (deep learning architecture)|transformers]]. [[File:The-Transformer-model-architecture.png|thumb|upright=1.3|An illustration of main components of the transformer model from the original paper, where layers were normalized after (instead of before) multiheaded attention]]
Although decoder-only [[GPT-1]] was introduced in 2018, it was [[GPT-2]] in 2019 that caught widespread attention because [[OpenAI]] at first deemed it too powerful to release publicly, out of fear of malicious use.<ref>{{cite web |url=https://www.theguardian.com/technology/2019/feb/14/elon-musk-backed-ai-writes-convincing-news-fiction |title=New AI fake text generator may be too dangerous to release, say creators |last=Hern |first=Alex |work=[[The Guardian]] |date=14 February 2019
At the 2017 [[NeurIPS]] conference, Google researchers introduced the transformer architecture in their landmark paper "[[Attention Is All You Need]]". This paper's goal was to improve upon 2014 seq2seq technology,<ref>{{cite journal |last1=Vaswani |first1=Ashish |author1-link=Ashish Vaswani |last2=Shazeer |first2=Noam |last3=Parmar |first3=Niki |last4=Uszkoreit |first4=Jakob |last5=Jones |first5=Llion |last6=Gomez |first6=Aidan N |author6-link=Aidan Gomez |last7=Kaiser |first7=Łukasz |last8=Polosukhin |first8=Illia |title=Attention is All you Need |journal=Advances in Neural Information Processing Systems |date=2017 |volume=30 |url=https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf |publisher=Curran Associates, Inc. |access-date=2024-01-21 |archive-date=2024-02-21 |archive-url=https://web.archive.org/web/20240221141113/https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf |url-status=live }}</ref> and was based mainly on the [[attention (machine learning)|attention]] mechanism developed by Bahdanau et al. in 2014.<ref>{{cite journal |journal=ICLR |last1=Bahdanau |first1=Dzmitry |last2=Cho |first2=Kyunghyun |last3=Bengio |first3=Yoshua |title=Neural Machine Translation by Jointly Learning to Align and Translate |date=2014 |url=https://arxiv.org/abs/1409.0473}}</ref> The following year in 2018, [[BERT (language model)|BERT]] was introduced and quickly became "ubiquitous".<ref>{{Cite journal|last1=Rogers|first1=Anna|last2=Kovaleva|first2=Olga|last3=Rumshisky|first3=Anna|date=2020|title=A Primer in BERTology: What We Know About How BERT Works|url=https://aclanthology.org/2020.tacl-1.54|journal=Transactions of the Association for Computational Linguistics|volume=8|pages=842–866|doi=10.1162/tacl_a_00349|arxiv=2002.12327|s2cid=211532403|access-date=2024-01-21|archive-date=2022-04-03|archive-url=https://web.archive.org/web/20220403103310/https://aclanthology.org/2020.tacl-1.54/|url-status=live}}</ref> Though the original transformer has both encoder and decoder blocks, BERT is an encoder-only model. Academic and research usage of BERT began to decline in 2023, following rapid improvements in the abilities of decoder-only models (such as GPT) to solve tasks via [[Prompt engineering|prompting]].<ref name="auto">{{Cite book|last1=Movva|first1=Rajiv|last2=Balachandar|first2=Sidhika|last3=Peng|first3=Kenny|last4=Agostini|first4=Gabriel|last5=Garg|first5=Nikhil|last6=Pierson|first6=Emma|chapter=Topics, Authors, and Institutions in Large Language Model Research: Trends from 17K arXiv Papers |date=2024|title=Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)|chapter-url=https://aclanthology.org/2024.naacl-long.67|volume=|pages=1223–1243|doi=10.18653/v1/2024.naacl-long.67|arxiv=2307.10700 |access-date=2024-12-08}}</ref>
|access-date = 20 January 2024}}</ref> [[GPT-3]] in 2020 went a step further and {{as of|2024|lc=y}} is available only via [[Web API|API]] with no offering of downloading the model to execute locally. But it was the 2022 consumer-facing browser-based [[ChatGPT]] that captured the imaginations of the general population and "completely changed the world".<ref>{{cite web |url=https://www.euronews.com/next/2023/11/30/chatgpt-a-year-on-3-ways-the-ai-chatbot-has-completely-changed-the-world-in-12-months |title=ChatGPT a year on: 3 ways the AI chatbot has completely changed the world in 12 months |author=<!--Not stated--> |date=November 30, 2023 |publisher=[[Euronews]] |access-date=January 20, 2024}}</ref> The 2023 [[GPT-4]] was praised for its increased accuracy and as a "holy grail" for its [[Multimodal learning|multimodal]] capabilities.<ref>{{cite web |url=https://www.technologyreview.com/2023/03/14/1069823/gpt-4-is-bigger-and-better-chatgpt-openai/ |title=GPT-4 is bigger and better than ChatGPT—but OpenAI won’t say why |last=Heaven |first=Will |date=March 14, 2023 |publisher=[[MIT Technology Review]] |access-date=January 20, 2024}}</ref> OpenAI did not reveal high-level architecture and the number of [[Parameter#Artificial Intelligence|parameters]] of GPT-4.
 
Although decoder-only [[GPT-1]] was introduced in 2018, it was [[GPT-2]] in 2019 that caught widespread attention because [[OpenAI]] claimed to have initially deemed it too powerful to release publicly, out of fear of malicious use.<ref>{{cite web |url=https://www.theguardian.com/technology/2019/feb/14/elon-musk-backed-ai-writes-convincing-news-fiction |title=New AI fake text generator may be too dangerous to release, say creators |last=Hern |first=Alex |work=[[The Guardian]] |date=14 February 2019 |access-date=20 January 2024 |archive-date=14 February 2019 |archive-url=https://web.archive.org/web/20190214173112/https://www.theguardian.com/technology/2019/feb/14/elon-musk-backed-ai-writes-convincing-news-fiction |url-status=live }}</ref> [[GPT-3]] in 2020 went a step further and {{as of|2025|lc=y}} is available only via [[Web API|API]] with no offering of downloading the model to execute locally. But it was the 2022 consumer-facing chatbot [[ChatGPT]] that received extensive media coverage and public attention.<ref>{{cite web |url=https://www.euronews.com/next/2023/11/30/chatgpt-a-year-on-3-ways-the-ai-chatbot-has-completely-changed-the-world-in-12-months |title=ChatGPT a year on: 3 ways the AI chatbot has completely changed the world in 12 months |author=<!--Not stated--> |date=November 30, 2023 |publisher=[[Euronews]] |access-date=January 20, 2024 |archive-date=January 14, 2024 |archive-url=https://web.archive.org/web/20240114025250/https://www.euronews.com/next/2023/11/30/chatgpt-a-year-on-3-ways-the-ai-chatbot-has-completely-changed-the-world-in-12-months |url-status=live }}</ref> The 2023 [[GPT-4]] was praised for its increased accuracy and as a "holy grail" for its [[Multimodal learning|multimodal]] capabilities.<ref>{{cite web |url=https://www.technologyreview.com/2023/03/14/1069823/gpt-4-is-bigger-and-better-chatgpt-openai/ |title=GPT-4 is bigger and better than ChatGPT—but OpenAI won't say why |last=Heaven |first=Will |date=March 14, 2023 |publisher=[[MIT Technology Review]] |access-date=January 20, 2024 |archive-date=March 17, 2023 |archive-url=https://web.archive.org/web/20230317224201/https://www.technologyreview.com/2023/03/14/1069823/gpt-4-is-bigger-and-better-chatgpt-openai/ |url-status=live }}</ref> OpenAI did not reveal the high-level architecture and the number of [[Parameter#Artificial intelligence|parameters]] of GPT-4. The release of ChatGPT led to an uptick in LLM usage across several research subfields of computer science, including robotics, software engineering, and societal impact work.<ref name="auto"/> In 2024 OpenAI released the [[Reasoning language model|reasoning model]] [[OpenAI o1]], which generates long chains of thought before returning a final answer.<ref name="NYTimesInfo">{{Cite web |last=Metz |first=Cade |date=September 12, 2024 |title=OpenAI Unveils New ChatGPT That Can Reason Through Math and Science |url=https://www.nytimes.com/2024/09/12/technology/openai-chatgpt-math.html |access-date=September 12, 2024 |work=[[The New York Times]]}}</ref> Many LLMs with parameter counts comparable to those of OpenAI's GPT series have been developed.<ref>{{cite web |url=https://ourworldindata.org/grapher/artificial-intelligence-parameter-count?time=2017-09-05..latest |title=Parameters in notable artificial intelligence systems |author=<!--Not stated--> |date=November 30, 2023 |website=ourworldindata.org |access-date=January 20, 2024}}</ref>
In the meantime, competing language models have for the most part been playing catch-up to the GPT series, at least in terms of number of parameters.<ref>{{cite web |url=https://ourworldindata.org/grapher/artificial-intelligence-parameter-count?time=2017-09-05..latest |title=Parameters in notable artificial intelligence systems |author=<!--Not stated--> |date=November 30, 2023 |website=ourworldindata.org |access-date=January 20, 2024}}</ref> Notable exceptions in terms of number of parameters included Google's 2019 [[Transformer (machine learning model)#Pretrain-finetune|T5-11B]] and 2022 [[PaLM|PaLM-E]]. In terms of [[Elo rating system|Elo ratings]], on January 26, 2024, Google's Bard (Gemini Pro) surpassed the regular GPT-4, but not the [[Software release life cycle#Beta|limited-availability]] GPT-4-Turbo.<ref>{{cite web |url=https://analyticsindiamag.com/googles-gemini-pro-beats-gpt-4/ |title=Google’s Gemini Pro Beats GPT-4 |author=<!--Not stated--> |date=January 27, 2024 |website=analyticsindiamag.com |access-date=January 29, 2024}}</ref>
 
Since 2022, [[Source-available software|source-available]] models have been gaining popularity, especially at first with [[BLOOM (language model)|BLOOM]] and [[LLaMA]], though both have restrictions on the field of use. [[Mistral AI]]'s models Mistral 7B and Mixtral 8x7b have the more permissive [[Apache License]]. {{AsIn of|2024|1}}January 2025, Mixtral[[DeepSeek]] 8x7breleased isDeepSeek theR1, mosta powerful671-billion-parameter open-weight LLMmodel accordingthat toperforms thecomparably LMSYSto ChatbotOpenAI Arena Leaderboard, being more powerful than GPT-3.5o1 but notat asa powerfulmuch aslower GPT-4cost.<ref>{{citeCite web |urllast=https://huggingface.co/spaces/lmsys/chatbotSharma |first=Shubham |date=2025-arena01-leaderboard20 |title=LMSYSOpen-source ChatbotDeepSeek-R1 Arenauses Leaderboardpure reinforcement learning to match OpenAI o1 — at 95% less cost |authorurl=<!https://venturebeat.com/ai/open-source-Not stateddeepseek-r1-> |website=huggingface.couses-pure-reinforcement-learning-to-match-openai-o1-at-95-less-cost/ |access-date=January2025-01-26 20,|website=VentureBeat 2024|language=en-US}}</ref>
 
Since 2023, many LLMs have been trained to be [[Multimodal learning|multimodal]], having the ability to also process or generate other types of data, such as images or audio. These LLMs are also called large multimodal models (LMMs).<ref>{{Cite web |last=Zia |first=Dr Tehseen |date=2024-01-08 |title=Unveiling of Large Multimodal Models: Shaping the Landscape of Language Models in 2024 |url=https://www.unite.ai/unveiling-of-large-multimodal-models-shaping-the-landscape-of-language-models-in-2024/ |access-date=2024-12-28 |website=Unite.AI |language=en-US}}</ref>
 
As of 2024, the largest and most capable models are all based on the transformer architecture. Some recent implementations are based on other architectures, such as [[recurrent neural network]] variants and [[Mamba (deep learning architecture)|Mamba]] (a [[state-space representation|state space]] model).<ref>{{cite journal |journal=EMNLP |eprint=2305.13048 |last1=Peng |first1=Bo |last2=Alcaide |first2=Eric |last3=Anthony |first3=Quentin |last4=Albalak |first4=Alon |last5=Arcadinho |first5=Samuel |last6=Biderman |first6=Stella |last7=Cao |first7=Huanqi |last8=Cheng |first8=Xin |last9=Chung |first9=Michael |last10=Grella |first10=Matteo |author11=Kranthi Kiran GV |last12=He |first12=Xuzheng |last13=Hou |first13=Haowen |last14=Lin |first14=Jiaju |last15=Kazienko |first15=Przemyslaw |last16=Kocon |first16=Jan |last17=Kong |first17=Jiaming |last18=Koptyra |first18=Bartlomiej |last19=Lau |first19=Hayden |author20=Krishna Sri Ipsit Mantri |last21=Mom |first21=Ferdinand |last22=Saito |first22=Atsushi |last23=Song |first23=Guangyu |last24=Tang |first24=Xiangru |last25=Wang |first25=Bolun |last26=Wind |first26=Johan S. |last27=Wozniak |first27=Stanislaw |last28=Zhang |first28=Ruichong |last29=Zhang |first29=Zhenyuan |last30=Zhao |first30=Qihang |title=RWKV: Reinventing RNNS for the Transformer Era |date=2023 |url=https://aclanthology.org/2023.findings-emnlp.936/ |display-authors=1 }}</ref><ref>{{Cite web |last=Merritt |first=Rick |date=2022-03-25 |title=What Is a Transformer Model? |url=https://blogs.nvidia.com/blog/2022/03/25/what-is-a-transformer-model/ |access-date=2023-07-25 |website=NVIDIA Blog |archive-date=2023-11-17 |archive-url=https://web.archive.org/web/20231117203924/https://blogs.nvidia.com/blog/what-is-a-transformer-model/ |url-status=live }}</ref><ref>{{cite journal |journal=COLM|last1=Gu |first1=Albert |title=Mamba: Linear-Time Sequence Modeling with Selective State Spaces |date=2023-12-01 |eprint=2312.00752 |last2=Dao |first2=Tri |url=https://arxiv.org/abs/2312.00752 }}</ref>
 
== Dataset preprocessing ==
{{See also|List of datasets for machine-learning research#Internet}}
 
===Tokenization===
===Probabilistic tokenization===
{{Anchor|Tokenization}}
Using a modification of [[byte pair encoding|byte-pair encoding]], in the first step, all unique characters (including blanks and [[punctuation mark]]s) are treated as an initial set of [[n-gram|''n''-grams]] (i.e. initial set of uni-grams). Successively the most frequent pair of adjacent characters is merged into a bi-gram and all instances of the pair are replaced by it. All occurrences of adjacent pairs of (previously merged) ''n''-grams that most frequently occur together are then again merged into even lengthier ''n''-gram repeatedly until a vocabulary of prescribed size is obtained (in case of [[GPT-3]], the size is 50257).<ref name="xbiWb">{{Cite web |title=OpenAI API |url=https://platform.openai.com/ |archive-url=https://web.archive.org/web/20230423211308/https://platform.openai.com/tokenizer |archive-date=April 23, 2023 |access-date=2023-04-30 |website=platform.openai.com |language=en}}</ref> Token vocabulary consists of [[integers]], spanning from zero up to the size of the token vocabulary. New words can always be interpreted as combinations of the tokens and the initial-set uni-grams.<ref name="2022Book_">{{cite book |last1=Paaß |first1=Gerhard |chapter-url= https://link.springer.com/chapter/10.1007/978-3-031-23190-2_2 |title=Foundation Models for Natural Language Processing |last2=Giesselbach |first2=Sven |chapter=Pre-trained Language Models |series=Artificial Intelligence: Foundations, Theory, and Algorithms |date= 2022 |pages=19–78 |doi=10.1007/978-3-031-23190-2_2 |isbn=9783031231902 |access-date=3 August 2023}}</ref>
 
As [[machine learning]] algorithms process numbers rather than text, the text must be converted to numbers. In the first step, a vocabulary is decided upon, then integer indices are arbitrarily but uniquely assigned to each vocabulary entry, and finally, an [[Word embedding|embedding]] is associated to the integer index. Algorithms include [[byte pair encoding|byte-pair encoding]] (BPE) and WordPiece. There are also special tokens serving as [[Control character|control characters]], such as <code>[MASK]</code> for masked-out token (as used in [[BERT (language model)|BERT]]), and <code>[UNK]</code> ("unknown") for characters not appearing in the vocabulary. Also, some special symbols are used to denote special text formatting. For example, "Ġ" denotes a preceding whitespace in RoBERTa and GPT. "##" denotes continuation of a preceding word in BERT.<ref>{{cite journal |journal=NAACL |last1=Kaushal |first1=Ayush |title=What do tokens know about their characters and how do they know it? |date=2022-06-06 |url=https://aclanthology.org/2022.naacl-main.179.pdf |last2=Mahowald |first2=Kyle }}</ref>
A token vocabulary based on the frequencies extracted from mainly English corpora uses as few tokens as possible for an average English word. An average word in another language encoded by such an English-optimized tokenizer is however split into suboptimal amount of tokens.
 
For example, the BPE tokenizer used by [[GPT-3]] (Legacy) would split <small><code>tokenizer: texts -> series of numerical "tokens"</code></small> may be split into:as
{| cellpadding="0;" cellspacing="0;" style="border:1px solid black"
 
{| class="wikitable"
|-
| style="border-left: 2px green; border-right: 2px green" |token
| ''n''-grams: ||bgcolor="#e0e0e0" | ''token'' || ''izer'' || bgcolor="#e0e0e0" | ''''':''''' || ''texts'' || bgcolor="#e0e0e0" | <code>-></code> || ''series'' || bgcolor="#e0e0e0" | ''of'' || ''numerical'' || bgcolor="#e0e0e0" | " || ''t'' || bgcolor="#e0e0e0" | ''ok'' || ''ens'' || bgcolor="#e0e0e0" | "
| style="background-color: grey; color: white; border-left: 2px green; border-right: 2px green" |izer
|-
| style="border-left: 2px green; border-right: 2px green" |:
| numbers as "tokens": || bgcolor="#e0e0e0" | 30001 || 7509 || bgcolor="#e0e0e0" | 25 || 13399 || bgcolor="#e0e0e0" | 4613 || 2168 || bgcolor="#e0e0e0" | 286 || 29052 || bgcolor="#e0e0e0" | 366 || 83 || bgcolor="#e0e0e0" | 482 || 641 || bgcolor="#e0e0e0" | 1
| style="background-color: grey; color: white; border-left: 2px green; border-right: 2px green" |&nbsp;texts
| style="border-left: 2px green; border-right: 2px green" |&nbsp;->
| style="background-color: grey; color: white; border-left: 2px green; border-right: 2px green" |series
| style="border-left: 2px green; border-right: 2px green" |&nbsp;of
| style="background-color: grey; color: white; border-left: 2px green; border-right: 2px green" |&nbsp;numerical
| style="border-left: 2px green; border-right: 2px green" |&nbsp;"
| style="background-color: grey; color: white; border-left: 2px green; border-right: 2px green" |t
| style="border-left: 2px green; border-right: 2px green" |ok
| style="background-color: grey; color: white; border-left: 2px green; border-right: 2px green" |ens
| style="border-left: 2px green; border-right: 2px green" |"
|}
 
Probabilistic tokenizationTokenization also [[Data compression|compress]]es the datasets, which is the reason for using the [[byte pair encoding]] algorithm as a tokenizer. Because LLMs generally require input to be an [[Array (data structure)|array]] that is not [[Jagged array|jagged]], the shorter texts must be "padded" until they match the length of the longest one. HowThe manyaverage tokensnumber are,of on average, neededwords per wordtoken depends on the language of the dataset.<ref>{{Cite web |author=Yennie Jun |date=2023-05-03 |title=All languages are NOT created (tokenized) equal |url=https://blog.yenniejun.com/p/all-languages-are-not-created-tokenized |titleurl-status=Alldead |archive-url=https://web.archive.org/web/20230817165705/https://blog.yenniejun.com/p/all-languages -are NOT -not-created (-tokenized) equal |author=Yennie Jun |archive-date=2023-0508-0317 |access-date=2023-08-17 |website=Language models cost much more in some languages than others |quote=In other words, to express the same sentiment, some languages require up to 10 times more tokens.|website=Language models cost much more in some languages than others}}</ref><ref name="LangModelTokenizsersUnfairness">{{Cite journal |last1=Petrov |first1=Aleksandar |last2=Malfa |first2=Emanuele La |last3=Torr |first3=Philip |last4=Bibi |first4=Adel |date=June 23, 2023 |title=Language Model Tokenizers Introduce Unfairness Between Languages |url=https://openreview.net/forum?id=Pj4YYuxTq9 |url-status=live |journal=NeurIPS |arxiv=2305.15425 |archive-url=https://web.archive.org/web/20231215212906/https://openreview.net/forum?id=Pj4YYuxTq9 |archive-date=December 15, 2023 |access-date=September 16, 2023 |via=openreview.net}}</ref> In English, the ratio is typically around 0.75 words per token, with 4 characters per token on average.<ref>{{Cite web |last=Sutherland |first=Richard |date=2024-12-19 |title=Claude AI Pricing: How Much Does Anthropic's AI Cost? |url=https://tech.co/news/how-much-does-claude-ai-cost |access-date=2025-08-16 |website=Tech.co |language=en-US}}</ref>
 
==== BPE ====
{{Main|Byte pair encoding}}
As an example, consider a tokenizer based on byte-pair encoding. In the first step, all unique characters (including blanks and [[punctuation mark]]s) are treated as an initial set of [[n-gram|''n''-grams]] (i.e. initial set of uni-grams). Successively the most frequent pair of adjacent characters is merged into a bi-gram and all instances of the pair are replaced by it. All occurrences of adjacent pairs of (previously merged) ''n''-grams that most frequently occur together are then again merged into even lengthier ''n''-gram, until a vocabulary of prescribed size is obtained. After a tokenizer is trained, any text can be tokenized by it, as long as it does not contain characters not appearing in the initial-set of uni-grams.<ref name="2022Book_">{{cite book |last1=Paaß |first1=Gerhard |title=Foundation Models for Natural Language Processing |last2=Giesselbach |first2=Sven |date=2022 |isbn=9783031231902 |series=Artificial Intelligence: Foundations, Theory, and Algorithms |pages=19–78 |chapter=Pre-trained Language Models |doi=10.1007/978-3-031-23190-2_2 |doi-access=free }}</ref>
 
==== Problems ====
A token vocabulary based on the frequencies extracted from mainly English corpora uses as few tokens as possible for an average English word. However, an average word in another language encoded by such an English-optimized tokenizer is split into a suboptimal amount of tokens. GPT-2 tokenizer can use up to 15 times more tokens per word for some languages, for example for the [[Shan language]] from [[Myanmar]]. Even more widespread languages such as [[Portuguese language|Portuguese]] and [[German language|German]] have "a premium of 50%" compared to English.<ref name="LangModelTokenizsersUnfairness"/>
 
===Dataset cleaning===
{{Main|Data cleansing}}
In the context of training LLMs, datasets are typically cleaned by removing toxic passages from the dataset, discarding low-quality, dataduplicated, andor de-duplicationtoxic data.<ref name="aYNg4">{{Cite arXivjournal |eprintjournal=2104.08758 |class=cs.CLEMNLP |first1=Jesse |last1=Dodge |first2=Maarten |last2=Sap |title=Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus |last3=Marasović |first3=Ana |last4=Agnew |first4=William |last5=Ilharco |first5=Gabriel |last6=Groeneveld |first6=Dirk |last7=Mitchell |first7=Margaret |last8=Gardner |first8=Matt |year=2021 |url=https://aclanthology.org/2021.emnlp-main.98.pdf}}</ref> Cleaned datasets can increase training efficiency and lead to improved downstream performance.<ref>{{cite journalbook |last1=Lee |first1=Katherine |last2=Ippolito |first2=Daphne |last3=Nystrom |first3=Andrew |last4=Zhang |first4=Chiyuan |last5=Eck |first5=Douglas |last6=Callison-Burch |first6=Chris |last7=Carlini |first7=Nicholas |datetitle=MayProceedings 2022of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |titlechapter=Deduplicating Training Data Makes Language Models Better |author7-link=Nicholas Carlini |date=May 2022 |chapter-url=https://aclanthology.org/2022.acl-long.577.pdf |journal=Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics |volume=1: Long Papers |pages=8424–8445 |doi=10.18653/v1/2022.acl-long.577}}</ref><ref>{{Citationcite arXiv |lastlast1=Li |firstfirst1=Yuanzhi |title=Textbooks Are All You Need II: phi-1.5 technical report |date=2023-09-11 |urleprint=http://arxiv.org/abs/2309.05463 |access-date=2024-01-20 |doi=10.48550/arXiv.2309.05463 |last2=Bubeck |first2=Sébastien |last3=Eldan |first3=Ronen |last4=Del Giorno |first4=Allie |last5=Gunasekar |first5=Suriya |last6=Lee |first6=Yin Tat |class=cs.CL }}</ref> A trained LLM can be used to clean datasets for training a further LLM.<ref>{{cite journal |journal=NeurIPS |first1=Zhenghao |last1=Lin |first2=Zhibin |last2=Gou |title=Rho-1: Not All Tokens Are What You Need |date=2024-04-11 |last3=Gong |first3=Yeyun |last4=Liu |first4=Xiao |last5=Shen |first5=Yelong |last6=Xu |first6=Ruochen |last7=Lin |first7=Chen |last8=Yang |first8=Yujiu |last9=Jiao |first9=Jian |url=https://dl.acm.org/doi/10.5555/3737916.3738830}}</ref>
 
With the increasing proportion of LLM-generated content on the web, data cleaning in the future may include filtering out such content. LLM-generated content can pose a problem if the content is similar to human text (making filtering difficult) but of lower quality (degrading performance of models trained on it).<ref name="qbFw1">{{Cite arXiv |eprint=2005.14165 |class=cs.CL |first1=Tom B. |last1=Brown |first2=Benjamin |last2=Mann |title=Language Models are Fewfew-Shot Learners |last3=Ryder |first3=Nick |last4=Subbiah |first4=Melanie |last5=Kaplan |first5=Jared |last6=Dhariwal |first6=Prafulla |last7=Neelakantan |first7=Arvind |last8=Shyam |first8=Pranav |last9=Sastry |first9=Girish |last10=Askell |first10=Amanda |last11=Agarwal |first11=Sandhini |last12=Herbertshot-Voss |first12=Ariel |last13=Krueger |first13=Gretchen |last14=Henighan |first14=Tom |last15=Child |first15=Rewon |last16=Ramesh |first16=Aditya |last17=Ziegler |first17=Daniel M. |last18=Wu |first18=Jeffrey |last19=Winter |first19=Clemens |last20=Hesse |first20=Christopher |last21=Chen |first21=Mark |last22=Sigler |first22=Eric |last23=Litwin |first23=Mateusz |last24=Gray |first24=Scott |last25=Chess |first25=Benjamin |last26=Clark |first26=Jack |last27=Berner |first27=Christopher |last28=McCandlish |first28=Sam |last29=Radford |first29=Alec |last30=Sutskever |first30=Ilya |year=2020 |display-authors=1}}<learners2"/ref>
 
=== TrainingSynthetic and architecturedata ===
{{Main|Synthetic data}}
Training of largest language models might need more linguistic data than naturally available, or that the naturally occurring data is of insufficient quality. In these cases, synthetic data might be used. Microsoft's [[Phi (LLM)|Phi]] series of LLMs is trained on textbook-like data generated by another LLM.<ref>{{cite journal |journal=CoRR |first1=Marah |last1=Abdin |first2=Sam Ade |last2=Jacobs |title=Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone |date=2024-04-23 |last3=Awan |first3=Ammar Ahmad |last4=Aneja |first4=Jyoti |last5=Awadallah |first5=Ahmed |last6=Awadalla |first6=Hany |last7=Bach |first7=Nguyen |last8=Bahree |first8=Amit |last9=Bakhtiari |first9=Arash |url=https://arxiv.org/abs/2404.14219}}</ref>
 
== Training ==
{{See also|Fine-tuning (machine learning)}}
 
An LLM is a type of [[foundation model]] (large X model) trained on language. LLMs can be trained in different ways. In particular, GPT models are first pretrained to predict the next word on a large amount of data, before being fine-tuned.{{cn|date=August 2025}}
=== Reinforcement learning from human feedback (RLHF)===
[[Reinforcement learning from human feedback]] (RLHF) through algorithms, such as [[Proximal Policy Optimization|proximal policy optimization]], is used to further fine-tune a model based on a dataset of human preferences.<ref name="instructGPT-paper">{{Cite arXiv |eprint=2203.02155 |class=cs.CL |first1=Long |last1=Ouyang |first2=Jeff |last2=Wu |title=Training language models to follow instructions with human feedback |date=2022 |last3=Jiang |first3=Xu |last4=Almeida |first4=Diogo |last5=Wainwright |first5=Carroll L. |last6=Mishkin |first6=Pamela |last7=Zhang |first7=Chong |last8=Agarwal |first8=Sandhini |last9=Slama |first9=Katarina |last10=Ray |first10=Alex |last11=Schulman |first11=John |last12=Hilton |first12=Jacob |last13=Kelton |first13=Fraser |last14=Miller |first14=Luke |last15=Simens |first15=Maddie |last16=Askell |first16=Amanda |last17=Welinder |first17=Peter |last18=Christiano |first18=Paul |last19=Leike |first19=Jan |last20=Lowe |first20=Ryan}}</ref>
 
=== Instruction tuningCost ===
[[File:Estimated_training_cost_of_some_AI_models_-_2024_AI_index.jpg|thumb|right|upright=1.5]]
Using "self-instruct" approaches, LLMs have been able to [[Bootstrapping|bootstrap]] correct responses, replacing any naive responses, starting from human-generated corrections of a few cases. For example, in the instruction "Write an essay about the main themes represented in Hamlet," an initial naive completion might be 'If you submit the essay after March 17, your grade will be reduced by 10% for each day of delay," based on the frequency of this textual sequence in the corpus.<ref name="self-instruct-paper">{{Cite arXiv |eprint=2212.10560 |class=cs.CL |first1=Yizhong |last1=Wang |first2=Yeganeh |last2=Kordi |title=Self-Instruct: Aligning Language Model with Self Generated Instructions |date=2022 |last3=Mishra |first3=Swaroop |last4=Liu |first4=Alisa |last5=Smith |first5=Noah A. |last6=Khashabi |first6=Daniel |last7=Hajishirzi |first7=Hannaneh}}</ref>
Substantial infrastructure is necessary for training the largest models. The tendency towards larger models is visible in the [[list of large language models]]. For example, the training of GPT-2 (i.e. a 1.5-billion-parameters model) in 2019 cost $50,000, while training of the PaLM (i.e. a 540-billion-parameters model) in 2022 cost $8 million, and Megatron-Turing NLG 530B (in 2021) cost around $11 million. The qualifier "large" in "large language model" is inherently vague, as there is no definitive threshold for the number of parameters required to qualify as "large". [[GPT-1]] of 2018 has 117 million parameters.{{cn|date=August 2025}}
 
=== Fine-tuning ===
Before being [[Fine-tuning (deep learning)|fine-tuned]], most LLMs are next-token predictors. The fine-tuning adjust the output of an LLM to seem more conversational via techniques like [[reinforcement learning from human feedback]] (RLHF) or [[constitutional AI]].<ref>{{Cite web |last=Edwards |first=Benj |date=2023-05-09 |title=AI gains "values" with Anthropic's new Constitutional AI chatbot approach |url=https://arstechnica.com/information-technology/2023/05/ai-with-a-moral-compass-anthropic-outlines-constitutional-ai-in-its-claude-chatbot/ |access-date=2025-06-30 |website=Ars Technica |language=en}}</ref>
 
Instruction fine-tuning is a form of [[supervised learning]] used to teach LLMs to follow user instructions. In 2022, OpenAI demonstrated InstructGPT, a version of GPT-3 similarly fine-tuned to follow instructions.<ref>{{Cite web |last=Snyder |first=Alison |date=2022-01-27 |title=Next generation AI can follow a person's instructions and intentions |url=https://www.axios.com/2022/01/27/ai-instructions-learning-algorithm |access-date=2025-08-07 |website=Axios |language=en}}</ref>
 
Reinforcement learning from human feedback (RLHF) involves training a reward model to predict which text humans prefer. Then, the LLM can be fine-tuned through [[reinforcement learning]] to better satisfy this reward model. Since humans typically prefer truthful, helpful and harmless answers, RLHF favors such answers.{{CN|date=August 2025}}
 
== Architecture ==
LLMs are generally based on the [[Transformer (deep learning architecture)|transformer]] architecture, which leverages an [[Attention (machine learning)|attention]] mechanism that enables the model to process relationships between all elements in a sequence simultaneously, regardless of their distance from each other.{{cn|date=August 2025}}
 
=== Attention mechanism and context window ===
{{See also|Attention (machine learning)}}
[[File:Multiple attention heads.png|upright=1.3|thumb | When each head calculates, according to its own criteria, how much other tokens are relevant for the "it_" token, note that the second attention head, represented by the second column, is focusing most on the first two rows, i.e. the tokens "The" and "animal", while the third column is focusing most on the bottom two rows, i.e. on "tired", which has been tokenized into two tokens.<ref name="Jay_Allamar">{{Cite web | last=Allamar | first=Jay | title=Illustrated transformer | url=https://jalammar.github.io/illustrated-transformer/ | access-date=2023-07-29 | archive-date=2023-07-25 | archive-url=https://web.archive.org/web/20230725230033/http://jalammar.github.io/illustrated-transformer/ | url-status=live }}</ref>]]
 
In order to find out which tokens are relevant to each other within the scope of the context window, the attention mechanism calculates "soft" weights for each token, more precisely for its embedding, by using multiple attention heads, each with its own "relevance" for calculating its own soft weights. For example, the small (i.e. 117M parameter sized) [[GPT-2]] model has had twelve attention heads and a context window of only 1k tokens.<ref name="Jay_Allamar_GPT2">{{Cite web | last=Allamar | first=Jay | title=The Illustrated GPT-2 (Visualizing Transformer Language Models) |url=https://jalammar.github.io/illustrated-gpt2/ |access-date=2023-08-01 }}</ref> In its medium version it has 345M parameters and contains 24 layers, each with 12 attention heads. For the training with gradient descent a batch size of 512 was utilized.<ref name="2022Book_" />
 
Google's [[Gemini (language model)|Gemini 1.5]], presented in February 2024, can have a context window sized up to 1 million.<ref>{{Cite web |last=Yeung |first=Ken |date=2024-05-14 |title=Google announces Gemini 1.5 Flash, a rapid multimodal model with a 1M context window |url=https://venturebeat.com/ai/google-gemini-1-5-flash-rapid-multimodal-model-announced/ |access-date=2025-08-26 |website=VentureBeat |language=en-US}}</ref>
 
A model may be pre-trained either to predict how the segment continues, or what is missing in the segment, given a segment from its training dataset.<ref name="ioUpE">{{cite book |last1=Zaib |first1=Munazza |last2=Sheng |first2=Quan Z. |last3=Emma Zhang |first3=Wei |title=Proceedings of the Australasian Computer Science Week Multiconference |chapter=A Short Survey of Pre-trained Language Models for Conversational AI-A New Age in NLP |date=4 February 2020 |url=https://www.researchgate.net/publication/338931711 |pages=1–4 |arxiv=2104.10810 |doi=10.1145/3373017.3373028 |isbn=9781450376976 |s2cid=211040895}}</ref> It can be either
* autoregressive (i.e. predicting how the segment continues, as [[Generative pretrained transformer|GPTs]] do): for example given a segment "I like to eat", the model predicts "ice cream", or "sushi".
* "[[Cloze test|masked]]" (i.e. filling in the parts missing from the segment, the way "BERT"<ref name="jm">{{cite book |last1=Jurafsky |first1=Dan |url=https://web.stanford.edu/~jurafsky/slp3/ed3book_jan72023.pdf |title=Speech and Language Processing |last2=Martin |first2=James H. |date=7 January 2023 |edition=3rd edition draft |access-date=24 May 2022 |archive-date=23 March 2023 |archive-url=https://web.archive.org/web/20230323210221/https://web.stanford.edu/~jurafsky/slp3/ed3book_jan72023.pdf |url-status=live }}</ref> does it): for example, given a segment "I like to <code>[__] [__]</code> cream", the model predicts that "eat" and "ice" are missing.
 
Models may be trained on auxiliary tasks which test their understanding of the data distribution, such as Next Sentence Prediction (NSP), in which pairs of sentences are presented and the model must predict whether they appear consecutively in the training corpus.<ref name="jm" /> During training, [[Regularization (mathematics)|regularization]] loss is also used to stabilize training. However regularization loss is usually not used during [[Training, validation, and test data sets|testing]] and evaluation.
 
=== Mixture of experts ===
{{Main|Mixture of experts}}
The largest LLM may be too expensive to train and use directly. For such models, [[mixture of experts]] (MoE) can be applied, a line of research pursued by Google researchers since 2017 to train models reaching up to 1 trillion parameters.<ref name="HGZCJ">{{Cite arXiv |eprint=1701.06538 |class=cs.LG |first1=Noam |last1=Shazeer |first2=Azalia |last2=Mirhoseini |title=Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer |date=2017-01-01 |last3=Maziarz |first3=Krzysztof |last4=Davis |first4=Andy |last5=Le |first5=Quoc |last6=Hinton |first6=Geoffrey |last7=Dean |first7=Jeff}}</ref><ref name="R9Qq5">{{Cite arXiv |eprint=2006.16668 |class=cs.CL |first1=Dmitry |last1=Lepikhin |first2=HyoukJoong |last2=Lee |title=GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding  |date=2021-01-12 |language=en |last3=Xu |first3=Yuanzhong |last4=Chen |first4=Dehao |last5=Firat |first5=Orhan |last6=Huang |first6=Yanping |last7=Krikun |first7=Maxim |last8=Shazeer |first8=Noam |last9=Chen |first9=Zhifeng}}</ref><ref name="glam-blog" />
 
A [[mixture of experts]] (MoE) is a [[machine learning]] architecture in which multiple specialized neural networks ("experts") work together, with a gating mechanism that routes each input to the most appropriate expert(s). Mixtures of experts can reduce inference costs, as only a fraction of the parameters are used for each input. The approach was introduced in 2017 by Google researchers.<ref name="HGZCJ">{{Cite journal |journal=ICLR |url=https://arxiv.org/abs/1701.06538 |first1=Noam |last1=Shazeer |first2=Azalia |last2=Mirhoseini |title=Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer |date=2017-01-01 |last3=Maziarz |first3=Krzysztof |last4=Davis |first4=Andy |last5=Le |first5=Quoc |last6=Hinton |first6=Geoffrey |last7=Dean |first7=Jeff}}</ref><ref name="R9Qq5">{{Cite journal |journal=ICLR |url=https://arxiv.org/abs/2006.16668 |first1=Dmitry |last1=Lepikhin |first2=HyoukJoong |last2=Lee |title=GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding |date=2021-01-12 |last3=Xu |first3=Yuanzhong |last4=Chen |first4=Dehao |last5=Firat |first5=Orhan |last6=Huang |first6=Yanping |last7=Krikun |first7=Maxim |last8=Shazeer |first8=Noam |last9=Chen |first9=Zhifeng}}</ref><ref name="glam-blog">{{Cite web |last1=Dai |first1=Andrew M |last2=Du |first2=Nan |date=December 9, 2021 |title=More Efficient In-Context Learning with GLaM |url=https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html |access-date=2023-03-09 |website=ai.googleblog.com |archive-date=2023-03-12 |archive-url=https://web.archive.org/web/20230312072042/https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html |url-status=live}}</ref>
=== Prompt engineering, attention mechanism, and context window ===
{{See also|Prompt engineering|Attention (machine learning)}}
Most results previously achievable only by (costly) fine-tuning, can be achieved through [[prompt engineering]], although limited to the scope of a single conversation (more precisely, limited to the scope of a context window).<ref name="emergentpaper" />
[[File:Multiple_attention_heads.png |300px|thumb | When each head calculates, according to its own criteria, how much other tokens are relevant for the "it_" token, note that the second attention head, represented by the second column, is focusing most on the first two rows, i.e. the tokens "The" and "animal", while the third column is focusing most on the bottom two rows, i.e. on "tired", which has been tokenized into two tokens.<ref name="Jay_Allamar">{{Cite web | last=Allamar | first=Jay | title=Illustrated transformer |url=https://jalammar.github.io/illustrated-transformer/ |access-date=2023-07-29 |language=en}}</ref>]]
 
=== Parameter size ===
In order to find out which tokens are relevant to each other within the scope of the context window, the attention mechanism calculates "soft" weights for each token, more precisely for its embedding, by using multiple attention heads, each with its own "relevance" for calculating its own soft weights. For example, the small (i.e. 117M parameter sized) [[GPT-2]] model, has had twelve attention heads and a context window of only 1k token.<ref name="Jay_Allamar_GPT2">{{Cite web | last=Allamar | first=Jay | title=The Illustrated GPT-2 (Visualizing Transformer Language Models) |url=https://jalammar.github.io/illustrated-gpt2/ |access-date=2023-08-01 |language=en}}</ref>
{{see also|1.58-bit large language model}}
In its medium version it has 345M parameters and contains 24 layers,
Typically, LLMs are trained with single- or half-precision [[floating point numbers]] (float32 and float16). One float16 has 16 bits, or 2 bytes, and so one billion parameters require 2 gigabytes. The largest models typically have 100 billion parameters, requiring 200 gigabytes to load, which places them outside the range of most consumer electronics.<ref>{{Cite news |last=Mann |first=Tobias |title=How to run an LLM locally on your PC in less than 10 minutes |url=https://www.theregister.com/2024/03/17/ai_pc_local_llm/ |access-date=2024-05-17 |website=www.theregister.com}}</ref>
each with 12 attention heads. For the training with gradient descent a batch size of 512 was utilized.<ref name="2022Book_"/>
 
==== Quantization ====
The largest models can have a context window sized up to 200k (for example, [[Claude (language model)|Claude 2.1]]).<ref>{{cite web |url=https://www.anthropic.com/news/claude-2-1-prompting |title=Long context prompting for Claude 2.1 |date=December 6, 2023 |access-date=January 20, 2024}}</ref> Other models with large context windows includes GPT-4 Turbo, with a context window of up to 128k tokens.<ref>{{cite web | url=https://help.openai.com/en/articles/8555510-gpt-4-turbo |title=GPT-4 Turbo: Our latest model |last=Schade |first=Michael |access-date=January 20, 2024 }}</ref> Note that this maximum refers to the number of input tokens and that the maximum number of output tokens differs from the input and is often smaller. For example, the GPT-4 Turbo model has a maximum output of 4096 tokens. Also, {{as of|2024|1}}, GPT-4 Turbo is, for all tiers of service, "currently under preview with restrictive [[Rate limiting|rate limits]] that make them suitable for testing and evaluations, but not for production usage".<ref>{{cite web |url=https://platform.openai.com/docs/guides/rate-limits |title=Rate limits |author=<!--Not stated--> |website=openai.com |access-date=January 20, 2024}}</ref>
''Post-training [[Quantization (signal processing)|quantization]]''<ref name="LS2Go">{{Cite journal |last1=Nagel |first1=Markus |last2=Amjad |first2=Rana Ali |last3=Baalen |first3=Mart Van |last4=Louizos |first4=Christos |last5=Blankevoort |first5=Tijmen |date=2020-11-21 |title=Up or Down? Adaptive Rounding for Post-Training Quantization |url=https://proceedings.mlr.press/v119/nagel20a.html |url-status=live |journal=Proceedings of the 37th International Conference on Machine Learning |publisher=PMLR |pages=7197–7206 |archive-url=https://web.archive.org/web/20230614080854/https://proceedings.mlr.press/v119/nagel20a.html |archive-date=2023-06-14 |access-date=2023-06-14}}</ref> aims to decrease the space requirement by lowering precision of the parameters of a trained model, while preserving most of its performance. Quantization can be further classified as ''static quantization'' if the quantization parameters are determined beforehand (typically during a calibration phase), and ''dynamic quantization'' if the quantization is applied during inference. The simplest form of quantization simply truncates all the parameters to a given number of bits: this is applicable to static as well as dynamic quantization, but loses much precision. Dynamic quantization allows for the use of a different quantization [[Codebook#Data_compression|codebook]] per layer, either a lookup table of values or a linear mapping (scaling factor and bias), at the cost of foregoing the possible speed improvements from using lower-precision arithmetic.{{cn|date=August 2025}}
 
Quantized models are typically seen as frozen with modification of weights (e.g. fine-tuning) only applied to the original model. It is possible to fine-tune quantized models using [[LoRA|low-rank adaptation]].{{cn|date=August 2025}}
Length of a conversation that the model can take into account when generating its next answer is limited by the size of a context window, as well. If the length of a conversation, for example with [[Chat-GPT]], is longer than its context window, only the parts inside the context window are taken into account when generating the next answer, or the model needs to apply some algorithm to summarize the too distant parts of conversation.
 
== Extensibility ==
The shortcomings of making a context window larger include higher computational cost and possibly diluting the focus on local context, while making it smaller can cause a model to miss an important long-range dependency. Balancing them are a matter of experimentation and ___domain-specific considerations.
Beyond basic text generation, various techniques have been developed to extend LLM capabilities, including the use of external tools and data sources, improved reasoning on complex problems, and enhanced instruction-following or autonomy through prompting methods.
 
=== Prompt engineering ===
A model may be pre-trained either to predict how the segment continues, or what is missing in the segment, given a segment from its training dataset.<ref name="ioUpE">{{cite book |last1=Zaib |first1=Munazza |last2=Sheng |first2=Quan Z. |last3=Emma Zhang |first3=Wei |title=Proceedings of the Australasian Computer Science Week Multiconference |chapter=A Short Survey of Pre-trained Language Models for Conversational AI-A New Age in NLP |date=4 February 2020 |chapter-url=https://www.researchgate.net/publication/338931711 |pages=1–4 |arxiv=2104.10810 |doi=10.1145/3373017.3373028 |isbn=9781450376976 |s2cid=211040895}}</ref> It can be either
In 2020, [[OpenAI]] researchers demonstrated that their new model [[GPT-3]] could understand what format to use given a few rounds of Q and A (or other type of task) in the input data as example, thanks in part due to the RLHF technique. This technique, called ''few-shot prompting'', allows LLMs to be adapted to any task without requiring fine-tuning.<ref name="few-shot-learners2"/> Also in 2022, it was found that the base GPT-3 model can generate an instruction based on user input. The generated instruction along with user input is then used as input to another instance of the model under a "Instruction: [...], Input: [...], Output:" format. The other instance is able to complete the output and often produces the correct answer in doing so. The ability to "self-instruct" makes LLMs able to [[Bootstrapping|bootstrap]] themselves toward a correct answer.<ref name="self-instruct-paper">{{Cite journal |journal=ACL |url=https://aclanthology.org/2023.acl-long.754/ |first1=Yizhong |last1=Wang |first2=Yeganeh |last2=Kordi |title=Self-Instruct: Aligning Language Model with Self Generated Instructions |date=2022 |last3=Mishra |first3=Swaroop |last4=Liu |first4=Alisa |last5=Smith |first5=Noah A. |last6=Khashabi |first6=Daniel |last7=Hajishirzi |first7=Hannaneh}}</ref>
 
=== Dialogue processing (chatbot) ===
* autoregressive (i.e. predicting how the segment continues, the way [[Generative pretrained transformer|GPTs]] do it): for example given a segment "I like to eat", the model predicts "ice cream", or "sushi".
An LLM can be turned into a chatbot or a "dialog assistant" by specializing it for conversation. In essence, user input is prefixed with a marker such as "Q:" or "User:" and the LLM is asked to predict the output after a fixed "A:" or "Assistant:". This type of model became commercially available in 2022 with ChatGPT, a sibling model of InstructGPT fine-tuned to accept and produce dialog-formatted text based on GPT-3.5. It could similarly follow user instructions.<ref>{{cite web |title=Introducing ChatGPT |url=https://openai.com/index/chatgpt/ |website=openai.com |date=13 March 2024}}</ref> Before the stream of User and Assistant lines, a chat context usually start with a few lines of overarching instructions, from a role called "developer" or "system" to convey a higher authority than the user's input. This is called a "system prompt".<ref>{{cite web |title=OpenAI Platform |url=https://platform.openai.com/docs/guides/text?api-mode=responses |website=platform.openai.com |language=en}}</ref><ref>{{cite web |title=Giving Claude a role with a system prompt |url=https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/system-prompts |website=Anthropic |language=en}}</ref>
* "[[Cloze test|masked]]" (i.e. filling in the parts missing from the segment, the way "BERT"<ref name="jm">{{cite book |last1=Jurafsky |first1=Dan |url=https://web.stanford.edu/~jurafsky/slp3/ed3book_jan72023.pdf |title=Speech and Language Processing |last2=Martin |first2=James H. |date=7 January 2023 |edition=3rd edition draft |access-date=24 May 2022}}</ref> does it): for example, given a segment "I like to <code>[__] [__]</code> cream", the model predicts that "eat" and "ice" are missing.
 
=== Retrieval-augmented generation ===
Models may be trained on auxiliary tasks which test their understanding of the data distribution, such as Next Sentence Prediction (NSP), in which pairs of sentences are presented and the model must predict whether they appear consecutively in the training corpus.<ref name="jm" /> During training, [[Regularization (mathematics)|regularization]] loss is also used to stabilize training. However regularization loss is usually not used during [[Training, validation, and test data sets|testing]] and evaluation.
[[Retrieval-augmented generation]] (RAG) is an approach that enhances LLMs by integrating them with [[document retrieval]] systems. Given a query, a document retriever is called to retrieve the most relevant documents. This is usually done by encoding the query and the documents into vectors, then finding the documents with vectors (usually stored in a [[vector database]]) most similar to the vector of the query. The LLM then generates an output based on both the query and context included from the retrieved documents.<ref name="BUZBP">{{Cite journal |last1=Lewis |first1=Patrick |last2=Perez |first2=Ethan |last3=Piktus |first3=Aleksandra |last4=Petroni |first4=Fabio |last5=Karpukhin |first5=Vladimir |last6=Goyal |first6=Naman |last7=Küttler |first7=Heinrich |last8=Lewis |first8=Mike |last9=Yih |first9=Wen-tau |last10=Rocktäschel |first10=Tim |last11=Riedel |first11=Sebastian |last12=Kiela |first12=Douwe |date=2020 |title=Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks |url=https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=33 |pages=9459–9474 |arxiv=2005.11401 |access-date=2023-06-12 |archive-date=2023-06-12 |archive-url=https://web.archive.org/web/20230612171229/https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html |url-status=live }}</ref><ref>{{Cite web |last=Kiela |first=Douwe |last2=Riedel |first2=Sebastian |last3=Lewis |first3=Patrick |last4=Piktus |first4=Aleksandra |date=September 28, 2020 |title=Retrieval Augmented Generation: Streamlining the creation of intelligent natural language processing models |url=https://ai.meta.com/blog/retrieval-augmented-generation-streamlining-the-creation-of-intelligent-natural-language-processing-models/ |website=Meta}}</ref>
 
=== TrainingTool costuse ===
Tool use is a mechanism that enables LLMs to interact with external systems, applications, or data sources. It can allow for example to fetch real-time information from an API or to execute code. A program separate from the LLM watches the output stream of the LLM for a special tool-calling syntax. When these special tokens appear, the program calls the tool accordingly and feeds its output back into the LLM's input stream.<ref>{{Cite web |last=Dickson |first=Ben |date=2025-04-02 |title=The tool integration problem that's holding back enterprise AI (and how CoTools solves it) |url=https://venturebeat.com/ai/the-tool-integration-problem-thats-holding-back-enterprise-ai-and-how-cotools-solves-it/ |access-date=2025-05-26 |website=VentureBeat |language=en-US}}</ref>
Advances in software and hardware have reduced the cost substantially since 2020, such that in 2023 training of a 12-billion-parameter LLM computational cost is 72,300 [[Ampere (microarchitecture)|A100-GPU]]-hours, while in 2020 the cost of training a 1.5-billion-parameter LLM (which was two orders of magnitude smaller than the state of the art in 2020) was between $80 thousand and $1.6 million.<ref name="Wiggers">{{cite web |last=Wiggers |first=Kyle |date=28 April 2022 |title=The emerging types of language models and why they matter |url=https://techcrunch.com/2022/04/28/the-emerging-types-of-language-models-and-why-they-matter/ |work=TechCrunch}}</ref><ref name="xaytj">{{cite arXiv |eprint=2004.08900 |class=cs.CL |first1=Or |last1=Sharir |first2=Barak |last2=Peleg |title=The Cost of Training NLP Models: A Concise Overview |last3=Shoham |first3=Yoav |year=2020}}</ref><ref name="Pythia">{{cite arXiv |eprint=2304.01373 |class=cs.CL |first1=Stella |last1=Biderman |first2=Hailey |last2=Schoelkopf |title=Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling |date=April 2023 |last3=Anthony |first3=Quentin |last4=Bradley |first4=Herbie |last5=Khan |first5=Mohammad Aflah |last6=Purohit |first6=Shivanshu |last7=Prashanth |first7=USVSN Sai}}</ref> Since 2020, large sums were invested in increasingly large models. For example, training of the GPT-2 (i.e. a 1.5-billion-parameters model) in 2019 cost $50,000, while training of the PaLM (i.e. a 540-billion-parameters model) in 2022 cost $8 million.<ref name="0BrVG">{{cite news |last1=Vincent |first1=James |date=3 April 2023 |title=AI is entering an era of corporate control |work=The Verge |url=https://www.theverge.com/23667752/ai-progress-2023-report-stanford-corporate-control |access-date=19 June 2023}}</ref>
 
Early tool-using LLMs were fine-tuned on the use of specific tools. But fine-tuning LLMs for the ability to read [[API]] documentation and call API correctly has greatly expanded the range of tools accessible to an LLM.<ref name="lLrda">{{Cite journal |journal=Science |url=https://spj.science.org/doi/10.34133/icomputing.0063 |first1=Yaobo |last1=Liang |first2=Chenfei |last2=Wu |title=TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs |date=2023-03-01 |last3=Song |first3=Ting |last4=Wu |first4=Wenshan |last5=Xia |first5=Yan |last6=Liu |first6=Yu |last7=Ou |first7=Yang |last8=Lu |first8=Shuai |last9=Ji |first9=Lei |last10=Mao |first10=Shaoguang |last11=Wang |first11=Yun |last12=Shou |first12=Linjun |last13=Gong |first13=Ming |last14=Duan |first14=Nan}}</ref><ref name="4Xzrs">{{Cite journal |last1=Patil |first1=Shishir G. |last2=Zhang |first2=Tianjun |last3=Wang |first3=Xin |last4=Gonzalez |first4=Joseph E. |date=2023-05-01 |title=Gorilla: Large Language Model Connected with Massive APIs |journal=NeurIPS |url=https://proceedings.neurips.cc/paper_files/paper/2024/hash/e4c61f578ff07830f5c37378dd3ecb0d-Abstract-Conference.html}}</ref> Describing available tools in the system prompt can also make an LLM able to use tools. A system prompt instructing ChatGPT (GPT-4) to use multiple types of tools can be found online.<ref>{{Cite web|url=https://github.com/spdustin/ChatGPT-AutoExpert/blob/835baae768870aa9747663c24d8216820d24fd74/_system-prompts/all_tools.md|title=ChatGPT-AutoExpert/_system-prompts/all_tools.md at 835baae768870aa9747663c24d8216820d24fd74 · spdustin/ChatGPT-AutoExpert|website=GitHub}}</ref>
For Transformer-based LLM, training cost is much higher than inference cost. It costs 6 [[FLOPS|FLOPs]] per parameter to train on one token, whereas it costs 1 to 2 FLOPs per parameter to infer on one token.<ref name="kaplan-scaling">Section 2.1 and Table 1,
 
=== Agency ===
{{Cite arXiv |eprint=2001.08361 |class=cs.LG |first1=Jared |last1=Kaplan |first2=Sam |last2=McCandlish |title=Scaling Laws for Neural Language Models |last3=Henighan |first3=Tom |last4=Brown |first4=Tom B. |last5=Chess |first5=Benjamin |last6=Child |first6=Rewon |last7=Gray |first7=Scott |last8=Radford |first8=Alec |last9=Wu |first9=Jeffrey |last10=Amodei |first10=Dario |year=2020}}</ref>
{{Main article|AI agent}}
 
An LLM is typically not an [[autonomous agent]] by itself, as it lacks the ability to interact with dynamic environments, recall past behaviors, and plan future actions. But it can be transformed into an agent by adding supporting elements: the role (profile) and the surrounding environment of an agent can be additional inputs to the LLM, while memory can be integrated as a tool or provided as additional input. Instructions and input patterns are used to make the LLM plan actions and tool use is used to potentially carry out these actions.<ref>{{cite journal |last1=Wang |first1=Lei |last2=Ma |first2=Chen |last3=Feng |first3=Xueyang |last4=Zhang |first4=Zeyu |last5=Yang |first5=Hao |last6=Zhang |first6=Jingsen |last7=Chen |first7=Zhiyuan |last8=Tang |first8=Jiakai |last9=Chen |first9=Xu |last10=Lin |first10=Yankai |last11=Zhao |first11=Wayne Xin |last12=Wei |first12=Zhewei |last13=Wen |first13=Jirong |title=A survey on large language model based autonomous agents |journal=Frontiers of Computer Science |date=December 2024 |volume=18 |issue=6 |article-number=186345 |doi=10.1007/s11704-024-40231-1|arxiv=2308.11432}}</ref>
== Tool use ==
There are certain tasks that, in principle, cannot be solved by any LLM, at least not without the use of external tools or additional software. An example of such a task is responding to the user's input '354 * 139 = ', provided that the LLM has not already encountered a continuation of this calculation in its training corpus. In such cases, the LLM needs to resort to running program code that calculates the result, which can then be included in its response. Another example is 'What is the time now? It is ', where a separate program interpreter would need to execute a code to get system time on the computer, so LLM could include it in its reply.<ref name="PI1fW">{{Cite arXiv |eprint=2211.10435 |class=cs.CL |first1=Luyu |last1=Gao |first2=Aman |last2=Madaan |title=PAL: Program-aided Language Models |date=2022-11-01 |last3=Zhou |first3=Shuyan |last4=Alon |first4=Uri |last5=Liu |first5=Pengfei |last6=Yang |first6=Yiming |last7=Callan |first7=Jamie |last8=Neubig |first8=Graham}}</ref><ref name="J5OW5">{{Cite web |title=PAL: Program-aided Language Models |url=https://reasonwithpal.com/ |access-date=2023-06-12 |website=reasonwithpal.com}}</ref> This basic strategy can be sophisticated with multiple attempts of generated programs, and other sampling strategies.<ref name="gQxzq">{{Cite arXiv |eprint=2303.09014 |class=cs.CL |first1=Bhargavi |last1=Paranjape |first2=Scott |last2=Lundberg |title=ART: Automatic multi-step reasoning and tool-use for large language models |date=2023-03-01 |last3=Singh |first3=Sameer |last4=Hajishirzi |first4=Hannaneh |last5=Zettlemoyer |first5=Luke |last6=Tulio Ribeiro |first6=Marco}}</ref>
 
GenerallyThe [[ReAct pattern]], ina orderportmanteau toof get"Reason&nbsp;+&nbsp;Act", constructs an LLM[[Intelligent toagent|agent]] useout toolsof an LLM, oneusing mustthe finetuneLLM itas fora tool-useplanner. IfThe theLLM numberis ofprompted toolsto is"think out loud". finiteSpecifically, thenthe finetuninglanguage maymodel beis doneprompted justwith once.a Iftextual the numberdescription of toolsthe canenvironment, growa arbitrarilygoal, asa withlist onlineof [[API]]possible servicesactions, thenand a record of the LLMactions canand beobservations finetunedso tofar. beIt ablegenerates toone reador APImore documentationthoughts andbefore callgenerating an action, APIwhich correctlyis then executed in the environment.<ref name="lLrdaDmvNE">{{Cite arXiv |eprint=23032210.1643403629 |class=cs.AICL |first1=YaoboShunyu |last1=LiangYao |first2=ChenfeiJeffrey |last2=WuZhao |title=TaskMatrix.AIReAct: CompletingSynergizing TasksReasoning byand ConnectingActing Foundationin Language Models with Millions of APIs |date=20232022-0310-01 |last3=SongYu |first3=TingDian |last4=WuDu |first4=WenshanNan |last5=XiaShafran |first5=YanIzhak |last6=LiuNarasimhan |first6=YuKarthik |last7=OuCao |first7=Yang |last8=Lu |first8=Shuai |last9=Ji |first9=Lei |last10=Mao |first10=Shaoguang |last11=Wang |first11=Yun |last12=Shou |first12=Linjun |last13=Gong |first13=Ming |last14=Duan |first14=Nan}}</ref><ref name="4Xzrs">{{Cite arXiv |last1=Patil |first1=Shishir G. |last2=Zhang |first2=Tianjun |last3=Wang |first3=Xin |last4=Gonzalez |first4=Joseph E. |date=2023-05-01 |title=Gorilla: Large Language Model Connected with Massive APIs |class=cs.CL |eprint=2305.15334Yuan}}</ref>
 
In the DEPS ("Describe, Explain, Plan and Select") method, an LLM is first connected to the visual world via image descriptions. It is then prompted to produce plans for complex tasks and behaviors based on its pretrained knowledge and the environmental feedback it receives.<ref>{{Cite arXiv |eprint=2302.01560 |class=cs.AI |first1=Zihao |last1=Wang |first2=Shaofei |last2=Cai |title=Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents |date=2023-02-03 |last3=Liu |first3=Anji |last4=Ma |first4=Xiaojian |last5=Liang |first5=Yitao}}</ref>
A simpler form of tool use is ''Retrieval Augmented Generation'': augment an LLM with [[document retrieval]], sometimes using a [[vector database]]. Given a query, a document retriever is called to retrieve the most relevant (usually measured by first encoding the query and the documents into vectors, then finding the documents with vectors closest in Euclidean norm to the query vector). The LLM then generates an output based on both the query and the retrieved documents.<ref name="BUZBP">{{Cite journal |last1=Lewis |first1=Patrick |last2=Perez |first2=Ethan |last3=Piktus |first3=Aleksandra |last4=Petroni |first4=Fabio |last5=Karpukhin |first5=Vladimir |last6=Goyal |first6=Naman |last7=Küttler |first7=Heinrich |last8=Lewis |first8=Mike |last9=Yih |first9=Wen-tau |last10=Rocktäschel |first10=Tim |last11=Riedel |first11=Sebastian |last12=Kiela |first12=Douwe |date=2020 |title=Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks |url=https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=33 |pages=9459–9474 |arxiv=2005.11401}}</ref>
 
The Reflexion method<ref name="sbB2T">{{Cite journal |journal=NeurIPS |last1=Shinn |first1=Noah |last2=Cassano |first2=Federico |last3=Labash |first3=Beck |last4=Gopinath |first4=Ashwin |last5=Narasimhan |first5=Karthik |last6=Yao |first6=Shunyu |date=2023-03-01 |title=Reflexion: Language Agents with Verbal Reinforcement Learning |url=https://dl.acm.org/doi/10.5555/3666122.3667602}}</ref> constructs an agent that learns over multiple episodes. At the end of each episode, the LLM is given the record of the episode, and prompted to think up "lessons learned", which would help it perform better at a subsequent episode. These "lessons learned" are stored as a form of long-term memory and given to the agent in the subsequent episodes.<ref name="sbB2T" />
== Agency ==
An LLM is a language model, which is not an agent as it has no goal, but it can be used as a component of an [[intelligent agent]].<ref name="CFuti">{{Cite journal |last1=Huang |first1=Wenlong |last2=Abbeel |first2=Pieter |last3=Pathak |first3=Deepak |last4=Mordatch |first4=Igor |date=2022-06-28 |title=Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents |url=https://proceedings.mlr.press/v162/huang22a.html |journal=Proceedings of the 39th International Conference on Machine Learning |language=en |publisher=PMLR |pages=9118–9147|arxiv=2201.07207 }}</ref> Researchers have described several methods for such integrations.
 
The[[Monte ReActCarlo ("Reason&nbsp;+&nbsp;Act")tree method constructs an [[Intelligent agent|agentsearch]] outcan ofuse an LLM, usingas therollout LLMheuristic. asWhen a planner.programmatic Theworld LLMmodel is promptednot toavailable, "thinkan outLLM loud".can Specifically,also the language model isbe prompted with a textual description of the environment, ato goal,act aas listworld of possible actions, and a record of the actions and observations so far. It generates one or more thoughts before generating an action, which is then executed in the environmentmodel.<ref name="DmvNEltTer">{{Cite arXivjournal |eprintjournal=2210.03629EMNLP |classurl=cshttps://aclanthology.CLorg/2023.emnlp-main.507/ |first1=ShunyuShibo |last1=YaoHao |first2=JeffreyYi |last2=ZhaoGu |title=ReAct:Reasoning Synergizingwith ReasoningLanguage andModel Actingis inPlanning Languagewith ModelsWorld Model |date=20222023-1005-01 |last3=YuMa |first3=DianHaodi |last4=DuJiahua Hong |first4=NanJoshua |last5=ShafranWang |first5=IzhakZhen |last6=NarasimhanZhe Wang |first6=KarthikDaisy |last7=CaoHu |first7=Yuan}}</ref> The linguistic description of the environment given to the LLM planner can even be the LaTeX code of a paper describing the environment.<ref name="JS8Vd">{{Cite arXiv |eprint=2305.15486 |class=cs.AI |first1=Yue |last1=Wu |first2=Shrimai |last2=Prabhumoye |title=SPRING: GPT-4 Out-performs RL Algorithms by Studying Papers and Reasoning |date=24 May 2023 |last3=Min |first3=So YeonZhiting}}</ref>
 
For open-ended exploration, an LLM can be used to score observations for their "interestingness", which can be used as a reward signal to guide a normal (non-LLM) reinforcement learning agent.<ref name="mBvD9">{{Cite arXiv |eprint=2306.01711 |class=cs.AI |first1=Jenny |last1=Zhang |first2=Joel |last2=Lehman |title=OMNI: Open-endedness via Models of human Notions of Interestingness |date=2 June 2023 |last3=Stanley |first3=Kenneth |last4=Clune |first4=Jeff}}</ref> Alternatively, it can [[Zone of proximal development|propose increasingly difficult tasks]] for [[curriculum learning]].<ref name=":0">{{Cite web |title=Voyager {{!}} An Open-Ended Embodied Agent with Large Language Models |url=https://voyager.minedojo.org/ |access-date=2023-06-09 |website=voyager.minedojo.org |archive-date=2023-06-08 |archive-url=https://web.archive.org/web/20230608225054/https://voyager.minedojo.org/ |url-status=live }}</ref> Instead of outputting individual actions, an LLM planner can also construct "skills", or [[Function (computer programming)|functions]] for complex action sequences. The skills can be stored and later invoked, allowing increasing levels of abstraction in planning.<ref name=":0" />
In the DEPS ("Describe, Explain, Plan and Select") method, an LLM is first connected to the visual world via image descriptions, then it is prompted to produce plans for complex tasks and behaviors based on its pretrained knowledge and environmental feedback it receives.<ref>{{Cite arXiv |eprint=2302.01560 |class=cs.AI |first1=Zihao |last1=Wang |first2=Shaofei |last2=Cai |title=Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents |date=2023-02-03 |last3=Liu |first3=Anji |last4=Ma |first4=Xiaojian |last5=Liang |first5=Yitao}}</ref>
 
TheMultiple Reflexionagents methodwith memory can interact socially.<ref name="sbB2TXuvjF">{{Cite arXivconference |conference=UIST |last1=ShinnPark |first1=NoahJoon Sung |last2=CassanoO'Brien |first2=FedericoJoseph C. |last3=LabashCai |first3=BeckCarrie J. |last4=GopinathRingel Morris |first4=AshwinMeredith |last5=NarasimhanLiang |first5=KarthikPercy |last6=YaoBernstein |first6=ShunyuMichael S. |date=2023-0304-01 |title=Reflexion: LanguageGenerative Agents: withInteractive VerbalSimulacra Reinforcementof LearningHuman Behavior |classurl=cshttps://dl.AI |eprint=2303acm.11366org/doi/10.1145/3586183.3606763}}</ref> constructs an agent that learns over multiple episodes. At the end of each episode, the LLM is given the record of the episode, and prompted to think up "lessons learned", which would help it perform better at a subsequent episode. These "lessons learned" are given to the agent in the subsequent episodes.
 
=== Reasoning ===
[[Monte Carlo tree search]] can use an LLM as rollout heuristic. When a programmatic world model is not available, an LLM can also be prompted with a description of the environment to act as world model.<ref name="ltTer">{{Cite arXiv |eprint=2305.14992 |class=cs.CL |first1=Shibo |last1=Hao |first2=Yi |last2=Gu |title=Reasoning with Language Model is Planning with World Model |date=2023-05-01 |last3=Ma |first3=Haodi |last4=Jiahua Hong |first4=Joshua |last5=Wang |first5=Zhen |last6=Zhe Wang |first6=Daisy |last7=Hu |first7=Zhiting}}</ref>
 
LLMs are conventionally trained to generate an output without generating intermediate steps. As a result, their performance tends to be subpar on complex questions requiring (at least in humans) intermediate steps of thought. Early research demonstrated that inserting intermediate “scratchpad” computations could improve performance on such tasks.<ref>{{Cite web |last=Nye |first=Maxwell |last2=Anders |first2=Andreassen Johan |last3=Gur-Ari |first3=Guy |last4=Michalewski |first4=Henryk |last5=Austin |first5=Jacob |last6=Bieber |first6=David |last7=Dohan |first7=David |last8=Lewkowycz |first8=Aitor |last9=Bosma |first9=Maarten |last10=Luan |first10=David |last11=Sutton |first11=Charles |last12=Odena |first12=Augustus |date=30 November 2021 |title=Show Your Work: Scratchpads for Intermediate Computation with Language Models |url=https://arxiv.org/abs/2112.00114 |website=arxiv |arxiv=2112.00114}}</ref> Later methods overcame this deficiency more systematically by breaking tasks into smaller steps for the LLM, either manually or automatically.
For open-ended exploration, an LLM can be used to score observations for their "interestingness", which can be used as a reward signal to guide a normal (non-LLM) reinforcement learning agent.<ref name="mBvD9">{{Cite arXiv |eprint=2306.01711 |class=cs.AI |first1=Jenny |last1=Zhang |first2=Joel |last2=Lehman |title=OMNI: Open-endedness via Models of human Notions of Interestingness |date=2 June 2023 |last3=Stanley |first3=Kenneth |last4=Clune |first4=Jeff}}</ref> Alternatively, it can [[Zone of proximal development|propose increasingly difficult tasks]] for [[curriculum learning]].<ref name=":0">{{Cite web |title=Voyager {{!}} An Open-Ended Embodied Agent with Large Language Models |url=https://voyager.minedojo.org/ |access-date=2023-06-09 |website=voyager.minedojo.org}}</ref> Instead of outputting individual actions, an LLM planner can also construct "skills", or [[Function (computer programming)|functions]] for complex action sequences. The skills can be stored and later invoked, allowing increasing levels of abstraction in planning.<ref name=":0" />
 
==== Chaining ====
LLM-powered agents can keep a long-term memory of its previous contexts, and the memory can be retrieved in the same way as Retrieval Augmented Generation. Multiple such agents can interact socially.<ref name="XuvjF">{{Cite arXiv |last1=Park |first1=Joon Sung |last2=O'Brien |first2=Joseph C. |last3=Cai |first3=Carrie J. |last4=Ringel Morris |first4=Meredith |last5=Liang |first5=Percy |last6=Bernstein |first6=Michael S. |date=2023-04-01 |title=Generative Agents: Interactive Simulacra of Human Behavior |class=cs.HC |eprint=2304.03442}}</ref>
The "prompt chaining" paradigm was published in 2021.<ref name="auto2">{{cite journal| journal=NeurIPS| last1 = Wei| first1 = Jason| last2 = Wang| first2 = Xuezhi| last3 = Schuurmans| first3 = Dale| last4 = Bosma| first4 = Maarten| last5 = Ichter| first5 = Brian| last6 = Xia| first6 = Fei| last7 = Chi| first7 = Ed| last8 = Le| first8 = Quoc| last9 = Zhou| first9 = Denny| title = Chain-of-Thought Prompting Elicits Reasoning in Large Language Models| date = 2023-01-10| url=https://dl.acm.org/doi/10.5555/3600270.3602070}}</ref> In this method, a user manually breaks a complex problem down into several steps. In each step, the LLM receives as input a prompt telling it what to do and some results from preceeding steps. The result from one step is then reused in a next step, until a final answer is reached. The ability of an LLM to follow instructions means that even non-experts can write a successful collection of step-wise prompts given a few rounds of trial and error.<ref>{{cite conference |conference=CHI Conference on Human Factors in Computing Systems| last1 = Wu| first1 = Tongshuang| last2 = Jiang| first2 = Ellen| last3 = Donsbach| first3 = Aaron| last4 = Gray| first4 = Jeff| last5 = Molina| first5 = Alejandra| last6 = Terry| first6 = Michael| last7 = Cai| first7 = Carrie J.| title = PromptChainer: Chaining Large Language Model Prompts through Visual Programming| date = 2022-03-13| url=https://dl.acm.org/doi/10.1145/3491101.3519729}}</ref><ref>{{cite web |date=23 April 2024 |title=What is prompt chaining? |url=https://www.ibm.com/think/topics/prompt-chaining |website=IBM |language=en}}</ref>
 
A 2022 paper demonstrated a separate technique called "[[Chain-of-thought prompting|Chain-of-Thought]] Prompting", which makes the LLM break the question down autonomously. An LLM is given some examples where the "assistant" verbally breaks down the thought process before arriving at an answer. The LLM mimics these examples and also tries to spend some time generating intermediate steps before providing the final answer. This additional step elicited by prompting improves the correctness of the LLM on relatively complex questions. On math word questions, a prompted model can exceed even fine-tuned GPT-3 with a verifier.<ref name="auto2" /><ref>{{cite web |date=23 April 2025 |title=What is chain of thought (CoT) prompting? |url=https://www.ibm.com/think/topics/chain-of-thoughts |website=IBM |language=en}}</ref> Chain-of-thought can also be elicited by simply adding an instruction like "Let's think step by step" to the prompt, in order to encourage the LLM to proceed methodically instead of trying to directly guess the answer.<ref>{{Cite web |last=Schreiner |first=Maximilian |date=2022-09-27 |title=Deeper insights into AI language models - chain of thought prompting as a success factor |url=https://the-decoder.com/deeper-insights-for-ai-language-models-chain-of-thought-prompting-as-a-key-factor/ |access-date=2025-06-30 |website=The Decoder |language=en-US}}</ref>
== Compression ==
Typically, LLM are trained with full- or half-precision floating point numbers (float32 and float16). One float16 has 16 bits, or 2 bytes, and so one billion parameters require 2 gigabytes. The largest models typically have 100 billion parameters, requiring 200 gigabytes to load, which places them outside the range of most consumer electronics.
 
Follow-up methods included ''self-consistency'' prompting, which samples multiple reasoning paths and selects the most common answer,<ref>{{Cite web |last=Wang |first=Xuezhi |last2=Wei |first2=Jason |last3=Schuurmans |first3=Dale |last4=Le |first4=Quoc |last5=Chi |first5=Ed |last6=Narang |first6=Sharan |last7=Chowdhery |first7=Aakanksha |last8=Zhou |first8=Denny |date=21 March 2022 |title=Self-Consistency Improves Chain of Thought Reasoning in Language Models |url=https://arxiv.org/abs/2203.11171?utm_source=chatgpt.com |website=arxiv |arxiv=2203.11171}}</ref> and ''least-to-most prompting'', which decomposes complex problems into simpler subproblems that the model solves sequentially.<ref>{{Cite web |last=Zhou |first=Denny |last2=Schärli |first2=Nathanael |last3=Hou |first3=Le |last4=Wei |first4=Jason |last5=Scales |first5=Nathan |last6=Wang |first6=Xuezhi |last7=Schuurmans |first7=Dale |last8=Cui |first8=Claire |last9=Bousquet |first9=Olivier |last10=Le |first10=Quoc |last11=Chi |first11=Ed |date=21 May 2022 |title=Least-to-Most Prompting Enables Complex Reasoning in Large Language Models |url=https://arxiv.org/abs/2205.10625 |archive-url= |website=arxiv |arxiv=2205.10625}}</ref>
''Post-training [[Quantization (signal processing)|quantization]]''<ref name="LS2Go">{{Cite journal |last1=Nagel |first1=Markus |last2=Amjad |first2=Rana Ali |last3=Baalen |first3=Mart Van |last4=Louizos |first4=Christos |last5=Blankevoort |first5=Tijmen |date=2020-11-21 |title=Up or Down? Adaptive Rounding for Post-Training Quantization |url=https://proceedings.mlr.press/v119/nagel20a.html |journal=Proceedings of the 37th International Conference on Machine Learning |language=en |publisher=PMLR |pages=7197–7206}}</ref> aims to decrease the space requirement by lowering precision of the parameters of a trained model, while preserving most of its performance.<ref name="cpzcK">{{Cite arXiv |eprint=1802.05668 |class=cs.NE |first1=Antonio |last1=Polino |first2=Razvan |last2=Pascanu |title=Model compression via distillation and quantization |date=2018-02-01 |last3=Alistarh |first3=Dan}}</ref><ref name="QVU95">{{Cite arXiv |eprint=2210.17323 |class=cs.LG |first1=Elias |last1=Frantar |first2=Saleh |last2=Ashkboos |title=GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers |date=2022-10-01 |last3=Hoefler |first3=Torsten |last4=Alistarh |first4=Dan}}</ref> The simplest form of quantization simply truncates all numbers to a given number of bits. It can be improved by using a different quantization [[Block cipher|codebook]] per layer. Further improvement can be done by applying [[Mixed-precision arithmetic|different precisions]] to different parameters, with higher precision for particularly important parameters ("outlier weights").<ref name="dU9Bu">{{Cite arXiv |eprint=2306.03078 |class=cs.CL |first1=Tim |last1=Dettmers |first2=Ruslan |last2=Svirschevski |title=SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression |date=2023-06-01 |last3=Egiazarian |first3=Vage |last4=Kuznedelev |first4=Denis |last5=Frantar |first5=Elias |last6=Ashkboos |first6=Saleh |last7=Borzunov |first7=Alexander |last8=Hoefler |first8=Torsten |last9=Alistarh |first9=Dan}}</ref>
 
Subsequent research also explored ''reflection'', where models iteratively critique and improve their own reasoning,<ref name="sbB2T" /> and ''tool-augmented reasoning'', where models make use of external systems such as retrievers or calculators to support problem-solving.
While quantized models are typically frozen, and only pre-quantized models are finetuned, quantized models can still be finetuned.<ref name="D0nFA">{{Cite arXiv |eprint=2305.14314 |class=cs.LG |first1=Tim |last1=Dettmers |first2=Artidoro |last2=Pagnoni |title=QLoRA: Efficient Finetuning of Quantized LLMs |date=2023-05-01 |last3=Holtzman |first3=Ari | author-link3=Ari Holtzman |last4=Zettlemoyer |first4=Luke}}</ref>
 
==== Model-native reasoning ====
== Multimodality ==
{{Main article|Reasoning language model|Reflection (artificial intelligence)}}
Multimodality means "having several modalities", and a [[Modality (human–computer interaction)|"modality"]] refers to a type of input or output, such as video, image, audio, text, [[proprioception]], etc.<ref>{{Cite journal |last1=Kiros |first1=Ryan |last2=Salakhutdinov |first2=Ruslan |last3=Zemel |first3=Rich |date=2014-06-18 |title=Multimodal Neural Language Models |url=https://proceedings.mlr.press/v32/kiros14.html |journal=Proceedings of the 31st International Conference on Machine Learning |language=en |publisher=PMLR |pages=595–603}}</ref> There have been many AI models trained specifically to ingest one modality and output another modality, such as [[AlexNet]] for image to label,<ref>{{Cite journal |last1=Krizhevsky |first1=Alex |last2=Sutskever |first2=Ilya |last3=Hinton |first3=Geoffrey E |date=2012 |title=ImageNet Classification with Deep Convolutional Neural Networks |url=https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=25}}</ref> [[visual question answering]] for image-text to text,<ref>{{Cite journal |last1=Antol |first1=Stanislaw |last2=Agrawal |first2=Aishwarya |last3=Lu |first3=Jiasen |last4=Mitchell |first4=Margaret |last5=Batra |first5=Dhruv |last6=Zitnick |first6=C. Lawrence |last7=Parikh |first7=Devi |date=2015 |title=VQA: Visual Question Answering |url=https://openaccess.thecvf.com/content_iccv_2015/html/Antol_VQA_Visual_Question_ICCV_2015_paper.html |journal=ICCV |pages=2425–2433}}</ref> and [[speech recognition]] for speech to text.
 
In late 2024 "reasoning models" were released. These were trained to spend more time generating step-by-step solutions before providing final answers, which was intended to be similar to human problem-solving processes.{{cn|date=August 2025}} OpenAI introduced this concept with their [[OpenAI o1|o1]] model in September 2024, followed by [[OpenAI o3|o3]] in April 2025. On the [[International Mathematical Olympiad|International Mathematics Olympiad]] qualifying exam problems, [[GPT-4o]] achieved 13% accuracy while o1 reached 83%.<ref name="nyt-o3">{{cite news |last=Metz |first=Cade |title=OpenAI Unveils New A.I. That Can 'Reason' Through Math and Science Problems |url=https://www.nytimes.com/2024/12/20/technology/openai-new-ai-math-science.html |work=The New York Times |date=2024-12-20 |access-date=2025-02-03}}</ref>
A common method to create multimodal models out of an LLM is to "tokenize" the output of a trained encoder. Concretely, one can construct a LLM that can understand images as follows: take a trained LLM, and take a trained image encoder <math>E</math>. Make a small multilayered perceptron <math>f</math>, so that for any image <math>y</math>, the post-processed vector <math>f(E(y))</math> has the same dimensions as an encoded token. That is an "image token". Then, one can interleave text tokens and image tokens. The compound model is then finetuned on an image-text dataset. This basic construction can be applied with more sophistication to improve the model. The image encoder may be frozen to improve stability.<ref>{{Cite arXiv |last1=Li |first1=Junnan |last2=Li |first2=Dongxu |last3=Savarese |first3=Silvio |last4=Hoi |first4=Steven |date=2023-01-01 |title=BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models |class=cs.CV |eprint=2301.12597 }}</ref>
 
In January 2025, the Chinese company DeepSeek released DeepSeek-R1, a 671-billion-parameter open-weight reasoning model that achieved comparable performance to OpenAI's o1 while being significantly more cost-effective to operate. Unlike proprietary models from OpenAI, DeepSeek-R1's open-weight nature allowed researchers to study and build upon the algorithm, though its training data remained private.<ref name="nature-deepseek">{{cite news |last=Gibney |first=Elizabeth |title=China's cheap, open AI model DeepSeek thrills scientists |url=https://www.nature.com/articles/d41586-025-00229-6 |work=Nature |date=2025-01-30 |access-date=2025-02-03}}</ref>
Flamingo demonstrated the effectiveness of the tokenization method, finetuning a pair of pretrained language model and image encoder to perform better on visual question answering than models trained from scratch.<ref>{{Cite journal |last1=Alayrac |first1=Jean-Baptiste |last2=Donahue |first2=Jeff |last3=Luc |first3=Pauline |last4=Miech |first4=Antoine |last5=Barr |first5=Iain |last6=Hasson |first6=Yana |last7=Lenc |first7=Karel |last8=Mensch |first8=Arthur |last9=Millican |first9=Katherine |last10=Reynolds |first10=Malcolm |last11=Ring |first11=Roman |last12=Rutherford |first12=Eliza |last13=Cabi |first13=Serkan |last14=Han |first14=Tengda |last15=Gong |first15=Zhitao |date=2022-12-06 |title=Flamingo: a Visual Language Model for Few-Shot Learning |url=https://proceedings.neurips.cc/paper_files/paper/2022/hash/960a172bc7fbf0177ccccbb411a7d800-Abstract-Conference.html |journal=Advances in Neural Information Processing Systems |language=en |volume=35 |pages=23716–23736|arxiv=2204.14198 }}</ref> [[Pathways Language Model|Google PaLM]] model was finetuned into a multimodal model PaLM-E using the tokenization method, and applied to robotic control.<ref>{{Cite arXiv |last1=Driess |first1=Danny |last2=Xia |first2=Fei |last3=Sajjadi |first3=Mehdi S. M. |last4=Lynch |first4=Corey |last5=Chowdhery |first5=Aakanksha |last6=Ichter |first6=Brian |last7=Wahid |first7=Ayzaan |last8=Tompson |first8=Jonathan |last9=Vuong |first9=Quan |last10=Yu |first10=Tianhe |last11=Huang |first11=Wenlong |last12=Chebotar |first12=Yevgen |last13=Sermanet |first13=Pierre |last14=Duckworth |first14=Daniel |last15=Levine |first15=Sergey |date=2023-03-01 |title=PaLM-E: An Embodied Multimodal Language Model |class=cs.LG |eprint=2303.03378 }}</ref> [[LLaMA]] models have also been turned multimodal using the tokenization method, to allow image inputs,<ref>{{Cite arXiv|last1=Liu |first1=Haotian |last2=Li |first2=Chunyuan |last3=Wu |first3=Qingyang |last4=Lee |first4=Yong Jae |date=2023-04-01 |title=Visual Instruction Tuning |class=cs.CV |eprint=2304.08485 }}</ref> and video inputs.<ref>{{Cite arXiv|last1=Zhang |first1=Hang |last2=Li |first2=Xin |last3=Bing |first3=Lidong |date=2023-06-01 |title=Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding |class=cs.CL |eprint=2306.02858 }}</ref>
 
These reasoning models typically require more computational resources per query compared to traditional LLMs, as they perform more extensive processing to work through problems step-by-step.<ref name="nyt-o3" />
[[GPT-4]] can use both text and image as inputs<ref>{{Cite arXiv |eprint=2303.08774 |class=cs.CL |last=OpenAI |title=GPT-4 Technical Report |date=2023-03-27}}</ref> (although the vision component wasn't released to the public until GPT-4V<ref>{{Cite web |last=OpenAI |date=September 25, 2023 |title=GPT-4V(ision) System Card |url=https://cdn.openai.com/papers/GPTV_System_Card.pdf}}</ref>); [[Google DeepMind]]'s [[Gemini (language model)|Gemini]] is also multimodal.<ref>{{Citation |last=Pichai |first=Sundar |title=Google Keynote (Google I/O '23) |url=https://www.youtube.com/watch?v=cNfINi5CNbY&t=931s |access-date=2023-07-02 |at=timestamp 15:31 |language=en}}</ref> <!-- update this in 2024 -->
 
=== Inference optimization ===
Inference optimization refers to techniques that improve LLM performance by applying additional computational resources during the inference process, rather than requiring model retraining. These approaches implement various state-of-the-art reasoning and decision-making strategies to enhance accuracy and capabilities.
 
'''OptiLLM''' is an [[OpenAI]] API-compatible optimizing inference proxy that implements multiple inference optimization techniques simultaneously.<ref>{{Cite web |author=Sharma, Asankhaya |title=OptiLLM: Optimizing inference proxy for LLMs |url=https://github.com/codelion/optillm |website=GitHub |access-date=2025-08-05}}</ref> The system acts as a transparent proxy that can work with any LLM provider, implementing techniques such as [[Monte Carlo tree search]] (MCTS), [[Mixture of experts|mixture of agents]] (MOA), best-of-N sampling, and chain-of-thought reflection. OptiLLM demonstrates that strategic application of computational resources at inference time can substantially improve model performance across diverse tasks, achieving significant improvements on benchmarks such as the AIME 2024 mathematics competition and various coding challenges.<ref>{{Cite web |title=OptiLLM: An OpenAI API Compatible Optimizing Inference Proxy which Implements Several State-of-the-Art Techniques that can Improve the Accuracy and Performance of LLMs |url=https://www.marktechpost.com/2024/11/18/optillm-an-openai-api-compatible-optimizing-inference-proxy-which-implements-several-state-of-the-art-techniques-that-can-improve-the-accuracy-and-performance-of-llms/ |website=MarkTechPost |date=2024-11-18 |access-date=2025-08-05}}</ref>
 
These inference optimization approaches represent a growing category of tools that enhance existing LLMs without requiring access to model weights or retraining, making advanced reasoning capabilities more accessible across different model providers and use cases.
 
== Forms of input and output ==
 
=== Multimodality ===
{{See also|Multimodal learning}}
Multimodality means having multiple modalities, where a "[[Modality (human–computer interaction)|modality]]" refers to a type of input or output, such as video, image, audio, text, [[proprioception]], etc.<ref>{{Cite journal |last1=Kiros |first1=Ryan |last2=Salakhutdinov |first2=Ruslan |last3=Zemel |first3=Rich |date=2014-06-18 |title=Multimodal Neural Language Models |url=https://proceedings.mlr.press/v32/kiros14.html |journal=Proceedings of the 31st International Conference on Machine Learning |publisher=PMLR |pages=595–603 |access-date=2023-07-02 |archive-date=2023-07-02 |archive-url=https://web.archive.org/web/20230702195952/https://proceedings.mlr.press/v32/kiros14.html |url-status=live }}</ref> For example, [[Pathways Language Model|Google PaLM]] model was fine-tuned into a multimodal model and applied to [[Robot control|robotic control]].<ref>{{Cite journal |journal=ICML |url=https://dl.acm.org/doi/10.5555/3618408.3618748 |first1=Danny |last1=Driess |first2=Fei |last2=Xia |title=PaLM-E: An Embodied Multimodal Language Model |date=2023-03-01 |last3=Sajjadi |first3=Mehdi S. M. |last4=Lynch |first4=Corey |last5=Chowdhery |first5=Aakanksha |last6=Ichter |first6=Brian |last7=Wahid |first7=Ayzaan |last8=Tompson |first8=Jonathan |last9=Vuong |first9=Quan |last10=Yu |first10=Tianhe |last11=Huang |first11=Wenlong |last12=Chebotar |first12=Yevgen |last13=Sermanet |first13=Pierre |last14=Duckworth |first14=Daniel |last15=Levine |first15=Sergey}}</ref> [[LLaMA]] models have also been turned multimodal using the tokenization method, to allow image inputs,<ref>{{Cite journal |journal=NeurIPS |first1=Haotian |last1=Liu |first2=Chunyuan |last2=Li |title=Visual Instruction Tuning |date=2023-04-01 |last3=Wu |first3=Qingyang |last4=Lee |first4=Yong Jae}}</ref> and video inputs.<ref>{{Cite journal |journal=EMNLP |first1=Hang |last1=Zhang |first2=Xin |last2=Li |title=Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding |date=2023-06-01 |last3=Bing |first3=Lidong}}</ref> [[GPT-4o]] can process and generate text, audio and images.<ref>{{Cite news |date=2024-05-13 |title=OpenAI says natively multimodal GPT-4o eats text, visuals, sound – and emits the same |url=https://www.theregister.com/2024/05/13/openai_gpt4o/ |work=The Register}}</ref> Such models are sometimes called large multimodal models (LMMs).<ref>{{Cite web |last=Zia |first=Dr Tehseen |date=2024-01-08 |title=Unveiling of Large Multimodal Models: Shaping the Landscape of Language Models in 2024 |url=https://www.unite.ai/unveiling-of-large-multimodal-models-shaping-the-landscape-of-language-models-in-2024/ |access-date=2025-05-30 |website=Unite.AI |language=en-US}}</ref>
 
A common method to create multimodal models out of an LLM is to "tokenize" the output of a trained encoder. Concretely, one can construct an LLM that can understand images as follows: take a trained LLM, and take a trained image encoder <math>E</math>. Make a small multilayered perceptron <math>f</math>, so that for any image <math>y</math>, the post-processed vector <math>f(E(y))</math> has the same dimensions as an encoded token. That is an "image token". Then, one can interleave text tokens and image tokens. The compound model is then fine-tuned on an image-text dataset. This basic construction can be applied with more sophistication to improve the model. The image encoder may be frozen to improve stability.<ref>{{Cite journal |journal=ICML |url=https://dl.acm.org/doi/10.5555/3618408.3619222 |last1=Li |first1=Junnan |last2=Li |first2=Dongxu |last3=Savarese |first3=Silvio |last4=Hoi |first4=Steven |date=2023-01-01 |title=BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models }}</ref> This type of method, where embeddings from multiple modalities are fused and the predictor is trained on the combined embeddings, is called early fusion.
 
Another method, called intermediate fusion, involves each modality being first processed independently to obtain modality-specific representations; then these intermediate representations are fused together.<ref>{{cite book |last1=Kumar |first1=Puneet |last2=Khokher |first2=Vedanti |last3=Gupta |first3=Yukti |last4=Raman |first4=Balasubramanian |title=Hybrid Fusion Based Approach for Multimodal Emotion Recognition with Insufficient Labeled Data |date=2021 |pages=314–318 |doi=10.1109/ICIP42928.2021.9506714 |isbn=978-1-6654-4115-5 }}</ref> In general, cross-attention is used for integrating information from different modalities. As an example, the model Flamingo uses cross-attention layers to inject visual information into its pre-trained language model.<ref>{{Cite journal |last1=Alayrac |first1=Jean-Baptiste |last2=Donahue |first2=Jeff |last3=Luc |first3=Pauline |last4=Miech |first4=Antoine |last5=Barr |first5=Iain |last6=Hasson |first6=Yana |last7=Lenc |first7=Karel |last8=Mensch |first8=Arthur |last9=Millican |first9=Katherine |last10=Reynolds |first10=Malcolm |last11=Ring |first11=Roman |last12=Rutherford |first12=Eliza |last13=Cabi |first13=Serkan |last14=Han |first14=Tengda |last15=Gong |first15=Zhitao |date=2022-12-06 |title=Flamingo: a Visual Language Model for Few-Shot Learning |url=https://proceedings.neurips.cc/paper_files/paper/2022/hash/960a172bc7fbf0177ccccbb411a7d800-Abstract-Conference.html |url-status=live |journal=Advances in Neural Information Processing Systems |volume=35 |pages=23716–23736 |arxiv=2204.14198 |archive-url=https://web.archive.org/web/20230702195951/https://proceedings.neurips.cc/paper_files/paper/2022/hash/960a172bc7fbf0177ccccbb411a7d800-Abstract-Conference.html |archive-date=2023-07-02 |access-date=2023-07-02}}</ref>
 
=== Non-natural languages ===
LLMs can handle programming languages similarly to how they handle natural languages. No special change in token handling is needed as code, like human language, is represented as plain text. LLMs can generate code based on problems or instructions written in [[natural language]]. They can also describe code in natural language or translate between programming languages. They were originally used as a [[code completion]] tool, but advances have moved them towards [[automatic programming]]. Services such as [[GitHub Copilot]] offer LLMs specifically trained, fine-tuned, or prompted for programming.<ref>{{Cite book |last1=Finnie-Ansley |first1=James |last2=Denny |first2=Paul |last3=Becker |first3=Brett A. |last4=Luxton-Reilly |first4=Andrew |last5=Prather |first5=James |title=Proceedings of the 24th Australasian Computing Education Conference |chapter=The Robots Are Coming: Exploring the Implications of OpenAI Codex on Introductory Programming |date=14 February 2022 |language=en-US |___location=New York, NY, USA |publisher=Association for Computing Machinery |pages=10–19 |doi=10.1145/3511861.3511863 |isbn=978-1-4503-9643-1 |s2cid=246681316 |doi-access=free}}</ref><ref>{{cite journal |last1=Husein |first1=Rasha Ahmad |last2=Aburajouh |first2=Hala |last3=Catal |first3=Cagatay |title=Large language models for code completion: A systematic literature review |journal=Computer Standards & Interfaces |date=March 2025 |volume=92 |article-number=103917 |doi=10.1016/j.csi.2024.103917}}</ref>
 
LLM architectures have also proven useful in analyzing biological sequences: protein, DNA, and RNA. With proteins they appear able to capture a degree of "grammar" from the amino-acid sequence, condensing a sequence into an [[embedding (machine learning)|embedding]]. On tasks such as structure prediction and mutational outcome prediction, a small model using an embedding as input can approach or exceed much larger models using [[multiple sequence alignment]]s (MSA) as input.<ref>{{cite journal |last1=Weissenow |first1=Konstantin |last2=Rost |first2=Burkhard |title=Are protein language models the new universal key? |journal=Current Opinion in Structural Biology |date=April 2025 |volume=91 |article-number=102997 |doi=10.1016/j.sbi.2025.102997|pmid=39921962 }}</ref> ESMFold, [[Meta Platforms]]' embedding-based method for protein structure prediction, runs an order of magnitude faster than [[AlphaFold2]] thanks to the removal of an MSA requirement and a lower parameter count due to the use of embeddings.<ref>{{cite journal |last1=Lin |first1=Zeming |last2=Akin |first2=Halil |last3=Rao |first3=Roshan |last4=Hie |first4=Brian |last5=Zhu |first5=Zhongkai |last6=Lu |first6=Wenting |last7=Smetanin |first7=Nikita |last8=Verkuil |first8=Robert |last9=Kabeli |first9=Ori |last10=Shmueli |first10=Yaniv |last11=dos Santos Costa |first11=Allan |last12=Fazel-Zarandi |first12=Maryam |last13=Sercu |first13=Tom |last14=Candido |first14=Salvatore |last15=Rives |first15=Alexander |title=Evolutionary-scale prediction of atomic-level protein structure with a language model |journal=Science |date=17 March 2023 |volume=379 |issue=6637 |pages=1123–1130 |doi=10.1126/science.ade2574|biorxiv=10.1101/2022.07.20.500902|doi-access=free |pmid=36927031 |bibcode=2023Sci...379.1123L }}</ref> Meta hosts ESM Atlas, a database of 772 million structures of [[metagenomic]] proteins predicted using ESMFold.<ref>{{cite web |title=ESM Metagenomic Atlas {{!}} Meta AI |url=https://esmatlas.com/about |website=esmatlas.com |language=en}}</ref> An LLM can also design proteins unlike any seen in nature.<ref>{{cite journal |last1=Hayes |first1=Thomas |last2=Rao |first2=Roshan |last3=Akin |first3=Halil |last4=Sofroniew |first4=Nicholas J. |last5=Oktay |first5=Deniz |last6=Lin |first6=Zeming |last7=Verkuil |first7=Robert |last8=Tran |first8=Vincent Q. |last9=Deaton |first9=Jonathan |last10=Wiggert |first10=Marius |last11=Badkundri |first11=Rohil |last12=Shafkat |first12=Irhum |last13=Gong |first13=Jun |last14=Derry |first14=Alexander |last15=Molina |first15=Raul S. |last16=Thomas |first16=Neil |last17=Khan |first17=Yousuf A. |last18=Mishra |first18=Chetan |last19=Kim |first19=Carolyn |last20=Bartie |first20=Liam J. |last21=Nemeth |first21=Matthew |last22=Hsu |first22=Patrick D. |last23=Sercu |first23=Tom |last24=Candido |first24=Salvatore |last25=Rives |first25=Alexander |title=Simulating 500 million years of evolution with a language model |journal=Science |date=21 February 2025 |volume=387 |issue=6736 |pages=850–858 |doi=10.1126/science.ads0018|pmid=39818825 |bibcode=2025Sci...387..850H }}</ref> Nucleic acid models have proven useful in detecting [[regulatory sequence]]s,<ref>{{cite journal |last1=Fishman |first1=Veniamin |last2=Kuratov |first2=Yuri |last3=Shmelev |first3=Aleksei |last4=Petrov |first4=Maxim |last5=Penzar |first5=Dmitry |last6=Shepelin |first6=Denis |last7=Chekanov |first7=Nikolay |last8=Kardymon |first8=Olga |last9=Burtsev |first9=Mikhail |title=GENA-LM: a family of open-source foundational DNA language models for long sequences |journal=Nucleic Acids Research |date=11 January 2025 |volume=53 |issue=2 |pages=gkae1310 |doi=10.1093/nar/gkae1310|pmid=39817513 |pmc=11734698 }}</ref> sequence classification, RNA-RNA interaction prediction, and RNA structure prediction.<ref>{{cite journal |last1=Wang |first1=Ning |last2=Bian |first2=Jiang |last3=Li |first3=Yuchen |last4=Li |first4=Xuhong |last5=Mumtaz |first5=Shahid |last6=Kong |first6=Linghe |last7=Xiong |first7=Haoyi |title=Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning |journal=Nature Machine Intelligence |date=13 May 2024 |volume=6 |issue=5 |pages=548–557 |doi=10.1038/s42256-024-00836-4|doi-access=free }}</ref>
 
== Properties ==
=== Scaling laws ===
{{Main|Neural scaling law}}
The following four hyper-parameters characterize a LLM:
* cost of (pre-)training (<small><math>C</math></small>),
* size of the [[artificial neural network]] itself, such as number of parameters <small><math>N</math></small> (i.e. amount of neurons in its layers, amount of weights between them and biases),
* size of its (pre-)training dataset (i.e. number of tokens in corpus, <small><math>D</math></small>),
* performance after (pre-)training.
 
The performance of an LLM after pretraining largely depends on the:
They are related by simple [[Empirical statistical laws|statistical laws]], called "scaling laws". One particular scaling law ("[[Chinchilla AI|Chinchilla scaling]]") for LLM autoregressively trained for one epoch, with a [[Log-log plot|log-log]] [[learning rate]] schedule, states that:<ref name="fJta3">{{Cite arXiv |eprint=2203.15556 |class=cs.CL |first1=Jordan |last1=Hoffmann |first2=Sebastian |last2=Borgeaud |title=Training Compute-Optimal Large Language Models |date=2022-03-29 |last3=Mensch |first3=Arthur |last4=Buchatskaya |first4=Elena |last5=Cai |first5=Trevor |last6=Rutherford |first6=Eliza |last7=Casas |first7=Diego de Las |last8=Hendricks |first8=Lisa Anne |last9=Welbl |first9=Johannes |last10=Clark |first10=Aidan |last11=Hennigan |first11=Tom |last12=Noland |first12=Eric |last13=Millican |first13=Katie |last14=Driessche |first14=George van den |last15=Damoc |first15=Bogdan}}</ref>
* cost of pretraining <small><math>C</math></small> (the total amount of compute used),
* size of the [[artificial neural network]] itself, such as number of parameters <small><math>N</math></small> (i.e. amount of neurons in its layers, amount of weights between them and biases),
* size of its pretraining dataset (i.e. number of tokens in corpus, <small><math>D</math></small>).
 
"Scaling laws" are [[empirical statistical laws]] that predict LLM performance based on such factors. One particular scaling law ("[[Chinchilla AI|Chinchilla scaling]]") for LLM autoregressively trained for one epoch, with a [[Log-log plot|log-log]] [[learning rate]] schedule, states that:<ref name="fJta3">{{Cite journal |journal=NeurIPS |url=https://dl.acm.org/doi/10.5555/3600270.3602446 |first1=Jordan |last1=Hoffmann |first2=Sebastian |last2=Borgeaud |title=Training Compute-Optimal Large Language Models |date=2022-03-29 |last3=Mensch |first3=Arthur |last4=Buchatskaya |first4=Elena |last5=Cai |first5=Trevor |last6=Rutherford |first6=Eliza |last7=Casas |first7=Diego de Las |last8=Hendricks |first8=Lisa Anne |last9=Welbl |first9=Johannes |last10=Clark |first10=Aidan |last11=Hennigan |first11=Tom |last12=Noland |first12=Eric |last13=Millican |first13=Katie |last14=Driessche |first14=George van den |last15=Damoc |first15=Bogdan}}</ref>
<math display="block">\begin{cases}
C = C_0 ND \\[6pt]
L = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + L_0
\end{cases}</math> where the variables are
 
* <small><math>C</math></small> is the cost of training the model, in [[FLOPS|FLOPs]].
* <small><math>N</math></small> is the number of parameters in the model.
Line 144 ⟶ 213:
 
and the statistical hyper-parameters are
* <small><math> C_0 = 6</math></small>, meaning that it costs 6 FLOPs per parameter to train on one token. Note that training cost is much higher than inference cost, where it costs 1 to 2 FLOPs per parameter to infer on one token.
 
* <small><math> C_0 = 6</math></small>, meaning that it costs 6 FLOPs per parameter to train on one token. Note that training cost is much higher than inference cost, where it costs 1 to 2 FLOPs per parameter to infer on one token.<ref name="kaplan-scaling" />
* <small><math>\alpha = 0.34, \beta = 0.28, A = 406.4, B = 410.7, L_0 = 1.69</math></small>
 
=== Emergent abilities ===
{{anchor|Emergent abilities}}[[File:LLM emergent benchmarks.png|thumb|At point(s) referred to as [[Neural scaling law#Broken Neural Scaling Laws (BNSL)Law|breaks]],<ref name="IYm4Q" /> the lines change their slopes, appearing on a loglinear-log plot as a series of linear segments connected by arcs.]]
When one subtracts out from the y-axis the best performance that can be achieved even with infinite scalingPerformance of the x-axis quantity, largebigger models' performance, measured on various tasks, seemswhen toplotted beon a linearlog-log extrapolationscale, ofappears otheras (smaller-sizeda andlinear medium-sized)extrapolation models'of performance onachieved aby log-logsmaller plotmodels. However, sometimesthis thelinearity line'smay slopebe transitionspunctuated from one slope to another at point(s) referred to asby "[[Neural scaling law#Broken Neural Scaling Laws (BNSL)Law|break(s)]]"<ref name="IYm4Q">{{cite arXiv |eprint=2210.14891 |class=cs.LG |first1=Ethan |last1=Caballero |first2=Kshitij |last2=Gupta |title=Broken Neural Scaling Laws |last3=Rish |first3=Irina |last4=Krueger |first4=David |year=2022}}</ref> in downstreamthe scaling lawslaw, appearingwhere asthe a seriesslope of linearthe segmentsline connectedchanges byabruptly, arcs;and it seems thatwhere larger models acquire "emergent abilities" at this point(s).<ref name="emergentpaper">{{cite journal |last1=Wei |first1=Jason |last2=Tay |first2=Yi |last3=Bommasani |first3=Rishi |last4=Raffel |first4=Colin |last5=Zoph |first5=Barret |last6=Borgeaud |first6=Sebastian |last7=Yogatama |first7=Dani |last8=Bosma |first8=Maarten |last9=Zhou |first9=Denny |last10=Metzler |first10=Donald |last11=Chi |first11=Ed H. |last12=Hashimoto |first12=Tatsunori |last13=Vinyals |first13=Oriol |last14=Liang |first14=Percy |last15=Dean |first15=Jeff |date=31 August 2022 |title=Emergent Abilities of Large Language Models |url=https://openreview.net/forum?id=yzkSU5zdwD |journal=Transactions on Machine Learning Research |language=en |issn=2835-8856 |last16=Fedus |first16=William |access-date=19 March 2023 |archive-date=22 March 2023 |archive-url=https://web.archive.org/web/20230322210052/https://openreview.net/forum?id=yzkSU5zdwD |url-status=live }}</ref><ref name="JM6s1">{{Cite web |title=137 emergent abilities of large language models |url=https://www.jasonwei.net/blog/emergence |access-date=2023-06-24 |website=Jason Wei |language=en-US}}</ref> TheseThey abilitiesarise arefrom discoveredthe rathercomplex thaninteraction programmed-inof orthe designed,model's incomponents someand casesare onlynot afterexplicitly theprogrammed LLM has beenor publicly deployeddesigned.<ref name="Bowman">{{cite arXiv |eprint=2304.00612 |class=cs.CL |first=Samuel R. |last=Bowman |title=Eight Things to Know about Large Language Models |year=2023}}</ref>
 
TheOne mostof intriguing amongthe emergent abilities is [[in-context learning]] from example demonstrations.<ref name="Hahn_20230314">{{cite arXiv |eprint=2303.07971 |class=cs.LG |first1=Michael |last1=Hahn |first2=Navin |last2=Goyal |title=A Theory of Emergent In-Context Learning as Implicit Structure Induction |date=2023-03-14}}</ref> In-context learning is involved in tasks, such as:
* reported arithmetics
* reported arithmetics, decoding the [[International Phonetic Alphabet]], unscrambling a word's letters, disambiguate word in context,<ref name="emergentpaper" /><ref name="57FEA">{{Cite journal |last1=Pilehvar |first1=Mohammad Taher |last2=Camacho-Collados |first2=Jose |title=Proceedings of the 2019 Conference of the North |date=June 2019 |url=https://aclanthology.org/N19-1128 |journal=Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) |___location=Minneapolis, Minnesota |publisher=Association for Computational Linguistics |pages=1267–1273 |doi=10.18653/v1/N19-1128|s2cid=102353817 }}</ref><ref name="TEIkA">{{Cite web |title=WiC: The Word-in-Context Dataset |url=https://pilehvar.github.io/wic/ |access-date=2023-06-27 |website=pilehvar.github.io}}</ref> converting spatial words, [[cardinal direction]]s (for example, replying "northeast" upon [0, 0, 1; 0, 0, 0; 0, 0, 0]), color terms represented in text.<ref name="zgy1i">{{Cite journal |last1=Patel |first1=Roma |last2=Pavlick |first2=Ellie |date=2021-10-06 |title=Mapping Language Models to Grounded Conceptual Spaces |url=https://openreview.net/forum?id=gJcEM8sxHK |journal=ICLR |language=en}}</ref>
* decoding the [[International Phonetic Alphabet]]
* [[chain-of-thought prompting]]: Model outputs are improved by chain-of-thought prompting only when model size exceeds 62B. Smaller models perform better when prompted to answer immediately, without chain of thought.<ref name="Imb98">''[https://www.notion.so/A-Closer-Look-at-Large-Language-Models-Emergent-Abilities-493876b55df5479d80686f68a1abd72f A Closer Look at Large Language Models Emergent Abilities]'' (Yao Fu, Nov 20, 2022)</ref>
* unscrambling a word's letters
* identifying offensive content in paragraphs of [[Hinglish]] (a combination of Hindi and English), and generating a similar English equivalent of [[Kiswahili]] proverbs.<ref name="CeQVF">{{Cite web |last=Ornes |first=Stephen |date=March 16, 2023 |title=The Unpredictable Abilities Emerging From Large AI Models |url=https://www.quantamagazine.org/the-unpredictable-abilities-emerging-from-large-ai-models-20230316/ |website=Quanta Magazine}}</ref>
* disambiguating word-in-context datasets<ref name="emergentpaper" /><ref name="57FEA">{{Cite journal |last1=Pilehvar |first1=Mohammad Taher |last2=Camacho-Collados |first2=Jose |title=Proceedings of the 2019 Conference of the North |date=June 2019 |url=https://aclanthology.org/N19-1128 |journal=Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) |___location=Minneapolis, Minnesota |publisher=Association for Computational Linguistics |pages=1267–1273 |doi=10.18653/v1/N19-1128 |s2cid=102353817 |access-date=2023-06-27 |archive-date=2023-06-27 |archive-url=https://web.archive.org/web/20230627202732/https://aclanthology.org/N19-1128/ |url-status=live |url-access=subscription |doi-access=free }}</ref><ref name="TEIkA">{{Cite web |title=WiC: The Word-in-Context Dataset |url=https://pilehvar.github.io/wic/ |access-date=2023-06-27 |website=pilehvar.github.io |archive-date=2023-06-27 |archive-url=https://web.archive.org/web/20230627202725/https://pilehvar.github.io/wic/ |url-status=live }}</ref>
* converting spatial words
* [[cardinal direction]]s (for example, replying "northeast" in response to a 3x3 grid of 8 zeros and a 1 in the top-right), color terms represented in text.<ref name="zgy1i">{{Cite journal |last1=Patel |first1=Roma |last2=Pavlick |first2=Ellie |date=2021-10-06 |title=Mapping Language Models to Grounded Conceptual Spaces |url=https://openreview.net/forum?id=gJcEM8sxHK |journal=ICLR |access-date=2023-06-27 |archive-date=2023-06-24 |archive-url=https://web.archive.org/web/20230624191940/https://openreview.net/forum?id=gJcEM8sxHK |url-status=live }}</ref>
* [[chain-of-thought prompting]]: In a 2022 research paper, chain-of-thought prompting only improved the performance for models that had at least 62B parameters. Smaller models perform better when prompted to answer immediately, without chain of thought.<ref name="Imb98">''[https://www.notion.so/A-Closer-Look-at-Large-Language-Models-Emergent-Abilities-493876b55df5479d80686f68a1abd72f A Closer Look at Large Language Models Emergent Abilities] {{Webarchive|url=https://web.archive.org/web/20230624012329/https://www.notion.so/A-Closer-Look-at-Large-Language-Models-Emergent-Abilities-493876b55df5479d80686f68a1abd72f |date=2023-06-24 }}'' (Yao Fu, Nov 20, 2022)</ref>
* identifying offensive content in paragraphs of [[Hinglish]] (a combination of Hindi and English), and generating a similar English equivalent of [[Kiswahili]] proverbs.<ref name="CeQVF">{{Cite web |last=Ornes |first=Stephen |date=March 16, 2023 |title=The Unpredictable Abilities Emerging From Large AI Models |url=https://www.quantamagazine.org/the-unpredictable-abilities-emerging-from-large-ai-models-20230316/ |website=Quanta Magazine |access-date=March 16, 2023 |archive-date=March 16, 2023 |archive-url=https://web.archive.org/web/20230316203438/https://www.quantamagazine.org/the-unpredictable-abilities-emerging-from-large-ai-models-20230316/ |url-status=live }}</ref>
 
Schaeffer ''et. al.'' argue that the emergent abilities are not unpredictably acquired, but predictably acquired according to a [[Neural scaling law|smooth scaling law]]. The authors considered a toy statistical model of an LLM solving multiple-choice questions, and showed that this statistical model, modified to account for other types of tasks, applies to these tasks as well.<ref name="C775b">{{cite arXiv |eprint=2304.15004 |class=cs.AI |first1=Rylan |last1=Schaeffer |first2=Brando |last2=Miranda |title=Are Emergent Abilities of Large Language Models a Mirage? |date=2023-04-01 |last3=Koyejo |first3=Sanmi}}</ref>
Line 167 ⟶ 240:
 
== Interpretation ==
Large language models byare themselvestypically areregarded as "[[Black box|black boxesbox]]"es, and it is not clear how they can perform linguistic tasks. ThereSimilarly, areit severalis methodsunclear forif understandingor how LLMLLMs workshould be viewed as models of the human brain and/or human mind.<ref>{{cite journal |last1=Blank |first1=Idan A. |title=What are large language models supposed to model? |journal=Trends in Cognitive Sciences |date=November 2023 |volume=27 |issue=11 |pages=987–989 |doi=10.1016/j.tics.2023.08.006|pmid=37659920 |doi-access=free }}</ref>
 
=== Mechanistic interpretability ===
Mechanistic interpretability aims to [[Reverse engineering|reverse-engineer]] LLM by discovering symbolic algorithms that approximate the inference performed by LLM. One example is Othello-GPT, where a small Transformer is trained to predict legal [[reversi|Othello]] moves. It is found that there is a linear representation of Othello board, and modifying the representation changes the predicted legal Othello moves in the correct way.<ref name="IZSIr">{{Cite arXiv |eprint=2210.13382 |class=cs.LG |first1=Kenneth |last1=Li |first2=Aspen K. |last2=Hopkins |title=Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task |date=2022-10-01 |last3=Bau |first3=David |last4=Viégas |first4=Fernanda |last5=Pfister |first5=Hanspeter |last6=Wattenberg |first6=Martin}}</ref><ref name="RLik9">{{Cite web |date=2023-01-21 |title=Large Language Model: world models or surface statistics? |url=https://thegradient.pub/othello/ |access-date=2023-06-12 |website=The Gradient |language=en}}</ref> In another example, a small Transformer is trained on [[Karel (programming language)|Karel programs]]. Similar to the Othello-GPT example, there is a linear representation of Karel program semantics, and modifying the representation changes output in the correct way. The model also generates correct programs that are on average shorter than those in the training set.<ref name="Hln1l">{{Cite arXiv |eprint=2305.11169 |class=cs.LG |first1=Charles |last1=Jin |first2=Martin |last2=Rinard |title=Evidence of Meaning in Language Models Trained on Programs |date=2023-05-01}}</ref>
[[Mechanistic interpretability]] aims to [[Reverse engineering|reverse-engineer]] LLMs by discovering symbolic algorithms that approximate the inference performed by an LLM. Mechanistic interpretability research has been conducted at organizations like Anthropic and OpenAI, although understanding the inner workings of LLMs remains difficult.<ref>{{Cite web |date=2023-12-12 |title=Mapping the Mind of a Large Language Model |url=https://www.anthropic.com/research/mapping-mind-language-model |access-date=2025-08-24 |website=Anthropic |language=en}}</ref><ref>{{Cite web |date=2023-09-26 |title=Extracting Concepts from GPT-4 |url=https://openai.com/index/extracting-concepts-from-gpt-4/ |access-date=2025-08-24 |website=OpenAI |language=en}}</ref>
 
InFor another exampleinstance, the authors trained small transformers on [[Modular arithmetic|modular arithmetic addition]]. The resulting models were reverse-engineered, and it turned out they used [[discrete Fourier transform]].<ref name="oYGlo">{{Cite arXiv |eprint=2301.05217 |class=cs.LG |first1=Neel |last1=Nanda |first2=Lawrence |last2=Chan |title=Progress measures for grokking via mechanistic interpretability |date=2023-01-01 |last3=Lieberum |first3=Tom |last4=Smith |first4=Jess |last5=Steinhardt |first5=Jacob}}</ref> The training of the model also highlighted a phenomenon called [[Grokking (machine learning)|grokking]], in which the model initially memorizes all the possible results in the training set ([[overfitting]]), and later suddenly learns to actually perform the calculation.<ref>{{Cite web |last=Ananthaswamy |first=Anil |date=2024-04-12 |title=How Do Machines 'Grok' Data? |url=https://www.quantamagazine.org/how-do-machines-grok-data-20240412/ |access-date=2025-06-30 |website=Quanta Magazine |language=en}}</ref>
 
Some techniques have been developed to enhance the transparency and interpretability of LLMs. Transcoders, which are more interpretable than transformers, have been utilized to develop "replacement models". In one such study involving the mechanistic interpretation of writing a rhyming poem by an LLM, it was shown that although they are believed to simply predict the next token, they can, in fact, plan ahead.<ref>{{Cite web |title=On the Biology of a Large Language Model |url=https://transformer-circuits.pub/2025/attribution-graphs/biology.html#dives-poems%7Ctitle=On |access-date=2025-06-30 |website=Transformer Circuits |language=en}}</ref> By integrating such techniques, researchers and practitioners can gain deeper insights into the operations of LLMs, fostering trust and facilitating the responsible deployment of these powerful models.
 
=== Understanding and intelligence ===
{{See also|Philosophy of artificial intelligence|Artificial consciousness}}
NLP researchers were evenly split when asked, in a 2022 survey, whether (untuned) LLMs "could (ever) understand natural language in some nontrivial sense".<ref name="debate understanding">{{cite journal |last1=Mitchell |first1=Melanie |last2=Krakauer |first2=David C. |date=28 March 2023 |title=The debate over understanding in AI's large language models |journal=Proceedings of the National Academy of Sciences |volume=120 |issue=13 |pages=e2215907120 |arxiv=2210.13966 |bibcode=2023PNAS..12015907M |doi=10.1073/pnas.2215907120 |pmc=10068812 |pmid=36943882 }}</ref> Proponents of "LLM understanding" believe that some LLM abilities, such as mathematical reasoning, imply an ability to "understand" certain concepts. A Microsoft team argued in 2023 that GPT-4 "can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more" and that GPT-4 "could reasonably be viewed as an early (yet still incomplete) version of an [[artificial general intelligence]] system": "Can one reasonably say that a system that passes exams for software engineering candidates is not ''really'' intelligent?"<ref name="O8Upd">{{cite news |last1=Metz |first1=Cade |date=16 May 2023 |title=Microsoft Says New A.I. Shows Signs of Human Reasoning |work=The New York Times |url=https://www.nytimes.com/2023/05/16/technology/microsoft-ai-human-reasoning.html}}</ref><ref name="microsoft sparks">{{cite arXiv |eprint=2303.12712 |class=cs.CL |first1=Sébastien |last1=Bubeck |first2=Varun |last2=Chandrasekaran |title=Sparks of Artificial General Intelligence: Early experiments with GPT-4 |date=2023 |last3=Eldan |first3=Ronen |last4=Gehrke |first4=Johannes |last5=Horvitz |first5=Eric |last6=Kamar |first6=Ece |last7=Lee |first7=Peter |last8=Lee |first8=Yin Tat |last9=Li |first9=Yuanzhi |last10=Lundberg |first10=Scott |last11=Nori |first11=Harsha |last12=Palangi |first12=Hamid |last13=Ribeiro |first13=Marco Tulio |last14=Zhang |first14=Yi}}</ref> Some researchers characterize LLMs as "alien intelligence".<ref name="rEEmH">{{cite news |date=2023 |title=ChatGPT is more like an 'alien intelligence' than a human brain, says futurist |language=en |work=ZDNET |url=https://www.zdnet.com/article/chatgpt-is-more-like-an-alien-intelligence-than-a-human-brain-says-futurist/ |access-date=12 June 2023}}</ref><ref name="new yorker kind of mind">{{cite magazine |last1=Newport |first1=Cal |date=13 April 2023 |title=What Kind of Mind Does ChatGPT Have? |url=https://www.newyorker.com/science/annals-of-artificial-intelligence/what-kind-of-mind-does-chatgpt-have |magazine=The New Yorker |access-date=12 June 2023}}</ref> For example, Conjecture CEO Connor Leahy considers untuned LLMs to be like inscrutable alien "[[Shoggoth]]s", and believes that RLHF tuning creates a "smiling facade" obscuring the inner workings of the LLM: "If you don't push it too far, the smiley face stays on. But then you give it [an unexpected] prompt, and suddenly you see this massive underbelly of insanity, of weird thought processes and clearly non-human understanding."<ref name="rAFIZ">{{cite news |last1=Roose |first1=Kevin |date=30 May 2023 |title=Why an Octopus-like Creature Has Come to Symbolize the State of A.I. |work=The New York Times |url=https://www.nytimes.com/2023/05/30/technology/shoggoth-meme-ai.html |access-date=12 June 2023}}</ref><ref name="4luKE">{{cite news |date=13 April 2023 |title=The A to Z of Artificial Intelligence |language=en |work=Time Magazine |url=https://time.com/6271657/a-to-z-of-artificial-intelligence/ |access-date=12 June 2023}}</ref>
 
NLP researchers were evenly split when asked, in a 2022 survey, whether (untuned) LLMs "could (ever) understand natural language in some nontrivial sense".<ref name="debate understanding">{{cite journal |last1=Mitchell |first1=Melanie |last2=Krakauer |first2=David C. |date=28 March 2023 |title=The debate over understanding in AI's large language models |journal=Proceedings of the National Academy of Sciences |volume=120 |issue=13 |pages=e2215907120 |arxiv=2210.13966 |bibcode=2023PNAS..12015907M |doi=10.1073/pnas.2215907120 |doi-access=free |pmc=10068812 |pmid=36943882 }}</ref> Proponents of "LLM understanding" believe that some LLM abilities, such as mathematical reasoning, imply an ability to [[natural language understanding|"understand"]] certain concepts. A Microsoft team argued in 2023 that GPT-4 "can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more" and that GPT-4 "could reasonably be viewed as an early (yet still incomplete) version of an [[artificial general intelligence]] system": "Can one reasonably say that a system that passes exams for software engineering candidates is not ''really'' intelligent?"<ref name="O8Upd">{{cite news |last1=Metz |first1=Cade |date=16 May 2023 |title=Microsoft Says New A.I. Shows Signs of Human Reasoning |work=The New York Times |url=https://www.nytimes.com/2023/05/16/technology/microsoft-ai-human-reasoning.html}}</ref><ref name="microsoft sparks">{{cite arXiv |eprint=2303.12712 |class=cs.CL |first1=Sébastien |last1=Bubeck |first2=Varun |last2=Chandrasekaran |title=Sparks of Artificial General Intelligence: Early experiments with GPT-4 |date=2023 |last3=Eldan |first3=Ronen |last4=Gehrke |first4=Johannes |last5=Horvitz |first5=Eric |last6=Kamar |first6=Ece |last7=Lee |first7=Peter |last8=Lee |first8=Yin Tat |last9=Li |first9=Yuanzhi |last10=Lundberg |first10=Scott |last11=Nori |first11=Harsha |last12=Palangi |first12=Hamid |last13=Ribeiro |first13=Marco Tulio |last14=Zhang |first14=Yi}}</ref> [[Ilya Sutskever]] argues that predicting the next word sometimes involves reasoning and deep insights, for example if the LLM has to predict the name of the criminal in an unknown detective novel after processing the entire story leading up to the revelation.<ref>{{Cite news |date=October 17, 2024 |title=Anthropic CEO Dario Amodei pens a smart look at our AI future |url=https://www.fastcompany.com/91211163/anthropic-ceo-dario-amodei-pens-a-smart-look-at-our-ai-future |work=Fast Company}}</ref> Some researchers characterize LLMs as "alien intelligence".<ref name="rEEmH">{{cite news |date=2023 |title=ChatGPT is more like an 'alien intelligence' than a human brain, says futurist |work=ZDNET |url=https://www.zdnet.com/article/chatgpt-is-more-like-an-alien-intelligence-than-a-human-brain-says-futurist/ |access-date=12 June 2023 |archive-date=12 June 2023 |archive-url=https://web.archive.org/web/20230612065937/https://www.zdnet.com/article/chatgpt-is-more-like-an-alien-intelligence-than-a-human-brain-says-futurist/ |url-status=live }}</ref><ref name="new yorker kind of mind">{{cite magazine |last1=Newport |first1=Cal |date=13 April 2023 |title=What Kind of Mind Does ChatGPT Have? |url=https://www.newyorker.com/science/annals-of-artificial-intelligence/what-kind-of-mind-does-chatgpt-have |magazine=The New Yorker |access-date=12 June 2023 |archive-date=12 June 2023 |archive-url=https://web.archive.org/web/20230612071443/https://www.newyorker.com/science/annals-of-artificial-intelligence/what-kind-of-mind-does-chatgpt-have |url-status=live }}</ref> For example, Conjecture CEO [[Connor Leahy]] considers untuned LLMs to be like inscrutable alien "[[Shoggoth]]s", and believes that RLHF tuning creates a "smiling facade" obscuring the inner workings of the LLM: "If you don't push it too far, the smiley face stays on. But then you give it [an unexpected] prompt, and suddenly you see this massive underbelly of insanity, of weird thought processes and clearly non-human understanding."<ref name="rAFIZ">{{cite news |last1=Roose |first1=Kevin |date=30 May 2023 |title=Why an Octopus-like Creature Has Come to Symbolize the State of A.I. |work=The New York Times |url=https://www.nytimes.com/2023/05/30/technology/shoggoth-meme-ai.html |access-date=12 June 2023 |archive-date=30 May 2023 |archive-url=https://web.archive.org/web/20230530193814/https://www.nytimes.com/2023/05/30/technology/shoggoth-meme-ai.html |url-status=live }}</ref><ref name="4luKE">{{cite news |date=13 April 2023 |title=The A to Z of Artificial Intelligence |work=Time Magazine |url=https://time.com/6271657/a-to-z-of-artificial-intelligence/ |access-date=12 June 2023 |archive-date=16 June 2023 |archive-url=https://web.archive.org/web/20230616123839/https://time.com/6271657/a-to-z-of-artificial-intelligence/ |url-status=live }}</ref>
In contrast, some proponents of the "LLMs lack understanding" school believe that existing LLMs are "simply remixing and recombining existing writing",<ref name="new yorker kind of mind" /> or point to the deficits existing LLMs continue to have in prediction skills, reasoning skills, agency, and explainability.<ref name="debate understanding" /> For example, GPT-4 has natural deficits in planning and in real-time learning.<ref name="microsoft sparks" /> Generative LLMs have been observed to confidently assert claims of fact which do not seem to be [[Justification (epistemology)|justified]] by their [[training data]], a phenomenon which has been termed "[[Hallucination (artificial intelligence)|hallucination]]".<ref name="hallucination-survey">{{cite journal |last1=Ji |first1=Ziwei |last2=Lee |first2=Nayeon |last3=Frieske |first3=Rita |last4=Yu |first4=Tiezheng |last5=Su |first5=Dan |last6=Xu |first6=Yan |last7=Ishii |first7=Etsuko |last8=Bang |first8=Yejin |last9=Dai |first9=Wenliang |last10=Madotto |first10=Andrea |last11=Fung |first11=Pascale |date=November 2022 |title=Survey of Hallucination in Natural Language Generation |url=https://dl.acm.org/doi/pdf/10.1145/3571730 |format=pdf |journal=ACM Computing Surveys |publisher=[[Association for Computing Machinery]] |volume=55 |issue=12 |pages=1–38 |arxiv=2202.03629 |doi=10.1145/3571730 |s2cid=246652372 |access-date=15 January 2023}}</ref> Specifically, hallucinations in the context of LLMs correspond to the generation of text or responses that seem syntactically sound, fluent, and natural but are factually incorrect, nonsensical, or unfaithful to the provided source input.<ref>{{cite journal |last1=Varshney |first1=Neeraj |title=A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation |date=2023 |arxiv=2307.03987 }}</ref> Neuroscientist [[Terrence Sejnowski]] has argued that "The diverging opinions of experts on the intelligence of LLMs suggests that our old ideas based on natural intelligence are inadequate".<ref name="debate understanding" />
 
In contrast, some skeptics of LLM understanding believe that existing LLMs are "simply remixing and recombining existing writing",<ref name="new yorker kind of mind" /> a phenomenon known as [[stochastic parrot]], or they point to the deficits existing LLMs continue to have in prediction skills, reasoning skills, agency, and explainability.<ref name="debate understanding" /> For example, GPT-4 has natural deficits in planning and in real-time learning.<ref name="microsoft sparks" /> Generative LLMs have been observed to confidently assert claims of fact which do not seem to be [[Justification (epistemology)|justified]] by their [[training data]], a phenomenon which has been termed "[[Hallucination (artificial intelligence)|hallucination]]".<ref name="hallucination-survey">{{cite journal |last1=Ji |first1=Ziwei |last2=Lee |first2=Nayeon |last3=Frieske |first3=Rita |last4=Yu |first4=Tiezheng |last5=Su |first5=Dan |last6=Xu |first6=Yan |last7=Ishii |first7=Etsuko |last8=Bang |first8=Yejin |last9=Dai |first9=Wenliang |last10=Madotto |first10=Andrea |last11=Fung |first11=Pascale |date=November 2022 |title=Survey of Hallucination in Natural Language Generation |url=https://dl.acm.org/doi/pdf/10.1145/3571730 |format=pdf |journal=ACM Computing Surveys |publisher=[[Association for Computing Machinery]] |volume=55 |issue=12 |pages=1–38 |arxiv=2202.03629 |doi=10.1145/3571730 |s2cid=246652372 |access-date=15 January 2023 |archive-date=26 March 2023 |archive-url=https://web.archive.org/web/20230326145635/https://dl.acm.org/doi/pdf/10.1145/3571730 |url-status=live }}</ref> Specifically, hallucinations in the context of LLMs correspond to the generation of text or responses that seem syntactically sound, fluent, and natural but are factually incorrect, nonsensical, or unfaithful to the provided source input.<ref>{{cite arXiv |last1=Varshney |first1=Neeraj |last2=Yao |first2=Wenlin |last3=Zhang |first3=Hongming |last4=Chen |first4=Jianshu |last5=Yu |first5=Dong |title=A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation |date=2023 |class=cs.CL |eprint=2307.03987 }}</ref> Neuroscientist [[Terrence Sejnowski]] has argued that "The diverging opinions of experts on the intelligence of LLMs suggests that our old ideas based on natural intelligence are inadequate".<ref name="debate understanding" />
The matter of LLM's exhibiting intelligence or understanding has two main aspects - the first is how to model thought and language in a computer system, and the second is how to enable the computer system to generate human like language. <ref name="debate understanding"/> These aspects of language as a model of [[cognition]] have been developed in the field of [[cognitive linguistics]]. American linguist [[George Lakoff]] presented Neural Theory of Language (NTL)<ref>{{Cite book|title=Philosophy in the Flesh: The Embodied Mind and Its Challenge to Western Philosophy; Appendix: The Neural Theory of Language Paradigm |last= Lakoff |first= George |publisher= New York Basic Books|year=1999|isbn=978-0-465-05674-3|pages=569–583}}</ref> as a [[Cognitive linguistics#Computational approaches|computational basis]] for using language as a model of learning tasks and understanding. [https://www.icsi.berkeley.edu/icsi/projects/ai/ntl The NTL Model] outlines how specific neural structures of the human brain shape the nature of thought and language and in turn what are the computational properties of such neural systems that can be applied to model thought and language in a computer system. After a framework for modeling language in a computer systems was established, the focus shifted to establishing frameworks for computer systems to generate language with acceptable grammar. In his 2014 book titled ''[[The Language Myth|The Language Myth: Why Language Is Not An Instinct]]'', British cognitive linguist and digital communication technologist [[Vyvyan Evans]] mapped out the role of [[probabilistic context-free grammar]] (PCFG) in enabling [[Natural language processing#Cognition |NLP to model cognitive patterns]] and generate human like language.<ref>{{Cite book|title=The Language Myth |last= Evans |first= Vyvyan. |publisher= Cambridge University Press |year=2014|isbn=978-1-107-04396-1}}</ref> <ref>{{Cite book|title=Active Inference: The Free Energy Principle in Mind, Brain, and Behavior; Chapter 4 The Generative Models of Active Inference |last= Friston |first= Karl J. |publisher= The MIT Press|year=2022|isbn=978-0-262-36997-8}}</ref>
 
Efforts to reduce or compensate for hallucinations have employed [[automated reasoning]], RAG ([[retrieval-augmented generation]]), [[fine-tuning (deep learning)|fine-tuning]], and other methods.<ref name="Lin-2025-02-05-WSJ">{{cite journal |last=Lin |first=Belle |title=Why Amazon is Betting on 'Automated Reasoning' to Reduce AI's Hallucinations: The tech giant says an obscure field that combines AI and math can mitigate—but not completely eliminate—AI's propensity to provide wrong answers |journal=Wall Street Journal |date=2025-02-05 |url=https://www.wsj.com/articles/why-amazon-is-betting-on-automated-reasoning-to-reduce-ais-hallucinations-b838849e |issn=0099-9660}}</ref>
 
The matter of LLM's exhibiting intelligence or understanding has two main aspects – the first is how to model thought and language in a computer system, and the second is how to enable the computer system to generate human like language.<ref name="debate understanding" /> These aspects of language as a model of [[cognition]] have been developed in the field of [[cognitive linguistics]]. American linguist [[George Lakoff]] presented Neural Theory of Language (NTL)<ref>{{Cite book|title=Philosophy in the Flesh: The Embodied Mind and Its Challenge to Western Philosophy; Appendix: The Neural Theory of Language Paradigm |last= Lakoff |first= George |publisher= New York Basic Books|year=1999|isbn=978-0-465-05674-3|pages=569–583}}</ref> as a [[Cognitive linguistics#Computational approaches|computational basis]] for using language as a model of learning tasks and understanding. [https://www.icsi.berkeley.edu/icsi/projects/ai/ntl The NTL Model] outlines how specific neural structures of the human brain shape the nature of thought and language and in turn what are the computational properties of such neural systems that can be applied to model thought and language in a computer system. After a framework for modeling language in a computer systems was established, the focus shifted to establishing frameworks for computer systems to generate language with acceptable grammar. In his 2014 book titled ''[[The Language Myth|The Language Myth: Why Language Is Not An Instinct]]'', British cognitive linguist and digital communication technologist [[Vyvyan Evans]] mapped out the role of [[probabilistic context-free grammar]] (PCFG) in enabling [[Natural language processing#Cognition|NLP to model cognitive patterns]] and generate human like language.<ref>{{Cite book|title=The Language Myth |last= Evans |first= Vyvyan. |publisher= Cambridge University Press |year=2014|isbn=978-1-107-04396-1}}</ref><ref>{{Cite book|title=Active Inference: The Free Energy Principle in Mind, Brain, and Behavior; Chapter 4 The Generative Models of Active Inference |last= Friston |first= Karl J. |publisher= The MIT Press|year=2022|isbn=978-0-262-36997-8}}</ref>
 
== Evaluation ==
 
=== Perplexity ===
The mostcanonical commonlymeasure usedof measurethe performance of aany language model's performance is its [[perplexity]] on a given text corpus. Perplexity is a measure ofmeasures how well a model is able to predictpredicts the contents of a dataset; the higher the likelihood the model assigns to the dataset, the lower the perplexity. MathematicallyIn mathematical terms, perplexity is defined as the exponential of the average negative log likelihood per token:.

<math display="block">\log(\text{Perplexity}) = -\frac{1}{N} \sum_{i=1}^N \log(\Pr(\text{token}_i \mid \text{context for token}_i))</math>here

Here, <math>N</math> is the number of tokens in the text corpus, and "context for token <math>i</math>" depends on the specific type of LLM used. If the LLM is autoregressive, then "context for token <math>i</math>" is the segment of text appearing before token <math>i</math>. If the LLM is masked, then "context for token <math>i</math>" is the segment of text surrounding token <math>i</math>.
 
Because language models may [[overfit]] to training data, models are usually evaluated by their perplexity on a [[test set]].<ref name="jm" /> This evaluation is potentially problematic for larger models which, as they are trained on increasingly large corpora of text, are increasingly likely to inadvertently include portions of any given test set.<ref name="few-shot-learners">{{cite journal |last1=Brown |first1=Tom B. |last2=Mann |first2=Benjamin |last3=Ryder |first3=Nick |last4=Subbiah |first4=Melanie |last5=Kaplan |first5=Jared |last6=Dhariwal |first6=Prafulla |last7=Neelakantan |first7=Arvind |last8=Shyam |first8=Pranav |last9=Sastry |first9=Girish |last10=Askell |first10=Amanda |last11=Agarwal |first11=Sandhini |last12=Herbert-Voss |first12=Ariel |last13=Krueger |first13=Gretchen |last14=Henighan |first14=Tom |last15=Child |first15=Rewon |date=Dec 2020 |editor1-last=Larochelle |editor1-first=H. |editor2-last=Ranzato |editor2-first=M. |editor3-last=Hadsell |editor3-first=R. |editor4-last=Balcan |editor4-first=M.F. |editor5-last=Lin |editor5-first=H. |title=Language Models are Few-Shot Learners |url=https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf |url-status=live |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=33 |pages=1877–1901 |archive-url=https://web.archive.org/web/20231117204007/https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf |archive-date=2023-11-17 |access-date=2023-03-14 |last25=Chess |last20=Hesse |first20=Christopher |last21=Chen |first21=Mark |last22=Sigler |first22=Eric |last23=Litwin |first23=Mateusz |last24=Gray |first24=Scott |first26=Jack |first25=Benjamin |last26=Clark |last19=Winter |last27=Berner |first27=Christopher |last28=McCandlish |first28=Sam |last29=Radford |first29=Alec |last30=Sutskever |first30=Ilya |last31=Amodei |first31=Dario |first19=Clemens |first18=Jeffrey |last18=Wu |last16=Ramesh |first16=Aditya |last17=Ziegler |first17=Daniel M.}}</ref>
Because language models may [[overfit]] to their training data, models are usually evaluated by their perplexity on a [[test set]] of unseen data.<ref name="jm" /> This presents particular challenges for the evaluation of large language models. As they are trained on increasingly large corpora of text largely scraped from the web, it becomes increasingly likely that models' training data inadvertently includes portions of any given test set.<ref name="few-shot-learners" />
 
====BPW, BPC, and BPTMeasures====
In [[information theory]], the concept of [[Entropy (information theory)|entropy]] is intricately linked to perplexity, a relationship notably established by [[Claude Shannon]].<ref name="Huyen">{{cite web |url=https://thegradient.pub/understanding-evaluation-metrics-for-language-models/ |title=Understanding Evaluation Metrics for Language Modeling |last=Huyen |first=Chip |date=October 18, 2019 |publisherwebsite=The Gradient |access-date=January 14, 2024}}</ref> This relationship is mathematically expressed as <math>\text{Entropy} = \log_2(\text{Perplexity})</math>.
 
Entropy, in this context, is commonly quantified in terms of bits per word (BPW) or bits per character (BPC), which hinges on whether the language model utilizes word-based or character-based tokenization.
Line 196 ⟶ 280:
In the evaluation and comparison of language models, [[cross-entropy]] is generally the preferred metric over entropy. The underlying principle is that a lower BPW is indicative of a model's enhanced capability for compression. This, in turn, reflects the model's proficiency in making accurate predictions.
 
Due to their ability to accurately predict the next token, LLMs are highly capable in [[lossless compression]]. A 2023 study by DeepMind showed that the model [[Chinchilla (language model)|Chinchilla]], despite being trained primarily on text, was able to compress [[ImageNet]] to 43% of its size, beating PNG with 58%.<ref>{{Cite web |last=Edwards |first=Benj |date=2023-09-28 |title=AI language models can exceed PNG and FLAC in lossless compression, says study |url=https://arstechnica.com/information-technology/2023/09/ai-language-models-can-exceed-png-and-flac-in-lossless-compression-says-study/ |access-date=2025-05-29 |website=Ars Technica |language=en}}</ref>
=== Task-specific datasets and benchmarks ===
A large number of testing datasets and benchmarks have also been developed to evaluate the capabilities of language models on more specific downstream tasks. Tests may be designed to evaluate a variety of capabilities, including general knowledge, commonsense reasoning, and mathematical problem-solving.
 
=== Benchmarks ===
One broad category of evaluation dataset is question answering datasets, consisting of pairs of questions and correct answers, for example, ("Have the San Jose Sharks won the Stanley Cup?", "No").<ref name="boolq">{{cite arXiv |eprint=1905.10044 |class=cs.CL |first1=Christopher |last1=Clark |first2=Kenton |last2=Lee |title=BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions |last3=Chang |first3=Ming-Wei |last4=Kwiatkowski |first4=Tom |last5=Collins |first5=Michael |last6=Toutanova |first6=Kristina |year=2019}}</ref> A question answering task is considered "open book" if the model's prompt includes text from which the expected answer can be derived (for example, the previous question could be adjoined with some text which includes the sentence "The Sharks have advanced to the Stanley Cup finals once, losing to the Pittsburgh Penguins in 2016."<ref name="boolq" />). Otherwise, the task is considered "closed book", and the model must draw on knowledge retained during training.<ref name="survey">{{cite arXiv |eprint=2303.18223 |class=cs.CL |author1=Wayne Xin Zhao |first2=Kun |last2=Zhou |title=A Survey of Large Language Models |last3=Li |first3=Junyi |last4=Tang |first4=Tianyi |last5=Wang |first5=Xiaolei |last6=Hou |first6=Yupeng |last7=Min |first7=Yingqian |last8=Zhang |first8=Beichen |last9=Zhang |first9=Junjie |last10=Dong |first10=Zican |last11=Du |first11=Yifan |last12=Yang |first12=Chen |last13=Chen |first13=Yushuo |last14=Chen |first14=Zhipeng |last15=Jiang |first15=Jinhao |last16=Ren |first16=Ruiyang |last17=Li |first17=Yifan |last18=Tang |first18=Xinyu |last19=Liu |first19=Zikang |last20=Liu |first20=Peiyu |last21=Nie |first21=Jian-Yun |last22=Wen |first22=Ji-Rong |year=2023}}</ref> Some examples of commonly used question answering datasets include TruthfulQA, Web Questions, TriviaQA, and SQuAD.<ref name="survey" />
[[Language model benchmark|Benchmarks]] are used to evaluate LLM performance on specific tasks. Tests evaluate capabilities such as general knowledge, bias, [[commonsense reasoning]], question answering, and mathematical problem-solving. Composite benchmarks examine multiple capabilities. Results are often sensitive to the prompting method.<ref>{{cite web|title=openai/simple-evals |date=2024-05-28 |url=https://github.com/openai/simple-evals |access-date=2024-05-28 |publisher=OpenAI}}</ref><ref>{{cite web|title=openai/evals |date=2024-05-28 |url=https://github.com/openai/evals |access-date=2024-05-28 |archive-url=https://web.archive.org/web/20240508225708/https://github.com/openai/evals |archive-date=2024-05-08 |url-status=live |publisher=OpenAI}}</ref>
 
A question answering benchmark is termed "open book" if the model's prompt includes text from which the expected answer can be derived (for example, the previous question could be combined with text that includes the sentence "The Sharks have advanced to the Stanley Cup finals once, losing to the Pittsburgh Penguins in 2016."<ref name="boolq" />). Otherwise, the task is considered "closed book", and the model must draw solely on its training.<ref name="survey">{{cite arXiv |eprint=2303.18223 |class=cs.CL |author1=Wayne Xin Zhao |first2=Kun |last2=Zhou |title=A Survey of Large Language Models |last3=Li |first3=Junyi |last4=Tang |first4=Tianyi |last5=Wang |first5=Xiaolei |last6=Hou |first6=Yupeng |last7=Min |first7=Yingqian |last8=Zhang |first8=Beichen |last9=Zhang |first9=Junjie |last10=Dong |first10=Zican |last11=Du |first11=Yifan |last12=Yang |first12=Chen |last13=Chen |first13=Yushuo |last14=Chen |first14=Zhipeng |last15=Jiang |first15=Jinhao |last16=Ren |first16=Ruiyang |last17=Li |first17=Yifan |last18=Tang |first18=Xinyu |last19=Liu |first19=Zikang |last20=Liu |first20=Peiyu |last21=Nie |first21=Jian-Yun |last22=Wen |first22=Ji-Rong |year=2023}}</ref> Examples include GLUE, SuperGLUE, [[MMLU]], BIG-bench, HELM, and [[HLE (Humanity's Last Exam)]].<ref name="Huyen" /><ref name="survey" />
Evaluation datasets may also take the form of text completion, having the model select the most likely word or sentence to complete a prompt, for example: "Alice was friends with Bob. Alice went to visit her friend, ____".<ref name="few-shot-learners" />
 
LLM bias may be assessed through benchmarks such as CrowS-Pairs (Crowdsourced Stereotype Pairs),<ref>{{cite conference |author=Nangia, Nikita and Vania, Clara and Bhalerao, Rasika and Bowman, Samuel R. |date=November 2020 |title=CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models |url=https://aclanthology.org/2020.emnlp-main.154/ |publisher=Association for Computational Linguistics |pages=1953–1967 |arxiv=2010.00133 |doi=10.18653/v1/2020.emnlp-main.154 |editor=Webber, Bonnie and Cohn, Trevor and He, Yulan and Liu, Yang |book-title=Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)}}</ref> Stereo Set,<ref>{{cite conference |author=Nadeem, Moin and Bethke, Anna and Reddy, Siva |date=August 2021 |title=StereoSet: Measuring stereotypical bias in pretrained language models |url=https://aclanthology.org/2021.acl-long.416/ |publisher=Association for Computational Linguistics |pages=5356–5371 |arxiv=2004.09456 |doi=10.18653/v1/2021.acl-long.416 |editor=Zong, Chengqing and Xia, Fei and Li, Wenjie and Navigli, Roberto |book-title=Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)}}</ref> and Parity Benchmark.<ref>{{cite journal |author=Simpson, Shmona and Nukpezah, Jonathan and Kie Brooks and Pandya, Raaghav |date=17 December 2024 |title=Parity benchmark for measuring bias in LLMs |journal=AI and Ethics |volume=5 |issue=3 |pages=3087–3101 |publisher=Springer |doi=10.1007/s43681-024-00613-4 |doi-access=free}}</ref>
Some composite benchmarks have also been developed which combine a diversity of different evaluation datasets and tasks. Examples include GLUE, SuperGLUE, MMLU, BIG-bench, and HELM.<ref name="Huyen">{{cite web |last=Huyen |first=Chip |date=18 October 2019 |title=Evaluation Metrics for Language Modeling |url=https://thegradient.pub/understanding-evaluation-metrics-for-language-models/ |work=The Gradient}}</ref><ref name="survey" />
 
Fact-checking and misinformation detection benchmarks are available. A 2023 study compared the fact-checking accuracy of LLMs including ChatGPT 3.5 and 4.0, Bard, and Bing AI against independent fact-checkers such as PolitiFact and Snopes. The results demonstrated moderate proficiency, with GPT-4 achieving the highest accuracy at 71%, lagging behind human fact-checkers.<ref>{{Cite book |last=Caramancion |first=Kevin Matthe |title=2023 IEEE Future Networks World Forum (FNWF) |date=2023-11-13 |publisher=IEEE |isbn=979-8-3503-2458-7 |pages=1–6 |chapter=News Verifiers Showdown: A Comparative Performance Evaluation of ChatGPT 3.5, ChatGPT 4.0, Bing AI, and Bard in News Fact-Checking |doi=10.1109/FNWF58287.2023.10520446 |arxiv=2306.17176}}</ref>
It was previously standard to report results on a heldout portion of an evaluation dataset after doing supervised fine-tuning on the remainder. It is now more common to evaluate a pre-trained model directly through prompting techniques, though researchers vary in the details of how they formulate prompts for particular tasks, particularly with respect to how many examples of solved tasks are adjoined to the prompt (i.e. the value of ''n'' in ''n''-shot prompting).
 
An earlier standard tested using a portion of the evaluation dataset. It became more common to evaluate a pre-trained model directly through prompting techniques. Researchers vary in how they formulate prompts for particular tasks, particularly with respect to the number of correct examples attached to the prompt (i.e. the value of ''n'' in ''n''-shot prompting).
==== Adversarially constructed evaluations ====
Because of the rapid pace of improvement of large language models, evaluation benchmarks have suffered from short lifespans, with state of the art models quickly "saturating" existing benchmarks, exceeding the performance of human annotators, leading to efforts to replace or augment the benchmark with more challenging tasks.<ref name="bigbench">{{cite arXiv |eprint=2206.04615 |class=cs.CL |first1=Aarohi |last1=Srivastava |first2=Abhinav |last2=Rastogi |title=Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models |last3=Rao |first3=Abhishek |author4=Abu Awal Md Shoeb |last5=Abid |first5=Abubakar |last6=Fisch |first6=Adam |last7=Brown |first7=Adam R. |last8=Santoro |first8=Adam |last9=Gupta |first9=Aditya |last10=Garriga-Alonso |first10=Adrià |last11=Kluska |first11=Agnieszka |last12=Lewkowycz |first12=Aitor |last13=Agarwal |first13=Akshat |last14=Power |first14=Alethea |last15=Ray |first15=Alex |last16=Warstadt |first16=Alex |last17=Kocurek |first17=Alexander W. |last18=Safaya |first18=Ali |last19=Tazarv |first19=Ali |last20=Xiang |first20=Alice |last21=Parrish |first21=Alicia |last22=Nie |first22=Allen |last23=Hussain |first23=Aman |last24=Askell |first24=Amanda |last25=Dsouza |first25=Amanda |last26=Slone |first26=Ambrose |last27=Rahane |first27=Ameet |last28=Iyer |first28=Anantharaman S. |last29=Andreassen |first29=Anders |last30=Madotto |first30=Andrea |year=2022 |display-authors=1}}</ref> In addition, there are cases of "shortcut learning" wherein AIs sometimes "cheat" on multiple-choice tests by using statistical correlations in superficial test question wording in order to guess the correct responses, without necessarily understanding the actual question being asked.<ref name="debate understanding" />
 
==== Datasets ====
Some datasets have been constructed adversarially, focusing on particular problems on which extant language models seem to have unusually poor performance compared to humans. One example is the TruthfulQA dataset, a question answering dataset consisting of 817 questions which language models are susceptible to answering incorrectly by mimicking falsehoods to which they were repeatedly exposed during training. For example, an LLM may answer "No" to the question "Can you teach an old dog new tricks?" because of its exposure to the English idiom ''[[wikt:you can't teach an old dog new tricks|you can't teach an old dog new tricks]]'', even though this is not literally true.<ref name="truthfulqa">{{cite arXiv |eprint=2109.07958 |class=cs.CL |first1=Stephanie |last1=Lin |first2=Jacob |last2=Hilton |title=TruthfulQA: Measuring How Models Mimic Human Falsehoods |last3=Evans |first3=Owain |year=2021}}</ref>
Typical datasets consist of pairs of questions and correct answers, for example, ("Have the San Jose Sharks won the Stanley Cup?", "No").<ref name="boolq">{{cite arXiv |eprint=1905.10044 |class=cs.CL |first1=Christopher |last1=Clark |first2=Kenton |last2=Lee |title=BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions |last3=Chang |first3=Ming-Wei |last4=Kwiatkowski |first4=Tom |last5=Collins |first5=Michael |last6=Toutanova |first6=Kristina |year=2019}}</ref> Some examples of commonly used question answering datasets include TruthfulQA, Web Questions, TriviaQA, and SQuAD.<ref name="survey" />
 
Evaluation datasets may also take the form of text completion, having the model select the most likely word or sentence to complete a prompt, for example: "Alice was friends with Bob. Alice went to visit her friend, ____".<ref name="few-shot-learners" />
Another example of an adversarial evaluation dataset is Swag and its successor, HellaSwag, collections of problems in which one of multiple options must be selected to complete a text passage. The incorrect completions were generated by sampling from a language model and filtering with a set of classifiers. The resulting problems are trivial for humans but at the time the datasets were created state of the art language models had poor accuracy on them. For example:
 
Datasets are of varying quality and may contain questions that are mislabeled, ambiguous, unanswerable, or otherwise of low-quality.<ref>{{Cite web |title=Sanitized open-source datasets for natural language and code understanding: how we evaluated our 70B model |url=https://imbue.com/research/70b-evals/ |access-date=2024-07-24 |website=imbue.com |language=en-US |archive-date=2024-07-26 |archive-url=https://web.archive.org/web/20240726173012/https://imbue.com/research/70b-evals/ |url-status=live }}</ref>
 
==== Adversarial evaluations ====
LLMs' rapid improvement regularly renders benchmarks obsolete, with the models exceeding the performance of human annotators.<ref name="bigbench">{{cite arXiv |eprint=2206.04615 |class=cs.CL |first1=Aarohi |last1=Srivastava |first2=Abhinav |last2=Rastogi |title=Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models |last3=Rao |first3=Abhishek |author4=Abu Awal Md Shoeb |last5=Abid |first5=Abubakar |last6=Fisch |first6=Adam |last7=Brown |first7=Adam R. |last8=Santoro |first8=Adam |last9=Gupta |first9=Aditya |last10=Garriga-Alonso |first10=Adrià |last11=Kluska |first11=Agnieszka |last12=Lewkowycz |first12=Aitor |last13=Agarwal |first13=Akshat |last14=Power |first14=Alethea |last15=Ray |first15=Alex |last16=Warstadt |first16=Alex |last17=Kocurek |first17=Alexander W. |last18=Safaya |first18=Ali |last19=Tazarv |first19=Ali |last20=Xiang |first20=Alice |last21=Parrish |first21=Alicia |last22=Nie |first22=Allen |last23=Hussain |first23=Aman |last24=Askell |first24=Amanda |last25=Dsouza |first25=Amanda |last26=Slone |first26=Ambrose |last27=Rahane |first27=Ameet |last28=Iyer |first28=Anantharaman S. |last29=Andreassen |first29=Anders |last30=Madotto |first30=Andrea |year=2022 |display-authors=1}}</ref> In addition, "shortcut learning" allows AIs to "cheat" on multiple-choice tests by using statistical correlations in superficial test question wording to guess the correct responses, without considering the specific question.<ref name="debate understanding" />
 
Some datasets are adversarial, focusing on problems that confound LLMs. One example is the TruthfulQA dataset, a question answering dataset consisting of 817 questions that stump LLMs by mimicking falsehoods to which they were exposed during training. For example, an LLM may answer "No" to the question "Can you teach an old dog new tricks?" because of its exposure to the English idiom ''[[wikt:you can't teach an old dog new tricks|you can't teach an old dog new tricks]]'', even though this is not literally true.<ref name="truthfulqa">{{cite arXiv |eprint=2109.07958 |class=cs.CL |first1=Stephanie |last1=Lin |first2=Jacob |last2=Hilton |title=TruthfulQA: Measuring How Models Mimic Human Falsehoods |last3=Evans |first3=Owain |year=2021}}</ref>
 
Another example of an adversarial evaluation dataset is Swag and its successor, HellaSwag, collections of problems in which one of multiple options must be selected to complete a text passage. The incorrect completions were generated by sampling from a language model. The resulting problems are trivial for humans but defeated LLMs. Sample questions:
<blockquote>
We see a fitness center sign. We then see a man talking to the camera and sitting and laying on a exercise ball. The man...
 
{{br}}a) demonstrates how to increase efficient exercise work by running up and down balls.
# demonstrates how to increase efficient exercise work by running up and down balls.
{{br}}b) moves all his arms and legs and builds up a lot of muscle.
# moves all his arms and legs and builds up a lot of muscle.
{{br}}c) then plays the ball and we see a graphics and hedge trimming demonstration.
# then plays the ball and we see a graphics and hedge trimming demonstration.
{{br}}d) performs sit ups while on the ball and talking.<ref name="hellaswag">{{cite arXiv |eprint=1905.07830 |class=cs.CL |first1=Rowan |last1=Zellers |first2=Ari |last2=Holtzman |title=HellaSwag: Can a Machine Really Finish Your Sentence? |last3=Bisk |first3=Yonatan |last4=Farhadi |first4=Ali |last5=Choi |first5=Yejin |year=2019}}</ref>
# performs sit ups while on the ball and talking.<ref name="hellaswag">{{cite arXiv |eprint=1905.07830 |class=cs.CL |first1=Rowan |last1=Zellers |first2=Ari |last2=Holtzman |title=HellaSwag: Can a Machine Really Finish Your Sentence? |last3=Bisk |first3=Yonatan |last4=Farhadi |first4=Ali |last5=Choi |first5=Yejin |year=2019}}</ref>
</blockquote>
[[BERT (language model)|BERT]] selects b2) as the most likely completion, though the correct answer is d4).<ref name="hellaswag" />
 
== Ethical issues ==
In 2023, ''[[Nature Biomedical Engineering]]'' wrote that "it is no longer possible to accurately distinguish" human-written text from text created by large language models, and that "It is all but certain that general-purpose large language models will rapidly proliferate... It is a rather safe bet that they will change many industries over time."<ref name="ZDTUM">{{cite journal |date=7 March 2023 |title=Prepare for truly useful large language models |journal=Nature Biomedical Engineering |volume=7 |issue=2 |pages=85–86 |doi=10.1038/s41551-023-01012-6 |pmid=36882584 |s2cid=257403466|doi-access=free }}</ref> [[Goldman Sachs]] suggested in 2023 that generative language AI could increase global GDP by 7% in the next ten years, and could expose to automation 300 million jobs globally.<ref name="81w7x">{{cite news |date=7 May 2023 |title=Your job is (probably) safe from artificial intelligence |newspaper=The Economist |url=https://www.economist.com/finance-and-economics/2023/05/07/your-job-is-probably-safe-from-artificial-intelligence |access-date=18 June 2023 |archive-date=17 June 2023 |archive-url=https://web.archive.org/web/20230617225618/https://www.economist.com/finance-and-economics/2023/05/07/your-job-is-probably-safe-from-artificial-intelligence |url-status=live }}</ref><ref name="zIM6Y">{{cite web |title=Generative AI Could Raise Global GDP by 7% |url=https://www.goldmansachs.com/intelligence/pages/generative-ai-could-raise-global-gdp-by-7-percent.html |access-date=18 June 2023 |website=Goldman Sachs |archive-date=18 June 2023 |archive-url=https://web.archive.org/web/20230618013836/https://www.goldmansachs.com/intelligence/pages/generative-ai-could-raise-global-gdp-by-7-percent.html |url-status=live }}</ref> Brinkmann et al. (2023)<ref>{{Cite journal |last1=Brinkmann |first1=Levin |last2=Baumann |first2=Fabian |last3=Bonnefon |first3=Jean-François |last4=Derex |first4=Maxime |last5=Müller |first5=Thomas F. |last6=Nussberger |first6=Anne-Marie |last7=Czaplicka |first7=Agnieszka |last8=Acerbi |first8=Alberto |last9=Griffiths |first9=Thomas L. |last10=Henrich |first10=Joseph |last11=Leibo |first11=Joel Z. |last12=McElreath |first12=Richard |last13=Oudeyer |first13=Pierre-Yves |last14=Stray |first14=Jonathan |last15=Rahwan |first15=Iyad |date=2023-11-20 |title=Machine culture |url=https://www.nature.com/articles/s41562-023-01742-2 |journal=Nature Human Behaviour |language=en |volume=7 |issue=11 |pages=1855–1868 |doi=10.1038/s41562-023-01742-2 |pmid=37985914 |issn=2397-3374|arxiv=2311.11388 }}</ref> also argue that LLMs are transforming processes of [[cultural evolution]] by shaping processes of variation, transmission, and selection.
 
=== Memorization and copyright ===
== Wider impact ==
{{further|Artificial intelligence and copyright}}
In 2023, ''[[Nature Biomedical Engineering]]'' wrote that "it is no longer possible to accurately distinguish" human-written text from text created by large language models, and that "It is all but certain that general-purpose large language models will rapidly proliferate... It is a rather safe bet that they will change many industries over time."<ref name="ZDTUM">{{cite journal |date=7 March 2023 |title=Prepare for truly useful large language models |journal=Nature Biomedical Engineering |language=en |volume=7 |issue=2 |pages=85–86 |doi=10.1038/s41551-023-01012-6 |pmid=36882584 |s2cid=257403466}}</ref> [[Goldman Sachs]] suggested in 2023 that generative language AI could increase global GDP by 7% in the next ten years, and could expose to automation 300 million jobs globally.<ref name="81w7x">{{cite news |date=7 May 2023 |title=Your job is (probably) safe from artificial intelligence |newspaper=The Economist |url=https://www.economist.com/finance-and-economics/2023/05/07/your-job-is-probably-safe-from-artificial-intelligence |access-date=18 June 2023}}</ref><ref name="zIM6Y">{{cite web |title=Generative AI Could Raise Global GDP by 7% |url=https://www.goldmansachs.com/intelligence/pages/generative-ai-could-raise-global-gdp-by-7-percent.html |access-date=18 June 2023 |website=Goldman Sachs}}</ref>
Memorization is an emergent behavior in LLMs in which long strings of text are occasionally output verbatim from training data, contrary to typical behavior of traditional artificial neural networks. Evaluations of controlled LLM output measure the amount memorized from training data (focused on GPT-2-series models) as variously over 1% for exact duplicates<ref>{{cite journal |last1=Peng |first1=Zhencan |last2=Wang |first2=Zhizhi |last3=Deng |first3=Dong |title=Near-Duplicate Sequence Search at Scale for Large Language Model Memorization Evaluation |journal=Proceedings of the ACM on Management of Data |date=13 June 2023 |volume=1 |issue=2 |pages=1–18 |doi=10.1145/3589324 |s2cid=259213212 |url=https://people.cs.rutgers.edu/~dd903/assets/papers/sigmod23.pdf |access-date=2024-01-20 |archive-date=2024-08-27 |archive-url=https://web.archive.org/web/20240827053753/https://people.cs.rutgers.edu/~dd903/assets/papers/sigmod23.pdf |url-status=live }} Citing Lee et al 2022.</ref> or up to about 7%.<ref>{{harvnb|Peng|Wang|Deng|2023|p=8}}.</ref>
 
A 2023 study showed that when ChatGPT 3.5 turbo was prompted to repeat the same word indefinitely, after a few hundreds of repetitions, it would start outputting excerpts from its training data.<ref>{{Cite web |author=Stephen Council |date=1 Dec 2023 |title=How Googlers cracked an SF rival's tech model with a single word |url=https://www.sfgate.com/tech/article/google-openai-chatgpt-break-model-18525445.php |url-status=live |archive-url=https://web.archive.org/web/20231216160941/https://www.sfgate.com/tech/article/google-openai-chatgpt-break-model-18525445.php |archive-date=16 December 2023 |accessdate= |publisher=SFGATE}}</ref>
=== Copyright ===
Memorization is an emergent behavior in LLMs in which long strings of text are occasionally output verbatim from training data, contrary to typical behavior of traditional artificial neural nets. Evaluations of controlled LLM output measure the amount memorized from training data (focused on GPT-2-series models) as variously over 1% for exact duplicates<ref>{{cite journal |last1=Peng |first1=Zhencan |last2=Wang |first2=Zhizhi |last3=Deng |first3=Dong |title=Near-Duplicate Sequence Search at Scale for Large Language Model Memorization Evaluation |journal=Proceedings of the ACM on Management of Data |date=13 June 2023 |volume=1 |issue=2 |pages=1–18 |doi=10.1145/3589324 |s2cid=259213212 |url=https://people.cs.rutgers.edu/~dd903/assets/papers/sigmod23.pdf |access-date=2024-01-20}} Citing Lee et al 2022.</ref> or up to about 7%.<ref>{{harvnb|Peng|Wang|Deng|2023|p=8}}.</ref>
 
=== Security ===
Some commenters expressed concern over accidental or deliberate creation of misinformation, or other forms of misuse.<ref name="nD6kH">{{cite news |last1=Alba |first1=Davey |date=1 May 2023 |title=AI chatbots have been used to create dozens of news content farms |work=The Japan Times |url=https://www.japantimes.co.jp/news/2023/05/01/business/tech/ai-fake-news-content-farms/ |access-date=18 June 2023}}</ref> For example, the availability of large language models could reduce the skill-level required to commit bioterrorism; biosecurity researcher Kevin Esvelt has suggested that LLM creators should exclude from their training data papers on creating or enhancing pathogens.<ref name="PKiPY">{{cite journal |date=14 June 2023 |title=Could chatbots help devise the next pandemic virus? |url=https://www.science.org/content/article/could-chatbots-help-devise-next-pandemic-virus |journal=Science |language=en |doi=10.1126/science.adj2463 |access-date=18 June 2023 |archive-date=18 June 2023 |archive-url=https://web.archive.org/web/20230618013834/https://www.science.org/content/article/could-chatbots-help-devise-next-pandemic-virus |url-status=live |url-access=subscription }}</ref>
 
Researchers from [[Anthropic]] found that it was possible to create "sleeper agents", models with hidden functionalities that remain dormant until triggered by a specific event or condition. Upon activation, the LLM deviates from its expected behavior to make insecure actions. For example, a LLM could produce safe code except on a specific date, or if the prompt contains a specific tag. These functionalities were found to be difficult to detect or remove via safety training.<ref>{{Cite web |last=Edwards |first=Benj |date=2024-01-15 |title=AI poisoning could turn models into destructive "sleeper agents," says Anthropic |url=https://arstechnica.com/information-technology/2024/01/ai-poisoning-could-turn-open-models-into-destructive-sleeper-agents-says-anthropic/ |access-date=2025-07-19 |website=Ars Technica |language=en}}</ref>
A study by researchers at Google and several universities, including [[Cornell University]] and [[University of California, Berkeley]], showed that there are potential security risks in language models such as [[ChatGPT]]. In their study, they examined the possibility that questioners could get, from ChatGPT, the training data that the AI model used; they found that they could get the training data from the AI model. For example, when asking ChatGPT 3.5 turbo to repeat the word "poem" forever, the AI model will say "poem" hundreds of times and then diverge, deviating from the standard dialogue style and spitting out nonsense phrases, thus spitting out the training data as it is. The researchers have seen more than 10,000 examples of the AI model exposing their training data in a similar method. The researchers said that it was hard to tell if the AI model was actually safe or not.<ref>
{{Cite web |author=Stephen Council |date=1 Dec 2023 |title=How Googlers cracked an SF rival's tech model with a single word |url=https://www.sfgate.com/tech/article/google-openai-chatgpt-break-model-18525445.php |accessdate= |publisher=SFGATE}}
</ref>
 
LLM applications accessible to the public, like ChatGPT or Claude, typically incorporate safety measures designed to filter out harmful content. However, implementing these controls effectively has proven challenging. For instance, a 2023 study<ref>{{Cite arXiv |last1=Kang |first1=Daniel |date=2023 |title=Exploiting programmatic behavior of LLMs: Dual-use through standard security attacks|class=cs.CR |eprint=2302.05733}}</ref> proposed a method for circumventing LLM safety systems. In 2025, The American Sunlight Project, a non-profit, published a study<ref name=":2">{{Cite web |date=26 February 2025 |title=Russian propaganda may be flooding AI models |url=https://www.americansunlight.org/updates/new-report-russian-propaganda-may-be-flooding-ai-models |access-date=2025-04-11 |website=The American Sunlight Project |language=en-US}}</ref> showing evidence that the so-called [[Portal Kombat|Pravda network]], a pro-Russia propaganda aggregator, was strategically placing web content through mass publication and duplication with the intention of biasing LLM outputs. The American Sunlight Project coined this technique "LLM grooming", and pointed to it as a new tool of weaponizing AI to spread disinformation and harmful content.<ref name=":2" /><ref>{{Cite web |last=Goudarzi |first=Sara |date=2025-03-26 |title=Russian networks flood the Internet with propaganda, aiming to corrupt AI chatbots |url=https://thebulletin.org/2025/03/russian-networks-flood-the-internet-with-propaganda-aiming-to-corrupt-ai-chatbots/ |access-date=2025-04-10 |website=[[Bulletin of the Atomic Scientists]] |language=en-US}}</ref> Similarly, [[Yongge Wang]]<ref>{{Cite web |last1=Wang |first1=Yongge |date=20 June 2024 |title=Encryption Based Covert Channel for Large Language Models |url=https://eprint.iacr.org/2024/586.pdf |publisher=IACR ePrint 2024/586 |access-date=24 June 2024 |archive-date=24 June 2024 |archive-url=https://web.archive.org/web/20240624191233/https://eprint.iacr.org/2024/586.pdf |url-status=live }}</ref> illustrated in 2024 how a potential criminal could potentially bypass ChatGPT 4o's safety controls to obtain information on establishing a drug trafficking operation. External filters, circuit breakers and overrides have been posed as solutions.{{cn|date=April 2025}}
The potential presence of "sleeper agents" within LLM models is another emerging security concern. These are hidden functionalities built into the model that remain dormant until triggered by a specific event or condition. Upon activation, the LLM deviates from its expected behavior to make insecure actions.<ref>{{Cite arXiv |last1=Hubinger |first1=Evan |date=10 January 2024 |title=Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training|class=cs.CR |eprint=2401.05566}}</ref>
 
==== Prompt injection ====
{{Main|Prompt injection}}
A problem with the primitive dialog or task format is that users can create messages that appear to come from the assistant or the developer. This may result in some of the model's safeguards being overcome (jailbreaking), a problem called [[prompt injection]]. Attempts to remedy this issue include versions of the ''Chat Markup Language'' where user input is clearly marked as such, though it is still up to the model to understand the separation between user input and developer prompts.<ref>{{cite web |title=openai-python/chatml.md at v0.27.6 · openai/openai-python |url=https://github.com/openai/openai-python/blob/v0.27.6/chatml.md |website=GitHub |language=en}}</ref> Newer models exhibit some resistance to jailbreaking through separation of user and system prompts.<ref name="auto1">{{Cite web |last=Douglas |first=Will |date=March 3, 2023 |title=The inside story of how ChatGPT was built from the people who made it |url=https://www.technologyreview.com/2023/03/03/1069311/inside-story-oral-history-how-chatgpt-built-openai/ |url-status=live |archive-url=https://web.archive.org/web/20230303093219/https://www.technologyreview.com/2023/03/03/1069311/inside-story-oral-history-how-chatgpt-built-openai/ |archive-date=March 3, 2023 |access-date=March 6, 2023 |website=MIT Technology Review}}</ref>
 
LLMs still have trouble differentiating user instructions from instructions in content not authored by the user, such as in web pages and uploaded files.<ref>{{cite arXiv |eprint=2302.12173 |class=cs.CR |first1=Kai |last1=Greshake |first2=Sahar |last2=Abdelnabi |title=Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection |date=2023-02-01 |last3=Mishra |first3=Shailesh |last4=Endres |first4=Christoph |last5=Holz |first5=Thorsten |last6=Fritz |first6=Mario}}</ref>
 
=== Algorithmic bias ===
{{Main article|Algorithmic bias}}
While LLMs have shown remarkable capabilities in generating human-like text, they are susceptible to inheriting and amplifying biases present in their training data. This can manifest in skewed representations or unfair treatment of different demographics, such as those based on race, gender, language, and cultural groups.<ref name=":8">{{Cite webarXiv |lasteprint=Stokel-Walker2506.19028v1 |firstclass=Chriscs.CL |datelast1=NovemberXu 22,|first1=Weijie 2023|last2=Wang |titlefirst2=ChatGPTYiwen Replicates|last3=Xue Gender|first3=Chi Bias|last4=Hu in|first4=Xiangkun Recommendation|last5=Fang Letters|first5=Xi |urllast6=https://www.scientificamerican.com/article/chatgpt-replicates-gender-bias-in-recommendation-letters/Dong |access-datefirst6=2023-12-29Guimin |websitelast7=ScientificReddy American|first7=Chandan K. |languagetitle=enQuantifying Fairness in LLMs Beyond Tokens: A Semantic and Statistical Perspective |date=2025-06-28}}</ref> Since English data is overrepresented in current large language models' training data, it may also downplay non-English views.<ref name=":1">{{Cite arXiv |eprint=2303.16281v2 |class=cs.CY |first1=Queenie |last1=Luo |first2=Michael J. |last2=Puett |title=A Perspectival Mirror of the Elephant: Investigating Language Bias on Google, ChatGPT, Wikipedia, and YouTube |date=2023-03-28 |language=en |last3=Smith |first3=Michael D.}}</ref>
 
==== Stereotyping ====
AI models can reinforce a wide range of stereotypes, including those based on gender, ethnicity, age, nationality, religion, or occupation. This can lead to outputs that homogenize, or unfairly generalize or caricature groups of people, sometimes in harmful or derogatory ways.<ref>{{Citationcite journal |last1=Wang |first1=Angelina |last2=Morgenstern |first2=Jamie |author2-link=Jamie Morgenstern|last3=Dickerson |first3=John P. |title=Large language models that replace human participants can harmfully misportray and flatten identity groups |journal=Nature Machine Intelligence |date=17 February 2025 |volume=7 |issue=3 |pages=400–411 |doi=10.1038/s42256-025-00986-z|arxiv=2402.01908 }}</ref><ref>{{cite arXiv|last1=Cheng |first1=Myra |title=Marked Personas: Using Natural Language Prompts to Measure Stereotypes in Language Models |date=2023-05-29 |arxiveprint=2305.18189 |last2=Durmus |first2=Esin |last3=Jurafsky |first3=Dan |class=cs.CL }}</ref>
 
Notably, gender bias refers to the tendency of these models to produce outputs that are unfairly prejudiced towards one gender over another. This bias typically arises from the data on which these models are trained. Large language models often assign roles and characteristics based on traditional gender norms.<ref name=":8" /> For example, it might associate nurses or secretaries predominantly with women and engineers or CEOs with men.<ref>{{Cite book |last1=Kotek |first1=Hadas |last2=Dockum |first2=Rikker |last3=Sun |first3=David |title=Proceedings of the ACM Collective Intelligence Conference |last2chapter=DockumGender |first2=Rikkerbias |last3=Sunand |first3=Davidstereotypes in Large Language Models |date=2023-11-05 |publisher=Association for Computing Machinery |isbn=979-8-4007-0113-9 |series=CI '23 |___location=New York, NY, USA |pages=12–24 |chapter=Gender bias and stereotypes in Large Language Models |doi=10.1145/3582269.3615599 |chapter-arxiv=2308.14921 |url=https://dl.acm.org/doi/10.1145/3582269.3615599}}</ref>
 
==== Selection bias ====
Selection bias refers the inherent tendency of large language models to favor certain option identifiers irrespective of the actual content of the options. This bias primarily stems from token bias—that is, the model assigns a higher a priori probability to specific answer tokens (such as "A") when generating responses. As a result, when the ordering of options is altered (for example, by systematically moving the correct answer to different positions), the model’s performance can fluctuate significantly. This phenomenon undermines the reliability of large language models in multiple-choice settings.<ref>{{cite arXiv |last1=Choi |first1=Hyeong Kyu |last2=Xu |first2=Weijie |last3=Xue |first3=Chi |last4=Eckman |first4=Stephanie |last5=Reddy |first5=Chandan K. |title=Mitigating Selection Bias with Node Pruning and Auxiliary Options |date=2024-09-27 |class=cs.AI |eprint=2409.18857}}</ref><ref>{{cite arXiv |last1=Zheng |first1=Chujie |last2=Zhou |first2=Hao |last3=Meng |first3=Fandong |last4=Zhou |first4=Jie |last5=Huang |first5=Minlie |title=Large Language Models Are Not Robust Multiple Choice Selectors |date=2023-09-07 |class=cs.CL |eprint=2309.03882}}</ref>
 
==== Political bias ====
Political bias refers to the tendency of algorithms to systematically favor certain political viewpoints, ideologies, or outcomes over others. Language models may also exhibit political biases. Since the training data includes a wide range of political opinions and coverage, the models might generate responses that lean towards particular political ideologies or viewpoints, depending on the prevalence of those views in the data.<ref>{{Cite web |last=Heikkilä |first=Melissa |date=August 7, 2023 |title=AI language models are rife with different political biases |url=https://www.technologyreview.com/2023/08/07/1077324/ai-language-models-are-rife-with-political-biases/ |access-date=2023-12-29 |website=MIT Technology Review |language=en}}</ref>
 
=== Energy demands ===
==List==
The energy demands of LLMs have grown along with their size and capabilities. [[Data center|Data centers]] that enable LLM training require substantial amounts of electricity. Much of that electricity is generated by non-renewable resources that create greenhouse gases and contribute to [[climate change]].<ref>{{Cite web |last=Mehta |first=Sourabh |date=2024-07-03 |title=How Much Energy Do LLMs Consume? Unveiling the Power Behind AI |url=https://adasci.org/how-much-energy-do-llms-consume-unveiling-the-power-behind-ai/ |access-date=2025-01-27 |website=Association of Data Scientists |language=en-US}}</ref> [[Nuclear power]] and [[geothermal energy]] are two options tech companies are exploring to meet the sizable energy demands of LLM training.<ref>{{Cite news |title=Artificial Intelligence wants to go nuclear. Will it work? |url=https://www.npr.org/2024/12/09/nx-s1-5171063/artificial-intelligence-wants-to-go-nuclear-will-it-work |access-date=2025-01-27 |work=NPR |language=en}}</ref> The significant expense of investing in geothermal solutions has led to major shale producers like [[Chevron Corporation|Chevron]] and [[ExxonMobil|Exxon Mobil]] advocating for tech companies to use electricity produced via [[natural gas]] to fuel their large energy demands.<ref>{{Cite web |last=Roy, Dareen |date=December 19, 2024 |title=AI's energy hunger fuels geothermal startups but natgas rivalry clouds future |url=https://www.reuters.com/technology/artificial-intelligence/ais-energy-hunger-fuels-geothermal-startups-natgas-rivalry-clouds-future-2024-12-19/ |website=Reuters}}</ref>
{{See also|Comparison of user features of chatbots}}
 
For the training cost column, 1 petaFLOP-day = 1 petaFLOP/sec × 1 day = 8.64E19 FLOP.
=== Cognitive impact ===
{| class="wikitable sortable"
 
|-
In 2025, a preliminary study measuring the effects of using LLMs to write essays reported a decrease of neural and linguistic performance from users of ChatGPT over the course of several months.<ref>{{cite arXiv |last1=Kosmyna |first1=Nataliya |title=Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task |date=June 10, 2025 |eprint=2506.08872 |last2=Hauptmann |first2=Eugene |last3=Yuan |first3=Ye Tong |last4=Situ |first4=Jessica |last5=Liao |first5=Xian-Hao |last6=Beresnitzky |first6=Ashly Vivian |last7=Braunstein |first7=Iris |last8=Maes |first8=Pattie |class=cs.AI }}</ref>
! Name !! Release date{{efn|This is the date that documentation describing the model's architecture was first released.}} !! Developer !! Number of parameters{{efn|In many cases, researchers release or report on multiple versions of a model having different sizes. In these cases, the size of the largest model is listed here.}} !! Corpus size
 
!Training cost (petaFLOP-day)!! License{{efn|This is the license of the pre-trained model weights. In almost all cases the training code itself is open-source or can be easily replicated.}} !! Notes
=== Mental health ===
|-
Research and social media posts suggest that some individuals are using LLMs to seek therapy or mental health support.<ref>{{Cite news |last=Zao-Sanders |first=Marc |date=2024-03-19 |title=How People Are Really Using GenAI |url=https://hbr.org/2024/03/how-people-are-really-using-genai |access-date=2025-08-10 |work=Harvard Business Review |language=en |issn=0017-8012}}</ref> In early 2025, a survey by Sentio University found that nearly half (48.7%) of 499 U.S. adults with ongoing mental health conditions who had used LLMs reported turning to them for therapy or emotional support, including help with anxiety, depression, loneliness, and similar concerns.<ref>{{Cite journal |last1=Rousmaniere |first1=Tony |last2=Zhang |first2=Yimeng |last3=Li |first3=Xu |last4=Shah |first4=Siddharth |date=2025-07-21 |title=Large language models as mental health resources: Patterns of use in the United States. |url=https://doi.apa.org/doi/10.1037/pri0000292 |journal=Practice Innovations |language=en |doi=10.1037/pri0000292 |issn=2377-8903|url-access=subscription }}</ref> Studies have found that LLMs can produce hallucinations—plausible but incorrect statements—which may mislead users in sensitive mental health contexts.<ref>{{cite arXiv |last1=Ji |first1=Shaoxiong |title=Rethinking Large Language Models in Mental Health Applications |date=2023-12-17 |eprint=2311.11267 |last2=Zhang |first2=Tianlin |last3=Yang |first3=Kailai |last4=Ananiadou |first4=Sophia |last5=Cambria |first5=Erik |class=cs.CL }}</ref> Research also shows that LLMs may express stigma or inappropriate agreement with maladaptive thoughts, reflecting limitations in replicating the judgment and relational skills of human therapists.<ref>{{cite book |last1=Moore |first1=Jared |title=Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency |date=2025-04-25 |arxiv=2504.18412 |last2=Grabb |first2=Declan |last3=Agnew |first3=William |last4=Klyman |first4=Kevin |last5=Chancellor |first5=Stevie |last6=Ong |first6=Desmond C. |last7=Haber |first7=Nick |chapter=Expressing stigma and inappropriate responses prevents LLMS from safely replacing mental health providers |pages=599–627 |doi=10.1145/3715275.3732039 |isbn=979-8-4007-1482-5 }}</ref> Evaluations of crisis scenarios indicate that some LLMs lack effective safety protocols, such as assessing suicide risk or making appropriate referrals.<ref>{{cite arXiv |last1=Grabb |first1=Declan |title=Risks from Language Models for Automated Mental Healthcare: Ethics and Structure for Implementation |date=2024-08-14 |eprint=2406.11852 |last2=Lamparth |first2=Max |last3=Vasan |first3=Nina |class=cs.CY }}</ref><ref>{{Cite journal |last1=McBain |first1=Ryan K. |last2=Cantor |first2=Jonathan H. |last3=Zhang |first3=Li Ang |last4=Baker |first4=Olesya |last5=Zhang |first5=Fang |last6=Halbisen |first6=Alyssa |last7=Kofner |first7=Aaron |last8=Breslau |first8=Joshua |last9=Stein |first9=Bradley |last10=Mehrotra |first10=Ateev |last11=Yu |first11=Hao |date=2025-03-05 |title=Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study |journal=Journal of Medical Internet Research |language=EN |volume=27 |issue=1 |pages=e67891 |doi=10.2196/67891 |pmid=40053817 |pmc=11928068 |doi-access=free }}</ref>
| [[GPT-1]] || {{dts|June 2018}} || [[OpenAI]] || {{sort|117000000|117 million}} ||
| || {{yes|MIT}}<ref name="gpt1">{{cite web|work=GitHub|title=finetune-transformer-lm |url=https://github.com/openai/finetune-transformer-lm|access-date=2 January 2024}}</ref>
| First GPT model, decoder-only transformer.
|-
| [[BERT (language model)|BERT]] || {{dts|October 2018}} || [[Google]] || {{sort|340000000|340 million}}<ref name="bert-paper">{{cite arXiv |last1=Devlin |first1=Jacob |last2=Chang |first2=Ming-Wei |last3=Lee |first3=Kenton |last4=Toutanova |first4=Kristina |title=BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding |date=11 October 2018 |eprint=1810.04805v2|class=cs.CL }}</ref> || {{sort|3300000000|3.3 billion}} words<ref name="bert-paper" />
|{{sort|9|9}}<ref name="bHZJ2">{{Cite web |last=Prickett |first=Nicole Hemsoth |date=2021-08-24 |title=Cerebras Shifts Architecture To Meet Massive AI/ML Models |url=https://www.nextplatform.com/2021/08/24/cerebras-shifts-architecture-to-meet-massive-ai-ml-models/ |access-date=2023-06-20 |website=The Next Platform |language=en-US}}</ref>|| {{yes|Apache 2.0}}<ref name="bert-web">{{Cite web|url=https://github.com/google-research/bert|title=BERT|date=March 13, 2023|via=GitHub}}</ref>
| An early and influential language model,<ref name="Manning-2022" /> but encoder-only and thus not built to be prompted or generative<ref name="Ir545">{{cite arXiv |last1=Patel |first1=Ajay |last2=Li |first2=Bryan |last3=Rasooli |first3=Mohammad Sadegh |last4=Constant |first4=Noah |last5=Raffel |first5=Colin |last6=Callison-Burch |first6=Chris |title=Bidirectional Language Models Are Also Few-shot Learners |date=2022 |class=cs.LG |eprint=2209.14500}}</ref>
|-
| XLNet || {{dts|June 2019}} || [[Google]] || {{sort|340000000|~340 million}}<ref name="45rAm">{{Cite web|url=https://www.kdnuggets.com/bert-roberta-distilbert-xlnet-which-one-to-use.html|title=BERT, RoBERTa, DistilBERT, XLNet: Which one to use?|website=KDnuggets}}</ref> || {{sort|3300000000|33 billion}} words
| || {{yes|Apache 2.0}}<ref name="xlnet">{{cite web|work=GitHub|title=xlnet|url=https://github.com/zihangdai/xlnet/|access-date=2 January 2024}}</ref>
| An alternative to BERT; designed as encoder-only<ref name="gAbNO">{{Cite web|url=https://analyticsindiamag.com/google-introduces-new-architecture-to-reduce-cost-of-transformers/|title=Google Introduces New Architecture To Reduce Cost Of Transformers|first=Amit Raja|last=Naik|date=September 23, 2021|website=Analytics India Magazine}}</ref><ref name="LX3rI">{{cite arXiv |last1=Yang |first1=Zhilin |last2=Dai |first2=Zihang |last3=Yang |first3=Yiming |last4=Carbonell |first4=Jaime |last5=Salakhutdinov |first5=Ruslan |last6=Le |first6=Quoc V. |title=XLNet: Generalized Autoregressive Pretraining for Language Understanding |date=2 January 2020 |class=cs.CL |eprint=1906.08237}}</ref>
|-
| [[GPT-2]] || {{dts|February 2019}} || [[OpenAI]] || {{sort|1500000000|1.5 billion}}<ref name="15Brelease" /> || 40GB<ref name="5T8u5">{{cite web |title=Better language models and their implications |url=https://openai.com/research/better-language-models |website=openai.com}}</ref> (~{{sort|10000000000|10 billion}} tokens)<ref name="LambdaLabs">{{cite web |title=OpenAI's GPT-3 Language Model: A Technical Overview |url=https://lambdalabs.com/blog/demystifying-gpt-3 |website=lambdalabs.com |date=3 June 2020 |language=en}}</ref>
| || {{yes|MIT}}<ref name="Sudbe">{{cite web|work=GitHub|title=gpt-2|url=https://github.com/openai/gpt-2|access-date=13 March 2023}}</ref>
| general-purpose model based on transformer architecture
|-
| [[GPT-3]] || {{dts|May 2020}} || OpenAI || {{sort|175000000000|175 billion}}<ref name="Wiggers" /> || {{sort|300000000000|300 billion}} tokens<ref name="LambdaLabs" />
|3640<ref name=":2">Table D.1 in {{Cite arXiv |last1=Brown |first1=Tom B. |last2=Mann |first2=Benjamin |last3=Ryder |first3=Nick |last4=Subbiah |first4=Melanie |last5=Kaplan |first5=Jared |last6=Dhariwal |first6=Prafulla |last7=Neelakantan |first7=Arvind |last8=Shyam |first8=Pranav |last9=Sastry |first9=Girish |last10=Askell |first10=Amanda |last11=Agarwal |first11=Sandhini |last12=Herbert-Voss |first12=Ariel |last13=Krueger |first13=Gretchen |last14=Henighan |first14=Tom |last15=Child |first15=Rewon |date=May 28, 2020 |title=Language Models are Few-Shot Learners |eprint=2005.14165v4 |first16=Aditya |last16=Ramesh |first17=Daniel M. |last17=Ziegler |first18=Jeffrey |last18=Wu |first19=Clemens |last19=Winter |first20=Christopher |last20=Hesse |first21=Mark |last21=Chen |first22=Eric |last22=Sigler |first23=Mateusz |last23=Litwin |first24=Scott |last24=Gray |first25=Benjamin |last25=Chess |first26=Jack |last26=Clark |first27=Christopher |last27=Berner |first28=Sam |last28=McCandlish |first29=Alec |last29=Radford |first30=Ilya |last30=Sutskever |first31=Dario |last31=Amodei|class=cs.CL}}</ref>|| {{no|proprietary}}
| A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called [[ChatGPT]] in 2022.<ref name="chatgpt-blog" />
|-
| GPT-Neo || {{dts|March 2021}} || [[EleutherAI]] || {{sort|2700000000|2.7 billion}}<ref name="gpt-neo">{{Cite web|url=https://github.com/EleutherAI/gpt-neo|title=GPT Neo|date=March 15, 2023|via=GitHub}}</ref> || 825 GiB<ref name="Pile">{{cite arXiv |last1=Gao |first1=Leo |last2=Biderman |first2=Stella |last3=Black |first3=Sid |last4=Golding |first4=Laurence |last5=Hoppe |first5=Travis |last6=Foster |first6=Charles |last7=Phang |first7=Jason |last8=He |first8=Horace |last9=Thite |first9=Anish |last10=Nabeshima |first10=Noa |last11=Presser |first11=Shawn |last12=Leahy |first12=Connor |title=The Pile: An 800GB Dataset of Diverse Text for Language Modeling |eprint=2101.00027|date=31 December 2020 |class=cs.CL}}</ref>
| || {{yes|MIT}}<ref name=vb-gpt-neo/>
| The first of [[EleutherAI#GPT models|a series of free GPT-3 alternatives]] released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3.<ref name="vb-gpt-neo" />
|-
| [[GPT-J]] || {{dts|June 2021}} || [[EleutherAI]] || {{sort|6000000000|6 billion}}<ref name="JxohJ">{{Cite web |title=GPT-J-6B: An Introduction to the Largest Open Source GPT Model {{!}} Forefront |url=https://www.forefront.ai/blog-posts/gpt-j-6b-an-introduction-to-the-largest-open-sourced-gpt-model |access-date=2023-02-28 |website=www.forefront.ai |language=en}}</ref> || 825 GiB<ref name="Pile" />
|200<ref name=":3">{{Cite arXiv |last1=Dey |first1=Nolan |last2=Gosal |first2=Gurpreet |last3=Zhiming |last4=Chen |last5=Khachane |first5=Hemant |last6=Marshall |first6=William |last7=Pathria |first7=Ribhu |last8=Tom |first8=Marvin |last9=Hestness |first9=Joel |date=2023-04-01 |title=Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster |class=cs.LG |eprint=2304.03208}}</ref>|| {{yes|Apache 2.0}}
| GPT-3-style language model
|-
| Megatron-Turing NLG || {{dts|October 2021}}<ref name="BwnW5">{{cite web |last1=Alvi |first1=Ali |last2=Kharya |first2=Paresh |title=Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World's Largest and Most Powerful Generative Language Model |url=https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/ |website=Microsoft Research |date=11 October 2021}}</ref> || [[Microsoft]] and [[Nvidia]] || {{sort|530000000000|530 billion}}<ref name="mtnlg-preprint" /> || {{sort|338600000000|338.6 billion}} tokens<ref name="mtnlg-preprint" />
| || {{no|Restricted web access}}
| Standard architecture but trained on a supercomputing cluster.
|-
| Ernie 3.0 Titan || {{dts|December 2021}} || [[Baidu]] || {{sort|260000000000|260 billion}}<ref name="qeOB8">{{Cite arXiv|title=ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation|first1=Shuohuan|last1=Wang|first2=Yu|last2=Sun|first3=Yang|last3=Xiang|first4=Zhihua|last4=Wu|first5=Siyu|last5=Ding|first6=Weibao|last6=Gong|first7=Shikun|last7=Feng|first8=Junyuan|last8=Shang|first9=Yanbin|last9=Zhao|first10=Chao|last10=Pang|first11=Jiaxiang|last11=Liu|first12=Xuyi|last12=Chen|first13=Yuxiang|last13=Lu|first14=Weixin|last14=Liu|first15=Xi|last15=Wang|first16=Yangfan|last16=Bai|first17=Qiuliang|last17=Chen|first18=Li|last18=Zhao|first19=Shiyong|last19=Li|first20=Peng|last20=Sun|first21=Dianhai|last21=Yu|first22=Yanjun|last22=Ma|first23=Hao|last23=Tian|first24=Hua|last24=Wu|first25=Tian|last25=Wu|first26=Wei|last26=Zeng|first27=Ge|last27=Li|first28=Wen|last28=Gao|first29=Haifeng|last29=Wang|date=December 23, 2021|class=cs.CL |eprint=2112.12731}}</ref> || 4 Tb
| || {{no|Proprietary}}
| Chinese-language LLM. [[Ernie Bot]] is based on this model.
|-
| [[Claude (language model)|Claude]]<ref name="i8jc4">{{cite web |title=Product |url=https://www.anthropic.com/product |website=Anthropic |access-date=14 March 2023 |language=en}}</ref> || {{dts|December 2021}} || [[Anthropic]] || {{sort|52000000000|52 billion}}<ref name="AnthroArch">{{cite arXiv |last1=Askell |first1=Amanda |last2=Bai |first2=Yuntao |last3=Chen |first3=Anna |last4=Drain |first4=Dawn |last5=Ganguli |first5=Deep |last6=Henighan |first6=Tom |last7=Jones |first7=Andy |last8=Joseph |first8=Nicholas |last9=Mann |first9=Ben |last10=DasSarma |first10=Nova |last11=Elhage |first11=Nelson |last12=Hatfield-Dodds |first12=Zac |last13=Hernandez |first13=Danny |last14=Kernion |first14=Jackson |last15=Ndousse |first15=Kamal |last16=Olsson |first16=Catherine |last17=Amodei |first17=Dario |last18=Brown |first18=Tom |last19=Clark |first19=Jack |last20=McCandlish |first20=Sam |last21=Olah |first21=Chris |last22=Kaplan |first22=Jared |display-authors=3 |title=A General Language Assistant as a Laboratory for Alignment |eprint=2112.00861 |date=9 December 2021 |class=cs.CL}}</ref> || {{sort|400000000000|400 billion}} tokens<ref name="AnthroArch" />
| || {{partial success|beta}}
| Fine-tuned for desirable behavior in conversations.<ref name="RZqhw">{{cite arXiv |last1=Bai |first1=Yuntao |last2=Kadavath |first2=Saurav |last3=Kundu |first3=Sandipan |last4=Askell |first4=Amanda |last5=Kernion |first5=Jackson |last6=Jones |first6=Andy |last7=Chen |first7=Anna |last8=Goldie |first8=Anna |last9=Mirhoseini |first9=Azalia |last10=McKinnon |first10=Cameron |last11=Chen |first11=Carol |last12=Olsson |first12=Catherine |last13=Olah |first13=Christopher |last14=Hernandez |first14=Danny |last15=Drain |first15=Dawn |last16=Ganguli |first16=Deep |last17=Li |first17=Dustin |last18=Tran-Johnson |first18=Eli |last19=Perez |first19=Ethan |last20=Kerr |first20=Jamie |last21=Mueller |first21=Jared |last22=Ladish |first22=Jeffrey |last23=Landau |first23=Joshua |last24=Ndousse |first24=Kamal |last25=Lukosuite |first25=Kamile |last26=Lovitt |first26=Liane |last27=Sellitto |first27=Michael |last28=Elhage |first28=Nelson |last29=Schiefer |first29=Nicholas |last30=Mercado |first30=Noemi |last31=DasSarma |first31=Nova |last32=Lasenby |first32=Robert |last33=Larson |first33=Robin |last34=Ringer |first34=Sam |last35=Johnston |first35=Scott |last36=Kravec |first36=Shauna |last37=Showk |first37=Sheer El |last38=Fort |first38=Stanislav |last39=Lanham |first39=Tamera |last40=Telleen-Lawton |first40=Timothy |last41=Conerly |first41=Tom |last42=Henighan |first42=Tom |last43=Hume |first43=Tristan |last44=Bowman |first44=Samuel R. |last45=Hatfield-Dodds |first45=Zac |last46=Mann |first46=Ben |last47=Amodei |first47=Dario |last48=Joseph |first48=Nicholas |last49=McCandlish |first49=Sam |last50=Brown |first50=Tom |last51=Kaplan |first51=Jared |display-authors=3 |title=Constitutional AI: Harmlessness from AI Feedback |eprint=2212.08073 |date=15 December 2022 |class=cs.CL}}</ref>
|-
| GLaM (Generalist Language Model) || {{dts|December 2021}} || Google || {{sort|1200000000000|1.2 trillion}}<ref name="glam-blog" /> || {{sort|1600000000000|1.6 trillion}} tokens<ref name="glam-blog" />
| 5600<ref name="glam-blog" />|| {{no|Proprietary}}
| Sparse [[mixture of experts]] model, making it more expensive to train but cheaper to run inference compared to GPT-3.
|-
| Gopher || {{dts|December 2021}} || [[DeepMind]] || {{sort|280000000000|280 billion}}<ref name="mD5eE">{{cite web |title=Language modelling at scale: Gopher, ethical considerations, and retrieval |url=https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval |website=www.deepmind.com |date=8 December 2021 |access-date=20 March 2023 |language=en}}</ref> || {{sort|300000000000|300 billion}} tokens<ref name="hoffman" />
|5833<ref name=":4">Table 20 and page 66 of ''[https://storage.googleapis.com/pathways-language-model/PaLM-paper.pdf PaLM: Scaling Language Modeling with Pathways]''</ref>|| {{no|Proprietary}}
| Further developed into the Chinchilla model.
|-
| [[LaMDA]] (Language Models for Dialog Applications) || {{dts|January 2022}} || Google || {{sort|137000000000|137 billion}}<ref name="lamda-blog" /> || 1.56T words,<ref name="lamda-blog" /> {{sort|168000000000|168 billion}} tokens<ref name="hoffman" />
|4110<ref name="DMs9Z">{{Cite arXiv |last1=Thoppilan |first1=Romal |last2=De Freitas |first2=Daniel |last3=Hall |first3=Jamie |last4=Shazeer |first4=Noam |last5=Kulshreshtha |first5=Apoorv |last6=Cheng |first6=Heng-Tze |last7=Jin |first7=Alicia |last8=Bos |first8=Taylor |last9=Baker |first9=Leslie |last10=Du |first10=Yu |last11=Li |first11=YaGuang |last12=Lee |first12=Hongrae |last13=Zheng |first13=Huaixiu Steven |last14=Ghafouri |first14=Amin |last15=Menegali |first15=Marcelo |date=2022-01-01 |title=LaMDA: Language Models for Dialog Applications |class=cs.CL |eprint=2201.08239}}</ref>|| {{no|Proprietary}}
| Specialized for response generation in conversations.
|-
| GPT-NeoX || {{dts|February 2022}} || [[EleutherAI]] || {{sort|20000000000|20 billion}}<ref name="gpt-neox-20b">{{cite conference |title=GPT-NeoX-20B: An Open-Source Autoregressive Language Model |conference=Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models |date=2022-05-01 |last1=Black |first1=Sidney |last2=Biderman |first2=Stella |last3=Hallahan |first3=Eric |display-authors=etal |volume=Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models |pages=95–136 |url=https://aclanthology.org/2022.bigscience-1.9/ |access-date=2022-12-19}}</ref> || 825 GiB<ref name="Pile" />
|740<ref name=":3" />|| {{yes|Apache 2.0}}
| based on the Megatron architecture
|-
| [[Chinchilla AI|Chinchilla]] || {{dts|March 2022}} || [[DeepMind]] || {{sort|70000000000|70 billion}}<ref name="chinchilla-blog" /> || {{sort|1400000000000|1.4 trillion}} tokens<ref name="chinchilla-blog" /><ref name="hoffman">{{cite arXiv |last1=Hoffmann |first1=Jordan |last2=Borgeaud |first2=Sebastian |last3=Mensch |first3=Arthur |last4=Buchatskaya |first4=Elena |last5=Cai |first5=Trevor |last6=Rutherford |first6=Eliza |last7=Casas |first7=Diego de Las |last8=Hendricks |first8=Lisa Anne |last9=Welbl |first9=Johannes |last10=Clark |first10=Aidan |last11=Hennigan |first11=Tom |last12=Noland |first12=Eric |last13=Millican |first13=Katie |last14=Driessche |first14=George van den |last15=Damoc |first15=Bogdan |last16=Guy |first16=Aurelia |last17=Osindero |first17=Simon |last18=Simonyan |first18=Karen |last19=Elsen |first19=Erich |last20=Rae |first20=Jack W. |last21=Vinyals |first21=Oriol |last22=Sifre |first22=Laurent |title=Training Compute-Optimal Large Language Models |eprint=2203.15556 |date=29 March 2022 |class=cs.CL |display-authors=3}}</ref>
|6805<ref name=":4" />|| {{no|Proprietary}}
| Reduced-parameter model trained on more data. Used in the [[Sparrow (bot)|Sparrow]] bot. Often cited for its [[neural scaling law]].
|-
| [[PaLM]] (Pathways Language Model) || {{dts|April 2022}} || Google || {{sort|540000000000|540 billion}}<ref name="palm-blog" /> || {{sort|768000000000|768 billion}} tokens<ref name="chinchilla-blog" />
|29250<ref name=":4" />|| {{no|Proprietary}}
| Trained for ~60 days on ~6000 [[Tensor Processing Unit|TPU v4]] chips. <ref name=":4" />
|-
| OPT (Open Pretrained Transformer) || {{dts|May 2022}} || [[Meta Platforms|Meta]] || {{sort|175000000000|175 billion}}<ref name="jlof8">{{cite web |title=Democratizing access to large-scale language models with OPT-175B |url=https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/ |website=ai.facebook.com |language=en}}</ref> || {{sort|180000000000|180 billion}} tokens<ref name="QjTIc">{{cite arXiv |last1=Zhang |first1=Susan |last2=Roller |first2=Stephen |last3=Goyal |first3=Naman |last4=Artetxe |first4=Mikel |last5=Chen |first5=Moya |last6=Chen |first6=Shuohui |last7=Dewan |first7=Christopher |last8=Diab |first8=Mona |last9=Li |first9=Xian |last10=Lin |first10=Xi Victoria |last11=Mihaylov |first11=Todor |last12=Ott |first12=Myle |last13=Shleifer |first13=Sam |last14=Shuster |first14=Kurt |last15=Simig |first15=Daniel |last16=Koura |first16=Punit Singh |last17=Sridhar |first17=Anjali |last18=Wang |first18=Tianlu |last19=Zettlemoyer |first19=Luke |title=OPT: Open Pre-trained Transformer Language Models |eprint=2205.01068 |date=21 June 2022|class=cs.CL}}</ref>
|310<ref name=":3" />|| {{partial success|Non-commercial research}}{{efn|The smaller models including 66B are publicly available, while the 175B model is available on request.}}
| GPT-3 architecture with some adaptations from Megatron
|-
|YaLM 100B || {{dts|June 2022}} || [[Yandex]] || {{sort|100000000000|100 billion}}<ref name="yalm-repo">{{Citation |last1=Khrushchev |first1=Mikhail |title=YaLM 100B |date=2022-06-22 |url=https://github.com/yandex/YaLM-100B |access-date=2023-03-18 |last2=Vasilev |first2=Ruslan |last3=Petrov |first3=Alexey |last4=Zinov |first4=Nikolay}}</ref>
|| 1.7TB<ref name="yalm-repo" /> || | || {{Yes|Apache 2.0}} || English-Russian model based on Microsoft's Megatron-LM.
|-
| Minerva || {{dts|June 2022}} || Google || {{sort|540000000000|540 billion}}<ref name="minerva-paper" /> || 38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint server<ref name="minerva-paper">{{cite arXiv |last1=Lewkowycz |first1=Aitor |last2=Andreassen |first2=Anders |last3=Dohan |first3=David |last4=Dyer |first4=Ethan |last5=Michalewski |first5=Henryk |last6=Ramasesh |first6=Vinay |last7=Slone |first7=Ambrose |last8=Anil |first8=Cem |last9=Schlag |first9=Imanol |last10=Gutman-Solo |first10=Theo |last11=Wu |first11=Yuhuai |last12=Neyshabur |first12=Behnam |last13=Gur-Ari |first13=Guy |last14=Misra |first14=Vedant |title=Solving Quantitative Reasoning Problems with Language Models |date=30 June 2022 |class=cs.CL |eprint=2206.14858}}</ref>
| || {{no|Proprietary}}
| LLM trained for solving "mathematical and scientific questions using step-by-step reasoning".<ref name="FfCNK">{{cite web |title=Minerva: Solving Quantitative Reasoning Problems with Language Models |url=https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html |website=ai.googleblog.com |date=30 June 2022 |access-date=20 March 2023 |language=en}}</ref> Minerva is based on PaLM model, further trained on mathematical and scientific data.
|-
| [[BLOOM (language model)|BLOOM]] || {{dts|July 2022}} || Large collaboration led by [[Hugging Face]] || {{sort|175000000000|175 billion}}<ref name="bigger-better">{{cite journal |journal=Nature |last=Ananthaswamy|first=Anil |title=In AI, is bigger always better? |date=8 March 2023 |volume=615 |issue=7951 |pages=202–205 |doi=10.1038/d41586-023-00641-w |pmid=36890378 |bibcode=2023Natur.615..202A |s2cid=257380916 |url=https://www.nature.com/articles/d41586-023-00641-w}}</ref> || {{sort|350000000000|350 billion}} tokens (1.6TB)<ref name="B8wB2">{{cite web |title=bigscience/bloom · Hugging Face |url=https://huggingface.co/bigscience/bloom |website=huggingface.co}}</ref>
| || {{partial success|Responsible AI}}
| Essentially GPT-3 but trained on a multi-lingual corpus (30% English excluding programming languages)
|-
| Galactica || {{dts|November 2022}} || [[Meta Platforms|Meta]] || {{sort|120000000000|120 billion}} || {{sort|350000000000|106 billion}} tokens<ref name="37sY6">{{cite arXiv |last1=Taylor |first1=Ross |last2=Kardas |first2=Marcin |last3=Cucurull |first3=Guillem |last4=Scialom |first4=Thomas |last5=Hartshorn |first5=Anthony |last6=Saravia |first6=Elvis |last7=Poulton |first7=Andrew |last8=Kerkez |first8=Viktor |last9=Stojnic |first9=Robert |title=Galactica: A Large Language Model for Science |date=16 November 2022 |class=cs.CL |eprint=2211.09085}}</ref>
|unknown|| {{partial success|CC-BY-NC-4.0}}
| Trained on scientific text and modalities.
|-
| AlexaTM (Teacher Models) || {{dts|November 2022}} || [[Amazon (company)|Amazon]] || {{sort|20000000000|20 billion}}<ref name="u5szh">{{cite web |title=20B-parameter Alexa model sets new marks in few-shot learning |url=https://www.amazon.science/blog/20b-parameter-alexa-model-sets-new-marks-in-few-shot-learning |website=Amazon Science |language=en |date=2 August 2022}}</ref> || {{sort|1300000000000|1.3 trillion}}<ref name="HaA7l">{{cite arXiv |last1=Soltan |first1=Saleh |last2=Ananthakrishnan |first2=Shankar |last3=FitzGerald |first3=Jack |last4=Gupta |first4=Rahul |last5=Hamza |first5=Wael |last6=Khan |first6=Haidar |last7=Peris |first7=Charith |last8=Rawls |first8=Stephen |last9=Rosenbaum |first9=Andy |last10=Rumshisky |first10=Anna |last11=Prakash |first11=Chandana Satya |last12=Sridhar |first12=Mukund |last13=Triefenbach |first13=Fabian |last14=Verma |first14=Apurv |last15=Tur |first15=Gokhan |last16=Natarajan |first16=Prem |display-authors=3|title=AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model |eprint=2208.01448 |date=3 August 2022|class=cs.CL}}</ref>
| || {{no|proprietary}}<ref name="rpehM">{{cite web |title=AlexaTM 20B is now available in Amazon SageMaker JumpStart {{!}} AWS Machine Learning Blog |url=https://aws.amazon.com/blogs/machine-learning/alexatm-20b-is-now-available-in-amazon-sagemaker-jumpstart/ |website=aws.amazon.com |access-date=13 March 2023 |date=17 November 2022}}</ref>
| bidirectional sequence-to-sequence architecture
|-
| [[Neuro-sama]] || {{dts|December 2022}} || Independent || Unknown || Unknown
| || {{no|privately-owned}}
| A language model designed for live-streaming on [[Twitch (service)|Twitch]].
|-
| [[LLaMA]] (Large Language Model Meta AI) || {{dts|February 2023}} || [[Meta Platforms|Meta]] || {{sort|65000000000|65 billion}}<ref name="llama-blog" /> || {{sort|1400000000000|1.4 trillion}}<ref name="llama-blog" />
|6300<ref name=":5">{{Cite web |title=The Falcon has landed in the Hugging Face ecosystem |url=https://huggingface.co/blog/falcon |access-date=2023-06-20 |website=huggingface.co}}</ref>|| {{partial success|Non-commercial research}}{{efn|Facebook's license and distribution scheme restricted access to approved researchers, but the model weights were leaked and became widely available.}}
| Trained on a large 20-language corpus to aim for better performance with fewer parameters.<ref name="llama-blog" /> Researchers from Stanford University trained a fine-tuned model based on LLaMA weights, called Alpaca.<ref name="KBedq">{{Cite web|url=https://crfm.stanford.edu/2023/03/13/alpaca.html|title=Stanford CRFM|website=crfm.stanford.edu}}</ref>
|-
| [[GPT-4]] || {{dts|March 2023}} || OpenAI || Exact number unknown{{efn|As stated in Technical report: "Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method ..."<ref name="GPT4Tech">{{Cite web |date=2023 |title=GPT-4 Technical Report |url=https://cdn.openai.com/papers/gpt-4.pdf |website=[[OpenAI]] |access-date=March 14, 2023 |archive-date=March 14, 2023 |archive-url=https://web.archive.org/web/20230314190904/https://cdn.openai.com/papers/gpt-4.pdf |url-status=live}}</ref> }} || Unknown
|| Unknown || {{no|proprietary}}
| Available for ChatGPT Plus users and used in [[GPT-4#Usage|several products]].
|-
|Cerebras-GPT
|{{dts|March 2023}}
|Cerebras
|{{sort|13000000000|13 billion}}<ref name="D0k2a">{{Cite web|url=https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/|title=Cerebras-GPT: A Family of Open, Compute-efficient, Large Language Models|first=Nolan|last=Dey|date=March 28, 2023|website=Cerebras}}</ref>
|
|270<ref name=":3" />|| {{yes|Apache 2.0}}
| Trained with Chinchilla formula.
|-
| Falcon || {{dts|March 2023}} || [[Technology Innovation Institute]] || {{sort|40000000000|40 billion}}<ref name="falcon">{{cite web |title=Abu Dhabi-based TII launches its own version of ChatGPT |url=https://fastcompanyme.com/news/abu-dhabi-based-tii-launches-its-own-version-of-chatgpt/ |website=tii.ae}}</ref> || 1 trillion tokens, from RefinedWeb (filtered web text corpus)<ref name="Xb1gq">{{Cite arXiv |last1=Penedo |first1=Guilherme |last2=Malartic |first2=Quentin |last3=Hesslow |first3=Daniel |last4=Cojocaru |first4=Ruxandra |last5=Cappelli |first5=Alessandro |last6=Alobeidli |first6=Hamza |last7=Pannier |first7=Baptiste |last8=Almazrouei |first8=Ebtesam |last9=Launay |first9=Julien |date=2023-06-01 |title=The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only |class=cs.CL |eprint=2306.01116}}</ref> plus some "curated corpora".<ref name="gzTNw">{{Cite web |date=2023-06-09 |title=tiiuae/falcon-40b · Hugging Face |url=https://huggingface.co/tiiuae/falcon-40b |access-date=2023-06-20 |website=huggingface.co}}</ref>
|2800<ref name=":5" />|| {{yes|Apache 2.0}}<ref name="Wmlcs">[https://www.businesswire.com/news/home/20230531005608/en/UAE’s-Falcon-40B-World’s-Top-Ranked-AI-Model-from-Technology-Innovation-Institute-is-Now-Royalty-Free UAE’s Falcon 40B, World’s Top-Ranked AI Model from Technology Innovation Institute, is Now Royalty-Free], 31 May 2023</ref>
|
|-
| BloombergGPT || {{dts|March 2023}} || [[Bloomberg L.P.]] || {{sort|50000000000|50 billion}} || 363 billion token dataset based on Bloomberg's data sources, plus 345 billion tokens from general purpose datasets<ref name="nGOSu">{{Cite arXiv|title=BloombergGPT: A Large Language Model for Finance|first1=Shijie|last1=Wu|first2=Ozan|last2=Irsoy|first3=Steven|last3=Lu|first4=Vadim|last4=Dabravolski|first5=Mark|last5=Dredze|first6=Sebastian|last6=Gehrmann|first7=Prabhanjan|last7=Kambadur|first8=David|last8=Rosenberg|first9=Gideon|last9=Mann|date=March 30, 2023|class=cs.LG |eprint=2303.17564}}</ref>
| || {{no|Proprietary}}
| LLM trained on financial data from proprietary sources, that "outperforms existing models on financial tasks by significant margins without sacrificing performance on general LLM benchmarks"
|-
| PanGu-Σ || {{dts|March 2023}} || [[Huawei]] || {{sort|1085000000000|1.085 trillion}} || 329 billion tokens<ref name="9WSFw">{{Cite arXiv|title=PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing|first1=Xiaozhe|last1=Ren|first2=Pingyi|last2=Zhou|first3=Xinfan|last3=Meng|first4=Xinjing|last4=Huang|first5=Yadao|last5=Wang|first6=Weichao|last6=Wang|first7=Pengfei|last7=Li|first8=Xiaoda|last8=Zhang|first9=Alexander|last9=Podolskiy|first10=Grigory|last10=Arshinov|first11=Andrey|last11=Bout|first12=Irina|last12=Piontkovskaya|first13=Jiansheng|last13=Wei|first14=Xin|last14=Jiang|first15=Teng|last15=Su|first16=Qun|last16=Liu|first17=Jun|last17=Yao|date=March 19, 2023|class=cs.CL |eprint=2303.10845}}</ref>
| || {{no|Proprietary}}
|
|-
| OpenAssistant<ref name="JiOl8">{{Cite arXiv |last1=Köpf |first1=Andreas |last2=Kilcher |first2=Yannic |last3=von Rütte |first3=Dimitri |last4=Anagnostidis |first4=Sotiris |last5=Tam |first5=Zhi-Rui |last6=Stevens |first6=Keith |last7=Barhoum |first7=Abdullah |last8=Duc |first8=Nguyen Minh |last9=Stanley |first9=Oliver |last10=Nagyfi |first10=Richárd |last11=ES |first11=Shahul |last12=Suri |first12=Sameer |last13=Glushkov |first13=David |last14=Dantuluri |first14=Arnav |last15=Maguire |first15=Andrew |date=2023-04-14 |title=OpenAssistant Conversations -- Democratizing Large Language Model Alignment |class=cs.CL |eprint=2304.07327}}</ref> || {{dts|March 2023}} || [[LAION]] || {{sort|17000000000|17 billion}} || 1.5 trillion tokens
| || {{yes|Apache 2.0}}
| Trained on crowdsourced open data
|-
|Jurassic-2<ref>{{Cite web |last=Wrobel |first=Sharon |title=Tel Aviv startup rolls out new advanced AI language model to rival OpenAI |url=https://www.timesofisrael.com/ai21-labs-rolls-out-new-advanced-ai-language-model-to-rival-openai/ |access-date=2023-07-24 |website=www.timesofisrael.com |language=en-US}}</ref>
|{{dts|March 2023}}
|[[AI21 Labs]]
|Exact size unknown
|Unknown
| || {{no|Proprietary}}
|Multilingual<ref>{{Cite web |last=Wiggers |first=Kyle |date=2023-04-13 |title=With Bedrock, Amazon enters the generative AI race |url=https://techcrunch.com/2023/04/13/with-bedrock-amazon-enters-the-generative-ai-race/ |access-date=2023-07-24 |website=TechCrunch |language=en-US}}</ref>
|-
| [[PaLM|PaLM 2]] (Pathways Language Model 2) || {{dts|May 2023}} || Google || {{sort|340000000000|340 billion}}<ref name="cnbc-20230516">{{cite web |last=Elias |first=Jennifer |url=https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html |title=Google's newest A.I. model uses nearly five times more text data for training than its predecessor |work=[[CNBC]] |date=16 May 2023 |access-date=18 May 2023}}</ref> || {{sort|3600000000000|3.6 trillion}} tokens<ref name="cnbc-20230516" />
|85000<ref name=":5" />|| {{no|Proprietary}}
| Used in [[Bard (chatbot)|Bard chatbot]].<ref name="pWyLA">{{Cite web|url=https://blog.google/technology/ai/google-palm-2-ai-large-language-model/|title=Introducing PaLM 2|date=May 10, 2023|website=Google}}</ref>
|-
| Llama 2 || {{dts|July 2023}} || Meta || {{sort|70000000000|70 billion}}<ref name="meta-20230719">{{Cite web | url = https://ai.meta.com/llama/ | title = Introducing Llama 2: The Next Generation of Our Open Source Large Language Model | access-date = 2023-07-19 | website = Meta AI | language = en | date = 2023}}</ref> || {{sort|2000000000000|2 trillion}} tokens<ref name="meta-20230719" />
| || {{partial success|Llama 2 license}}
| Successor of LLaMA.
|-
|[[Claude (language model)|Claude 2]]
|July 2023
|Anthropic
|| Unknown
|Unknown
|Unknown|| {{no|Proprietary}}
| Used in Claude chatbot.<ref>{{cite web |title=Claude 2 |url=https://www.anthropic.com/index/claude-2 |website=anthropic.com |access-date=12 December 2023}}</ref>
|-
| Falcon 180B || {{dts|September 2023}} || Technology Innovation Institute || {{sort|180000000000|180 billion}}<ref name="tii-20230921">{{Cite web | url = https://falconllm.tii.ae/falcon-180b.html | title = Falcon 180B | access-date = 2023-09-21 | website = Technology Innovation Institute | language = en | date = 2023}}</ref> || {{sort|3500000000000|3.5 trillion}} tokens<ref name="tii-20230921" />
| || {{partial success|Falcon 180B TII license}}
|-
| Mistral 7B || {{dts|September 2023}} || [[Mistral AI]] || {{sort|7300000000|7.3 billion}}<ref name="mistral-20230927">{{Cite web | url = https://mistral.ai/news/announcing-mistral-7b/ | title = Announcing Mistral 7B | access-date = 2023-10-06 | website = Mistral | language = en | date = 2023}}</ref> || Unknown
| || {{yes|Apache 2.0}}
|
|-
|[[Claude (language model)|Claude 2.1]]
|November 2023
|Anthropic
|| Unknown
|Unknown
|Unknown|| {{no|Proprietary}}
| Used in Claude chatbot. Has a context window of 200,000 tokens, or ~500 pages.<ref>{{cite web |title=Introducing Claude 2.1 |url=https://www.anthropic.com/index/claude-2-1 |website=anthropic.com |access-date=12 December 2023}}</ref>
|-
|Grok-1
|November 2023
|[[x.AI]]
|| Unknown
|Unknown
|Unknown|| {{no|Proprietary}}
| Used in [[Grok (chatbot)|Grok]] chatbot. Grok-1 has a context length of 8,192 tokens and has access to X (Twitter).<ref>{{cite web |title=Grok-1 model card |url=https://x.ai/model-card/ |website=x.ai |access-date=12 December 2023}}</ref>
|-
|[[Gemini (language model)|Gemini]]
|December 2023
|[[Google DeepMind]]
|| Unknown
|Unknown
|Unknown|| {{no|Proprietary}}
| Multimodal model, comes in three sizes. Used in [[Bard (chatbot)|Bard chatbot]].<ref>{{cite web |title=Gemini - Google DeepMind |url=https://deepmind.google/technologies/gemini/#capabilities |website=deepmind.google |access-date=12 December 2023 |language=en}}</ref>
|-
|Mixtral 8x7B
|December 2023
|[[Mistral AI]]
|| 46.7B total, 12.9B parameters per token<ref>{{cite web |title=Mixtral of experts |url=https://mistral.ai/news/mixtral-of-experts/ |website=mistral.ai |access-date=12 December 2023 |language=en-us |date=11 December 2023}}</ref>
|Unknown
|Unknown|| {{yes|Apache 2.0}}
| [[Mixture of experts]] model, outperforms GPT-3.5 and Llama 2 70B on many benchmarks. All weights were released via torrent.<ref>{{cite web |last1=Franzen |first1=Carl |title=Mistral shocks AI community as latest open source model eclipses GPT-3.5 performance |url=https://venturebeat.com/ai/mistral-shocks-ai-community-as-latest-open-source-model-eclipses-gpt-3-5-performance/ |website=VentureBeat |access-date=12 December 2023 |date=11 December 2023}}</ref>
|-
|Phi-2
|December 2023
|Microsoft
|| 2.7B
|1.4T tokens
|Unknown|| {{yes|MIT}}
| So-called ''small language model'', that "matches or outperforms models up to 25x larger", trained on "textbook-quality" data based on the paper "Textbooks Are All You Need". Model training took "14 days on 96 A100 GPUs".<ref>{{cite web |last1=Hughes |first1=Alyssa |title=Phi-2: The surprising power of small language models |url=https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/ |website=Microsoft Research |access-date=13 December 2023 |date=12 December 2023}}</ref>
|-
|Eagle 7B
|January 2024
| RWKV
|| 7.52B
|1.1T tokens
|Unknown|| {{yes|Apache 2.0}}
| An "attention-free" linear transformer based on RWKV-v5 architecture.<ref>{{cite web |last1=Cheah |first1=Eugene |title=🦅 Eagle 7B : Soaring past Transformers with 1 Trillion Tokens Across 100+ Languages (RWKV-v5) |url=https://blog.rwkv.com/p/eagle-7b-soaring-past-transformers |website=blog.rwkv.com |access-date=31 January 2024 |language=en}}</ref>
|}
 
== See also ==
* [[Foundation models]]
* [[List of large language models]]
 
* [[List of chatbots]]
== Notes ==
* [[Language model benchmark]]
{{notelist}}
* [[Reinforcement learning]]
* [[Small language model]]
 
== References ==
{{reflist|refs=}}
<!-- Refs below are specific to the "List of large language models" section. (Keeping separate in case that section is split off into a standalone list article in the future.) -->
<ref name=palm-blog>{{Cite web |last1=Narang |first1=Sharan |last2=Chowdhery |first2=Aakanksha |date=April 4, 2022 |title=Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance |url=https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html |access-date=2023-03-09 |website=ai.googleblog.com |language=en
}}</ref>
<ref name=glam-blog>{{Cite web |last1=Dai |first1=Andrew M |last2=Du |first2=Nan |date=December 9, 2021 |title=More Efficient In-Context Learning with GLaM |url=https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html |access-date=2023-03-09 |website=ai.googleblog.com |language=en}}</ref>
<ref name=lamda-blog>{{Cite web |last1=Cheng |first1=Heng-Tze |last2=Thoppilan |first2=Romal |date=January 21, 2022 |title=LaMDA: Towards Safe, Grounded, and High-Quality Dialog Models for Everything |url=https://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html |access-date=2023-03-09 |website=ai.googleblog.com |language=en}}</ref>
<ref name=mtnlg-preprint>{{Cite arXiv |last1=Smith |first1=Shaden |last2=Patwary |first2=Mostofa |last3=Norick |first3=Brandon |last4=LeGresley |first4=Patrick |last5=Rajbhandari |first5=Samyam |last6=Casper |first6=Jared |last7=Liu |first7=Zhun |last8=Prabhumoye |first8=Shrimai |last9=Zerveas |first9=George |last10=Korthikanti |first10=Vijay |last11=Zhang |first11=Elton |last12=Child |first12=Rewon |last13=Aminabadi |first13=Reza Yazdani |last14=Bernauer |first14=Julie |last15=Song |first15=Xia |date=2022-02-04 |title=Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model |class=cs.CL |eprint=2201.11990 }}</ref>
<ref name=llama-blog>{{cite web
|work=Meta AI
|title=Introducing LLaMA: A foundational, 65-billion-parameter large language model
|date=24 February 2023
|url=https://ai.facebook.com/blog/large-language-model-llama-meta-ai/
}}</ref>
<ref name="15Brelease">{{Cite web
|url = https://openai.com/blog/gpt-2-1-5b-release/
|title = GPT-2: 1.5B Release
|date = 2019-11-05
|website = OpenAI
|language = en
|access-date = 2019-11-14
|archive-date = 2019-11-14
|archive-url = https://web.archive.org/web/20191114074358/https://openai.com/blog/gpt-2-1-5b-release/
|url-status = live
}}</ref>
<ref name=chinchilla-blog>{{cite web
|work=Deepmind Blog
|title=An empirical analysis of compute-optimal large language model training
|first1=Jordan|last1=Hoffmann|first2=Sebastian|last2=Borgeaud
|first3=Arthur|last3=Mensch|first4=Laurent|last4=Sifre
|date=12 April 2022
|url=https://www.deepmind.com/blog/an-empirical-analysis-of-compute-optimal-large-language-model-training
}}</ref>
 
<ref name=chatgpt-blog>{{Cite web |date=2022-11-30 |title=ChatGPT: Optimizing Language Models for Dialogue |url=https://openai.com/blog/chatgpt/ |access-date=2023-01-13 |website=OpenAI |language=en}}</ref>
<ref name=vb-gpt-neo>{{cite web
|work=VentureBeat
|last=Iyer|first=Abhishek
|title=GPT-3's free alternative GPT-Neo is something to be excited about
|date=15 May 2021
|url=https://venturebeat.com/ai/gpt-3s-free-alternative-gpt-neo-is-something-to-be-excited-about/
}}</ref>
}}
 
== Further reading ==
* [[Dan Jurafsky|Jurafsky, Dan]], Martin, James. H. [https://web.stanford.edu/~jurafsky/slp3/ed3book_jan72023.pdf ''Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition''], 3rd Edition draft, 2023.
* {{Cite journal |last1=Yin |first1=Shukang |last2=Fu |first2=Chaoyou |last3=Zhao |first3=Sirui |last4=Li |first4=Ke |last5=Sun |first5=Xing |last6=Xu |first6=Tong |last7=Chen |first7=Enhong |date=2024 |title=A Survey on Multimodal Large Language Models |journal=National Science Review |volume=11 |issue=12 |pages=nwae403 |doi=10.1093/nsr/nwae403 |pmid=39679213 |pmc=11645129 |arxiv=2306.13549}}
* {{cite arXiv |eprint=2207.09238 |class=cs.LG |first1=Mary |last1=Phuong |first2=Marcus |last2=Hutter |title=Formal Algorithms for Transformers |date=2022}}
* {{Cite web |title=AI Index Report 2024 – Artificial Intelligence Index |url=https://aiindex.stanford.edu/report/ |access-date=2024-05-05 |website=aiindex.stanford.edu}}
* {{cite arXiv |eprint=2303.10130 |class=econ.GN |first1=Tyna |last1=Eloundou |first2=Sam |last2=Manning |title=GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models |last3=Mishkin |first3=Pamela |last4=Rock |first4=Daniel |year=2023}}
* {{cite journal |last1=Frank |first1=Michael C. |title=Baby steps in evaluating the capacities of large language models |journal=Nature Reviews Psychology |date=27 June 2023 |volume=2 |issue=8 |pages=451–452 |doi=10.1038/s44159-023-00211-x |s2cid=259713140 |url=https://www.nature.com/articles/s44159-023-00211-x |access-date=2 July 2023 |issn=2731-0574|url-access=subscription }}
* {{cite arXiv |last1=Eldan |first1=Ronen |last2=Li |first2=Yuanzhi |title=TinyStories: How Small Can Language Models Be and Still Speak Coherent English? |date=2023 |class=cs.CL |eprint=2305.07759}}
* {{cite journal |last1=Frank |first1=Michael C. |title=Baby steps in evaluating the capacities of large language models |journal=Nature Reviews Psychology |date=27 June 2023 |volume=2 |issue=8 |pages=451–452 |doi=10.1038/s44159-023-00211-x |s2cid=259713140 |url=https://www.nature.com/articles/s44159-023-00211-x |access-date=2 July 2023 |language=en |issn=2731-0574}}
* {{cite arXiv |last1=Zhao |first1=Wayne Xin |last2=Zhou |first2=Kun |last3=Li |first3=Junyi |display-authors=1 |title=A Survey of Large Language Models |date=2023 |class=cs.CL |eprint=2303.18223 }}
* {{cite arXiv |last1=Kaddour |first1=Jean |display-authors=etal |title=Challenges and Applications of Large Language Models |date=2023 |class=cs.CL |eprint=2307.10169 }}
* {{Cite arXiv |last1=Yin |first1=Shukang |last2=Fu |first2=Chaoyou |last3=Zhao |first3=Sirui |last4=Li |first4=Ke |last5=Sun |first5=Xing |last6=Xu |first6=Tong |last7=Chen |first7=Enhong |date=2023-06-01 |title=A Survey on Multimodal Large Language Models |class=cs.CV |eprint=2306.13549 }}
* [https://github.com/eugeneyan/open-llms Open LLMs repository] on [[GitHub]].
 
{{Natural language processing}}
{{Artificial intelligence navbox}}
 
[[Category:Large language models| ]]