'''Reasoning language models''' are [[artificial intelligence]] systems that combine [[natural language processing]] with structured reasoning capabilities. These models are usually constructed by [[Prompt engineering|prompting]], [[Fine-tuning (deep learning)|supervised finetuning]] (SFT), and [[reinforcement learning]] (RL) initialized with [[Pretrained language model|pretrained language models]].
== Prompting ==
{{Main|Prompt engineering}}
A language model is a generative model of a training dataset of texts. Prompting means constructing a text prompt, such that, conditional on the text prompt, the language model generates a solution to the task. Prompting can be applied to a pretrained model ("base model"), a base model that has undergone SFT, or RL, or both.<ref>{{Citation |last1=Qiao |first1=Shuofei |title=Reasoning with Language Model Prompting: A Survey |date=2023-09-18 |arxiv=2212.09597 |last2=Ou |first2=Yixin |last3=Zhang |first3=Ningyu |last4=Chen |first4=Xiang |last5=Yao |first5=Yunzhi |last6=Deng |first6=Shumin |last7=Tan |first7=Chuanqi |last8=Huang |first8=Fei |last9=Chen |first9=Huajun}}</ref>
=== Chain of thought ===
Chain of Thought prompting (CoT) prompts the model to answer a question by first generating a "chain of thought", i.e. steps of reasoning that mimic a [[train of thought]].<ref name="weipaper2">{{cite conference |last1=Wei |first1=Jason |last2=Wang |first2=Xuezhi |last3=Schuurmans |first3=Dale |last4=Bosma |first4=Maarten |last5=Ichter |first5=Brian |last6=Xia |first6=Fei |last7=Chi |first7=Ed H. |last8=Le |first8=Quoc V. |last9=Zhou |first9=Denny |date=31 October 2022 |title=Chain-of-Thought Prompting Elicits Reasoning in Large Language Models |conference=Advances in Neural Information Processing Systems (NeurIPS 2022) |language=en |volume=35 |arxiv=2201.11903}}</ref> It was published in 2022 by the [[Google Brain|Brain team]] of Google on the [[PaLM|PaLM-540B]] model.<ref>{{Cite web |author=Sharan Narang and Aakanksha Chowdhery |date=2022-04-04 |title=Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance |url=https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html}}</ref> In CoT prompting, the prompt is of the form "<Input> Let's think step by step", and the model would respond with a chain of reasoning steps, ended with an answer:<math display="block">
\text{Input} \rightarrow \underbrace{\text{Step}_1 \rightarrow \text{Step}_2 \rightarrow \cdots \rightarrow \text{Step}_n}_{\text{Reasoning chain}} \rightarrow \text{Answer}
</math>Similarly, Tree of Thought prompting generalizes CoT by prompting the model to generate one or more "possible next steps", and then running the model on each of the possible next steps by [[Breadth-first search|breadth-first]], [[Beam search|beam]], or some other method of tree search.<ref>{{Cite arXiv |eprint=2305.10601 |class=cs.CL |first1=Shunyu |last1=Yao |first2=Dian |last2=Yu |title=Tree of Thoughts: Deliberate Problem Solving with Large Language Models |date=2023-05-17 |last3=Zhao |first3=Jeffrey |last4=Shafran |first4=Izhak |last5=Griffiths |first5=Thomas L. |last6=Cao |first6=Yuan |last7=Narasimhan |first7=Karthik}}</ref> Similarly, Graph of Thought generalizes CoT so that the reasoning steps form a [[directed acyclic graph]].<ref>{{Cite journal |last1=Besta |first1=Maciej |last2=Blach |first2=Nils |last3=Kubicek |first3=Ales |last4=Gerstenberger |first4=Robert |last5=Podstawski |first5=Michal |last6=Gianinazzi |first6=Lukas |last7=Gajda |first7=Joanna |last8=Lehmann |first8=Tomasz |last9=Niewiadomski |first9=Hubert |last10=Nyczyk |first10=Piotr |last11=Hoefler |first11=Torsten |date=2024-03-24 |title=Graph of Thoughts: Solving Elaborate Problems with Large Language Models |url=https://ojs.aaai.org/index.php/AAAI/article/view/29720 |journal=Proceedings of the AAAI Conference on Artificial Intelligence |language=en |volume=38 |issue=16 |pages=17682–17690 |doi=10.1609/aaai.v38i16.29720 |issn=2374-3468|arxiv=2308.09687 }}</ref>
Self-consistency decoding performs several chain-of-thought rollouts, then selects the most commonly reached conclusion out of all the rollouts.<ref>{{cite arXiv |eprint=2203.11171 |class=cs.CL |first1=Xuezhi |last1=Wang |first2=Jason |last2=Wei |title=Self-Consistency Improves Chain of Thought Reasoning in Language Models |date=2022-03-01 |last3=Schuurmans |first3=Dale |last4=Le |first4=Quoc |last5=Chi |first5=Ed |last6=Narang |first6=Sharan |last7=Chowdhery |first7=Aakanksha |last8=Zhou |first8=Denny}}</ref> If the rollouts disagree by a lot, a human can be queried for the correct chain of thought.<ref>{{cite arXiv |eprint=2302.12246 |class=cs.CL |first1=Shizhe |last1=Diao |first2=Pengcheng |last2=Wang |title=Active Prompting with Chain-of-Thought for Large Language Models |date=2023-02-01 |last3=Lin |first3=Yong |last4=Zhang |first4=Tong}}</ref>
=== Retrieval-augmented generation ===
{{Main article|Retrieval-augmented generation}}
A language model may answer a query by first querying a database of documents using the query. The document retrieval can be via a [[vector database]], summary index, tree index, or keyword table index.<ref>{{Cite web |title=How Each Index Works - LlamaIndex 🦙 v0.10.17 |url=https://docs.llamaindex.ai/en/v0.10.17/module_guides/indexing/index_guide.html |access-date=2024-04-08 |website=docs.llamaindex.ai}}</ref> Following document retrieval, the LLM generates an output that incorporates information from both the query and the retrieved documents.<ref>{{Cite journal |last1=Lewis |first1=Patrick |last2=Perez |first2=Ethan |last3=Piktus |first3=Aleksandra |last4=Petroni |first4=Fabio |last5=Karpukhin |first5=Vladimir |last6=Goyal |first6=Naman |last7=Küttler |first7=Heinrich |last8=Lewis |first8=Mike |last9=Yih |first9=Wen-tau |last10=Rocktäschel |first10=Tim |last11=Riedel |first11=Sebastian |last12=Kiela |first12=Douwe |date=2020 |title=Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks |url=https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=33 |pages=9459–9474 |arxiv=2005.11401}}</ref>
=== Tool use ===
Language models can perform long reasoning steps by calling external methods, such as numerical recipes, program interpreters, API calls, and so on. This can be prompt-engineered by describing the external methods in-context (an example of in-context learning) or finetuned into the model.<ref>{{Cite journal |last1=Schick |first1=Timo |last2=Dwivedi-Yu |first2=Jane |last3=Dessi |first3=Roberto |last4=Raileanu |first4=Roberta |last5=Lomeli |first5=Maria |last6=Hambro |first6=Eric |last7=Zettlemoyer |first7=Luke |last8=Cancedda |first8=Nicola |last9=Scialom |first9=Thomas |date=2023-12-15 |title=Toolformer: Language Models Can Teach Themselves to Use Tools |url=https://proceedings.neurips.cc/paper_files/paper/2023/hash/d842425e4bf79ba039352da0f658a906-Abstract-Conference.html |journal=Advances in Neural Information Processing Systems |language=en |volume=36 |pages=68539–68551|arxiv=2302.04761 }}</ref>
== Supervised finetuning ==
A base[[large language model]] (LLM) can be finetuned on a dataset of reasoning tasks with example solutions and reasoning traces. The finetunedfine-tuned model would then be able to generate reasoning traces for a given problem.<ref name=":0">{{Citation |last1=Uesato |first1=Jonathan |title=Solving math word problems with process- and outcome-based feedback |date=2022-11-25 |arxiv=2211.14275 |last2=Kushman |first2=Nate |last3=Kumar |first3=Ramana |last4=Song |first4=Francis |last5=Siegel |first5=Noah |last6=Wang |first6=Lisa |last7=Creswell |first7=Antonia |last8=Irving |first8=Geoffrey |last9=Higgins |first9=Irina}}</ref><ref name=":2" />
As it is expensive to get humans to write reasoning traces for a SFT dataset, researchers have proposed ways to automatically construct SFT datasets. In rejection sampling finetuning (RFT), new reasoning traces are collected via a loop:<ref>{{Citation |last1=Yuan |first1=Zheng |title=Scaling Relationship on Learning Mathematical Reasoning with Large Language Models |date=2023-09-13 |arxiv=2308.01825 |last2=Yuan |first2=Hongyi |last3=Li |first3=Chengpeng |last4=Dong |first4=Guanting |last5=Lu |first5=Keming |last6=Tan |first6=Chuanqi |last7=Zhou |first7=Chang |last8=Zhou |first8=Jingren}}</ref>
Self-consistency can be combined with an ORM. The model would be used to generate multiple answers, and the answers would be clustered, so that each cluster has the same answer. The ORM is used to compute the reward for each answer, and the rewards within each cluster is summed. The answer corresponding to the cluster with the highest summed reward is output.<ref name=":3" />
== Applications ==
Prompt engineering was discovered in [[GPT-3]] as "few-shot learning",<ref>{{Cite journal |last1=Brown |first1=Tom B. |last2=Mann |first2=Benjamin |last3=Ryder |first3=Nick |last4=Subbiah |first4=Melanie |last5=Kaplan |first5=Jared |last6=Dhariwal |first6=Prafulla |last7=Neelakantan |first7=Arvind |last8=Shyam |first8=Pranav |last9=Sastry |first9=Girish |last10=Askell |first10=Amanda |last11=Agarwal |first11=Sandhini |last12=Herbert-Voss |first12=Ariel |last13=Krueger |first13=Gretchen |last14=Henighan |first14=Tom |last15=Child |first15=Rewon |date=2020-12-06 |title=Language models are few-shot learners |url=https://dl.acm.org/doi/abs/10.5555/3495724.3495883 |journal=Proceedings of the 34th International Conference on Neural Information Processing Systems |series=NIPS '20 |___location=Red Hook, NY, USA |publisher=Curran Associates Inc. |pages=1877–1901 |isbn=978-1-7138-2954-6}}</ref> which began a period of research into "eliciting" capacities of pretrained language models. It was then found that a model could be prompted to perform CoT reasoning, which improves its performance on reasoning tasks.
== Benchmark ==
{{Main|Benchmark (computing)|List of language model benchmarks}}
The reasoning ability of language models are usually tested on problems with unambiguous solutions that can be cheaply checked, and requires reasoning when solved by a human. Such problems are usually in mathematics and [[competitive programming]]. The answer is usually an array of integers, a multiple choice letter, or a program that passes [[Unit testing|unit tests]] within a limited runtime. Some common ones include:
* GSM8K (Grade School Math): 8.5K linguistically diverse [[Primary school|elementary school]] [[Word problem (mathematics education)|math word problems]] that require 2 to 8 basic arithmetic operations to solve.<ref name=":2" />
* [[MMLU]] (Measuring Massive Multitask Language Understanding): 16,000 multiple-choice questions spanning 57 academic subjects including mathematics, philosophy, law, and medicine.<ref>{{Citation |last1=Hendrycks |first1=Dan |title=Measuring Massive Multitask Language Understanding |date=2021-01-12 |arxiv=2009.03300 |last2=Burns |first2=Collin |last3=Basart |first3=Steven |last4=Zou |first4=Andy |last5=Mazeika |first5=Mantas |last6=Song |first6=Dawn |last7=Steinhardt |first7=Jacob}}</ref>
* GPQA (Google-Proof Q&A): 448 multiple-choice questions written by ___domain experts in biology, physics, and chemistry, and requires PhD-level experts to solve.<ref>{{Citation |last1=Rein |first1=David |title=GPQA: A Graduate-Level Google-Proof Q&A Benchmark |date=2023-11-20 |arxiv=2311.12022 |last2=Hou |first2=Betty Li |last3=Stickland |first3=Asa Cooper |last4=Petty |first4=Jackson |last5=Pang |first5=Richard Yuanzhe |last6=Dirani |first6=Julien |last7=Michael |first7=Julian |last8=Bowman |first8=Samuel R.}}</ref>
* HumanEval: Programming problems where the solution is always a python function, often just a few lines long.<ref name=":4">{{Citation |last1=Chen |first1=Mark |title=Evaluating Large Language Models Trained on Code |date=2021-07-14 |arxiv=2107.03374 |last2=Tworek |first2=Jerry |last3=Jun |first3=Heewoo |last4=Yuan |first4=Qiming |last5=Pinto |first5=Henrique Ponde de Oliveira |last6=Kaplan |first6=Jared |last7=Edwards |first7=Harri |last8=Burda |first8=Yuri |last9=Joseph |first9=Nicholas}}</ref>
The benchmark scores are of the following kinds:
* pass@n: The model is given <math>n</math> attempts to solve each problem. If any attempt is correct, the model earns a point. The pass@n score is the model's average score over all problems.
* cons@n: The model is given <math>n</math> attempts to solve each problem. If the most common answer is correct, the model earns a point. The cons@n score is the model's average score over all problems. Here "cons" stands for "consensus" or "majority voting".<ref>{{Citation |last1=DeepSeek-AI |title=DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning |date=2025-01-22 |arxiv=2501.12948 |last2=Guo |first2=Daya |last3=Yang |first3=Dejian |last4=Zhang |first4=Haowei |last5=Song |first5=Junxiao |last6=Zhang |first6=Ruoyu |last7=Xu |first7=Runxin |last8=Zhu |first8=Qihao |last9=Ma |first9=Shirong}}</ref>
The pass@n score can be estimated more accurately by making <math>N > n</math> attempts, and use the unbiased estimator <math>1- \frac{\binom{N-c}{n}}{\binom{N}{n}}</math>, where <math>c</math> is the number of correct attempts.<ref name=":4" />
== See also ==
|