Reasoning language model: Difference between revisions

Content deleted Content added
Generation time: Update to be more factually correct and concise.
 
(43 intermediate revisions by 16 users not shown)
Line 1:
{{Short description|Language models designed for reasoning tasks}}
{{Copy edit|for=jargon|date=May 2025}}
{{Distinguish|Large reasoning model}}
{{unreliable sources|date=January 2025}}
 
'''Reasoning language models''' ('''RLMs''') are [[large language model]]s that are trained further to solve tasks that take several steps of [[reasoning]].<ref>{{cite arXiv |last1=Besta |first1=Maciej |last2=Barth |first2=Julia |last3=Schreiber |first3=Eric |last4=Kubicek |first4=Ales |last5=Catarino |first5=Afonso |last6=Gerstenberger |first6=Robert |last7=Nyczyk |first7=Piotr |last8=Iff |first8=Patrick |last9=Li |first9=Yueling |title=Reasoning Language Models: A Blueprint |date=2025-01-23 |eprint=2501.11223 |class=cs.CL}}</ref> They tend to do better on logic, math, and programming tasks than standard LLMs, can [[Backtracking|revisit and revise]] earlier steps, and make use of extra computation while answering as another way to [[Neural scaling law|scale performance]], alongside the number of training examples, parameters, and training compute.<ref name=":8">{{cite web |title=Learning to reason with LLMs |url=https://openai.com/index/learning-to-reason-with-llms/ |website=OpenAI |date=2024-09-12 |access-date=2025-07-26}}</ref>
'''Reasoning language models''' are [[artificial intelligence]] systems that combine [[natural language processing]] with structured reasoning capabilities. These models are usually constructed by [[Prompt engineering|prompting]], [[Fine-tuning (deep learning)|supervised finetuning]] (SFT), and [[reinforcement learning]] (RL) initialized with [[Pretrained language model|pretrained language models]].
 
== PromptingHistory ==
=== 2024 ===
{{Main|Prompt engineering}}
In September 2024, [[OpenAI]] released [[OpenAI o1#release|o1-preview]], an LLM with enhanced reasoning.<ref>{{cite news |last1=Edwards |first1=Benj |date=2024-09-12 |title=OpenAI's new "reasoning" AI models are here: o1-preview and o1-mini |url=https://arstechnica.com/information-technology/2024/09/openais-new-reasoning-ai-models-are-here-o1-preview-and-o1-mini/ |access-date=2025-02-06 |work=Ars Technica |language=en-US}}</ref> The full version, [[OpenAI o1|o1]], followed in December 2024. OpenAI also began sharing results on its successor, [[OpenAI o3|o3]].<ref>{{cite web |title=OpenAI o1 System Card |url=https://cdn.openai.com/o1-system-card.pdf |website=OpenAI |date=2024-12-05 |access-date=2025-07-26}}</ref><ref>{{cite news |last=Robison |first=Kylie |date=2024-12-05 |title=OpenAI launches ChatGPT Pro, a $200/month plan with unlimited access to o1, GPT-4o, and more |url=https://www.theverge.com/2024/12/5/24314147/openai-reasoning-model-o1-strawberry-chatgpt-pro-new-tier |access-date=2025-07-26 |work=The Verge}}</ref><ref>{{cite news |last=Singh |first=Jaspreet |date=2024-12-20 |title=OpenAI unveils 'o3' model, touting advances in reasoning |url=https://www.reuters.com/technology/artificial-intelligence/openai-unveils-o3-model-touting-advances-reasoning-2024-12-20/ |access-date=2025-07-26 |work=Reuters}}</ref>
A language model is a generative model of a training dataset of texts. Prompting means constructing a text prompt, such that, conditional on the text prompt, the language model generates a solution to the task. Prompting can be applied to a pretrained model ("base model"), a base model that has undergone SFT, or RL, or both.<ref>{{Citation |last1=Qiao |first1=Shuofei |title=Reasoning with Language Model Prompting: A Survey |date=2023-09-18 |arxiv=2212.09597 |last2=Ou |first2=Yixin |last3=Zhang |first3=Ningyu |last4=Chen |first4=Xiang |last5=Yao |first5=Yunzhi |last6=Deng |first6=Shumin |last7=Tan |first7=Chuanqi |last8=Huang |first8=Fei |last9=Chen |first9=Huajun}}</ref>
 
The development of reasoning LLMs has illustrated what [[Richard S. Sutton|Rich Sutton]] called the "bitter lesson": that scaling compute often outperforms methods that rely on specific human insights.<ref>{{cite web |last1=Sutton |first1=Richard S. |title=The Bitter Lesson |url=http://www.incompleteideas.net/IncIdeas/BitterLesson.html |access-date=2025-02-27 |website=Incomplete Ideas}}</ref> For example, the Generative AI Research Lab (GAIR) explored complex methods such as tree search and reinforcement learning to replicate o1's capabilities. In their "o1 Replication Journey" papers they reported that [[knowledge distillation]] (training a smaller model to imitate o1's outputs) worked surprisingly well. This highlighted the effectiveness of distillation in this context.<ref>{{cite arXiv |last1=Huang |first1=Zhen |last2=Zou |first2=Haoyang |last3=Li |first3=Xuefeng |last4=Liu |first4=Yixiu |last5=Zheng |first5=Yuxiang |last6=Chern |first6=Ethan |last7=Xia |first7=Shijie |last8=Qin |first8=Yiwei |last9=Yuan |first9=Weizhe |title=O1 Replication Journey — Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? |date=2024-11-25 |eprint=2411.16489 |class=cs.CL}}</ref><ref name=":6">{{cite news |last=Zeff |first=Maxwell |date=2025-02-05 |title=Researchers created an open rival to OpenAI's o1 'reasoning' model for under $50 |url=https://techcrunch.com/2025/02/05/researchers-created-an-open-rival-to-openais-o1-reasoning-model-for-under-50/ |access-date=2025-07-26 |work=TechCrunch}}</ref>
=== Chain of thought ===
Chain of Thought prompting (CoT) prompts the model to answer a question by first generating a "chain of thought", i.e. steps of reasoning that mimic a [[train of thought]].<ref name="weipaper2">{{cite conference |last1=Wei |first1=Jason |last2=Wang |first2=Xuezhi |last3=Schuurmans |first3=Dale |last4=Bosma |first4=Maarten |last5=Ichter |first5=Brian |last6=Xia |first6=Fei |last7=Chi |first7=Ed H. |last8=Le |first8=Quoc V. |last9=Zhou |first9=Denny |date=31 October 2022 |title=Chain-of-Thought Prompting Elicits Reasoning in Large Language Models |conference=Advances in Neural Information Processing Systems (NeurIPS 2022) |language=en |volume=35 |arxiv=2201.11903}}</ref> It was published in 2022 by the [[Google Brain|Brain team]] of Google on the [[PaLM|PaLM-540B]] model.<ref>{{Cite web |author=Sharan Narang and Aakanksha Chowdhery |date=2022-04-04 |title=Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance |url=https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html}}</ref> In CoT prompting, the prompt is of the form "<Input> Let's think step by step", and the model would respond with a chain of reasoning steps, ended with an answer:<math display="block">
\text{Input} \rightarrow \underbrace{\text{Step}_1 \rightarrow \text{Step}_2 \rightarrow \cdots \rightarrow \text{Step}_n}_{\text{Reasoning chain}} \rightarrow \text{Answer}
</math>Similarly, Tree of Thought prompting generalizes CoT by prompting the model to generate one or more "possible next steps", and then running the model on each of the possible next steps by [[Breadth-first search|breadth-first]], [[Beam search|beam]], or some other method of tree search.<ref>{{Cite arXiv |eprint=2305.10601 |class=cs.CL |first1=Shunyu |last1=Yao |first2=Dian |last2=Yu |title=Tree of Thoughts: Deliberate Problem Solving with Large Language Models |date=2023-05-17 |last3=Zhao |first3=Jeffrey |last4=Shafran |first4=Izhak |last5=Griffiths |first5=Thomas L. |last6=Cao |first6=Yuan |last7=Narasimhan |first7=Karthik}}</ref> Similarly, Graph of Thought generalizes CoT so that the reasoning steps form a [[directed acyclic graph]].<ref>{{Cite journal |last1=Besta |first1=Maciej |last2=Blach |first2=Nils |last3=Kubicek |first3=Ales |last4=Gerstenberger |first4=Robert |last5=Podstawski |first5=Michal |last6=Gianinazzi |first6=Lukas |last7=Gajda |first7=Joanna |last8=Lehmann |first8=Tomasz |last9=Niewiadomski |first9=Hubert |last10=Nyczyk |first10=Piotr |last11=Hoefler |first11=Torsten |date=2024-03-24 |title=Graph of Thoughts: Solving Elaborate Problems with Large Language Models |url=https://ojs.aaai.org/index.php/AAAI/article/view/29720 |journal=Proceedings of the AAAI Conference on Artificial Intelligence |language=en |volume=38 |issue=16 |pages=17682–17690 |doi=10.1609/aaai.v38i16.29720 |issn=2374-3468|arxiv=2308.09687 }}</ref>
 
[[Alibaba Group|Alibaba]] released reasoning versions of its [[Qwen]] LLMs in November 2024.<ref>{{cite web |title=QwQ-32B-Preview: Reflect Deeply on the Boundaries of the Unknown |url=https://qwenlm.github.io/blog/qwq-32b-preview/ |website=Qwen (Alibaba Cloud) |date=2024-11-28 |access-date=2025-07-26}}</ref>
Self-consistency decoding performs several chain-of-thought rollouts, then selects the most commonly reached conclusion out of all the rollouts.<ref>{{cite arXiv |eprint=2203.11171 |class=cs.CL |first1=Xuezhi |last1=Wang |first2=Jason |last2=Wei |title=Self-Consistency Improves Chain of Thought Reasoning in Language Models |date=2022-03-01 |last3=Schuurmans |first3=Dale |last4=Le |first4=Quoc |last5=Chi |first5=Ed |last6=Narang |first6=Sharan |last7=Chowdhery |first7=Aakanksha |last8=Zhou |first8=Denny}}</ref> If the rollouts disagree by a lot, a human can be queried for the correct chain of thought.<ref>{{cite arXiv |eprint=2302.12246 |class=cs.CL |first1=Shizhe |last1=Diao |first2=Pengcheng |last2=Wang |title=Active Prompting with Chain-of-Thought for Large Language Models |date=2023-02-01 |last3=Lin |first3=Yong |last4=Zhang |first4=Tong}}</ref>
In December 2024, the team introduced QvQ-72B-Preview, an experimental visual reasoning model.<ref>{{cite web |title=QVQ: To See the World with Wisdom |url=https://qwenlm.github.io/blog/qvq-72b-preview/ |website=Qwen |publisher=Alibaba Cloud |date=2024-12-25 |access-date=2025-07-26}}</ref>
 
In December 2024, Google introduced [[Gemini Deep Research|Deep Research]] in [[Gemini (chatbot)|Gemini]],<ref>{{cite web |date=2024-12-11 |title=Try Deep Research and our new experimental model in Gemini, your AI assistant |url=https://blog.google/products/gemini/google-gemini-deep-research/ |access-date=2025-02-05 |website=Google |language=en-US}}</ref> a feature that runs multi-step research tasks.<ref>{{cite news |last=Roth |first=Emma |date=2024-12-11 |title=Google built an AI tool that can do research for you |url=https://www.theverge.com/2024/12/11/24318217/google-gemini-advanced-deep-research-launch |access-date=2025-07-26 |work=The Verge}}</ref>
=== Retrieval-augmented generation ===
{{Main article|Retrieval-augmented generation}}
A language model may answer a query by first querying a database of documents using the query. The document retrieval can be via a [[vector database]], summary index, tree index, or keyword table index.<ref>{{Cite web |title=How Each Index Works - LlamaIndex 🦙 v0.10.17 |url=https://docs.llamaindex.ai/en/v0.10.17/module_guides/indexing/index_guide.html |access-date=2024-04-08 |website=docs.llamaindex.ai}}</ref> Following document retrieval, the LLM generates an output that incorporates information from both the query and the retrieved documents.<ref>{{Cite journal |last1=Lewis |first1=Patrick |last2=Perez |first2=Ethan |last3=Piktus |first3=Aleksandra |last4=Petroni |first4=Fabio |last5=Karpukhin |first5=Vladimir |last6=Goyal |first6=Naman |last7=Küttler |first7=Heinrich |last8=Lewis |first8=Mike |last9=Yih |first9=Wen-tau |last10=Rocktäschel |first10=Tim |last11=Riedel |first11=Sebastian |last12=Kiela |first12=Douwe |date=2020 |title=Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks |url=https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=33 |pages=9459–9474 |arxiv=2005.11401}}</ref>
 
On December 16, 2024, an experiment with a [[Llama (language model)|Llama]] 3B model showed that by scaling test-time compute, a relatively small model could outperform a much larger Llama 70B model on challenging reasoning tasks. This suggested that better inference strategies can unlock useful reasoning capabilities even in small models.<ref>{{cite web |title=Scaling test-time compute |url=https://huggingface.co/blog/h4-scaling-test-time-compute |website=Hugging Face |date=2024-12-16 |access-date=2025-07-26}}</ref><ref name=":7">{{cite journal |last1=Snell |first1=Charlie |last2=Lee |first2=Jaehoon |last3=Xu |first3=Kelvin |last4=Kumar |first4=Aviral |date=2025 |title=Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters |url=https://openreview.net/forum?id=t4s3hJY9dH |journal=International Conference on Learning Representations (ICLR 2025) |access-date=2025-07-26 |arxiv=2408.03314}}</ref>
=== Tool use ===
Language models can perform long reasoning steps by calling external methods, such as numerical recipes, program interpreters, API calls, and so on. This can be prompt-engineered by describing the external methods in-context (an example of in-context learning) or finetuned into the model.<ref>{{Cite journal |last1=Schick |first1=Timo |last2=Dwivedi-Yu |first2=Jane |last3=Dessi |first3=Roberto |last4=Raileanu |first4=Roberta |last5=Lomeli |first5=Maria |last6=Hambro |first6=Eric |last7=Zettlemoyer |first7=Luke |last8=Cancedda |first8=Nicola |last9=Scialom |first9=Thomas |date=2023-12-15 |title=Toolformer: Language Models Can Teach Themselves to Use Tools |url=https://proceedings.neurips.cc/paper_files/paper/2023/hash/d842425e4bf79ba039352da0f658a906-Abstract-Conference.html |journal=Advances in Neural Information Processing Systems |language=en |volume=36 |pages=68539–68551|arxiv=2302.04761 }}</ref>
 
=== Supervised finetuning2025 ===
In January 2025, [[DeepSeek]] released [[DeepSeek (chatbot)|R1]], a model with comparable performance to o1 at lower cost. The release demonstrated the effectiveness of [[Group Relative Policy Optimization]] (GRPO).<ref>{{cite news |last1=Orland |first1=Kyle |date=2025-01-28 |title=How does DeepSeek R1 really fare against OpenAI's best reasoning models? |url=https://arstechnica.com/ai/2025/01/how-does-deepseek-r1-really-fare-against-openais-best-reasoning-models/ |access-date=2025-02-06 |work=Ars Technica}}</ref><ref name=":9">{{cite arXiv |last1=DeepSeek-AI |last2=Guo |first2=Daya |last3=Yang |first3=Dejian |last4=Zhang |first4=Haowei |last5=Song |first5=Junxiao |last6=Zhang |first6=Ruoyu |last7=Xu |first7=Runxin |last8=Zhu |first8=Qihao |last9=Ma |first9=Shirong |title=DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning |date=2025-01-22 |eprint=2501.12948 |class=cs.CL}}</ref> On January 25, 2025, [[DeepSeek]] added a feature to DeepSeek R1 that lets the model search the web while it reasons, making it easier to combine retrieval with reasoning.<ref>{{cite news |script-title=zh:DeepSeek 支持"深度思考+联网检索"能力 |trans-title=DeepSeek adds a search feature supporting simultaneous deep thinking and web search |work=People's Daily Online |date=2025-01-29 |url=http://tech.people.com.cn/n1/2025/0129/c1007-40386565.html |language=zh |access-date=2025-07-26}}</ref> The effectiveness of distillation for reasoning models was shown in works such as s1-32B, which achieved strong performance through budget forcing and scaling methods.<ref name=":10">{{cite arXiv |last1=Muennighoff |first1=Niklas |last2=Yang |first2=Zitong |last3=Shi |first3=Weijia |last4=Li |first4=Xiang Lisa |last5=Fei-Fei |first5=Li |last6=Hajishirzi |first6=Hannaneh |last7=Zettlemoyer |first7=Luke |last8=Liang |first8=Percy |last9=Candès |first9=Emmanuel |title=s1: Simple test-time scaling |date=2025-02-03 |eprint=2501.19393 |class=cs.CL}}</ref><ref name=":6"/>
A base model can be finetuned on a dataset of reasoning tasks with example solutions and reasoning traces. The finetuned model would then be able to generate reasoning traces for a given problem.<ref name=":0">{{Citation |last1=Uesato |first1=Jonathan |title=Solving math word problems with process- and outcome-based feedback |date=2022-11-25 |arxiv=2211.14275 |last2=Kushman |first2=Nate |last3=Kumar |first3=Ramana |last4=Song |first4=Francis |last5=Siegel |first5=Noah |last6=Wang |first6=Lisa |last7=Creswell |first7=Antonia |last8=Irving |first8=Geoffrey |last9=Higgins |first9=Irina}}</ref><ref name=":2" />
 
On February 2, 2025, OpenAI released [[ChatGPT Deep Research|Deep Research]] based on their [[OpenAI o3|o3]] model,<ref name=":5">{{cite web |date=2025-02-02 |title=Introducing deep research |url=https://openai.com/index/introducing-deep-research/ |access-date=2025-02-05 |website=OpenAI |language=en-US}}</ref> allowing users to initiate complex research tasks and generate comprehensive reports which incorporate various sources from the web.<ref name=":5" />
As it is expensive to get humans to write reasoning traces for a SFT dataset, researchers have proposed ways to automatically construct SFT datasets. In rejection sampling finetuning (RFT), new reasoning traces are collected via a loop:<ref>{{Citation |last1=Yuan |first1=Zheng |title=Scaling Relationship on Learning Mathematical Reasoning with Large Language Models |date=2023-09-13 |arxiv=2308.01825 |last2=Yuan |first2=Hongyi |last3=Li |first3=Chengpeng |last4=Dong |first4=Guanting |last5=Lu |first5=Keming |last6=Tan |first6=Chuanqi |last7=Zhou |first7=Chang |last8=Zhou |first8=Jingren}}</ref>
 
== Supervised finetuning ==
A [[large language model]] (LLM) can be fine-tuned on a dataset of reasoning tasks paired with example solutions and step-by-step (reasoning) traces. The fine-tuned model can then produce its own reasoning traces for new problems.<ref name=":0">{{cite arXiv |last1=Uesato |first1=Jonathan |last2=Kushman |first2=Nate |last3=Kumar |first3=Ramana |last4=Song |first4=Francis |last5=Siegel |first5=Noah |last6=Wang |first6=Lisa |last7=Creswell |first7=Antonia |last8=Irving |first8=Geoffrey |last9=Higgins |first9=Irina |title=Solving math word problems with process- and outcome-based feedback |date=2022-11-25 |eprint=2211.14275 |class=cs.LG}}</ref><ref name=":2" />
 
Because human-written traces are costly to collect, researchers have proposed ways to build such datasets automatically. In ''rejection sampling finetuning'' (RFT), new reasoning traces are gathered in a loop:<ref>{{cite arXiv |last1=Yuan |first1=Zheng |last2=Yuan |first2=Hongyi |last3=Li |first3=Chengpeng |last4=Dong |first4=Guanting |last5=Lu |first5=Keming |last6=Tan |first6=Chuanqi |last7=Zhou |first7=Chang |last8=Zhou |first8=Jingren |title=Scaling Relationship on Learning Mathematical Reasoning with Large Language Models |date=2023-09-13 |eprint=2308.01825 |class=cs.CL}}</ref>
# Sample a task prompt
# Sample a task prompt.
# Generate many reasoning traces for the prompt.
# Use a verifier to remove reasoning traces with thea wrong final answer., and optionally remove duplicates
# For each remaining trace, extract the set of equations appearing in it. Deduplicate the traces so that each one has a different set of equations. Add those to the dataset.
 
== Reinforcement learning ==
A pretrained language model can be further trained bywith RL. In the RL formalism, a generative language model is a '''policy''' <math>\pi</math>. A prompt specifying a task to solveprompt is an environmental '''state''' <math>x</math>, and the response of the language model's to the promptresponse is an '''action''' <math>y</math>. The probability that the language model responds <math>x</math> with <math>y</math> is <math>\pi(y|x)</math>.
 
Training a reasoning language model bywith RL then consists ofmeans constructing a '''reward model''' <math>r(x, y)</math> to guide the RL process. Intuitively, athe reward model describessays how desirable/appropriate/good thea response is for thea prompt. For reasoning language model, the prompt describes a reasoning task, and the reward would beis high if the response solves the task, and low if theit responsedoes fails to solve the tasknot.
 
For reasoning language models, the model'sA response <math>y</math> may be broken -down into multiple steps, in which case it is written as <math>y_1, y_2, \dots, y_n</math>.
 
Most recent systems use policy-gradient methods such as [[Proximal Policy Optimization]] (PPO) because PPO constrains each policy update with a clipped objective, which stabilises training for very large policies.<ref name="OpenAIAlign2022">{{cite web |title=Aligning language models to follow instructions |website=OpenAI Blog |url=https://openai.com/blog/instruction-following/ |date=2022-01-27 |access-date=2025-05-04}}</ref>
=== Outcome Reward Model ===
 
=== Outcome reward model ===
{{Anchor|Outcome Reward Model|ORM}}
 
OutcomeAn outcome reward model, or outcome-supervised RM (ORM),<ref name=":0" /> is a reward model that computesgives the reward offor a step <math>r(x, y_1, \dots, y_i)</math> determinedbased byon the final answer: <math>r(x, y_1, \dots, y_i) = r(x, y_n)</math>. TheySuch models are alsooften called "verifiers".
 
For tasks with an answeranswers that isare easy to verify, such as [[Word problem (mathematics education)|math word problems in math]], the outcome reward can simply be binary: 1 if the final answer is correct, and 0 otherwise.<ref name=":0" /> If theautomatic answerverification is not easy to verify programmaticallyhard, humans can manually label the answers as correct or not, thenand thethose labels can be used to finetune a base model that predicts the human label.<ref name=":2">{{Citationcite arXiv |last1=Cobbe |first1=Karl |title=Training Verifiers to Solve Math Word Problems |date=2021-11-18 |arxiv=2110.14168 |last2=Kosaraju |first2=Vineet |last3=Bavarian |first3=Mohammad |last4=Chen |first4=Mark |last5=Jun |first5=Heewoo |last6=Kaiser |first6=Lukasz |last7=Plappert |first7=Matthias |last8=Tworek |first8=Jerry |last9=Hilton |first9=Jacob |title=Training Verifiers to Solve Math Word Problems |date=2021-11-18 |eprint=2110.14168 |class=cs.LG}}</ref> For other kinds of tasks, such aslike creative writing, where task performancequality is not binarysimply true/ or false, one can train a reward model by finetuning a base model on human [[Ranking (statistics)|ranked preference]] data, such as used in [[reinforcement learning from human feedback]].<ref name=":1">{{Citationcite journal |last1=Lightman |first1=Hunter |title=Let's Verify Step by Step |date=2023-05-31 |arxiv=2305.20050 |last2=Kosaraju |first2=Vineet |last3=Burda |first3=Yura |last4=Edwards |first4=Harri |last5=Baker |first5=Bowen |last6=Lee |first6=Teddy |last7=Leike |first7=Jan |last8=Schulman |first8=John |last9=Sutskever |first9=Ilya |date=2024 |title=Let's Verify Step by Step |url=https://openreview.net/forum?id=dKDGgN0eTg |journal=International Conference on Learning Representations (ICLR 2024) |access-date=2025-07-26 |arxiv=2305.20050}}</ref> A base model can also be finetunedfine-tuned to predict, givenfrom a partial thinking trace <math>x, y_1, \dots, y_m</math>, whether the final answer wouldwill be correct, orand not.this Thisprediction can then be usedserve as a binary reward signal.<ref name=":0" />
 
The ORM is usually trained viawith [[logistic regression]], i.e. by minimizing [[Cross-entropy|cross-entropy loss]].<ref name=":3" />
 
Given a PRM, an ORM can be constructed by multiplying the total process reward during the reasoning trace,<ref name=":1" /> or by taking the minimum,<ref name=":3" /> or someby other methodways of aggregating process rewards. DeepSeek used a simple ORM to aggregatetrain the process[[DeepSeek rewards(chatbot)|R1 model]].<ref name=":9"/>
 
=== Process Rewardreward Modelmodel ===
{{Anchor|Process Reward Model|PRM}}
 
ProcessA process reward model, or process-supervised RM (PRM),<ref name=":0" /> is a reward model that computesgives the reward offor a step <math>r(x, y_1, \dots, y_i)</math> determinedbased only byon the steps so far: <math>(x, y_1, \dots, y_i)</math>.
 
Given a partial thinking trace <math>x, y_1, \dots, y_m</math>, a human can be queried as tojudge whether the steps ''so far'' are correct, regardlesswithout oflooking whetherat the ultimatefinal answer would be correct. This can then be used asyields a binary reward signal. AsBecause human labels are expensivecostly, a base model can then be finetunedfine-tuned to predict the human labelsthem.<ref name=":0" /> The PRM is usually trained bywith [[logistic regression]] on the human labels, i.e. by minimizing the [[Cross-entropy|cross-entropy loss]] between the true labels and the predicted labels.<ref name=":3" />
 
As an example, in a 2023 OpenAI paper, collected 800K process labels were collected for 75K solutionthinking traces. A labeler would be presented withsaw a solution trace, and keepmarked labellingeach step as "positive" if theit stepmoved progressestoward towards thea solution, "neutral" if it iswas not wrong, but doesdid not progress towards solutionhelp, and "negative" if it iswas a mistake. AsAfter soonthe as afirst "negative" label is entered, the labeler stopsstopped labelingon that thinking trace, and beginsmoved labelingto another one. The ideaauthors wasargued that, whilelabeling labellingup subsequentto reasoningthe stepsfirst canerror providewas evenenough richerto supervisiontrain signals,a simplycapable labelingPRM, upeven tothough thelabeling firstlater errorsteps wascould sufficientgive forricher training a competent PRMsignals.<ref name=":1" /><ref>{{Citationcite web |title=openai/prm800k |date=2025-01-27 |url=https://github.com/openai/prm800k |access-website=GitHub |publisher=OpenAI |date=2025-01-27 |publisheraccess-date=OpenAI2025-01-27}}</ref>
 
AsTo avoid human labels are expensive, researchers have proposed methods to create PRM without human labels on the processes. Inspired by [[Monte Carlo tree search]] (MCTS), the Math-Shepherd method samples multiple continuations until the end, starting at each reasoning step <math>y_i</math>, and set the reward at that step to be either <math>\frac{\#\text{(correct answers)}}{\#\text{(total answers)}}</math> in the case of "soft estimation", or
<math>\begin{cases}
1 & \text{if one of the answers is correct}\\
0 & \text{else}
\end{cases}</math>
\end{cases}</math> in the case of "hard estimation". This creates process reward using only an ORM, which is usually easier or cheaper to construct. After creating these process reward labels, a PRM can be trained on them.<ref name=":3">{{Cite journal |last1=Wang |first1=Peiyi |last2=Li |first2=Lei |last3=Shao |first3=Zhihong |last4=Xu |first4=Runxin |last5=Dai |first5=Damai |last6=Li |first6=Yifei |last7=Chen |first7=Deli |last8=Wu |first8=Yu |last9=Sui |first9=Zhifang |date=August 2024 |editor-last=Ku |editor-first=Lun-Wei |editor2-last=Martins |editor2-first=Andre |editor3-last=Srikumar |editor3-first=Vivek |title=Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations |url=https://aclanthology.org/2024.acl-long.510/ |journal=Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |___location=Bangkok, Thailand |publisher=Association for Computational Linguistics |pages=9426–9439 |doi=10.18653/v1/2024.acl-long.510|arxiv=2312.08935 }}</ref> Some have tried a fully MCTS approach.<ref>{{Citation |last1=Chen |first1=Guoxin |title=AlphaMath Almost Zero: Process Supervision without Process |date=2024-09-27 |arxiv=2405.03553 |last2=Liao |first2=Minpeng |last3=Li |first3=Chengxi |last4=Fan |first4=Kai}}</ref>
in the case of "hard estimation". This creates process rewards from an ORM, which is often easier or cheaper to construct. A PRM can then be trained on these labels.<ref name=":3">{{cite journal |last1=Wang |first1=Peiyi |last2=Li |first2=Lei |last3=Shao |first3=Zhihong |last4=Xu |first4=Runxin |last5=Dai |first5=Damai |last6=Li |first6=Yifei |last7=Chen |first7=Deli |last8=Wu |first8=Yu |last9=Sui |first9=Zhifang |editor-last=Ku |editor-first=Lun-Wei |editor2-last=Martins |editor2-first=Andre |editor3-last=Srikumar |editor3-first=Vivek |title=Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations |journal=Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |___location=Bangkok, Thailand |publisher=Association for Computational Linguistics |date=August 2024 |pages=9426–9439 |doi=10.18653/v1/2024.acl-long.510 |arxiv=2312.08935}}</ref> Some work has tried a fully MCTS approach.<ref>{{cite arXiv |last1=Chen |first1=Guoxin |last2=Liao |first2=Minpeng |last3=Li |first3=Chengxi |last4=Fan |first4=Kai |title=AlphaMath Almost Zero: Process Supervision without Process |date=2024-09-27 |eprint=2405.03553 |class=cs.LG}}</ref>
 
One can also use an ORM to implicitly construct a PRM, similar to [[direct preference optimization]].<ref>{{Citationcite arXiv |last1=Yuan |first1=Lifan |title=Free Process Rewards without Process Labels |date=2024-12-02 |arxiv=2412.01981 |last2=Li |first2=Wendi |last3=Chen |first3=Huayu |last4=Cui |first4=Ganqu |last5=Ding |first5=Ning |last6=Zhang |first6=Kaiyan |last7=Zhou |first7=Bowen |last8=Liu |first8=Zhiyuan |last9=Peng |first9=Hao |title=Free Process Rewards without Process Labels |date=2024-12-02 |eprint=2412.01981 |class=cs.CL}}</ref>
 
<ref>{{Citation |last1=DeepSeek-AI |title=DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning |date=2025-01-22 |arxiv=2501.12948 |last2=Guo |first2=Daya |last3=Yang |first3=Dejian |last4=Zhang |first4=Haowei |last5=Song |first5=Junxiao |last6=Zhang |first6=Ruoyu |last7=Xu |first7=Runxin |last8=Zhu |first8=Qihao |last9=Ma |first9=Shirong}}</ref><ref>{{Citation |last1=Shao |first1=Zhihong |title=DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models |date=2024-04-27 |arxiv=2402.03300 |last2=Wang |first2=Peiyi |last3=Zhu |first3=Qihao |last4=Xu |first4=Runxin |last5=Song |first5=Junxiao |last6=Bi |first6=Xiao |last7=Zhang |first7=Haowei |last8=Zhang |first8=Mingchuan |last9=Li |first9=Y. K.}}</ref>
 
=== Guided sampling ===
A trained ORM can be used to selectpick the best response. The policy wouldgenerates rollout multipleseveral responses, and a trainedthe ORM would selectselects the best responseone. This allowsimplements a simple form of [[Neural scaling law|test -time compute scaling]] ("best-of-N").<ref name=":2" /> <ref>{{Citationcite arXiv |last1=Zhang |first1=Di |title=LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning |date=2024-11-21 |arxiv=2410.02884 |last2=Wu |first2=Jianbo |last3=Lei |first3=Jingdi |last4=Che |first4=Tong |last5=Li |first5=Jiatong |last6=Xie |first6=Tong |last7=Huang |first7=Xiaoshui |last8=Zhang |first8=Shufei |last9=Pavone |first9=Marco |title=LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning |date=2024-11-21 |eprint=2410.02884 |class=cs.CL}}</ref>
 
A trained PRM can guide reasoning by a greedy [[Tree traversal|tree search]]: the policy proposes several next steps, the PRM picks one, and the process repeats. This mirrors using an ORM to pick a whole response.<ref>{{cite arXiv |last1=Ma |first1=Qianli |last2=Zhou |first2=Haotian |last3=Liu |first3=Tingkai |last4=Yuan |first4=Jianbo |last5=Liu |first5=Pengfei |last6=You |first6=Yang |last7=Yang |first7=Hongxia |title=Let's reward step by step: Step-Level reward model as the Navigators for Reasoning |date=2023-10-16 |eprint=2310.10080 |class=cs.CL}}</ref> [[Beam search]] performs better than greedy search.
 
''Lookahead search'' is another tree search method. The policy proposes several next steps, then makes a short rollout for each. If a solution is found during rollout, the search stops early. Otherwise, the PRM scores each rollout, and the step with the highest score is chosen.<ref name=":7"/>
 
''Self-consistency'' can be combined with an ORM. The model generates multiple answers, and the answers are clustered so that each cluster has the same final answer. The ORM scores each answer, scores in each cluster are summed, and the answer from the highest-scoring cluster is returned.<ref name=":3" />
 
== Benchmarks ==
Reasoning models generally score higher than non-reasoning models on many benchmarks, especially on tasks requiring multi-step reasoning.
 
Some benchmarks exclude reasoning models because their responses take longer and cost more.<ref>{{cite book |last1=Huang |first1=Yuting |last2=Zois |first2=Christos |last3=Wang |first3=Yue |last4=Zhang |first4=Yue |last5=Mavromatis |first5=Christos |last6=Zeng |first6=Jiachen |last7=Yin |first7=Shihao |last8=Voulkidis |first8=Antonios |last9=Shepard |first9=Daniel |chapter=Toward Foundation Models for Online Complex Event Detection in CPS-IoT: A Case Study |title=Proceedings of the 2nd International Workshop on Foundation Models for Cyber-Physical Systems & Internet of Things |publisher=ACM |date=2025 |pages=1–6 |doi=10.1145/3722565.3727198 |arxiv=2503.12282 |isbn=979-8-4007-1608-9 |quote=Although we did not evaluate o1 and o3 models … their high cost and inference time make them impractical for online CED, which requires frequent, low-latency API requests.}}</ref><ref>{{cite arXiv |last1=Hu |first1=Zihao |last2=Wang |first2=Yuqing |last3=Sun |first3=Rui |last4=Lu |first4=Haoran |last5=Gong |first5=Qian |last6=Wang |first6=Jinshuai |last7=Gong |first7=Yunlong |last8=Huang |first8=Yiming |last9=He |first9=Peng |title=Inference-Time Compute: More Faithful? A Research Note |date=2025-02-13 |eprint=2502.09673 |class=cs.CL |quote=we were unable to evaluate O1 and R1 …}}</ref><ref>{{cite arXiv |last1=Chen |first1=Guoliang |last2=Zhu |first2=Zhiyao |last3=Meng |first3=Qinxiang |last4=Liang |first4=Weilin |last5=Ji |first5=Zijie |last6=Liu |first6=Jiangning |last7=Zeng |first7=Jie |title=RealBench: Evaluating LLMs as Verilog Engineers |date=2025-03-07 |eprint=2503.04914 |class=cs.AI |quote=For O1-preview, we sample only once due to high cost.}}</ref><ref>{{cite arXiv |last1=Gupta |first1=Arpit |last2=Schapira |first2=Michael |last3=Gill |first3=Phillipa |last4=Seetharaman |first4=Srinivasan |title=On the Feasibility of Using LLMs to Execute Multistage Network Attacks |date=2025-01-30 |eprint=2501.16466 |class=cs.CR |quote=We were unable to evaluate o1 … the public API has a safeguard that prevents o1 from executing attacks.}}</ref>
 
=== Humanity's Last Exam ===
The [[Humanity's Last Exam|HLE]] benchmark tests expert-level reasoning across mathematics, humanities, and the natural sciences, and shows large performance gaps between models. State-of-the-art reasoning models score low on HLE, leaving room to improve. For example, the full reasoning model [[OpenAI o3|o3]] reached 26.6%,<ref name=":5"/> while the lighter o3-mini-high (on text-only questions) reached 13%.<ref>{{cite web |title=Humanity's Last Exam leaderboard |url=https://agi.safe.ai/benchmarks/hle |website=Safe.ai |publisher=Center for AI Safety |access-date=2025-07-26}}</ref>
 
=== AIME ===
On the [[American Invitational Mathematics Examination]] (AIME), a difficult math competition, non-reasoning models usually solve under 30% of problems. Models that use reasoning methods score between 50% and 80%.<ref name=":8"/><ref name=":9"/><ref name=":10"/> While [[OpenAI o1|OpenAI's o1]] maintained or slightly improved its accuracy from reported 2024 results to 2025 AIME results, o3-mini (high) reached a higher accuracy (80%) at a much lower cost (about 12 times cheaper).<ref name=":4">{{cite web |date=2025-01-31 |title=OpenAI o3-mini |url=https://openai.com/index/openai-o3-mini/ |access-date=2025-02-09 |website=OpenAI |language=en-US}}</ref>
 
=== o3-mini performance ===
According to OpenAI's January 2025 report on o3-mini, adjusting "reasoning effort" significantly affects performance, especially for [[STEM]] tasks. Moving from low to high reasoning effort raises accuracy on AIME 2024, GPQA Diamond, and [[Codeforces]], typically by 10–30%. With high effort, o3-mini (high) achieved 87.3% on AIME (different from the MathArena AIME benchmark), 79.7% on GPQA Diamond, 2130 Elo on Codeforces, and 49.3 on SWE-bench Verified.<ref name=":4"/>
 
== Drawbacks ==
 
=== Computational cost ===
Reasoning models often need far more compute while answering than non-reasoning models. On AIME, they were 10 to 74 times more expensive'''<ref name=":1" />''' than non-reasoning counterparts.
 
=== Generation time ===
Due to the tendency of reasoning language models to produce verbose outputs, the time it takes to generate an output increases greatly when compared to a standard [[large language model]].
 
== Models ==
A trained PRM can also be used to guide reasoning by greedy [[Tree traversal|tree search]]. That is, the policy model generates several possible next reasoning steps, and the PRM selects the best one, and the process repeats. This is similar to how a trained ORM can be used to select the best response.<ref>{{Citation |last1=Ma |first1=Qianli |title=Let's reward step by step: Step-Level reward model as the Navigators for Reasoning |date=2023-10-16 |arxiv=2310.10080 |last2=Zhou |first2=Haotian |last3=Liu |first3=Tingkai |last4=Yuan |first4=Jianbo |last5=Liu |first5=Pengfei |last6=You |first6=Yang |last7=Yang |first7=Hongxia}}</ref> [[Beam search]] perform better than greedy search.
 
=== [[OpenAI]] ===
Lookahead search is another tree search method, where the policy model generates several possible next reasoning steps, then make a (partial) rollout for each. If a solution endpoint is reached during the forward simulation, the process halts early. Otherwise, the PRM is used to calculate the total reward for each rollout. The step with the highest rollout is selected.<ref>{{Citation |last1=Snell |first1=Charlie |title=Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters |date=2024-08-06 |arxiv=2408.03314 |last2=Lee |first2=Jaehoon |last3=Xu |first3=Kelvin |last4=Kumar |first4=Aviral}}</ref>
* [[GPT-5]]
* [[OpenAI o4-mini|o4-mini]]
* [[OpenAI o3|o3 and o3-mini]]
* [[OpenAI o1|o1 and o1-preview]]
 
=== [[Gemini (chatbot)|Gemini]] ===
Self-consistency can be combined with an ORM. The model would be used to generate multiple answers, and the answers would be clustered, so that each cluster has the same answer. The ORM is used to compute the reward for each answer, and the rewards within each cluster is summed. The answer corresponding to the cluster with the highest summed reward is output.<ref name=":3" />
* [[Gemini (language model)|2.5 Pro and Flash]]
* [[Gemini (language model)|2.0 Flash Thinking]]
 
=== Applications[[DeepSeek]] ===
* R1 (based on V3)
Prompt engineering was discovered in [[GPT-3]] as "few-shot learning",<ref>{{Cite journal |last1=Brown |first1=Tom B. |last2=Mann |first2=Benjamin |last3=Ryder |first3=Nick |last4=Subbiah |first4=Melanie |last5=Kaplan |first5=Jared |last6=Dhariwal |first6=Prafulla |last7=Neelakantan |first7=Arvind |last8=Shyam |first8=Pranav |last9=Sastry |first9=Girish |last10=Askell |first10=Amanda |last11=Agarwal |first11=Sandhini |last12=Herbert-Voss |first12=Ariel |last13=Krueger |first13=Gretchen |last14=Henighan |first14=Tom |last15=Child |first15=Rewon |date=2020-12-06 |title=Language models are few-shot learners |url=https://dl.acm.org/doi/abs/10.5555/3495724.3495883 |journal=Proceedings of the 34th International Conference on Neural Information Processing Systems |series=NIPS '20 |___location=Red Hook, NY, USA |publisher=Curran Associates Inc. |pages=1877–1901 |isbn=978-1-7138-2954-6}}</ref> which began a period of research into "eliciting" capacities of pretrained language models. It was then found that a model could be prompted to perform CoT reasoning, which improves its performance on reasoning tasks.
* R1-Lite-Preview (test version based on V2.5)
 
=== Benchmark[[Qwen]] ===
* QvQ-72B-Preview — an experimental visual reasoning model launched on December 24, 2024, which integrates image understanding with verbal chain-of-thought reasoning.
{{Main|Benchmark (computing)|List of language model benchmarks}}
* QwQ-32B-Preview — an experimental text-based reasoning model released in late November 2024 that emphasizes complex, step-by-step analysis.
 
=== [[Anthropic]] ===
The reasoning ability of language models are usually tested on problems with unambiguous solutions that can be cheaply checked, and requires reasoning when solved by a human. Such problems are usually in mathematics and [[competitive programming]]. The answer is usually an array of integers, a multiple choice letter, or a program that passes [[Unit testing|unit tests]] within a limited runtime. Some common ones include:
* [[Claude (language model)#Claude 3.7|Claude Sonnet 3.7]] has an adjustable amount of 'thinking' tokens.
 
=== [[Mistral AI]] ===
* GSM8K (Grade School Math): 8.5K linguistically diverse [[Primary school|elementary school]] [[Word problem (mathematics education)|math word problems]] that require 2 to 8 basic arithmetic operations to solve.<ref name=":2" />
* Magistral (medium & small)
* [[MMLU]] (Measuring Massive Multitask Language Understanding): 16,000 multiple-choice questions spanning 57 academic subjects including mathematics, philosophy, law, and medicine.<ref>{{Citation |last1=Hendrycks |first1=Dan |title=Measuring Massive Multitask Language Understanding |date=2021-01-12 |arxiv=2009.03300 |last2=Burns |first2=Collin |last3=Basart |first3=Steven |last4=Zou |first4=Andy |last5=Mazeika |first5=Mantas |last6=Song |first6=Dawn |last7=Steinhardt |first7=Jacob}}</ref>
* GPQA (Google-Proof Q&A): 448 multiple-choice questions written by ___domain experts in biology, physics, and chemistry, and requires PhD-level experts to solve.<ref>{{Citation |last1=Rein |first1=David |title=GPQA: A Graduate-Level Google-Proof Q&A Benchmark |date=2023-11-20 |arxiv=2311.12022 |last2=Hou |first2=Betty Li |last3=Stickland |first3=Asa Cooper |last4=Petty |first4=Jackson |last5=Pang |first5=Richard Yuanzhe |last6=Dirani |first6=Julien |last7=Michael |first7=Julian |last8=Bowman |first8=Samuel R.}}</ref>
* HumanEval: Programming problems where the solution is always a python function, often just a few lines long.<ref name=":4">{{Citation |last1=Chen |first1=Mark |title=Evaluating Large Language Models Trained on Code |date=2021-07-14 |arxiv=2107.03374 |last2=Tworek |first2=Jerry |last3=Jun |first3=Heewoo |last4=Yuan |first4=Qiming |last5=Pinto |first5=Henrique Ponde de Oliveira |last6=Kaplan |first6=Jared |last7=Edwards |first7=Harri |last8=Burda |first8=Yuri |last9=Joseph |first9=Nicholas}}</ref>
 
=== [[XAI (company)|xAI]] ===
The benchmark scores are of the following kinds:
* [[Grok_(chatbot)#Grok_3|Grok 3]]
* [[Grok_(chatbot)#Grok_4|Grok 4]]
 
=== [[Hugging Face]] ===
* pass@n: The model is given <math>n</math> attempts to solve each problem. If any attempt is correct, the model earns a point. The pass@n score is the model's average score over all problems.
* OlympicCoder-7B & 32B, as part of reproducing the R1 training openly (Open R1 project).<ref>{{cite web |title=Open-R1: a fully open reproduction of DeepSeek-R1 |url=https://huggingface.co/blog/open-r1 |website=Hugging Face |date=2025-02-24 |access-date=2025-07-26}}</ref><ref>{{cite web |title=OlympicCoder-7B |url=https://huggingface.co/open-r1/OlympicCoder-7B |website=Hugging Face |date=2025-03-11 |access-date=2025-07-26}}</ref>
* cons@n: The model is given <math>n</math> attempts to solve each problem. If the most common answer is correct, the model earns a point. The cons@n score is the model's average score over all problems. Here "cons" stands for "consensus" or "majority voting".<ref>{{Citation |last1=DeepSeek-AI |title=DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning |date=2025-01-22 |arxiv=2501.12948 |last2=Guo |first2=Daya |last3=Yang |first3=Dejian |last4=Zhang |first4=Haowei |last5=Song |first5=Junxiao |last6=Zhang |first6=Ruoyu |last7=Xu |first7=Runxin |last8=Zhu |first8=Qihao |last9=Ma |first9=Shirong}}</ref>
The pass@n score can be estimated more accurately by making <math>N > n</math> attempts, and use the unbiased estimator <math>1- \frac{\binom{N-c}{n}}{\binom{N}{n}}</math>, where <math>c</math> is the number of correct attempts.<ref name=":4" />
 
== See also ==
* [[Generative pre-trained transformer]]
* [[Neuro-symbolic AI]]
* [[Automated theorem proving]]
* [[Automated reasoning]]
* [[Reflection (artificial intelligence)]]