Reasoning language model: Difference between revisions

Content deleted Content added
2024: In September 2024, OpenAI released o1-preview, an LLM with enhanced reasoning
Generation time: Update to be more factually correct and concise.
 
(15 intermediate revisions by 8 users not shown)
Line 1:
{{Short description|Language models designed for reasoning tasks}}{{Multiple issues|
{{unreliable sources|date=January 2025}}
{{Copy edit|for=jargon|date=May 2025}}
 
}}
'''Reasoning language models''' ('''RLMs''') are [[large language model]]s that are trained further to solve tasks that take several steps of [[reasoning]].<ref>{{cite arXiv |last1=Besta |first1=Maciej |last2=Barth |first2=Julia |last3=Schreiber |first3=Eric |last4=Kubicek |first4=Ales |last5=Catarino |first5=Afonso |last6=Gerstenberger |first6=Robert |last7=Nyczyk |first7=Piotr |last8=Iff |first8=Patrick |last9=Li |first9=Yueling |title=Reasoning Language Models: A Blueprint |date=2025-01-23 |eprint=2501.11223 |class=cs.CL}}</ref> They tend to do better on logic, math, and programming tasks than standard LLMs, can [[Backtracking|revisit and revise]] earlier steps, and make use of extra computation while answering as another way to [[Neural scaling law|scale performance]], alongside the number of training examples, parameters, and training compute.<ref name=":8">{{cite web |title=Learning to reason with LLMs |url=https://openai.com/index/learning-to-reason-with-llms/ |website=OpenAI |date=2024-09-12 |access-date=2025-07-26}}</ref>
{{Merge to|Reflection (artificial intelligence)|date=April 2025}}
'''Reasoning language models''' ('''RLMs''') are [[large language model]]s that have been further trained to solve multi-step [[reasoning]] tasks.<ref>{{cite arXiv |title=Reasoning Language Models: A Blueprint |last=Besta |first=Maciej |date=2025-01-23 |eprint=2501.11223 |class=cs.CL}}</ref> These models perform better on logical, mathematical or programmatic tasks than traditional autoregressive LLMs, have the ability to [[Backtracking|backtrack]], and employ test-time compute as an additional [[Neural scaling law|scaling axis]] beyond [[Training, validation, and test data sets|training examples]], parameter count, and train-time compute.
 
== History ==
=== 2024 ===
In September 2024, [[OpenAI]] released [[OpenAI o1#release|o1-preview]], an LLM with enhanced reasoning.<ref>{{Citecite webnews |lastlast1=Edwards |firstfirst1=Benj |date=2024-09-12 |title=OpenAI's new "reasoning" AI models are here: o1-preview and o1-mini |url=https://arstechnica.com/information-technology/2024/09/openais-new-reasoning-ai-models-are-here-o1-preview-and-o1-mini/ |access-date=2025-02-06 |websitework=Ars Technica |language=en-US}}</ref> The full version, [[OpenAI o1|o1]], followed in December 2024. OpenAI also began sharing results on its successor, [[OpenAI o3|o3]].<ref>{{Citecite web |title=OpenAI o1 System Card |url=https://cdn.openai.com/o1-system-card.pdf |website=OpenAI |date=2024-12-05 |access-date=2025-07-26}}</ref><ref>{{cite news |last=Robison |first=Kylie |date=2024-12-2005 |title=OpenAI confirmslaunches newChatGPT frontierPro, modelsa o3$200/month andplan o3with unlimited access to o1, GPT-mini4o, and more |url=https://venturebeatwww.theverge.com/ai2024/12/5/24314147/openai-confirmsreasoning-newmodel-frontiero1-modelsstrawberry-o3chatgpt-andpro-o3new-mini/tier |access-date=2025-0207-0626 |websitework=VentureBeatThe Verge}}</ref><ref>{{cite news |languagelast=enSingh |first=Jaspreet |date=2024-US12-20 |title=OpenAI unveils 'o3' model, touting advances in reasoning |url=https://www.reuters.com/technology/artificial-intelligence/openai-unveils-o3-model-touting-advances-reasoning-2024-12-20/ |access-date=2025-07-26 |work=Reuters}}</ref>
 
The development of reasoning LLMs has illustrated what [[Richard S. Sutton|Rich Sutton]] termedcalled the "bitter lesson": that generalscaling methodscompute leveragingoften computationoutperforms oftenmethods outperformthat those relyingrely on specific human insights.<ref>{{Citecite web |lastlast1=Sutton |firstfirst1=Richard S. |title=The Bitter Lesson |url=http://www.incompleteideas.net/IncIdeas/BitterLesson.html |access-date=2025-02-27 |website=Incomplete Ideas}}</ref> For instanceexample, some research groups, such as the Generative AI Research Lab (GAIR), initially explored complex techniquesmethods likesuch as tree search and reinforcement learning in attempts to replicate o1's capabilities. However, they found, as documented inIn their "o1 Replication Journey" papers, they reported that [[knowledge distillation]] (training a smaller model to mimicimitate o1's outputs) – wasworked surprisingly effectivewell. This highlighted the powereffectiveness of distillation in this context.<ref>{{cite arXiv |last1=Huang |first1=Zhen |last2=Zou |first2=Haoyang |last3=Li |first3=Xuefeng |last4=Liu |first4=Yixiu |last5=Zheng |first5=Yuxiang |last6=Chern |first6=Ethan |last7=Xia |first7=Shijie |last8=Qin |first8=Yiwei |last9=Yuan |first9=Weizhe |title=O1 Replication Journey — Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? |date=2024-11-25 |eprint=2411.16489 |class=cs.CL}}</ref><ref name=":6">{{cite news |last=Zeff |first=Maxwell |date=2025-02-05 |title=Researchers created an open rival to OpenAI's o1 'reasoning' model for under $50 |url=https://techcrunch.com/2025/02/05/researchers-created-an-open-rival-to-openais-o1-reasoning-model-for-under-50/ |access-date=2025-07-26 |work=TechCrunch}}</ref>
 
[[Alibaba Group|Alibaba]] also released reasoning versions of its [[Qwen]] LLMs in November 2024.<ref>{{cite web |title=QwQ-32B-Preview: Reflect Deeply on the Boundaries of the Unknown |url=https://qwenlm.github.io/blog/qwq-32b-preview/ |website=Qwen (Alibaba Cloud) |date=2024-11-28 |access-date=2025-07-26}}</ref>
In December 2024, the team introduced QvQ-72B-Preview, an experimental visual reasoning model.<ref>{{cite web |title=QVQ: To See the World with Wisdom |url=https://qwenlm.github.io/blog/qvq-72b-preview/ |website=Qwen |publisher=Alibaba Cloud |date=2024-12-25 |access-date=2025-07-26}}</ref>
 
In December 2024, Google introduced [[Gemini Deep Research|Deep Research]] in [[Gemini (chatbot)|Gemini]],<ref>{{Citecite web |date=2024-12-11 |title=Try Deep Research and our new experimental model in Gemini, your AI assistant |url=https://blog.google/products/gemini/google-gemini-deep-research/ |access-date=2025-02-05 |website=Google |language=en-usUS}}</ref> a feature in Gemini that conductsruns multi-step research tasks.<ref>{{cite news |last=Roth |first=Emma |date=2024-12-11 |title=Google built an AI tool that can do research for you |url=https://www.theverge.com/2024/12/11/24318217/google-gemini-advanced-deep-research-launch |access-date=2025-07-26 |work=The Verge}}</ref>
 
On December 16, 2024, an experiment usingwith a [[Llama (language model)|Llama]] 3B model demonstratedshowed that by scaling test-time compute, a relatively small model could outperform a much larger Llama 70B model on challenging reasoning tasks. This result highlightedsuggested that improvedbetter inference strategies can unlock latentuseful reasoning capabilities even in compactsmall models.<ref>{{Citecite web |title=Scaling test-time compute - a Hugging Face Space by HuggingFaceH4 |url=https://huggingface.co/spacesblog/HuggingFaceH4/blogposth4-scaling-test-time-compute |website=Hugging Face |date=2024-12-16 |access-date=2025-0207-0526}}</ref><ref name=":7">{{cite journal |websitelast1=huggingfaceSnell |first1=Charlie |last2=Lee |first2=Jaehoon |last3=Xu |first3=Kelvin |last4=Kumar |first4=Aviral |date=2025 |title=Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters |url=https://openreview.conet/forum?id=t4s3hJY9dH |journal=International Conference on Learning Representations (ICLR 2025) |access-date=2025-07-26 |arxiv=2408.03314}}</ref>
 
=== 2025 ===
In January 2025, [[DeepSeek]] released [[DeepSeek (chatbot)|R1]], a model competitive with comparable performance to o1 at lower cost,. highlightingThe release demonstrated the effectiveness of [[Group Relative Policy Optimization]] (GRPO).<ref>{{Citecite webnews |lastlast1=Orland |firstfirst1=Kyle |date=2025-01-28 |title=How does DeepSeek R1 really fare against OpenAI's best reasoning models? |url=https://arstechnica.com/ai/2025/01/how-does-deepseek-r1-really-fare-against-openais-best-reasoning-models/ |access-date=2025-02-06 |websitework=Ars Technica}}</ref><ref name=":9">{{cite arXiv |languagelast1=enDeepSeek-USAI |last2=Guo |first2=Daya |last3=Yang |first3=Dejian |last4=Zhang |first4=Haowei |last5=Song |first5=Junxiao |last6=Zhang |first6=Ruoyu |last7=Xu |first7=Runxin |last8=Zhu |first8=Qihao |last9=Ma |first9=Shirong |title=DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning |date=2025-01-22 |eprint=2501.12948 |class=cs.CL}}</ref> On January 25, 2025, [[DeepSeek]] launchedadded a feature in theirto DeepSeek R1 model,that enablinglets the simultaneousmodel usesearch ofthe searchweb andwhile reasoningit capabilitiesreasons, whichmaking allowsit foreasier moreto efficient integration of datacombine retrieval with reflective reasoning processes. OpenAI subsequently released o3-mini, followed by [[ChatGPT Deep Research|Deep Research]] which is based on [[OpenAI o3|o3]].<ref>{{Citecite news |lastscript-title=Milmozh:DeepSeek |first=Dan支持"深度思考+联网检索"能力 |date=2025trans-02-03 |title=OpenAIDeepSeek launchesadds 'deepa research'search toolfeature thatsupporting itsimultaneous saysdeep canthinking matchand researchweb analystsearch |work=People's Daily Online |date=2025-01-29 |url=httpshttp://wwwtech.theguardianpeople.com.cn/technologyn1/2025/feb/030129/openaic1007-deep-research-agent-chatgpt-deepseek40386565.html |language=zh |access-date=2025-0307-16 |work=The Guardian |language=en-GB |issn=0261-307726}}</ref> The powereffectiveness of distillation for reasoning models was furthershown demonstratedin works bysuch as s1-32B, achievingwhich achieved strong performance withthrough budget forcing and scaling techniquesmethods.<ref name=":10">{{Citationcite arXiv |last1=Muennighoff |first1=Niklas |title=s1: Simple test-time scaling |date=2025-02-03 |arxiv=2501.19393 |last2=Yang |first2=Zitong |last3=Shi |first3=Weijia |last4=Li |first4=Xiang Lisa |last5=Fei-Fei |first5=Li |last6=Hajishirzi |first6=Hannaneh |last7=Zettlemoyer |first7=Luke |last8=Liang |first8=Percy |last9=Candès |first9=Emmanuel |title=s1: Simple test-time scaling |date=2025-02-03 |eprint=2501.19393 |class=cs.CL}}</ref><ref name=":6"/>
 
On February 2, 2025, OpenAI released [[ChatGPT Deep Research|Deep Research]] based on their [[OpenAI o3|o3]] model,<ref name=":5">{{Citecite web |date=2025-02-02 |title=Introducing deep research |url=https://openai.com/index/introducing-deep-research/ |access-date=2025-02-05 |website=OpenAI |language=en-US}}</ref> a tool that integrates reasoning and web search in a unified workflow, allowing users to performinitiate complex research tasks thatand requiregenerate multi-stepcomprehensive reasoningreports andwhich dataincorporate synthesis from multiplevarious sources. It is based on [[OpenAI o3|o3]] and can take from 5the to 30 minutes to generate comprehensive reportsweb.<ref>{{Cite web |lastname=Ha |first=Anthony |date=2025-02-03 |title=OpenAI unveils a new ChatGPT agent for 'deep research' |url=https"://techcrunch.com/2025/02/02/openai-unveils-a-new-chatgpt-agent-for-deep-research/5" |access-date=2025-02-06 |website=TechCrunch |language=en-US}}</ref>
 
== Supervised finetuning ==
A [[large language model]] (LLM) can be finetunedfine-tuned on a dataset of reasoning tasks paired with example solutions and step-by-step (reasoning) traces. The fine-tuned model can then produce its own reasoning traces for new problems.<ref name=":0">{{Citationcite arXiv |last1=Uesato |first1=Jonathan |title=Solving math word problems with process- and outcome-based feedback |date=2022-11-25 |arxiv=2211.14275 |last2=Kushman |first2=Nate |last3=Kumar |first3=Ramana |last4=Song |first4=Francis |last5=Siegel |first5=Noah |last6=Wang |first6=Lisa |last7=Creswell |first7=Antonia |last8=Irving |first8=Geoffrey |last9=Higgins |first9=Irina |title=Solving math word problems with process- and outcome-based feedback |date=2022-11-25 |eprint=2211.14275 |class=cs.LG}}</ref><ref name=":2" />
 
AsBecause ithuman-written istraces expensiveare to get humanscostly to write reasoning traces for a SFT datasetcollect, researchers have proposed ways to automaticallybuild construct SFTsuch datasets automatically. In ''rejection sampling finetuning'' (RFT), new reasoning traces are collectedgathered viain a loop:<ref>{{Citationcite arXiv |last1=Yuan |first1=Zheng |title=Scaling Relationship on Learning Mathematical Reasoning with Large Language Models |date=2023-09-13 |arxiv=2308.01825 |last2=Yuan |first2=Hongyi |last3=Li |first3=Chengpeng |last4=Dong |first4=Guanting |last5=Lu |first5=Keming |last6=Tan |first6=Chuanqi |last7=Zhou |first7=Chang |last8=Zhou |first8=Jingren |title=Scaling Relationship on Learning Mathematical Reasoning with Large Language Models |date=2023-09-13 |eprint=2308.01825 |class=cs.CL}}</ref>
# Sample a task prompt.
 
# Sample a task prompt
# Generate many reasoning traces for the prompt.
# Use a verifier to remove reasoning traces with thea wrong final answer., and optionally remove duplicates
# For each remaining trace, extract the set of equations appearing in it. Deduplicate the traces so that each one has a different set of equations. Add those to the dataset.
 
== Reinforcement learning ==
A pretrained language model can be further trained bywith RL. In the RL formalism, a generative language model is a '''policy''' <math>\pi</math>. A prompt specifying a task to solveprompt is an environmental '''state''' <math>x</math>, and the response of the language model's to the promptresponse is an '''action''' <math>y</math>. The probability that the language model responds <math>x</math> with <math>y</math> is <math>\pi(y|x)</math>.
 
Training a reasoning language model bywith RL then consists ofmeans constructing a '''reward model''' <math>r(x, y)</math> to guide the RL process. Intuitively, athe reward model describessays how desirable/appropriate/good thea response is for thea prompt. For reasoning language model, the prompt describes a reasoning task, and the reward would beis high if the response solves the task, and low if theit responsedoes fails to solve the tasknot.
 
For reasoning language models, the model'sA response <math>y</math> may be broken -down into multiple steps, in which case it is written as <math>y_1, y_2, \dots, y_n</math>.
 
Most recent systems use policy-gradient methods such as [[Proximal Policy Optimization]] (PPO) because PPO constrains each policy update with a clipped objective, which stabilises training for very large policies.<ref name="OpenAIAlign2022">{{cite web |title=Aligning language models to follow instructions |website=OpenAI Blog |url=https://openai.com/blog/instruction-following/ |date=2022-01-27 |access-date=2025-05-04}}</ref>
Line 45 ⟶ 42:
{{Anchor|Outcome Reward Model|ORM}}
 
OutcomeAn outcome reward model, or outcome-supervised RM (ORM),<ref name=":0" /> is a reward model that computesgives the reward offor a step <math>r(x, y_1, \dots, y_i)</math> determinedbased byon the final answer: <math>r(x, y_1, \dots, y_i) = r(x, y_n)</math>. TheySuch models are alsooften called "verifiers".
 
For tasks with an answeranswers that isare easy to verify, such as [[Word problem (mathematics education)|math word problems in math]], the outcome reward can simply be binary: 1 if the final answer is correct, and 0 otherwise.<ref name=":0" /> If theautomatic answerverification is not easy to verify programmaticallyhard, humans can manually label the answers as correct or not, thenand thethose labels can be used to finetune a base model that predicts the human label.<ref name=":2">{{Citationcite arXiv |last1=Cobbe |first1=Karl |title=Training Verifiers to Solve Math Word Problems |date=2021-11-18 |arxiv=2110.14168 |last2=Kosaraju |first2=Vineet |last3=Bavarian |first3=Mohammad |last4=Chen |first4=Mark |last5=Jun |first5=Heewoo |last6=Kaiser |first6=Lukasz |last7=Plappert |first7=Matthias |last8=Tworek |first8=Jerry |last9=Hilton |first9=Jacob |title=Training Verifiers to Solve Math Word Problems |date=2021-11-18 |eprint=2110.14168 |class=cs.LG}}</ref> For other kinds of tasks, such aslike creative writing, where task performancequality is not binarysimply true/ or false, one can train a reward model by finetuning a base model on human [[Ranking (statistics)|ranked preference]] data, such as used in [[reinforcement learning from human feedback]].<ref name=":1">{{Citationcite journal |last1=Lightman |first1=Hunter |title=Let's Verify Step by Step |date=2023-05-31 |arxiv=2305.20050 |last2=Kosaraju |first2=Vineet |last3=Burda |first3=Yura |last4=Edwards |first4=Harri |last5=Baker |first5=Bowen |last6=Lee |first6=Teddy |last7=Leike |first7=Jan |last8=Schulman |first8=John |last9=Sutskever |first9=Ilya |date=2024 |title=Let's Verify Step by Step |url=https://openreview.net/forum?id=dKDGgN0eTg |journal=International Conference on Learning Representations (ICLR 2024) |access-date=2025-07-26 |arxiv=2305.20050}}</ref> A base model can also be finetunedfine-tuned to predict, givenfrom a partial thinking trace <math>x, y_1, \dots, y_m</math>, whether the final answer wouldwill be correct, orand not.this Thisprediction can then be usedserve as a binary reward signal.<ref name=":0" />
 
The ORM is usually trained viawith [[logistic regression]], i.e. by minimizing [[Cross-entropy|cross-entropy loss]].<ref name=":3" />
 
Given a PRM, an ORM can be constructed by multiplying the total process reward during the reasoning trace,<ref name=":1" /> or by taking the minimum,<ref name=":3" /> or someby other methodways toof aggregate theaggregating process rewards. DeepSeek used a simple ORM forto trainingtrain the [[DeepSeek (chatbot)|R1 model]].<ref>{{Citation |last1name=DeepSeek-AI |title=DeepSeek-R1": Incentivizing Reasoning Capability in LLMs via Reinforcement Learning |date=2025-01-22 |arxiv=2501.12948 |last2=Guo |first2=Daya |last3=Yang |first3=Dejian |last4=Zhang |first4=Haowei |last5=Song |first5=Junxiao |last6=Zhang |first6=Ruoyu |last7=Xu |first7=Runxin |last8=Zhu |first8=Qihao |last9=Ma |first9=Shirong}}<9"/ref>
 
=== Process reward model ===
{{Anchor|Process Reward Model|PRM}}
 
ProcessA process reward model, or process-supervised RM (PRM),<ref name=":0" /> is a reward model that computesgives the reward offor a step <math>r(x, y_1, \dots, y_i)</math> determinedbased only byon the steps so far: <math>(x, y_1, \dots, y_i)</math>.
 
Given a partial thinking trace <math>x, y_1, \dots, y_m</math>, a human can be queried as tojudge whether the steps ''so far'' are correct, regardlesswithout oflooking whetherat the ultimatefinal answer would be correct. This can then be used asyields a binary reward signal. AsBecause human labels are expensivecostly, a base model can then be finetunedfine-tuned to predict the human labelsthem.<ref name=":0" /> The PRM is usually trained bywith [[logistic regression]] on the human labels, i.e. by minimizing the [[Cross-entropy|cross-entropy loss]] between the true labels and the predicted labels.<ref name=":3" />
 
As an example, in a 2023 OpenAI paper, collected 800K process labels were collected for 75K solutionthinking traces. A labeler would be presented withsaw a solution trace, and keepmarked labellingeach step as "positive" if theit stepmoved progressestoward towards thea solution, "neutral" if it iswas not wrong, but doesdid not progress towards solutionhelp, and "negative" if it iswas a mistake. AsAfter soonthe as afirst "negative" label is entered, the labeler stopsstopped labelingon that thinking trace, and beginsmoved labelingto another one. The ideaauthors wasargued that, whilelabeling labellingup subsequentto reasoningthe stepsfirst canerror providewas evenenough richerto supervisiontrain signals,a simplycapable labelingPRM, upeven tothough thelabeling firstlater errorsteps wascould sufficientgive forricher training a competent PRMsignals.<ref name=":1" /><ref>{{Citationcite web |title=openai/prm800k |date=2025-01-27 |url=https://github.com/openai/prm800k |access-website=GitHub |publisher=OpenAI |date=2025-01-27 |publisheraccess-date=OpenAI2025-01-27}}</ref>
 
AsTo avoid human labels are expensive, researchers have proposed methods to create PRM without human labels on the processes. Inspired by [[Monte Carlo tree search]] (MCTS), the Math-Shepherd method samples multiple continuations until the end, starting at each reasoning step <math>y_i</math>, and set the reward at that step to be either <math>\frac{\#\text{(correct answers)}}{\#\text{(total answers)}}</math> in the case of "soft estimation", or
<math>\begin{cases}
1 & \text{if one of the answers is correct}\\
0 & \text{else}
\end{cases}</math>
\end{cases}</math> in the case of "hard estimation". This creates process rewardrewards using onlyfrom an ORM, which is usuallyoften easier or cheaper to construct. After creating these process reward labels, aA PRM can then be trained on themthese labels.<ref name=":3">{{Citecite journal |last1=Wang |first1=Peiyi |last2=Li |first2=Lei |last3=Shao |first3=Zhihong |last4=Xu |first4=Runxin |last5=Dai |first5=Damai |last6=Li |first6=Yifei |last7=Chen |first7=Deli |last8=Wu |first8=Yu |last9=Sui |first9=Zhifang |date=August 2024 |editor-last=Ku |editor-first=Lun-Wei |editor2-last=Martins |editor2-first=Andre |editor3-last=Srikumar |editor3-first=Vivek |title=Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations |url=https://aclanthology.org/2024.acl-long.510/ |journal=Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |___location=Bangkok, Thailand |publisher=Association for Computational Linguistics |date=August 2024 |pages=9426–9439 |doi=10.18653/v1/2024.acl-long.510 |arxiv=2312.08935 }}</ref> Some havework has tried a fully MCTS approach.<ref>{{Citationcite arXiv |last1=Chen |first1=Guoxin |last2=Liao |first2=Minpeng |last3=Li |first3=Chengxi |last4=Fan |first4=Kai |title=AlphaMath Almost Zero: Process Supervision without Process |date=2024-09-27 |arxiveprint=2405.03553 |last2class=Liao |first2=Minpeng |last3=Li |first3=Chengxi |last4=Fan |first4=Kaics.LG}}</ref>
 
One can also use an ORM to implicitly construct a PRM, similar to [[direct preference optimization]].<ref>{{Citationcite arXiv |last1=Yuan |first1=Lifan |title=Free Process Rewards without Process Labels |date=2024-12-02 |arxiv=2412.01981 |last2=Li |first2=Wendi |last3=Chen |first3=Huayu |last4=Cui |first4=Ganqu |last5=Ding |first5=Ning |last6=Zhang |first6=Kaiyan |last7=Zhou |first7=Bowen |last8=Liu |first8=Zhiyuan |last9=Peng |first9=Hao |title=Free Process Rewards without Process Labels |date=2024-12-02 |eprint=2412.01981 |class=cs.CL}}</ref>
 
=== Guided sampling ===
A trained ORM can be used to selectpick the best response. The policy wouldgenerates rollout multipleseveral responses, and a trainedthe ORM would selectselects the best responseone. This allowsimplements a simple form of [[Neural scaling law|test -time compute scaling]] ("best-of-N").<ref name=":2" /> <ref>{{Citationcite arXiv |last1=Zhang |first1=Di |title=LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning |date=2024-11-21 |arxiv=2410.02884 |last2=Wu |first2=Jianbo |last3=Lei |first3=Jingdi |last4=Che |first4=Tong |last5=Li |first5=Jiatong |last6=Xie |first6=Tong |last7=Huang |first7=Xiaoshui |last8=Zhang |first8=Shufei |last9=Pavone |first9=Marco |title=LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning |date=2024-11-21 |eprint=2410.02884 |class=cs.CL}}</ref>
 
A trained PRM can also be used to guide reasoning by a greedy [[Tree traversal|tree search]]. That is,: the policy model generatesproposes several possible next reasoning steps, and the PRM selects the bestpicks one, and the process repeats. This ismirrors similarusing to how a trainedan ORM can be used to selectpick thea bestwhole response.<ref>{{Citationcite arXiv |last1=Ma |first1=Qianli |title=Let's reward step by step: Step-Level reward model as the Navigators for Reasoning |date=2023-10-16 |arxiv=2310.10080 |last2=Zhou |first2=Haotian |last3=Liu |first3=Tingkai |last4=Yuan |first4=Jianbo |last5=Liu |first5=Pengfei |last6=You |first6=Yang |last7=Yang |first7=Hongxia |title=Let's reward step by step: Step-Level reward model as the Navigators for Reasoning |date=2023-10-16 |eprint=2310.10080 |class=cs.CL}}</ref> [[Beam search]] performperforms better than greedy search.
 
''Lookahead search'' is another tree search method,. where theThe policy model generatesproposes several possible next reasoning steps, then makemakes a (partial)short rollout for each. If a solution endpoint is reachedfound during the forward simulationrollout, the processsearch haltsstops early. Otherwise, the PRM isscores usedeach torollout, calculateand the total reward for each rollout. The step with the highest rolloutscore is selectedchosen.<ref>{{Citation |last1name=Snell |first1=Charlie |title=Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters |date=2024-08-06 |arxiv=2408.03314 |last2=Lee |first2=Jaehoon |last3=Xu |first3=Kelvin |last4=Kumar |first4=Aviral}}<":7"/ref>
 
''Self-consistency'' can be combined with an ORM. The model would be used to generategenerates multiple answers, and the answers would beare clustered, so that each cluster has the same final answer. The ORM is used to compute the reward forscores each answer, andscores the rewards withinin each cluster isare summed., The answer corresponding toand the clusteranswer withfrom the highest-scoring summed rewardcluster is outputreturned.<ref name=":3" />
 
== Benchmarks ==
Reasoning models generally outperformscore higher than non-reasoning models inon mostmany benchmarks, especially on tasks requiring multi-step reasoning.
 
Some benchmarks exclude reasoning models because their responses take longer and cost more.<ref>{{cite book |last1=Huang |first1=Yuting |last2=Zois |first2=Christos |last3=Wang |first3=Yue |last4=Zhang |first4=Yue |last5=Mavromatis |first5=Christos |last6=Zeng |first6=Jiachen |last7=Yin |first7=Shihao |last8=Voulkidis |first8=Antonios |last9=Shepard |first9=Daniel |chapter=Toward Foundation Models for Online Complex Event Detection in CPS-IoT: A Case Study |title=Proceedings of the 2nd International Workshop on Foundation Models for Cyber-Physical Systems & Internet of Things |publisher=ACM |date=2025 |pages=1–6 |doi=10.1145/3722565.3727198 |arxiv=2503.12282 |isbn=979-8-4007-1608-9 |quote=Although we did not evaluate o1 and o3 models … their high cost and inference time make them impractical for online CED, which requires frequent, low-latency API requests.}}</ref><ref>{{cite arXiv |last1=Hu |first1=Zihao |last2=Wang |first2=Yuqing |last3=Sun |first3=Rui |last4=Lu |first4=Haoran |last5=Gong |first5=Qian |last6=Wang |first6=Jinshuai |last7=Gong |first7=Yunlong |last8=Huang |first8=Yiming |last9=He |first9=Peng |title=Inference-Time Compute: More Faithful? A Research Note |date=2025-02-13 |eprint=2502.09673 |class=cs.CL |quote=we were unable to evaluate O1 and R1 …}}</ref><ref>{{cite arXiv |last1=Chen |first1=Guoliang |last2=Zhu |first2=Zhiyao |last3=Meng |first3=Qinxiang |last4=Liang |first4=Weilin |last5=Ji |first5=Zijie |last6=Liu |first6=Jiangning |last7=Zeng |first7=Jie |title=RealBench: Evaluating LLMs as Verilog Engineers |date=2025-03-07 |eprint=2503.04914 |class=cs.AI |quote=For O1-preview, we sample only once due to high cost.}}</ref><ref>{{cite arXiv |last1=Gupta |first1=Arpit |last2=Schapira |first2=Michael |last3=Gill |first3=Phillipa |last4=Seetharaman |first4=Srinivasan |title=On the Feasibility of Using LLMs to Execute Multistage Network Attacks |date=2025-01-30 |eprint=2501.16466 |class=cs.CR |quote=We were unable to evaluate o1 … the public API has a safeguard that prevents o1 from executing attacks.}}</ref>
However, some benchmarks exclude reflective models due to longer response times.
 
=== Humanity's Last Exam ===
The [[Humanity's Last Exam|HLE]], a rigorous benchmark designed to assesstests expert-level reasoning across mathematics, humanities, and the natural sciences, revealsand shows substantiallarge performance gaps amongbetween models. State-of-the-art reasoning models have demonstratedscore low accuracy on HLE, highlighting significantleaving room forto improvementimprove. InFor particularexample, the full reasoning model [[OpenAI o3|o3]] achieved an accuracy ofreached 26.6%,<ref>{{Cite web |lastname=McKenna |first=Greg |title=OpenAI's deep research can complete 26% of Humanity's Last Exam |url=https":5"//fortune.com/2025/02/12/openai-deepresearch-humanity-last-exam/ |access-date=2025-03-16 |website=Fortune |language=en}}</ref> while itsthe lighter counterpart, o3‑minio3-mini-high (evaluated on text‑onlytext-only questions), reached 13%.<ref>{{Citecite web |author1=John-Anthony Disotto |date=2025-02-04 |title=OpenAIHumanity's DeepLast ResearchExam smashes records for the world's hardest AI exam, with ChatGPT o3-mini and DeepSeek left in its wakeleaderboard |url=https://wwwagi.techradarsafe.comai/computingbenchmarks/artificial-intelligence/openais-deep-research-smashes-records-for-the-worlds-hardest-hle |website=Safe.ai-exam-with-chatgpt-o3-mini-and-deepseek-left-in-its-wake |publisher=Center for AI Safety |access-date=2025-0307-16 |website=TechRadar |language=en26}}</ref>
 
=== AIME ===
TheOn the [[American Invitational Mathematics Examination]] (AIME) benchmark, a challengingdifficult mathematicsmath competition, demonstrates significant performance differences between model types. Nonnon-reasoning models typicallyusually solve less thanunder 30% of AIMEproblems. InModels contrast, modelsthat employinguse reasoning techniquesmethods score between 50% and 80%.<ref>{{Cite web |datename=2025-02-10 |title=MathArena |url=https":8"//matharena.ai/><ref |access-datename=2025-02-10 |archive-url=https":9"//web.archive.org/web/20250210032556/https://matharena.ai/><ref |archive-datename=":10 February 2025 }}<"/ref> While [[OpenAI o1|OpenAI's o1]] maintained or slightly improved its accuracy from reported 2024{{Source?|date=July 2022}} metricsresults to 2025 AIME results, o3-mini (high) achievedreached a higher accuracy (80%) at a significantlymuch lower cost (approximatelyabout 12 times cheaper).<ref name=":4">{{cite web |date=2025-01-31 |title=OpenAI o3-mini |url=https://openai.com/index/openai-o3-mini/ |access-date=2025-02-09 |website=OpenAI |language=en-US}}</ref>
 
=== o3-mini performance ===
According to OpenAI's January 2025 report on o3-mini, adjustableadjusting "reasoning effort" significantly affects performance, particularlyespecially infor [[STEM]] tasks. Increasing reasoning effortMoving from low to high boostsreasoning effort raises accuracy on benchmarks like AIME 2024, GPQA Diamond, and [[Codeforces]], providing performance gains typically inby the range of 10-3010–30%. With high reasoning effort, o3-mini (high) achieved 87.3% inon AIME (different from the MathArena AIME benchmark results), 79.7% inon GPQA Diamond, 2130 Elo inon Codeforces, and 49.3 inon SWE-bench Verified.<ref>{{Cite web |datename=2025-01-31 |title=OpenAI o3-mini |url=https":4"//openai.com/index/openai-o3-mini/ |access-date=2025-02-09 |website=OpenAI |language=en-US}}</ref>
 
== Drawbacks ==
 
=== Computational cost ===
Reasoning models requireoften significantlyneed far more test-timecompute computewhile answering than non-reasoning models. On the AIME benchmark, reasoning modelsthey were 10 to 74 times more expensive'''<ref name=":1" />''' than non-reasoning counterparts.
 
=== Generation time ===
Due to the tendency of reasoning language models to produce verbose outputs, the time it takes to generate an output increases greatly when compared to a standard [[large language model]].
Reflective reasoning increases response times, with current models taking anywhere from three seconds to several minutes to generate an answer. As reasoning depth improves, future models may require even longer processing times.
 
== Models ==
 
=== [[OpenAI]] ===
* [[GPT-5]]
* [[OpenAI o4-mini|o4-mini]]
* [[OpenAI o3|o3 and o3-mini]]
Line 123:
 
=== [[Mistral AI]] ===
 
* Magistral (medium & small)
 
=== [[XAI (company)|xAI]] ===
* [[Grok Grok_(chatbot)#Grok_3|Grok 3]] 3
* [[Grok_(chatbot)#Grok_4|Grok 4]]
 
=== [[Hugging Face]] ===
* OlympicCoder-7B & 32B, as part of reproducing the R1 training openly (Open R1 project).<ref>{{cite web |title=Open-R1: a fully open reproduction of DeepSeek-R1 |url=https://huggingface.co/blog/open-r1 |website=Hugging Face |date=2025-02-24 |access-date=2025-07-26}}</ref><ref>{{cite web |title=OlympicCoder-7B |url=https://huggingface.co/open-r1/OlympicCoder-7B |website=Hugging Face |date=2025-03-11 |access-date=2025-07-26}}</ref>
 
* OlympicCoder-7B & 32B, as part of reproducing the R1 training openly (Open R1 project).<ref>{{Cite web |date=2025-03-12 |title=@lewtun on Hugging Face: "Introducing OlympicCoder: a series of open reasoning models that can solve…" |url=https://huggingface.co/posts/lewtun/886287473065721 |access-date=2025-04-04 |website=huggingface.co}}</ref>
 
== See also ==