Reasoning language model: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 13:21, 10 July 2025 edit Permacultura (talk \| contribs) Extended confirmed users 2,842 edits →2024: In September 2024, OpenAI released o1-preview, an LLM with enhanced reasoning ← Previous edit		Latest revision as of 10:21, 20 August 2025 edit undo SimonAytes (talk \| contribs) 13 edits →Generation time: Update to be more factually correct and concise. Tag: Visual edit
(15 intermediate revisions by 8 users not shown)
Line 1: {{Short description\|Language models designed for reasoning tasks}}~~{{Multiple issues\|~~ ~~{{unreliable sources\|date=January 2025}}~~ {{Copy edit\|for=jargon\|date=May 2025}} }} '''Reasoning language models''' ('''RLMs''') are [[large language model]]s that are trained further to solve tasks that take several steps of [[reasoning]].<ref>{{cite arXiv \|last1=Besta \|first1=Maciej \|last2=Barth \|first2=Julia \|last3=Schreiber \|first3=Eric \|last4=Kubicek \|first4=Ales \|last5=Catarino \|first5=Afonso \|last6=Gerstenberger \|first6=Robert \|last7=Nyczyk \|first7=Piotr \|last8=Iff \|first8=Patrick \|last9=Li \|first9=Yueling \|title=Reasoning Language Models: A Blueprint \|date=2025-01-23 \|eprint=2501.11223 \|class=cs.CL}}</ref> They tend to do better on logic, math, and programming tasks than standard LLMs, can [[Backtracking\|revisit and revise]] earlier steps, and make use of extra computation while answering as another way to [[Neural scaling law\|scale performance]], alongside the number of training examples, parameters, and training compute.<ref name=":8">{{cite web \|title=Learning to reason with LLMs \|url=https://openai.com/index/learning-to-reason-with-llms/ \|website=OpenAI \|date=2024-09-12 \|access-date=2025-07-26}}</ref> ~~{{Merge to\|Reflection (artificial intelligence)\|date=April 2025}}~~ '''Reasoning language models''' ('''RLMs''') are [[large language model]]s that have been further trained to solve multi-step [[reasoning]] tasks.<ref>{{cite arXiv \|title=Reasoning Language Models: A Blueprint \|last=Besta \|first=Maciej \|date=2025-01-23 \|eprint=2501.11223 \|class=cs.CL}}</ref> These models perform better on logical, mathematical or programmatic tasks than traditional autoregressive LLMs, have the ability to [[Backtracking\|backtrack]], and employ test-time compute as an additional [[Neural scaling law\|scaling axis]] beyond [[Training, validation, and test data sets\|training examples]], parameter count, and train-time compute. == History == === 2024 === In September 2024, [[OpenAI]] released [[OpenAI o1#release\|o1-preview]], an LLM with enhanced reasoning.<ref>{{~~Cite~~cite ~~web~~news \|~~last~~last1=Edwards \|~~first~~first1=Benj \|date=2024-09-12 \|title=OpenAI's new "reasoning" AI models are here: o1-preview and o1-mini \|url=https://arstechnica.com/information-technology/2024/09/openais-new-reasoning-ai-models-are-here-o1-preview-and-o1-mini/ \|access-date=2025-02-06 \|~~website~~work=Ars Technica \|language=en-US}}</ref> The full version, [[OpenAI o1\|o1]], followed in December 2024. OpenAI also began sharing results on its successor, [[OpenAI o3\|o3]].<ref>{{~~Cite~~cite web \|title=OpenAI o1 System Card \|url=https://cdn.openai.com/o1-system-card.pdf \|website=OpenAI \|date=2024-12-05 \|access-date=2025-07-26}}</ref><ref>{{cite news \|last=Robison \|first=Kylie \|date=2024-12-2005 \|title=OpenAI ~~confirms~~launches ~~new~~ChatGPT ~~frontier~~Pro, ~~models~~a o3$200/month ~~and~~plan o3with unlimited access to o1, GPT-~~mini~~4o, and more \|url=https://~~venturebeat~~www.theverge.com/ai2024/12/5/24314147/openai-~~confirms~~reasoning-~~new~~model-~~frontier~~o1-~~models~~strawberry-o3chatgpt-~~and~~pro-o3new-~~mini/~~tier \|access-date=2025-0207-0626 \|~~website~~work=~~VentureBeat~~The Verge}}</ref><ref>{{cite news \|~~language~~last=enSingh \|first=Jaspreet \|date=2024-US12-20 \|title=OpenAI unveils 'o3' model, touting advances in reasoning \|url=https://www.reuters.com/technology/artificial-intelligence/openai-unveils-o3-model-touting-advances-reasoning-2024-12-20/ \|access-date=2025-07-26 \|work=Reuters}}</ref> The development of reasoning LLMs has illustrated what [[Richard S. Sutton\|Rich Sutton]] ~~termed~~called the "bitter lesson": that ~~general~~scaling ~~methods~~compute ~~leveraging~~often ~~computation~~outperforms ~~often~~methods ~~outperform~~that ~~those relying~~rely on specific human insights.<ref>{{~~Cite~~cite web \|~~last~~last1=Sutton \|~~first~~first1=Richard S. \|title=The Bitter Lesson \|url=http://www.incompleteideas.net/IncIdeas/BitterLesson.html \|access-date=2025-02-27 \|website=Incomplete Ideas}}</ref> For ~~instance~~example, ~~some research groups, such as~~ the Generative AI Research Lab (GAIR)~~, initially~~ explored complex ~~techniques~~methods ~~like~~such as tree search and reinforcement learning ~~in attempts~~ to replicate o1's capabilities. ~~However, they found, as documented in~~In their "o1 Replication Journey" papers, they reported that [[knowledge distillation]] — (training a smaller model to ~~mimic~~imitate o1's outputs) ~~– was~~worked surprisingly ~~effective~~well. This highlighted the ~~power~~effectiveness of distillation in this context.<ref>{{cite arXiv \|last1=Huang \|first1=Zhen \|last2=Zou \|first2=Haoyang \|last3=Li \|first3=Xuefeng \|last4=Liu \|first4=Yixiu \|last5=Zheng \|first5=Yuxiang \|last6=Chern \|first6=Ethan \|last7=Xia \|first7=Shijie \|last8=Qin \|first8=Yiwei \|last9=Yuan \|first9=Weizhe \|title=O1 Replication Journey — Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? \|date=2024-11-25 \|eprint=2411.16489 \|class=cs.CL}}</ref><ref name=":6">{{cite news \|last=Zeff \|first=Maxwell \|date=2025-02-05 \|title=Researchers created an open rival to OpenAI's o1 'reasoning' model for under $50 \|url=https://techcrunch.com/2025/02/05/researchers-created-an-open-rival-to-openais-o1-reasoning-model-for-under-50/ \|access-date=2025-07-26 \|work=TechCrunch}}</ref> [[Alibaba Group\|Alibaba]] ~~also~~ released reasoning versions of its [[Qwen]] LLMs in November 2024.<ref>{{cite web \|title=QwQ-32B-Preview: Reflect Deeply on the Boundaries of the Unknown \|url=https://qwenlm.github.io/blog/qwq-32b-preview/ \|website=Qwen (Alibaba Cloud) \|date=2024-11-28 \|access-date=2025-07-26}}</ref> In December 2024, the team introduced QvQ-72B-Preview, an experimental visual reasoning model.<ref>{{cite web \|title=QVQ: To See the World with Wisdom \|url=https://qwenlm.github.io/blog/qvq-72b-preview/ \|website=Qwen \|publisher=Alibaba Cloud \|date=2024-12-25 \|access-date=2025-07-26}}</ref> In December 2024, Google introduced [[Gemini Deep Research\|Deep Research]] in [[Gemini (chatbot)\|Gemini]],<ref>{{~~Cite~~cite web \|date=2024-12-11 \|title=Try Deep Research and our new experimental model in Gemini, your AI assistant \|url=https://blog.google/products/gemini/google-gemini-deep-research/ \|access-date=2025-02-05 \|website=Google \|language=en-usUS}}</ref> a feature ~~in Gemini~~ that ~~conducts~~runs multi-step research tasks.<ref>{{cite news \|last=Roth \|first=Emma \|date=2024-12-11 \|title=Google built an AI tool that can do research for you \|url=https://www.theverge.com/2024/12/11/24318217/google-gemini-advanced-deep-research-launch \|access-date=2025-07-26 \|work=The Verge}}</ref> On December 16, 2024, an experiment ~~using~~with a [[Llama (language model)\|Llama]] 3B model ~~demonstrated~~showed that by scaling test-time compute, a relatively small model could outperform a much larger Llama 70B model on challenging reasoning tasks. This ~~result highlighted~~suggested that ~~improved~~better inference strategies can unlock ~~latent~~useful reasoning capabilities even in ~~compact~~small models.<ref>{{~~Cite~~cite web \|title=Scaling test-time compute ~~- a Hugging Face Space by HuggingFaceH4~~ \|url=https://huggingface.co/~~spaces~~blog/~~HuggingFaceH4/blogpost~~h4-scaling-test-time-compute \|website=Hugging Face \|date=2024-12-16 \|access-date=2025-0207-0526}}</ref><ref name=":7">{{cite journal \|~~website~~last1=~~huggingface~~Snell \|first1=Charlie \|last2=Lee \|first2=Jaehoon \|last3=Xu \|first3=Kelvin \|last4=Kumar \|first4=Aviral \|date=2025 \|title=Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters \|url=https://openreview.conet/forum?id=t4s3hJY9dH \|journal=International Conference on Learning Representations (ICLR 2025) \|access-date=2025-07-26 \|arxiv=2408.03314}}</ref> === 2025 === In January 2025, [[DeepSeek]] released [[DeepSeek (chatbot)\|R1]], a model ~~competitive~~ with comparable performance to o1 at lower cost,. ~~highlighting~~The release demonstrated the effectiveness of [[Group Relative Policy Optimization]] (GRPO).<ref>{{~~Cite~~cite ~~web~~news \|~~last~~last1=Orland \|~~first~~first1=Kyle \|date=2025-01-28 \|title=How does DeepSeek R1 really fare against OpenAI's best reasoning models? \|url=https://arstechnica.com/ai/2025/01/how-does-deepseek-r1-really-fare-against-openais-best-reasoning-models/ \|access-date=2025-02-06 \|~~website~~work=Ars Technica}}</ref><ref name=":9">{{cite arXiv \|~~language~~last1=enDeepSeek-USAI \|last2=Guo \|first2=Daya \|last3=Yang \|first3=Dejian \|last4=Zhang \|first4=Haowei \|last5=Song \|first5=Junxiao \|last6=Zhang \|first6=Ruoyu \|last7=Xu \|first7=Runxin \|last8=Zhu \|first8=Qihao \|last9=Ma \|first9=Shirong \|title=DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning \|date=2025-01-22 \|eprint=2501.12948 \|class=cs.CL}}</ref> On January 25, 2025, [[DeepSeek]] ~~launched~~added a feature ~~in their~~to DeepSeek R1 ~~model,~~that ~~enabling~~lets the ~~simultaneous~~model ~~use~~search ofthe ~~search~~web ~~and~~while ~~reasoning~~it ~~capabilities~~reasons, ~~which~~making ~~allows~~it ~~for~~easier ~~more~~to ~~efficient integration of data~~combine retrieval with ~~reflective~~ reasoning ~~processes. OpenAI subsequently released o3-mini, followed by [[ChatGPT Deep Research\|Deep Research]] which is based on [[OpenAI o3\|o3]]~~.<ref>{{~~Cite~~cite news \|~~last~~script-title=~~Milmo~~zh:DeepSeek ~~\|first=Dan~~支持"深度思考+联网检索"能力 \|~~date=2025~~trans-~~02-03 \|~~title=~~OpenAI~~DeepSeek ~~launches~~adds ~~'deep~~a ~~research'~~search ~~tool~~feature ~~that~~supporting itsimultaneous ~~says~~deep ~~can~~thinking ~~match~~and ~~research~~web ~~analyst~~search \|work=People's Daily Online \|date=2025-01-29 \|url=~~https~~http://~~www~~tech.~~theguardian~~people.com.cn/~~technology~~n1/2025/~~feb/03~~0129/~~openai~~c1007-~~deep-research-agent-chatgpt-deepseek~~40386565.html \|language=zh \|access-date=2025-0307-~~16 \|work=The Guardian \|language=en-GB \|issn=0261-3077~~26}}</ref> The ~~power~~effectiveness of distillation for reasoning models was ~~further~~shown ~~demonstrated~~in works bysuch as s1-32B, ~~achieving~~which achieved strong performance ~~with~~through budget forcing and scaling ~~techniques~~methods.<ref name=":10">{{~~Citation~~cite arXiv \|last1=Muennighoff \|first1=Niklas ~~\|title=s1: Simple test-time scaling \|date=2025-02-03 \|arxiv=2501.19393~~ \|last2=Yang \|first2=Zitong \|last3=Shi \|first3=Weijia \|last4=Li \|first4=Xiang Lisa \|last5=Fei-Fei \|first5=Li \|last6=Hajishirzi \|first6=Hannaneh \|last7=Zettlemoyer \|first7=Luke \|last8=Liang \|first8=Percy \|last9=Candès \|first9=Emmanuel \|title=s1: Simple test-time scaling \|date=2025-02-03 \|eprint=2501.19393 \|class=cs.CL}}</ref><ref name=":6"/> On February 2, 2025, OpenAI released [[ChatGPT Deep Research\|Deep Research]] based on their [[OpenAI o3\|o3]] model,<ref name=":5">{{~~Cite~~cite web \|date=2025-02-02 \|title=Introducing deep research \|url=https://openai.com/index/introducing-deep-research/ \|access-date=2025-02-05 \|website=OpenAI \|language=en-US}}</ref> ~~a tool that integrates reasoning and web search in a unified workflow,~~ allowing users to ~~perform~~initiate complex research tasks ~~that~~and ~~require~~generate ~~multi-step~~comprehensive ~~reasoning~~reports ~~and~~which ~~data~~incorporate ~~synthesis from multiple~~various sources~~. It is based on [[OpenAI o3\|o3]] and can take~~ from 5the ~~to 30 minutes to generate comprehensive reports~~web.<ref~~>{{Cite~~ ~~web \|last~~name=~~Ha \|first=Anthony \|date=2025-02-03 \|title=OpenAI unveils a new ChatGPT agent for 'deep research' \|url=https~~":~~//techcrunch.com/2025/02/02/openai-unveils-a-new-chatgpt-agent-for-deep-research/~~5" ~~\|access-date=2025-02-06 \|website=TechCrunch \|language=en-US}}<~~/~~ref~~> == Supervised finetuning == A [[large language model]] (LLM) can be ~~finetuned~~fine-tuned on a dataset of reasoning tasks paired with example solutions and step-by-step (reasoning) traces. The fine-tuned model can then produce its own reasoning traces for new problems.<ref name=":0">{{~~Citation~~cite arXiv \|last1=Uesato \|first1=Jonathan ~~\|title=Solving math word problems with process- and outcome-based feedback \|date=2022-11-25 \|arxiv=2211.14275~~ \|last2=Kushman \|first2=Nate \|last3=Kumar \|first3=Ramana \|last4=Song \|first4=Francis \|last5=Siegel \|first5=Noah \|last6=Wang \|first6=Lisa \|last7=Creswell \|first7=Antonia \|last8=Irving \|first8=Geoffrey \|last9=Higgins \|first9=Irina \|title=Solving math word problems with process- and outcome-based feedback \|date=2022-11-25 \|eprint=2211.14275 \|class=cs.LG}}</ref><ref name=":2" /> AsBecause ithuman-written istraces ~~expensive~~are ~~to get humans~~costly to ~~write reasoning traces for a SFT dataset~~collect, researchers have proposed ways to ~~automatically~~build ~~construct SFT~~such datasets automatically. In ''rejection sampling finetuning'' (RFT), new reasoning traces are ~~collected~~gathered ~~via~~in a loop:<ref>{{~~Citation~~cite arXiv \|last1=Yuan \|first1=Zheng ~~\|title=Scaling Relationship on Learning Mathematical Reasoning with Large Language Models \|date=2023-09-13 \|arxiv=2308.01825~~ \|last2=Yuan \|first2=Hongyi \|last3=Li \|first3=Chengpeng \|last4=Dong \|first4=Guanting \|last5=Lu \|first5=Keming \|last6=Tan \|first6=Chuanqi \|last7=Zhou \|first7=Chang \|last8=Zhou \|first8=Jingren \|title=Scaling Relationship on Learning Mathematical Reasoning with Large Language Models \|date=2023-09-13 \|eprint=2308.01825 \|class=cs.CL}}</ref> # Sample a task prompt.▼ ▲# Sample a task prompt # Generate many reasoning traces for the prompt. # Use a verifier to remove reasoning traces with ~~the~~a wrong final answer., and optionally remove duplicates ~~# For each remaining trace, extract the set of equations appearing in it. Deduplicate the traces so that each one has a different set of equations. Add those to the dataset.~~ == Reinforcement learning == A pretrained language model can be further trained bywith RL. In the RL formalism, a generative language model is a '''policy''' <math>\pi</math>. A ~~prompt specifying a~~ task ~~to solve~~prompt is an environmental '''state''' <math>x</math>, and the ~~response of the language~~ model's ~~to the prompt~~response is an '''action''' <math>y</math>. The probability that the ~~language~~ model responds <math>x</math> with <math>y</math> is <math>\pi(y\|x)</math>. Training a reasoning language model bywith RL ~~then consists of~~means constructing a '''reward model''' <math>r(x, y)</math> to guide the RL process. Intuitively, athe reward ~~model describes~~says how ~~desirable/appropriate/~~good ~~the~~a response is for ~~the~~a prompt. For ~~reasoning language model, the prompt describes~~ a reasoning task, ~~and~~ the reward ~~would be~~is high if the response solves the task, and low if ~~the~~it ~~response~~does ~~fails to solve the task~~not. ~~For reasoning language models, the model's~~A response <math>y</math> may be broken -down into multiple steps, ~~in which case it is~~ written as <math>y_1, y_2, \dots, y_n</math>. Most recent systems use policy-gradient methods such as [[Proximal Policy Optimization]] (PPO) because PPO constrains each policy update with a clipped objective, which stabilises training for very large policies.<ref name="OpenAIAlign2022">{{cite web \|title=Aligning language models to follow instructions \|website=OpenAI Blog \|url=https://openai.com/blog/instruction-following/ \|date=2022-01-27 \|access-date=2025-05-04}}</ref> Line 45 ⟶ 42: {{Anchor\|Outcome Reward Model\|ORM}} ~~Outcome~~An outcome reward model, or outcome-supervised RM (ORM),<ref name=":0" /> ~~is a reward model that computes~~gives the reward offor a step <math>r(x, y_1, \dots, y_i)</math> ~~determined~~based byon the final answer: <math>r(x, y_1, \dots, y_i) = r(x, y_n)</math>. ~~They~~Such models are ~~also~~often called "verifiers". For tasks with ~~an answer~~answers that isare easy to verify, such as [[Word problem (mathematics education)\|math word problems ~~in math~~]], the outcome reward can ~~simply~~ be binary: 1 if the final answer is correct, ~~and~~ 0 otherwise.<ref name=":0" /> If ~~the~~automatic ~~answer~~verification is ~~not easy to verify programmatically~~hard, humans can ~~manually~~ label ~~the~~ answers as correct or not, ~~then~~and ~~the~~those labels can be used to finetune a base model that predicts the human label.<ref name=":2">{{~~Citation~~cite arXiv \|last1=Cobbe \|first1=Karl ~~\|title=Training Verifiers to Solve Math Word Problems \|date=2021-11-18 \|arxiv=2110.14168~~ \|last2=Kosaraju \|first2=Vineet \|last3=Bavarian \|first3=Mohammad \|last4=Chen \|first4=Mark \|last5=Jun \|first5=Heewoo \|last6=Kaiser \|first6=Lukasz \|last7=Plappert \|first7=Matthias \|last8=Tworek \|first8=Jerry \|last9=Hilton \|first9=Jacob \|title=Training Verifiers to Solve Math Word Problems \|date=2021-11-18 \|eprint=2110.14168 \|class=cs.LG}}</ref> For ~~other kinds of~~ tasks, ~~such as~~like creative writing, where ~~task performance~~quality is not ~~binary~~simply true/ or false, one can train a reward ~~model by finetuning a base~~ model on human [[Ranking (statistics)\|ranked preference]] data, ~~such~~ as ~~used~~ in [[reinforcement learning from human feedback]].<ref name=":1">{{~~Citation~~cite journal \|last1=Lightman \|first1=Hunter ~~\|title=Let's Verify Step by Step \|date=2023-05-31 \|arxiv=2305.20050~~ \|last2=Kosaraju \|first2=Vineet \|last3=Burda \|first3=Yura \|last4=Edwards \|first4=Harri \|last5=Baker \|first5=Bowen \|last6=Lee \|first6=Teddy \|last7=Leike \|first7=Jan \|last8=Schulman \|first8=John \|last9=Sutskever \|first9=Ilya \|date=2024 \|title=Let's Verify Step by Step \|url=https://openreview.net/forum?id=dKDGgN0eTg \|journal=International Conference on Learning Representations (ICLR 2024) \|access-date=2025-07-26 \|arxiv=2305.20050}}</ref> A base model can also be ~~finetuned~~fine-tuned to predict, ~~given~~from a partial thinking trace <math>x, y_1, \dots, y_m</math>, whether the final answer ~~would~~will be correct, orand ~~not.~~this ~~This~~prediction can ~~then be used~~serve as a binary reward ~~signal~~.<ref name=":0" /> The ORM is usually trained ~~via~~with [[logistic regression]], i.e. by minimizing [[Cross-entropy\|cross-entropy loss]].<ref name=":3" /> Given a PRM, an ORM can be constructed by multiplying the total process reward during the reasoning trace,<ref name=":1" /> or by taking the minimum,<ref name=":3" /> or ~~some~~by other ~~method~~ways toof ~~aggregate the~~aggregating process rewards. DeepSeek used a simple ORM ~~for~~to ~~training~~train the [[DeepSeek (chatbot)\|R1 model]].<ref~~>{{Citation~~ ~~\|last1~~name=~~DeepSeek-AI \|title=DeepSeek-R1~~": Incentivizing Reasoning Capability in LLMs via Reinforcement Learning \|date=2025-01-22 \|arxiv=2501.12948 \|last2=Guo \|first2=Daya \|last3=Yang \|first3=Dejian \|last4=Zhang \|first4=Haowei \|last5=Song \|first5=Junxiao \|last6=Zhang \|first6=Ruoyu \|last7=Xu \|first7=Runxin \|last8=Zhu \|first8=Qihao \|last9=Ma \|first9=Shirong}}<9"/~~ref~~> === Process reward model === {{Anchor\|Process Reward Model\|PRM}} ~~Process~~A process reward model, or process-supervised RM (PRM),<ref name=":0" /> ~~is a reward model that computes~~gives the reward offor a step <math>r(x, y_1, \dots, y_i)</math> ~~determined~~based only byon the steps so far: <math>(x, y_1, \dots, y_i)</math>. Given a partial thinking trace <math>x, y_1, \dots, y_m</math>, a human can ~~be queried as to~~judge whether the steps ''so far'' are correct, ~~regardless~~without oflooking ~~whether~~at the ~~ultimate~~final answer ~~would be correct~~. This ~~can then be used as~~yields a binary reward ~~signal~~. AsBecause human labels are ~~expensive~~costly, a base model can ~~then~~ be ~~finetuned~~fine-tuned to predict ~~the human labels~~them.<ref name=":0" /> The PRM is usually trained bywith [[logistic regression]] on the human labels, i.e. by minimizing the [[Cross-entropy\|cross-entropy loss]] between ~~the~~ true ~~labels~~ and ~~the~~ predicted labels.<ref name=":3" /> As an example, in a 2023 OpenAI paper, collected 800K process labels ~~were collected~~ for 75K ~~solution~~thinking traces. A labeler ~~would be presented with~~saw a ~~solution~~ trace, and ~~keep~~marked ~~labelling~~each step as "positive" if ~~the~~it ~~step~~moved ~~progresses~~toward ~~towards the~~a solution, "neutral" if it iswas not wrong, but ~~does~~did not ~~progress towards solution~~help, and "negative" if it iswas a mistake. AsAfter ~~soon~~the ~~as a~~first "negative" label ~~is entered~~, the labeler ~~stops~~stopped ~~labeling~~on that ~~thinking~~ trace, and ~~begins~~moved ~~labeling~~to another ~~one~~. The ~~idea~~authors ~~was~~argued that, ~~while~~labeling ~~labelling~~up ~~subsequent~~to ~~reasoning~~the ~~steps~~first ~~can~~error ~~provide~~was ~~even~~enough ~~richer~~to ~~supervision~~train ~~signals,~~a ~~simply~~capable ~~labeling~~PRM, upeven tothough ~~the~~labeling ~~first~~later ~~error~~steps ~~was~~could ~~sufficient~~give ~~for~~richer ~~training a competent PRM~~signals.<ref name=":1" /><ref>{{~~Citation~~cite web \|title=~~openai/~~prm800k ~~\|date=2025-01-27~~ \|url=https://github.com/openai/prm800k \|~~access-~~website=GitHub \|publisher=OpenAI \|date=2025-01-27 \|~~publisher~~access-date=~~OpenAI~~2025-01-27}}</ref> AsTo avoid human labels ~~are expensive~~, researchers have proposed methods to create PRM without human labels on the processes. Inspired by [[Monte Carlo tree search]] (MCTS), the Math-Shepherd method samples multiple continuations until the end, starting at each reasoning step <math>y_i</math>, and set the reward at that step to be either <math>\frac{\#\text{(correct answers)}}{\#\text{(total answers)}}</math> in the case of "soft estimation", or <math>\begin{cases} 1 & \text{if one of the answers is correct}\\ 0 & \text{else} \end{cases}</math> ~~\end{cases}</math>~~ in the case of "hard estimation". This creates process ~~reward~~rewards ~~using only~~from an ORM, which is ~~usually~~often easier or cheaper to construct. ~~After creating these process reward labels, a~~A PRM can then be trained on ~~them~~these labels.<ref name=":3">{{~~Cite~~cite journal \|last1=Wang \|first1=Peiyi \|last2=Li \|first2=Lei \|last3=Shao \|first3=Zhihong \|last4=Xu \|first4=Runxin \|last5=Dai \|first5=Damai \|last6=Li \|first6=Yifei \|last7=Chen \|first7=Deli \|last8=Wu \|first8=Yu \|last9=Sui \|first9=Zhifang ~~\|date=August 2024~~ \|editor-last=Ku \|editor-first=Lun-Wei \|editor2-last=Martins \|editor2-first=Andre \|editor3-last=Srikumar \|editor3-first=Vivek \|title=Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations ~~\|url=https://aclanthology.org/2024.acl-long.510/~~ \|journal=Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) \|___location=Bangkok, Thailand \|publisher=Association for Computational Linguistics \|date=August 2024 \|pages=9426–9439 \|doi=10.18653/v1/2024.acl-long.510 \|arxiv=2312.08935 }}</ref> Some ~~have~~work has tried a fully MCTS approach.<ref>{{~~Citation~~cite arXiv \|last1=Chen \|first1=Guoxin \|last2=Liao \|first2=Minpeng \|last3=Li \|first3=Chengxi \|last4=Fan \|first4=Kai \|title=AlphaMath Almost Zero: Process Supervision without Process \|date=2024-09-27 \|~~arxiv~~eprint=2405.03553 \|~~last2~~class=~~Liao \|first2=Minpeng \|last3=Li \|first3=Chengxi \|last4=Fan \|first4=Kai~~cs.LG}}</ref> One can also use an ORM to implicitly construct a PRM, similar to [[direct preference optimization]].<ref>{{~~Citation~~cite arXiv \|last1=Yuan \|first1=Lifan ~~\|title=Free Process Rewards without Process Labels \|date=2024-12-02 \|arxiv=2412.01981~~ \|last2=Li \|first2=Wendi \|last3=Chen \|first3=Huayu \|last4=Cui \|first4=Ganqu \|last5=Ding \|first5=Ning \|last6=Zhang \|first6=Kaiyan \|last7=Zhou \|first7=Bowen \|last8=Liu \|first8=Zhiyuan \|last9=Peng \|first9=Hao \|title=Free Process Rewards without Process Labels \|date=2024-12-02 \|eprint=2412.01981 \|class=cs.CL}}</ref> === Guided sampling === A trained ORM can be used to ~~select~~pick the best response. The policy ~~would~~generates ~~rollout multiple~~several responses, and ~~a trained~~the ORM ~~would select~~selects the best ~~response~~one. This ~~allows~~implements a simple form of [[Neural scaling law\|test -time compute scaling]] ("best-of-N").<ref name=":2" /> <ref>{{~~Citation~~cite arXiv \|last1=Zhang \|first1=Di ~~\|title=LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning \|date=2024-11-21 \|arxiv=2410.02884~~ \|last2=Wu \|first2=Jianbo \|last3=Lei \|first3=Jingdi \|last4=Che \|first4=Tong \|last5=Li \|first5=Jiatong \|last6=Xie \|first6=Tong \|last7=Huang \|first7=Xiaoshui \|last8=Zhang \|first8=Shufei \|last9=Pavone \|first9=Marco \|title=LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning \|date=2024-11-21 \|eprint=2410.02884 \|class=cs.CL}}</ref> A trained PRM can ~~also be used to~~ guide reasoning by a greedy [[Tree traversal\|tree search]]~~. That is,~~: the policy ~~model generates~~proposes several ~~possible~~ next ~~reasoning~~ steps, ~~and~~ the PRM ~~selects the best~~picks one, and the process repeats. This ismirrors ~~similar~~using ~~to how a trained~~an ORM ~~can be used~~ to ~~select~~pick ~~the~~a ~~best~~whole response.<ref>{{~~Citation~~cite arXiv \|last1=Ma \|first1=Qianli ~~\|title=Let's reward step by step: Step-Level reward model as the Navigators for Reasoning \|date=2023-10-16 \|arxiv=2310.10080~~ \|last2=Zhou \|first2=Haotian \|last3=Liu \|first3=Tingkai \|last4=Yuan \|first4=Jianbo \|last5=Liu \|first5=Pengfei \|last6=You \|first6=Yang \|last7=Yang \|first7=Hongxia \|title=Let's reward step by step: Step-Level reward model as the Navigators for Reasoning \|date=2023-10-16 \|eprint=2310.10080 \|class=cs.CL}}</ref> [[Beam search]] ~~perform~~performs better than greedy search. ''Lookahead search'' is another tree search method,. ~~where the~~The policy ~~model generates~~proposes several ~~possible~~ next ~~reasoning~~ steps, then ~~make~~makes a ~~(partial)~~short rollout for each. If a solution ~~endpoint~~ is ~~reached~~found during ~~the forward simulation~~rollout, the ~~process~~search ~~halts~~stops early. Otherwise, the PRM isscores ~~used~~each torollout, ~~calculate~~and the ~~total reward for each rollout. The~~ step with the highest ~~rollout~~score is ~~selected~~chosen.<ref~~>{{Citation~~ ~~\|last1~~name=Snell \|first1=Charlie \|title=Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters \|date=2024-08-06 \|arxiv=2408.03314 \|last2=Lee \|first2=Jaehoon \|last3=Xu \|first3=Kelvin \|last4=Kumar \|first4=Aviral}}<":7"/~~ref~~> ''Self-consistency'' can be combined with an ORM. The model ~~would be used to generate~~generates multiple answers, and the answers ~~would be~~are clustered, so that each cluster has the same final answer. The ORM ~~is used to compute the reward for~~scores each answer, ~~and~~scores ~~the rewards within~~in each cluster isare summed., ~~The answer corresponding to~~and the ~~cluster~~answer ~~with~~from the highest-scoring ~~summed reward~~cluster is ~~output~~returned.<ref name=":3" /> == Benchmarks == Reasoning models generally ~~outperform~~score higher than non-reasoning models inon ~~most~~many benchmarks, especially on tasks requiring multi-step reasoning. Some benchmarks exclude reasoning models because their responses take longer and cost more.<ref>{{cite book \|last1=Huang \|first1=Yuting \|last2=Zois \|first2=Christos \|last3=Wang \|first3=Yue \|last4=Zhang \|first4=Yue \|last5=Mavromatis \|first5=Christos \|last6=Zeng \|first6=Jiachen \|last7=Yin \|first7=Shihao \|last8=Voulkidis \|first8=Antonios \|last9=Shepard \|first9=Daniel \|chapter=Toward Foundation Models for Online Complex Event Detection in CPS-IoT: A Case Study \|title=Proceedings of the 2nd International Workshop on Foundation Models for Cyber-Physical Systems & Internet of Things \|publisher=ACM \|date=2025 \|pages=1–6 \|doi=10.1145/3722565.3727198 \|arxiv=2503.12282 \|isbn=979-8-4007-1608-9 \|quote=Although we did not evaluate o1 and o3 models … their high cost and inference time make them impractical for online CED, which requires frequent, low-latency API requests.}}</ref><ref>{{cite arXiv \|last1=Hu \|first1=Zihao \|last2=Wang \|first2=Yuqing \|last3=Sun \|first3=Rui \|last4=Lu \|first4=Haoran \|last5=Gong \|first5=Qian \|last6=Wang \|first6=Jinshuai \|last7=Gong \|first7=Yunlong \|last8=Huang \|first8=Yiming \|last9=He \|first9=Peng \|title=Inference-Time Compute: More Faithful? A Research Note \|date=2025-02-13 \|eprint=2502.09673 \|class=cs.CL \|quote=we were unable to evaluate O1 and R1 …}}</ref><ref>{{cite arXiv \|last1=Chen \|first1=Guoliang \|last2=Zhu \|first2=Zhiyao \|last3=Meng \|first3=Qinxiang \|last4=Liang \|first4=Weilin \|last5=Ji \|first5=Zijie \|last6=Liu \|first6=Jiangning \|last7=Zeng \|first7=Jie \|title=RealBench: Evaluating LLMs as Verilog Engineers \|date=2025-03-07 \|eprint=2503.04914 \|class=cs.AI \|quote=For O1-preview, we sample only once due to high cost.}}</ref><ref>{{cite arXiv \|last1=Gupta \|first1=Arpit \|last2=Schapira \|first2=Michael \|last3=Gill \|first3=Phillipa \|last4=Seetharaman \|first4=Srinivasan \|title=On the Feasibility of Using LLMs to Execute Multistage Network Attacks \|date=2025-01-30 \|eprint=2501.16466 \|class=cs.CR \|quote=We were unable to evaluate o1 … the public API has a safeguard that prevents o1 from executing attacks.}}</ref> ~~However, some benchmarks exclude reflective models due to longer response times.~~ === Humanity's Last Exam === The [[Humanity's Last Exam\|HLE]]~~, a rigorous~~ benchmark ~~designed to assess~~tests expert-level reasoning across mathematics, humanities, and the natural sciences, ~~reveals~~and shows ~~substantial~~large performance gaps ~~among~~between models. State-of-the-art reasoning models ~~have demonstrated~~score low ~~accuracy~~ on HLE, ~~highlighting significant~~leaving room ~~for~~to ~~improvement~~improve. InFor ~~particular~~example, the full reasoning model [[OpenAI o3\|o3]] ~~achieved an accuracy of~~reached 26.6%,<ref~~>{{Cite~~ ~~web \|last~~name=~~McKenna \|first=Greg \|title=OpenAI's deep research can complete 26% of Humanity's Last Exam \|url=https~~":5"/~~/fortune.com/2025/02/12/openai-deepresearch-humanity-last-exam/ \|access-date=2025-03-16 \|website=Fortune \|language=en}}</ref~~> while ~~its~~the lighter ~~counterpart, o3‑mini~~o3-mini-high (~~evaluated~~ on ~~text‑only~~text-only questions), reached 13%.<ref>{{~~Cite~~cite web ~~\|author1=John-Anthony Disotto \|date=2025-02-04~~ \|title=~~OpenAI~~Humanity's ~~Deep~~Last ~~Research~~Exam ~~smashes records for the world's hardest AI exam, with ChatGPT o3-mini and DeepSeek left in its wake~~leaderboard \|url=https://~~www~~agi.~~techradar~~safe.~~com~~ai/~~computing~~benchmarks/~~artificial-intelligence/openais-deep-research-smashes-records-for-the-worlds-hardest-~~hle \|website=Safe.ai~~-exam-with-chatgpt-o3-mini-and-deepseek-left-in-its-wake~~ \|publisher=Center for AI Safety \|access-date=2025-0307-~~16 \|website=TechRadar \|language=en~~26}}</ref> === AIME === ~~The~~On the [[American Invitational Mathematics Examination]] (AIME) ~~benchmark~~, a ~~challenging~~difficult ~~mathematics~~math competition, ~~demonstrates significant performance differences between model types. Non~~non-reasoning models ~~typically~~usually solve ~~less than~~under 30% of ~~AIME~~problems. InModels ~~contrast, models~~that ~~employing~~use reasoning ~~techniques~~methods score between 50% and 80%.<ref~~>{{Cite~~ ~~web \|date~~name=~~2025-02-10 \|title=MathArena \|url=https~~":8"/~~/matharena.ai/~~><ref ~~\|access-date~~name=~~2025-02-10 \|archive-url=https~~":9"/~~/web.archive.org/web/20250210032556/https://matharena.ai/~~><ref ~~\|archive-date~~name=":10 ~~February 2025 }}<~~"/~~ref~~> While [[OpenAI o1\|OpenAI's o1]] maintained or slightly improved its accuracy from reported 2024~~{{Source?\|date=July~~ ~~2022}} metrics~~results to 2025 AIME results, o3-mini (high) ~~achieved~~reached a higher accuracy (80%) at a ~~significantly~~much lower cost (~~approximately~~about 12 times cheaper).<ref name=":4">{{cite web \|date=2025-01-31 \|title=OpenAI o3-mini \|url=https://openai.com/index/openai-o3-mini/ \|access-date=2025-02-09 \|website=OpenAI \|language=en-US}}</ref> === o3-mini performance === According to OpenAI's January 2025 report on o3-mini, ~~adjustable~~adjusting "reasoning effort" significantly affects performance, ~~particularly~~especially infor [[STEM]] tasks. ~~Increasing reasoning effort~~Moving from low to high ~~boosts~~reasoning effort raises accuracy on ~~benchmarks like~~ AIME 2024, GPQA Diamond, and [[Codeforces]], ~~providing performance gains~~ typically inby ~~the range of 10-30~~10–30%. With high ~~reasoning~~ effort, o3-mini (high) achieved 87.3% inon AIME (different from the MathArena AIME benchmark ~~results~~), 79.7% inon GPQA Diamond, 2130 Elo inon Codeforces, and 49.3 inon SWE-bench Verified.<ref~~>{{Cite~~ ~~web \|date~~name=~~2025-01-31 \|title=OpenAI o3-mini \|url=https~~":4"/~~/openai.com/index/openai-o3-mini/ \|access-date=2025-02-09 \|website=OpenAI \|language=en-US}}</ref~~> == Drawbacks == === Computational cost === Reasoning models ~~require~~often ~~significantly~~need far more ~~test-time~~compute ~~compute~~while answering than non-reasoning models. On ~~the~~ AIME ~~benchmark~~, ~~reasoning models~~they were 10 to 74 times more expensive'''<ref name=":1" />''' than non-reasoning counterparts. === Generation time === Due to the tendency of reasoning language models to produce verbose outputs, the time it takes to generate an output increases greatly when compared to a standard [[large language model]]. Reflective reasoning increases response times, with current models taking anywhere from three seconds to several minutes to generate an answer. As reasoning depth improves, future models may require even longer processing times. == Models == === [[OpenAI]] === * [[GPT-5]] * [[OpenAI o4-mini\|o4-mini]] * [[OpenAI o3\|o3 and o3-mini]] Line 123: === [[Mistral AI]] === * Magistral (medium & small) === [[XAI (company)\|xAI]] === * [[~~Grok~~ Grok_(chatbot)#Grok_3\|Grok 3]] 3 * [[Grok_(chatbot)#Grok_4\|Grok 4]] === [[Hugging Face]] === * OlympicCoder-7B & 32B, as part of reproducing the R1 training openly (Open R1 project).<ref>{{cite web \|title=Open-R1: a fully open reproduction of DeepSeek-R1 \|url=https://huggingface.co/blog/open-r1 \|website=Hugging Face \|date=2025-02-24 \|access-date=2025-07-26}}</ref><ref>{{cite web \|title=OlympicCoder-7B \|url=https://huggingface.co/open-r1/OlympicCoder-7B \|website=Hugging Face \|date=2025-03-11 \|access-date=2025-07-26}}</ref> * OlympicCoder-7B & 32B, as part of reproducing the R1 training openly (Open R1 project).<ref>{{Cite web \|date=2025-03-12 \|title=@lewtun on Hugging Face: "Introducing OlympicCoder: a series of open reasoning models that can solve…" \|url=https://huggingface.co/posts/lewtun/886287473065721 \|access-date=2025-04-04 \|website=huggingface.co}}</ref> == See also ==