Reasoning language model: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 06:06, 26 July 2025 edit Kjerish (talk \| contribs) Extended confirmed users 5,991 edits Improve references ← Previous edit		Latest revision as of 10:21, 20 August 2025 edit undo SimonAytes (talk \| contribs) 13 edits →Generation time: Update to be more factually correct and concise. Tag: Visual edit
(9 intermediate revisions by 5 users not shown)
Line 1: {{Short description\|Language models designed for reasoning tasks}} {{Copy edit\|for=jargon\|date=May 2025}} '''Reasoning language models''' ('''RLMs''') are [[large language model]]s that ~~have~~are ~~been~~trained further ~~trained~~ to solve ~~multi-step~~tasks that take several steps of [[reasoning]] ~~tasks~~.<ref>{{cite arXiv \|last1=Besta \|first1=Maciej \|last2=Barth \|first2=Julia \|last3=Schreiber \|first3=Eric \|last4=Kubicek \|first4=Ales \|last5=Catarino \|first5=Afonso \|last6=Gerstenberger \|first6=Robert \|last7=Nyczyk \|first7=Piotr \|last8=Iff \|first8=Patrick \|last9=Li \|first9=Yueling \|title=Reasoning Language Models: A Blueprint \|date=2025-01-23 \|~~arxiv~~eprint=2501.11223 \|class=cs.CL}}</ref> ~~These~~They ~~models~~tend ~~perform~~to do better on ~~logical~~logic, ~~mathematical~~math, orand ~~programmatic~~programming tasks than ~~traditional autoregressive~~standard LLMs, ~~have the ability to~~can [[Backtracking\|~~backtrack~~revisit and revise]] earlier steps, and ~~employ~~make ~~test-time~~use ~~compute~~of extra computation while answering as ananother ~~additional~~way to [[Neural scaling law\|~~scaling~~scale ~~axis~~performance]] ~~beyond [[Training~~, ~~validation,~~alongside ~~and~~the ~~test~~number ~~data~~of ~~sets\|~~training examples]], ~~parameter count~~parameters, and ~~train-time~~training compute.<ref name=":8">{{cite web \|title=Learning to reason with LLMs \|url=https://openai.com/index/learning-to-reason-with-llms/ \|website=OpenAI \|date=2024-09-12 \|access-date=2025-07-26}}</ref> == History == === 2024 === In September 2024, [[OpenAI]] released [[OpenAI o1#release\|o1-preview]], an LLM with enhanced reasoning.<ref>{{cite news \|last1=Edwards \|first1=Benj \|date=2024-09-12 \|title=OpenAI's new "reasoning" AI models are here: o1-preview and o1-mini \|url=https://arstechnica.com/information-technology/2024/09/openais-new-reasoning-ai-models-are-here-o1-preview-and-o1-mini/ \|access-date=2025-02-06 \|work=Ars Technica \|language=en-US}}</ref> The full version, [[OpenAI o1\|o1]], followed in December 2024. OpenAI also began sharing results on its successor, [[OpenAI o3\|o3]].<ref>{{cite web \|title=OpenAI o1 System Card \|url=https://cdn.openai.com/o1-system-card.pdf \|website=OpenAI \|date=2024-12-05 \|access-date=2025-07-26}}</ref><ref>{{cite news \|last=Robison \|first=Kylie \|date=2024-12-05 \|title=OpenAI launches ChatGPT Pro, a $200/month plan with unlimited access to o1, ~~GPT‑4o~~GPT-4o, and more \|url=https://www.theverge.com/2024/12/5/24314147/openai-reasoning-model-o1-strawberry-chatgpt-pro-new-tier \|access-date=2025-07-26 \|work=The Verge}}</ref><ref>{{cite news \|last=Singh \|first=Jaspreet \|date=2024-12-20 \|title=OpenAI unveils 'o3' model, touting advances in reasoning \|url=https://www.reuters.com/technology/artificial-intelligence/openai-unveils-o3-model-touting-advances-reasoning-2024-12-20/ \|access-date=2025-07-26 \|work=Reuters}}</ref> The development of reasoning LLMs has illustrated what [[Richard S. Sutton\|Rich Sutton]] ~~termed~~called the "bitter lesson": that ~~general~~scaling ~~methods leveraging computation~~compute often ~~outperform~~outperforms ~~those~~methods that ~~relying~~rely on specific human insights.<ref>{{cite web \|last1=Sutton \|first1=Richard S. \|title=The Bitter Lesson \|url=http://www.incompleteideas.net/IncIdeas/BitterLesson.html \|access-date=2025-02-27 \|website=Incomplete Ideas}}</ref> For ~~instance~~example, ~~some research groups, such as~~ the Generative AI Research Lab (GAIR)~~, initially~~ explored complex ~~techniques~~methods ~~like~~such as tree search and reinforcement learning ~~in attempts~~ to replicate o1's capabilities. ~~However, they found, as documented in~~In their "o1 Replication Journey" papers, they reported that [[knowledge distillation]] — (training a smaller model to ~~mimic~~imitate o1's outputs) ~~– was~~worked surprisingly ~~effective~~well. This highlighted the ~~power~~effectiveness of distillation in this context.<ref>{{cite arXiv \|last1=Huang \|first1=Zhen \|last2=Zou \|first2=Haoyang \|last3=Li \|first3=Xuefeng \|last4=Liu \|first4=Yixiu \|last5=Zheng \|first5=Yuxiang \|last6=Chern \|first6=Ethan \|last7=Xia \|first7=Shijie \|last8=Qin \|first8=Yiwei \|last9=Yuan \|first9=Weizhe \|title=O1 Replication Journey — Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? \|date=2024-11-25 \|~~arxiv~~eprint=2411.16489 \|class=cs.CL}}</ref><ref name=":6">{{cite news \|last=Zeff \|first=Maxwell \|date=2025-02-05 \|title=Researchers created an open rival to ~~OpenAI’s~~OpenAI's o1 ~~‘reasoning’~~'reasoning' model for under $50 \|url=https://techcrunch.com/2025/02/05/researchers-created-an-open-rival-to-openais-o1-reasoning-model-for-under-50/ \|access-date=2025-07-26 \|work=TechCrunch}}</ref> [[Alibaba Group\|Alibaba]] ~~also~~ released reasoning versions of its [[Qwen]] LLMs in November 2024.<ref>{{cite web \|title=QwQ-32B-Preview: Reflect Deeply on the Boundaries of the Unknown \|url=https://qwenlm.github.io/blog/qwq-32b-preview/ \|website=Qwen (Alibaba Cloud) \|date=2024-11-28 \|access-date=2025-07-26}}</ref> In December 2024, the team introduced QvQ-72B-Preview, an experimental visual reasoning model.<ref>{{cite web \|title=QVQ: To See the World with Wisdom \|url=https://qwenlm.github.io/blog/qvq-72b-preview/ \|website=Qwen \|publisher=Alibaba Cloud \|date=2024-12-25 \|access-date=2025-07-26}}</ref> In December 2024, Google introduced [[Gemini Deep Research\|Deep Research]] in [[Gemini (chatbot)\|Gemini]],<ref>{{cite web \|date=2024-12-11 \|title=Try Deep Research and our new experimental model in Gemini, your AI assistant \|url=https://blog.google/products/gemini/google-gemini-deep-research/ \|access-date=2025-02-05 \|website=Google \|language=en-US}}</ref> a feature ~~in Gemini~~ that ~~conducts~~runs multi-step research tasks.<ref>{{cite news \|last=Roth \|first=Emma \|date=2024-12-11 \|title=Google built an AI tool that can do research for you \|url=https://www.theverge.com/2024/12/11/24318217/google-gemini-advanced-deep-research-launch \|access-date=2025-07-26 \|work=The Verge}}</ref> On December 16, 2024, an experiment ~~using~~with a [[Llama (language model)\|Llama]] 3B model ~~demonstrated~~showed that by scaling test-time compute, a relatively small model could outperform a much larger Llama 70B model on challenging reasoning tasks. This ~~result highlighted~~suggested that ~~improved~~better inference strategies can unlock ~~latent~~useful reasoning capabilities even in ~~compact~~small models.<ref>{{cite web \|title=Scaling test-time compute \|url=https://huggingface.co/blog/h4-scaling-test-time-compute \|website=Hugging Face \|date=2024-12-16 \|access-date=2025-07-26}}</ref><ref name=":7">{{cite journal \|last1=Snell \|first1=Charlie \|last2=Lee \|first2=Jaehoon \|last3=Xu \|first3=Kelvin \|last4=Kumar \|first4=Aviral \|date=2025 \|title=Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters \|url=https://openreview.net/forum?id=t4s3hJY9dH \|journal=International Conference on Learning Representations (ICLR 2025) \|access-date=2025-07-26 \|arxiv=2408.03314}}</ref> === 2025 === In January 2025, [[DeepSeek]] released [[DeepSeek (chatbot)\|R1]], a model ~~competitive~~ with comparable performance to o1 at lower cost,. ~~highlighting~~The release demonstrated the effectiveness of [[Group Relative Policy Optimization]] (GRPO).<ref>{{cite news \|last1=Orland \|first1=Kyle \|date=2025-01-28 \|title=How does DeepSeek R1 really fare against OpenAI's best reasoning models? \|url=https://arstechnica.com/ai/2025/01/how-does-deepseek-r1-really-fare-against-openais-best-reasoning-models/ \|access-date=2025-02-06 \|work=Ars Technica}}</ref><ref name=":9">{{cite arXiv \|last1=DeepSeek-AI ~~\|first1=~~ \|last2=Guo \|first2=Daya \|last3=Yang \|first3=Dejian \|last4=Zhang \|first4=Haowei \|last5=Song \|first5=Junxiao \|last6=Zhang \|first6=Ruoyu \|last7=Xu \|first7=Runxin \|last8=Zhu \|first8=Qihao \|last9=Ma \|first9=Shirong \|title=DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning \|date=2025-01-22 \|~~arxiv~~eprint=2501.12948 \|class=cs.CL}}</ref> On January 25, 2025, [[DeepSeek]] ~~launched~~added a feature ~~in their~~to DeepSeek R1 ~~model,~~that ~~enabling~~lets the ~~simultaneous~~model ~~use~~search ofthe ~~search~~web ~~and~~while ~~reasoning~~it ~~capabilities~~reasons, ~~which~~making ~~allows~~it ~~for~~easier ~~more efficient integration of~~to ~~data~~combine retrieval with ~~reflective~~ reasoning ~~processes~~.<ref>{{cite news \|script-title=zh:DeepSeek 支持“"深度思考+联网检索”"能力 \|trans-title=DeepSeek adds a search feature supporting simultaneous deep thinking and web search \|work=~~People’s~~People's Daily Online \|date=2025-01-29 \|url=http://tech.people.com.cn/n1/2025/0129/c1007-40386565.html \|language=zh \|access-date=2025-07-26}}</ref> ~~OpenAI~~The ~~subsequently~~effectiveness ~~released~~of ~~o3-mini,~~distillation ~~followed~~for byreasoning ~~[[ChatGPT~~models ~~Deep~~was ~~Research\|Deep~~shown ~~Research]]~~in ~~which~~works issuch based on [[OpenAI o3\|o3]].<ref>{{cite news \|last1=Milmo \|first1=Dan \|date=2025-02-03 \|title=OpenAI launches 'deep research' tool that it says can match research analyst \|url=https://www.theguardian.com/technology/2025/feb/03/openai-deep-research-agent-chatgpt-deepseek \|access-date=2025-03-16 \|work=The Guardian \|language=en-GB \|issn=0261-3077}}</ref> The power of distillation was further demonstrated byas s1-32B, ~~achieving~~which achieved strong performance ~~with~~through budget forcing and scaling ~~techniques~~methods.<ref name=":10">{{cite arXiv \|last1=Muennighoff \|first1=Niklas \|last2=Yang \|first2=Zitong \|last3=Shi \|first3=Weijia \|last4=Li \|first4=Xiang Lisa \|last5=Fei-Fei \|first5=Li \|last6=Hajishirzi \|first6=Hannaneh \|last7=Zettlemoyer \|first7=Luke \|last8=Liang \|first8=Percy \|last9=Candès \|first9=Emmanuel \|title=s1: Simple test-time scaling \|date=2025-02-03 \|~~arxiv~~eprint=2501.19393 \|class=cs.CL}}</ref><ref name=":6"/> On February 2, 2025, OpenAI released [[ChatGPT Deep Research\|Deep Research]] based on their [[OpenAI o3\|o3]] model,<ref name=":5">{{cite web \|date=2025-02-02 \|title=Introducing deep research \|url=https://openai.com/index/introducing-deep-research/ \|access-date=2025-02-05 \|website=OpenAI \|language=en-US}}</ref> ~~a tool that integrates reasoning and web search in a unified workflow,~~ allowing users to ~~perform~~initiate complex research tasks ~~that~~and ~~require~~generate ~~multi-step~~comprehensive ~~reasoning~~reports ~~and~~which ~~data~~incorporate ~~synthesis from multiple~~various sources~~. It is based on [[OpenAI o3\|o3]] and can take~~ from 5the ~~to 30 minutes to generate comprehensive reports~~web.<ref name=":5" /> == Supervised finetuning == A [[large language model]] (LLM) can be ~~finetuned~~fine-tuned on a dataset of reasoning tasks paired with example solutions and step-by-step (reasoning) traces. The fine-tuned model can then produce its own reasoning traces for new problems.<ref name=":0">{{cite arXiv \|last1=Uesato \|first1=Jonathan \|last2=Kushman \|first2=Nate \|last3=Kumar \|first3=Ramana \|last4=Song \|first4=Francis \|last5=Siegel \|first5=Noah \|last6=Wang \|first6=Lisa \|last7=Creswell \|first7=Antonia \|last8=Irving \|first8=Geoffrey \|last9=Higgins \|first9=Irina \|title=Solving math word problems with process- and outcome-based feedback \|date=2022-11-25 \|~~arxiv~~eprint=2211.14275 \|class=cs.LG}}</ref><ref name=":2" /> AsBecause ithuman-written istraces ~~expensive~~are ~~to get humans~~costly to ~~write reasoning traces for a SFT dataset~~collect, researchers have proposed ways to ~~automatically~~build ~~construct SFT~~such datasets automatically. In ''rejection sampling finetuning'' (RFT), new reasoning traces are ~~collected~~gathered ~~via~~in a loop:<ref>{{cite arXiv \|last1=Yuan \|first1=Zheng \|last2=Yuan \|first2=Hongyi \|last3=Li \|first3=Chengpeng \|last4=Dong \|first4=Guanting \|last5=Lu \|first5=Keming \|last6=Tan \|first6=Chuanqi \|last7=Zhou \|first7=Chang \|last8=Zhou \|first8=Jingren \|title=Scaling Relationship on Learning Mathematical Reasoning with Large Language Models \|date=2023-09-13 \|~~arxiv~~eprint=2308.01825 \|class=cs.CL}}</ref> # Sample a task prompt. # Generate many reasoning traces for the prompt. # Use a verifier to remove reasoning traces with ~~the~~a wrong final answer., and optionally remove duplicates ~~# For each remaining trace, extract the set of equations appearing in it. Deduplicate the traces so that each one has a different set of equations. Add those to the dataset.~~ == Reinforcement learning == A pretrained language model can be further trained bywith RL. In the RL formalism, a generative language model is a '''policy''' <math>\pi</math>. A ~~prompt specifying a~~ task ~~to solve~~prompt is an environmental '''state''' <math>x</math>, and the ~~response of the language~~ model's ~~to the prompt~~response is an '''action''' <math>y</math>. The probability that the ~~language~~ model responds <math>x</math> with <math>y</math> is <math>\pi(y\|x)</math>. Training a reasoning language model bywith RL ~~then consists of~~means constructing a '''reward model''' <math>r(x, y)</math> to guide the RL process. Intuitively, athe reward ~~model describes~~says how ~~desirable/appropriate/~~good ~~the~~a response is for ~~the~~a prompt. For ~~reasoning language model, the prompt describes~~ a reasoning task, ~~and~~ the reward ~~would be~~is high if the response solves the task, and low if ~~the~~it ~~response~~does ~~fails to solve the task~~not. ~~For reasoning language models, the model's~~A response <math>y</math> may be broken -down into multiple steps, ~~in which case it is~~ written as <math>y_1, y_2, \dots, y_n</math>. Most recent systems use policy-gradient methods such as [[Proximal Policy Optimization]] (PPO) because PPO constrains each policy update with a clipped objective, which stabilises training for very large policies.<ref name="OpenAIAlign2022">{{cite web \|title=Aligning language models to follow instructions \|website=OpenAI Blog \|url=https://openai.com/blog/instruction-following/ \|date=2022-01-27 \|access-date=2025-05-04}}</ref> Line 42: {{Anchor\|Outcome Reward Model\|ORM}} ~~Outcome~~An outcome reward model, or outcome-supervised RM (ORM),<ref name=":0" /> ~~is a reward model that computes~~gives the reward offor a step <math>r(x, y_1, \dots, y_i)</math> ~~determined~~based byon the final answer: <math>r(x, y_1, \dots, y_i) = r(x, y_n)</math>. ~~They~~Such models are ~~also~~often called "verifiers". For tasks with ~~an answer~~answers that isare easy to verify, such as [[Word problem (mathematics education)\|math word problems ~~in math~~]], the outcome reward can ~~simply~~ be binary: 1 if the final answer is correct, ~~and~~ 0 otherwise.<ref name=":0" /> If ~~the~~automatic ~~answer~~verification is ~~not easy to verify programmatically~~hard, humans can ~~manually~~ label ~~the~~ answers as correct or not, ~~then~~and ~~the~~those labels can be used to finetune a base model that predicts the human label.<ref name=":2">{{cite arXiv \|last1=Cobbe \|first1=Karl \|last2=Kosaraju \|first2=Vineet \|last3=Bavarian \|first3=Mohammad \|last4=Chen \|first4=Mark \|last5=Jun \|first5=Heewoo \|last6=Kaiser \|first6=Lukasz \|last7=Plappert \|first7=Matthias \|last8=Tworek \|first8=Jerry \|last9=Hilton \|first9=Jacob \|title=Training Verifiers to Solve Math Word Problems \|date=2021-11-18 \|~~arxiv~~eprint=2110.14168 \|class=cs.LG}}</ref> For ~~other kinds of~~ tasks, ~~such as~~like creative writing, where ~~task performance~~quality is not ~~binary~~simply true/ or false, one can train a reward ~~model by finetuning a base~~ model on human [[Ranking (statistics)\|ranked preference]] data, ~~such~~ as ~~used~~ in [[reinforcement learning from human feedback]].<ref name=":1">{{cite journal \|last1=Lightman \|first1=Hunter \|last2=Kosaraju \|first2=Vineet \|last3=Burda \|first3=Yura \|last4=Edwards \|first4=Harri \|last5=Baker \|first5=Bowen \|last6=Lee \|first6=Teddy \|last7=Leike \|first7=Jan \|last8=Schulman \|first8=John \|last9=Sutskever \|first9=Ilya \|date=2024 \|title=Let's Verify Step by Step \|url=https://openreview.net/forum?id=dKDGgN0eTg \|journal=International Conference on Learning Representations (ICLR 2024) \|access-date=2025-07-26 \|arxiv=2305.20050}}</ref> A base model can also be ~~finetuned~~fine-tuned to predict, ~~given~~from a partial thinking trace <math>x, y_1, \dots, y_m</math>, whether the final answer ~~would~~will be correct, orand ~~not.~~this ~~This~~prediction can ~~then be used~~serve as a binary reward ~~signal~~.<ref name=":0" /> The ORM is usually trained ~~via~~with [[logistic regression]], i.e. by minimizing [[Cross-entropy\|cross-entropy loss]].<ref name=":3" /> Given a PRM, an ORM can be constructed by multiplying the total process reward during the reasoning trace,<ref name=":1" /> or by taking the minimum,<ref name=":3" /> or ~~some~~by other ~~method~~ways toof ~~aggregate the~~aggregating process rewards. DeepSeek used a simple ORM ~~for~~to ~~training~~train the [[DeepSeek (chatbot)\|R1 model]].<ref name=":9"/> === Process reward model === {{Anchor\|Process Reward Model\|PRM}} ~~Process~~A process reward model, or process-supervised RM (PRM),<ref name=":0" /> ~~is a reward model that computes~~gives the reward offor a step <math>r(x, y_1, \dots, y_i)</math> ~~determined~~based only byon the steps so far: <math>(x, y_1, \dots, y_i)</math>. Given a partial thinking trace <math>x, y_1, \dots, y_m</math>, a human can ~~be queried as to~~judge whether the steps ''so far'' are correct, ~~regardless~~without oflooking ~~whether~~at the ~~ultimate~~final answer ~~would be correct~~. This ~~can then be used as~~yields a binary reward ~~signal~~. AsBecause human labels are ~~expensive~~costly, a base model can ~~then~~ be ~~finetuned~~fine-tuned to predict ~~the human labels~~them.<ref name=":0" /> The PRM is usually trained bywith [[logistic regression]] on the human labels, i.e. by minimizing the [[Cross-entropy\|cross-entropy loss]] between ~~the~~ true ~~labels~~ and ~~the~~ predicted labels.<ref name=":3" /> As an example, in a 2023 OpenAI paper, collected 800K process labels ~~were collected~~ for 75K ~~solution~~thinking traces. A labeler ~~would be presented with~~saw a ~~solution~~ trace, and ~~keep~~marked ~~labelling~~each step as "positive" if ~~the~~it ~~step~~moved ~~progresses~~toward ~~towards the~~a solution, "neutral" if it iswas not wrong, but ~~does~~did not ~~progress towards solution~~help, and "negative" if it iswas a mistake. AsAfter ~~soon~~the ~~as a~~first "negative" label ~~is entered~~, the labeler ~~stops~~stopped ~~labeling~~on that ~~thinking~~ trace, and ~~begins~~moved ~~labeling~~to another ~~one~~. The ~~idea~~authors ~~was~~argued that, ~~while~~labeling ~~labelling~~up ~~subsequent~~to ~~reasoning~~the ~~steps~~first ~~can~~error ~~provide~~was ~~even~~enough ~~richer~~to ~~supervision~~train ~~signals,~~a ~~simply~~capable ~~labeling~~PRM, upeven tothough ~~the~~labeling ~~first~~later ~~error~~steps ~~was~~could ~~sufficient~~give ~~for~~richer ~~training a competent PRM~~signals.<ref name=":1" /><ref>{{cite web \|title=prm800k \|url=https://github.com/openai/prm800k \|website=GitHub \|publisher=OpenAI \|date=2025-01-27 \|access-date=2025-01-27}}</ref> AsTo avoid human labels ~~are expensive~~, researchers have proposed methods to create PRM without human labels on the processes. Inspired by [[Monte Carlo tree search]] (MCTS), the Math-Shepherd method samples multiple continuations until the end, starting at each reasoning step <math>y_i</math>, and set the reward at that step to be either <math>\frac{\#\text{(correct answers)}}{\#\text{(total answers)}}</math> in the case of "soft estimation", or <math>\begin{cases} 1 & \text{if one of the answers is correct}\\ 0 & \text{else} \end{cases}</math> ~~\end{cases}</math>~~ in the case of "hard estimation". This creates process ~~reward~~rewards ~~using only~~from an ORM, which is ~~usually~~often easier or cheaper to construct. ~~After creating these process reward labels, a~~A PRM can then be trained on ~~them~~these labels.<ref name=":3">{{cite journal \|last1=Wang \|first1=Peiyi \|last2=Li \|first2=Lei \|last3=Shao \|first3=Zhihong \|last4=Xu \|first4=Runxin \|last5=Dai \|first5=Damai \|last6=Li \|first6=Yifei \|last7=Chen \|first7=Deli \|last8=Wu \|first8=Yu \|last9=Sui \|first9=Zhifang \|editor-last=Ku \|editor-first=Lun-Wei \|editor2-last=Martins \|editor2-first=Andre \|editor3-last=Srikumar \|editor3-first=Vivek \|title=~~Math‑Shepherd~~Math-Shepherd: Verify and Reinforce LLMs ~~Step‑by‑step~~Step-by-step without Human Annotations \|journal=Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) \|___location=Bangkok, Thailand \|publisher=Association for Computational Linguistics \|date=August 2024 \|pages=9426–9439 \|doi=10.18653/v1/2024.acl-long.510 \|arxiv=2312.08935}}</ref> Some ~~have~~work has tried a fully MCTS approach.<ref>{{cite arXiv \|last1=Chen \|first1=Guoxin \|last2=Liao \|first2=Minpeng \|last3=Li \|first3=Chengxi \|last4=Fan \|first4=Kai \|title=AlphaMath Almost Zero: Process Supervision without Process \|date=2024-09-27 \|~~arxiv~~eprint=2405.03553 \|class=cs.LG}}</ref> One can also use an ORM to implicitly construct a PRM, similar to [[direct preference optimization]].<ref>{{cite arXiv \|last1=Yuan \|first1=Lifan \|last2=Li \|first2=Wendi \|last3=Chen \|first3=Huayu \|last4=Cui \|first4=Ganqu \|last5=Ding \|first5=Ning \|last6=Zhang \|first6=Kaiyan \|last7=Zhou \|first7=Bowen \|last8=Liu \|first8=Zhiyuan \|last9=Peng \|first9=Hao \|title=Free Process Rewards without Process Labels \|date=2024-12-02 \|~~arxiv~~eprint=2412.01981 \|class=cs.CL}}</ref> === Guided sampling === A trained ORM can be used to ~~select~~pick the best response. The policy ~~would~~generates ~~rollout multiple~~several responses, and ~~a trained~~the ORM ~~would select~~selects the best ~~response~~one. This ~~allows~~implements a simple form of [[Neural scaling law\|test -time compute scaling]] ("best-of-N").<ref name=":2" /> <ref>{{cite arXiv \|last1=Zhang \|first1=Di \|last2=Wu \|first2=Jianbo \|last3=Lei \|first3=Jingdi \|last4=Che \|first4=Tong \|last5=Li \|first5=Jiatong \|last6=Xie \|first6=Tong \|last7=Huang \|first7=Xiaoshui \|last8=Zhang \|first8=Shufei \|last9=Pavone \|first9=Marco \|title=LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning \|date=2024-11-21 \|~~arxiv~~eprint=2410.02884 \|class=cs.CL}}</ref> A trained PRM can ~~also be used to~~ guide reasoning by a greedy [[Tree traversal\|tree search]]~~. That is,~~: the policy ~~model generates~~proposes several ~~possible~~ next ~~reasoning~~ steps, ~~and~~ the PRM ~~selects the best~~picks one, and the process repeats. This ismirrors ~~similar~~using ~~to how a trained~~an ORM ~~can be used~~ to ~~select~~pick ~~the~~a ~~best~~whole response.<ref>{{cite arXiv \|last1=Ma \|first1=Qianli \|last2=Zhou \|first2=Haotian \|last3=Liu \|first3=Tingkai \|last4=Yuan \|first4=Jianbo \|last5=Liu \|first5=Pengfei \|last6=You \|first6=Yang \|last7=Yang \|first7=Hongxia \|title=Let's reward step by step: Step-Level reward model as the Navigators for Reasoning \|date=2023-10-16 \|~~arxiv~~eprint=2310.10080 \|class=cs.CL}}</ref> [[Beam search]] ~~perform~~performs better than greedy search. ''Lookahead search'' is another tree search method,. ~~where the~~The policy ~~model generates~~proposes several ~~possible~~ next ~~reasoning~~ steps, then ~~make~~makes a ~~(partial)~~short rollout for each. If a solution ~~endpoint~~ is ~~reached~~found during ~~the forward simulation~~rollout, the ~~process~~search ~~halts~~stops early. Otherwise, the PRM isscores ~~used~~each torollout, ~~calculate~~and the ~~total reward for each rollout. The~~ step with the highest ~~rollout~~score is ~~selected~~chosen.<ref name=":7"/> ''Self-consistency'' can be combined with an ORM. The model ~~would be used to generate~~generates multiple answers, and the answers ~~would be~~are clustered, so that each cluster has the same final answer. The ORM ~~is used to compute the reward for~~scores each answer, ~~and~~scores ~~the rewards within~~in each cluster isare summed., ~~The answer corresponding to~~and the ~~cluster~~answer ~~with~~from the highest-scoring ~~summed reward~~cluster is ~~output~~returned.<ref name=":3" /> == Benchmarks == Reasoning models generally ~~outperform~~score higher than non-reasoning models inon ~~most~~many benchmarks, especially on tasks requiring multi-step reasoning. ~~However, some~~Some benchmarks exclude ~~reflective~~reasoning models ~~due~~because totheir responses take longer ~~response~~and cost ~~times~~more.<ref>{{cite ~~journal~~book \|last1=Huang \|first1=Yuting \|last2=Zois \|first2=Christos \|last3=Wang \|first3=Yue \|last4=Zhang \|first4=Yue \|last5=Mavromatis \|first5=Christos \|last6=Zeng \|first6=Jiachen \|last7=Yin \|first7=Shihao \|last8=Voulkidis \|first8=Antonios \|last9=Shepard \|first9=Daniel \|~~title~~chapter=Toward Foundation Models for Online Complex Event Detection in ~~CPS‑IoT~~CPS-IoT: A Case Study \|~~journal~~title=Proceedings of the ~~26th~~2nd International ~~Conference~~Workshop on ~~Information~~Foundation ~~Processing~~Models infor ~~Sensor~~Cyber-Physical ~~Networks~~Systems ~~(IPSN~~& Internet of ~~’25)~~Things \|publisher=ACM \|date=2025 \|pages=1–6 \|doi=10.1145/3722565.3727198 \|arxiv=2503.12282 \|isbn=979-8-4007-1608-9 \|quote=Although we did not evaluate o1 and o3 models … their high cost and inference time make them impractical for online CED, which requires frequent, ~~low‑latency~~low-latency API requests.}}</ref><ref>{{cite arXiv \|last1=Hu \|first1=Zihao \|last2=Wang \|first2=Yuqing \|last3=Sun \|first3=Rui \|last4=Lu \|first4=Haoran \|last5=Gong \|first5=Qian \|last6=Wang \|first6=Jinshuai \|last7=Gong \|first7=Yunlong \|last8=Huang \|first8=Yiming \|last9=He \|first9=Peng \|title=Inference-Time Compute: More Faithful? A Research Note \|date=2025-02-13 \|~~arxiv~~eprint=2502.09673 \|class=cs.CL \|quote=we were unable to evaluate O1 and R1 …}}</ref><ref>{{cite arXiv \|last1=Chen \|first1=Guoliang \|last2=Zhu \|first2=Zhiyao \|last3=Meng \|first3=Qinxiang \|last4=Liang \|first4=Weilin \|last5=Ji \|first5=Zijie \|last6=Liu \|first6=Jiangning \|last7=Zeng \|first7=Jie \|title=RealBench: Evaluating LLMs as Verilog Engineers \|date=2025-03-07 \|~~arxiv~~eprint=2503.04914 \|class=cs.AI \|quote=For O1-preview, we sample only once due to high cost.}}</ref><ref>{{cite arXiv \|last1=Gupta \|first1=Arpit \|last2=Schapira \|first2=Michael \|last3=Gill \|first3=Phillipa \|last4=Seetharaman \|first4=Srinivasan \|title=On the Feasibility of Using LLMs to Execute Multistage Network Attacks \|date=2025-01-30 \|~~arxiv~~eprint=2501.16466 \|class=cs.CR \|quote=We were unable to evaluate o1 … the public API has a safeguard that prevents o1 from executing attacks.}}</ref> === Humanity's Last Exam === The [[Humanity's Last Exam\|HLE]]~~, a rigorous~~ benchmark ~~designed to assess~~tests expert-level reasoning across mathematics, humanities, and the natural sciences, ~~reveals~~and shows ~~substantial~~large performance gaps ~~among~~between models. State-of-the-art reasoning models ~~have demonstrated~~score low ~~accuracy~~ on HLE, ~~highlighting significant~~leaving room ~~for~~to ~~improvement~~improve. InFor ~~particular~~example, the full reasoning model [[OpenAI o3\|o3]] ~~achieved an accuracy of~~reached 26.6%,<ref name=":5"/> while ~~its~~the lighter ~~counterpart, o3‑mini~~o3-mini-high (~~evaluated~~ on ~~text‑only~~text-only questions), reached 13%.<ref>{{cite web \|title=~~Humanity’s~~Humanity's Last Exam leaderboard \|url=https://agi.safe.ai/benchmarks/hle \|website=Safe.ai \|publisher=Center for AI Safety \|access-date=2025-07-26}}</ref> === AIME === ~~The~~On the [[American Invitational Mathematics Examination]] (AIME) ~~benchmark~~, a ~~challenging~~difficult ~~mathematics~~math competition, ~~demonstrates significant performance differences between model types. Non~~non-reasoning models ~~typically~~usually solve ~~less than~~under 30% of ~~AIME~~problems. InModels ~~contrast, models~~that ~~employing~~use reasoning ~~techniques~~methods score between 50% and 80%.<ref name=":8"/><ref name=":9"/><ref name=":10"/> While [[OpenAI o1\|OpenAI's o1]] maintained or slightly improved its accuracy from reported 2024~~{{Source?\|date=July~~ ~~2022}} metrics~~results to 2025 AIME results, o3-mini (high) ~~achieved~~reached a higher accuracy (80%) at a ~~significantly~~much lower cost (~~approximately~~about 12 times cheaper).<ref name=":4">{{cite web \|date=2025-01-31 \|title=OpenAI o3-mini \|url=https://openai.com/index/openai-o3-mini/ \|access-date=2025-02-09 \|website=OpenAI \|language=en-US}}</ref> === o3-mini performance === According to OpenAI's January 2025 report on o3-mini, ~~adjustable~~adjusting "reasoning effort" significantly affects performance, ~~particularly~~especially infor [[STEM]] tasks. ~~Increasing reasoning effort~~Moving from low to high ~~boosts~~reasoning ~~accuracy~~effort onraises ~~benchmarks~~accuracy ~~like~~on AIME 2024, GPQA Diamond, and [[Codeforces]], ~~providing performance gains~~ typically inby ~~the range of 10-30~~10–30%. With high ~~reasoning~~ effort, o3-mini (high) achieved 87.3% inon AIME (different from the MathArena AIME benchmark ~~results~~), 79.7% inon GPQA Diamond, 2130 Elo inon Codeforces, and 49.3 inon SWE-bench Verified.<ref name=":4"/> == Drawbacks == === Computational cost === Reasoning models ~~require~~often ~~significantly~~need far more ~~test-time~~compute ~~compute~~while answering than non-reasoning models. On ~~the~~ AIME ~~benchmark~~, ~~reasoning models~~they were 10 to 74 times more expensive'''<ref name=":1" />''' than non-reasoning counterparts. === Generation time === Due to the tendency of reasoning language models to produce verbose outputs, the time it takes to generate an output increases greatly when compared to a standard [[large language model]]. Reflective reasoning increases response times, with current models taking anywhere from three seconds to several minutes to generate an answer. As reasoning depth improves, future models may require even longer processing times. == Models == === [[OpenAI]] === * [[GPT-5]] * [[OpenAI o4-mini\|o4-mini]] * [[OpenAI o3\|o3 and o3-mini]] Line 120 ⟶ 123: === [[Mistral AI]] === * Magistral (medium & small) Line 128 ⟶ 130: === [[Hugging Face]] === * OlympicCoder-7B & 32B, as part of reproducing the R1 training openly (Open R1 project).<ref>{{cite web \|title=~~Open‑R1~~Open-R1: a fully open reproduction of ~~DeepSeek‑R1~~DeepSeek-R1 \|url=https://huggingface.co/blog/open-r1 \|website=Hugging Face \|date=2025-02-24 \|access-date=2025-07-26}}</ref><ref>{{cite web \|title=OlympicCoder-7B \|url=https://huggingface.co/open-r1/OlympicCoder-7B \|website=Hugging Face \|date=2025-03-11 \|access-date=2025-07-26}}</ref>▼ ▲* OlympicCoder-7B & 32B, as part of reproducing the R1 training openly (Open R1 project).<ref>{{cite web \|title=Open‑R1: a fully open reproduction of DeepSeek‑R1 \|url=https://huggingface.co/blog/open-r1 \|website=Hugging Face \|date=2025-02-24 \|access-date=2025-07-26}}</ref><ref>{{cite web \|title=OlympicCoder-7B \|url=https://huggingface.co/open-r1/OlympicCoder-7B \|website=Hugging Face \|date=2025-03-11 \|access-date=2025-07-26}}</ref> == See also ==