Reasoning language model: Difference between revisions

Content deleted Content added
Remove image; I may add one back that is RLM specific
Generation time: Update to be more factually correct and concise.
 
(5 intermediate revisions by 4 users not shown)
Line 2:
{{Copy edit|for=jargon|date=May 2025}}
 
'''Reasoning language models''' ('''RLMs''') are [[large language model]]s that are trained further to solve tasks that take several steps of [[reasoning]].<ref>{{cite arXiv |last1=Besta |first1=Maciej |last2=Barth |first2=Julia |last3=Schreiber |first3=Eric |last4=Kubicek |first4=Ales |last5=Catarino |first5=Afonso |last6=Gerstenberger |first6=Robert |last7=Nyczyk |first7=Piotr |last8=Iff |first8=Patrick |last9=Li |first9=Yueling |title=Reasoning Language Models: A Blueprint |date=2025-01-23 |arxiveprint=2501.11223 |class=cs.CL}}</ref> They tend to do better on logic, math, and programming tasks than standard LLMs, can [[Backtracking|revisit and revise]] earlier steps, and make use of extra computation while answering as another way to [[Neural scaling law|scale performance]], alongside the number of training examples, parameters, and training compute.<ref name=":8">{{cite web |title=Learning to reason with LLMs |url=https://openai.com/index/learning-to-reason-with-llms/ |website=OpenAI |date=2024-09-12 |access-date=2025-07-26}}</ref>
 
== History ==
Line 8:
In September 2024, [[OpenAI]] released [[OpenAI o1#release|o1-preview]], an LLM with enhanced reasoning.<ref>{{cite news |last1=Edwards |first1=Benj |date=2024-09-12 |title=OpenAI's new "reasoning" AI models are here: o1-preview and o1-mini |url=https://arstechnica.com/information-technology/2024/09/openais-new-reasoning-ai-models-are-here-o1-preview-and-o1-mini/ |access-date=2025-02-06 |work=Ars Technica |language=en-US}}</ref> The full version, [[OpenAI o1|o1]], followed in December 2024. OpenAI also began sharing results on its successor, [[OpenAI o3|o3]].<ref>{{cite web |title=OpenAI o1 System Card |url=https://cdn.openai.com/o1-system-card.pdf |website=OpenAI |date=2024-12-05 |access-date=2025-07-26}}</ref><ref>{{cite news |last=Robison |first=Kylie |date=2024-12-05 |title=OpenAI launches ChatGPT Pro, a $200/month plan with unlimited access to o1, GPT-4o, and more |url=https://www.theverge.com/2024/12/5/24314147/openai-reasoning-model-o1-strawberry-chatgpt-pro-new-tier |access-date=2025-07-26 |work=The Verge}}</ref><ref>{{cite news |last=Singh |first=Jaspreet |date=2024-12-20 |title=OpenAI unveils 'o3' model, touting advances in reasoning |url=https://www.reuters.com/technology/artificial-intelligence/openai-unveils-o3-model-touting-advances-reasoning-2024-12-20/ |access-date=2025-07-26 |work=Reuters}}</ref>
 
The development of reasoning LLMs has illustrated what [[Richard S. Sutton|Rich Sutton]] called the "bitter lesson": that scaling compute often outperforms methods that rely on specific human insights.<ref>{{cite web |last1=Sutton |first1=Richard S. |title=The Bitter Lesson |url=http://www.incompleteideas.net/IncIdeas/BitterLesson.html |access-date=2025-02-27 |website=Incomplete Ideas}}</ref> For example, the Generative AI Research Lab (GAIR) explored complex methods such as tree search and reinforcement learning to replicate o1's capabilities. In their "o1 Replication Journey" papers they reported that [[knowledge distillation]] (training a smaller model to imitate o1's outputs) worked surprisingly well. This highlighted the effectiveness of distillation in this context.<ref>{{cite arXiv |last1=Huang |first1=Zhen |last2=Zou |first2=Haoyang |last3=Li |first3=Xuefeng |last4=Liu |first4=Yixiu |last5=Zheng |first5=Yuxiang |last6=Chern |first6=Ethan |last7=Xia |first7=Shijie |last8=Qin |first8=Yiwei |last9=Yuan |first9=Weizhe |title=O1 Replication Journey — Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? |date=2024-11-25 |arxiveprint=2411.16489 |class=cs.CL}}</ref><ref name=":6">{{cite news |last=Zeff |first=Maxwell |date=2025-02-05 |title=Researchers created an open rival to OpenAI’sOpenAI's o1 ‘reasoning’'reasoning' model for under $50 |url=https://techcrunch.com/2025/02/05/researchers-created-an-open-rival-to-openais-o1-reasoning-model-for-under-50/ |access-date=2025-07-26 |work=TechCrunch}}</ref>
 
[[Alibaba Group|Alibaba]] released reasoning versions of its [[Qwen]] LLMs in November 2024.<ref>{{cite web |title=QwQ-32B-Preview: Reflect Deeply on the Boundaries of the Unknown |url=https://qwenlm.github.io/blog/qwq-32b-preview/ |website=Qwen (Alibaba Cloud) |date=2024-11-28 |access-date=2025-07-26}}</ref>
Line 18:
 
=== 2025 ===
In January 2025, [[DeepSeek]] released [[DeepSeek (chatbot)|R1]], a model with comparable performance to o1 at lower cost. The release demonstrated the effectiveness of [[Group Relative Policy Optimization]] (GRPO).<ref>{{cite news |last1=Orland |first1=Kyle |date=2025-01-28 |title=How does DeepSeek R1 really fare against OpenAI's best reasoning models? |url=https://arstechnica.com/ai/2025/01/how-does-deepseek-r1-really-fare-against-openais-best-reasoning-models/ |access-date=2025-02-06 |work=Ars Technica}}</ref><ref name=":9">{{cite arXiv |last1=DeepSeek-AI |first1= |last2=Guo |first2=Daya |last3=Yang |first3=Dejian |last4=Zhang |first4=Haowei |last5=Song |first5=Junxiao |last6=Zhang |first6=Ruoyu |last7=Xu |first7=Runxin |last8=Zhu |first8=Qihao |last9=Ma |first9=Shirong |title=DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning |date=2025-01-22 |arxiveprint=2501.12948 |class=cs.CL}}</ref> On January 25, 2025, [[DeepSeek]] added a feature to DeepSeek R1 that lets the model search the web while it reasons, making it easier to combine retrieval with reasoning.<ref>{{cite news |script-title=zh:DeepSeek 支持"深度思考+联网检索"能力 |trans-title=DeepSeek adds a search feature supporting simultaneous deep thinking and web search |work=People’sPeople's Daily Online |date=2025-01-29 |url=http://tech.people.com.cn/n1/2025/0129/c1007-40386565.html |language=zh |access-date=2025-07-26}}</ref> OpenAI subsequently released o3-mini, followed by [[ChatGPT Deep Research|Deep Research]] based on [[OpenAI o3|o3]].<ref>{{cite news |last1=Milmo |first1=Dan |date=2025-02-03 |title=OpenAI launches 'deep research' tool that it says can match research analyst |url=https://www.theguardian.com/technology/2025/feb/03/openai-deep-research-agent-chatgpt-deepseek |access-date=2025-03-16 |work=The Guardianeffectiveness |language=en-GBof |issn=0261-3077}}</ref>distillation Thefor effectivenessreasoning of distillationmodels was shown againin byworks such as s1-32B, which reachedachieved strong performance withthrough budget forcing and scaling methods.<ref name=":10">{{cite arXiv |last1=Muennighoff |first1=Niklas |last2=Yang |first2=Zitong |last3=Shi |first3=Weijia |last4=Li |first4=Xiang Lisa |last5=Fei-Fei |first5=Li |last6=Hajishirzi |first6=Hannaneh |last7=Zettlemoyer |first7=Luke |last8=Liang |first8=Percy |last9=Candès |first9=Emmanuel |title=s1: Simple test-time scaling |date=2025-02-03 |arxiveprint=2501.19393 |class=cs.CL}}</ref><ref name=":6"/>
 
On February 2, 2025, OpenAI released [[ChatGPT Deep Research|Deep Research]] based on their [[OpenAI o3|o3]] model,<ref name=":5">{{cite web |date=2025-02-02 |title=Introducing deep research |url=https://openai.com/index/introducing-deep-research/ |access-date=2025-02-05 |website=OpenAI |language=en-US}}</ref> a tool that integrates reasoning and web search in one workflow soallowing users canto runinitiate complex research that needs several stepstasks and sources.generate Itcomprehensive isreports basedwhich onincorporate [[OpenAIvarious o3|o3]] and can takesources from 5the to 30 minutes to generate comprehensive reportsweb.<ref name=":5" />
 
== Supervised finetuning ==
A [[large language model]] (LLM) can be fine-tuned on a dataset of reasoning tasks paired with example solutions and step-by-step (reasoning) traces. The fine-tuned model can then produce its own reasoning traces for new problems.<ref name=":0">{{cite arXiv |last1=Uesato |first1=Jonathan |last2=Kushman |first2=Nate |last3=Kumar |first3=Ramana |last4=Song |first4=Francis |last5=Siegel |first5=Noah |last6=Wang |first6=Lisa |last7=Creswell |first7=Antonia |last8=Irving |first8=Geoffrey |last9=Higgins |first9=Irina |title=Solving math word problems with process- and outcome-based feedback |date=2022-11-25 |arxiveprint=2211.14275 |class=cs.LG}}</ref><ref name=":2" />
 
Because human-written traces are costly to collect, researchers have proposed ways to build such datasets automatically. In ''rejection sampling finetuning'' (RFT), new reasoning traces are gathered in a loop:<ref>{{cite arXiv |last1=Yuan |first1=Zheng |last2=Yuan |first2=Hongyi |last3=Li |first3=Chengpeng |last4=Dong |first4=Guanting |last5=Lu |first5=Keming |last6=Tan |first6=Chuanqi |last7=Zhou |first7=Chang |last8=Zhou |first8=Jingren |title=Scaling Relationship on Learning Mathematical Reasoning with Large Language Models |date=2023-09-13 |arxiveprint=2308.01825 |class=cs.CL}}</ref>
# Sample a task prompt.
# Generate many reasoning traces for the prompt.
Line 44:
An outcome reward model, or outcome-supervised RM (ORM),<ref name=":0" /> gives the reward for a step <math>r(x, y_1, \dots, y_i)</math> based on the final answer: <math>r(x, y_1, \dots, y_i) = r(x, y_n)</math>. Such models are often called "verifiers".
 
For tasks with answers that are easy to verify, such as [[Word problem (mathematics education)|math word problems]], the outcome reward can be binary: 1 if the final answer is correct, 0 otherwise.<ref name=":0" /> If automatic verification is hard, humans can label answers as correct or not, and those labels can be used to finetune a base model that predicts the human label.<ref name=":2">{{cite arXiv |last1=Cobbe |first1=Karl |last2=Kosaraju |first2=Vineet |last3=Bavarian |first3=Mohammad |last4=Chen |first4=Mark |last5=Jun |first5=Heewoo |last6=Kaiser |first6=Lukasz |last7=Plappert |first7=Matthias |last8=Tworek |first8=Jerry |last9=Hilton |first9=Jacob |title=Training Verifiers to Solve Math Word Problems |date=2021-11-18 |arxiveprint=2110.14168 |class=cs.LG}}</ref> For tasks like creative writing, where quality is not simply true or false, one can train a reward model on human [[Ranking (statistics)|ranked preference]] data, as in [[reinforcement learning from human feedback]].<ref name=":1">{{cite journal |last1=Lightman |first1=Hunter |last2=Kosaraju |first2=Vineet |last3=Burda |first3=Yura |last4=Edwards |first4=Harri |last5=Baker |first5=Bowen |last6=Lee |first6=Teddy |last7=Leike |first7=Jan |last8=Schulman |first8=John |last9=Sutskever |first9=Ilya |date=2024 |title=Let's Verify Step by Step |url=https://openreview.net/forum?id=dKDGgN0eTg |journal=International Conference on Learning Representations (ICLR 2024) |access-date=2025-07-26 |arxiv=2305.20050}}</ref> A base model can also be fine-tuned to predict, from a partial thinking trace <math>x, y_1, \dots, y_m</math>, whether the final answer will be correct, and this prediction can serve as a binary reward.<ref name=":0" />
 
The ORM is usually trained with [[logistic regression]], i.e. by minimizing [[Cross-entropy|cross-entropy loss]].<ref name=":3" />
Line 64:
0 & \text{else}
\end{cases}</math>
in the case of "hard estimation". This creates process rewards from an ORM, which is often easier or cheaper to construct. A PRM can then be trained on these labels.<ref name=":3">{{cite journal |last1=Wang |first1=Peiyi |last2=Li |first2=Lei |last3=Shao |first3=Zhihong |last4=Xu |first4=Runxin |last5=Dai |first5=Damai |last6=Li |first6=Yifei |last7=Chen |first7=Deli |last8=Wu |first8=Yu |last9=Sui |first9=Zhifang |editor-last=Ku |editor-first=Lun-Wei |editor2-last=Martins |editor2-first=Andre |editor3-last=Srikumar |editor3-first=Vivek |title=Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations |journal=Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |___location=Bangkok, Thailand |publisher=Association for Computational Linguistics |date=August 2024 |pages=9426–9439 |doi=10.18653/v1/2024.acl-long.510 |arxiv=2312.08935}}</ref> Some work has tried a fully MCTS approach.<ref>{{cite arXiv |last1=Chen |first1=Guoxin |last2=Liao |first2=Minpeng |last3=Li |first3=Chengxi |last4=Fan |first4=Kai |title=AlphaMath Almost Zero: Process Supervision without Process |date=2024-09-27 |arxiveprint=2405.03553 |class=cs.LG}}</ref>
 
One can also use an ORM to implicitly construct a PRM, similar to [[direct preference optimization]].<ref>{{cite arXiv |last1=Yuan |first1=Lifan |last2=Li |first2=Wendi |last3=Chen |first3=Huayu |last4=Cui |first4=Ganqu |last5=Ding |first5=Ning |last6=Zhang |first6=Kaiyan |last7=Zhou |first7=Bowen |last8=Liu |first8=Zhiyuan |last9=Peng |first9=Hao |title=Free Process Rewards without Process Labels |date=2024-12-02 |arxiveprint=2412.01981 |class=cs.CL}}</ref>
 
=== Guided sampling ===
A trained ORM can be used to pick the best response. The policy generates several responses, and the ORM selects the best one. This implements a simple form of [[Neural scaling law|test-time compute scaling]] ("best-of-N").<ref name=":2" /> <ref>{{cite arXiv |last1=Zhang |first1=Di |last2=Wu |first2=Jianbo |last3=Lei |first3=Jingdi |last4=Che |first4=Tong |last5=Li |first5=Jiatong |last6=Xie |first6=Tong |last7=Huang |first7=Xiaoshui |last8=Zhang |first8=Shufei |last9=Pavone |first9=Marco |title=LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning |date=2024-11-21 |arxiveprint=2410.02884 |class=cs.CL}}</ref>
 
A trained PRM can guide reasoning by a greedy [[Tree traversal|tree search]]: the policy proposes several next steps, the PRM picks one, and the process repeats. This mirrors using an ORM to pick a whole response.<ref>{{cite arXiv |last1=Ma |first1=Qianli |last2=Zhou |first2=Haotian |last3=Liu |first3=Tingkai |last4=Yuan |first4=Jianbo |last5=Liu |first5=Pengfei |last6=You |first6=Yang |last7=Yang |first7=Hongxia |title=Let's reward step by step: Step-Level reward model as the Navigators for Reasoning |date=2023-10-16 |arxiveprint=2310.10080 |class=cs.CL}}</ref> [[Beam search]] performs better than greedy search.
 
''Lookahead search'' is another tree search method. The policy proposes several next steps, then makes a short rollout for each. If a solution is found during rollout, the search stops early. Otherwise, the PRM scores each rollout, and the step with the highest score is chosen.<ref name=":7"/>
Line 80:
Reasoning models generally score higher than non-reasoning models on many benchmarks, especially on tasks requiring multi-step reasoning.
 
Some benchmarks exclude reasoning models because their responses take longer and cost more.<ref>{{cite journalbook |last1=Huang |first1=Yuting |last2=Zois |first2=Christos |last3=Wang |first3=Yue |last4=Zhang |first4=Yue |last5=Mavromatis |first5=Christos |last6=Zeng |first6=Jiachen |last7=Yin |first7=Shihao |last8=Voulkidis |first8=Antonios |last9=Shepard |first9=Daniel |titlechapter=Toward Foundation Models for Online Complex Event Detection in CPS-IoT: A Case Study |journaltitle=Proceedings of the 26th2nd International ConferenceWorkshop on InformationFoundation ProcessingModels infor SensorCyber-Physical NetworksSystems (IPSN& Internet of ’25)Things |publisher=ACM |date=2025 |pages=1–6 |doi=10.1145/3722565.3727198 |arxiv=2503.12282 |isbn=979-8-4007-1608-9 |quote=Although we did not evaluate o1 and o3 models … their high cost and inference time make them impractical for online CED, which requires frequent, low-latency API requests.}}</ref><ref>{{cite arXiv |last1=Hu |first1=Zihao |last2=Wang |first2=Yuqing |last3=Sun |first3=Rui |last4=Lu |first4=Haoran |last5=Gong |first5=Qian |last6=Wang |first6=Jinshuai |last7=Gong |first7=Yunlong |last8=Huang |first8=Yiming |last9=He |first9=Peng |title=Inference-Time Compute: More Faithful? A Research Note |date=2025-02-13 |arxiveprint=2502.09673 |class=cs.CL |quote=we were unable to evaluate O1 and R1 …}}</ref><ref>{{cite arXiv |last1=Chen |first1=Guoliang |last2=Zhu |first2=Zhiyao |last3=Meng |first3=Qinxiang |last4=Liang |first4=Weilin |last5=Ji |first5=Zijie |last6=Liu |first6=Jiangning |last7=Zeng |first7=Jie |title=RealBench: Evaluating LLMs as Verilog Engineers |date=2025-03-07 |arxiveprint=2503.04914 |class=cs.AI |quote=For O1-preview, we sample only once due to high cost.}}</ref><ref>{{cite arXiv |last1=Gupta |first1=Arpit |last2=Schapira |first2=Michael |last3=Gill |first3=Phillipa |last4=Seetharaman |first4=Srinivasan |title=On the Feasibility of Using LLMs to Execute Multistage Network Attacks |date=2025-01-30 |arxiveprint=2501.16466 |class=cs.CR |quote=We were unable to evaluate o1 … the public API has a safeguard that prevents o1 from executing attacks.}}</ref>
 
=== Humanity's Last Exam ===
The [[Humanity's Last Exam|HLE]] benchmark tests expert-level reasoning across mathematics, humanities, and the natural sciences, and shows large performance gaps between models. State-of-the-art reasoning models score low on HLE, leaving room to improve. For example, the full reasoning model [[OpenAI o3|o3]] reached 26.6%,<ref name=":5"/> while the lighter o3-mini-high (on text-only questions) reached 13%.<ref>{{cite web |title=Humanity’sHumanity's Last Exam leaderboard |url=https://agi.safe.ai/benchmarks/hle |website=Safe.ai |publisher=Center for AI Safety |access-date=2025-07-26}}</ref>
 
=== AIME ===
Line 97:
 
=== Generation time ===
Due to the tendency of reasoning language models to produce verbose outputs, the time it takes to generate an output increases greatly when compared to a standard [[large language model]].
Reasoning increases response time, with current models taking from a few seconds to several minutes to answer. As depth of reasoning grows, future models may need even longer.
 
== Models ==
 
=== [[OpenAI]] ===
* [[GPT-5]]
* [[OpenAI o4-mini|o4-mini]]
* [[OpenAI o3|o3 and o3-mini]]
Line 122 ⟶ 123:
 
=== [[Mistral AI]] ===
 
* Magistral (medium & small)
 
Line 130:
 
=== [[Hugging Face]] ===
 
* OlympicCoder-7B & 32B, as part of reproducing the R1 training openly (Open R1 project).<ref>{{cite web |title=Open-R1: a fully open reproduction of DeepSeek-R1 |url=https://huggingface.co/blog/open-r1 |website=Hugging Face |date=2025-02-24 |access-date=2025-07-26}}</ref><ref>{{cite web |title=OlympicCoder-7B |url=https://huggingface.co/open-r1/OlympicCoder-7B |website=Hugging Face |date=2025-03-11 |access-date=2025-07-26}}</ref>