Content deleted Content added
SimonAytes (talk | contribs) →Generation time: Update to be more factually correct and concise. |
|||
(43 intermediate revisions by 16 users not shown) | |||
Line 1:
{{Short description|Language models designed for reasoning tasks}}
{{Copy edit|for=jargon|date=May 2025}}
'''Reasoning language models''' ('''RLMs''') are [[large language model]]s that are trained further to solve tasks that take several steps of [[reasoning]].<ref>{{cite arXiv |last1=Besta |first1=Maciej |last2=Barth |first2=Julia |last3=Schreiber |first3=Eric |last4=Kubicek |first4=Ales |last5=Catarino |first5=Afonso |last6=Gerstenberger |first6=Robert |last7=Nyczyk |first7=Piotr |last8=Iff |first8=Patrick |last9=Li |first9=Yueling |title=Reasoning Language Models: A Blueprint |date=2025-01-23 |eprint=2501.11223 |class=cs.CL}}</ref> They tend to do better on logic, math, and programming tasks than standard LLMs, can [[Backtracking|revisit and revise]] earlier steps, and make use of extra computation while answering as another way to [[Neural scaling law|scale performance]], alongside the number of training examples, parameters, and training compute.<ref name=":8">{{cite web |title=Learning to reason with LLMs |url=https://openai.com/index/learning-to-reason-with-llms/ |website=OpenAI |date=2024-09-12 |access-date=2025-07-26}}</ref>
==
=== 2024 ===
In September 2024, [[OpenAI]] released [[OpenAI o1#release|o1-preview]], an LLM with enhanced reasoning.<ref>{{cite news |last1=Edwards |first1=Benj |date=2024-09-12 |title=OpenAI's new "reasoning" AI models are here: o1-preview and o1-mini |url=https://arstechnica.com/information-technology/2024/09/openais-new-reasoning-ai-models-are-here-o1-preview-and-o1-mini/ |access-date=2025-02-06 |work=Ars Technica |language=en-US}}</ref> The full version, [[OpenAI o1|o1]], followed in December 2024. OpenAI also began sharing results on its successor, [[OpenAI o3|o3]].<ref>{{cite web |title=OpenAI o1 System Card |url=https://cdn.openai.com/o1-system-card.pdf |website=OpenAI |date=2024-12-05 |access-date=2025-07-26}}</ref><ref>{{cite news |last=Robison |first=Kylie |date=2024-12-05 |title=OpenAI launches ChatGPT Pro, a $200/month plan with unlimited access to o1, GPT-4o, and more |url=https://www.theverge.com/2024/12/5/24314147/openai-reasoning-model-o1-strawberry-chatgpt-pro-new-tier |access-date=2025-07-26 |work=The Verge}}</ref><ref>{{cite news |last=Singh |first=Jaspreet |date=2024-12-20 |title=OpenAI unveils 'o3' model, touting advances in reasoning |url=https://www.reuters.com/technology/artificial-intelligence/openai-unveils-o3-model-touting-advances-reasoning-2024-12-20/ |access-date=2025-07-26 |work=Reuters}}</ref>
The development of reasoning LLMs has illustrated what [[Richard S. Sutton|Rich Sutton]] called the "bitter lesson": that scaling compute often outperforms methods that rely on specific human insights.<ref>{{cite web |last1=Sutton |first1=Richard S. |title=The Bitter Lesson |url=http://www.incompleteideas.net/IncIdeas/BitterLesson.html |access-date=2025-02-27 |website=Incomplete Ideas}}</ref> For example, the Generative AI Research Lab (GAIR) explored complex methods such as tree search and reinforcement learning to replicate o1's capabilities. In their "o1 Replication Journey" papers they reported that [[knowledge distillation]] (training a smaller model to imitate o1's outputs) worked surprisingly well. This highlighted the effectiveness of distillation in this context.<ref>{{cite arXiv |last1=Huang |first1=Zhen |last2=Zou |first2=Haoyang |last3=Li |first3=Xuefeng |last4=Liu |first4=Yixiu |last5=Zheng |first5=Yuxiang |last6=Chern |first6=Ethan |last7=Xia |first7=Shijie |last8=Qin |first8=Yiwei |last9=Yuan |first9=Weizhe |title=O1 Replication Journey — Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? |date=2024-11-25 |eprint=2411.16489 |class=cs.CL}}</ref><ref name=":6">{{cite news |last=Zeff |first=Maxwell |date=2025-02-05 |title=Researchers created an open rival to OpenAI's o1 'reasoning' model for under $50 |url=https://techcrunch.com/2025/02/05/researchers-created-an-open-rival-to-openais-o1-reasoning-model-for-under-50/ |access-date=2025-07-26 |work=TechCrunch}}</ref>
[[Alibaba Group|Alibaba]] released reasoning versions of its [[Qwen]] LLMs in November 2024.<ref>{{cite web |title=QwQ-32B-Preview: Reflect Deeply on the Boundaries of the Unknown |url=https://qwenlm.github.io/blog/qwq-32b-preview/ |website=Qwen (Alibaba Cloud) |date=2024-11-28 |access-date=2025-07-26}}</ref>
In December 2024, the team introduced QvQ-72B-Preview, an experimental visual reasoning model.<ref>{{cite web |title=QVQ: To See the World with Wisdom |url=https://qwenlm.github.io/blog/qvq-72b-preview/ |website=Qwen |publisher=Alibaba Cloud |date=2024-12-25 |access-date=2025-07-26}}</ref>
In December 2024, Google introduced [[Gemini Deep Research|Deep Research]] in [[Gemini (chatbot)|Gemini]],<ref>{{cite web |date=2024-12-11 |title=Try Deep Research and our new experimental model in Gemini, your AI assistant |url=https://blog.google/products/gemini/google-gemini-deep-research/ |access-date=2025-02-05 |website=Google |language=en-US}}</ref> a feature that runs multi-step research tasks.<ref>{{cite news |last=Roth |first=Emma |date=2024-12-11 |title=Google built an AI tool that can do research for you |url=https://www.theverge.com/2024/12/11/24318217/google-gemini-advanced-deep-research-launch |access-date=2025-07-26 |work=The Verge}}</ref>
On December 16, 2024, an experiment with a [[Llama (language model)|Llama]] 3B model showed that by scaling test-time compute, a relatively small model could outperform a much larger Llama 70B model on challenging reasoning tasks. This suggested that better inference strategies can unlock useful reasoning capabilities even in small models.<ref>{{cite web |title=Scaling test-time compute |url=https://huggingface.co/blog/h4-scaling-test-time-compute |website=Hugging Face |date=2024-12-16 |access-date=2025-07-26}}</ref><ref name=":7">{{cite journal |last1=Snell |first1=Charlie |last2=Lee |first2=Jaehoon |last3=Xu |first3=Kelvin |last4=Kumar |first4=Aviral |date=2025 |title=Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters |url=https://openreview.net/forum?id=t4s3hJY9dH |journal=International Conference on Learning Representations (ICLR 2025) |access-date=2025-07-26 |arxiv=2408.03314}}</ref>
===
In January 2025, [[DeepSeek]] released [[DeepSeek (chatbot)|R1]], a model with comparable performance to o1 at lower cost. The release demonstrated the effectiveness of [[Group Relative Policy Optimization]] (GRPO).<ref>{{cite news |last1=Orland |first1=Kyle |date=2025-01-28 |title=How does DeepSeek R1 really fare against OpenAI's best reasoning models? |url=https://arstechnica.com/ai/2025/01/how-does-deepseek-r1-really-fare-against-openais-best-reasoning-models/ |access-date=2025-02-06 |work=Ars Technica}}</ref><ref name=":9">{{cite arXiv |last1=DeepSeek-AI |last2=Guo |first2=Daya |last3=Yang |first3=Dejian |last4=Zhang |first4=Haowei |last5=Song |first5=Junxiao |last6=Zhang |first6=Ruoyu |last7=Xu |first7=Runxin |last8=Zhu |first8=Qihao |last9=Ma |first9=Shirong |title=DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning |date=2025-01-22 |eprint=2501.12948 |class=cs.CL}}</ref> On January 25, 2025, [[DeepSeek]] added a feature to DeepSeek R1 that lets the model search the web while it reasons, making it easier to combine retrieval with reasoning.<ref>{{cite news |script-title=zh:DeepSeek 支持"深度思考+联网检索"能力 |trans-title=DeepSeek adds a search feature supporting simultaneous deep thinking and web search |work=People's Daily Online |date=2025-01-29 |url=http://tech.people.com.cn/n1/2025/0129/c1007-40386565.html |language=zh |access-date=2025-07-26}}</ref> The effectiveness of distillation for reasoning models was shown in works such as s1-32B, which achieved strong performance through budget forcing and scaling methods.<ref name=":10">{{cite arXiv |last1=Muennighoff |first1=Niklas |last2=Yang |first2=Zitong |last3=Shi |first3=Weijia |last4=Li |first4=Xiang Lisa |last5=Fei-Fei |first5=Li |last6=Hajishirzi |first6=Hannaneh |last7=Zettlemoyer |first7=Luke |last8=Liang |first8=Percy |last9=Candès |first9=Emmanuel |title=s1: Simple test-time scaling |date=2025-02-03 |eprint=2501.19393 |class=cs.CL}}</ref><ref name=":6"/>
On February 2, 2025, OpenAI released [[ChatGPT Deep Research|Deep Research]] based on their [[OpenAI o3|o3]] model,<ref name=":5">{{cite web |date=2025-02-02 |title=Introducing deep research |url=https://openai.com/index/introducing-deep-research/ |access-date=2025-02-05 |website=OpenAI |language=en-US}}</ref> allowing users to initiate complex research tasks and generate comprehensive reports which incorporate various sources from the web.<ref name=":5" />
== Supervised finetuning ==
A [[large language model]] (LLM) can be fine-tuned on a dataset of reasoning tasks paired with example solutions and step-by-step (reasoning) traces. The fine-tuned model can then produce its own reasoning traces for new problems.<ref name=":0">{{cite arXiv |last1=Uesato |first1=Jonathan |last2=Kushman |first2=Nate |last3=Kumar |first3=Ramana |last4=Song |first4=Francis |last5=Siegel |first5=Noah |last6=Wang |first6=Lisa |last7=Creswell |first7=Antonia |last8=Irving |first8=Geoffrey |last9=Higgins |first9=Irina |title=Solving math word problems with process- and outcome-based feedback |date=2022-11-25 |eprint=2211.14275 |class=cs.LG}}</ref><ref name=":2" />
Because human-written traces are costly to collect, researchers have proposed ways to build such datasets automatically. In ''rejection sampling finetuning'' (RFT), new reasoning traces are gathered in a loop:<ref>{{cite arXiv |last1=Yuan |first1=Zheng |last2=Yuan |first2=Hongyi |last3=Li |first3=Chengpeng |last4=Dong |first4=Guanting |last5=Lu |first5=Keming |last6=Tan |first6=Chuanqi |last7=Zhou |first7=Chang |last8=Zhou |first8=Jingren |title=Scaling Relationship on Learning Mathematical Reasoning with Large Language Models |date=2023-09-13 |eprint=2308.01825 |class=cs.CL}}</ref>
# Sample a task prompt.
# Generate many reasoning traces for the prompt.
# Use a verifier to remove reasoning traces with
== Reinforcement learning ==
A pretrained language model can be further trained
Training a reasoning language model
Most recent systems use policy-gradient methods such as [[Proximal Policy Optimization]] (PPO) because PPO constrains each policy update with a clipped objective, which stabilises training for very large policies.<ref name="OpenAIAlign2022">{{cite web |title=Aligning language models to follow instructions |website=OpenAI Blog |url=https://openai.com/blog/instruction-following/ |date=2022-01-27 |access-date=2025-05-04}}</ref>
=== Outcome reward model ===
{{Anchor|Outcome Reward Model|ORM}}
For tasks with
The ORM is usually trained
Given a PRM, an ORM can be constructed by multiplying the total process reward during the reasoning trace,<ref name=":1" />
=== Process
{{Anchor|Process Reward Model|PRM}}
Given a partial thinking trace <math>x, y_1, \dots, y_m</math>, a human can
As an example,
<math>\begin{cases} 1 & \text{if one of the answers is correct}\\
0 & \text{else}
\end{cases}</math>
in the case of "hard estimation". This creates process rewards from an ORM, which is often easier or cheaper to construct. A PRM can then be trained on these labels.<ref name=":3">{{cite journal |last1=Wang |first1=Peiyi |last2=Li |first2=Lei |last3=Shao |first3=Zhihong |last4=Xu |first4=Runxin |last5=Dai |first5=Damai |last6=Li |first6=Yifei |last7=Chen |first7=Deli |last8=Wu |first8=Yu |last9=Sui |first9=Zhifang |editor-last=Ku |editor-first=Lun-Wei |editor2-last=Martins |editor2-first=Andre |editor3-last=Srikumar |editor3-first=Vivek |title=Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations |journal=Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |___location=Bangkok, Thailand |publisher=Association for Computational Linguistics |date=August 2024 |pages=9426–9439 |doi=10.18653/v1/2024.acl-long.510 |arxiv=2312.08935}}</ref> Some work has tried a fully MCTS approach.<ref>{{cite arXiv |last1=Chen |first1=Guoxin |last2=Liao |first2=Minpeng |last3=Li |first3=Chengxi |last4=Fan |first4=Kai |title=AlphaMath Almost Zero: Process Supervision without Process |date=2024-09-27 |eprint=2405.03553 |class=cs.LG}}</ref>
One can also use an ORM to implicitly construct a PRM, similar to [[direct preference optimization]].<ref>{{
=== Guided sampling ===
A trained ORM can be used to
A trained PRM can guide reasoning by a greedy [[Tree traversal|tree search]]: the policy proposes several next steps, the PRM picks one, and the process repeats. This mirrors using an ORM to pick a whole response.<ref>{{cite arXiv |last1=Ma |first1=Qianli |last2=Zhou |first2=Haotian |last3=Liu |first3=Tingkai |last4=Yuan |first4=Jianbo |last5=Liu |first5=Pengfei |last6=You |first6=Yang |last7=Yang |first7=Hongxia |title=Let's reward step by step: Step-Level reward model as the Navigators for Reasoning |date=2023-10-16 |eprint=2310.10080 |class=cs.CL}}</ref> [[Beam search]] performs better than greedy search.
''Lookahead search'' is another tree search method. The policy proposes several next steps, then makes a short rollout for each. If a solution is found during rollout, the search stops early. Otherwise, the PRM scores each rollout, and the step with the highest score is chosen.<ref name=":7"/>
''Self-consistency'' can be combined with an ORM. The model generates multiple answers, and the answers are clustered so that each cluster has the same final answer. The ORM scores each answer, scores in each cluster are summed, and the answer from the highest-scoring cluster is returned.<ref name=":3" />
== Benchmarks ==
Reasoning models generally score higher than non-reasoning models on many benchmarks, especially on tasks requiring multi-step reasoning.
Some benchmarks exclude reasoning models because their responses take longer and cost more.<ref>{{cite book |last1=Huang |first1=Yuting |last2=Zois |first2=Christos |last3=Wang |first3=Yue |last4=Zhang |first4=Yue |last5=Mavromatis |first5=Christos |last6=Zeng |first6=Jiachen |last7=Yin |first7=Shihao |last8=Voulkidis |first8=Antonios |last9=Shepard |first9=Daniel |chapter=Toward Foundation Models for Online Complex Event Detection in CPS-IoT: A Case Study |title=Proceedings of the 2nd International Workshop on Foundation Models for Cyber-Physical Systems & Internet of Things |publisher=ACM |date=2025 |pages=1–6 |doi=10.1145/3722565.3727198 |arxiv=2503.12282 |isbn=979-8-4007-1608-9 |quote=Although we did not evaluate o1 and o3 models … their high cost and inference time make them impractical for online CED, which requires frequent, low-latency API requests.}}</ref><ref>{{cite arXiv |last1=Hu |first1=Zihao |last2=Wang |first2=Yuqing |last3=Sun |first3=Rui |last4=Lu |first4=Haoran |last5=Gong |first5=Qian |last6=Wang |first6=Jinshuai |last7=Gong |first7=Yunlong |last8=Huang |first8=Yiming |last9=He |first9=Peng |title=Inference-Time Compute: More Faithful? A Research Note |date=2025-02-13 |eprint=2502.09673 |class=cs.CL |quote=we were unable to evaluate O1 and R1 …}}</ref><ref>{{cite arXiv |last1=Chen |first1=Guoliang |last2=Zhu |first2=Zhiyao |last3=Meng |first3=Qinxiang |last4=Liang |first4=Weilin |last5=Ji |first5=Zijie |last6=Liu |first6=Jiangning |last7=Zeng |first7=Jie |title=RealBench: Evaluating LLMs as Verilog Engineers |date=2025-03-07 |eprint=2503.04914 |class=cs.AI |quote=For O1-preview, we sample only once due to high cost.}}</ref><ref>{{cite arXiv |last1=Gupta |first1=Arpit |last2=Schapira |first2=Michael |last3=Gill |first3=Phillipa |last4=Seetharaman |first4=Srinivasan |title=On the Feasibility of Using LLMs to Execute Multistage Network Attacks |date=2025-01-30 |eprint=2501.16466 |class=cs.CR |quote=We were unable to evaluate o1 … the public API has a safeguard that prevents o1 from executing attacks.}}</ref>
=== Humanity's Last Exam ===
The [[Humanity's Last Exam|HLE]] benchmark tests expert-level reasoning across mathematics, humanities, and the natural sciences, and shows large performance gaps between models. State-of-the-art reasoning models score low on HLE, leaving room to improve. For example, the full reasoning model [[OpenAI o3|o3]] reached 26.6%,<ref name=":5"/> while the lighter o3-mini-high (on text-only questions) reached 13%.<ref>{{cite web |title=Humanity's Last Exam leaderboard |url=https://agi.safe.ai/benchmarks/hle |website=Safe.ai |publisher=Center for AI Safety |access-date=2025-07-26}}</ref>
=== AIME ===
On the [[American Invitational Mathematics Examination]] (AIME), a difficult math competition, non-reasoning models usually solve under 30% of problems. Models that use reasoning methods score between 50% and 80%.<ref name=":8"/><ref name=":9"/><ref name=":10"/> While [[OpenAI o1|OpenAI's o1]] maintained or slightly improved its accuracy from reported 2024 results to 2025 AIME results, o3-mini (high) reached a higher accuracy (80%) at a much lower cost (about 12 times cheaper).<ref name=":4">{{cite web |date=2025-01-31 |title=OpenAI o3-mini |url=https://openai.com/index/openai-o3-mini/ |access-date=2025-02-09 |website=OpenAI |language=en-US}}</ref>
=== o3-mini performance ===
According to OpenAI's January 2025 report on o3-mini, adjusting "reasoning effort" significantly affects performance, especially for [[STEM]] tasks. Moving from low to high reasoning effort raises accuracy on AIME 2024, GPQA Diamond, and [[Codeforces]], typically by 10–30%. With high effort, o3-mini (high) achieved 87.3% on AIME (different from the MathArena AIME benchmark), 79.7% on GPQA Diamond, 2130 Elo on Codeforces, and 49.3 on SWE-bench Verified.<ref name=":4"/>
== Drawbacks ==
=== Computational cost ===
Reasoning models often need far more compute while answering than non-reasoning models. On AIME, they were 10 to 74 times more expensive'''<ref name=":1" />''' than non-reasoning counterparts.
=== Generation time ===
Due to the tendency of reasoning language models to produce verbose outputs, the time it takes to generate an output increases greatly when compared to a standard [[large language model]].
== Models ==
=== [[OpenAI]] ===
* [[GPT-5]]
* [[OpenAI o4-mini|o4-mini]]
* [[OpenAI o3|o3 and o3-mini]]
* [[OpenAI o1|o1 and o1-preview]]
=== [[Gemini (chatbot)|Gemini]] ===
* [[Gemini (language model)|2.5 Pro and Flash]]
* [[Gemini (language model)|2.0 Flash Thinking]]
===
* R1 (based on V3)
* R1-Lite-Preview (test version based on V2.5)
===
* QvQ-72B-Preview — an experimental visual reasoning model launched on December 24, 2024, which integrates image understanding with verbal chain-of-thought reasoning.
* QwQ-32B-Preview — an experimental text-based reasoning model released in late November 2024 that emphasizes complex, step-by-step analysis.
=== [[Anthropic]] ===
* [[Claude (language model)#Claude 3.7|Claude Sonnet 3.7]] has an adjustable amount of 'thinking' tokens.
=== [[Mistral AI]] ===
* Magistral (medium & small)
=== [[XAI (company)|xAI]] ===
* [[Grok_(chatbot)#Grok_3|Grok 3]]
* [[Grok_(chatbot)#Grok_4|Grok 4]]
=== [[Hugging Face]] ===
* OlympicCoder-7B & 32B, as part of reproducing the R1 training openly (Open R1 project).<ref>{{cite web |title=Open-R1: a fully open reproduction of DeepSeek-R1 |url=https://huggingface.co/blog/open-r1 |website=Hugging Face |date=2025-02-24 |access-date=2025-07-26}}</ref><ref>{{cite web |title=OlympicCoder-7B |url=https://huggingface.co/open-r1/OlympicCoder-7B |website=Hugging Face |date=2025-03-11 |access-date=2025-07-26}}</ref>
== See also ==
* [[Automated reasoning]]
* [[Reflection (artificial intelligence)]]
|