Content deleted Content added
Permacultura (talk | contribs) →2024: In September 2024, OpenAI released o1-preview, an LLM with enhanced reasoning |
→Benchmarks: Add references |
||
(16 intermediate revisions by 8 users not shown) | |||
Line 1:
{{Short description|Language models designed for reasoning tasks}}
{{Copy edit|for=jargon|date=May 2025}}
'''Reasoning language models''' ('''RLMs''') are [[large language model]]s that are trained further to solve tasks that take several steps of [[reasoning]].<ref>{{cite arXiv |last1=Besta |first1=Maciej |last2=Barth |first2=Julia |last3=Schreiber |first3=Eric |last4=Kubicek |first4=Ales |last5=Catarino |first5=Afonso |last6=Gerstenberger |first6=Robert |last7=Nyczyk |first7=Piotr |last8=Iff |first8=Patrick |last9=Li |first9=Yueling |title=Reasoning Language Models: A Blueprint |date=2025-01-23 |eprint=2501.11223 |class=cs.CL}}</ref> They tend to do better on logic, math, and programming tasks than standard LLMs, can [[Backtracking|revisit and revise]] earlier steps, and make use of extra computation while answering as another way to [[Neural scaling law|scale performance]], alongside the number of training examples, parameters, and training compute.<ref name=":8">{{cite web |title=Learning to reason with LLMs |url=https://openai.com/index/learning-to-reason-with-llms/ |website=OpenAI |date=2024-09-12 |access-date=2025-07-26}}</ref>
== History ==
=== 2024 ===
In September 2024, [[OpenAI]] released [[OpenAI o1#release|o1-preview]], an LLM with enhanced reasoning.<ref>{{
The development of reasoning LLMs has illustrated what [[Richard S. Sutton|Rich Sutton]]
[[Alibaba Group|Alibaba]]
In December 2024, the team introduced QvQ-72B-Preview, an experimental visual reasoning model.<ref>{{cite web |title=QVQ: To See the World with Wisdom |url=https://qwenlm.github.io/blog/qvq-72b-preview/ |website=Qwen |publisher=Alibaba Cloud |date=2024-12-25 |access-date=2025-07-26}}</ref>
In December 2024, Google introduced [[Gemini Deep Research|Deep Research]] in [[Gemini (chatbot)|Gemini]],<ref>{{
On December 16, 2024, an experiment
=== 2025 ===
In January 2025, [[DeepSeek]] released [[DeepSeek (chatbot)|R1]], a model
On February 2, 2025, OpenAI released [[ChatGPT Deep Research|Deep Research]] based on their [[OpenAI o3|o3]] model,<ref name=":5">{{
== Supervised finetuning ==
A [[large language model]] (LLM) can be
# Sample a task prompt.▼
▲# Sample a task prompt
# Generate many reasoning traces for the prompt.
# Use a verifier to remove reasoning traces with
== Reinforcement learning ==
A pretrained language model can be further trained
Training a reasoning language model
Most recent systems use policy-gradient methods such as [[Proximal Policy Optimization]] (PPO) because PPO constrains each policy update with a clipped objective, which stabilises training for very large policies.<ref name="OpenAIAlign2022">{{cite web |title=Aligning language models to follow instructions |website=OpenAI Blog |url=https://openai.com/blog/instruction-following/ |date=2022-01-27 |access-date=2025-05-04}}</ref>
Line 45 ⟶ 42:
{{Anchor|Outcome Reward Model|ORM}}
For tasks with
The ORM is usually trained
Given a PRM, an ORM can be constructed by multiplying the total process reward during the reasoning trace,<ref name=":1" />
=== Process reward model ===
{{Anchor|Process Reward Model|PRM}}
Given a partial thinking trace <math>x, y_1, \dots, y_m</math>, a human can
As an example,
<math>\begin{cases} 1 & \text{if one of the answers is correct}\\
0 & \text{else}
\end{cases}</math>
One can also use an ORM to implicitly construct a PRM, similar to [[direct preference optimization]].<ref>{{
=== Guided sampling ===
A trained ORM can be used to
A trained PRM can
''Lookahead search'' is another tree search method
''Self-consistency'' can be combined with an ORM. The model
== Benchmarks ==
Reasoning models generally score higher than non-reasoning models on many benchmarks, especially on tasks requiring multi-step reasoning.<ref>{{Citation |last=Wei |first=Jason |title=Chain-of-Thought Prompting Elicits Reasoning in Large Language Models |date=2023-01-10 |url=http://arxiv.org/abs/2201.11903 |access-date=2025-08-30 |publisher=arXiv |doi=10.48550/arXiv.2201.11903 |id=arXiv:2201.11903 |last2=Wang |first2=Xuezhi |last3=Schuurmans |first3=Dale |last4=Bosma |first4=Maarten |last5=Ichter |first5=Brian |last6=Xia |first6=Fei |last7=Chi |first7=Ed |last8=Le |first8=Quoc |last9=Zhou |first9=Denny}}</ref><ref>{{Citation |last=Wang |first=Xuezhi |title=Self-Consistency Improves Chain of Thought Reasoning in Language Models |date=2023-03-07 |url=http://arxiv.org/abs/2203.11171 |access-date=2025-08-30 |publisher=arXiv |doi=10.48550/arXiv.2203.11171 |id=arXiv:2203.11171 |last2=Wei |first2=Jason |last3=Schuurmans |first3=Dale |last4=Le |first4=Quoc |last5=Chi |first5=Ed |last6=Narang |first6=Sharan |last7=Chowdhery |first7=Aakanksha |last8=Zhou |first8=Denny}}</ref><ref>{{Citation |last=Yao |first=Shunyu |title=Tree of Thoughts: Deliberate Problem Solving with Large Language Models |date=2023-12-03 |url=http://arxiv.org/abs/2305.10601 |access-date=2025-08-30 |publisher=arXiv |doi=10.48550/arXiv.2305.10601 |id=arXiv:2305.10601 |last2=Yu |first2=Dian |last3=Zhao |first3=Jeffrey |last4=Shafran |first4=Izhak |last5=Griffiths |first5=Thomas L. |last6=Cao |first6=Yuan |last7=Narasimhan |first7=Karthik}}</ref><ref>{{Cite journal |last=Cui |first=Dong-Xu |last2=Long |first2=Shi-Yu |last3=Tang |first3=Yi-Xuan |last4=Zhao |first4=Yue |last5=Li |first5=Qiao |date=2025-08-25 |title=Can Reasoning Power Significantly Improve the Knowledge of Large Language Models for Chemistry?─Based on Conversations with LLMs |url=https://doi.org/10.1021/acs.jcim.5c01265 |journal=Journal of Chemical Information and Modeling |doi=10.1021/acs.jcim.5c01265 |issn=1549-9596}}</ref><ref>{{Citation |last=Qwen |title=Qwen2.5 Technical Report |date=2024 |url=https://arxiv.org/abs/2412.15115 |access-date=2025-08-30 |publisher=arXiv |doi=10.48550/ARXIV.2412.15115 |last2=Yang |first2=An |last3=Yang |first3=Baosong |last4=Zhang |first4=Beichen |last5=Hui |first5=Binyuan |last6=Zheng |first6=Bo |last7=Yu |first7=Bowen |last8=Li |first8=Chengyuan |last9=Liu |first9=Dayiheng}}</ref><ref>{{Citation |last=Comanici |first=Gheorghe |title=Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities |date=2025-07-22 |url=http://arxiv.org/abs/2507.06261 |access-date=2025-08-30 |publisher=arXiv |doi=10.48550/arXiv.2507.06261 |id=arXiv:2507.06261 |last2=Bieber |first2=Eric |last3=Schaekermann |first3=Mike |last4=Pasupat |first4=Ice |last5=Sachdeva |first5=Noveen |last6=Dhillon |first6=Inderjit |last7=Blistein |first7=Marcel |last8=Ram |first8=Ori |last9=Zhang |first9=Dan}}</ref><ref>{{Cite journal |last=Mirza |first=Adrian |last2=Alampara |first2=Nawaf |last3=Kunchapu |first3=Sreekanth |last4=Ríos-García |first4=Martiño |last5=Emoekabu |first5=Benedict |last6=Krishnan |first6=Aswanth |last7=Gupta |first7=Tanya |last8=Schilling-Wilhelmi |first8=Mara |last9=Okereke |first9=Macjonathan |last10=Aneesh |first10=Anagha |last11=Asgari |first11=Mehrdad |last12=Eberhardt |first12=Juliane |last13=Elahi |first13=Amir Mohammad |last14=Elbeheiry |first14=Hani M. |last15=Gil |first15=María Victoria |date=2025-07 |title=A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists |url=https://www.nature.com/articles/s41557-025-01815-x |journal=Nature Chemistry |language=en |volume=17 |issue=7 |pages=1027–1034 |doi=10.1038/s41557-025-01815-x |issn=1755-4349}}</ref>
Some benchmarks exclude reasoning models because their responses take longer and cost more.<ref>{{cite book |last1=Huang |first1=Yuting |last2=Zois |first2=Christos |last3=Wang |first3=Yue |last4=Zhang |first4=Yue |last5=Mavromatis |first5=Christos |last6=Zeng |first6=Jiachen |last7=Yin |first7=Shihao |last8=Voulkidis |first8=Antonios |last9=Shepard |first9=Daniel |chapter=Toward Foundation Models for Online Complex Event Detection in CPS-IoT: A Case Study |title=Proceedings of the 2nd International Workshop on Foundation Models for Cyber-Physical Systems & Internet of Things |publisher=ACM |date=2025 |pages=1–6 |doi=10.1145/3722565.3727198 |arxiv=2503.12282 |isbn=979-8-4007-1608-9 |quote=Although we did not evaluate o1 and o3 models … their high cost and inference time make them impractical for online CED, which requires frequent, low-latency API requests.}}</ref><ref>{{cite arXiv |last1=Hu |first1=Zihao |last2=Wang |first2=Yuqing |last3=Sun |first3=Rui |last4=Lu |first4=Haoran |last5=Gong |first5=Qian |last6=Wang |first6=Jinshuai |last7=Gong |first7=Yunlong |last8=Huang |first8=Yiming |last9=He |first9=Peng |title=Inference-Time Compute: More Faithful? A Research Note |date=2025-02-13 |eprint=2502.09673 |class=cs.CL |quote=we were unable to evaluate O1 and R1 …}}</ref><ref>{{cite arXiv |last1=Chen |first1=Guoliang |last2=Zhu |first2=Zhiyao |last3=Meng |first3=Qinxiang |last4=Liang |first4=Weilin |last5=Ji |first5=Zijie |last6=Liu |first6=Jiangning |last7=Zeng |first7=Jie |title=RealBench: Evaluating LLMs as Verilog Engineers |date=2025-03-07 |eprint=2503.04914 |class=cs.AI |quote=For O1-preview, we sample only once due to high cost.}}</ref><ref>{{cite arXiv |last1=Gupta |first1=Arpit |last2=Schapira |first2=Michael |last3=Gill |first3=Phillipa |last4=Seetharaman |first4=Srinivasan |title=On the Feasibility of Using LLMs to Execute Multistage Network Attacks |date=2025-01-30 |eprint=2501.16466 |class=cs.CR |quote=We were unable to evaluate o1 … the public API has a safeguard that prevents o1 from executing attacks.}}</ref>
=== Humanity's Last Exam ===
The [[Humanity's Last Exam|HLE]]
=== AIME ===
=== o3-mini performance ===
According to OpenAI's January 2025 report on o3-mini,
== Drawbacks ==
=== Computational cost ===
Reasoning models
=== Generation time ===
Due to the tendency of reasoning language models to produce verbose outputs, the time it takes to generate an output increases greatly when compared to a standard [[large language model]].
== Models ==
=== [[OpenAI]] ===
* [[GPT-5]]
* [[OpenAI o4-mini|o4-mini]]
* [[OpenAI o3|o3 and o3-mini]]
Line 123:
=== [[Mistral AI]] ===
* Magistral (medium & small)
=== [[XAI (company)|xAI]] ===
* [[
* [[Grok_(chatbot)#Grok_4|Grok 4]]
=== [[Hugging Face]] ===
* OlympicCoder-7B & 32B, as part of reproducing the R1 training openly (Open R1 project).<ref>{{cite web |title=Open-R1: a fully open reproduction of DeepSeek-R1 |url=https://huggingface.co/blog/open-r1 |website=Hugging Face |date=2025-02-24 |access-date=2025-07-26}}</ref><ref>{{cite web |title=OlympicCoder-7B |url=https://huggingface.co/open-r1/OlympicCoder-7B |website=Hugging Face |date=2025-03-11 |access-date=2025-07-26}}</ref>
== See also ==
|