Content deleted Content added
Improve references |
SimonAytes (talk | contribs) →Generation time: Update to be more factually correct and concise. |
||
(9 intermediate revisions by 5 users not shown) | |||
Line 1:
{{Short description|Language models designed for reasoning tasks}}
{{Copy edit|for=jargon|date=May 2025}}
'''Reasoning language models''' ('''RLMs''') are [[large language model]]s that
== History ==
=== 2024 ===
In September 2024, [[OpenAI]] released [[OpenAI o1#release|o1-preview]], an LLM with enhanced reasoning.<ref>{{cite news |last1=Edwards |first1=Benj |date=2024-09-12 |title=OpenAI's new "reasoning" AI models are here: o1-preview and o1-mini |url=https://arstechnica.com/information-technology/2024/09/openais-new-reasoning-ai-models-are-here-o1-preview-and-o1-mini/ |access-date=2025-02-06 |work=Ars Technica |language=en-US}}</ref> The full version, [[OpenAI o1|o1]], followed in December 2024. OpenAI also began sharing results on its successor, [[OpenAI o3|o3]].<ref>{{cite web |title=OpenAI o1 System Card |url=https://cdn.openai.com/o1-system-card.pdf |website=OpenAI |date=2024-12-05 |access-date=2025-07-26}}</ref><ref>{{cite news |last=Robison |first=Kylie |date=2024-12-05 |title=OpenAI launches ChatGPT Pro, a $200/month plan with unlimited access to o1,
The development of reasoning LLMs has illustrated what [[Richard S. Sutton|Rich Sutton]]
[[Alibaba Group|Alibaba]]
In December 2024, the team introduced QvQ-72B-Preview, an experimental visual reasoning model.<ref>{{cite web |title=QVQ: To See the World with Wisdom |url=https://qwenlm.github.io/blog/qvq-72b-preview/ |website=Qwen |publisher=Alibaba Cloud |date=2024-12-25 |access-date=2025-07-26}}</ref>
In December 2024, Google introduced [[Gemini Deep Research|Deep Research]] in [[Gemini (chatbot)|Gemini]],<ref>{{cite web |date=2024-12-11 |title=Try Deep Research and our new experimental model in Gemini, your AI assistant |url=https://blog.google/products/gemini/google-gemini-deep-research/ |access-date=2025-02-05 |website=Google |language=en-US}}</ref> a feature
On December 16, 2024, an experiment
=== 2025 ===
In January 2025, [[DeepSeek]] released [[DeepSeek (chatbot)|R1]], a model
On February 2, 2025, OpenAI released [[ChatGPT Deep Research|Deep Research]] based on their [[OpenAI o3|o3]] model,<ref name=":5">{{cite web |date=2025-02-02 |title=Introducing deep research |url=https://openai.com/index/introducing-deep-research/ |access-date=2025-02-05 |website=OpenAI |language=en-US}}</ref>
== Supervised finetuning ==
A [[large language model]] (LLM) can be
# Sample a task prompt.
# Generate many reasoning traces for the prompt.
# Use a verifier to remove reasoning traces with
== Reinforcement learning ==
A pretrained language model can be further trained
Training a reasoning language model
Most recent systems use policy-gradient methods such as [[Proximal Policy Optimization]] (PPO) because PPO constrains each policy update with a clipped objective, which stabilises training for very large policies.<ref name="OpenAIAlign2022">{{cite web |title=Aligning language models to follow instructions |website=OpenAI Blog |url=https://openai.com/blog/instruction-following/ |date=2022-01-27 |access-date=2025-05-04}}</ref>
Line 42:
{{Anchor|Outcome Reward Model|ORM}}
For tasks with
The ORM is usually trained
Given a PRM, an ORM can be constructed by multiplying the total process reward during the reasoning trace,<ref name=":1" />
=== Process reward model ===
{{Anchor|Process Reward Model|PRM}}
Given a partial thinking trace <math>x, y_1, \dots, y_m</math>, a human can
As an example,
<math>\begin{cases} 1 & \text{if one of the answers is correct}\\
0 & \text{else}
\end{cases}</math>
One can also use an ORM to implicitly construct a PRM, similar to [[direct preference optimization]].<ref>{{cite arXiv |last1=Yuan |first1=Lifan |last2=Li |first2=Wendi |last3=Chen |first3=Huayu |last4=Cui |first4=Ganqu |last5=Ding |first5=Ning |last6=Zhang |first6=Kaiyan |last7=Zhou |first7=Bowen |last8=Liu |first8=Zhiyuan |last9=Peng |first9=Hao |title=Free Process Rewards without Process Labels |date=2024-12-02 |
=== Guided sampling ===
A trained ORM can be used to
A trained PRM can
''Lookahead search'' is another tree search method
''Self-consistency'' can be combined with an ORM. The model
== Benchmarks ==
Reasoning models generally
=== Humanity's Last Exam ===
The [[Humanity's Last Exam|HLE]]
=== AIME ===
=== o3-mini performance ===
According to OpenAI's January 2025 report on o3-mini,
== Drawbacks ==
=== Computational cost ===
Reasoning models
=== Generation time ===
Due to the tendency of reasoning language models to produce verbose outputs, the time it takes to generate an output increases greatly when compared to a standard [[large language model]].
== Models ==
=== [[OpenAI]] ===
* [[GPT-5]]
* [[OpenAI o4-mini|o4-mini]]
* [[OpenAI o3|o3 and o3-mini]]
Line 120 ⟶ 123:
=== [[Mistral AI]] ===
* Magistral (medium & small)
Line 128 ⟶ 130:
=== [[Hugging Face]] ===
* OlympicCoder-7B & 32B, as part of reproducing the R1 training openly (Open R1 project).<ref>{{cite web |title=
▲* OlympicCoder-7B & 32B, as part of reproducing the R1 training openly (Open R1 project).<ref>{{cite web |title=Open‑R1: a fully open reproduction of DeepSeek‑R1 |url=https://huggingface.co/blog/open-r1 |website=Hugging Face |date=2025-02-24 |access-date=2025-07-26}}</ref><ref>{{cite web |title=OlympicCoder-7B |url=https://huggingface.co/open-r1/OlympicCoder-7B |website=Hugging Face |date=2025-03-11 |access-date=2025-07-26}}</ref>
== See also ==
|