Content deleted Content added
Add diagram |
SimonAytes (talk | contribs) →Generation time: Update to be more factually correct and concise. |
||
(7 intermediate revisions by 5 users not shown) | |||
Line 2:
{{Copy edit|for=jargon|date=May 2025}}
'''Reasoning language models''' ('''RLMs''') are [[large language model]]s that are trained further to solve tasks that take several steps of [[reasoning]].<ref>{{cite arXiv |last1=Besta |first1=Maciej |last2=Barth |first2=Julia |last3=Schreiber |first3=Eric |last4=Kubicek |first4=Ales |last5=Catarino |first5=Afonso |last6=Gerstenberger |first6=Robert |last7=Nyczyk |first7=Piotr |last8=Iff |first8=Patrick |last9=Li |first9=Yueling |title=Reasoning Language Models: A Blueprint |date=2025-01-23 |
▲'''Reasoning language models''' ('''RLMs''') are [[large language model]]s that are trained further to solve tasks that take several steps of [[reasoning]].<ref>{{cite arXiv |last1=Besta |first1=Maciej |last2=Barth |first2=Julia |last3=Schreiber |first3=Eric |last4=Kubicek |first4=Ales |last5=Catarino |first5=Afonso |last6=Gerstenberger |first6=Robert |last7=Nyczyk |first7=Piotr |last8=Iff |first8=Patrick |last9=Li |first9=Yueling |title=Reasoning Language Models: A Blueprint |date=2025-01-23 |arxiv=2501.11223 |class=cs.CL}}</ref> They tend to do better on logic, math, and programming tasks than standard LLMs, can [[Backtracking|revisit and revise]] earlier steps, and make use of extra computation while answering as another way to [[Neural scaling law|scale performance]], alongside the number of training examples, parameters, and training compute.<ref name=":8">{{cite web |title=Learning to reason with LLMs |url=https://openai.com/index/learning-to-reason-with-llms/ |website=OpenAI |date=2024-09-12 |access-date=2025-07-26}}</ref>
== History ==
Line 10 ⟶ 8:
In September 2024, [[OpenAI]] released [[OpenAI o1#release|o1-preview]], an LLM with enhanced reasoning.<ref>{{cite news |last1=Edwards |first1=Benj |date=2024-09-12 |title=OpenAI's new "reasoning" AI models are here: o1-preview and o1-mini |url=https://arstechnica.com/information-technology/2024/09/openais-new-reasoning-ai-models-are-here-o1-preview-and-o1-mini/ |access-date=2025-02-06 |work=Ars Technica |language=en-US}}</ref> The full version, [[OpenAI o1|o1]], followed in December 2024. OpenAI also began sharing results on its successor, [[OpenAI o3|o3]].<ref>{{cite web |title=OpenAI o1 System Card |url=https://cdn.openai.com/o1-system-card.pdf |website=OpenAI |date=2024-12-05 |access-date=2025-07-26}}</ref><ref>{{cite news |last=Robison |first=Kylie |date=2024-12-05 |title=OpenAI launches ChatGPT Pro, a $200/month plan with unlimited access to o1, GPT-4o, and more |url=https://www.theverge.com/2024/12/5/24314147/openai-reasoning-model-o1-strawberry-chatgpt-pro-new-tier |access-date=2025-07-26 |work=The Verge}}</ref><ref>{{cite news |last=Singh |first=Jaspreet |date=2024-12-20 |title=OpenAI unveils 'o3' model, touting advances in reasoning |url=https://www.reuters.com/technology/artificial-intelligence/openai-unveils-o3-model-touting-advances-reasoning-2024-12-20/ |access-date=2025-07-26 |work=Reuters}}</ref>
The development of reasoning LLMs has illustrated what [[Richard S. Sutton|Rich Sutton]] called the "bitter lesson": that scaling compute often outperforms methods that rely on specific human insights.<ref>{{cite web |last1=Sutton |first1=Richard S. |title=The Bitter Lesson |url=http://www.incompleteideas.net/IncIdeas/BitterLesson.html |access-date=2025-02-27 |website=Incomplete Ideas}}</ref> For example, the Generative AI Research Lab (GAIR) explored complex methods such as tree search and reinforcement learning to replicate o1's capabilities. In their "o1 Replication Journey" papers they reported that [[knowledge distillation]] (training a smaller model to imitate o1's outputs) worked surprisingly well. This highlighted the effectiveness of distillation in this context.<ref>{{cite arXiv |last1=Huang |first1=Zhen |last2=Zou |first2=Haoyang |last3=Li |first3=Xuefeng |last4=Liu |first4=Yixiu |last5=Zheng |first5=Yuxiang |last6=Chern |first6=Ethan |last7=Xia |first7=Shijie |last8=Qin |first8=Yiwei |last9=Yuan |first9=Weizhe |title=O1 Replication Journey — Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? |date=2024-11-25 |
[[Alibaba Group|Alibaba]] released reasoning versions of its [[Qwen]] LLMs in November 2024.<ref>{{cite web |title=QwQ-32B-Preview: Reflect Deeply on the Boundaries of the Unknown |url=https://qwenlm.github.io/blog/qwq-32b-preview/ |website=Qwen (Alibaba Cloud) |date=2024-11-28 |access-date=2025-07-26}}</ref>
Line 20 ⟶ 18:
=== 2025 ===
In January 2025, [[DeepSeek]] released [[DeepSeek (chatbot)|R1]], a model with comparable performance to o1 at lower cost. The release demonstrated the effectiveness of [[Group Relative Policy Optimization]] (GRPO).<ref>{{cite news |last1=Orland |first1=Kyle |date=2025-01-28 |title=How does DeepSeek R1 really fare against OpenAI's best reasoning models? |url=https://arstechnica.com/ai/2025/01/how-does-deepseek-r1-really-fare-against-openais-best-reasoning-models/ |access-date=2025-02-06 |work=Ars Technica}}</ref><ref name=":9">{{cite arXiv |last1=DeepSeek-AI
On February 2, 2025, OpenAI released [[ChatGPT Deep Research|Deep Research]] based on their [[OpenAI o3|o3]] model,<ref name=":5">{{cite web |date=2025-02-02 |title=Introducing deep research |url=https://openai.com/index/introducing-deep-research/ |access-date=2025-02-05 |website=OpenAI |language=en-US}}</ref>
== Supervised finetuning ==
A [[large language model]] (LLM) can be fine-tuned on a dataset of reasoning tasks paired with example solutions and step-by-step (reasoning) traces. The fine-tuned model can then produce its own reasoning traces for new problems.<ref name=":0">{{cite arXiv |last1=Uesato |first1=Jonathan |last2=Kushman |first2=Nate |last3=Kumar |first3=Ramana |last4=Song |first4=Francis |last5=Siegel |first5=Noah |last6=Wang |first6=Lisa |last7=Creswell |first7=Antonia |last8=Irving |first8=Geoffrey |last9=Higgins |first9=Irina |title=Solving math word problems with process- and outcome-based feedback |date=2022-11-25 |
Because human-written traces are costly to collect, researchers have proposed ways to build such datasets automatically. In ''rejection sampling finetuning'' (RFT), new reasoning traces are gathered in a loop:<ref>{{cite arXiv |last1=Yuan |first1=Zheng |last2=Yuan |first2=Hongyi |last3=Li |first3=Chengpeng |last4=Dong |first4=Guanting |last5=Lu |first5=Keming |last6=Tan |first6=Chuanqi |last7=Zhou |first7=Chang |last8=Zhou |first8=Jingren |title=Scaling Relationship on Learning Mathematical Reasoning with Large Language Models |date=2023-09-13 |
# Sample a task prompt.
# Generate many reasoning traces for the prompt.
Line 46 ⟶ 44:
An outcome reward model, or outcome-supervised RM (ORM),<ref name=":0" /> gives the reward for a step <math>r(x, y_1, \dots, y_i)</math> based on the final answer: <math>r(x, y_1, \dots, y_i) = r(x, y_n)</math>. Such models are often called "verifiers".
For tasks with answers that are easy to verify, such as [[Word problem (mathematics education)|math word problems]], the outcome reward can be binary: 1 if the final answer is correct, 0 otherwise.<ref name=":0" /> If automatic verification is hard, humans can label answers as correct or not, and those labels can be used to finetune a base model that predicts the human label.<ref name=":2">{{cite arXiv |last1=Cobbe |first1=Karl |last2=Kosaraju |first2=Vineet |last3=Bavarian |first3=Mohammad |last4=Chen |first4=Mark |last5=Jun |first5=Heewoo |last6=Kaiser |first6=Lukasz |last7=Plappert |first7=Matthias |last8=Tworek |first8=Jerry |last9=Hilton |first9=Jacob |title=Training Verifiers to Solve Math Word Problems |date=2021-11-18 |
The ORM is usually trained with [[logistic regression]], i.e. by minimizing [[Cross-entropy|cross-entropy loss]].<ref name=":3" />
Line 66 ⟶ 64:
0 & \text{else}
\end{cases}</math>
in the case of "hard estimation". This creates process rewards from an ORM, which is often easier or cheaper to construct. A PRM can then be trained on these labels.<ref name=":3">{{cite journal |last1=Wang |first1=Peiyi |last2=Li |first2=Lei |last3=Shao |first3=Zhihong |last4=Xu |first4=Runxin |last5=Dai |first5=Damai |last6=Li |first6=Yifei |last7=Chen |first7=Deli |last8=Wu |first8=Yu |last9=Sui |first9=Zhifang |editor-last=Ku |editor-first=Lun-Wei |editor2-last=Martins |editor2-first=Andre |editor3-last=Srikumar |editor3-first=Vivek |title=Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations |journal=Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |___location=Bangkok, Thailand |publisher=Association for Computational Linguistics |date=August 2024 |pages=9426–9439 |doi=10.18653/v1/2024.acl-long.510 |arxiv=2312.08935}}</ref> Some work has tried a fully MCTS approach.<ref>{{cite arXiv |last1=Chen |first1=Guoxin |last2=Liao |first2=Minpeng |last3=Li |first3=Chengxi |last4=Fan |first4=Kai |title=AlphaMath Almost Zero: Process Supervision without Process |date=2024-09-27 |
One can also use an ORM to implicitly construct a PRM, similar to [[direct preference optimization]].<ref>{{cite arXiv |last1=Yuan |first1=Lifan |last2=Li |first2=Wendi |last3=Chen |first3=Huayu |last4=Cui |first4=Ganqu |last5=Ding |first5=Ning |last6=Zhang |first6=Kaiyan |last7=Zhou |first7=Bowen |last8=Liu |first8=Zhiyuan |last9=Peng |first9=Hao |title=Free Process Rewards without Process Labels |date=2024-12-02 |
=== Guided sampling ===
A trained ORM can be used to pick the best response. The policy generates several responses, and the ORM selects the best one. This implements a simple form of [[Neural scaling law|test-time compute scaling]] ("best-of-N").<ref name=":2" /> <ref>{{cite arXiv |last1=Zhang |first1=Di |last2=Wu |first2=Jianbo |last3=Lei |first3=Jingdi |last4=Che |first4=Tong |last5=Li |first5=Jiatong |last6=Xie |first6=Tong |last7=Huang |first7=Xiaoshui |last8=Zhang |first8=Shufei |last9=Pavone |first9=Marco |title=LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning |date=2024-11-21 |
A trained PRM can guide reasoning by a greedy [[Tree traversal|tree search]]: the policy proposes several next steps, the PRM picks one, and the process repeats. This mirrors using an ORM to pick a whole response.<ref>{{cite arXiv |last1=Ma |first1=Qianli |last2=Zhou |first2=Haotian |last3=Liu |first3=Tingkai |last4=Yuan |first4=Jianbo |last5=Liu |first5=Pengfei |last6=You |first6=Yang |last7=Yang |first7=Hongxia |title=Let's reward step by step: Step-Level reward model as the Navigators for Reasoning |date=2023-10-16 |
''Lookahead search'' is another tree search method. The policy proposes several next steps, then makes a short rollout for each. If a solution is found during rollout, the search stops early. Otherwise, the PRM scores each rollout, and the step with the highest score is chosen.<ref name=":7"/>
Line 82 ⟶ 80:
Reasoning models generally score higher than non-reasoning models on many benchmarks, especially on tasks requiring multi-step reasoning.
Some benchmarks exclude reasoning models because their responses take longer and cost more.<ref>{{cite
=== Humanity's Last Exam ===
The [[Humanity's Last Exam|HLE]] benchmark tests expert-level reasoning across mathematics, humanities, and the natural sciences, and shows large performance gaps between models. State-of-the-art reasoning models score low on HLE, leaving room to improve. For example, the full reasoning model [[OpenAI o3|o3]] reached 26.6%,<ref name=":5"/> while the lighter o3-mini-high (on text-only questions) reached 13%.<ref>{{cite web |title=
=== AIME ===
Line 99 ⟶ 97:
=== Generation time ===
Due to the tendency of reasoning language models to produce verbose outputs, the time it takes to generate an output increases greatly when compared to a standard [[large language model]].
== Models ==
=== [[OpenAI]] ===
* [[GPT-5]]
* [[OpenAI o4-mini|o4-mini]]
* [[OpenAI o3|o3 and o3-mini]]
Line 124 ⟶ 123:
=== [[Mistral AI]] ===
* Magistral (medium & small)
Line 132 ⟶ 130:
=== [[Hugging Face]] ===
* OlympicCoder-7B & 32B, as part of reproducing the R1 training openly (Open R1 project).<ref>{{cite web |title=Open-R1: a fully open reproduction of DeepSeek-R1 |url=https://huggingface.co/blog/open-r1 |website=Hugging Face |date=2025-02-24 |access-date=2025-07-26}}</ref><ref>{{cite web |title=OlympicCoder-7B |url=https://huggingface.co/open-r1/OlympicCoder-7B |website=Hugging Face |date=2025-03-11 |access-date=2025-07-26}}</ref>
|