Reasoning language model: Difference between revisions

Content deleted Content added
Feedback has nothing to do with this. This article is essentially about RLVR, a new training step on top of LLMs
Improve references
Line 1:
{{Short description|Language models designed for reasoning tasks}}{{Multiple issues|
{{unreliable sources|date=January 2025}}
{{Copy edit|for=jargon|date=May 2025}}
'''Reasoning language models''' ('''RLMs''') are [[large language model]]s that have been further trained to solve multi-step [[reasoning]] tasks.<ref>{{cite arXiv |last1=Besta |first1=Maciej |last2=Barth |first2=Julia |last3=Schreiber |first3=Eric |last4=Kubicek |first4=Ales |last5=Catarino |first5=Afonso |last6=Gerstenberger |first6=Robert |last7=Nyczyk |first7=Piotr |last8=Iff |first8=Patrick |last9=Li |first9=Yueling |title=Reasoning Language Models: A Blueprint |last=Besta |first=Maciej |date=2025-01-23 |eprintarxiv=2501.11223 |class=cs.CL}}</ref> These models perform better on logical, mathematical or programmatic tasks than traditional autoregressive LLMs, have the ability to [[Backtracking|backtrack]], and employ test-time compute as an additional [[Neural scaling law|scaling axis]] beyond [[Training, validation, and test data sets|training examples]], parameter count, and train-time compute.<ref name=":8">{{cite web |title=Learning to reason with LLMs |url=https://openai.com/index/learning-to-reason-with-llms/ |website=OpenAI |date=2024-09-12 |access-date=2025-07-26}}</ref>
}}
'''Reasoning language models''' ('''RLMs''') are [[large language model]]s that have been further trained to solve multi-step [[reasoning]] tasks.<ref>{{cite arXiv |title=Reasoning Language Models: A Blueprint |last=Besta |first=Maciej |date=2025-01-23 |eprint=2501.11223 |class=cs.CL}}</ref> These models perform better on logical, mathematical or programmatic tasks than traditional autoregressive LLMs, have the ability to [[Backtracking|backtrack]], and employ test-time compute as an additional [[Neural scaling law|scaling axis]] beyond [[Training, validation, and test data sets|training examples]], parameter count, and train-time compute.
 
== History ==
=== 2024 ===
In September 2024, [[OpenAI]] released [[OpenAI o1#release|o1-preview]], an LLM with enhanced reasoning.<ref>{{Citecite webnews |lastlast1=Edwards |firstfirst1=Benj |date=2024-09-12 |title=OpenAI's new "reasoning" AI models are here: o1-preview and o1-mini |url=https://arstechnica.com/information-technology/2024/09/openais-new-reasoning-ai-models-are-here-o1-preview-and-o1-mini/ |access-date=2025-02-06 |websitework=Ars Technica |language=en-US}}</ref> The full version, [[OpenAI o1|o1]], followed in December 2024. OpenAI also began sharing results on its successor, [[OpenAI o3|o3]].<ref>{{Citecite web |title=OpenAI o1 System Card |url=https://cdn.openai.com/o1-system-card.pdf |website=OpenAI |date=2024-12-05 |access-date=2025-07-26}}</ref><ref>{{cite news |last=Robison |first=Kylie |date=2024-12-2005 |title=OpenAI confirmslaunches newChatGPT frontierPro, modelsa o3$200/month plan with unlimited access to o1, GPT‑4o, and o3-minimore |url=https://venturebeatwww.theverge.com/ai2024/12/5/24314147/openai-confirmsreasoning-newmodel-frontiero1-modelsstrawberry-o3chatgpt-andpro-o3new-mini/tier |access-date=2025-0207-0626 |websitework=VentureBeatThe Verge}}</ref><ref>{{cite news |languagelast=enSingh |first=Jaspreet |date=2024-US12-20 |title=OpenAI unveils 'o3' model, touting advances in reasoning |url=https://www.reuters.com/technology/artificial-intelligence/openai-unveils-o3-model-touting-advances-reasoning-2024-12-20/ |access-date=2025-07-26 |work=Reuters}}</ref>
 
The development of reasoning LLMs has illustrated what [[Richard S. Sutton|Rich Sutton]] termed the "bitter lesson": that general methods leveraging computation often outperform those relying on specific human insights.<ref>{{Citecite web |lastlast1=Sutton |firstfirst1=Richard S. |title=The Bitter Lesson |url=http://www.incompleteideas.net/IncIdeas/BitterLesson.html |access-date=2025-02-27 |website=Incomplete Ideas}}</ref> For instance, some research groups, such as the Generative AI Research Lab (GAIR), initially explored complex techniques like tree search and reinforcement learning in attempts to replicate o1's capabilities. However, they found, as documented in their "o1 Replication Journey" papers, that [[knowledge distillation]] — training a smaller model to mimic o1's outputs – was surprisingly effective. This highlighted the power of distillation in this context.<ref>{{cite arXiv |last1=Huang |first1=Zhen |last2=Zou |first2=Haoyang |last3=Li |first3=Xuefeng |last4=Liu |first4=Yixiu |last5=Zheng |first5=Yuxiang |last6=Chern |first6=Ethan |last7=Xia |first7=Shijie |last8=Qin |first8=Yiwei |last9=Yuan |first9=Weizhe |title=O1 Replication Journey — Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? |date=2024-11-25 |arxiv=2411.16489 |class=cs.CL}}</ref><ref name=":6">{{cite news |last=Zeff |first=Maxwell |date=2025-02-05 |title=Researchers created an open rival to OpenAI’s o1 ‘reasoning’ model for under $50 |url=https://techcrunch.com/2025/02/05/researchers-created-an-open-rival-to-openais-o1-reasoning-model-for-under-50/ |access-date=2025-07-26 |work=TechCrunch}}</ref>
 
[[Alibaba Group|Alibaba]] also released reasoning versions of its [[Qwen]] LLMs in November 2024.<ref>{{cite web |title=QwQ-32B-Preview: Reflect Deeply on the Boundaries of the Unknown |url=https://qwenlm.github.io/blog/qwq-32b-preview/ |website=Qwen (Alibaba Cloud) |date=2024-11-28 |access-date=2025-07-26}}</ref>
In December 2024, the team introduced QvQ-72B-Preview, an experimental visual reasoning model.<ref>{{cite web |title=QVQ: To See the World with Wisdom |url=https://qwenlm.github.io/blog/qvq-72b-preview/ |website=Qwen |publisher=Alibaba Cloud |date=2024-12-25 |access-date=2025-07-26}}</ref>
 
In December 2024, Google introduced [[Gemini Deep Research|Deep Research]] in [[Gemini (chatbot)|Gemini]],<ref>{{Citecite web |date=2024-12-11 |title=Try Deep Research and our new experimental model in Gemini, your AI assistant |url=https://blog.google/products/gemini/google-gemini-deep-research/ |access-date=2025-02-05 |website=Google |language=en-usUS}}</ref> a feature in Gemini that conducts multi-step research tasks.<ref>{{cite news |last=Roth |first=Emma |date=2024-12-11 |title=Google built an AI tool that can do research for you |url=https://www.theverge.com/2024/12/11/24318217/google-gemini-advanced-deep-research-launch |access-date=2025-07-26 |work=The Verge}}</ref>
 
On December 16, 2024, an experiment using a [[Llama (language model)|Llama]] 3B model demonstrated that by scaling test-time compute, a relatively small model could outperform a much larger Llama 70B model on challenging reasoning tasks. This result highlighted that improved inference strategies can unlock latent reasoning capabilities even in compact models.<ref>{{Citecite web |title=Scaling test-time compute - a Hugging Face Space by HuggingFaceH4 |url=https://huggingface.co/spacesblog/HuggingFaceH4/blogposth4-scaling-test-time-compute |website=Hugging Face |date=2024-12-16 |access-date=2025-0207-0526}}</ref><ref name=":7">{{cite journal |websitelast1=huggingfaceSnell |first1=Charlie |last2=Lee |first2=Jaehoon |last3=Xu |first3=Kelvin |last4=Kumar |first4=Aviral |date=2025 |title=Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters |url=https://openreview.conet/forum?id=t4s3hJY9dH |journal=International Conference on Learning Representations (ICLR 2025) |access-date=2025-07-26 |arxiv=2408.03314}}</ref>
 
=== 2025 ===
In January 2025, [[DeepSeek]] released [[DeepSeek (chatbot)|R1]], a model competitive with o1 at lower cost, highlighting the effectiveness of [[Group Relative Policy Optimization]] (GRPO).<ref>{{Citecite webnews |lastlast1=Orland |firstfirst1=Kyle |date=2025-01-28 |title=How does DeepSeek R1 really fare against OpenAI's best reasoning models? |url=https://arstechnica.com/ai/2025/01/how-does-deepseek-r1-really-fare-against-openais-best-reasoning-models/ |access-date=2025-02-06 |websitework=Ars Technica}}</ref><ref name=":9">{{cite arXiv |languagelast1=enDeepSeek-USAI |first1= |last2=Guo |first2=Daya |last3=Yang |first3=Dejian |last4=Zhang |first4=Haowei |last5=Song |first5=Junxiao |last6=Zhang |first6=Ruoyu |last7=Xu |first7=Runxin |last8=Zhu |first8=Qihao |last9=Ma |first9=Shirong |title=DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning |date=2025-01-22 |arxiv=2501.12948 |class=cs.CL}}</ref> On January 25, 2025, [[DeepSeek]] launched a feature in their DeepSeek R1 model, enabling the simultaneous use of search and reasoning capabilities, which allows for more efficient integration of data retrieval with reflective reasoning processes.<ref>{{cite news |script-title=zh:DeepSeek 支持“深度思考+联网检索”能力 |trans-title=DeepSeek adds a search feature supporting simultaneous deep thinking and web search |work=People’s Daily Online |date=2025-01-29 |url=http://tech.people.com.cn/n1/2025/0129/c1007-40386565.html |language=zh |access-date=2025-07-26}}</ref> OpenAI subsequently released o3-mini, followed by [[ChatGPT Deep Research|Deep Research]] which is based on [[OpenAI o3|o3]].<ref>{{Citecite news |lastlast1=Milmo |firstfirst1=Dan |date=2025-02-03 |title=OpenAI launches 'deep research' tool that it says can match research analyst |url=https://www.theguardian.com/technology/2025/feb/03/openai-deep-research-agent-chatgpt-deepseek |access-date=2025-03-16 |work=The Guardian |language=en-GB |issn=0261-3077}}</ref> The power of distillation was further demonstrated by s1-32B, achieving strong performance with budget forcing and scaling techniques.<ref name=":10">{{Citationcite arXiv |last1=Muennighoff |first1=Niklas |title=s1: Simple test-time scaling |date=2025-02-03 |arxiv=2501.19393 |last2=Yang |first2=Zitong |last3=Shi |first3=Weijia |last4=Li |first4=Xiang Lisa |last5=Fei-Fei |first5=Li |last6=Hajishirzi |first6=Hannaneh |last7=Zettlemoyer |first7=Luke |last8=Liang |first8=Percy |last9=Candès |first9=Emmanuel |title=s1: Simple test-time scaling |date=2025-02-03 |arxiv=2501.19393 |class=cs.CL}}</ref><ref name=":6"/>
 
On February 2, 2025, OpenAI released [[ChatGPT Deep Research|Deep Research]],<ref name=":5">{{Citecite web |date=2025-02-02 |title=Introducing deep research |url=https://openai.com/index/introducing-deep-research/ |access-date=2025-02-05 |website=OpenAI |language=en-US}}</ref> a tool that integrates reasoning and web search in a unified workflow, allowing users to perform complex research tasks that require multi-step reasoning and data synthesis from multiple sources. It is based on [[OpenAI o3|o3]] and can take from 5 to 30 minutes to generate comprehensive reports.<ref>{{Cite web |lastname=Ha |first=Anthony |date=2025-02-03 |title=OpenAI unveils a new ChatGPT agent for 'deep research' |url=https":5"//techcrunch.com/2025/02/02/openai-unveils-a-new-chatgpt-agent-for-deep-research/ |access-date=2025-02-06 |website=TechCrunch |language=en-US}}</ref>
 
== Supervised finetuning ==
A [[large language model]] (LLM) can be finetuned on a dataset of reasoning tasks with example solutions and reasoning traces. The fine-tuned model can then produce its own reasoning traces for new problems.<ref name=":0">{{Citationcite arXiv |last1=Uesato |first1=Jonathan |title=Solving math word problems with process- and outcome-based feedback |date=2022-11-25 |arxiv=2211.14275 |last2=Kushman |first2=Nate |last3=Kumar |first3=Ramana |last4=Song |first4=Francis |last5=Siegel |first5=Noah |last6=Wang |first6=Lisa |last7=Creswell |first7=Antonia |last8=Irving |first8=Geoffrey |last9=Higgins |first9=Irina |title=Solving math word problems with process- and outcome-based feedback |date=2022-11-25 |arxiv=2211.14275 |class=cs.LG}}</ref><ref name=":2" />
 
As it is expensive to get humans to write reasoning traces for a SFT dataset, researchers have proposed ways to automatically construct SFT datasets. In rejection sampling finetuning (RFT), new reasoning traces are collected via a loop:<ref>{{Citation |last1=Yuan |first1=Zheng |title=Scaling Relationship on Learning Mathematical Reasoning with Large Language Models |date=2023-09-13 |arxiv=2308.01825 |last2=Yuan |first2=Hongyi |last3=Li |first3=Chengpeng |last4=Dong |first4=Guanting |last5=Lu |first5=Keming |last6=Tan |first6=Chuanqi |last7=Zhou |first7=Chang |last8=Zhou |first8=Jingren}}</ref>
 
As it is expensive to get humans to write reasoning traces for a SFT dataset, researchers have proposed ways to automatically construct SFT datasets. In rejection sampling finetuning (RFT), new reasoning traces are collected via a loop:<ref>{{Citationcite arXiv |last1=Yuan |first1=Zheng |title=Scaling Relationship on Learning Mathematical Reasoning with Large Language Models |date=2023-09-13 |arxiv=2308.01825 |last2=Yuan |first2=Hongyi |last3=Li |first3=Chengpeng |last4=Dong |first4=Guanting |last5=Lu |first5=Keming |last6=Tan |first6=Chuanqi |last7=Zhou |first7=Chang |last8=Zhou |first8=Jingren |title=Scaling Relationship on Learning Mathematical Reasoning with Large Language Models |date=2023-09-13 |arxiv=2308.01825 |class=cs.CL}}</ref>
# Sample a task prompt
# Generate many reasoning traces for the prompt.
Line 46 ⟶ 44:
Outcome reward model, or outcome-supervised RM (ORM),<ref name=":0" /> is a reward model that computes the reward of a step <math>r(x, y_1, \dots, y_i)</math> determined by the final answer: <math>r(x, y_1, \dots, y_i) = r(x, y_n)</math>. They are also called "verifiers".
 
For tasks with an answer that is easy to verify, such as [[Word problem (mathematics education)|word problems in math]], the outcome reward can simply be binary: 1 if the final answer is correct, and 0 otherwise.<ref name=":0" /> If the answer is not easy to verify programmatically, humans can manually label the answers as correct or not, then the labels can be used to finetune a base model that predicts the human label.<ref name=":2">{{Citationcite arXiv |last1=Cobbe |first1=Karl |title=Training Verifiers to Solve Math Word Problems |date=2021-11-18 |arxiv=2110.14168 |last2=Kosaraju |first2=Vineet |last3=Bavarian |first3=Mohammad |last4=Chen |first4=Mark |last5=Jun |first5=Heewoo |last6=Kaiser |first6=Lukasz |last7=Plappert |first7=Matthias |last8=Tworek |first8=Jerry |last9=Hilton |first9=Jacob |title=Training Verifiers to Solve Math Word Problems |date=2021-11-18 |arxiv=2110.14168 |class=cs.LG}}</ref> For other kinds of tasks, such as creative writing, where task performance is not binary true/false, one can train a reward model by finetuning a base model on human [[Ranking (statistics)|ranked preference]] data, such as used in [[reinforcement learning from human feedback]].<ref name=":1">{{Citationcite journal |last1=Lightman |first1=Hunter |title=Let's Verify Step by Step |date=2023-05-31 |arxiv=2305.20050 |last2=Kosaraju |first2=Vineet |last3=Burda |first3=Yura |last4=Edwards |first4=Harri |last5=Baker |first5=Bowen |last6=Lee |first6=Teddy |last7=Leike |first7=Jan |last8=Schulman |first8=John |last9=Sutskever |first9=Ilya |date=2024 |title=Let's Verify Step by Step |url=https://openreview.net/forum?id=dKDGgN0eTg |journal=International Conference on Learning Representations (ICLR 2024) |access-date=2025-07-26 |arxiv=2305.20050}}</ref> A base model can also be finetuned to predict, given a partial thinking trace <math>x, y_1, \dots, y_m</math>, whether the final answer would be correct or not. This can then be used as a binary reward signal.<ref name=":0" />
 
The ORM is usually trained via [[logistic regression]], i.e. minimizing [[Cross-entropy|cross-entropy loss]].<ref name=":3" />
 
Given a PRM, an ORM can be constructed by multiplying the total process reward during the reasoning trace,<ref name=":1" /> or by taking the minimum,<ref name=":3" /> or some other method to aggregate the process rewards. DeepSeek used a simple ORM for training the [[DeepSeek (chatbot)|R1 model]].<ref>{{Citation |last1name=DeepSeek-AI |title=DeepSeek-R1": Incentivizing Reasoning Capability in LLMs via Reinforcement Learning |date=2025-01-22 |arxiv=2501.12948 |last2=Guo |first2=Daya |last3=Yang |first3=Dejian |last4=Zhang |first4=Haowei |last5=Song |first5=Junxiao |last6=Zhang |first6=Ruoyu |last7=Xu |first7=Runxin |last8=Zhu |first8=Qihao |last9=Ma |first9=Shirong}}<9"/ref>
 
=== Process reward model ===
Line 59 ⟶ 57:
Given a partial thinking trace <math>x, y_1, \dots, y_m</math>, a human can be queried as to whether the steps ''so far'' are correct, regardless of whether the ultimate answer would be correct. This can then be used as a binary reward signal. As human labels are expensive, a base model can then be finetuned to predict the human labels.<ref name=":0" /> The PRM is usually trained by [[logistic regression]] on the human labels, i.e. by minimizing the [[Cross-entropy|cross-entropy loss]] between the true labels and the predicted labels.<ref name=":3" />
 
As an example, in a 2023 OpenAI paper, 800K process labels were collected for 75K solution traces. A labeler would be presented with a solution trace, and keep labelling "positive" if the step progresses towards the solution, "neutral" if it is not wrong, but does not progress towards solution, and "negative" if it is a mistake. As soon as a "negative" label is entered, the labeler stops labeling that thinking trace, and begins labeling another one. The idea was that, while labelling subsequent reasoning steps can provide even richer supervision signals, simply labeling up to the first error was sufficient for training a competent PRM.<ref name=":1" /><ref>{{Citationcite web |title=openai/prm800k |date=2025-01-27 |url=https://github.com/openai/prm800k |access-website=GitHub |publisher=OpenAI |date=2025-01-27 |publisheraccess-date=OpenAI2025-01-27}}</ref>
 
As human labels are expensive, researchers have proposed methods to create PRM without human labels on the processes. Inspired by [[Monte Carlo tree search]] (MCTS), the Math-Shepherd method samples multiple continuations until the end, starting at each reasoning step <math>y_i</math>, and set the reward at that step to be either <math>\frac{\#\text{(correct answers)}}{\#\text{(total answers)}}</math> in the case of "soft estimation", or <math>\begin{cases}
1 & \text{if one of the answers is correct}\\
0 & \text{else}
\end{cases}</math> in the case of "hard estimation". This creates process reward using only an ORM, which is usually easier or cheaper to construct. After creating these process reward labels, a PRM can be trained on them.<ref name=":3">{{Citecite journal |last1=Wang |first1=Peiyi |last2=Li |first2=Lei |last3=Shao |first3=Zhihong |last4=Xu |first4=Runxin |last5=Dai |first5=Damai |last6=Li |first6=Yifei |last7=Chen |first7=Deli |last8=Wu |first8=Yu |last9=Sui |first9=Zhifang |date=August 2024 |editor-last=Ku |editor-first=Lun-Wei |editor2-last=Martins |editor2-first=Andre |editor3-last=Srikumar |editor3-first=Vivek |title=Math-ShepherdMath‑Shepherd: Verify and Reinforce LLMs Step-by-stepStep‑by‑step without Human Annotations |url=https://aclanthology.org/2024.acl-long.510/ |journal=Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |___location=Bangkok, Thailand |publisher=Association for Computational Linguistics |date=August 2024 |pages=9426–9439 |doi=10.18653/v1/2024.acl-long.510 |arxiv=2312.08935 }}</ref> Some have tried a fully MCTS approach.<ref>{{Citationcite arXiv |last1=Chen |first1=Guoxin |last2=Liao |first2=Minpeng |last3=Li |first3=Chengxi |last4=Fan |first4=Kai |title=AlphaMath Almost Zero: Process Supervision without Process |date=2024-09-27 |arxiv=2405.03553 |last2class=Liao |first2=Minpeng |last3=Li |first3=Chengxi |last4=Fan |first4=Kaics.LG}}</ref>
 
One can also use an ORM to implicitly construct a PRM, similar to [[direct preference optimization]].<ref>{{Citationcite arXiv |last1=Yuan |first1=Lifan |title=Free Process Rewards without Process Labels |date=2024-12-02 |arxiv=2412.01981 |last2=Li |first2=Wendi |last3=Chen |first3=Huayu |last4=Cui |first4=Ganqu |last5=Ding |first5=Ning |last6=Zhang |first6=Kaiyan |last7=Zhou |first7=Bowen |last8=Liu |first8=Zhiyuan |last9=Peng |first9=Hao |title=Free Process Rewards without Process Labels |date=2024-12-02 |arxiv=2412.01981 |class=cs.CL}}</ref>
 
=== Guided sampling ===
A trained ORM can be used to select the best response. The policy would rollout multiple responses, and a trained ORM would select the best response. This allows a simple form of [[Neural scaling law|test time compute scaling]] ("best-of-N").<ref name=":2" /> <ref>{{Citationcite arXiv |last1=Zhang |first1=Di |title=LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning |date=2024-11-21 |arxiv=2410.02884 |last2=Wu |first2=Jianbo |last3=Lei |first3=Jingdi |last4=Che |first4=Tong |last5=Li |first5=Jiatong |last6=Xie |first6=Tong |last7=Huang |first7=Xiaoshui |last8=Zhang |first8=Shufei |last9=Pavone |first9=Marco |title=LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning |date=2024-11-21 |arxiv=2410.02884 |class=cs.CL}}</ref>
 
A trained PRM can also be used to guide reasoning by greedy [[Tree traversal|tree search]]. That is, the policy model generates several possible next reasoning steps, and the PRM selects the best one, and the process repeats. This is similar to how a trained ORM can be used to select the best response.<ref>{{Citationcite arXiv |last1=Ma |first1=Qianli |title=Let's reward step by step: Step-Level reward model as the Navigators for Reasoning |date=2023-10-16 |arxiv=2310.10080 |last2=Zhou |first2=Haotian |last3=Liu |first3=Tingkai |last4=Yuan |first4=Jianbo |last5=Liu |first5=Pengfei |last6=You |first6=Yang |last7=Yang |first7=Hongxia |title=Let's reward step by step: Step-Level reward model as the Navigators for Reasoning |date=2023-10-16 |arxiv=2310.10080 |class=cs.CL}}</ref> [[Beam search]] perform better than greedy search.
 
Lookahead search is another tree search method, where the policy model generates several possible next reasoning steps, then make a (partial) rollout for each. If a solution endpoint is reached during the forward simulation, the process halts early. Otherwise, the PRM is used to calculate the total reward for each rollout. The step with the highest rollout is selected.<ref>{{Citation |last1name=Snell |first1=Charlie |title=Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters |date=2024-08-06 |arxiv=2408.03314 |last2=Lee |first2=Jaehoon |last3=Xu |first3=Kelvin |last4=Kumar |first4=Aviral}}<":7"/ref>
 
Self-consistency can be combined with an ORM. The model would be used to generate multiple answers, and the answers would be clustered, so that each cluster has the same answer. The ORM is used to compute the reward for each answer, and the rewards within each cluster is summed. The answer corresponding to the cluster with the highest summed reward is output.<ref name=":3" />
Line 80 ⟶ 78:
Reasoning models generally outperform non-reasoning models in most benchmarks, especially on tasks requiring multi-step reasoning.
 
However, some benchmarks exclude reflective models due to longer response times.<ref>{{cite journal |last1=Huang |first1=Yuting |last2=Zois |first2=Christos |last3=Wang |first3=Yue |last4=Zhang |first4=Yue |last5=Mavromatis |first5=Christos |last6=Zeng |first6=Jiachen |last7=Yin |first7=Shihao |last8=Voulkidis |first8=Antonios |last9=Shepard |first9=Daniel |title=Toward Foundation Models for Online Complex Event Detection in CPS‑IoT: A Case Study |journal=Proceedings of the 26th International Conference on Information Processing in Sensor Networks (IPSN ’25) |publisher=ACM |date=2025 |quote=Although we did not evaluate o1 and o3 models … their high cost and inference time make them impractical for online CED, which requires frequent, low‑latency API requests.}}</ref><ref>{{cite arXiv |last1=Hu |first1=Zihao |last2=Wang |first2=Yuqing |last3=Sun |first3=Rui |last4=Lu |first4=Haoran |last5=Gong |first5=Qian |last6=Wang |first6=Jinshuai |last7=Gong |first7=Yunlong |last8=Huang |first8=Yiming |last9=He |first9=Peng |title=Inference-Time Compute: More Faithful? A Research Note |date=2025-02-13 |arxiv=2502.09673 |class=cs.CL |quote=we were unable to evaluate O1 and R1 …}}</ref><ref>{{cite arXiv |last1=Chen |first1=Guoliang |last2=Zhu |first2=Zhiyao |last3=Meng |first3=Qinxiang |last4=Liang |first4=Weilin |last5=Ji |first5=Zijie |last6=Liu |first6=Jiangning |last7=Zeng |first7=Jie |title=RealBench: Evaluating LLMs as Verilog Engineers |date=2025-03-07 |arxiv=2503.04914 |class=cs.AI |quote=For O1-preview, we sample only once due to high cost.}}</ref><ref>{{cite arXiv |last1=Gupta |first1=Arpit |last2=Schapira |first2=Michael |last3=Gill |first3=Phillipa |last4=Seetharaman |first4=Srinivasan |title=On the Feasibility of Using LLMs to Execute Multistage Network Attacks |date=2025-01-30 |arxiv=2501.16466 |class=cs.CR |quote=We were unable to evaluate o1 … the public API has a safeguard that prevents o1 from executing attacks.}}</ref>
However, some benchmarks exclude reflective models due to longer response times.
 
=== Humanity's Last Exam ===
The [[Humanity's Last Exam|HLE]], a rigorous benchmark designed to assess expert-level reasoning across mathematics, humanities, and the natural sciences, reveals substantial performance gaps among models. State-of-the-art reasoning models have demonstrated low accuracy on HLE, highlighting significant room for improvement. In particular, the full reasoning model [[OpenAI o3|o3]] achieved an accuracy of 26.6%,<ref>{{Cite web |lastname=McKenna |first=Greg |title=OpenAI's deep research can complete 26% of Humanity's Last Exam |url=https":5"//fortune.com/2025/02/12/openai-deepresearch-humanity-last-exam/ |access-date=2025-03-16 |website=Fortune |language=en}}</ref> while its lighter counterpart, o3‑mini-high (evaluated on text‑only questions), reached 13%.<ref>{{Citecite web |author1=John-Anthony Disotto |date=2025-02-04 |title=OpenAI'sHumanity’s DeepLast Research smashes records for the world's hardest AI exam, with ChatGPT o3-mini and DeepSeek left in itsExam wakeleaderboard |url=https://wwwagi.techradarsafe.comai/computingbenchmarks/artificial-intelligence/openais-deep-research-smashes-records-for-the-worlds-hardest-hle |website=Safe.ai-exam-with-chatgpt-o3-mini-and-deepseek-left-in-its-wake |publisher=Center for AI Safety |access-date=2025-0307-16 |website=TechRadar |language=en26}}</ref>
 
=== AIME ===
The [[American Invitational Mathematics Examination]] (AIME) benchmark, a challenging mathematics competition, demonstrates significant performance differences between model types. Non-reasoning models typically solve less than 30% of AIME. In contrast, models employing reasoning techniques score between 50% and 80%.<ref>{{Cite web |date=2025-02-10 |title=MathArena |urlname=https":8"//matharena.ai/><ref |access-date=2025-02-10 |archive-urlname=https":9"//web.archive.org/web/20250210032556/https://matharena.ai/><ref |archive-datename=":10 February 2025 }}<"/ref> While [[OpenAI o1|OpenAI's o1]] maintained or slightly improved its accuracy from reported 2024{{Source?|date=July 2022}} metrics to 2025 AIME results, o3-mini (high) achieved a higher accuracy (80%) at a significantly lower cost (approximately 12 times cheaper).<ref name=":4">{{cite web |date=2025-01-31 |title=OpenAI o3-mini |url=https://openai.com/index/openai-o3-mini/ |access-date=2025-02-09 |website=OpenAI |language=en-US}}</ref>
 
=== o3-mini performance ===
According to OpenAI's January 2025 report on o3-mini, adjustable "reasoning effort" significantly affects performance, particularly in [[STEM]]. Increasing reasoning effort from low to high boosts accuracy on benchmarks like AIME 2024, GPQA Diamond, and [[Codeforces]], providing performance gains typically in the range of 10-30%. With high reasoning effort, o3-mini (high) achieved 87.3% in AIME (different from the MathArena AIME benchmark results), 79.7% in GPQA Diamond, 2130 Elo in Codeforces, and 49.3 in SWE-bench Verified.<ref>{{Cite web |datename=2025-01-31 |title=OpenAI o3-mini |url=https":4"//openai.com/index/openai-o3-mini/ |access-date=2025-02-09 |website=OpenAI |language=en-US}}</ref>
 
== Drawbacks ==
Line 131 ⟶ 129:
=== [[Hugging Face]] ===
 
* OlympicCoder-7B & 32B, as part of reproducing the R1 training openly (Open R1 project).<ref>{{Citecite web |date=2025-03-12 |title=@lewtun on Hugging Face: "Introducing OlympicCoderOpen‑R1: a series offully open reasoningreproduction modelsof that can solve…"DeepSeek‑R1 |url=https://huggingface.co/postsblog/lewtun/886287473065721open-r1 |website=Hugging Face |date=2025-02-24 |access-date=2025-0407-0426}}</ref><ref>{{cite web |websitetitle=OlympicCoder-7B |url=https://huggingface.co/open-r1/OlympicCoder-7B |website=Hugging Face |date=2025-03-11 |access-date=2025-07-26}}</ref>
 
== See also ==