Content deleted Content added
Feedback has nothing to do with this. This article is essentially about RLVR, a new training step on top of LLMs |
Improve references |
||
Line 1:
{{Short description|Language models designed for reasoning tasks}}
{{Copy edit|for=jargon|date=May 2025}}
'''Reasoning language models''' ('''RLMs''') are [[large language model]]s that have been further trained to solve multi-step [[reasoning]] tasks.<ref>{{cite arXiv |last1=Besta |first1=Maciej |last2=Barth |first2=Julia |last3=Schreiber |first3=Eric |last4=Kubicek |first4=Ales |last5=Catarino |first5=Afonso |last6=Gerstenberger |first6=Robert |last7=Nyczyk |first7=Piotr |last8=Iff |first8=Patrick |last9=Li |first9=Yueling |title=Reasoning Language Models: A Blueprint
▲'''Reasoning language models''' ('''RLMs''') are [[large language model]]s that have been further trained to solve multi-step [[reasoning]] tasks.<ref>{{cite arXiv |title=Reasoning Language Models: A Blueprint |last=Besta |first=Maciej |date=2025-01-23 |eprint=2501.11223 |class=cs.CL}}</ref> These models perform better on logical, mathematical or programmatic tasks than traditional autoregressive LLMs, have the ability to [[Backtracking|backtrack]], and employ test-time compute as an additional [[Neural scaling law|scaling axis]] beyond [[Training, validation, and test data sets|training examples]], parameter count, and train-time compute.
== History ==
=== 2024 ===
In September 2024, [[OpenAI]] released [[OpenAI o1#release|o1-preview]], an LLM with enhanced reasoning.<ref>{{
The development of reasoning LLMs has illustrated what [[Richard S. Sutton|Rich Sutton]] termed the "bitter lesson": that general methods leveraging computation often outperform those relying on specific human insights.<ref>{{
[[Alibaba Group|Alibaba]] also released reasoning versions of its [[Qwen]] LLMs in November 2024.<ref>{{cite web |title=QwQ-32B-Preview: Reflect Deeply on the Boundaries of the Unknown |url=https://qwenlm.github.io/blog/qwq-32b-preview/ |website=Qwen (Alibaba Cloud) |date=2024-11-28 |access-date=2025-07-26}}</ref>
In December 2024, the team introduced QvQ-72B-Preview, an experimental visual reasoning model.<ref>{{cite web |title=QVQ: To See the World with Wisdom |url=https://qwenlm.github.io/blog/qvq-72b-preview/ |website=Qwen |publisher=Alibaba Cloud |date=2024-12-25 |access-date=2025-07-26}}</ref>
In December 2024, Google introduced [[Gemini Deep Research|Deep Research]] in [[Gemini (chatbot)|Gemini]],<ref>{{
On December 16, 2024, an experiment using a [[Llama (language model)|Llama]] 3B model demonstrated that by scaling test-time compute, a relatively small model could outperform a much larger Llama 70B model on challenging reasoning tasks. This result highlighted that improved inference strategies can unlock latent reasoning capabilities even in compact models.<ref>{{
=== 2025 ===
In January 2025, [[DeepSeek]] released [[DeepSeek (chatbot)|R1]], a model competitive with o1 at lower cost, highlighting the effectiveness of [[Group Relative Policy Optimization]] (GRPO).<ref>{{
On February 2, 2025, OpenAI released [[ChatGPT Deep Research|Deep Research]],<ref name=":5">{{
== Supervised finetuning ==
A [[large language model]] (LLM) can be finetuned on a dataset of reasoning tasks with example solutions and reasoning traces. The fine-tuned model can then produce its own reasoning traces for new problems.<ref name=":0">{{
As it is expensive to get humans to write reasoning traces for a SFT dataset, researchers have proposed ways to automatically construct SFT datasets. In rejection sampling finetuning (RFT), new reasoning traces are collected via a loop:<ref>{{Citation |last1=Yuan |first1=Zheng |title=Scaling Relationship on Learning Mathematical Reasoning with Large Language Models |date=2023-09-13 |arxiv=2308.01825 |last2=Yuan |first2=Hongyi |last3=Li |first3=Chengpeng |last4=Dong |first4=Guanting |last5=Lu |first5=Keming |last6=Tan |first6=Chuanqi |last7=Zhou |first7=Chang |last8=Zhou |first8=Jingren}}</ref>▼
▲As it is expensive to get humans to write reasoning traces for a SFT dataset, researchers have proposed ways to automatically construct SFT datasets. In rejection sampling finetuning (RFT), new reasoning traces are collected via a loop:<ref>{{
# Sample a task prompt
# Generate many reasoning traces for the prompt.
Line 46 ⟶ 44:
Outcome reward model, or outcome-supervised RM (ORM),<ref name=":0" /> is a reward model that computes the reward of a step <math>r(x, y_1, \dots, y_i)</math> determined by the final answer: <math>r(x, y_1, \dots, y_i) = r(x, y_n)</math>. They are also called "verifiers".
For tasks with an answer that is easy to verify, such as [[Word problem (mathematics education)|word problems in math]], the outcome reward can simply be binary: 1 if the final answer is correct, and 0 otherwise.<ref name=":0" /> If the answer is not easy to verify programmatically, humans can manually label the answers as correct or not, then the labels can be used to finetune a base model that predicts the human label.<ref name=":2">{{
The ORM is usually trained via [[logistic regression]], i.e. minimizing [[Cross-entropy|cross-entropy loss]].<ref name=":3" />
Given a PRM, an ORM can be constructed by multiplying the total process reward during the reasoning trace,<ref name=":1" /> or by taking the minimum,<ref name=":3" /> or some other method to aggregate the process rewards. DeepSeek used a simple ORM for training the [[DeepSeek (chatbot)|R1 model]].<ref
=== Process reward model ===
Line 59 ⟶ 57:
Given a partial thinking trace <math>x, y_1, \dots, y_m</math>, a human can be queried as to whether the steps ''so far'' are correct, regardless of whether the ultimate answer would be correct. This can then be used as a binary reward signal. As human labels are expensive, a base model can then be finetuned to predict the human labels.<ref name=":0" /> The PRM is usually trained by [[logistic regression]] on the human labels, i.e. by minimizing the [[Cross-entropy|cross-entropy loss]] between the true labels and the predicted labels.<ref name=":3" />
As an example, in a 2023 OpenAI paper, 800K process labels were collected for 75K solution traces. A labeler would be presented with a solution trace, and keep labelling "positive" if the step progresses towards the solution, "neutral" if it is not wrong, but does not progress towards solution, and "negative" if it is a mistake. As soon as a "negative" label is entered, the labeler stops labeling that thinking trace, and begins labeling another one. The idea was that, while labelling subsequent reasoning steps can provide even richer supervision signals, simply labeling up to the first error was sufficient for training a competent PRM.<ref name=":1" /><ref>{{
As human labels are expensive, researchers have proposed methods to create PRM without human labels on the processes. Inspired by [[Monte Carlo tree search]] (MCTS), the Math-Shepherd method samples multiple continuations until the end, starting at each reasoning step <math>y_i</math>, and set the reward at that step to be either <math>\frac{\#\text{(correct answers)}}{\#\text{(total answers)}}</math> in the case of "soft estimation", or <math>\begin{cases}
1 & \text{if one of the answers is correct}\\
0 & \text{else}
\end{cases}</math> in the case of "hard estimation". This creates process reward using only an ORM, which is usually easier or cheaper to construct. After creating these process reward labels, a PRM can be trained on them.<ref name=":3">{{
One can also use an ORM to implicitly construct a PRM, similar to [[direct preference optimization]].<ref>{{
=== Guided sampling ===
A trained ORM can be used to select the best response. The policy would rollout multiple responses, and a trained ORM would select the best response. This allows a simple form of [[Neural scaling law|test time compute scaling]] ("best-of-N").<ref name=":2" /> <ref>{{
A trained PRM can also be used to guide reasoning by greedy [[Tree traversal|tree search]]. That is, the policy model generates several possible next reasoning steps, and the PRM selects the best one, and the process repeats. This is similar to how a trained ORM can be used to select the best response.<ref>{{
Lookahead search is another tree search method, where the policy model generates several possible next reasoning steps, then make a (partial) rollout for each. If a solution endpoint is reached during the forward simulation, the process halts early. Otherwise, the PRM is used to calculate the total reward for each rollout. The step with the highest rollout is selected.<ref
Self-consistency can be combined with an ORM. The model would be used to generate multiple answers, and the answers would be clustered, so that each cluster has the same answer. The ORM is used to compute the reward for each answer, and the rewards within each cluster is summed. The answer corresponding to the cluster with the highest summed reward is output.<ref name=":3" />
Line 80 ⟶ 78:
Reasoning models generally outperform non-reasoning models in most benchmarks, especially on tasks requiring multi-step reasoning.
However, some benchmarks exclude reflective models due to longer response times.<ref>{{cite journal |last1=Huang |first1=Yuting |last2=Zois |first2=Christos |last3=Wang |first3=Yue |last4=Zhang |first4=Yue |last5=Mavromatis |first5=Christos |last6=Zeng |first6=Jiachen |last7=Yin |first7=Shihao |last8=Voulkidis |first8=Antonios |last9=Shepard |first9=Daniel |title=Toward Foundation Models for Online Complex Event Detection in CPS‑IoT: A Case Study |journal=Proceedings of the 26th International Conference on Information Processing in Sensor Networks (IPSN ’25) |publisher=ACM |date=2025 |quote=Although we did not evaluate o1 and o3 models … their high cost and inference time make them impractical for online CED, which requires frequent, low‑latency API requests.}}</ref><ref>{{cite arXiv |last1=Hu |first1=Zihao |last2=Wang |first2=Yuqing |last3=Sun |first3=Rui |last4=Lu |first4=Haoran |last5=Gong |first5=Qian |last6=Wang |first6=Jinshuai |last7=Gong |first7=Yunlong |last8=Huang |first8=Yiming |last9=He |first9=Peng |title=Inference-Time Compute: More Faithful? A Research Note |date=2025-02-13 |arxiv=2502.09673 |class=cs.CL |quote=we were unable to evaluate O1 and R1 …}}</ref><ref>{{cite arXiv |last1=Chen |first1=Guoliang |last2=Zhu |first2=Zhiyao |last3=Meng |first3=Qinxiang |last4=Liang |first4=Weilin |last5=Ji |first5=Zijie |last6=Liu |first6=Jiangning |last7=Zeng |first7=Jie |title=RealBench: Evaluating LLMs as Verilog Engineers |date=2025-03-07 |arxiv=2503.04914 |class=cs.AI |quote=For O1-preview, we sample only once due to high cost.}}</ref><ref>{{cite arXiv |last1=Gupta |first1=Arpit |last2=Schapira |first2=Michael |last3=Gill |first3=Phillipa |last4=Seetharaman |first4=Srinivasan |title=On the Feasibility of Using LLMs to Execute Multistage Network Attacks |date=2025-01-30 |arxiv=2501.16466 |class=cs.CR |quote=We were unable to evaluate o1 … the public API has a safeguard that prevents o1 from executing attacks.}}</ref>
=== Humanity's Last Exam ===
The [[Humanity's Last Exam|HLE]], a rigorous benchmark designed to assess expert-level reasoning across mathematics, humanities, and the natural sciences, reveals substantial performance gaps among models. State-of-the-art reasoning models have demonstrated low accuracy on HLE, highlighting significant room for improvement. In particular, the full reasoning model [[OpenAI o3|o3]] achieved an accuracy of 26.6%,<ref
=== AIME ===
The [[American Invitational Mathematics Examination]] (AIME) benchmark, a challenging mathematics competition, demonstrates significant performance differences between model types. Non-reasoning models typically solve less than 30% of AIME. In contrast, models employing reasoning techniques score between 50% and 80%.<ref
=== o3-mini performance ===
According to OpenAI's January 2025 report on o3-mini, adjustable "reasoning effort" significantly affects performance, particularly in [[STEM]]. Increasing reasoning effort from low to high boosts accuracy on benchmarks like AIME 2024, GPQA Diamond, and [[Codeforces]], providing performance gains typically in the range of 10-30%. With high reasoning effort, o3-mini (high) achieved 87.3% in AIME (different from the MathArena AIME benchmark results), 79.7% in GPQA Diamond, 2130 Elo in Codeforces, and 49.3 in SWE-bench Verified.<ref
== Drawbacks ==
Line 131 ⟶ 129:
=== [[Hugging Face]] ===
* OlympicCoder-7B & 32B, as part of reproducing the R1 training openly (Open R1 project).<ref>{{
== See also ==
|