Reasoning language model: Difference between revisions

Content deleted Content added
Added models section, taken directly from Reflection (artificial intelligence) article
Minor fixes; mention PPO
Line 1:
{{Short description|Language models designed for reasoning tasks}}{{Merge to|Reflection (artificial intelligence)|date=April 2025}}{{unreliable sources|date=January 2025}}
 
'''Reasoning language models''' ('''RLMs''') are [[large language model]]s that have been further trained to solve multi‑stepmulti-step reasoning tasks.<ref>{{cite arXiv |title=Reasoning Language Models: A Blueprint |last=Besta |first=Maciej |date=2025-01-23 |eprint=2501.11223 |class=cs.CL}}</ref> These models perform better on logical, mathematical or programmatic tasks than traditional autoregressive LLMs, have the ability to [[Backtracking|backtrack]], and employ test-time compute as an additional [[Neural scaling law|scaling axis]] beyond [[Training, validation, and test data sets|training examples]], parameter count, and train-time compute.
 
 
== Supervised finetuning ==
A [[large language model]] (LLM) can be finetuned on a dataset of reasoning tasks with example solutions and reasoning traces. The fine-tuned model wouldcan then beproduce ableits to generateown reasoning traces for a givennew problemproblems.<ref name=":0">{{Citation |last1=Uesato |first1=Jonathan |title=Solving math word problems with process- and outcome-based feedback |date=2022-11-25 |arxiv=2211.14275 |last2=Kushman |first2=Nate |last3=Kumar |first3=Ramana |last4=Song |first4=Francis |last5=Siegel |first5=Noah |last6=Wang |first6=Lisa |last7=Creswell |first7=Antonia |last8=Irving |first8=Geoffrey |last9=Higgins |first9=Irina}}</ref><ref name=":2" />
 
As it is expensive to get humans to write reasoning traces for a SFT dataset, researchers have proposed ways to automatically construct SFT datasets. In rejection sampling finetuning (RFT), new reasoning traces are collected via a loop:<ref>{{Citation |last1=Yuan |first1=Zheng |title=Scaling Relationship on Learning Mathematical Reasoning with Large Language Models |date=2023-09-13 |arxiv=2308.01825 |last2=Yuan |first2=Hongyi |last3=Li |first3=Chengpeng |last4=Dong |first4=Guanting |last5=Lu |first5=Keming |last6=Tan |first6=Chuanqi |last7=Zhou |first7=Chang |last8=Zhou |first8=Jingren}}</ref>
Line 20 ⟶ 19:
 
For reasoning language models, the model's response <math>y</math> may be broken down into multiple steps, in which case it is written as <math>y_1, y_2, \dots, y_n</math>.
 
Most recent systems use policy-gradient methods such as [[Proximal Policy Optimization]] (PPO) because PPO constrains each policy update with a clipped objective, which stabilises training for very large policies.<ref name="OpenAIAlign2022">{{cite web |title=Aligning language models to follow instructions |website=OpenAI Blog |url=https://openai.com/blog/instruction-following/ |date=2022-01-27 |access-date=2025-05-04}}</ref>
 
=== Outcome Reward Model ===