Content deleted Content added
Added models section, taken directly from Reflection (artificial intelligence) article |
Minor fixes; mention PPO |
||
Line 1:
{{Short description|Language models designed for reasoning tasks}}{{Merge to|Reflection (artificial intelligence)|date=April 2025}}{{unreliable sources|date=January 2025}}
'''Reasoning language models''' ('''RLMs''') are [[large language model]]s that have been further trained to solve
== Supervised finetuning ==
A [[large language model]] (LLM) can be finetuned on a dataset of reasoning tasks with example solutions and reasoning traces. The fine-tuned model
As it is expensive to get humans to write reasoning traces for a SFT dataset, researchers have proposed ways to automatically construct SFT datasets. In rejection sampling finetuning (RFT), new reasoning traces are collected via a loop:<ref>{{Citation |last1=Yuan |first1=Zheng |title=Scaling Relationship on Learning Mathematical Reasoning with Large Language Models |date=2023-09-13 |arxiv=2308.01825 |last2=Yuan |first2=Hongyi |last3=Li |first3=Chengpeng |last4=Dong |first4=Guanting |last5=Lu |first5=Keming |last6=Tan |first6=Chuanqi |last7=Zhou |first7=Chang |last8=Zhou |first8=Jingren}}</ref>
Line 20 ⟶ 19:
For reasoning language models, the model's response <math>y</math> may be broken down into multiple steps, in which case it is written as <math>y_1, y_2, \dots, y_n</math>.
Most recent systems use policy-gradient methods such as [[Proximal Policy Optimization]] (PPO) because PPO constrains each policy update with a clipped objective, which stabilises training for very large policies.<ref name="OpenAIAlign2022">{{cite web |title=Aligning language models to follow instructions |website=OpenAI Blog |url=https://openai.com/blog/instruction-following/ |date=2022-01-27 |access-date=2025-05-04}}</ref>
=== Outcome Reward Model ===
|