Revision as of 20:46, 4 May 2025 edit Kjerish (talk \| contribs) Extended confirmed users 5,991 edits Added models section, taken directly from Reflection (artificial intelligence) article ← Previous edit		Revision as of 21:34, 4 May 2025 edit undo Kjerish (talk \| contribs) Extended confirmed users 5,991 edits Minor fixes; mention PPO Next edit →
Line 1: {{Short description\|Language models designed for reasoning tasks}}{{Merge to\|Reflection (artificial intelligence)\|date=April 2025}}{{unreliable sources\|date=January 2025}} '''Reasoning language models''' ('''RLMs''') are [[large language model]]s that have been further trained to solve ~~multi‑step~~multi-step reasoning tasks.<ref>{{cite arXiv \|title=Reasoning Language Models: A Blueprint \|last=Besta \|first=Maciej \|date=2025-01-23 \|eprint=2501.11223 \|class=cs.CL}}</ref> These models perform better on logical, mathematical or programmatic tasks than traditional autoregressive LLMs, have the ability to [[Backtracking\|backtrack]], and employ test-time compute as an additional [[Neural scaling law\|scaling axis]] beyond [[Training, validation, and test data sets\|training examples]], parameter count, and train-time compute. == Supervised finetuning == A [[large language model]] (LLM) can be finetuned on a dataset of reasoning tasks with example solutions and reasoning traces. The fine-tuned model ~~would~~can then beproduce ~~able~~its ~~to generate~~own reasoning traces for ~~a given~~new ~~problem~~problems.<ref name=":0">{{Citation \|last1=Uesato \|first1=Jonathan \|title=Solving math word problems with process- and outcome-based feedback \|date=2022-11-25 \|arxiv=2211.14275 \|last2=Kushman \|first2=Nate \|last3=Kumar \|first3=Ramana \|last4=Song \|first4=Francis \|last5=Siegel \|first5=Noah \|last6=Wang \|first6=Lisa \|last7=Creswell \|first7=Antonia \|last8=Irving \|first8=Geoffrey \|last9=Higgins \|first9=Irina}}</ref><ref name=":2" /> As it is expensive to get humans to write reasoning traces for a SFT dataset, researchers have proposed ways to automatically construct SFT datasets. In rejection sampling finetuning (RFT), new reasoning traces are collected via a loop:<ref>{{Citation \|last1=Yuan \|first1=Zheng \|title=Scaling Relationship on Learning Mathematical Reasoning with Large Language Models \|date=2023-09-13 \|arxiv=2308.01825 \|last2=Yuan \|first2=Hongyi \|last3=Li \|first3=Chengpeng \|last4=Dong \|first4=Guanting \|last5=Lu \|first5=Keming \|last6=Tan \|first6=Chuanqi \|last7=Zhou \|first7=Chang \|last8=Zhou \|first8=Jingren}}</ref> Line 20 ⟶ 19: For reasoning language models, the model's response <math>y</math> may be broken down into multiple steps, in which case it is written as <math>y_1, y_2, \dots, y_n</math>. Most recent systems use policy-gradient methods such as [[Proximal Policy Optimization]] (PPO) because PPO constrains each policy update with a clipped objective, which stabilises training for very large policies.<ref name="OpenAIAlign2022">{{cite web \|title=Aligning language models to follow instructions \|website=OpenAI Blog \|url=https://openai.com/blog/instruction-following/ \|date=2022-01-27 \|access-date=2025-05-04}}</ref> === Outcome Reward Model ===

Reasoning language model: Difference between revisions