Revision as of 21:55, 4 May 2025 edit Kjerish (talk \| contribs) Extended confirmed users 5,991 edits Fix cite bug ← Previous edit		Revision as of 14:16, 19 May 2025 edit undo Alenoach (talk \| contribs) Extended confirmed users 5,805 edits m MOS:HEADCAPS Tag: Visual edit Next edit →
Line 39: Most recent systems use policy-gradient methods such as [[Proximal Policy Optimization]] (PPO) because PPO constrains each policy update with a clipped objective, which stabilises training for very large policies.<ref name="OpenAIAlign2022">{{cite web \|title=Aligning language models to follow instructions \|website=OpenAI Blog \|url=https://openai.com/blog/instruction-following/ \|date=2022-01-27 \|access-date=2025-05-04}}</ref> === Outcome ~~Reward~~reward ~~Model~~model === {{Anchor\|Outcome Reward Model\|ORM}} Line 50: Given a PRM, an ORM can be constructed by multiplying the total process reward during the reasoning trace,<ref name=":1" /> or by taking the minimum,<ref name=":3" /> or some other method to aggregate the process rewards. DeepSeek used a simple ORM for training the [[DeepSeek (chatbot)\|R1 model]].<ref>{{Citation \|last1=DeepSeek-AI \|title=DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning \|date=2025-01-22 \|arxiv=2501.12948 \|last2=Guo \|first2=Daya \|last3=Yang \|first3=Dejian \|last4=Zhang \|first4=Haowei \|last5=Song \|first5=Junxiao \|last6=Zhang \|first6=Ruoyu \|last7=Xu \|first7=Runxin \|last8=Zhu \|first8=Qihao \|last9=Ma \|first9=Shirong}}</ref> === Process ~~Reward~~reward ~~Model~~model === {{Anchor\|Process Reward Model\|PRM}}

Reasoning language model: Difference between revisions