Content deleted Content added
Fix cite bug |
|||
Line 39:
Most recent systems use policy-gradient methods such as [[Proximal Policy Optimization]] (PPO) because PPO constrains each policy update with a clipped objective, which stabilises training for very large policies.<ref name="OpenAIAlign2022">{{cite web |title=Aligning language models to follow instructions |website=OpenAI Blog |url=https://openai.com/blog/instruction-following/ |date=2022-01-27 |access-date=2025-05-04}}</ref>
=== Outcome
{{Anchor|Outcome Reward Model|ORM}}
Line 50:
Given a PRM, an ORM can be constructed by multiplying the total process reward during the reasoning trace,<ref name=":1" /> or by taking the minimum,<ref name=":3" /> or some other method to aggregate the process rewards. DeepSeek used a simple ORM for training the [[DeepSeek (chatbot)|R1 model]].<ref>{{Citation |last1=DeepSeek-AI |title=DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning |date=2025-01-22 |arxiv=2501.12948 |last2=Guo |first2=Daya |last3=Yang |first3=Dejian |last4=Zhang |first4=Haowei |last5=Song |first5=Junxiao |last6=Zhang |first6=Ruoyu |last7=Xu |first7=Runxin |last8=Zhu |first8=Qihao |last9=Ma |first9=Shirong}}</ref>
=== Process
{{Anchor|Process Reward Model|PRM}}
|