Content deleted Content added
→Guided sampling: R1 uses ORM |
|||
Line 49:
The ORM is usually trained via [[logistic regression]], i.e. minimizing [[Cross-entropy|cross-entropy loss]].<ref name=":3" />
Given a PRM, an ORM can be constructed by multiplying the total process reward during the reasoning trace,<ref name=":1" /> or by taking the minimum,<ref name=":3" /> or some other method to aggregate the process rewards. DeepSeek used a simple ORM for training the [[DeepSeek (chatbot)|R1 model]].<ref>{{Citation |last1=DeepSeek-AI |title=DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning |date=2025-01-22 |arxiv=2501.12948 |last2=Guo |first2=Daya |last3=Yang |first3=Dejian |last4=Zhang |first4=Haowei |last5=Song |first5=Junxiao |last6=Zhang |first6=Ruoyu |last7=Xu |first7=Runxin |last8=Zhu |first8=Qihao |last9=Ma |first9=Shirong}}</ref>
=== Process Reward Model ===
Line 66:
One can also use an ORM to implicitly construct a PRM, similar to [[direct preference optimization]].<ref>{{Citation |last1=Yuan |first1=Lifan |title=Free Process Rewards without Process Labels |date=2024-12-02 |arxiv=2412.01981 |last2=Li |first2=Wendi |last3=Chen |first3=Huayu |last4=Cui |first4=Ganqu |last5=Ding |first5=Ning |last6=Zhang |first6=Kaiyan |last7=Zhou |first7=Bowen |last8=Liu |first8=Zhiyuan |last9=Peng |first9=Hao}}</ref>
=== Guided sampling ===
|