Reasoning language model: Difference between revisions

Content deleted Content added
Guided sampling: R1 uses ORM
Line 49:
The ORM is usually trained via [[logistic regression]], i.e. minimizing [[Cross-entropy|cross-entropy loss]].<ref name=":3" />
 
Given a PRM, an ORM can be constructed by multiplying the total process reward during the reasoning trace,<ref name=":1" /> or by taking the minimum,<ref name=":3" /> or some other method to aggregate the process rewards. DeepSeek used a simple ORM for training the [[DeepSeek (chatbot)|R1 model]].<ref>{{Citation |last1=DeepSeek-AI |title=DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning |date=2025-01-22 |arxiv=2501.12948 |last2=Guo |first2=Daya |last3=Yang |first3=Dejian |last4=Zhang |first4=Haowei |last5=Song |first5=Junxiao |last6=Zhang |first6=Ruoyu |last7=Xu |first7=Runxin |last8=Zhu |first8=Qihao |last9=Ma |first9=Shirong}}</ref>
 
=== Process Reward Model ===
Line 66:
 
One can also use an ORM to implicitly construct a PRM, similar to [[direct preference optimization]].<ref>{{Citation |last1=Yuan |first1=Lifan |title=Free Process Rewards without Process Labels |date=2024-12-02 |arxiv=2412.01981 |last2=Li |first2=Wendi |last3=Chen |first3=Huayu |last4=Cui |first4=Ganqu |last5=Ding |first5=Ning |last6=Zhang |first6=Kaiyan |last7=Zhou |first7=Bowen |last8=Liu |first8=Zhiyuan |last9=Peng |first9=Hao}}</ref>
 
<ref>{{Citation |last1=DeepSeek-AI |title=DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning |date=2025-01-22 |arxiv=2501.12948 |last2=Guo |first2=Daya |last3=Yang |first3=Dejian |last4=Zhang |first4=Haowei |last5=Song |first5=Junxiao |last6=Zhang |first6=Ruoyu |last7=Xu |first7=Runxin |last8=Zhu |first8=Qihao |last9=Ma |first9=Shirong}}</ref><ref>{{Citation |last1=Shao |first1=Zhihong |title=DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models |date=2024-04-27 |arxiv=2402.03300 |last2=Wang |first2=Peiyi |last3=Zhu |first3=Qihao |last4=Xu |first4=Runxin |last5=Song |first5=Junxiao |last6=Bi |first6=Xiao |last7=Zhang |first7=Haowei |last8=Zhang |first8=Mingchuan |last9=Li |first9=Y. K.}}</ref>
 
=== Guided sampling ===