Reasoning language model: Difference between revisions

Content deleted Content added
Generation time: Update to be more factually correct and concise.
 
(3 intermediate revisions by 3 users not shown)
Line 18:
 
=== 2025 ===
In January 2025, [[DeepSeek]] released [[DeepSeek (chatbot)|R1]], a model with comparable performance to o1 at lower cost. The release demonstrated the effectiveness of [[Group Relative Policy Optimization]] (GRPO).<ref>{{cite news |last1=Orland |first1=Kyle |date=2025-01-28 |title=How does DeepSeek R1 really fare against OpenAI's best reasoning models? |url=https://arstechnica.com/ai/2025/01/how-does-deepseek-r1-really-fare-against-openais-best-reasoning-models/ |access-date=2025-02-06 |work=Ars Technica}}</ref><ref name=":9">{{cite arXiv |last1=DeepSeek-AI |last2=Guo |first2=Daya |last3=Yang |first3=Dejian |last4=Zhang |first4=Haowei |last5=Song |first5=Junxiao |last6=Zhang |first6=Ruoyu |last7=Xu |first7=Runxin |last8=Zhu |first8=Qihao |last9=Ma |first9=Shirong |title=DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning |date=2025-01-22 |eprint=2501.12948 |class=cs.CL}}</ref> On January 25, 2025, [[DeepSeek]] added a feature to DeepSeek R1 that lets the model search the web while it reasons, making it easier to combine retrieval with reasoning.<ref>{{cite news |script-title=zh:DeepSeek 支持"深度思考+联网检索"能力 |trans-title=DeepSeek adds a search feature supporting simultaneous deep thinking and web search |work=People’sPeople's Daily Online |date=2025-01-29 |url=http://tech.people.com.cn/n1/2025/0129/c1007-40386565.html |language=zh |access-date=2025-07-26}}</ref> OpenAI subsequently released o3-mini, followed by [[ChatGPT Deep Research|Deep Research]] based on [[OpenAI o3|o3]].<ref>{{cite news |last1=Milmo |first1=Dan |date=2025-02-03 |title=OpenAI launches 'deep research' tool that it says can match research analyst |url=https://www.theguardian.com/technology/2025/feb/03/openai-deep-research-agent-chatgpt-deepseek |access-date=2025-03-16 |work=The Guardianeffectiveness |language=en-GBof |issn=0261-3077}}</ref>distillation Thefor effectivenessreasoning of distillationmodels was shown againin works such byas s1-32B, which reachedachieved strong performance withthrough budget forcing and scaling methods.<ref name=":10">{{cite arXiv |last1=Muennighoff |first1=Niklas |last2=Yang |first2=Zitong |last3=Shi |first3=Weijia |last4=Li |first4=Xiang Lisa |last5=Fei-Fei |first5=Li |last6=Hajishirzi |first6=Hannaneh |last7=Zettlemoyer |first7=Luke |last8=Liang |first8=Percy |last9=Candès |first9=Emmanuel |title=s1: Simple test-time scaling |date=2025-02-03 |eprint=2501.19393 |class=cs.CL}}</ref><ref name=":6"/>
 
On February 2, 2025, OpenAI released [[ChatGPT Deep Research|Deep Research]] based on their [[OpenAI o3|o3]] model,<ref name=":5">{{cite web |date=2025-02-02 |title=Introducing deep research |url=https://openai.com/index/introducing-deep-research/ |access-date=2025-02-05 |website=OpenAI |language=en-US}}</ref> a tool that integrates reasoning and web search in one workflow soallowing users canto runinitiate complex research that needs several stepstasks and sources.generate Itcomprehensive isreports basedwhich onincorporate [[OpenAIvarious o3|o3]] and can takesources from 5the to 30 minutes to generate comprehensive reportsweb.<ref name=":5" />
 
== Supervised finetuning ==
Line 80:
Reasoning models generally score higher than non-reasoning models on many benchmarks, especially on tasks requiring multi-step reasoning.
 
Some benchmarks exclude reasoning models because their responses take longer and cost more.<ref>{{cite journalbook |last1=Huang |first1=Yuting |last2=Zois |first2=Christos |last3=Wang |first3=Yue |last4=Zhang |first4=Yue |last5=Mavromatis |first5=Christos |last6=Zeng |first6=Jiachen |last7=Yin |first7=Shihao |last8=Voulkidis |first8=Antonios |last9=Shepard |first9=Daniel |titlechapter=Toward Foundation Models for Online Complex Event Detection in CPS-IoT: A Case Study |journaltitle=Proceedings of the 26th2nd International ConferenceWorkshop on InformationFoundation ProcessingModels infor SensorCyber-Physical NetworksSystems (IPSN& Internet of '25)Things |publisher=ACM |date=2025 |pages=1–6 |doi=10.1145/3722565.3727198 |arxiv=2503.12282 |isbn=979-8-4007-1608-9 |quote=Although we did not evaluate o1 and o3 models … their high cost and inference time make them impractical for online CED, which requires frequent, low-latency API requests.}}</ref><ref>{{cite arXiv |last1=Hu |first1=Zihao |last2=Wang |first2=Yuqing |last3=Sun |first3=Rui |last4=Lu |first4=Haoran |last5=Gong |first5=Qian |last6=Wang |first6=Jinshuai |last7=Gong |first7=Yunlong |last8=Huang |first8=Yiming |last9=He |first9=Peng |title=Inference-Time Compute: More Faithful? A Research Note |date=2025-02-13 |eprint=2502.09673 |class=cs.CL |quote=we were unable to evaluate O1 and R1 …}}</ref><ref>{{cite arXiv |last1=Chen |first1=Guoliang |last2=Zhu |first2=Zhiyao |last3=Meng |first3=Qinxiang |last4=Liang |first4=Weilin |last5=Ji |first5=Zijie |last6=Liu |first6=Jiangning |last7=Zeng |first7=Jie |title=RealBench: Evaluating LLMs as Verilog Engineers |date=2025-03-07 |eprint=2503.04914 |class=cs.AI |quote=For O1-preview, we sample only once due to high cost.}}</ref><ref>{{cite arXiv |last1=Gupta |first1=Arpit |last2=Schapira |first2=Michael |last3=Gill |first3=Phillipa |last4=Seetharaman |first4=Srinivasan |title=On the Feasibility of Using LLMs to Execute Multistage Network Attacks |date=2025-01-30 |eprint=2501.16466 |class=cs.CR |quote=We were unable to evaluate o1 … the public API has a safeguard that prevents o1 from executing attacks.}}</ref>
 
=== Humanity's Last Exam ===
Line 97:
 
=== Generation time ===
Due to the tendency of reasoning language models to produce verbose outputs, the time it takes to generate an output increases greatly when compared to a standard [[large language model]].
Reasoning increases response time, with current models taking from a few seconds to several minutes to answer. As depth of reasoning grows, future models may need even longer.
 
== Models ==