Reasoning language model: Difference between revisions

Content deleted Content added
m Fixed a reference. Please see Category:CS1 errors: dates.
 
(5 intermediate revisions by 4 users not shown)
Line 18:
 
=== 2025 ===
In January 2025, [[DeepSeek]] released [[DeepSeek (chatbot)|R1]], a model with comparable performance to o1 at lower cost. The release demonstrated the effectiveness of [[Group Relative Policy Optimization]] (GRPO).<ref>{{cite news |last1=Orland |first1=Kyle |date=2025-01-28 |title=How does DeepSeek R1 really fare against OpenAI's best reasoning models? |url=https://arstechnica.com/ai/2025/01/how-does-deepseek-r1-really-fare-against-openais-best-reasoning-models/ |access-date=2025-02-06 |work=Ars Technica}}</ref><ref name=":9">{{cite arXiv |last1=DeepSeek-AI |last2=Guo |first2=Daya |last3=Yang |first3=Dejian |last4=Zhang |first4=Haowei |last5=Song |first5=Junxiao |last6=Zhang |first6=Ruoyu |last7=Xu |first7=Runxin |last8=Zhu |first8=Qihao |last9=Ma |first9=Shirong |title=DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning |date=2025-01-22 |eprint=2501.12948 |class=cs.CL}}</ref> On January 25, 2025, [[DeepSeek]] added a feature to DeepSeek R1 that lets the model search the web while it reasons, making it easier to combine retrieval with reasoning.<ref>{{cite news |script-title=zh:DeepSeek 支持"深度思考+联网检索"能力 |trans-title=DeepSeek adds a search feature supporting simultaneous deep thinking and web search |work=People’sPeople's Daily Online |date=2025-01-29 |url=http://tech.people.com.cn/n1/2025/0129/c1007-40386565.html |language=zh |access-date=2025-07-26}}</ref> OpenAI subsequently released o3-mini, followed by [[ChatGPT Deep Research|Deep Research]] based on [[OpenAI o3|o3]].<ref>{{cite news |last1=Milmo |first1=Dan |date=2025-02-03 |title=OpenAI launches 'deep research' tool that it says can match research analyst |url=https://www.theguardian.com/technology/2025/feb/03/openai-deep-research-agent-chatgpt-deepseek |access-date=2025-03-16 |work=The Guardianeffectiveness |language=en-GBof |issn=0261-3077}}</ref>distillation Thefor effectivenessreasoning of distillationmodels was shown againin works such byas s1-32B, which reachedachieved strong performance withthrough budget forcing and scaling methods.<ref name=":10">{{cite arXiv |last1=Muennighoff |first1=Niklas |last2=Yang |first2=Zitong |last3=Shi |first3=Weijia |last4=Li |first4=Xiang Lisa |last5=Fei-Fei |first5=Li |last6=Hajishirzi |first6=Hannaneh |last7=Zettlemoyer |first7=Luke |last8=Liang |first8=Percy |last9=Candès |first9=Emmanuel |title=s1: Simple test-time scaling |date=2025-02-03 |eprint=2501.19393 |class=cs.CL}}</ref><ref name=":6"/>
 
On February 2, 2025, OpenAI released [[ChatGPT Deep Research|Deep Research]] based on their [[OpenAI o3|o3]] model,<ref name=":5">{{cite web |date=2025-02-02 |title=Introducing deep research |url=https://openai.com/index/introducing-deep-research/ |access-date=2025-02-05 |website=OpenAI |language=en-US}}</ref> a tool that integrates reasoning and web search in one workflow soallowing users canto runinitiate complex research that needs several stepstasks and sources.generate Itcomprehensive isreports basedwhich onincorporate [[OpenAIvarious o3|o3]] and can takesources from 5the to 30 minutes to generate comprehensive reportsweb.<ref name=":5" />
 
== Supervised finetuning ==
Line 78:
 
== Benchmarks ==
Reasoning models generally score higher than non-reasoning models on many benchmarks, especially on tasks requiring multi-step reasoning.<ref>{{Citation |last=Wei |first=Jason |title=Chain-of-Thought Prompting Elicits Reasoning in Large Language Models |date=2023-01-10 |url=http://arxiv.org/abs/2201.11903 |access-date=2025-08-30 |publisher=arXiv |doi=10.48550/arXiv.2201.11903 |id=arXiv:2201.11903 |last2=Wang |first2=Xuezhi |last3=Schuurmans |first3=Dale |last4=Bosma |first4=Maarten |last5=Ichter |first5=Brian |last6=Xia |first6=Fei |last7=Chi |first7=Ed |last8=Le |first8=Quoc |last9=Zhou |first9=Denny}}</ref><ref>{{Citation |last=Wang |first=Xuezhi |title=Self-Consistency Improves Chain of Thought Reasoning in Language Models |date=2023-03-07 |url=http://arxiv.org/abs/2203.11171 |access-date=2025-08-30 |publisher=arXiv |doi=10.48550/arXiv.2203.11171 |id=arXiv:2203.11171 |last2=Wei |first2=Jason |last3=Schuurmans |first3=Dale |last4=Le |first4=Quoc |last5=Chi |first5=Ed |last6=Narang |first6=Sharan |last7=Chowdhery |first7=Aakanksha |last8=Zhou |first8=Denny}}</ref><ref>{{Citation |last=Yao |first=Shunyu |title=Tree of Thoughts: Deliberate Problem Solving with Large Language Models |date=2023-12-03 |url=http://arxiv.org/abs/2305.10601 |access-date=2025-08-30 |publisher=arXiv |doi=10.48550/arXiv.2305.10601 |id=arXiv:2305.10601 |last2=Yu |first2=Dian |last3=Zhao |first3=Jeffrey |last4=Shafran |first4=Izhak |last5=Griffiths |first5=Thomas L. |last6=Cao |first6=Yuan |last7=Narasimhan |first7=Karthik}}</ref><ref>{{Cite journal |last=Cui |first=Dong-Xu |last2=Long |first2=Shi-Yu |last3=Tang |first3=Yi-Xuan |last4=Zhao |first4=Yue |last5=Li |first5=Qiao |date=2025-08-25 |title=Can Reasoning Power Significantly Improve the Knowledge of Large Language Models for Chemistry?─Based on Conversations with LLMs |url=https://doi.org/10.1021/acs.jcim.5c01265 |journal=Journal of Chemical Information and Modeling |doi=10.1021/acs.jcim.5c01265 |issn=1549-9596}}</ref><ref>{{Citation |last=Qwen |title=Qwen2.5 Technical Report |date=2024 |url=https://arxiv.org/abs/2412.15115 |access-date=2025-08-30 |publisher=arXiv |doi=10.48550/ARXIV.2412.15115 |last2=Yang |first2=An |last3=Yang |first3=Baosong |last4=Zhang |first4=Beichen |last5=Hui |first5=Binyuan |last6=Zheng |first6=Bo |last7=Yu |first7=Bowen |last8=Li |first8=Chengyuan |last9=Liu |first9=Dayiheng}}</ref><ref>{{Citation |last=Comanici |first=Gheorghe |title=Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities |date=2025-07-22 |url=http://arxiv.org/abs/2507.06261 |access-date=2025-08-30 |publisher=arXiv |doi=10.48550/arXiv.2507.06261 |id=arXiv:2507.06261 |last2=Bieber |first2=Eric |last3=Schaekermann |first3=Mike |last4=Pasupat |first4=Ice |last5=Sachdeva |first5=Noveen |last6=Dhillon |first6=Inderjit |last7=Blistein |first7=Marcel |last8=Ram |first8=Ori |last9=Zhang |first9=Dan}}</ref><ref>{{Cite journal |last=Mirza |first=Adrian |last2=Alampara |first2=Nawaf |last3=Kunchapu |first3=Sreekanth |last4=Ríos-García |first4=Martiño |last5=Emoekabu |first5=Benedict |last6=Krishnan |first6=Aswanth |last7=Gupta |first7=Tanya |last8=Schilling-Wilhelmi |first8=Mara |last9=Okereke |first9=Macjonathan |last10=Aneesh |first10=Anagha |last11=Asgari |first11=Mehrdad |last12=Eberhardt |first12=Juliane |last13=Elahi |first13=Amir Mohammad |last14=Elbeheiry |first14=Hani M. |last15=Gil |first15=María Victoria |date=July 2025 |title=A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists |url=https://www.nature.com/articles/s41557-025-01815-x |journal=Nature Chemistry |language=en |volume=17 |issue=7 |pages=1027–1034 |doi=10.1038/s41557-025-01815-x |issn=1755-4349}}</ref>
Reasoning models generally score higher than non-reasoning models on many benchmarks, especially on tasks requiring multi-step reasoning.
 
Some benchmarks exclude reasoning models because their responses take longer and cost more.<ref>{{cite journalbook |last1=Huang |first1=Yuting |last2=Zois |first2=Christos |last3=Wang |first3=Yue |last4=Zhang |first4=Yue |last5=Mavromatis |first5=Christos |last6=Zeng |first6=Jiachen |last7=Yin |first7=Shihao |last8=Voulkidis |first8=Antonios |last9=Shepard |first9=Daniel |titlechapter=Toward Foundation Models for Online Complex Event Detection in CPS-IoT: A Case Study |journaltitle=Proceedings of the 26th2nd International ConferenceWorkshop on InformationFoundation ProcessingModels infor SensorCyber-Physical NetworksSystems (IPSN& Internet of '25)Things |publisher=ACM |date=2025 |pages=1–6 |doi=10.1145/3722565.3727198 |arxiv=2503.12282 |isbn=979-8-4007-1608-9 |quote=Although we did not evaluate o1 and o3 models … their high cost and inference time make them impractical for online CED, which requires frequent, low-latency API requests.}}</ref><ref>{{cite arXiv |last1=Hu |first1=Zihao |last2=Wang |first2=Yuqing |last3=Sun |first3=Rui |last4=Lu |first4=Haoran |last5=Gong |first5=Qian |last6=Wang |first6=Jinshuai |last7=Gong |first7=Yunlong |last8=Huang |first8=Yiming |last9=He |first9=Peng |title=Inference-Time Compute: More Faithful? A Research Note |date=2025-02-13 |eprint=2502.09673 |class=cs.CL |quote=we were unable to evaluate O1 and R1 …}}</ref><ref>{{cite arXiv |last1=Chen |first1=Guoliang |last2=Zhu |first2=Zhiyao |last3=Meng |first3=Qinxiang |last4=Liang |first4=Weilin |last5=Ji |first5=Zijie |last6=Liu |first6=Jiangning |last7=Zeng |first7=Jie |title=RealBench: Evaluating LLMs as Verilog Engineers |date=2025-03-07 |eprint=2503.04914 |class=cs.AI |quote=For O1-preview, we sample only once due to high cost.}}</ref><ref>{{cite arXiv |last1=Gupta |first1=Arpit |last2=Schapira |first2=Michael |last3=Gill |first3=Phillipa |last4=Seetharaman |first4=Srinivasan |title=On the Feasibility of Using LLMs to Execute Multistage Network Attacks |date=2025-01-30 |eprint=2501.16466 |class=cs.CR |quote=We were unable to evaluate o1 … the public API has a safeguard that prevents o1 from executing attacks.}}</ref>
 
=== Humanity's Last Exam ===
Line 97:
 
=== Generation time ===
Due to the tendency of reasoning language models to produce verbose outputs, the time it takes to generate an output increases greatly when compared to a standard [[large language model]].
Reasoning increases response time, with current models taking from a few seconds to several minutes to answer. As depth of reasoning grows, future models may need even longer.
 
== Models ==