Reasoning language model: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 05:37, 8 August 2025 edit Xose.vazquez (talk \| contribs) Extended confirmed users 10,810 edits →Models ← Previous edit		Latest revision as of 21:27, 30 August 2025 edit undo Ira Leviton (talk \| contribs) Extended confirmed users 358,467 edits m Fixed a reference. Please see Category:CS1 errors: dates.
(5 intermediate revisions by 4 users not shown)
Line 18: === 2025 === In January 2025, [[DeepSeek]] released [[DeepSeek (chatbot)\|R1]], a model with comparable performance to o1 at lower cost. The release demonstrated the effectiveness of [[Group Relative Policy Optimization]] (GRPO).<ref>{{cite news \|last1=Orland \|first1=Kyle \|date=2025-01-28 \|title=How does DeepSeek R1 really fare against OpenAI's best reasoning models? \|url=https://arstechnica.com/ai/2025/01/how-does-deepseek-r1-really-fare-against-openais-best-reasoning-models/ \|access-date=2025-02-06 \|work=Ars Technica}}</ref><ref name=":9">{{cite arXiv \|last1=DeepSeek-AI \|last2=Guo \|first2=Daya \|last3=Yang \|first3=Dejian \|last4=Zhang \|first4=Haowei \|last5=Song \|first5=Junxiao \|last6=Zhang \|first6=Ruoyu \|last7=Xu \|first7=Runxin \|last8=Zhu \|first8=Qihao \|last9=Ma \|first9=Shirong \|title=DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning \|date=2025-01-22 \|eprint=2501.12948 \|class=cs.CL}}</ref> On January 25, 2025, [[DeepSeek]] added a feature to DeepSeek R1 that lets the model search the web while it reasons, making it easier to combine retrieval with reasoning.<ref>{{cite news \|script-title=zh:DeepSeek 支持“"深度思考+联网检索”"能力 \|trans-title=DeepSeek adds a search feature supporting simultaneous deep thinking and web search \|work=~~People’s~~People's Daily Online \|date=2025-01-29 \|url=http://tech.people.com.cn/n1/2025/0129/c1007-40386565.html \|language=zh \|access-date=2025-07-26}}</ref> OpenAI subsequently released o3-mini, followed by [[ChatGPT Deep Research\|Deep Research]] based on [[OpenAI o3\|o3]].<ref>{{cite news \|last1=Milmo \|first1=Dan \|date=2025-02-03 \|title=OpenAI launches 'deep research' tool that it says can match research analyst \|url=https://www.theguardian.com/technology/2025/feb/03/openai-deep-research-agent-chatgpt-deepseek \|access-date=2025-03-16 \|work=The ~~Guardian~~effectiveness ~~\|language=en-GB~~of ~~\|issn=0261-3077}}</ref>~~distillation ~~The~~for ~~effectiveness~~reasoning ~~of distillation~~models was shown ~~again~~in works such byas s1-32B, which ~~reached~~achieved strong performance ~~with~~through budget forcing and scaling methods.<ref name=":10">{{cite arXiv \|last1=Muennighoff \|first1=Niklas \|last2=Yang \|first2=Zitong \|last3=Shi \|first3=Weijia \|last4=Li \|first4=Xiang Lisa \|last5=Fei-Fei \|first5=Li \|last6=Hajishirzi \|first6=Hannaneh \|last7=Zettlemoyer \|first7=Luke \|last8=Liang \|first8=Percy \|last9=Candès \|first9=Emmanuel \|title=s1: Simple test-time scaling \|date=2025-02-03 \|eprint=2501.19393 \|class=cs.CL}}</ref><ref name=":6"/> On February 2, 2025, OpenAI released [[ChatGPT Deep Research\|Deep Research]] based on their [[OpenAI o3\|o3]] model,<ref name=":5">{{cite web \|date=2025-02-02 \|title=Introducing deep research \|url=https://openai.com/index/introducing-deep-research/ \|access-date=2025-02-05 \|website=OpenAI \|language=en-US}}</ref> ~~a tool that integrates reasoning and web search in one workflow so~~allowing users ~~can~~to ~~run~~initiate complex research ~~that needs several steps~~tasks and ~~sources.~~generate Itcomprehensive isreports ~~based~~which onincorporate ~~[[OpenAI~~various ~~o3\|o3]] and can take~~sources from 5the ~~to 30 minutes to generate comprehensive reports~~web.<ref name=":5" /> == Supervised finetuning == Line 78: == Benchmarks == Reasoning models generally score higher than non-reasoning models on many benchmarks, especially on tasks requiring multi-step reasoning.<ref>{{Citation \|last=Wei \|first=Jason \|title=Chain-of-Thought Prompting Elicits Reasoning in Large Language Models \|date=2023-01-10 \|url=http://arxiv.org/abs/2201.11903 \|access-date=2025-08-30 \|publisher=arXiv \|doi=10.48550/arXiv.2201.11903 \|id=arXiv:2201.11903 \|last2=Wang \|first2=Xuezhi \|last3=Schuurmans \|first3=Dale \|last4=Bosma \|first4=Maarten \|last5=Ichter \|first5=Brian \|last6=Xia \|first6=Fei \|last7=Chi \|first7=Ed \|last8=Le \|first8=Quoc \|last9=Zhou \|first9=Denny}}</ref><ref>{{Citation \|last=Wang \|first=Xuezhi \|title=Self-Consistency Improves Chain of Thought Reasoning in Language Models \|date=2023-03-07 \|url=http://arxiv.org/abs/2203.11171 \|access-date=2025-08-30 \|publisher=arXiv \|doi=10.48550/arXiv.2203.11171 \|id=arXiv:2203.11171 \|last2=Wei \|first2=Jason \|last3=Schuurmans \|first3=Dale \|last4=Le \|first4=Quoc \|last5=Chi \|first5=Ed \|last6=Narang \|first6=Sharan \|last7=Chowdhery \|first7=Aakanksha \|last8=Zhou \|first8=Denny}}</ref><ref>{{Citation \|last=Yao \|first=Shunyu \|title=Tree of Thoughts: Deliberate Problem Solving with Large Language Models \|date=2023-12-03 \|url=http://arxiv.org/abs/2305.10601 \|access-date=2025-08-30 \|publisher=arXiv \|doi=10.48550/arXiv.2305.10601 \|id=arXiv:2305.10601 \|last2=Yu \|first2=Dian \|last3=Zhao \|first3=Jeffrey \|last4=Shafran \|first4=Izhak \|last5=Griffiths \|first5=Thomas L. \|last6=Cao \|first6=Yuan \|last7=Narasimhan \|first7=Karthik}}</ref><ref>{{Cite journal \|last=Cui \|first=Dong-Xu \|last2=Long \|first2=Shi-Yu \|last3=Tang \|first3=Yi-Xuan \|last4=Zhao \|first4=Yue \|last5=Li \|first5=Qiao \|date=2025-08-25 \|title=Can Reasoning Power Significantly Improve the Knowledge of Large Language Models for Chemistry?─Based on Conversations with LLMs \|url=https://doi.org/10.1021/acs.jcim.5c01265 \|journal=Journal of Chemical Information and Modeling \|doi=10.1021/acs.jcim.5c01265 \|issn=1549-9596}}</ref><ref>{{Citation \|last=Qwen \|title=Qwen2.5 Technical Report \|date=2024 \|url=https://arxiv.org/abs/2412.15115 \|access-date=2025-08-30 \|publisher=arXiv \|doi=10.48550/ARXIV.2412.15115 \|last2=Yang \|first2=An \|last3=Yang \|first3=Baosong \|last4=Zhang \|first4=Beichen \|last5=Hui \|first5=Binyuan \|last6=Zheng \|first6=Bo \|last7=Yu \|first7=Bowen \|last8=Li \|first8=Chengyuan \|last9=Liu \|first9=Dayiheng}}</ref><ref>{{Citation \|last=Comanici \|first=Gheorghe \|title=Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities \|date=2025-07-22 \|url=http://arxiv.org/abs/2507.06261 \|access-date=2025-08-30 \|publisher=arXiv \|doi=10.48550/arXiv.2507.06261 \|id=arXiv:2507.06261 \|last2=Bieber \|first2=Eric \|last3=Schaekermann \|first3=Mike \|last4=Pasupat \|first4=Ice \|last5=Sachdeva \|first5=Noveen \|last6=Dhillon \|first6=Inderjit \|last7=Blistein \|first7=Marcel \|last8=Ram \|first8=Ori \|last9=Zhang \|first9=Dan}}</ref><ref>{{Cite journal \|last=Mirza \|first=Adrian \|last2=Alampara \|first2=Nawaf \|last3=Kunchapu \|first3=Sreekanth \|last4=Ríos-García \|first4=Martiño \|last5=Emoekabu \|first5=Benedict \|last6=Krishnan \|first6=Aswanth \|last7=Gupta \|first7=Tanya \|last8=Schilling-Wilhelmi \|first8=Mara \|last9=Okereke \|first9=Macjonathan \|last10=Aneesh \|first10=Anagha \|last11=Asgari \|first11=Mehrdad \|last12=Eberhardt \|first12=Juliane \|last13=Elahi \|first13=Amir Mohammad \|last14=Elbeheiry \|first14=Hani M. \|last15=Gil \|first15=María Victoria \|date=July 2025 \|title=A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists \|url=https://www.nature.com/articles/s41557-025-01815-x \|journal=Nature Chemistry \|language=en \|volume=17 \|issue=7 \|pages=1027–1034 \|doi=10.1038/s41557-025-01815-x \|issn=1755-4349}}</ref> ~~Reasoning models generally score higher than non-reasoning models on many benchmarks, especially on tasks requiring multi-step reasoning.~~ Some benchmarks exclude reasoning models because their responses take longer and cost more.<ref>{{cite ~~journal~~book \|last1=Huang \|first1=Yuting \|last2=Zois \|first2=Christos \|last3=Wang \|first3=Yue \|last4=Zhang \|first4=Yue \|last5=Mavromatis \|first5=Christos \|last6=Zeng \|first6=Jiachen \|last7=Yin \|first7=Shihao \|last8=Voulkidis \|first8=Antonios \|last9=Shepard \|first9=Daniel \|~~title~~chapter=Toward Foundation Models for Online Complex Event Detection in CPS-IoT: A Case Study \|~~journal~~title=Proceedings of the ~~26th~~2nd International ~~Conference~~Workshop on ~~Information~~Foundation ~~Processing~~Models infor ~~Sensor~~Cyber-Physical ~~Networks~~Systems ~~(IPSN~~& Internet of ~~'25)~~Things \|publisher=ACM \|date=2025 \|pages=1–6 \|doi=10.1145/3722565.3727198 \|arxiv=2503.12282 \|isbn=979-8-4007-1608-9 \|quote=Although we did not evaluate o1 and o3 models … their high cost and inference time make them impractical for online CED, which requires frequent, low-latency API requests.}}</ref><ref>{{cite arXiv \|last1=Hu \|first1=Zihao \|last2=Wang \|first2=Yuqing \|last3=Sun \|first3=Rui \|last4=Lu \|first4=Haoran \|last5=Gong \|first5=Qian \|last6=Wang \|first6=Jinshuai \|last7=Gong \|first7=Yunlong \|last8=Huang \|first8=Yiming \|last9=He \|first9=Peng \|title=Inference-Time Compute: More Faithful? A Research Note \|date=2025-02-13 \|eprint=2502.09673 \|class=cs.CL \|quote=we were unable to evaluate O1 and R1 …}}</ref><ref>{{cite arXiv \|last1=Chen \|first1=Guoliang \|last2=Zhu \|first2=Zhiyao \|last3=Meng \|first3=Qinxiang \|last4=Liang \|first4=Weilin \|last5=Ji \|first5=Zijie \|last6=Liu \|first6=Jiangning \|last7=Zeng \|first7=Jie \|title=RealBench: Evaluating LLMs as Verilog Engineers \|date=2025-03-07 \|eprint=2503.04914 \|class=cs.AI \|quote=For O1-preview, we sample only once due to high cost.}}</ref><ref>{{cite arXiv \|last1=Gupta \|first1=Arpit \|last2=Schapira \|first2=Michael \|last3=Gill \|first3=Phillipa \|last4=Seetharaman \|first4=Srinivasan \|title=On the Feasibility of Using LLMs to Execute Multistage Network Attacks \|date=2025-01-30 \|eprint=2501.16466 \|class=cs.CR \|quote=We were unable to evaluate o1 … the public API has a safeguard that prevents o1 from executing attacks.}}</ref> === Humanity's Last Exam === Line 97: === Generation time === Due to the tendency of reasoning language models to produce verbose outputs, the time it takes to generate an output increases greatly when compared to a standard [[large language model]]. ~~Reasoning increases response time, with current models taking from a few seconds to several minutes to answer. As depth of reasoning grows, future models may need even longer.~~ == Models ==