Reasoning language model: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 10:15, 20 August 2025 edit SimonAytes (talk \| contribs) 13 edits →2025: Update OpenAI section for clarity. Tags: references removed Visual edit ← Previous edit		Latest revision as of 06:31, 30 August 2025 edit undo 205.198.91.67 (talk) →Benchmarks: Add references Tag: Visual edit
(One intermediate revision by one other user not shown)
Line 78: == Benchmarks == Reasoning models generally score higher than non-reasoning models on many benchmarks, especially on tasks requiring multi-step reasoning.<ref>{{Citation \|last=Wei \|first=Jason \|title=Chain-of-Thought Prompting Elicits Reasoning in Large Language Models \|date=2023-01-10 \|url=http://arxiv.org/abs/2201.11903 \|access-date=2025-08-30 \|publisher=arXiv \|doi=10.48550/arXiv.2201.11903 \|id=arXiv:2201.11903 \|last2=Wang \|first2=Xuezhi \|last3=Schuurmans \|first3=Dale \|last4=Bosma \|first4=Maarten \|last5=Ichter \|first5=Brian \|last6=Xia \|first6=Fei \|last7=Chi \|first7=Ed \|last8=Le \|first8=Quoc \|last9=Zhou \|first9=Denny}}</ref><ref>{{Citation \|last=Wang \|first=Xuezhi \|title=Self-Consistency Improves Chain of Thought Reasoning in Language Models \|date=2023-03-07 \|url=http://arxiv.org/abs/2203.11171 \|access-date=2025-08-30 \|publisher=arXiv \|doi=10.48550/arXiv.2203.11171 \|id=arXiv:2203.11171 \|last2=Wei \|first2=Jason \|last3=Schuurmans \|first3=Dale \|last4=Le \|first4=Quoc \|last5=Chi \|first5=Ed \|last6=Narang \|first6=Sharan \|last7=Chowdhery \|first7=Aakanksha \|last8=Zhou \|first8=Denny}}</ref><ref>{{Citation \|last=Yao \|first=Shunyu \|title=Tree of Thoughts: Deliberate Problem Solving with Large Language Models \|date=2023-12-03 \|url=http://arxiv.org/abs/2305.10601 \|access-date=2025-08-30 \|publisher=arXiv \|doi=10.48550/arXiv.2305.10601 \|id=arXiv:2305.10601 \|last2=Yu \|first2=Dian \|last3=Zhao \|first3=Jeffrey \|last4=Shafran \|first4=Izhak \|last5=Griffiths \|first5=Thomas L. \|last6=Cao \|first6=Yuan \|last7=Narasimhan \|first7=Karthik}}</ref><ref>{{Cite journal \|last=Cui \|first=Dong-Xu \|last2=Long \|first2=Shi-Yu \|last3=Tang \|first3=Yi-Xuan \|last4=Zhao \|first4=Yue \|last5=Li \|first5=Qiao \|date=2025-08-25 \|title=Can Reasoning Power Significantly Improve the Knowledge of Large Language Models for Chemistry?─Based on Conversations with LLMs \|url=https://doi.org/10.1021/acs.jcim.5c01265 \|journal=Journal of Chemical Information and Modeling \|doi=10.1021/acs.jcim.5c01265 \|issn=1549-9596}}</ref><ref>{{Citation \|last=Qwen \|title=Qwen2.5 Technical Report \|date=2024 \|url=https://arxiv.org/abs/2412.15115 \|access-date=2025-08-30 \|publisher=arXiv \|doi=10.48550/ARXIV.2412.15115 \|last2=Yang \|first2=An \|last3=Yang \|first3=Baosong \|last4=Zhang \|first4=Beichen \|last5=Hui \|first5=Binyuan \|last6=Zheng \|first6=Bo \|last7=Yu \|first7=Bowen \|last8=Li \|first8=Chengyuan \|last9=Liu \|first9=Dayiheng}}</ref><ref>{{Citation \|last=Comanici \|first=Gheorghe \|title=Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities \|date=2025-07-22 \|url=http://arxiv.org/abs/2507.06261 \|access-date=2025-08-30 \|publisher=arXiv \|doi=10.48550/arXiv.2507.06261 \|id=arXiv:2507.06261 \|last2=Bieber \|first2=Eric \|last3=Schaekermann \|first3=Mike \|last4=Pasupat \|first4=Ice \|last5=Sachdeva \|first5=Noveen \|last6=Dhillon \|first6=Inderjit \|last7=Blistein \|first7=Marcel \|last8=Ram \|first8=Ori \|last9=Zhang \|first9=Dan}}</ref><ref>{{Cite journal \|last=Mirza \|first=Adrian \|last2=Alampara \|first2=Nawaf \|last3=Kunchapu \|first3=Sreekanth \|last4=Ríos-García \|first4=Martiño \|last5=Emoekabu \|first5=Benedict \|last6=Krishnan \|first6=Aswanth \|last7=Gupta \|first7=Tanya \|last8=Schilling-Wilhelmi \|first8=Mara \|last9=Okereke \|first9=Macjonathan \|last10=Aneesh \|first10=Anagha \|last11=Asgari \|first11=Mehrdad \|last12=Eberhardt \|first12=Juliane \|last13=Elahi \|first13=Amir Mohammad \|last14=Elbeheiry \|first14=Hani M. \|last15=Gil \|first15=María Victoria \|date=2025-07 \|title=A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists \|url=https://www.nature.com/articles/s41557-025-01815-x \|journal=Nature Chemistry \|language=en \|volume=17 \|issue=7 \|pages=1027–1034 \|doi=10.1038/s41557-025-01815-x \|issn=1755-4349}}</ref> ~~Reasoning models generally score higher than non-reasoning models on many benchmarks, especially on tasks requiring multi-step reasoning.~~ Some benchmarks exclude reasoning models because their responses take longer and cost more.<ref>{{cite book \|last1=Huang \|first1=Yuting \|last2=Zois \|first2=Christos \|last3=Wang \|first3=Yue \|last4=Zhang \|first4=Yue \|last5=Mavromatis \|first5=Christos \|last6=Zeng \|first6=Jiachen \|last7=Yin \|first7=Shihao \|last8=Voulkidis \|first8=Antonios \|last9=Shepard \|first9=Daniel \|chapter=Toward Foundation Models for Online Complex Event Detection in CPS-IoT: A Case Study \|title=Proceedings of the 2nd International Workshop on Foundation Models for Cyber-Physical Systems & Internet of Things \|publisher=ACM \|date=2025 \|pages=1–6 \|doi=10.1145/3722565.3727198 \|arxiv=2503.12282 \|isbn=979-8-4007-1608-9 \|quote=Although we did not evaluate o1 and o3 models … their high cost and inference time make them impractical for online CED, which requires frequent, low-latency API requests.}}</ref><ref>{{cite arXiv \|last1=Hu \|first1=Zihao \|last2=Wang \|first2=Yuqing \|last3=Sun \|first3=Rui \|last4=Lu \|first4=Haoran \|last5=Gong \|first5=Qian \|last6=Wang \|first6=Jinshuai \|last7=Gong \|first7=Yunlong \|last8=Huang \|first8=Yiming \|last9=He \|first9=Peng \|title=Inference-Time Compute: More Faithful? A Research Note \|date=2025-02-13 \|eprint=2502.09673 \|class=cs.CL \|quote=we were unable to evaluate O1 and R1 …}}</ref><ref>{{cite arXiv \|last1=Chen \|first1=Guoliang \|last2=Zhu \|first2=Zhiyao \|last3=Meng \|first3=Qinxiang \|last4=Liang \|first4=Weilin \|last5=Ji \|first5=Zijie \|last6=Liu \|first6=Jiangning \|last7=Zeng \|first7=Jie \|title=RealBench: Evaluating LLMs as Verilog Engineers \|date=2025-03-07 \|eprint=2503.04914 \|class=cs.AI \|quote=For O1-preview, we sample only once due to high cost.}}</ref><ref>{{cite arXiv \|last1=Gupta \|first1=Arpit \|last2=Schapira \|first2=Michael \|last3=Gill \|first3=Phillipa \|last4=Seetharaman \|first4=Srinivasan \|title=On the Feasibility of Using LLMs to Execute Multistage Network Attacks \|date=2025-01-30 \|eprint=2501.16466 \|class=cs.CR \|quote=We were unable to evaluate o1 … the public API has a safeguard that prevents o1 from executing attacks.}}</ref> Line 97: === Generation time === Due to the tendency of reasoning language models to produce verbose outputs, the time it takes to generate an output increases greatly when compared to a standard [[large language model]]. ~~Reasoning increases response time, with current models taking from a few seconds to several minutes to answer. As depth of reasoning grows, future models may need even longer.~~ == Models ==