Content deleted Content added
SimonAytes (talk | contribs) →2025: Update OpenAI section for clarity. Tags: references removed Visual edit |
→Benchmarks: Add references |
||
(One intermediate revision by one other user not shown) | |||
Line 78:
== Benchmarks ==
Reasoning models generally score higher than non-reasoning models on many benchmarks, especially on tasks requiring multi-step reasoning.<ref>{{Citation |last=Wei |first=Jason |title=Chain-of-Thought Prompting Elicits Reasoning in Large Language Models |date=2023-01-10 |url=http://arxiv.org/abs/2201.11903 |access-date=2025-08-30 |publisher=arXiv |doi=10.48550/arXiv.2201.11903 |id=arXiv:2201.11903 |last2=Wang |first2=Xuezhi |last3=Schuurmans |first3=Dale |last4=Bosma |first4=Maarten |last5=Ichter |first5=Brian |last6=Xia |first6=Fei |last7=Chi |first7=Ed |last8=Le |first8=Quoc |last9=Zhou |first9=Denny}}</ref><ref>{{Citation |last=Wang |first=Xuezhi |title=Self-Consistency Improves Chain of Thought Reasoning in Language Models |date=2023-03-07 |url=http://arxiv.org/abs/2203.11171 |access-date=2025-08-30 |publisher=arXiv |doi=10.48550/arXiv.2203.11171 |id=arXiv:2203.11171 |last2=Wei |first2=Jason |last3=Schuurmans |first3=Dale |last4=Le |first4=Quoc |last5=Chi |first5=Ed |last6=Narang |first6=Sharan |last7=Chowdhery |first7=Aakanksha |last8=Zhou |first8=Denny}}</ref><ref>{{Citation |last=Yao |first=Shunyu |title=Tree of Thoughts: Deliberate Problem Solving with Large Language Models |date=2023-12-03 |url=http://arxiv.org/abs/2305.10601 |access-date=2025-08-30 |publisher=arXiv |doi=10.48550/arXiv.2305.10601 |id=arXiv:2305.10601 |last2=Yu |first2=Dian |last3=Zhao |first3=Jeffrey |last4=Shafran |first4=Izhak |last5=Griffiths |first5=Thomas L. |last6=Cao |first6=Yuan |last7=Narasimhan |first7=Karthik}}</ref><ref>{{Cite journal |last=Cui |first=Dong-Xu |last2=Long |first2=Shi-Yu |last3=Tang |first3=Yi-Xuan |last4=Zhao |first4=Yue |last5=Li |first5=Qiao |date=2025-08-25 |title=Can Reasoning Power Significantly Improve the Knowledge of Large Language Models for Chemistry?─Based on Conversations with LLMs |url=https://doi.org/10.1021/acs.jcim.5c01265 |journal=Journal of Chemical Information and Modeling |doi=10.1021/acs.jcim.5c01265 |issn=1549-9596}}</ref><ref>{{Citation |last=Qwen |title=Qwen2.5 Technical Report |date=2024 |url=https://arxiv.org/abs/2412.15115 |access-date=2025-08-30 |publisher=arXiv |doi=10.48550/ARXIV.2412.15115 |last2=Yang |first2=An |last3=Yang |first3=Baosong |last4=Zhang |first4=Beichen |last5=Hui |first5=Binyuan |last6=Zheng |first6=Bo |last7=Yu |first7=Bowen |last8=Li |first8=Chengyuan |last9=Liu |first9=Dayiheng}}</ref><ref>{{Citation |last=Comanici |first=Gheorghe |title=Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities |date=2025-07-22 |url=http://arxiv.org/abs/2507.06261 |access-date=2025-08-30 |publisher=arXiv |doi=10.48550/arXiv.2507.06261 |id=arXiv:2507.06261 |last2=Bieber |first2=Eric |last3=Schaekermann |first3=Mike |last4=Pasupat |first4=Ice |last5=Sachdeva |first5=Noveen |last6=Dhillon |first6=Inderjit |last7=Blistein |first7=Marcel |last8=Ram |first8=Ori |last9=Zhang |first9=Dan}}</ref><ref>{{Cite journal |last=Mirza |first=Adrian |last2=Alampara |first2=Nawaf |last3=Kunchapu |first3=Sreekanth |last4=Ríos-García |first4=Martiño |last5=Emoekabu |first5=Benedict |last6=Krishnan |first6=Aswanth |last7=Gupta |first7=Tanya |last8=Schilling-Wilhelmi |first8=Mara |last9=Okereke |first9=Macjonathan |last10=Aneesh |first10=Anagha |last11=Asgari |first11=Mehrdad |last12=Eberhardt |first12=Juliane |last13=Elahi |first13=Amir Mohammad |last14=Elbeheiry |first14=Hani M. |last15=Gil |first15=María Victoria |date=2025-07 |title=A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists |url=https://www.nature.com/articles/s41557-025-01815-x |journal=Nature Chemistry |language=en |volume=17 |issue=7 |pages=1027–1034 |doi=10.1038/s41557-025-01815-x |issn=1755-4349}}</ref>
Some benchmarks exclude reasoning models because their responses take longer and cost more.<ref>{{cite book |last1=Huang |first1=Yuting |last2=Zois |first2=Christos |last3=Wang |first3=Yue |last4=Zhang |first4=Yue |last5=Mavromatis |first5=Christos |last6=Zeng |first6=Jiachen |last7=Yin |first7=Shihao |last8=Voulkidis |first8=Antonios |last9=Shepard |first9=Daniel |chapter=Toward Foundation Models for Online Complex Event Detection in CPS-IoT: A Case Study |title=Proceedings of the 2nd International Workshop on Foundation Models for Cyber-Physical Systems & Internet of Things |publisher=ACM |date=2025 |pages=1–6 |doi=10.1145/3722565.3727198 |arxiv=2503.12282 |isbn=979-8-4007-1608-9 |quote=Although we did not evaluate o1 and o3 models … their high cost and inference time make them impractical for online CED, which requires frequent, low-latency API requests.}}</ref><ref>{{cite arXiv |last1=Hu |first1=Zihao |last2=Wang |first2=Yuqing |last3=Sun |first3=Rui |last4=Lu |first4=Haoran |last5=Gong |first5=Qian |last6=Wang |first6=Jinshuai |last7=Gong |first7=Yunlong |last8=Huang |first8=Yiming |last9=He |first9=Peng |title=Inference-Time Compute: More Faithful? A Research Note |date=2025-02-13 |eprint=2502.09673 |class=cs.CL |quote=we were unable to evaluate O1 and R1 …}}</ref><ref>{{cite arXiv |last1=Chen |first1=Guoliang |last2=Zhu |first2=Zhiyao |last3=Meng |first3=Qinxiang |last4=Liang |first4=Weilin |last5=Ji |first5=Zijie |last6=Liu |first6=Jiangning |last7=Zeng |first7=Jie |title=RealBench: Evaluating LLMs as Verilog Engineers |date=2025-03-07 |eprint=2503.04914 |class=cs.AI |quote=For O1-preview, we sample only once due to high cost.}}</ref><ref>{{cite arXiv |last1=Gupta |first1=Arpit |last2=Schapira |first2=Michael |last3=Gill |first3=Phillipa |last4=Seetharaman |first4=Srinivasan |title=On the Feasibility of Using LLMs to Execute Multistage Network Attacks |date=2025-01-30 |eprint=2501.16466 |class=cs.CR |quote=We were unable to evaluate o1 … the public API has a safeguard that prevents o1 from executing attacks.}}</ref>
Line 97:
=== Generation time ===
Due to the tendency of reasoning language models to produce verbose outputs, the time it takes to generate an output increases greatly when compared to a standard [[large language model]].
== Models ==
|