Reasoning models generally score higher than non-reasoning models on many benchmarks, especially on tasks requiring multi-step reasoning.
Some benchmarks exclude reasoning models because their responses take longer and cost more.<ref>{{cite journalbook |last1=Huang |first1=Yuting |last2=Zois |first2=Christos |last3=Wang |first3=Yue |last4=Zhang |first4=Yue |last5=Mavromatis |first5=Christos |last6=Zeng |first6=Jiachen |last7=Yin |first7=Shihao |last8=Voulkidis |first8=Antonios |last9=Shepard |first9=Daniel |titlechapter=Toward Foundation Models for Online Complex Event Detection in CPS-IoT: A Case Study |journaltitle=Proceedings of the 26th2nd International ConferenceWorkshop on InformationFoundation ProcessingModels infor SensorCyber-Physical NetworksSystems (IPSN& Internet of '25)Things |publisher=ACM |date=2025 |pages=1–6 |doi=10.1145/3722565.3727198 |arxiv=2503.12282 |isbn=979-8-4007-1608-9 |quote=Although we did not evaluate o1 and o3 models … their high cost and inference time make them impractical for online CED, which requires frequent, low-latency API requests.}}</ref><ref>{{cite arXiv |last1=Hu |first1=Zihao |last2=Wang |first2=Yuqing |last3=Sun |first3=Rui |last4=Lu |first4=Haoran |last5=Gong |first5=Qian |last6=Wang |first6=Jinshuai |last7=Gong |first7=Yunlong |last8=Huang |first8=Yiming |last9=He |first9=Peng |title=Inference-Time Compute: More Faithful? A Research Note |date=2025-02-13 |eprint=2502.09673 |class=cs.CL |quote=we were unable to evaluate O1 and R1 …}}</ref><ref>{{cite arXiv |last1=Chen |first1=Guoliang |last2=Zhu |first2=Zhiyao |last3=Meng |first3=Qinxiang |last4=Liang |first4=Weilin |last5=Ji |first5=Zijie |last6=Liu |first6=Jiangning |last7=Zeng |first7=Jie |title=RealBench: Evaluating LLMs as Verilog Engineers |date=2025-03-07 |eprint=2503.04914 |class=cs.AI |quote=For O1-preview, we sample only once due to high cost.}}</ref><ref>{{cite arXiv |last1=Gupta |first1=Arpit |last2=Schapira |first2=Michael |last3=Gill |first3=Phillipa |last4=Seetharaman |first4=Srinivasan |title=On the Feasibility of Using LLMs to Execute Multistage Network Attacks |date=2025-01-30 |eprint=2501.16466 |class=cs.CR |quote=We were unable to evaluate o1 … the public API has a safeguard that prevents o1 from executing attacks.}}</ref>