Content deleted Content added
Added History section from Reflection (artificial intelligence) |
Added benchmarks (essentially "benefits"), and drawbacks sections from Reflection (artificial intelligence) |
||
Line 74:
Self-consistency can be combined with an ORM. The model would be used to generate multiple answers, and the answers would be clustered, so that each cluster has the same answer. The ORM is used to compute the reward for each answer, and the rewards within each cluster is summed. The answer corresponding to the cluster with the highest summed reward is output.<ref name=":3" />
== Benchmarks ==
Reasoning models generally outperform non-reasoning models in most benchmarks, especially on tasks requiring multi-step reasoning.
However, some benchmarks exclude reflective models due to longer response times.
=== Humanity's Last Exam ===
The [[Humanity's Last Exam|HLE]], a rigorous benchmark designed to assess expert-level reasoning across mathematics, humanities, and the natural sciences, reveals substantial performance gaps among models. State-of-the-art reasoning models have demonstrated low accuracy on HLE, highlighting significant room for improvement. In particular, the full reasoning model [[OpenAI o3|o3]] achieved an accuracy of 26.6%,<ref>{{Cite web |last=McKenna |first=Greg |title=OpenAI's deep research can complete 26% of Humanity's Last Exam |url=https://fortune.com/2025/02/12/openai-deepresearch-humanity-last-exam/ |access-date=2025-03-16 |website=Fortune |language=en}}</ref> while its lighter counterpart, o3‑mini-high (evaluated on text‑only questions), reached 13%.<ref>{{Cite web |author1=John-Anthony Disotto |date=2025-02-04 |title=OpenAI's Deep Research smashes records for the world's hardest AI exam, with ChatGPT o3-mini and DeepSeek left in its wake |url=https://www.techradar.com/computing/artificial-intelligence/openais-deep-research-smashes-records-for-the-worlds-hardest-ai-exam-with-chatgpt-o3-mini-and-deepseek-left-in-its-wake |access-date=2025-03-16 |website=TechRadar |language=en}}</ref>
=== AIME ===
The [[American Invitational Mathematics Examination]] (AIME) benchmark, a challenging mathematics competition, demonstrates significant performance differences between model types. Non-reasoning models typically solve less than 30% of AIME. In contrast, models employing reasoning techniques score between 50% and 80%.<ref name=":1">{{Cite web |date=2025-02-10 |title=MathArena |url=https://matharena.ai/ |access-date=2025-02-10 |archive-url=https://web.archive.org/web/20250210032556/https://matharena.ai/ |archive-date=10 February 2025 }}</ref> While [[OpenAI o1|OpenAI's o1]] maintained or slightly improved its accuracy from reported 2024{{Source?|date=July 2022}} metrics to 2025 AIME results, o3-mini (high) achieved a higher accuracy (80%) at a significantly lower cost (approximately 12 times cheaper).
=== o3-mini performance ===
According to OpenAI's January 2025 report on o3-mini, adjustable "reasoning effort" significantly affects performance, particularly in [[STEM]]. Increasing reasoning effort from low to high boosts accuracy on benchmarks like AIME 2024, GPQA Diamond, and [[Codeforces]], providing performance gains typically in the range of 10-30%. With high reasoning effort, o3-mini (high) achieved 87.3% in AIME (different from the MathArena AIME benchmark results), 79.7% in GPQA Diamond, 2130 Elo in Codeforces, and 49.3 in SWE-bench Verified.<ref>{{Cite web |date=2025-01-31 |title=OpenAI o3-mini |url=https://openai.com/index/openai-o3-mini/ |access-date=2025-02-09 |website=OpenAI |language=en-US}}</ref>
== Drawbacks ==
=== Computational cost ===
Reasoning models require significantly more test-time compute than non-reasoning models. On the AIME benchmark, reasoning models were 10 to 74 times more expensive'''<ref name=":1" />''' than non-reasoning counterparts.
=== Generation time ===
Reflective reasoning increases response times, with current models taking anywhere from three seconds to several minutes to generate an answer. As reasoning depth improves, future models may require even longer processing times.
== Models ==
|