Reasoning language model: Difference between revisions

Content deleted Content added
OAbot (talk | contribs)
m Open access bot: arxiv updated in citation with #oabot.
Benchmark: removed some since they are better placed at the Lists
Line 81:
 
== Benchmark ==
{{Main|Benchmark (computing)|List of language model benchmarks}}
 
The reasoning ability of language models are usually tested on problems of which there are unambiguous solutions that can be cheaply checked, and requires reasoning when solved by a human. These are usually in mathematics and [[competitive programming]]. The answer is usually an array of integers, a multiple choice letter, or a program that passes [[Unit testing|unit tests]] within a limited runtime. Some common ones include:
 
* GSM8K (Grade School Math): 8.5K linguistically diverse [[Primary school|elementary school]] [[Word problem (mathematics education)|math word problems]] that require 2 to 8 basic arithmetic operations to solve.<ref name=":2" />
* [[MMLU]] (Measuring Massive Multitask Language Understanding): 16,000 multiple-choice questions spanning 57 academic subjects including mathematics, philosophy, law, and medicine.<ref>{{Citation |last1=Hendrycks |first1=Dan |title=Measuring Massive Multitask Language Understanding |date=2021-01-12 |arxiv=2009.03300 |last2=Burns |first2=Collin |last3=Basart |first3=Steven |last4=Zou |first4=Andy |last5=Mazeika |first5=Mantas |last6=Song |first6=Dawn |last7=Steinhardt |first7=Jacob}}</ref>
* MATH: 12,500 competition-level math problems.<ref>{{Citation |last1=Hendrycks |first1=Dan |title=Measuring Mathematical Problem Solving With the MATH Dataset |date=2021-11-08 |arxiv=2103.03874 |last2=Burns |first2=Collin |last3=Kadavath |first3=Saurav |last4=Arora |first4=Akul |last5=Basart |first5=Steven |last6=Tang |first6=Eric |last7=Song |first7=Dawn |last8=Steinhardt |first8=Jacob}}</ref>
* MathEval: An omnibus benchmark that contains 20 other benchmarks, such as GSM8K, MATH, and the math subsection of MMLU. Over 20,000 math problems. Difficulty ranges from elementary school to high school competition.<ref>{{Citation |last=math-eval |title=math-eval/MathEval |date=2025-01-26 |url=https://github.com/math-eval/MathEval |access-date=2025-01-27}}</ref>
* GPQA (Google-Proof Q&A): 448 multiple-choice questions written by ___domain experts in biology, physics, and chemistry, and requires PhD-level experts to solve.<ref>{{Citation |last1=Rein |first1=David |title=GPQA: A Graduate-Level Google-Proof Q&A Benchmark |date=2023-11-20 |arxiv=2311.12022 |last2=Hou |first2=Betty Li |last3=Stickland |first3=Asa Cooper |last4=Petty |first4=Jackson |last5=Pang |first5=Richard Yuanzhe |last6=Dirani |first6=Julien |last7=Michael |first7=Julian |last8=Bowman |first8=Samuel R.}}</ref>
* HumanEval: Programming problems where the solution is always a python function, often just a few lines long.<ref name=":4">{{Citation |last1=Chen |first1=Mark |title=Evaluating Large Language Models Trained on Code |date=2021-07-14 |arxiv=2107.03374 |last2=Tworek |first2=Jerry |last3=Jun |first3=Heewoo |last4=Yuan |first4=Qiming |last5=Pinto |first5=Henrique Ponde de Oliveira |last6=Kaplan |first6=Jared |last7=Edwards |first7=Harri |last8=Burda |first8=Yuri |last9=Joseph |first9=Nicholas}}</ref>
* SWE-Bench: 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase and an issue, the task is to edit the codebase to solve the issue.<ref>{{Citation |last1=Jimenez |first1=Carlos E. |title=SWE-bench: Can Language Models Resolve Real-World GitHub Issues? |date=2024-11-11 |arxiv=2310.06770 |last2=Yang |first2=John |last3=Wettig |first3=Alexander |last4=Yao |first4=Shunyu |last5=Pei |first5=Kexin |last6=Press |first6=Ofir |last7=Narasimhan |first7=Karthik}}</ref>
* ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence): Something similar to a [[Raven's Progressive Matrices]] test.<ref>{{Cite web |title=ARC Prize |url=https://arcprize.org/ |access-date=2025-01-27 |website=ARC Prize |language=en}}</ref>
* LiveBench: A series of benchmarks released monthly, including high school math competition questions, competitive coding questions, logic puzzles, and other tasks.<ref>{{Cite web |title=LiveBench |url=https://livebench.ai/ |access-date=2025-01-27 |website=livebench.ai}}</ref>
* FrontierMath: Questions from areas of modern math that are difficult for professional mathematicians to solve. Each question has an integer solution.<ref>{{Citation |last1=Glazer |first1=Elliot |title=FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI |date=2024-12-20 |arxiv=2411.04872 |last2=Erdil |first2=Ege |last3=Besiroglu |first3=Tamay |last4=Chicharro |first4=Diego |last5=Chen |first5=Evan |last6=Gunning |first6=Alex |last7=Olsson |first7=Caroline Falkman |last8=Denain |first8=Jean-Stanislas |last9=Ho |first9=Anson}}</ref>
 
The benchmark scores are of the following kinds: