Revision as of 02:59, 3 February 2025 edit OAbot (talk \| contribs) Bots 643,717 edits m Open access bot: arxiv updated in citation with #oabot. ← Previous edit		Revision as of 02:25, 13 February 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,296 edits →Benchmark: removed some since they are better placed at the Lists Tag: Visual edit Next edit →
Line 81: == Benchmark == {{Main\|Benchmark (computing)\|List of language model benchmarks}} The reasoning ability of language models are usually tested on problems of which there are unambiguous solutions that can be cheaply checked, and requires reasoning when solved by a human. These are usually in mathematics and [[competitive programming]]. The answer is usually an array of integers, a multiple choice letter, or a program that passes [[Unit testing\|unit tests]] within a limited runtime. Some common ones include: * GSM8K (Grade School Math): 8.5K linguistically diverse [[Primary school\|elementary school]] [[Word problem (mathematics education)\|math word problems]] that require 2 to 8 basic arithmetic operations to solve.<ref name=":2" /> * [[MMLU]] (Measuring Massive Multitask Language Understanding): 16,000 multiple-choice questions spanning 57 academic subjects including mathematics, philosophy, law, and medicine.<ref>{{Citation \|last1=Hendrycks \|first1=Dan \|title=Measuring Massive Multitask Language Understanding \|date=2021-01-12 \|arxiv=2009.03300 \|last2=Burns \|first2=Collin \|last3=Basart \|first3=Steven \|last4=Zou \|first4=Andy \|last5=Mazeika \|first5=Mantas \|last6=Song \|first6=Dawn \|last7=Steinhardt \|first7=Jacob}}</ref> * MATH: 12,500 competition-level math problems.<ref>{{Citation \|last1=Hendrycks \|first1=Dan \|title=Measuring Mathematical Problem Solving With the MATH Dataset \|date=2021-11-08 \|arxiv=2103.03874 \|last2=Burns \|first2=Collin \|last3=Kadavath \|first3=Saurav \|last4=Arora \|first4=Akul \|last5=Basart \|first5=Steven \|last6=Tang \|first6=Eric \|last7=Song \|first7=Dawn \|last8=Steinhardt \|first8=Jacob}}</ref> * MathEval: An omnibus benchmark that contains 20 other benchmarks, such as GSM8K, MATH, and the math subsection of MMLU. Over 20,000 math problems. Difficulty ranges from elementary school to high school competition.<ref>{{Citation \|last=math-eval \|title=math-eval/MathEval \|date=2025-01-26 \|url=https://github.com/math-eval/MathEval \|access-date=2025-01-27}}</ref> * GPQA (Google-Proof Q&A): 448 multiple-choice questions written by ___domain experts in biology, physics, and chemistry, and requires PhD-level experts to solve.<ref>{{Citation \|last1=Rein \|first1=David \|title=GPQA: A Graduate-Level Google-Proof Q&A Benchmark \|date=2023-11-20 \|arxiv=2311.12022 \|last2=Hou \|first2=Betty Li \|last3=Stickland \|first3=Asa Cooper \|last4=Petty \|first4=Jackson \|last5=Pang \|first5=Richard Yuanzhe \|last6=Dirani \|first6=Julien \|last7=Michael \|first7=Julian \|last8=Bowman \|first8=Samuel R.}}</ref> * HumanEval: Programming problems where the solution is always a python function, often just a few lines long.<ref name=":4">{{Citation \|last1=Chen \|first1=Mark \|title=Evaluating Large Language Models Trained on Code \|date=2021-07-14 \|arxiv=2107.03374 \|last2=Tworek \|first2=Jerry \|last3=Jun \|first3=Heewoo \|last4=Yuan \|first4=Qiming \|last5=Pinto \|first5=Henrique Ponde de Oliveira \|last6=Kaplan \|first6=Jared \|last7=Edwards \|first7=Harri \|last8=Burda \|first8=Yuri \|last9=Joseph \|first9=Nicholas}}</ref> * SWE-Bench: 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase and an issue, the task is to edit the codebase to solve the issue.<ref>{{Citation \|last1=Jimenez \|first1=Carlos E. \|title=SWE-bench: Can Language Models Resolve Real-World GitHub Issues? \|date=2024-11-11 \|arxiv=2310.06770 \|last2=Yang \|first2=John \|last3=Wettig \|first3=Alexander \|last4=Yao \|first4=Shunyu \|last5=Pei \|first5=Kexin \|last6=Press \|first6=Ofir \|last7=Narasimhan \|first7=Karthik}}</ref> * ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence): Something similar to a [[Raven's Progressive Matrices]] test.<ref>{{Cite web \|title=ARC Prize \|url=https://arcprize.org/ \|access-date=2025-01-27 \|website=ARC Prize \|language=en}}</ref> * LiveBench: A series of benchmarks released monthly, including high school math competition questions, competitive coding questions, logic puzzles, and other tasks.<ref>{{Cite web \|title=LiveBench \|url=https://livebench.ai/ \|access-date=2025-01-27 \|website=livebench.ai}}</ref> * FrontierMath: Questions from areas of modern math that are difficult for professional mathematicians to solve. Each question has an integer solution.<ref>{{Citation \|last1=Glazer \|first1=Elliot \|title=FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI \|date=2024-12-20 \|arxiv=2411.04872 \|last2=Erdil \|first2=Ege \|last3=Besiroglu \|first3=Tamay \|last4=Chicharro \|first4=Diego \|last5=Chen \|first5=Evan \|last6=Gunning \|first6=Alex \|last7=Olsson \|first7=Caroline Falkman \|last8=Denain \|first8=Jean-Stanislas \|last9=Ho \|first9=Anson}}</ref> The benchmark scores are of the following kinds:

Reasoning language model: Difference between revisions