Content deleted Content added
m Open access bot: arxiv updated in citation with #oabot. |
→Benchmark: removed some since they are better placed at the Lists |
||
Line 81:
== Benchmark ==
{{Main|Benchmark (computing)|List of language model benchmarks}}
The reasoning ability of language models are usually tested on problems of which there are unambiguous solutions that can be cheaply checked, and requires reasoning when solved by a human. These are usually in mathematics and [[competitive programming]]. The answer is usually an array of integers, a multiple choice letter, or a program that passes [[Unit testing|unit tests]] within a limited runtime. Some common ones include:
* GSM8K (Grade School Math): 8.5K linguistically diverse [[Primary school|elementary school]] [[Word problem (mathematics education)|math word problems]] that require 2 to 8 basic arithmetic operations to solve.<ref name=":2" />
* [[MMLU]] (Measuring Massive Multitask Language Understanding): 16,000 multiple-choice questions spanning 57 academic subjects including mathematics, philosophy, law, and medicine.<ref>{{Citation |last1=Hendrycks |first1=Dan |title=Measuring Massive Multitask Language Understanding |date=2021-01-12 |arxiv=2009.03300 |last2=Burns |first2=Collin |last3=Basart |first3=Steven |last4=Zou |first4=Andy |last5=Mazeika |first5=Mantas |last6=Song |first6=Dawn |last7=Steinhardt |first7=Jacob}}</ref>
* GPQA (Google-Proof Q&A): 448 multiple-choice questions written by ___domain experts in biology, physics, and chemistry, and requires PhD-level experts to solve.<ref>{{Citation |last1=Rein |first1=David |title=GPQA: A Graduate-Level Google-Proof Q&A Benchmark |date=2023-11-20 |arxiv=2311.12022 |last2=Hou |first2=Betty Li |last3=Stickland |first3=Asa Cooper |last4=Petty |first4=Jackson |last5=Pang |first5=Richard Yuanzhe |last6=Dirani |first6=Julien |last7=Michael |first7=Julian |last8=Bowman |first8=Samuel R.}}</ref>
* HumanEval: Programming problems where the solution is always a python function, often just a few lines long.<ref name=":4">{{Citation |last1=Chen |first1=Mark |title=Evaluating Large Language Models Trained on Code |date=2021-07-14 |arxiv=2107.03374 |last2=Tworek |first2=Jerry |last3=Jun |first3=Heewoo |last4=Yuan |first4=Qiming |last5=Pinto |first5=Henrique Ponde de Oliveira |last6=Kaplan |first6=Jared |last7=Edwards |first7=Harri |last8=Burda |first8=Yuri |last9=Joseph |first9=Nicholas}}</ref>
The benchmark scores are of the following kinds:
|