Revision as of 10:36, 18 April 2024 edit AndyFielding (talk \| contribs) Extended confirmed users 17,718 edits mNo edit summary Tag: Visual edit ← Previous edit		Revision as of 19:04, 21 April 2024 edit undo Colin M (talk \| contribs) Autopatrolled, Administrators 12,442 edits →Evaluation and benchmarks: wikilink MMLU Next edit →
Line 52: * Stanford Sentiment [[Treebank]]<ref>{{Cite web\|url=https://nlp.stanford.edu/sentiment/treebank.html\|title=Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank\|website=nlp.stanford.edu\|access-date=2019-02-25\|archive-date=27 October 2020\|archive-url=https://web.archive.org/web/20201027125825/https://nlp.stanford.edu/sentiment/treebank.html\|url-status=live}}</ref> * Winograd NLI * BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC, OpenBookQA, NaturalQuestions, TriviaQA, RACE, [[MMLU\|MMLU (Massive Multitask Language Understanding)]], BIG-bench hard, GSM8k, RealToxicityPrompts, WinoGender, CrowS-Pairs.<ref>{{Citation\| last = Hendrycks\| first = Dan\| title = Measuring Massive Multitask Language Understanding\| accessdate = 2023-03-15\| date = 2023-03-14\| url = https://github.com/hendrycks/test\| archive-date = 15 March 2023\| archive-url = https://web.archive.org/web/20230315011614/https://github.com/hendrycks/test\| url-status = live}}</ref> ([https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md LLaMa Benchmark]) == See also ==

Language model: Difference between revisions