Revision as of 22:14, 19 February 2025 edit Cosmia Nebula (talk \| contribs) Extended confirmed users 11,296 edits →See also Tag: Visual edit ← Previous edit		Revision as of 22:17, 19 February 2025 edit undo Cosmia Nebula (talk \| contribs) Extended confirmed users 11,296 edits →Benchmark Tag: Visual edit Next edit →
Line 83: {{Main\|Benchmark (computing)\|List of language model benchmarks}} The reasoning ability of language models are usually tested on problems ~~of which there are~~with unambiguous solutions that can be cheaply checked, and requires reasoning when solved by a human. ~~These~~Such problems are usually in mathematics and [[competitive programming]]. The answer is usually an array of integers, a multiple choice letter, or a program that passes [[Unit testing\|unit tests]] within a limited runtime. Some common ones include: * GSM8K (Grade School Math): 8.5K linguistically diverse [[Primary school\|elementary school]] [[Word problem (mathematics education)\|math word problems]] that require 2 to 8 basic arithmetic operations to solve.<ref name=":2" />

Reasoning language model: Difference between revisions