The most widely used benchmarks for evaluating LLMs

1 pointskavaivaleri2y ago1 comments

Commonsense Reasoning - HellaSwag - Winogrande - PIQA - SIQA - OpenBookQA - ARC - CommonsenseQA

Logical Reasoning - MMLU - BBHard

Mathematical Reasoning - GSM-8K - MATH - MGSM - DROP

Code Generation - HumanEval - MBPP

World Knowledge & QA - NaturalQuestions - TriviaQA - MMMU - TruthfulQA

I collected their descriptions and links to their original papers here: https://www.turingpost.com/p/llm-benchmarks

1 comments

1 comments · 1 top-level

I've never been able to click on a Turingpost link, they all give an SSL error...

j / k navigate · click thread line to collapse