Logical Reasoning - MMLU - BBHard
Mathematical Reasoning - GSM-8K - MATH - MGSM - DROP
Code Generation - HumanEval - MBPP
World Knowledge & QA - NaturalQuestions - TriviaQA - MMMU - TruthfulQA
I collected their descriptions and links to their original papers here: https://www.turingpost.com/p/llm-benchmarks