It turns out creating "good tests" isn't so easy. Generally tests suffer from one or more of the following: subjective grading, game-able question formats, or superficial coverage of the material.
Most tests administered by school teachers have very subjective grading. Basically every essay question has subjective grading, as do "show your work" math test questions. Multiple choice questions remove much of that subjectivity, but at the expense of a system that can often be gamed and often covers the subject matter in a superficial way.
If incentives are correctly aligned, tests with subjective grading can work reasonably well. But when tests taken by students are used to evaluate the performance of teachers, the incentives of the teachers align against you. For wide scale experimentation, you want a test that actually measures meaningful understanding of the subject matter and at the same time can be graded objectively and at the same time defies the attempts of students, teachers, and school administerators to subvert the results. That's basically the holy grail of tests.