The issue of incentivizing students to guess-and-check when providing the test scripts upfront is, IMO, fixed by making the students write the tests themselves. This paper explains it pretty well:
https://www.cs.tufts.edu/~nr/cs257/archive/stephen-edwards/a...
Essentially the students would write a test suite and the grading framework would grade based on
1) Code coverage when running the student's test cases against an instructor's reference solution
2) Correctness of output: running student's test cases on student's code and comparing with output from running those test cases on the reference solution
3) Number of test cases passed in student's test suite
Also from the paper:
"All three measures are taken on a 0%–100% scale, and the three components are simply multiplied together. As a result, the score in each dimension becomes a “cap” for the overall score—it is not possible for a student to do poorly in one dimension but do well overall. Also, the effect of the multiplication is that a student cannot accept so-so scores across the board. Instead, near-perfect performance in at least two dimensions should become the expected norm for students."
Students still get the benefit of knowing their grade when they submit, and as an added bonus students get more hands-on experience with test-driven development. Having the students write the tests themselves also increases the cost of mutating code until it just works.