The trick with the F is a good one. Yes, tokenization is very likely one of the contributing factors.
The point of the eval is an attempt to exercise: attention to detail, tokenization, ability to stop, ability to follow instruction and ignore irrelevant instructions, biases, laziness, computational limits, etc.