So the interesting part about this one is that when I had the model write up the results for that session:
https://github.com/karpathy/autoresearch/discussions/32
Look at its comment about this "improvement":
"""
Surprising non-results:
- Changing random seed from 42→137 improved by 0.0004. Seed 7 was worse. Make of that what you will.
"""
So the model knows! It knows that this is a weird thing to do after the fact. I think it's silly that the model even tried and that it ran this, but some part of it also knows that it was wrong. This means that this is fixable by prompt.md