>It's pretty easy to get determinism with a simple harness for a well-defined set of tasks with the recent models that are post-trained for tool use.
Do you have a source? Claude Code is the only genetic system that seems to really work well enough to be useful, and it’s equipped with an absolutely absurd amount of testing and redundancy to make it useful.
Should I read that as 'generic system'? Most hard data is with company internal evals, but for the well defined tasks externally it's been pretty easy to spin up a basic tool loop and validate. Did you have something in mind? [I don't necessarily count 'coding' as well-defined in the generic sense, so I suspect we're coming at this from different scopes re: the definition of 'LLMs somewhat deterministic and useful as tools']