Observations:
4.6 had previously failed to the point where I had to wipe context. It must have written memories because it was referring to the previous conversation.
As the article points out, 4.6 went out of its way to be lazy and came up with an unusable plan. It did extra planning to avoid renaming files (the toplevel task description involves reorganizing directories of files).
4.6 took twice as long to respond as 4.5.
I’m treating this as a model regression. 4.6 is borderline unusable. I’ve hit all the issues the article describes.
Also, there needs to be an obvious way to disable memory or something. The current UX is terrible, since once an error or incorrect refusal propagates, there is no obvious recovery path.
Anyway, with think set to high, I see drastically different behavior: much slower and much worse output from 4.6.