(I'm from OpenAI.)
The following are true:
- In our API, we don't change model weights or model behavior over time (e.g., by time of day, or weeks/months after release)
- Tiny caveats include: there is a bit of non-determinism in batched non-associative math that can vary by batch / hardware, bugs or API downtime can obviously change behavior, heavy load can slow down speeds, and this of course doesn't apply to the 'unpinned' models that are clearly supposed to change over time (e.g., xxx-latest). But we don't do any quantization or routing gimmicks that would change model weights.
- In ChatGPT and Codex CLI, model behavior can change over time (e.g., we might change a tool, update a system prompt, tweak default thinking time, run an A/B test, or ship other updates); we try to be transparent with our changelogs (listed below) but to be honest not every small change gets logged here. But even here we're not doing any gimmicks to cut quality by time of day or intentionally dumb down models after launch. Model behavior can change though, as can the product / prompt / harness.
ChatGPT release notes: https://help.openai.com/en/articles/6825453-chatgpt-release-...
Codex changelog: https://developers.openai.com/codex/changelog/
Codex CLI commit history: https://github.com/openai/codex/commits/main/
I've had this perceived experience so many times, and while of course it's almost impossible to be objective about this, it just seem so in your face.
I don't discard being novelty plus getting used to it, plus psychological factors, do you have any takes on this?
https://www.reddit.com/r/OpenAI/comments/1qv77lq/chatgpt_low...
Maybe a dumb question but does this mean model quality may vary based on which hardware your request gets routed to?
I feel like you need to be making a bigger statement about this. If you go onto various parts of the Net (Reddit, the bird site etc) half the posts about AI are seemingly conspiracy theories that AI companies are watering down their products after release week.
It's worth checking different versions of Claude Code, and updating your tools if you don't do it automatically. Also run the same prompts through VS Code, Cursor, Claude Code in terminal, etc. You can get very different model responses based on the system prompt, what context is passed via the harness, how the rules are loaded and all sorts of minor tweaks.
If you make raw API calls and see behavioural changes over time, that would be another concern.
Accuracy can decreases at large context sizes. OpenAI's compaction handles this better than anyone else, but it's still an issue.
If you are seeing this kind of thing start a new chat and re-run the same query. You'll usually see an improvement.
Regardless I tend to use new chats often.
PS - I appreciate you coming here and commenting!
(I work at OpenAI)