My work has included: – Custom AI agents, backend and infra-heavy systems – scraping and automation pipelines – AI-based outreach and enrichment pipelines – shipping and operating production systems end to end
I’m looking for an early-stage startup (seed–Series A), ideally working closely with founders and owning systems from zero to one.
Portfolio: https://zerobitflip.com Email: sm@zerobitflip.com
Over the past week or so, I’ve noticed:
More shallow reasoning
Ignoring parts of context
More confident-but-wrong answers
Slight regression in structured refactors
This is mostly in real-world coding tasks (mid-size projects, not toy prompts).
Could just be my workload getting more complex — but it feels different.
Has anyone else noticed a shift in quality recently? Or is this just variance + perception bias on my end?
Would love to hear if others are seeing similar patterns (or not).
I tested:
- Gemini Pro 3 - Opus 4.6 - GLM-5 - Kimi 2.5
My rough criteria:
- Code correctness (first-pass compile success) - Quality of architectural suggestions - Refactor clarity - Handling of existing code context - Cost per useful output
Surprisingly (at least to me), Kimi 2.5 gave the best cost/performance ratio for this particular workload. It wasn’t always the most “verbose” or polished, but it required the fewest correction loops per dollar spent.
Opus 4.6 felt strong on reasoning-heavy changes, but cost scaled quickly. Gemini Pro 3 was decent but inconsistent in multi-file refactors. GLM-5 was interesting but sometimes hallucinated internal project structures.
This is obviously anecdotal and project-specific.
Curious:
What models are people here using for real-world codebases?
Has anyone benchmarked cost vs correction loops?
Are people optimizing for raw quality or iteration speed per dollar?
Would love to hear other dev experiences, especially from people working in Go or other statically typed backends.