I exclusively use sonnet and advisor is basically “hey opus chime in on my approach”. been working great as far as i can tell.
The one major difference I noticed is that the GPT models are more analytical (e.g. better at mathematical analysis, code review) vs Claude models tend to write more straight forward code. Besides that I don't really see any significant differences.
There are a few gotchas with swapping, like being careful with AGENTS.md/CLAUDE.md naming (Claude Code only recognizes CLAUDE.md, and I think Codex only works with AGENTS.md), and updating skill files to match the tool.
I was using gpt-5.5 high. Writing terraform code for GCP, debugging app launch and Dockerfile issues, that sort of thing. It was going in loops hallucinating features of GCP, looking things up in strange ways, running terraform apply after being explicitly told in the last interaction not to, and overall not solving problems. These were very straightforward tasks and it couldn't be trusted for five minutes. It's the difference in what I would trust an early senior engineer to do vs what I would trust an unreliable high school intern to do.
> I am fairly convinced this is the shape serious agent work keeps converging toward.
"this" being "plan with expensive model, implement with cheap model".
Anyone who follows HN would be hard-pressed to disagree; this architecture is re-invented twice monthly.
https://www.facebook.com/groups/vibecodinglife/posts/1946207... https://github.com/openai/codex/discussions/10628 https://build5nines.com/stop-burning-premium-requests-how-to...
> Not because it is aesthetically pleasing. Because every other shape eventually runs into the same boring failures: context rot, self-grading, goalpost drift, and merge chaos.
Actual failure isn't boring. But struggling through a generated software project that celebrates its own genius and doesn't have a single self-critical or genuinely reflective thing to say...at least watching paint dry I might get giddy off the fumes.
I'm not interested in critiquing the project itself, either, you'll just run that through a model, too.
wow linking a facebook groups post might actually be worse than x, is there an xcancel alternative for facebook?
FWIW, re: best practices, your install script potentially runs `rm -rf` on the user's global skills whose names shadow your project's.
> Each rule below is enforced mechanically by the skill, not left to vibes.
> R1. Repo docs are the memory; not in HANDOFF.md = didn't happen
SKILL.md:
> Not in docs/HANDOFF.md = didn't happen. Refuse to judge results that exist only in conversation or builder chat output.
"Mechnical enforcement" just means "prompting the LLM a bit extra" these days? It (still) amazes me how much effort and tokens we expend on what could and should be a two line script...
This project is meant to be the latter, but there’s not a clean way to integrate that into Claude Code or Codex because they expect to do both.
Pi can do it, but then your users can’t use their Claude subscriptions, so you have to cludgily try to do the same thing via LLM prompts.
LLM-written readmes love to use inscrutable jargon that means nothing outside of the context window that birthed it.
Even if one discovered something that millions (billions?) of dollars of AI compute and the best statisticians in the world was not able to find via exhaustive research, domain search and training... what do you think are the chances this won't be folded into the next update of every model, making the rigmarole moot?
Extraordinary claims require extraordinary evidence and technology-shattering innovations in AI are not know to come from a markdown.
I wanted to see what would happen if Claude delegated work to pi wiht a model like Deepseek, so I forked your repo and tried it out. It's working really well so far. https://github.com/pcomans/architect-loop-pi
You can use any agent and/or model for each step and share context between them.