These days I use Codex, with GPT-5-Codex + $200 Pro subscription. I code all day every day and haven't yet seen a single rate limiting issue.
We've come a long way. Just 3-4 months ago, LLMs would start doing a huge mess when faced with a large codebase. They would have massive problems with files with +1k LoC (I know, files should never grow this big).
Until recently, I had to religiously provide the right context to the model to get good results. Codex does not need it anymore.
Heck, even UI seems to be a solved problem now with shadcn/ui + MCP.
My personal workflow when building bigger new features:
1. Describe problem with lots of details (often recording 20-60 mins of voice, transcribe)
2. Prompt the model to create a PRD
3. CHECK the PRD, improve and enrich it - this can take hours
4. Actually have the AI agent generate the code and lots of tests
5. Use AI code review tools like CodeRabbit, or recently the /review function of Codex, iterate a few times
6. Check and verify manually - often times, there are a few minor bugs still in the implementation, but can be fixed quickly - sometimes I just create a list of what I found and pass it for improving
With this workflow, I am getting extraordinary results.
AMA.
In case it wasn't obvious, I have gone from rabidly bullish on AI to very bearish over the last 18 months. Because I haven't found one instance where AI is running the show and things aren't falling apart in not-always-obvious ways.
Or at least this was true until recently. GPT-5 is consistently delivering more coherent and better working UIs, provided I use it with shadcn or alternative component libraries.
So while you can generate a lot of code very fast, testing UX and UI is still manual work - at least for me.
I am pretty sure, AI should not run the show. It is a sophisticated tool, but it is not a show runner - not yet.
Note: using it for my B2B e-commerce
When I started leaning heavily into LLMs I was using really detailed documentations. Not '20 minutes of voice recordings', but my specification documents would easily hit hundreds of lines even for simple features.
The result was decent, but extremely frustrating. Because it would often deliver 80% to 90% but the final 10% to 20% it could never get right.
So, what I naturally started doing was to care less about the details of the implementation and focus on the behavior I want. And this led me to simpler prompts, to the point that I don't feel the need to create a specification document anymore. I just use the plan mode in Claude Code and it is good enough for me.
One way that I started to think about this was that really specific documentations were almost as if I was 'over-fitting' my solution over other technically viable solutions the model could come up with. One example would be if I want to sort an array, I could either ask for "sort the array" or "merge sort the array". And by forcing a merge sort I may end up with a worse solution. Admittedly sort is a pretty simple and unlikely example, but this could happen with any topic. You may ask the model to use a hash-set but a better solution would be to use a bloom filter.
Given all that, do you think investing so much time into your prompts provides a good ROI compared with the alternative of not really min-maxing every single prompt?
I tend to provide detailed PRDs, because even if the first couple of iterations of the coding agent are not perfect, it tends to be easier to get there (as opposed to having a vague prompt and move on from there).
What I do sometimes is an experimental run - especially when I am stuck. I express my high-level vision, and just have the LLM code it to see what happens. I do not do it often, but it has sometimes helped me get out of being mentally stuck with some part of the application.
Funnily, I am facing this problem right now, and your post might just have reminded me, that sometimes a quick experiment can be better than 2 days of overthinking about the problem...
I kind of assumed that claude code is doing most of the things described this document under the hood (but I really have no idea).
And yes Step 3 is what no one does. And that's not limited to AI. I built a 20+ year career mostly around step 3 (after being biomed UNIX/Network tech support, sysadmin and programmer for 6 years).
Now I am building something new but in a very familiar domain. I agree my workflow would not work for your average "vibe coder".
I'm interested in hearing more about this - any resource you can point me at or do you mind elaborating a bit? TIA!
If you use Codex, convert the config to toml:
[mcp_servers.shadcn] command = "npx" args = ["shadcn@latest", "mcp"]
Now with the MCP server, you can instruct the coding agent to use shadcn. I often do "I you need to add new UI elements, make sure to use shadcn and the shadcn component registry to find the best fitting component"
The genius move is that the shadcn components are all based on Tailwind and get COPIED to your project. 95% of the time, the created UI views are just pixel-perfect, spacing is right, everything looks good enough. You can take it from here to personalize it more using the coding agent.
I just ask it to give me instructions for a coding agent and give it a small description of what I want to do, it looks at my code, and details what I describes as best as it can, and usually I have enough to let Junie (JetBrains AI) run on.
I can't personally justify $200 a month, I would need to see seriously strong results for that much. I use AI piecemeal because it has always been the best way to use it. I still want to understand the codebase. When things break its mostly on you to figure out what broke.
Btw one thing that helps conserve context/tokens is to use GPT 5 Pro to read entire files (it will read more than Codex will, though Codex is good at digging) and generate plans for Codex to execute. Tools like RepoPrompt help with this (though it also looks pretty complicated)
The models tend to be very good about syntax, but this sort of linting will often catch dead code like unused variables or arguments.
You do need to rule-prompt that the agent may need to run pre-commit multiple times to verify the changes worked, or to re-add to the commit. Also, frustratingly, you also need to be explicit that pre-commit might fail and it should fix the errors (otherwise sometimes it'll run and say "I ran pre-commit!") For commits there are some other guardrails, like blanket denying git add <wildcard>.
Claude will sometimes complain via its internal monologue when it fails a ton of linter checks and is forced to write complete docstrings for everything. Sometimes you need to nudge it to not give up, and then it will act excited when the number of errors goes down.
Generally I have observed that using a statically typed language like Typescript helps catching issues early on. Had much worse results with Ruby.
I’d love to see the codebase if you can share. My experience with LLM code generation (I’ve tried all of the popular models and tools, though generally favor Claude Code with Opus and Sonnet). My time working with them leads me to suspect that your ~200k LoC project could be solved in only about 10k LoC. Their solutions are unnecessary complex (I’m guessing because they don’t “know” the problem, in the way a human does) and that compounds over time. At this point, I would guess my most common instruction to this tools is to simplify the solution. Even when that’s part of the plan.
But one thing that MUST get better soon is having the AI agent verify its own code. There are a few solutions in place, e.g. using an MCP server to give access to the browser, but these tend to be brittle and slow. And for some reason, the AI agents do not like calling these tools too much, so you kinda have to force them every time.
PRD review can be done, but AI cannot fill the missing gaps the same way a human can. Usually, when I create a new PRD, it is because I have a certain vision in my head. For that reason, the process of reviewing the PRD can be optimized by maybe 20%. OR maybe I struggle to see how tools could make me faster at reading and commenting / editing the PRD.
Did you start with Cursor and move to Codex or only ever Codex?
Drop us an email at navan.chauhan[at]strongdm.com
> Sean proposes that in the AI future, the specs will become the real code. That in two years, you'll be opening python files in your IDE with about the same frequency that, today, you might open up a hex editor to read assembly.
> It was uncomfortable at first. I had to learn to let go of reading every line of PR code. I still read the tests pretty carefully, but the specs became our source of truth for what was being built and why.
This doesn't make sense as long as LLMs are non-deterministic. The prompt could be perfect, but there's no way to guarantee that the LLM will turn it into a reasonable implementation.
With compilers, I don't need to crack open a hex editor on every build to check the assembly. The compiler is deterministic and well-understood, not to mention well-tested. Even if there's a bug in it, the bug will be deterministic and debuggable. LLMs are neither.
If you spend time to write out requirements in English in a way that cannot be misinterpreted in any way you end up with programming language.
The real problem is just that they don't have brains, and can't think. They generate text that is optimized to look the most right, but not to be the most right. That means they're deceptive right off the bat. When a human is wrong, it usually looks wrong. When an LLM is wrong, it's generating the most correct looking thing it possibly could while still being wrong, with no consideration for actual correctness. It has no idea what "correctness" even means, or any ideas at all, because it's a computer doing matmul.
They are text summarization/regurgitation, pattern matching machines. They regurgitate summaries of things seen in their training data, and that training data was written by humans who can think. We just let ourselves get duped into believing the machine is the where the thinking is coming from and not the (likely uncompensated) author(s) whose work was regurgitated for you.
The same entity interpreting the spec in exactly the same way will resolve the ambiguities the same way each time.
Human and current AI interpretation of specs is non-deterministic process. But, if we wanted to build a deterministic AI we could.
That's about 20 words. Show me the programming language that can express that entire feature in 20 words. Even very English-like languages like Python or Kotlin might just about do it, if you're working in something else like C++ then no.
In practice, this spec will expand to changes to your dependency lists (and therefore you must know what library is used for CSV parsing in your language, the AI knows this stuff better than you), then there's some file handling, error handling if the file doesn't exist, maybe some UI like flags or other configuration, working out what the column names are, writing the loop, saving it back out, writing unit tests. Any reasonable programmer will produce a very similar PR given this spec but the diff will be much larger than the spec.
I think it is worse than that. The prompt, written in natural language, is by its very nature vague and incomplete, which is great if you are aiming for creative artistry. I am also really happy that we are able to search for dates using phrases like "get me something close to a weekend, but not on Tuesdays" on a booking website instead of picking dates from a dropdown box.
However, if natural language was the right tool for software requirements, software engineering would have been a solved problem long ago. We got rightfully excited with LLMs, but now we are trying to solve every problem with it. IMO, for requirements specification, the situation is similar to earlier efforts using formal systems and full verification, but at the exact opposite end. Similar to formal software verification, I expect this phase to end up as a partially failed experiment that will teach us new ways to think about software development. It will create real value in some domains and it will be totally abandoned in others. Interesting times...
I think this is a logical error. Non-determinism is orthogonal to probability of being correct. LLMs can remain non-deterministic while being made more and more reliable. I think “guarantee” is not a meaningful standard because a) I don’t think there can be such a thing as a perfect prompt, and b) humans do not meet that standard today.
The tooling is better than just cracking open the assembly but in some areas people do effectively do this, usually to check for vectorization of hot loops, since various things can mean a compiler fails to do it. I used to use Intel VTune to do this in the HPC scientific world.
This seems like a typical engineer forgets people aren't machines line of thinking.
I think we will find ways around this. Because humans are also non-deterministic. So what do we do? We review our code, test it, etc. LLMs could do a lot more of that. Eg, they could maintain and run extensive testing, among other ways to validate that behavior matches the spec.
Even when we give a spec to a human and tell them to implement it, we scrutinize and test the code they produce. We don't just hand over a spec and blindly accept the result. And that's despite the fact that humans have a lot more common sense, and the ability to ask questions when a requirement is ambiguous.
There's also no way to guarantee that you're not going to get hit by a meteor strike tomorrow. It doesn't have to be provably deterministic at a computer science PhD level for people without PhDs to say eh, it's fine. Okay, it's not deterministic. What does that mean in practice? Given the same spec.md file, at the layer of abstraction where we're no longer writing code by hand, who cares, because of a lack of determinism, if the variable for the filename object is called filename or fname or file or name as long as the code is doing something reasonable? If it works, if it passes tests, if we presume that the stoichastic parrot is going to parrot out its training data sufficiently close each time, why is it important?
As far as compilers being deterministic, there's a fascinating detail we ran into with Ksplice. They're not. They're only sufficiently enough that we trust them to be fine. There was this bug we kept tripping, back in roughly 2006, where GCC would swap registers used for a variable, resulting in the Ksplice patch being larger than it had to be, to include handling the register swap as well. The bug has since been fixed, exposing the details of why it was choosing different registers, but unfortunately I don't remember enough details about it. So don't believe me if you don't want to, but the point is, we trust the c compiler, given a function that takes in variables a, b, c, d, that a, b, c, and d will be map them to r0, r1, r2, or r3. We don't actually care what the order that mapping goes, so long as it works.
So the leap, that some have made, and others have not, is that LLMs aren't going to randomly flip out and delete all your data. Which is funny, because that's actually happened on replit. Despite that, despite the fact that LLMs still hallucinate total bullshit and goes off the rail; some people trust LLMs enough to convert a spec to working code. Personally, I think we're not there yet and won't be while GPU time isn't free. (Arguably it is already because anybody can just start typing into chat.com, but that's propped up by VC funding. That isn't infinite, so we'll have to see where we're at in a couple of years.)
That addresses the determinism part. The other part that was raised is debuggable. Again, I don't think we're at a place where we can get rid of generated code any time soon, and as long as code is being generated, then we can debug it using traditional techniques. As far as debugging LLMs themselves, it's not zero. They're not mainstream yet, but it's an active area of research. We can abliterate models and fine tune them (or whatever) to answer "how do you make cocaine", counter to their training. So they're not total black boxes.
Thus, even if traditional software development dies off, the new field is LLM creation and editing. As with new technologies, porn picks it up first. Llama and other downlodable models (they're not open source https://www.downloadableisnotopensource.org/ ). Downloadable models have been fine tuned or whatever to generate adult content, despite being trained not to. So that's new jobs being created in a new field.
The interesting thing is, for something to be deterministic that thing doesn't need to be defined first. I'd guess we can get an understanding of day/night-cycles without understanding anything about the solar system. In that same vein your Ksplice GCC bug doesn't sound nondeterministic. What did you choose to do in the case of the observed Ksplice behavior? Did you debug and help with the patch, or did you just pick another compiler? It seems that somebody did the investigation to bring GCC closer to the "same result 100% of the time", and I truly have to thank that person.
But here we are and LLMs and the "90% of the time"-approach are praised as the next abstraction in programming, and I just don't get it. The feedback loop is hailed as the new runtime, whereas it should be build time only. LLMs take advantage of the solid foundations we built and provide an NLP-interface on top - to produce code, and do that fast. That's not abstraction in the sense of programming, like Assembly/C++/Blender, but rather abstraction in the sense of distance, like PC/Network/Cloud. We use these "abstractions in distance" to widen reach, design impact and shift responsibilities.
It would be an absolute clown show if AWS could take the same infrastructure code and perform the deployment of the services somehow differently each time... so non-deterministically. There's already all kinds of external variables other than the infra code which can affect the deployment, such as existing deployed services which sometimes need to be (manually) destroyed for the new deployment to succeed.
It starts with /feature, and takes a description. Then it analyzes the codebase and asks questions.
Once I’ve answered questions, it writes a plan in markdown. There will be 8-10 markdowns files with descriptions of what it wants to do and full code samples.
Then it does a “code critic” step where it looks for errors. Importantly, this code critic is wrong about 60% of the time. I review its critique and erase a bunch of dumb issues it’s invented.
By that point, I have a concise folder of changes along with my original description, and it’s been checked over. Then all I do is say “go” to Claude Code and it’s off to the races doing each specific task.
This helps it keep from going off the rails, and I’m usually confident that the changes it made were the changes I wanted.
I use this workflow a few times per day for all the bigger tasks and then use regular Claude code when I can be pretty specific about what I want done. It’s proven to be a pretty efficient workflow.
[0] GitHub.com/iambateman/speedrun
At first I thought that was pretty compelling, since it includes more edge cases and examples that you otherwise miss.
In the end all that planning still results in a lot of pretty mediocre code that I ended up throwing away most of the time.
Maybe there is a learning curve and I need to tweak the requirements more tho.
For me personally, the most successful approach has been a fast iteration loop with small and focused problems. Being able to generate prototypes based on your actual code and exploring different solutions has been very productive. Interestingly, I kind of have a similar workflow where I use Copilot in ask mode for exploration, before switching to agent mode for implementation, sounds similar to Kiro, but somehow it’s more successful.
Anyways, trying to generate lots of code at once has almost always been a disaster and even the most detailed prompt doesn’t really help much. I’d love to see how the code and projects of people claiming to run more than 5 LLMs concurrently look like, because with the tools I’m using, that would be a mess pretty fast.
You spend a few minutes generating a spec, then agents go off and do their coding, often lasting 10-30 minutes, including running and fixing lints, adding and running tests, ...
Then you come back and review.
But you had 10 of these running at the same time!
You become a manager of AI agents.
For many, this will be a shitty way to spend their time.... But it is very likely the future of this profession.
The AI coding tools are going to be looking at other files in the project to help with context. Ambiguity is the death of AI effectiveness. You have to keep things clear and so that may require addressing smaller sections at a time. Unless you can really configure the tools in ways to isolate things.
This is why I like tools that have a lot of control and are transparent. If you ask a tool what the full system and user prompt is and it doesn't tell you? Run away from that tool as fast as you can.
You need to have introspections here. You have to be able to see what causes a behavior you don't want and be able to correct it. Any tool that takes that away from you is one that won't work.
I start my sessions with something like `!cat ./docs/*` and I can start asking questions. Make sure you regularly ask it to point out any inconsistencies or ambiguity in the docs.
I see it has a pseudo code step, was it helpful at all to try to define a workflow, process or procedure beforehand?
I've also heard that keeping each file down to 100 lines is critical before connecting them. Noticed the same but haven't tried it in depth.
I've been experimenting with Github agents recently, they use GPT-5 to write loads of code, and even make sure it compiles and "runs" before ending the task.
Then you go and run it and it's just garbage, yeah it's technically building and running "something", but often it's not anything like what you asked for, and it's splurged out so much code you can't even fix it.
Then I go and write it myself like the old days.
It's context all the way down. That just means you need to find and give it the context to enable it to figure out how to do the thing. Docs, manuals, whatever. Same stuff that you would use to enable a human that doesn't know how to do it to figure out how.
I imagine this has to to with concurrency requiring conceptual and logical reasoning, which LLMs are known to struggle with about as badly as they do with math and arithmetic. Now, it's possible that the right language to work with the LLM in these domains is not program code, but a spec language like TLA+. However, at that point, I'd probably just spend less effort to write the potentially tricky concurrent code myself.
So those people should either stop using it or learn to use it productively. We're not doomed to live in a world where programmers start using AI, lose productivity because of it and then stay in that less productive state.
They can be forced to write in their performance evaluation how much (not if, because they would be fired) "AI" has improved their productivity.
The article at the link is about how to use AI effectively in complex codebases. It emphasizes that the techniques described are "not magic", and makes very reasonable claims.
This is exactly right. Our role is shifting from writing implementation details to defining and verifying behavior.
I recently needed to add recursive uploads to a complex S3-to-SFTP Python operator that had a dozen path manipulation flags. My process was:
* Extract the existing behavior into a clear spec (i.e., get the unit tests passing).
* Expand that spec to cover the new recursive functionality.
* Hand the problem and the tests to a coding agent.
I quickly realized I didn't need to understand the old code at all. My entire focus was on whether the new code was faithful to the spec. This is the future: our value will be in demonstrating correctness through verification, while the code itself becomes an implementation detail handled by an agent.
I could argue that our main job was always that - defining and verifying behavior. As in, it was a large part of the job. Time spent on writing implementation details have always been on a downward trend via higher level languages, compilers and other abstractions.
This may be true, but see Postel's Law, that says that the observed behavior of a heavily-used system becomes its public interface and specification, with all its quirks and implementation errors. It may be important to keep testing that the clients using the code are also faithful to the spec, and detect and handle discrepancies.
But.. I hate this. I hate the idea of learning to manage the machine's context to do work. This reads like a lecture in an MBA class about managing certain types of engineers, not like an engineering doc.
Never have I wanted to manage people. And never have I even considered my job would be to find the optimum path to the machine writing my code.
Maybe firmware is special (I write firmware)... I doubt it. We have a cursor subscription and are expected to use it on production codebases. Business leaders are pushing it HARD. To be a leader in my job, I don't need to know algorithms, design patterns, C, make, how to debug, how to work with memory mapped io, what wear leveling is, etc.. I need to know 'compaction' and 'context engineering'
I feel like a ship corker inspecting a riveted hull
For them is the world, for us it means nothing.
It also helps starting small, get something useful done and iterate by adding more features overtime (or keeping it small).
It's not at senior engineer level until it asks relevant questions about lacking context instead of blindly trying to solve problems IMO.
the key part was really just explicitly thinking about different levels of abstraction at different levels of vibecoding. I was doing it before, but not explicitly in discrete steps and that was where i got into messes. The prior approach made check pointing / reverting very difficult.
When i think of everything in phases, i do similar stuff w/ my git commits at "phase" levels, which makes design decision easier to make.
I also do spend ~4-5 hours cleaning up the code at the very very end once everything works. But its still way faster than writing hard features myself.
Like yes vibecoding in the lovable-esque "give me an app that does XYZ" manner is obviously ridiculous and wrong, and will result in slop. Building any serious app based on "vibes" is stupid.
But if you're doing this right, you are not "coding" in any traditional sense of the word, and you are *definitely* not relying on vibes
Maybe we need a new word
If you're properly reviewing the code, you're programming.
The challenge is finding a good term for code that's responsibly written with AI assistance. I've been calling it "AI-assisted programming" but that's WAY too long.
i've also heard "aura coding", "spec-driven development" and a bunch of others I don't love.
but we def need a new word cause vibe coding aint it
I've said this repeatedly, I mostly use it for boilerplate code, or when I'm having a brain fart of sorts, I still love to solve things for myself, but AI can take me from "I know I want x, y, z" to "oh look I got to x, y, z in under 30 minutes, which could have taken hours. For side projects this is fine.
I think if you do it piecemeal it should almost always be fine. When you try to tell it to do two much, you and the model both don't consider edge cases (Ask it for those too!) and are more prone for a rude awakening eventually.
If AI is so groundbreaking, why do we have to have guides and jump through 3000 hoops just so we can make it work?
This is the new world we live in. Anyone who actually likes coding should seriously look for other venues because this industry is for other type of people now.
I use AI in my job. I went from tolerable (not doing anything fancy) to unbearable.
I'm actually looking to become a council employee with a boring job and code my own stuff, because if this is what I have to do moving forward, I rather go back to non-coding jobs.
Also quite funny that one of the latest commits is "ignore some tests" :D
> While the cancelation PR required a little more love to take things over the line, we got incredible progress in just a day.
Sorry, had they effectively estimated that an engineer should produce 4-6KLOC per day (that's before genAI)?
> We've gotten claude code to handle 300k LOC Rust codebases, ship a week's worth of work in a day, and maintain code quality that passes expert review.
This seems more like delegation just like if one delegated a coding task to another engineer and reviewed it.
> That in two years, you'll be opening python files in your IDE with about the same frequency that, today, you might open up a hex editor to read assembly (which, for most of us, is never).
This seems more like abstraction just like if one considers Python a sort of higher level layer above C and C a higher level layer above Assembly, except now the language is English.
Can it really be both?
You'll also note that while I talk about "spec driven development", most of the tactical stuff we've proven out is downstream of having a good spec.
But in the end a good spec is probably "the right abstraction" and most of these techniques fall out as implementation details. But to paraphrase sandy metz - better to stay in the details than to accidentally build against the wrong abstraction (https://sandimetz.com/blog/2016/1/20/the-wrong-abstraction)
I don't think delegation is right - when me and vaibhav shipped a week's worth of work in a day, we were DEEPLY engaged with the work, we didn't step away from the desk, we were constantly resteering and probably sent 50+ user messages that day, in addition to some point-edits to markdown files along the way.
To write and review a good spec, you also need to understand your codebase. How are you going to do that without reading the code? We are not getting abstracted away from our codebases.
For it to be an abstraction, we would need our coding agents to not only write all of our code, they would also need to explain it all to us. I am very skeptical that this is how developers will work in the near future. Software development would become increasingly unreliable as we won't even understand what our codebases actually do. We would just interact with a squishy lossy English layer.
It'd be nice if the article included the cost for each project. A 35k LOC change in a 350k codebase with a bunch of back and forth and context rewriting over 7 hours, would that be a regular subscription, max subscription, or would that not even cover it?
> oh, and yeah, our team of three is averaging about $12k on opus per month
I'll have to admit, I was intrigued with the workflow at first. But emm, okay, yeah, I'll keep handwriting my open source contributions for a while.
but yes we switched off per-token this week because we ran out of anthropic credits, we're on max plan now
It's super effective with the right guardrails and docs. It also works better on languages like Go instead of Python.
1. Go's spec and standard practices are more stable, in my experience. This means the training data is tighter and more likely to work.
2. Go's types give the llm more information on how to use something, versus the python model.
3. Python has been an entry-level accessible language for a long time. This means a lot of the code in the training set is by amateurs. Go, ime, is never someone's first language. So you effectively only get code from someone who has already has other programming experience.
4. Go doesn't do much 'weird' stuff. It's not hard to wrap your head around.
> I had to learn to let go of reading every line of PR code
Ah. And I’m over here struggling to get my teammates to read lines that aren’t in the PR.
Ah well, if this stuff works out it’ll be commoditized like the author said and I’ll catch up later. Hard to evaluate the article given the authors financial interest in this succeeding and my lack of domain expertise.
Would you trust an colleague who is over confident, lies all the time, and then pushes a huge PR? I wouldn't.
All I know is that it works. On the greenfield project the code is simple enough to mostly just run `/create_plan` and skip research altogether. You still get the benefit of the agents and everything.
The key is really truly reviewing the documents that the AI spits out. Ask yourself if it covered the edge cases that you're worried about or if it truly picked the right tech for the job. For instance, did it break out of your sqlite pattern and suggest using postgres or something like that. These are very simple checks that you can spot in an instant. Usually chatting with the agent after the plan is created is enough to REPL-edit the plan directly with claude code while it's got it all in context.
At my day job I've got to use github copilot, so I had to tweak the prompts a bit, but the intentional compaction between steps still happens, just not quite as efficiently because copilot doesn't support sub-agents in the same way as claude code. However, I am still able to keep productivity up.
-------
A personal aside.
Immediately before AI assisted coding really took off, I started to feel really depressed that my job was turning into a really boring thing for me. Everything just felt like such a chore. The death by a million paper cuts is real in a large codebase with the interplay and idiosyncrasies of multiple repos, teams, personalities, etc. The main benefit of AI assisted coding for me personally seems to be smoothing over those paper cuts.
I derive pleasure from building things that work. Every little thing that held up that ultimate goal was sucking the pleasure out of the activity that I spent most of my day trying to do. I am much happier now having impressed myself with what I can build if I stick to it.
I made specs for every part of the code in a separate folder and that had in it logs on every feature I worked on. It was an API server in python with many services like accounts, notifications, subscriptions etc.
It got to the point where managing context became extremely challenging. Claude would not be able to determine business logic properly and it can get complex. e.g. if you want to do a simple RBAC system with an account and profile with a junction table for roles joining an account with profile. In the end what kind of worked was I had to give it UML diagrams of the relationship with examples to make it understand and behave better.
"what happens if we end up owning this codebase but don't know how it works / don't know how to steer a model on how to make progress"
There are two common problems w/ primarily-AI-written code
1. Unfamiliar codebase -> research lets you get up to speed quickly on flows and functionality
2. Giant PR Reviews Suck -> plans give you ordered context on what's changing and why
Mitchell has praised ampcode for the thread sharing, another good solution to #2 - https://x.com/mitchellh/status/1963277478795026484
> Research lets you get up to speed quickly on flows and functionality
This is the _je ne sais quoi_ that people who are comfortable with AI have made peace with and those who are not have not. If you don't know what the code base does or how to make progress you are effectively trusting the system that built the thing you don't understand to understand the thing and teach you. And then from that understanding you're going to direct the teacher to make changes to the system it taught you to understand. Which suggests a certain _je ne sais quoi_ about human intelligence that isn't present in the system, but which would be necessary to create an understanding of the thing under consideration. Which leads to your understanding being questionable because it was sourced from something that _lacks_ that _je ne sais quoi_. But the order time of failure here is "lifetimes". Of features, of codebases, of persons.
1. Break down the feature or bug report into a technical implementation spec. Add in COT for the splits. 2. Verify the implementation spec. Feed reviews back to your original agent that has created the spec. Edit, merge, integrate feedback. 3. Transform implementation spec into an implementation plan - logically split into modules look at dependency chain. 4. Build, test and integrate continuously with coding agents 5. Squash the commits if needed into a single one for the whole feature.
Generally has worked well as a process when working on a complex feature. You can add in HITL at each stage if you need more verification.
For larger codebases always maintain an ARCHITECTURE.md and for larger modules a DESIGN.md
i'll then prompt it for more based on if my interpretation of the file is missing anything or has confusing instructions or details.
usually in-between larger prompts I'll do a full /reset rather than /compact, have it reference the doc, and then iterate some more.
once it's time to try implementing I do one more /reset, then go phase by phase of the plan in increments /reset-ing between each and having it update the doc with its progress.
generally works well enough but not sure i'd trust it at work.
You want control over and visibility into what’s being compacted, and /compact doesn’t do great on either
Please kill me now
They saw the the first screen assembled by Replit and figured everything they could see would work with some "small tweaks" which is where I was allegedly to come into the picture.
They continued to lecture me about how the app would need Web Workers for maximum client side performance (explanations full of em-dashes so I knew they were pasting in AI slop at me) and it must all be browser based with no servers because "my prototype doesn't need a server"
Meanwhile their "prototype" had a broken Node.js backend running alongside the frontend listening on a TCP port.
When I asked about this backend they knew nothing about it be assured me their prototype was all browser based with no "servers".
Needless to say I'm never taking on any work from that client again, one of the small joys of being a contractor.
- Mr. Snarky
An abstraction for this that seems promising to me for its completeness and size is a User Story paired with a research plan(?).
This works well for many kinds of applications and emphasizes shipping concrete business value for every unit of work.
I wrote about some of it here: https://blog.nilenso.com/blog/2025/09/15/ai-unit-of-work/
I also think a lot of coding benchmarks and perhaps even RL environments are not accounting for the messy back and forth of real world software development, which is why there's always a gap between the promise and reality.
I hope to back up this hypothesis with actual data and experiments!
Edit: typo.
not exactly valuable as guidance since programming languages are very easy to verify, but the https://ghuntley.com/ralph post is an example of whats possible on the very extreme end of the spectrum
An hour for 14 lines of code. Not sure how this shows any productivity gain from AI. It's clear that it's not the code writing that is the bottleneck in a task like this.
Looking at the "30K lines" features, the majority of the 30K lines are either auto-generated code (not by AI), or documentation. One of them is also a PoC and not merged...
2. Write down the principles and assumptions behind the design and keep them current
In other words, the same thing successful human teams on complex projects do! Have we become so addicted to “attention-deficit agile” that this seems like a new technique?
Imagine, detailed specs, design documents, and RFC reviews are becoming the new hotness. Who would have thought??
All because they have been forced to master technical communication at scale.
but the reason I wrote this (and maybe a side effect of the SF bubble) is MOST of the people I have talked to, from 3-person startups to 1000+ employee public companies, are in a state where this feels novel and valuable, not a foregone conclusion or something happening automatically
The hierarchy of leverage concept is great! Love it. (Can't say I like the 1 bad line of CLAUDE.md is 100K lines of bad code; I've had some bad lines in my CLAUDE.md from time to time - I almost always let Claude write it's own CLAUDE.md.).
<system-reminder> IMPORTANT: this context may or may not be relevant to your tasks. You should not respond to this context or otherwise consider it in your response unless it is highly relevant to your task. Most of the time, it is not relevant. </system-reminder>
lots of others have written about this so i won't go deep but its a clear product decision, but if you don't know what's in your context window, you can't respond/architect your balance between claude.md and /commands well.
Not surprisingly, building really fast is not the silver bullet you'd think it is. It's all about what to build and how to distribute it. Otherwise bigcos/billionaires would have armies of engineers growing their net worth to epic scales.
This is not my experience at all.
I also don’t get the line obsession.
Good code has less lines not more
That’s usd 150k per year. Probably low for SF, but may be a lot in other areas.
But let's assume you're much better than average at understanding code by reviewing it -- you have another frustrating experience to get through with AI. Pre-AI, let's say 4 days of the week are spend writing new code, while 1 day is spent fixing unforseen issues (perhaps incorrect assumption) that came up after production integration or showing things to real users. Post-AI, someone might be able to write those 4 days worth of code in 1 day, but making decisions about unexpected issues after integration doesn't get compressed -- that still takes 1 day.
So post-AI, your time switches almost entirely from the fun, creative act of writing code to the more frustrating experience of figuring out what's wrong with a lot of code that is almost correct. But you're way ahead -- you've tested your assumptions much faster, but unfortunately that means nearly all of your time will now be spent in a state of feeling dumb and trying to figure out why your assumptions are wrong. If your assumptions were right, you'd just move forward without noticing.
But the write up troubles me. If I'm reading correctly, he did 1 bugfix (approved and merged) and then 2 larger PRs (1 merged, 1 still in draft over a month later). That's an insanely small sample size to draw conclusions from.
How can you talk like you've just proven the workflow works "for brownfield codebases"? You proved it worked for 2/3 tasks in 2 codebases, one failure (we can't say it works until the code is shipped IMO).
I've been working on something what I call Micromanaged Driven Development https://mmdd.dev and wrote about it at https://builder.aws.com/content/2y6nQgj1FVuaJIn9rFLThIslwaJ/...
I'm in a similar search and I'm stoked to see that many people riding the wave of coding with AI is moving in this direction.
Lots of learning ahead.
So with this approach of spending 3 hours on planning without verification in code, that's too hard for me.
I agree the context compaction sounds good. But I'm not sure if an md file is good enough to carry the info from research to plan and implementation. Personally I often find the context is too complex or the problem is too big. I just open a new session to resolve a smaller, more specific problem in source code, then test and review the source code.
Only if AI code generation is correct 99.9% of the time and almost never hallucinates. We trust compilers and don't read assembly code because we know it's deterministic and the output can never be wrong (barring bugs and certain optimization issues, which are rare/one-time fixes). As long as generated code is not doing what the original "code" (in this case, specs) doing, humans need to go back to fix things themselves.
With this I find that most of the shenanigans of manual context window managing with putting things in markdown files is kind of unnecessary.
You still need to make it plan things, as well as guide the research it does to make sure it gets enough useful info into the context window, but in general it now seems to me like it does a really good job with preserving the information. This is with Sonnet 4
YMMV
Research helps with complex implementations and for brownfield. But it isn't always needed - simple bugfixes can be one-shot!
So all AI workflows could be expressed with some number "N" of "document expansion phases":
N(0): vibe coding.
N(1): "write a spec then implement it while I watch".
N(2): "research then specify". At this point you start to get serious steerability.
What's N(3) and beyond? Strategy docs, industry research, monetization planning? Can AI do these too, all of it ending up in git? Interesting to muse on.
It's kind of like if you could chat with Repomix or Gitingest so they only pull the most relevant parts of your codebase into a prompt for planning, etc
I'm a paying RepoPrompt user but not associated in any other way.
I've used it in conjunction with Codex, Claude Code, and any other code gen tool I have tried so far. It saves a lot of tokens and time (and headaches)
From many tries I had a success only twice. Could you give some hints on how to use it?
Is this a rule of thumb? Will the cheaper (fewer params) models dumb down at 25%?
Great links, BAML is a crazy rabbithole and just found myself nodding along to frequent /compact. These tips are hard-earned and very generously given. Anyone here can take it or leave it. I have theft on my mind, personally. (ʃƪ¬‿¬)
People like languages that are expressive and concise. That means they do things like omit types, use type inference, macros, syntactic sugar, allow for ambiguities and all the other stuff that gives us shorter, easier to type code that requires more effort to figure out. A good intuition here might be that the harder the compiler/interpreter has to work to convert it into running/executable code, the harder an LLM will have to work to figure out what that code does.
LLMs don't mind verbosity and spelling things out. Things that are long winded and boring to us are helpful for an LLM. The optimal language for an LLM is going to be different than one that is optimal for a human. And we're not good at actually producing detailed specifications. Programming actually is the job of coming up with detailed specifications. Easy to forget when you are doing that but that's literally what programming is. You write some kind of specification that is then "compiled" into something that actually works as specified.
The solution to agentic coding isn't writing specifications for our specifications. That just moves the problem.
We've had a few decades of practice where we just happen to stuff code into files and use very primitive tools to manipulate those files. Agentic coding uses a few party tricks involving command line tools to manipulate those files and reading them by one into the precious context window. We're probably shoveling too much data around. But since that's the way we store code, there are no better tools to do that.
From having used things like Codex, 99% of what it does is interrogating what's there via tediously slow prodding and poking around the code base using simple command line commands and build tool invocations. It's like watching paint dry. I usually just go off doing something else while it boils the oceans and does god knows what before finally doing the (usually) relatively straightforward thing that I asked it to do. It's easy to see that this doesn't scale that well.
The whole point of a large code base is that it probably won't all fit in the context window. We can try to brute force the problem; or we can try to be more selective. The name of the game here is being able to be able to quickly select just the right stuff to put in there and discard all the rest.
We can either do that manually (tedious and a lot of work, sort of as the article proposes), or make it easier for the LLM to use tools that do that. Possibly a bunch of poorly structured files in some nested directory hierarchy isn't the optimal thing here. Most non AI based automated refactorings require something that more closely resembles the internal data structures of what a compiler would use (e.g. symbol tables, definitions, etc.).
A lot of what an agentic coding system has to do is reconstruct something similar enough to that just so it can build a context in which it can do constructive things. The less ambiguous and more structured that is, the easier the job. The easier we make it to do that, the more it can focus on solving interesting problems rather than getting ready to do that.
I don't have all the answers here but if agentic coding is going to be most of the coding, it makes sense to optimize the tools, languages, etc. for that rather than for us.
This is how: I work for a company called NonBioS.ai - we already implement most of what is mentioned in this article. Actually we implemented this about 6 months back and what we have now is an advanced version of the same flow. Every user in NonBioS gets a full linux VM with root access. You can ask nonbios to pull in your source code and ask it to implement any feature. The context is all managed automatically through a process we call "Strategic Forgetting" which is in someways an advanced version of the logic in this article.
Strategic Forgetting handles the context automatically - think of it like automatic compaction. It evaluates information retention based on several key factors:
1. Relevance Scoring: We assess how directly information contributes to the current objective vs. being tangential noise
2. Temporal Decay: Information gets weighted by recency and frequency of use - rarely accessed context naturally fades
3. Retrievability: If data can be easily reconstructed from system state or documentation, it's a candidate for pruning
4. Source Priority: User-provided context gets higher retention weight than inferred or generated content
The algorithm runs continuously during coding sessions, creating a dynamic "working memory" that stays lean and focused. Think of it like how you naturally filter out background conversations to focus on what matters.
And we have tried it out in very complex code bases and it works pretty well. Once you know how well it works, you will not have a hard time believing that the days of using IDE's to edit code is probably numbered.
Also - you can try it out for yourself very quickly at NonBioS.ai. We have a very generous free tier that will be enough for the biggest code base you can throw at nonbios. However, big feature implementations or larger refactorings might take time longer than what is afforded in the free tier.
We're taking a profession that attracts people who enjoy a particular type of mental stimulation, and transforming it into something that most members of the profession just fundamentally do not enjoy.
If you're a business leader wondering why AI hasn't super charged your company's productivity, it's at least partly because you're asking people to change the way they work so drastically, that they no longer derive intrinsic motivation from it.
Doesn't apply to every developer. But it's a lot.
https://github.com/ricardoborges/cpython
what web programming task GPT-5 can't handle?