Features are vertical slices through the software cake, but the cake is actually made out of horizontal layers. Creating a bunch of servings of cake and then trying to stick them together just results in a fragile mess that's difficult to work with and easy to break.
Internally, the team splits Epics into "Spikes" (figure out what to do) and "Tasks" (executing on the things we need to do).
- Spikes are scoped to up to 3 days and their outcome is usually a doc and either a follow-up Spike or Tasks to execute.
- Tasks must be as small and unambiguous as possible (within reason).
The point I'm making is that there are large cross-cutting concerns that shouldn't be sliced up by feature, but rather that the features should arise out of the composition of the cross-cutting concerns.
A single user story commonly requires the holy trinity of UI, 'business logic' and data storage, and my contention is that it's more efficient and robust to build those three layers out holistically rather than try to assemble them from the fragments required for all the user stories.
I kind of like this analogy because it does help us reason about the situation. The one-room shack is basically an MVP; a hacky result that just does one thing and probably poorly, but it is useful enough to justify its own existence. The giant mansion built from detailed architectural plans seems like a waterfall process for an enterprise application, doesn't it?
There are many advantages to building a house one room at a time. You get something to house you quickly and cheaply. When you build each extension, you have a very good idea of how it will be most useful because you know your needs well. You are more capable of taking advantage of sales (my neighbor collects construction overstock for free/cheap and starts building something once he has enough quantity to do so). It's more "agile". The resulting houses are beautiful in their own bespoke ways. They last a long time, too.
The downsides are that the services and structure are a hodgepodge of eras and necessity. If you're competent, you can avoid problems in your own work, but you may have to build on shoddy "legacy" work. You spend more of your time in a state of construction, and it may be infeasible to undertake a whole-house project like running ethernet to every room.
It's all tradeoffs. I think it does in many cases make sense to build a house in this way, and it likewise makes sense to build software this way. It depends on the situation.
> There are many advantages to building a house one room at a time.... It's more "agile"... The downsides are that the services and structure are a hodgepodge of eras and necessity... it may be infeasible to undertake a whole-house project like running ethernet to every room.
The thing is that that end result is actually the opposite of agile, being as it is more difficult to change, and this speaks more broadly to a perennial problem in software development - requirements change regularly, even deep into project development. Planning a design up front does not mean just fixing a specific set of requirements in stone, but also anticipating the things that may change, even without knowing the specifics of what those changes will be, and designing in a flexible way that can accomodate a broad spectrum of possible futures. A car manufacturer might conceivably branch into making other types of vehicles, plant equipment, and similar things like that, whereas they are unlikely to ever get into catering (and if they did, that would likely be a seperate business and a new piece of software). Responding only to the requirements in front of you right now tends to make the design more rigid rather than less, and almost inevitably leads to big balls of mud and big-bang rewrite projects that fail as often as they succeed. Keep in mind also that most software spends most of its life in maintenance mode, so optimising for the delivery stage is short-sighted at best.
Designing software in the way I'm describing is not easy, but it's definitely possible, and in my opinion offers a lot more value than it might first appear.
Call user stories a grouping of work, sure, but I guess I don't see why the distinction matters. Most possible "units of work" will have many cases worth breaking down further regardless of choice of unit.
OK, the kitchen and the bathroom are special cases due to the plumbing and so on, so my analogy breaks down a bit there, but the rest of the rooms? They don't crystalise their function until the occupants move in. Maybe I as a builder might assume a certain room will be the living room, and designate another as the master bedroom, but until the owner puts in funiture, they're more or less just empty boxes with power and windows. Most of the 'features' or user stories of the house arise at the end out of the combination of built elements and final decoration. Software is actually a lot like this - take a trivial example user story of creating an invoice. What do you need for that? UI, data storage, comms maybe, some domain logic. Each of those things can (and in my opinion should) be expressed independently, but if you're developing that user story as a single deliverable, then you need to create bits of all of them. And that's what I'm saying - we're building things that naturally decompose into 'horizontal' layers (units of infrastructure), but doing it in 'vertical' slices (user stories), which, to torture my analogy even further, results in uneven flooring, mismatched walls, and untidy structures that get more and more difficult to change over time as requirements change and more builders try to add other slices of building that were not anticipated.
If you want to sleep in the lounge from now on instead of the bedroom, and entertain your guests in the back bedroom, you can just move the furniture. That's a lot more agile in my opinion than the software we commonly build.
they're very well adapted to legacy enterprise work
* Evaluate what, if any, structural implications removing the wall has * Tear down the existing wall * Redo any plumbing, ductwork, wiring, etc that was hiding in the wall * Remediate structural concerns from removing the wall. * Redo the flooring * Repair and repaint any damage done to remaining drywall
If this is part of a larger renovation, you will likely schedule work so the above tasks happen at the same time as other similar tasks.
E.g. A meaningful unit of work might be "electrical roughing", which would include both moving wires that were previously in the wall, and running a new circuit to the garage for a car charger. No user story covers those to tasks, but the nature of renovating a house means that it makes sense to do them together.
This isn't a good analogy. When building a house, you are physically realising a blueprint that describes everything in great detail. You know exactly where every wire and pipe should go ahead of time. When there are changes, they must be minor.
This isn't how writing code works. Maybe some management level people would like to believe it can work that way, but it doesn't in practice.
> This isn't how writing code works
Maybe that's not how you write code, but there are many different ways to paint that particular fence, no? I've been coding for a long time and for me, this is the approach that I've landed on that's the best I've found so far. To me, the idea of feature-led development is, to put it mildly, nonsensical.
Product requirements are a hypothesis for creating business value, and the only way to test that hypothesis is to actually demonstrate a slice of that value in a way that's legible to all stakeholders involved.
This post is a nice articulation of this: https://blog.nilenso.com/blog/2025/09/17/the-common-sense-un...
There is so much great software in the world that wasn't delivered like that and couldn't be delivered like that: Unix, Microsoft Word, Postgres, AutoCAD, The JVM, Google search, Windows, AWS, Robotics, Calculators ....
The software industry seems to have been captured by contractors who used to deliver CRUD apps and now want to make the whole world in their image and likeness.
Its because SE is a low class low power field. Its not respected by the people in charge at the overwhelming majority of companies. It has resisted standardizing like lawyers, doctors or even real estate agents. So there is little leverage a person in the field can push back with. Its mostly just seen as an annoyance to gaining/consolidating power for the power brokers on their way up the ladder.
That really is what computers/software are. Huge engines for orchestrating power that kings of old couldn't dream of.
You can't standardize a field that changes so fast, it takes decades to standardize a field and there has never been a point in time of software where two decades didn't completely changes the job.
And the worse news is: it will never change. There are several things fundamental to SWE, at least the corporate, open source, and/or indie flavors, that ensure it will not be standardized.
This is the difference with FAANGs. Software engineering is king. The inmates are running the asylum.
Google is at least 4x as efficient as other large companies I've worked for. Nearly every internal process that can possibly be automated is.
Depends on the physical process. Are you carving, casting, bolting or welding or using 3d modelling and printing … ?
> trying to stick them together just results in a fragile mess
If it's a physical cake.
If it's software we seamlessly add functionality to each layer as needed.
thanks for the article, it's a good one
yes, just as was said each and every previous time OpenAI/anthropic shit out a new model
"now it doesn't suck!"
The hedonic treadmill ensures it feels the same way each time.
But that doesn’t mean the models aren’t improving, nor that the scope isn’t expanding. If you compare today’s tools to those a year ago, the difference is stark.
They know that its a significant, but not revolutionary improvement.
If you supervise and manage your agents closely on well scoped (small) tasks they are pretty handy.
If you need a prototype and don't care about code quality or maintenance, they are great.
Anyone claiming 2x, 5x, 10x etc is absolutely kidding themselves for any non-trivial software.
It takes all of five minutes to have it run and at the end I can review it, if it's small ask it to execute, and if it actually requires me to work it myself well now I have a reference with line numbers, some comments on how the system appears to work, what the intent is, areas of interest, etc..
I also rely heavily on the sequential thinking MCP server to give it more structure.
Edit:
I will say because I think it's important I've been a senior dev for a while now, a lot of my job _is_ reviewing other people's pull requests. I don't find it hard or tedious at all.
Honestly it's a lot easier to review a few small "PRs" as the agent works than some of the giant PRs I'd get from team members before.
Compared to just doing it yourself though?
Imagine having to micromanage a junior developer like this to get good results
Ridiculous tbh
I'd rather use it the other way, I'm the one in charge, and the AI reviews any logical flaw or things that I would have missed. I don't even have to think about context window since it'll only look at my new code logic.
So yeah, 3 years after the first ChatGPT and Copilot, I don't feel huge changes regarding "automated" AI programming, and I don't have any AI tool in my IDE, I pefer to have a chat using their website, to brainstorm, or occasionally find a solution to something I'm stuck on.
It's good enough that it helps, particularly in areas or languages that I'm unfamiliar with. But I'm constantly fighting with it.
Impressively, it recognized the structure of the code and correctly identified it as a component of an audio codec library, and provided a reasonably complete description of many minute details specific to this codec and the work that the function was doing.
Rather less impressively, it decided to ignore my request and write a function that used C++ features throughout, such as type inference and lambdas, or should I say "lambdas" because it was actually just a function-defined-within-a-function that tried to access and mutate variables outside of its own function scope, like we were writing Javascript or something. Even apart from that, the code was rife with the sorts of warnings that even a default invocation of gcc would flag.
I can see why people would be wowed by this on its face. I wouldn't expect any average developer to have such a depth of knowledge and breadth of pattern-matching ability to be able to identify the specific task that this specific function in this specific audio codec was performing.
At the same time, this is clearly not a tool that's suitable for letting loose on a codebase without EXTREME supervision. This was a fresh session (no prior context to confuse it) using a tightly crafted prompt (a small, self-contained C program doing one thing) with a clear goal, and it still required constant handholding.
At the end of the day, I got the code working by editing it manually, but in an honest retrospective I would have to admit that the overall process actually didn't save me any time at all.
Ironically, despite how they're sold, these tools are infinitely better at going from code to English than going the other way around.
Brainstorming, ideation and small, well defined tasks where I can quickly vet the solution : these feel like the sweet spot for current frontier model capabilities.
(Unless you are pumping out some sloppy React SPA that you don't care about anything except get it working as fast as possible - fine, get Claude code to one shot it)
Just two questions, if you don’t mind satisfying my curiosity.
- Did you tell it to write C? Or better yet, what was the prompt? You can use Claude --resume to easily find that.
- Which model? (Sooner or Opus)? Though I’d have expected either one to work.
Yes. Decently useful (and reasonably safe) to red team yourself with. But extremely easy to red queen yourself otherwise.
There's a big difference with their benchmarks and real world coding.
But... reviewing code is harder than writing code. Expressing how I want something to be done in natural language is incredibly hard.
So over time I'm spending a lot of energy in those things, and only getting it 80% right.
Not to mention I'm constantly in this highly suspicious mode, trying to pierce through the veil of my own prompt and the code generated, because it's the edge cases that make work hard.
The end result is exhaustion. There is no recharge. Plans are front-loaded, and then you switch to auditing mode.
Whereas with code you front-load a good amount of design, but you can make changes as you go, and since you know your own code the effort to make those are much lower.
Working with LLM-generated code is mostly the same. The more sophisticated the autocomplete, the more mental overhead spent on understanding its output. There is an advantage: you are spared having to argue with a possibly defensive peer about what you believe is best. There is also a disadvantage: you do not feel like you are helping someone grow, and instead you are an unpaid (you are not paid for that in particular) contributor to a product by Microsoft (or similar) intended generally in longer term to push you and/or your peers out of the job. Additionally, there is no single mind that you can build rapport with and learn to understand the approaches and vibes of after a while.
Surprise, surprise… that is why programming languages were created.
Programming languages were created because of the different problem of “its very hard to get computers to understand natural language even if you know how to express what you want in it”.
If you know what to write but it's tedious, LLMs are great, they'll just fill all that in for you. Anything more complex or open that needs checking could be quicker to just think through and write yourself. You can still use LLMs at the edges, e.g. what API methods should I use for this?
Another thing it's good at is writing tests - a lot of times I won't bother, but with AI I can do it cheaply. And it's very good at keeping documentation and a codebase consistent, believe it or not. If I change a part that is mentioned somewhere else and it has it in the context window, it will update both parts, whereas I might omit it.
For me it's very clearly the opposite. I wonder if it's a professional background, or personality or neurotype issue or something, but when I'm faced with a problem I often get somewhat paralyzed, spending a long time thinking about a good approach, but when I delegate to someone or ask an AI to tackle it, even if I get back something half-shitty, it removes my paralysis and then reviewing and improving what they did is significantly easier (or at least more motivating) for me than doing it from scratch. And even if they ended up giving me something that's entirely in the wrong direction, and I need to throw out all of it, I still usually feel that it removes that paralysis and gives me a better understanding of the problem space.
I wonder if this difference between people accounts for a significant different between those who benefit from AI and those who don't.
It’s amazing at reviewing code. It will identify what you fear, the horrors that lie within the codebase, and it’ll bring them out into the sunlight and give you a 7 step plan for fixing them. And the coding model is good, it can write a function. But it can’t follow a plan worth shit. And if I have to be extremely detailed at the function by function level, then I should be in the editor coding. Claude code is an amazing niche tool for code reviews and dialogue and debugging and coping with new technologies and tools, but it is not a productivity enhancement for daily coding.
> most SWE folks still have no idea how big the difference is between the coding agents they tried a year ago and declared as useless and chatgpt 5 paired with Codex or Cursor today
Also liszper: oh, you tried the current approach and don’t agree with me? Well you just don’t know what you are doing.
The difference from an actual junior developer, of course, is that the human junior developer learns from his mistakes and gets better, but Claude seems to be stuck at the level of expertise of its model, and you have to wait for the model to improve before Claude improves.
I am always so skeptical of this style of response. Because if it takes hundreds of hours to learn to use something, how can it really be the silver bullet everyone was claiming earlier? Surely they were all in the midst of the 100 hours. And what else could we do if we spent 100 hours learning something? It's a lot of time, a huge investment, all on faith that things will get better.
I’ve invested hundreds of hours in process and tooling, and can now ship major features with tests in record time with Claude Code.
You have to coach it in TDD - no matter how much you explain in CLAUDE.md. That’s part because “a test that fails because the code isn’t written yet” is conceptually very similar to “a test that passes without the code we’re about to write” and is also similar to “a test that asserts the code we’re about to write is not there”. You have to watch closely to make sure it produces the first thing.
Why does it keep getting confused? You can’t blame it really. When two things are conceptually similar, models need lots of examples to distinguish between them. If the set of samples is sparse the model is likely to jump the small distance from a concept to similar ones.
So, you have to accept this as how Claude 4 works, keep it on a short leash, keep reminding it that it must watch the test fail, ask it if the test failed for the right reason (not some setup issue), and THEN give it permission to write the code.
The result is two mirror copies of your feature or fix: code and tests.
Reviewing code and tests together is pleasant because they mirror one another. The tests forever ensure your feature works as described, no manual testing needed, no regressions. And the model knows all the tricks to make your tests really beautiful.
TDD is the check and balance missing from most people’s agentic software dev process.
But I have repeatedly seen claude get hung up on TDD itself and I've tried lots of different prompts/directions. It runs into a problem and inevitably runs ever more complicated shell commands and creating weird temp input files than sticking to "cargo test" and addressing the failing test.
Since I need to review the agent's code, I'd much prefer it to use a workflow like a human, with a progression of small commits following TDD--much easier to review the code then. If it's just splatting up big diffs, then it makes review harder, and that offsets any productivity gains.
I much prefer to choose tasks that can be done with 25%+ context left and then just start the next task with fresh context.
If I'm getting low on context I have it summarize the plan and progress in a text file rather than use /compact and then start a fresh context and reference that file, which I can then edit and try again if I'm not getting good results.
1. Assume that any model will start to lose focus beyond 50K-100K tokens (even with a huge context window).
2. Be gluttonous with chats. At the first sign of confusion or mistakes, tell it to generate a new prompt and move to a new chat.
3. Write detailed prompts with clear expectations (from how the code should be written to the specific implementation that's required). Combine these with context like docs to get a fairly consistent hit rate.
4. Use tools like Cline that let you switch between an "Act" and "Plan" mode. This saves a ton of tokens but also avoids the LLM getting stuck on a loop when it's debugging.
I recently wrote this short blog post related to this: https://ryanglover.net/blog/treat-the-ai-like-it-s-yourself
The above approach helped me to implement a full-blown database wrapper around LMDB for Node.js in ~2 weeks of slow back-and-forth (link to code in post for those who are curious).
When I have let Claude loose and vibe coded up hundreds of lines at a time that I have no familiarity with, I viscerally feel how I no longer understand or can maintain the app I've built. If I can't get Claude to do the next change I need, I'm screwed.
I'm very satisfied at the moment to be wielding LLMs as a tool at the individual function / microfeature level and getting a very satisfying productivity improvement.
But this has more or less always been the case for LLMs. The challenge becomes context capure. Which in my opinion is the real challenge with LLM adoption. Without the right contex, some tasks just cannot be reliably completed.
I didn't know about Kiro specs. I've been playing around with my own org-mode based approach with mixed success in keeping dev agent work tracked:
e.g maybe a dev plan is all your authentication feature requirements, or in the house of analogy – all the requirements for the rooms, but with instructions to actually just first build the floor, and the walls.
Dev plans then slice the reqs into meaningful units of work, as mentioned in the article – a feature/story, is often too large of a checkpoint, or often needs to be implemented in collaboration with other features/stories, so it understands the correct architectural context,.
You can then implement Dev plans over MCP, or copy to .md for tools like Lovable or V0.
It feels like part of my journey to being an "AI developer" is being present for those tradeoffs, metabolizing each one into my craft.
AI is a fickle, but powerful horse. I'm finding it a privilege to learn how to be a rider.