Once you’ve got Claude Code set up, you can point it at your codebase, have it learn your conventions, pull in best practices, and refine everything until it’s basically operating like a super-powered teammate. The real unlock is building a solid set of reusable “skills” plus a few agents for the stuff you do all the time.
For example, we have a custom UI library, and Claude Code has a skill that explains exactly how to use it. Same for how we write Storybooks, how we structure APIs, and basically how we want everything done in our repo. So when it generates code, it already matches our patterns and standards out of the box.
We also had Claude Code create a bunch of ESLint automation, including custom ESLint rules and lint checks that catch and auto-handle a lot of stuff before it even hits review.
Then we take it further: we have a deep code review agent Claude Code runs after changes are made. And when a PR goes up, we have another Claude Code agent that does a full PR review, following a detailed markdown checklist we’ve written for it.
On top of that, we’ve got like five other Claude Code GitHub workflow agents that run on a schedule. One of them reads all commits from the last month and makes sure docs are still aligned. Another checks for gaps in end-to-end coverage. Stuff like that. A ton of maintenance and quality work is just… automated. It runs ridiculously smoothly.
We even use Claude Code for ticket triage. It reads the ticket, digs into the codebase, and leaves a comment with what it thinks should be done. So when an engineer picks it up, they’re basically starting halfway through already.
There is so much low-hanging fruit here that it honestly blows my mind people aren’t all over it. 2026 is going to be a wake-up call.
(used voice to text then had claude reword, I am lazy and not gonna hand write it all for yall sorry!)
Edit: made an example repo for ya
Codex/Claude Code are terrible with C++. It also can't do Rust really well, once you get to the meat of it. Not sure why that is, but they just spit out nonsense that creates more work than it helps me. It also can't one shot anything complete, even though I might feed him the entire paper that explains what the algorithm is supposed to do.
Try to do some OpenGL or Vulkan with it, without using WebGPU or three.js. Try it with real code, that all of us have to deal with every day. SDL, Vulkan RHI, NVRHI. Very frustrating.
Try it with boost, or cmake, or taskflow. It loses itself constantly, hallucinates which version it is working on and ignores you when you provide actual pointers to documentation on the repo.
I've also recently tried to get Opus 4.5 to move the Job system from Doom 3 BFG to the original codebase. Clean clone of dhewm3, pointed Opus to the BFG Job system codebase, and explained how it works. I have also fed it the Fabien Sanglard code review of the job system: https://fabiensanglard.net/doom3_bfg/threading.php
We are not sleeping on it, we are actually waiting for it to get actually useful. Sure, it can generate a full stack admin control panel in JS for my PostgreSQL tables, but is that really "not normal"? That's basic.
And I get it. Coding with Claude Code really was prompting something, getting errors, and asking it to fix it. Which was still useful but I could see why a skilled coder adding a feature to a complex codebase would just give up
Opus 4.5 really is at a new tier however. It just...works. The errors are far fewer and often very minor - "careless" errors, not fundamental issues (like forgetting to add "use client" to a nextjs client component.
I had an idea for something that I wanted, and in five scattered hours, I got it good enough to use. I'm thinking about it in a few different ways:
1. I estimate I could have done it without AI with 2 weeks full-time effort. (Full-time defined as >> 40 hours / week.)
2. I have too many other things to do that are purportedly more important that programming. I really can't dedicate to two weeks full-time to a "nice to have" project. So, without AI, I wouldn't have done it at all.
3. I could hire someone to do it for me. At the university, those are students. From experience with lots of advising, a top-tier undergraduate student could have achieved the same thing, had they worked full tilt for a semester (before LLMs). This of course assumes that I'm meeting them every week.
Nobody is sleeping. I'm using LLMs daily to help me in simple coding tasks.
But really where is the hurry? At this point not a few weeks go by without the next best thing since sliced bread to come out. Why would I bother "learning" (and there's really nothing to learn here) some tool/workflow that is already outdated by the time it comes out?
> 2026 is going to be a wake-up call
Do you honestly think a developer not using AI won't be able to adapt to a LLM workflow in, say, 2028 or 2029? It has to be 2026 or... What exactly?
There is literally no hurry.
You're using the equivalent of the first portable CD-player in the 80s: it was huge, clunky, had hiccups, had a huge battery attached to it. It was shiny though, for those who find new things shiny. Others are waiting for a portable CD player that is slim, that buffers, that works fine. And you're saying that people won't be able to learn how to put a CD in a slim CD player because they didn't use a clunky one first.
claude can call ssh and do system admin tasks. It works amazingly well. I have 3 VM's, which depends on each other (proxmox with openwrt, adguard, unbound), and claude can prove to me that my dns chains works perfectly, my firewalls are perfect etc as claude can ssh into each. Setting up services, diagnosing issues, auditing configs... you name it. Just awesome.
claude can call other sh scripts on the machine, so over time, you can create a bunch of scripts that lets claude one shot certain tasks that would normally eat tokens. It works great. One script per intention - don't have a script do more than one thing.
claude can call the compiler, run the debug executable and read the debug logs.. in real time. So claude can read my android apps debug stream via adb.. or my C# debug console because claude calls the compiler, not me. Just ask it to do it and it will diagnose stuff really quickly.
It can also analyze your db tables (give it readonly sql access), look at the application code and queries, and diagnose performance issues.
The opportunities are endless here. People need to wake up to this.
> have it learn your conventions, pull in best practices
What do you mean by "have it learn your conventions"? Is there a way to somehow automatically extract your conventions and store it within CLAUDE.md?
> For example, we have a custom UI library, and Claude Code has a skill that explains exactly how to use it. Same for how we write Storybooks, how we structure APIs, and basically how we want everything done in our repo. So when it generates code, it already matches our patterns and standards out of the box.
Did you have to develop these skills yourself? How much work was that? Do you have public examples somewhere?
Especially if you're in a place where a lot of time was spent previously revising PRs for best practices, etc, even for human-submitted code, then having the LLM do that for you that saves a bunch of time. Most humans are bad at following those super-well.
There's a lot of stuff where I'm pretty sure I'm up to at least 2x speed now. And for things like making CLI tools or bash scripts, 10x-20x. But in terms of "the overall output of my day job in total", probably more like 1.5x.
But I think we will need a couple major leaps in tooling - probably deterministic tooling, not LLM tooling - before anyone could responsibly ship code nobody has ever read in situations with millions of dollars on the line (which is different from vibe-coding something that ends up making millions - that's a low-risk-high-reward situation, where big bets on doing things fast make sense. if you're already making millions, dramatic changes like that can become high-risk-low-reward very quickly. In those companies, "I know that only touching these files is 99.99% likely to be completely safe for security-critical functionality" and similar "obvious" intuition makes up for the lack of ability to exhaustively test software in a practical way (even with fuzzers and things), and "i didn't even look at the code" is conceding responsibility to a dangerous degree there.)
Reword? But why not just voice to text alone...
Oh but we all read the partially synthetic ad by this point. Psyche.
I agree with this, but I haven't needed to use any advanced features to get good results. I think the simple approach gets you most of the benefits. Broadly, I just have markdown files in the repo written for a human dev audience that the agent can also use.
Basically:
- README.md with a quick start section for devs, descriptions of all build targets and tests, etc. Normal stuff.
- AGENTS.md (only file that's not written for people specifically) that just describes the overall directory structure and has a short step of instructions for the agent: (1) Always read the readme before you start. (2) Always read the relevant design docs before you start. (3) Always run the linter, a build, and tests whenever you make code changes.
- docs/*.md that contain design docs, architecture docs, and user stories, just text. It's important to have these resources anyway, agent or no.
As with human devs, the better the docs/requirements the better the results.
There are times those extra layers are worth it but it seems LLMs have a bias to add them prematurely and overcomplicate things. You then end up with extra complexity you didn't need.
When needed it can be picked up in a day. Otherwise they are not paid based in tickets solved etc. If the incentives were properly aligned everyone would already use it
I'm not saying there aren't use cases for agents, just that it's normal that most software engineers are sleeping on it.
codex cli even has a skill to create skills; it's super easy to get up to speed with them
https://github.com/openai/skills/blob/main/skills/.system/sk...
> we have another Claude Code agent that does a full PR review, following a detailed markdown checklist we’ve written for it.
(if you know) how is that compared to coderabbit? i'm seriously looking for something better rn...take my downvote as hard as you can. this sort of thing is awfully off-putting.
The tech industry just went through an insane hiring craze and is now thinning out. This will help to separate the chaff from the wheat.
I don't know why any company would want to hire "tech" people who are terrified of tech and completely obstinate when it comes to utilizing it. All the people I see downplaying it take a half-assed approach at using it then disparage it when it's not completely perfect.
I started tinkering with LLMs in 2022. First use case, speak in natural english to the llm, give it a json structure, have it decipher the natural language and fill in that json structure (vacation planning app, so you talk to it about where/how you want to vacation and it creates the structured data in the app). Sometimes I'd use it for minor coding fixes (copy and paste a block into chatgpt, fix errors or maybe just ideation). This was all personal project stuff.
At my job we got LLM access in mid/late 2023. Not crazy useful, but still was helpful. We got claude code in 2024. These days I only have an IDE open so I can make quick changes (like bumping up a config parameter, changing a config bool, etc.). I almost write ZERO code now. I usually have 3+ claude code sessions open.
On my personal projects I'm using Gemini + codex primarily (since I have a google account and chatgpt $20/month account). When I get throttled on those I go to claude and pay per token. I'll often rip through new features, projects, ideas with one agent, then I have another agent come through and clean things up, look for code smells, etc. I don't allow the agents to have full unfettered control, but I'd say 70%+ of the time I just blindly accept their changes. If there are problems I can catch them on the MR/PR.
I agree about the low hanging fruit and I'm constantly shocked at the sheer amount of FUD around LLMs. I want to generalize, like I feel like it's just the mid/jr level devs that speak poorly about it, but there's definitely senior/staff level people I see (rarely, mind you) that also don't like LLMs.
I do feel like the online sentiment is slowly starting to change though. One thing I've noticed a lot of is that when it's an anonymous post it's more likely to downplay LLMs. But if I go on linkedin and look at actual good engineers I see them praising LLMs. Someone speaking about how powerful the LLMs are - working on sophisticated projects at startups or FAANG. Someone with FUD when it comes to LLM - web dev out of Alabama.
I could go on and on but I'm just ranting/venting a little. I guess I can end this by saying that in my professional/personal life 9/10 of the top level best engineers I know are jumping on LLMs any chance they get. Only 1/10 talks about AI slop or bullshit like that.
I have Django project with 50 kLOC and it is pretty capable of understanding the architecture, style of coding, naming of variables, functions etc. Sometimes it excels on tasks like "replicate this non-trivial functionality for this other model and update the UI appropriately" and leaves me stunned. Sometimes it solves for me tedious and labourous "replace this markdown editor with something modern, allowing fullscreen edits of content" and does annoying mistake that only visual control shows and is not capable to fix it after 5 prompts. I feel as I am becoming tester more than a developer and I do not like the shift. Especially when I do not like to tell someone he did an obvious mistake and should fix it - it seems I do not care if it is human or AI, I just do not like incompetence I guess.
Yesterday I had to add some parameters to very simple Falcon project and found out it has not been updated for several months and won't build due to some pip issues with pymssql. OK, this is really marginal sub-project so I said - let's migrate it to uv and let's not get hands dirty and let the Claude do it. He did splendidly but in the Dockerfile he missed the "COPY server.py /data/" while I asked him to change the path... Build failed, I updated the path myself and moved on.
And then you listen to very smart guys like Karpathy who rave about Tab, Tab, Tab, while not understanding the language or anything about the code they write. Am I getting this wrong?
I am really far far away from letting agents touch my infrastructure via SSH, access managed databases with full access privileges etc. and dread the day one of my silly customers asks me to give their agent permission to managed services. One might say the liability should then be shifted, but at the end of the day, humans will have to deal with the damage done.
My customer who uses all the codebase I am mentioning here asked me, if there is a way to provide "some AI" with item GTINs and let it generate photos, descriptions, etc. including metadata they handcrafted and extracted for years from various sources. While it looks like nice idea and for them the possibility of decreasing the staff count, I caught the feeling they do not care about the data quality anymore or do not understand the problems the are brining upon them due to errors nobody will catch until it is too late.
TL;DR: I am using Opus 4.5, it helps a lot, I have to keep being (very) cautious. Wake up call 2026? Rather like waking up from hallucination.
I shortened it for anyone else that might need it
----
Software engineers are sleeping on Claude Code agents. By teaching it your conventions, you can automate your entire workflow:
Custom Skills: Generates code matching your UI library and API patterns.
Quality Ops: Automates ESLint, doc syncing, and E2E coverage audits.
Agentic Reviews: Performs deep PR checks against custom checklists.
Smart Triage: Pre-analyzes tickets to give devs a head start.
Check out the showcase repo to see these patterns in action.
Disclaimer: I don't have access to Claude Code. My employer has only granted me Claude Teams. Supposedly, they don't use my poopy code to train their models if I use my work email Claude so I am supposed to use that. If I'm not pasting code (asking general questions) into Claude, I believe I'm allowed to use whatever.
And my conclusion is: it's still not as smart as a good human programmer. It frequently got stuck, went down wrong paths, ignored what I told it to do to do something wrong, or even repeat a previous mistake I had to correct.
Yet in other ways, it's unbelievably good. I can give it a directory full of code to analyze, and it can tell me it's an implementation of Kozo Sugiyama's dagre graph layout algorithm, and immediately identify the file with the error. That's unbelievably impressive. Unfortunately it can't fix the error. The error was one of the many errors it made during previous sessions.
So my verdict is that it's great for code analysis, and it's fantastic for injecting some book knowledge on complex topics into your programming, but it can't tackle those complex problems by itself.
Yesterday and today I was upgrading a bunch of unit tests because of a dependency upgrade, and while it was occasionally very helpful, it also regularly got stuck. I got a lot more done than usual in the same time, but I do wonder if it wasn't too much. Wasn't there an easier way to do this? I didn't look for it, because every step of the way, Opus's solution seemed obvious and easy, and I had no idea how deep a pit it was getting me into. I should have been more critical of the direction it was pointing to.
You can see some of the context limits here:
If you want the full capability, use the API and use something like opencode. You will find that a single PR can easily rack up 3 digits of consumption costs.
I don't think you've seen the full potential. I'm currently #1 on 5 different very complex computer engineering problems, and I can't even write a "hello world" in rust or cpp. You no longer need to know how to write code, you just need to understand the task at a high level and nudge the agents in the right direction. The game has changed.
- https://highload.fun/tasks/3/leaderboard
- https://highload.fun/tasks/12/leaderboard
- https://highload.fun/tasks/15/leaderboard
Try it again using Claude Code and a subscription to Claude. It can run as a chat window in VS Code and Cursor too.
Sure, Copilot charges 3x tokens for using Opus 4.5, but, how were you still able to use up half the allocated tokens not even one week into January?
I thought using up 50% was mad for me (inline completions + opencode), that's even worse
No doubt I could give Opus 4.5 "build be a XYZ app" and it will do well. But day to day, when I ask it "build me this feature" it uses strange abstractions, and often requires several attempts on my part to do it in the way I consider "right". Any non-technical person might read that and go "if it works it works" but any reasonable engineer will know that thats not enough.
Starting back in 2022/2023:
- (~2022) It can auto-complete one line, but it can't write a full function.
- (~2023) Ok, it can write a full function, but it can't write a full feature.
- (~2024) Ok, it can write a full feature, but it can't write a simple application.
- (~2025) Ok, it can write a simple application, but it can't create a full application that is actually a valuable product.
- (~2025+) Ok, it can write a full application that is actually a valuable product, but it can't create a long-lived complex codebase for a product that is extensible and scalable over the long term.
It's pretty clear to me where this is going. The only question is how long it takes to get there.
I've worked on teams where multiple engineers argued about the "right" way to build something. I remember thinking that they had biases based on past experiences and assumptions about what mattered. It usually took an outsider to proactively remind them what actually mattered to the business case.
I remember cases where a team of engineers built something the "right" way but it turned out to be the wrong thing. (Well engineered thing no one ever used)
Sometimes hacking something together messily to confirm it's the right thing to be building is the right way. Then making sure it's secure, then finally paying down some technical debt to make it more maintainable and extensible.
Where I see real silly problems is when engineers over-engineer from the start before it's clear they are building the right thing, or when management never lets them clean up the code base to make it maintainable or extensible when it's clear it is the right thing.
There's always a balance/tension, but it's when things go too far one way or another that I see avoidable failures.
- How quickly is cost of refactor to a new pattern with functional parity going down?
- How does that change the calculus around tech debt?
If engineering uses 3 different abstractions in inconsistent ways that leak implementation details across components and duplicate functionality in ways that are very hard to reason about, that is, in conventional terms, an existential problem that might kill the entire business, as all dev time will end up consumed by bug fixes and dealing with pointless complexity, velocity will fall to nothing, and the company will stop being able to iterate.
But if claude can reliably reorganize code, fix patterns, and write working migrations for state when prompted to do so, it seems like the entire way to reason about tech debt has changed. And it has changed more if you are willing to bet that models within a year will be much better at such tasks.
And in my experience, claude is imperfect at refactors and still requires review and a lot of steering, but it's one of the things it's better at, because it has clear requirements and testing workflows already built to work with around the existing behavior. Refactoring is definitely a hell of a lot faster than it used to be, at least on the few I've dealt with recently.
In my mind it might be kind of like thinking about financial debt in a world with high inflation, in that the debt seems like it might get cheaper over time rather than more expensive.
You’re talking like in the year 2026 we’re still writing code for future humans to understand and improve.
I fear we are not doing that. Right now, Opus 4.5 is writing code that later Opus 5.0 will refactor and extend. And so on.
Opus is great and definitely speeds up development even in larger code bases and is reasonably good at matching coding style/standard to that of of the existing code base.
In my opinion, the big issue is the relatively small context that quickly overwhelms the models when given a larger task on a large codebase.
For example, I have a largish enterprise grade code base with nice enterprise grade OO patterns and class hierarchies. There was a simple tech debt item that required refactoring about 30-40 classes to adhere to a slightly different class hierarchy. The work is not difficult, just tedious, especially as unit tests need to be fixed up.
I threw Opus at it with very precise instructions as to what I wanted it to do and how I wanted it to do it. It started off well but then disintegrated once it got overwhelmed at the sheer number of files it had to change. At some point it got stuck in some kind of an error loop where one change it made contradicted with another change and it just couldn't work itself out. I tried stopping it and helping it out but at this point the context was so polluted that it just couldn't see a way out. I'd say that once an LLM can handle more 'context' than a senior dev with good knowledge of a large codebase, LLM will be viable in a whole new realm of development tasks on existing code bases. That 'too hard to refactor this/make this work with that' task will suddenly become viable.
Why do they need to be replaced? Programmers are in the perfect place to use AI coding tools productively. It makes them more valuable.
I'm very lucky that I rarely have to deal with other devs and I'm writing a lot of code from scratch using whatever is the latest version of the frameworks. I understand that gives me a lot of privileges others don't have.
They get those ocassionally all the time though too. Depends on the company. In some software houses it's constant "greenfield projects", one after another. And even in companies with 1-2 pieces of main established software to maintain, there are all kinds of smaller utilities or pipelines needed.
>But day to day, when I ask it "build me this feature" it uses strange abstractions, and often requires several attempts on my part to do it in the way I consider "right".
In some cases that's legit. In other cases it's just "it did it well, but not how I'd done it", which is often needless stickness to some particular style (often a contention between 2 human programmers too).
Basically, what FloorEgg says in this thread: "There are two types of right/wrong ways to build: the context specific right/wrong way to build something and an overly generalized engineer specific right/wrong way to build things."
And you can always not just tell it "build me this feature", but tell it (high level way) how to do it, and give it a generic context about such preferences too.
When I worked at Google, people rarely got promoted for doing that. They got promoted for delivering features or sometimes from rescuing a failing project because everyone was doing the former until promotion velocity dropped and your good people left to other projects not yet bogged down too far.
This sounds a lot like the old arguments around using compilers vs hand-writing asm. But now you can tell the LLM how you want to implement the changes you want. This will become more and more relevant as we try to maintain the code it generates.
But, for right now, another thing Claude's great at is answering questions about the codebase. It'll do the analysis and bring up reports for you. You can use that information to guide the instructions for changes, or just to help you be more productive.
This thing jumped into a giant JSF (yes, JSF) codebase and started fixing things with nearly zero guidance.
Even if it could build the perfect greenfield app, as it updates the app it is needs to consider backwards compatibility and breaking changes. LLMs seem very far as growing apps. I think this is because LLMs are trained on the final outcome of the engineering process, but not on the incremental sub-commit work of first getting a faked out outline of the code running and then slowly building up that code until you have something that works.
This isn't to say that LLMs or other AI approaches couldn't replace software engineering some day, but they clear aren't good enough yet and the training sets they have currently have access to are unlikely to provide the needed examples.
Then don't ask it to "build me this feature" instead lay out a software development process with designated human in the loop where you want it and guard rails to keep it on track. Create a code review agent to look for and reject strange abstractions. Tell it what you don't like and it's really good at finding it.
I find Opus 4.5, properly prompted, to be significantly better at reviewing code than writing it, but you can just put it in a loop until the code it writes matches the review.
Codex will regularly give me 1000+ line diffs where all my comments (I review every single line of what agents write) are basically nitpicks. "Make this shallow w/ early return, use | None instead of Optional", that sort of thing.
I do prompt it in detail though. It feels like I'm the person coming in with the architecture most of the time, AI "draws the rest of the owl."
Your hobby project, though, knock yourself out.
AI coding is a multiplier of writing speed but doesn't excuse planning out and mapping out features.
You can have reasonably engineered code if you get models to stick to well designed modules but you need to tell them.
The number of production applications that achieve this rounds to zero
I’ve probably managed 300 brownfield web, mobile, edge, datacenter, data processing and ML applications/products across DoD, B2B, consumer and literally zero of them were built in this way
So far, Im not convinced, but lets take a look at fundmentally whats happening and why humans > agents > LLMs.
At its heart, programming is a constraint satisfaction problem.
The more constraints (requirements, syntax, standards, etc) you have, the harder it is to solve them all simultaneously.
New projects with few contributors have fewer constraints.
The process of “any change” is therefore simpler.
Now, undeniably
1) agents have improved the ability to solve constraints by iterating; eg. Generate, test, modify, etc. over raw LLm output.
2) There is an upper bound (context size, model capability) to solve simultaneous constraints.
3) Most people have a better ability to do this than agents (including claude code using opus 4.5).
So, if youre seeing good results from agents, you probably have a smaller set of constraints than other people.
Similarly, if youre getting bad results, you can probably improve them by relaxing some of the constraints (consistent ui, number of contributors, requirements, standards, security requirements, split code into well defined packages).
This will make both agents and humans more productive.
The open question is: will models continue to improve enough to approach or exceed human level ability in this?
Are humans willing to relax the constraints enough for it to be plausible?
I would say currently people clambering about the end of human developers are cluelessly deceived by the “appearance of complexity” which does not match the “reality of constraints” in larger applications.
Opus 4.5 cannot do the work of a human on code bases Ive worked on. Hell, talented humans struggle to work on some of them.
…but that doesnt mean it doesnt work.
Just that, right now, the constraint set it can solve is not large enough to be useful in those situations.
…and increasingly we see low quality software where people care only about speed of delivery; again, lowering the bar in terms of requirements.
So… you know. Watch this space. Im not counting on having a dev job in 10 years. If I do, it might be making a pile of barely working garbage.
…but I have one now, and anyone who thinks that this year people will be largely replaced by AI is probably poorly informed and has misunderstood the capabilities on these models.
Theres only so low you can go in terms of quality.
LLMs are pretty good at picking up existing codebases. Even with cleared context they can do „look at this codebase and this spec doc that created it. I want to add feature x“
Copy-paste the bug report and watch it go.
With all the due respect: a file converter for windows is glueing few windows APIs with the relevant codec.
Now, good luck working on a complex warehouse management application where you need extremely complex logic to sort the order of picking, assembling, packing on an infinite number of variables: weight, amazon prime priority, distribution centers, number and type of carts available, number and type of assembly stations available, different delivery systems and requirements for different delivery operators (such as GLE, DHL, etc) that has to work with N customers all requiring slightly different capabilities and flows, all having different printers and operations, etc, etc. And I ain't even scratching the surface of the business logic complexity (not even mentioning functional requirements) to avoid boring the reader.
Mind you, AI is still tremendously useful in the analysis phase, and can sort of help in some steps of the implementation one, but the number of times you can avoid looking thoroughly at the code for any minor issue or discrepancy is absolutely close to 0.
I've been building a somewhat-novel, complex, greenfield desktop app for 6 months now, conceived and architected by a human (me), visually designed by a human (me), implementation heavily leaning on mostly Claude Code but with Codex and Gemini thrown in the mix for the grunt work. I have decades of experience, could have built it bespoke in like 1-2 years probably, but I wanted a real project to kick the tires on "the future of our profession".
TL;DR I started with 100% vibe code simply to test the limits of what was being promised. It was a functional toy that had a lot of problems. I started over and tried a CLI version. It needed a therapist. I started over and went back to visual UI. It worked but was too constrained. I started over again. After about 10 complete start-overs in blank folders, I had a better vision of what I wanted to make, and how to achieve it. Since then, I've been working day after day, screen after screen, building, refactoring, going feature by feature, bug after bug, exactly how I would if I was coding manually. Many times I've reached a point where it feels "feature complete", until I throw a bigger dataset at it, which brings it to its knees. Time to re-architect, re-think memory and storage and algorithms and libraries used. Code bloated, and I put it on a diet until it was trim and svelte. I've tried many different approaches to hard problems, some of which LLMs would suggest that truly surprised me in their efficacy, but only after I presented the issues with the previous implementation. There's a lot of conversation and back and forth with the machine, but we always end up getting there in the end. Opus 4.5 has been significantly better than previous Anthropic models. As I hit milestones, I manually audit code, rewrite things, reformat things, generally polish the turd.
I tell this story only because I'm 95% there to a real, legitimate product, with 90% of the way to go still. It's been half a year.
Vibe coding a simple app that you just want to use personally is cool; let the machine do it all, don't worry about under the hood, and I think a lot of people will be doing that kind of stuff more and more because it's so empowering and immediate.
Using these tools is also neat and amazing because they're a force multiplier for a single person or small group who really understand what needs done and what decisions need made.
These tools can build very complex, maintainable software if you can walk with them step by step and articulate the guidelines and guardrails, testing every feature, pushing back when it gets it wrong, growing with the codebase, getting in there manually whenever and wherever needed.
These tools CANNOT one-shot truly new stuff, but they can be slowly cajoled and massaged into eventually getting you to where you want to go; like, hard things are hard, and things that take time don't get done for a while. I have no moral compunctions or philosophical musings about utilizing these tools, but IMO there's still significant effort and coordination needed to make something really great using them (and literally minimal effort and no coordination needed to make something passable)
If you're solo, know what you want, and know what you're doing, I believe you might see 2x, 4x gains in time and efficiency using Claude Code and all of his magical agents, but if your project is more than a toy, I would bet that 2x or 4x is applied to a temporal period of years, not days or months!
I am in a unique situation where I work with a variety of codebases over the week. I have had no problem at all utilizing Claude Code w/ Opus 4.5 and Gemini CLI w/ Gemini 3.0 Pro to make excellent code that is indisputably "the right way", in an extremely clear and understandable way, and that is maximally extensible. None of them are greenfield projects.
I feel like this is a bit of je ne sais quoi where people appeal to some indemonstrable essence that these tools just can't accomplish, and only the "non-technical" people are foolish enough to not realize it. I'm a pretty technical person (about 30 years of software development, up to staff engineer and then VP). I think they have reached a pretty high level of competence. I still audit the code and monitor their creations, but I don't think they're the oft claimed "junior developer" replacement, but instead do the work I would have gotten from a very experienced, expert-level developer, but instead of being an expert at a niche, they're experts at almost every niche.
Are they perfect? Far from it. It still requires a practitioner who knows what they're doing. But frequently on here I see people giving takes that sound like they last used some early variant of Copilot or something and think that remains state of the art. The rest of us are just accelerating our lives with these tools, knowing that pretending they suck online won't slow their ascent an iota.
Not in terms of knowledge. That was already phenomenal. But in its ability to act independently: to make decisions, collaborate with me to solve problems, ask follow-up questions, write plans and actually execute them.
You have to experience it yourself on your own real problems and over the course of days or weeks.
Every coding problem I was able to define clearly enough within the limits of the context window, the chatbot could solve and these weren’t easy. It wasn’t just about writing and testing code. It also involved reverse engineering and cracking encoding-related problems. The most impressive part was how actively it worked on problems in a tight feedback loop.
In the traditional sense, I haven’t really coded privately at all in recent weeks. Instead, I’ve been guiding and directing, having it write specifications, and then refining and improving them.
Curious how this will perform in complex, large production environments.
Yes, feel free to blame me for the fact that these aren’t very business-realistic.
How do you stop it from over-engineering everything?
When I use Claude Code I find that it *can* add a tremendous amount of ability due to its ability to see my entire codebase at once, but the issue is that if I'm doing something where seeing my entire codebase would help that it blasts through my quota too fast. And if I'm tightly scoping it, it's just as easy & faster for me to use the website.
Because of this I've shifted back to the website. I find that I get more done faster that way.
This is basically all my side projects.
All the LLM coded projects I've seen shared so far[1] have been tech toys though. I've watched things pop up on my twitter feed (usually games related), then quietly go off air before reaching a gold release (I manually keep up to date with what I've found, so it's not the algorithm).
I find this all very interesting: LLMs dont change the fundamental drives needed to build successful products. I feel like I'm observing the TikTokification of software development. I dont know why people aren't finishing. Maybe they stop when the "real work" kicks in. Or maybe they hit the limits of what LLMs can do (so far). Maybe they jump to the next idea to keep chasing the rush.
Acquiring context requires real work, and I dont see a way forward to automating that away. And to be clear, context is human needs; i.e. the reasons why someone will use your product. In the game development world, it's very difficult to overstate how much work needs to be done to create a smooth, enjoyable experience for the player.
While anyone may be able to create a suite of apps in a weekend, I think very few of them will have the patience and time to maintain them (just like software development before LLMs! i.e. Linux, open source software, etc.).
[1] yes, selection bias. There are A LOT of AI devs just marketing their LLMs. Also it's DEFINITELY too early to be certain. Take everything Im saying with a one pound grain of salt.
real people get fed up of debating the same tired "omg new model 1000x better now" posts/comments from the astroturfers, the shills and their bots each time OpenAI shits out a new model
(article author is a Microslop employee)
Sharing how you're using these tools is quite a lot of work!
I think it's also rewarding to just be able to build something for yourself, and one benefit of scratching your own itch is that you don't have to go through the full effort of making something "production ready". You can just build something that's tailed specifically to the problem you're trying to solve without worrying about edge cases.
Which is to say, you're absolutely right :).
I interpret it more as spooked silence
Stopping there is just fine if you're doing it as a hobby. I love to do this to test out isolated ideas. I have dozens of RPGs in this state, just to play around with different design concepts from technical to gameplay.
Given the enthusiasm of our ruling class towards automating software development work, it may make sense for a software engineer to publicly signal how much onboard as a professional they are with it.
But, I've seen stranger stuff throughout my professional life: I still remember people enthusiastically defending EJB 2.1 and xdoclet as perfectly fine ways of writing software.
I hacked together a Swift tool to replace a Python automation I had, merged an ARM JIT engine into a 68k emulator, and even got a very decent start on a synth project I’ve been meaning to do for years.
What has become immensely apparent to me is that even gpt-5-mini can create decent Go CLI apps provided you write down a coherent spec and review the code as if it was a peer’s pull request (the VS Code base prompts and tooling steer even dumb models through a pretty decent workflow).
GPT 5.2 and the codex variants are, to me, every bit as good as Opus but without the groveling and emojis - I can ask it to build an entire CI workflow and it does it in pretty much one shot if I give it the steps I want.
So for me at least this model generation is a huge force multiplier (but I’ve always been the type to plan before coding and reason out most of the details before I start, so it might be a matter of method).
I had to dig through source code to confirm whether those features actually existed. They don't, so the CLI tools GPT recommended aren't actually applicable to my use case.
Yesterday, it hallucinated features of WebDav clients, and then talked up an abandoned and incomplete project on GitHub with a dozen stars as if it was the perfect fit for what I was trying to do, when it wasn't.
I only remember these because they're recent and CLI related, given the topic, but there are experiences like this daily across different subjects and domains.
I think the following things are true now:
- Vibe Coding is, more than ever, "autopilot" in the aviation sense, not the colloquial sense. You have to watch it, you are responsible, the human has do run takeoff/landing (the hard parts), but it significantly eases and reduces risk on a bulk of the work.
- The gulf of developer experience between today's frontier tooling and six months ago is huge. I pushed hard to understand and use these tools throughout last year, and spent months discouraged--back to manual coding. Folks need to re-evaluate by trying premium tools, not free ones.
- Tooling makers have figured out a lot of neat hacks to work around the limitations of LLMs to make it seem like they're even better than they are. Junie integrates with your IDE, Antigravity has multiple agents maintaining background intel on your project and priorities across chats. Antigravity also compresses contexts and starts new ones without you realizing it, calls to sub-agents to avoid context pollution, and other tricks to auto-manage context.
- Unix tools (sed, grep, awk, etc.) and the git CLI (ls-tree, show, --stat, etc.) have been a huge force-multiplier, as they keep the context small compared to raw ingestion of an entire file, allowing the LLMs to get more work done in a smaller context window.
- The people who hire programmers are still not capable of Vibe Coding production-quality web apps, even with all these improvements. In fact, I believe today this is less of a risk than I feared 10 months ago. These are advanced tools that need constant steering, and a good eye for architecture, design, developer experience, test quality, etc. is the difference between my vibe coded Ruby [0] (which I heavily stewarded) and my vibe coded Rust [1] (I don't even know what borrow means).
[0]: https://git.sr.ht/~kerrick/ratatui_ruby/tree/stable/item/lib
[1]: https://git.sr.ht/~kerrick/ratatui_ruby/tree/stable/item/ext...
1) It often overcomplicates things for me. After I refactor its code, it's usually half the size and much more readable. It often adds unnecessary checks or mini-features 'just in case' that I don't need.
2) On the other hand, almost every function it produces has at least one bug or ignores at least one instruction. However, if I ask it to review its own code several times, it eventually finds the bugs.
I still find it very useful, just not as a standalone programming agent. My workflow is that ChatGPT gives me a rough blueprint and I iterate on it myself, I find this faster and less error-prone. It's usually most useful in areas where I'm not an expert, such as when I don't remember exact APIs. In areas where I can immediately picture the entire implementation in my head, it's usually faster and more reliable to write the code myself.
This experience for me is current but I do not normally use Opus so perhaps I should give it a try and figure out if it can reason around problems I myself do not foresee (for example a browser JS API quirk that I had never seen).
Time will tell what happens, but if programming becomes "prompt engineering", I'm planning on quitting my job and pivoting to something else. It's nice to get stuff working fast, but AI just sucks the joy out of building for me.
Trying to not feel the pressure/anxiety from this, but every time a new model drops there is this tiny moment where I think "Is it actually different this time?"
I hear you but I think many companies will change the role ; you'll get the technical ownership + big chunks of the data/product/devops responsibility. I'm speculating but I think one person can take that on himself with the new tools and deliver tremendous value. I don't know how they'll call this new role though, we'll see.
This reminds me of that example where someone asked an agent to improve a codebase in a loop overnight and they woke up to 100,000 lines of garbage [0]. Similarly you see people doing side-by-side of their implementation and what an AI did, which can also quite effectively show how AI can make quite poor architecture decisions.
This is why I think the “plan modes” and spec driven development are so important effective for agents, because it helps to avoid one of their main weaknesses.
So you gave it an poorly defined task, and it failed?
I wouldn't be surprised to find out that they will find issues infinitely, if looped with fixes.
Have you tried the planning mode? Ask it to review the codebase and identify defects, but don't let it make any changes until you've discussed each one or each category and planned out what to do to correct them. I've had it refactor code perfectly, but only when given examples of exactly what you want it to do, or given clear direction on what to do (or not to do).
I appreciate the spirited debate and I agree with most of it - on both sides. It's a strange place to be where I think both arguments for and against this case make perfect sense. All I have to go on then is my personal experience, which is the only objective thing I've got. This entire profession feels stochastic these days.
A few points of clarification...
1. I don't speak for anyone but myself. I'm wrong at least half the time so you've been warned.
2. I didn't use any fancy workflows to build these things. Just used dictation to talk to GitHub Copilot in VS Code. There is a custom agent prompt toward the end of the post I used, but it's mostly to coerce Opus 4.5 into using subagents and context7 - the only MCP I used. There is no plan, implement - nothing like that. On occasion I would have it generate a plan or summary, but no fancy prompt needed to do that - just ask for it. The agent harness in VS Code for Opus 4.5 is remarkably good.
3. When I say AI is going to replace developers, I mean that in the sense that it will do what we are doing now. It already is for me. That said, I think there's a strong case that we will have more devs - not less. Think about it - if anyone with solid systems knowledge can build anything, the only way you can ship more differentiating features than me is to build more of them. That is going to take more people, not more agents. Agents can only scale as far as the humans who manage them.
New account because now you know who I am :)
Once Opus "finished", how did you validate and give it feedback it might not have access to (like iPhone simulator testing)?
A JavaScript interpreter written in Python? How about a WebAssembly runtime in Python? How about porting BurntSushi's absurdly great Rust optimized string search routines to C and making them faster?
And these are mostly just casual experiments, often run from my phone!
I'm assuming this refers to the python port of Bellard's MQJS [1]? It's impressive and very useful, but leaving out the "based on mqjs" part is misleading.
How did it do? :-)
It produced tests, then wrote the interpreter, then ran the tests and worked until all of them passed. I was genuinely surprised that it worked.
Meanwhile half a year to a year ago I could already point whatever model was du jour at the time at pychromecast and tell it repeatedly "just convert the rest of functionality to Swift" and it did it. No idea about the quality of code, but it worked alongside with implementations for mDNS, and SwiftUI, see gif/video here: https://mastodon.nu/@dmitriid/114753811880082271 (doesn't include chromecast info in the video).
I think agents have become better, but models likely almost entirely plateaued.
I'm fine with contributed AI-generated code if someone who's skills I respect is willing to stake their reputation on that code being good.
I just cutpasted a technical spec I wrote 22 years ago I spent months on for a language I never got around to building out, Opus zero-shotted a parser, complete with tests and examples in 3 minutes. I cutpasted the parser into a new session and asked it to write concept documentation and a language reference, and it did. The best part is after asking it to produce uses of the language, it's clear the aesthetics are total garbage in practice.
Told friends for years long in advance that we were coal miners, and I'll tell you the same thing. Embrace it and adapt
It is a well known fact that people advance their tech careers by building something new and leaving maintenance to others. Google is usually mentioned.
By which I mean, our industry does a piss poor job of rewarding responsibility and care.
I'll write the code, it can help me explore options, find potential problems and suggest tests, but I'll write the code.
This weekend I explained to Claude what I wanted the app to do, and then gave it the crappy code I wrote 10 years ago as a starting point.
It made the app exactly as I described it the first time. From there, now that I had a working app that I liked, I iterated a few times to add new features. Only once did it not get it correct, and I had to tell it what I thought the problem was (that it made the viewport too small). And after that it was working again.
I did in 30 minutes with Claude what I had try to do in a few hours previously.
Where it got stuck however was when I asked it to convert it to a screensaver for the Mac. It just had no idea what to do. But that was Claude on the web, not Claude Code. I'm going to try it with CC and see if I can get it.
I also did the same thing with a Chrome plugin for Gmail. Something I've wanted for nearly 20 years, and could never figure out how to do (basically sort by sender). I got Opus 4.5 to make me a plugin to do it and it only took a few iterations.
I look forward to finally getting all those small apps and plugins I've wanted forever.
Yes opus 4.5 seems great but most of the time it tries to vastly over complicate a solution. Its answer will be 10x harder to maintain and debug than the simpler solution a human would have created by thinking about the constraints of keeping code working.
And it turns out the quality of output you get from both the humans and the models is highly correlated with the quality of the specification you write before you start coding.
Letting a model run amok within the constraints of your spec is actually great for specification development! You get instant feedback of what you wrongly specified or underspecified. On top of this, you learn how to write specifications where critical information that needs to be used together isn't spread across thousands of pages - thinking about context windows when writing documentation is useful for both human and AI consumers.
Stuff that seems basic, but that I haven't always been able to count on in my teams' "production" code.
Over time, I imagine even cloud providers, app stores etc can start doing automated security scanning for these types of failure modes, or give a more restricted version of the experience to ensure safety too.
I predict in 2026 we're going to see agents get better at running their own QA, and also get better at not just disabling failing tests. We'll continue to see advancements that will improve quality.
Maintain and debug by who? It's just going to be Opus 4.5 (and 4.6...and 5...etc.) that are maintaining and debugging it. And I don't think it minds, and I also think it will be quite good at it.
something like code-simplifier is surprisingly useful (as is /review)
Regardless of how much you value Cloud Code technically, there is no denying that it has/will have huge impact. If technology knowledge and development are commoditised and distributed via subscription, huge societal changes are going to happen. Image what will happen to Ireland if Accenture dissolves, or what will happen to the millions of Indians when IT outsourcing becomes economically irrelevant. Will Seattle become new Detroit after Microsoft automates Windows maintenance? What about the hairdressers, cooks, lawyers, etc. who provided services for IT labourers/companies in California?
Lot of people here (especially Anthropic-adjacent) like to extrapolate the trends and draw conclusions up to the point when they say that white-collar labourers will not be needed anymore. I would like these people to have courage to take this one step further and connect this resolution with the housing crisis, loneliness epidemic, college debts, and job market crisis for people under 30.
It feels like we are diving head first into societal crisis of unparalleled scale and the people behind the steering wheel are excited to push the accelerator pedal even more.
Connect the world with reliable internet, then build a high tech remote control facility in Bangladesh and outsource plumbing, electrical work, housekeeping, dog watching, truck driving, etc etc
No AGI necessary. There’s billions of perfectly capable brains halfway around the world.
In other words. Only one side is even fighting the war. The other one is either cheering on the tsunami on or fretting about how their beachside house will get wrecked without making any effort to save themselves.
This is the sort of collective agency that even hundreds of thousands of dollars in annual wages/other compensation in American tech hubs gets us. Pathetic.
It will have impact on me in the long run, sure, it will transform my job, sure, but I'm confident my skills are engineering-related, not coding-related.
I mean, even if it forces me out of the job entirely, so be it, I can't really do anything if the status quo changes, only adapt.
If anything this example shows that these cli tools give regular devs much higher leverage.
There's a lot of software labor that is like, go to the lowest cost country, hire some mediocre people there and then hire some US guy to manage them.
That's the biggest target of this stuff, because now that US guy can just get equal or hight code in both quality and output without the coordination cost.
But unless we get to the point where you can do what I call "hypercode" I don't think we'll see SWEs as a whole category die.
Just like we don't understand assembly but still need technical skills when things go wrong, there's always value in low level technical skills.
This is also my take. When the printing press came out, I bet there were scribes who thought, "holy shit, there goes my job!" But I bet there were other scribes who thought, "holy shit, I don't have to do this by hand any more?!"
It's one thing when something like weaving or farming gets automated. We have a finite need for clothes and food. Our desire for software is essentially infinite, or at least, it's not clear we have anywhere close to enough of it. The constraint has always been time and budget. Those constraints are loosening now. And you can't tell me that when I am able to wield a tool that makes me 10X more productive that that somehow diminishes my value.
I think for a while people have been talking about the fact that as all development tools have gotten better - the idea that a developer is a person who turns requirements into code is dead. You have to be able to operate at a higher level, be able to do some level of work to also develop requirements, work to figure out how to make two pieces of software work together, etc.
But the point is Obviously at an extreme end 1 CTO can't run google and probably not say 1 PM or Engineer per product, but what is the mental load people can now take on. Google may start hiring less engineers (or maybe what happens is it becomes more cuthroat, hire the same number of engineers but keep them much more shortly, brutal up or out.
But essentially we're talking about complexity and mental load - And so maybe it's essentially the same number of teams because teams exist because they're the right size, but teams are a lot smaller.
This project would have taken me years of specialization and research to do right. Opus's strength has been the ability to both speak broadly and also drill down into low-level implementations.
I can express an intent, and have some discussion back and forth around various possible designs and implementations to achieve my goals, and then I can be preparing for other tasks while Opus works in the background. I ask Opus to loop me in any time there are decisions to be made, and I ask it to clearly explain things to me.
Contrary to losing skills, I feel that I have rapidly gained a lot of knowledge about low-level systems programming. It feels like pair programming with an agentic model has finally become viable.
I will be clear though, it takes the steady hand of an experience and attentive senior developer + product designer to understand how to maintain constraints on the system that allow the codebase to grow in a way that is maintainable on the long-term. This is especially important, because the larger the codebase is, the harder it becomes for agentic models to reason holistically about large-scale changes or how new features should properly integrate into the system.
If left to its own devices, Opus 4.5 will delete things, change specification, shirk responsibilities in lieu of hacky band-aids, etc. You need to know the stack well so that you can assist with debugging and reasoning about code quality and organization. It is not a panacea. But it's ground-breaking. This is going to be my most productive year in my life.
On the flip side though, things are going to change extremely fast once large-scale, profitable infrastructure becomes easily replicable, and spinning up a targeted phishing campaign takes five seconds and a walk around the park. And our workforce will probably start shrinking permanently over the next few years if progress does not hit a wall.
Among other things, I do predict we will see a resurgence of smol web communities now that independent web development is becoming much more accessible again, closer to how it when I first got into it back in the early 2000's.
However, Opus 4.5 is incredible when you give it everything it needs, a direction, what you have versus what you want and it will make it work, really, it will work. The code might me ugly, undesirable, would only work for that one condition, but with futher prompting you can evolve it and produce something that you can be proud of.
Opus is only as good as the user and the tools the user gives to it. Hmm, that's starting to sound kind-of... human...
I now write very long specifications and this helps. I haven't figured out a bulletproof workflow, I think that will take years. But I often get just amazing code out of it.
I read If Anyone Builds It Everyone Dies over the break. The basic premise was that we can't "align" AI so when we turn it loose in an agent loop what it produces isn't necessarily what we want. It may be on the surface, to appease us and pass a cursory inspection, but it could embed other stuff according to other goals.
On the whole, I found it a little silly and implausible, but I'm second guessing parts of that response now that I'm seeing more people (this post, the Gas Town thing on the front page earlier) go all-in on vibe coding. There is likely to be a large body of running software out there that will be created by agents and never inspected by humans.
I think a more plausible failure mode in the near future (next year or two) is something more like a "worm". Someone building an agent with the explicit instructions to try to replicate itself. Opus 4.5 and GPT 5.2 are good enough that in an agent loop they could pretty thoroughly investigate any system they land on, and try to use a few ways to propagate their agent wrapper.
Is it possible to create an obfuscated quine that exhibits stable detection-avoiding behavior on every frontier model simultaneously, as well as on an old-school classifier and/or GPT-3 era LLM fine-tuned just for worm detection? One incapable of even thinking about what it's seeing, and being persuaded to follow its subtle propagation logic? I'm not sure that the answer is yes.
The larger issue to me is less that an LLM can propagate in generated code undetected, but rather that an attacker's generated code may soon be able to execute a level of hyper-customized spear-phishing-assisted attack at scale, targeting sites without large security teams - and that it will be hitting unintentional security flaws introduced by those smaller companies' vibe code. Who needs a worm when you have the resources of a state-level attacker at your fingertips, and numerous ways to monetize? The balance of power is shifting tremendously towards black hats, IMO.
And everything worked really well until they switched chip set.
At which point the same model failed entirely. Upon inspection it turned out the AI model had learned that overloading particular registers would cause such an electrical charge buildup that transistors on other pathways would be flipped.
And it was doing this in a coordinated manner in order to get the results it wanted lol.
I can't find any references in my very cursory searches, but your comment reminded me of the story
Most RCEs, 0-days, and whatnots are not due to the NSA hiding behind the "Jia Tan" pseudo to try to backdoor all the SSH servers on all the systemd [1] Linuxes in the world: they're just programmer errors.
I think accidental security holes with LLMs are way, way, way more likely than actual malicious attempts.
And with the amount of code spoutted by LLMs, it is indeed --and the lack of audit is-- an issue.
[1] I know, I know: it's totally unrelated to systemd. Yet only systems using systemd would have been pwned. If you're pro-systemd you've got your point of view on this but I've got mine and you won't change my mind so don't bother.
Can't wait for when the competition catches up with Claude Code, especially the open source/weights Chinese alternatives :)
It means that it is going to be as easy to create software as it is to create a post on TikTok, and making your software commercially successful will be basically the same task (with the same uncontrollable dynamics) as whether or not your TikTok post goes viral.
My biggest issue was limitations around how Claude Code could change Xcode settings and verify design elements in the simulator.
ie well known paths based on training data.
what's never posted is someone building something that solves a real problem in the real world - that deals with messy data | interfaces.
I like a.i to do the common routine tasks that I don't like to do like apply tailwind styles but being renter and faking productivity that's not it
https://thermal-bridge.streamlit.app/
Disclaimer: I'm not a programmer or software engineer. I have a background in physics and understand some scripting in python and basic git. The code is messy at the moment because I explored/am still exploring to port it to another framework/language
I would love to hear from some freelance programmers how LLMs have changed their work in the last two years.
"I decided to make up for my dereliction of duties by building her another app for her sign business that would make her life just a bit more delightful - and eliminate two other apps she is currently paying for"
OP used Opus to re-write existing applications that his wife was paying for. So now any time you make a commercial app and try to sell it, you're up against everyone with access to Opus or similar tooling who can replicate your application, exactly to their own specifications.
She doesn’t need to hire anyone
It never duplicates code, implements something again and leaves the old code around, breaks my convention, hallucinates, or tells me it’s done when the code doesn’t even compile, which sonnet 4.5 and Opus 4.1 did all the time
I’m wondering if this had changed with Opus 4.5 since so many people are raving about it now. What’s your experience?
Claude - fast, to the point but maybe only 85% - 90% there and needs closer observation while it works
GPT-x-high (or xhigh) - you tell it what to do, it will work slowly but precise and the solution is exactly what you want. 98% there, needs no supervision
In that it's using a non-deterministic machine to build a deterministic one.
Which gives all the benefits of determinism in production, with all the benefits of non-deterministic creativity in development.
Imho, Anthropic is pretty smart in picking it as a core focus.
Yeah, GRAMMAR
For all the wonderment of the article, tripping up on a penultimate word that was supposedly checked by AI suddenly calls into question everything that went before...
I'm not so sure... I mean it's true that regardless of if you are a beginner, junior, or 'senior', if you say "build me an instagram clone" Opus 4.5 will probably do a decent job. I think the skills go still in understanding architecture, and just knowing where pitfalls and problems can arise, or even making some important 'abstraction cut' prompts. I think applications can still grow to a point where you still need to prompt at specific domains only, or the model will fail to do everything you want it to - especially if you give it massive 'fix this this this anad that too' prompts
There is no fixed truth regarding what an "app" is, does, or looks like. Let alone the device it runs on or the technology it uses.
But to an LLM, there are only fixed truths (and in my experience, only three or four possible families of design for an application).
Opus 4.5 produces correct code more often, but when the human at the keyboard is trying to avoid making any engineering decisions, the code will continue to be boring.
Why would you not want you code to be boring?
It is best in its class, but trips up frequently with complicated engineering tasks involving dynamic variables. Think: Browser page loading, designing for a system where it will "forget" to account for race conditions, etc.
Still, this gets me very excited for the next generation of models from Anthropic for heavy tasks.
I’m not shaming but I personally need to know if my sentiment is correct or not or I just don’t know how to use LLMs
Can vibe coder gurus create operating system from scratch that competes with Linux and make it generate code that basically isn’t Linux since LLM are trained on said the source code …
Also all this on $20 plan. Free and self host solution will be best
Yeah, they can't do that.
Yet you expect $20 of computing to do it.
No.
Vibe-coding, in the original sense where you don't bother with code reviews, the code quality and speed are both insufficient for that.
I experimented with them just before Christmas. I do not think my experiments were fully representative of the entire range of tasks needed for replacing Linux: Having them create some web apps, python scripts, a video game, a toy programming language, all beat my expectations given the METR study. While one flaw with the METR study is the small number of data points at current 50% successful task length, I take my success as evidence I've been throwing easy tasks at the LLM, not that the LLM is as good as it looks like to me.
However, assume for the moment that they were representative tasks:
For quality, what I saw just before Christmas was the equivalent of someone with a few years' experience under their belt, the kind of person who is just about to stop being a junior and get a pay rise. For speed, $20 of Claude Code will get you around 10 sprints' equivalent to that level of human's output.
"Junior about to get a pay rise" isn't high enough quality to let loose unchecked on a project that could compete with Linux, and even if it was, 10 sprints/month is too slow. Even if you increase the spend on LLMs to match the cost of a typical US junior developer, you're getting an army of 1500 full-time (40h/week) juniors, and Linux is, what, some 50-100 million developer-hours, so it would still take something like 16-32 years of calendar time (or, equivalently, order-of 1.2-2.5 million dollars) even if you could perfectly manage all those agents.
If you just vibe code, you get some millions of dollars worth of junior grade technical debt. There's cases where this is fine, an entire operating system isn't one of them.
> Also all this on $20 plan. Free and self host solution will be best
IMO unlikely, but not impossible.
A box with 10x the resources of your personal computer may be c. 10x the price, give or take.
While electricity is negligible (which today, hah!): If any given person is using that server only during a normal 40 hour work week, that's 25% utilisation rate, therefore if it can be rented out to people in other timezones or where the weekend is different, the effective cost for that 10x server is only 2.5x.
When electricity price is a major part of the cost, and electricity prices vary a lot from one place to another, then it can be worth remote-hosting even when you're the only user.
That said, energy efficiency of compute is still improving, albeit not as rapidly as Moore's Law used to, and if this trend continues then it's plausible that we get performance equivalent to current SOTA hosted models running on high-end smartphones by 2032. Assuming WW3 doesn't break out between the US and China after the latter tries to take Taiwan and all their chip factories, or whatever
Jrifjxgwyenf! A hammer is a really bad screwdriver. My car is really bad at refrigerating food. If you ask for something outside its training data, it doesn't do a very good job. So don't do that! All of the code on the Internet is a pretty big dataset though, so maybe Claude could do an operating system that isn't Linux that competes with it by laundering the FreeBSD kernel source through the training process.
And you're barely even willing to invest any money into this? The first Apple computer cost $4,000 or so. You want the bleeding edge of technology delivered to the smartphone in your hand, for $20, or else it's a complete failure? Buddy, your sentiment isn't the issue, it's your attitude.
I'm not here spouting ridiculous claims like AI is going to cure all of the different kinds of cancer by the end of 2027, I just want to say that endlessly contrarian naysayers are as equally borish as the syncophantic hype AIs they're opposing.
https://chronick.github.io/typing-arena/
With another more substantial personal project (Eurorack module firmware, almost ready to release), I set up Claude Code to act as a design assistant, where I'd give it feedback on current implementation, and it would go through several rounds of design/review/design/review until I honed it down. It had several good ideas that I wouldn't have thought of otherwise (or at least would have taken me much longer to do).
Really excited to do some other projects after this one is done.
I will note that my experience varies slightly by language though. I’ve found it’s not as good at typescript.
If I had done the same thing Pre LLM era it would have taken me months
Like that first one where he writes a right-click handler, off the top of my head I have no idea how I would do that, I could see it taking a few hours to just set up a dev environment, and I would probably overthink the research. I was working on something where Junie suggested I write a browser extension for Firefox and I was initially intimidated at the thought but it banged out something in just a few minutes that basically worked after the second prompt.
Similarly the Facebook autoposter is completely straightforward to code but it can be so emotionally exhausting to fight with authentication APIs, a big part of the coding agent story isn't just that it saves you time but that they can be strong when you are emotionally weak.
The one which seems the hardest is the one that does the routing and travel time estimation which I'd imagine is calling out to some API or library. I used to work at a place that did sales territory optimization and we had one product that would help work out routes for sales and service people who travel from customer to customer and we had a specialist code that stuff in C++ and he had a very different viewpoint than me, he was good at what he did and could get that kind of code to run fast but I wouldn't have trusted him to even look at applications code.
I think the gap between Sonnet 4.5 and Opus is pretty small, compared to the absolute chasm between like gpt-4.1, grok, etc. vs Sonnet.
(it was for single html/js PWA to measure and track heart rate)
Opus seems to go less deep, does it's own things, do not follow instructions exactly EVEN IF I WROTE ALL CAPS. With Sonnet 4.5 I can understand everything author is saying. May be Opus is optimised for Claude code and Sonnet works best on Web.
As many other commentators have said, individual results vary extremely widely. I'd love to be able to look at the footage of either someone who claims a 10x productivity increase, or someone who claims no productivity increase, to see what's happening.
> Disclaimer: This post was written by a human and edited for spelling, grammer by Haiku 4.5
I have used AskUserQuestionTool to complete my initial spec. And then Opus 4.5 created the tool according to that extensive and detailed spec.
It appeared to work out of the box.
Boy how horrific was the code. Unnecessary recursions, unused variables, data structures being built with no usage, deep branch nesting and weird code that is hard to understand because of how illogical it is.
And yes, it was broken on many levels and did not and could not do the job properly.
I then had to rewrite the tool from scratch and overall I have definitely spent more time spec'ing and understanding Claude code than if I have just written this tool from scratch initially.
Then I tried again for a small tool I needed to run codesign in parallel: https://github.com/egorFiNE/codesign-parallel
Same thing. Same outcome, had to rewrite.
IMO, our jobs are safe. It's our ways of working that are changing. Rapidly.
the thing about being an engineer at commercial capacity is "maintaining/enhancing an existing program/software system that has been developed over years by multiple people(including those who already left) and do it in a way that does not cause any outages/bugs/break existing functionality.
while the blog post mentions about the ability of using AI to generate new applications, but it does not talk about maintaining one over a longer period of time. for that, you would need real users, real constraints, and real feature requests which preferably pay you so you can priortize them.
I would love to see such blog posts where for example, a PM is able to add features for a period of one month without breaking the production, but it would be a very costly experiment.
If you tell it to use linters and other kinds of code analysis tools it takes it to the next level. Ruff for Python or Clippy for Rust for example. The LLM makes so much code so fast and then passes it through these tools and actually understands what the tools say and it goes and makes the changes. I have created a whole tool chain that I put in a pre commit text file in my repos and tell the LLM something like "Look in this text file and use every tool you see listed to improve code quality".
That being said, I doubt it can turn a non-dev into a dev still, it just makes competent devs way better still.
I still need to be able to understand what it is doing and what the tools are for to even have a chance to give it the guardrails it should follow.
This could pretty much be the beginning of the end of everything, if misaligned models wanted to they could install killswitches everywhere. And you can't trust security updates either so you are even more vulnerable to external exploits.
It's really scary, I fear the future, it's going to be so bad. It's best to not touch AI at all and stay hidden from it as long as possible to survive the catastrophe or not be a helping part of it. Don't turn your devices into a node of a clandestine bot net that is only waiting to conspire against us.
> because models like to write code WAY more than they like to delete it
Yeah, this is the big one. I haven't figured it out either. New or changing requirements are almost always implemented a flurry of if/else branches all over the place, rather than taking the time for a step back and a reimagining of a cohesive integration of old and new. I've had occasional luck asking for this explicitly, but far more frequently they'll respond with recommendations that are far more mechanical, e.g. "you could extract a function for these two lines of code that you repeat twice", not architectural, in nature. (I still find pasting a bunch of files into the chat interface and iterating on refinements conversationally to be faster and produce better results).
That said, I'm convinced now that it'll get there sooner or later. At that point, I really don't know what purpose SWEs will serve. For a while we might serve as go-betweens between the coding agent and PMs, but LLMs are already way better at translating from tech jargon to human, so I can't imagine it would be long before product starts bypassing us and talking directly to the agents, who (err, which) can respond with various design alternatives, pros and cons of each, identify all the dependencies, possible compatibility concerns, alignment with future direction, migration time, compute cost, user education and adoption tracking, etc, all in real time in fluent PM-ese. IDK what value I add to that equation.
For the last year or so I figured we'd probably hit a wall before AI got to that point, but over the last month or so, I'm convinced it's only a matter of time.
I use copilot and change models often, and haven't really noticed any major differences between them, except some of the newer ones are very slow.
I generally feel the smaller and faster ones are more useful since they will let me discover problems with my prompt or context faster.
Maybe I'm simply not using LLM's in a way that lets the superiority of newer models reveal itself properly, but there is a huge financial incentive for LLM makers to pretend that their model has game-changing "special sauce" even if it doesn't.
> What you don’t need: variable names, formatting, comments meant for humans, or patterns designed to spare your brain.
> What you do need: simple entry points, explicit code with fewer abstractions, minimal coupling, and linear control flow.
Example: One of my customers (which I got by Reddit posts, cold calls, having a website, and eventually word of mouth) wanted to do something novel with a vendor in my niche. AI doesn't know how to build it because there's no documentation for the interfaces we needed to use.
Claude’s response:
The article’s central tension is real - Burke went from skeptic to believer by building four increasingly complex apps in rapid succession using Opus 4.5. But his evidence also reveals the limits of that belief.
Notice what he actually built: Windows utilities, a screen recorder, and two Firebase-backed CRUD apps for his wife’s business. These are real applications solving real problems, but they’re also the kinds of projects where you can throw away the code if something goes wrong. When he says “I don’t know how the code works” and “I’m maybe 80% confident these applications are bulletproof,” he’s admitting the core problem with the “AI replaces developers” narrative.
That 80% confidence matters. In your Splink work, you’re the sole frontend developer - you can’t deploy code you’re 80% confident about. You need to understand the implications of your architectural decisions, know where the edge cases are, and maintain the system when requirements change. Burke’s building throwaway prototypes for his wife’s yard sign business. You’re building production software that other people depend on.
His “LLM-first code” philosophy is interesting but backwards. He’s optimizing for AI regeneration rather than human maintenance because he assumes the AI will always be there to fix problems. But AI can’t tell you why a decision was made six months ago when business requirements shift. It can’t explain the constraints that led to a particular architecture. And it definitely can’t navigate political and organizational context when stakeholders disagree about priorities.
The Firebase examples are telling - he keeps emphasizing how well Opus knows the Firebase CLI, as if that proves general capability. But Firebase is extremely well-documented, widely-discussed training data. Try that same experiment with your company’s internal API or a niche library with poor documentation. The model won’t be nearly as capable.
What Burke actually demonstrated is that Opus 4.5 is an excellent pair programmer for prototyping with well-known tools. That’s legitimately valuable. But “pair programmer for prototyping” isn’t the same as “replacing developers.” It’s augmenting someone who already knows how to build software and can evaluate whether the generated code is good.
The most revealing line is at the end: “Just make sure you know where your API keys are.” He’s nervous about security because he doesn’t understand the code. That nervousness is appropriate - it’s the signal that tells you when you’ve crossed from useful tool into dangerous territory.
Just today:
Opus 4.5 Extended Thinking designed psql schema for “stream updates after snapshot” with bugs.
Grok Heavy gave correct solution without explanations.
ChatGPT 5.2 Pro gave correct solution and also explained why simpler way wouldn’t work.
Either it wasn’t that good, or the author failed in the one phrase they didn’t proofread.
(No judgement meant, it’s just funny).
Ho it was also quite funny it used the exact same color as hackernews and a similar layout.
The above is for vibe coding; for taking the wheel, I can only use Opus because I suck at prompting codex (it needs very specific instructions), and codex is also way too slow for pair programming.
It's hard to say if Opus 4.5 itself will change everything given the cost/latency issues, but now that all the labs will have very good synthetic agentic data thanks to Opus 4.5, I will be very interested to see what the LLMs release this year will be able to do. A Sonnet 4.7 that can do agentic coding as well as Opus 4.5 but at Sonnet's speed/price would be the real gamechanger: with Claude Code on the $20/mo plan, you can barely do more than one or two prompts with Opus 4.5 per session.
When you need to build out specific feature or logic, it can fail hard. And the best is when you have something working, and it fixes something else and deletes the old code that was working, just in a different spot.
Disclaimer: This post was written by a human and edited for spelling, grammer by Haiku 4.5
I recently am finishing the reading of Mistborn series, so please do not read further unless you want a spoiler. SPOILER
There is a suspicion that mists can change written text. END OF SPOILER
So how can we be sure that Haiku didn't change the text in favour of AI then?Opus 4.5 is incredible, it is the GPT-4 moment for coding because how honest and noticeable the capacity increase is. But still, it has blind spots just like human.
I haven't tried it for coding. I'm just talking about regular chatting.
It's doing something different from prior models. It seems like it can maintain structural coherence even for very long chats.
Where as prior models felt like System 1 thinking, ChatGPT5.2 appears like it exhibits System 2 thinking.
(note that if you look at individual slices, Opus is getting often outperformed by Sonnet).
Unfortunately, it's still surprisingly easy for these models to fall into really stupid maintainability traps.
For instance today, Opus adds a feature to the code that needs access to a db. It fails because the db (sqlite) is not local to the executable at runtime. Its solution is to create this 100 line function to resolve a relative path and deal with errors and variations.
I hit ESC and say "... just accept a flag for --localdb <file>". It responds with "oh, that's a much cleaner implementation. Good idea!". It then implements my approach and deletes all the hacks it had scattered about.
This... is why LLMs are still not Senior engineers. They do plainly stupid things. They're still absurdly powerful and helpful, but if you want maintainable code you really have to pay attention.
Another common failure is when context is polluted.
I asked Opus to implement a feature by looking up the spec. It looked up the wrong spec (a v2 api instead of a v3) -- I had only indicated "latest spec". It then did the classic LLM circular troubleshooting as we went in 4 loops trying to figure out why calculations were failing.
I killed the session, asked a fresh instance to "figure out why the calculation was failing" and it found it straight away. The previous instance would have gone in circles for eternity because its worldview had been polluted by assumptions made -- that could not be shaken.
This is a second way in which LLMs are rigid and robotic in their thinking and approach -- taking the wrong way even when directed not to. Further reading on 'debugging decay': https://arxiv.org/abs/2506.18403
All this said, the number of failure scenarios gets ever smaller. We've gone from "problem and hallucination every other code block" to "problem every 200-1000 code blocks".
They're now in the sweet spot of acting as a massive accelerator. If you're not using them, you'll simply deliver slower.
it is quite strange, you have to make it write the code in a way it can reason about it without it reading it, you also have to feel the code without reading all of it. like a blind man feeling the shape of an object; Shape from Darkness
you can ask opus to make a car, it will give you a car, then you ask it for navigation; no problem, it uses google maps works perfect
then you ask it to improve the breaks, and it will give internet to the tires and the break pedal, and the pedal will send a signal via ipv6 to the tires which will enable a very well designed local breaking system, why not, we already have internet for google maps.
i think the new software engineering is 10 times harder than the old one :)
Ofc the reality is a mix of both, but really curious on what contributes more.
Probably just using cursor with old models (eww) can yield a quick response.
https://burkeholland.github.io/posts/opus-4-5-change-everyth...
Read this article and ultrathink critically about it. Provide your perspective.
The article makes a strong experiential case for improved velocity in AI-assisted development but contains several argumentation weaknesses and conflations worth examining.
The projects described are legitimately non-trivial: Firebase backend integration, Facebook OAuth, iOS apps in Swift (a language the author doesn't know), GitHub Actions pipelines, scheduled cloud functions. Getting these working in hours rather than weeks represents a real capability shift. The author is honest about his limitations and uncertainties, particularly the security concerns.
Where the argument breaks down:
1. "Replace developers" vs "dramatically augment developers" The author's own workflow contradicts the headline claim. He's still:
Making architectural decisions (choosing Firebase) Handling errors Opus couldn't see (XAML via Visual Studio) Writing custom prompts to shape output quality Manually auditing security Making product and UX decisions
This is developer work. The tool changed; the role didn't disappear.
2. The 80% security confidence undermines his thesis He admits he's shipping apps with "80% confidence" in security and calls it "too damn low." This is the crux: the AI accelerated production but didn't replace the judgment required to responsibly ship production software. The velocity gain exposed a competence gap rather than closing it.
3. Sample bias in project selection
All examples are:
Greenfield (no existing codebase) Single developer Personal/family use Standard patterns with excellent documentation (Firebase, SwiftUI, React Native) No regulatory, compliance, or scale requirements No team collaboration or handoff considerations
These constraints define a specific problem space where AI excels. Extrapolating to "developers are replaceable" ignores the majority of professional software work.
4. "Code doesn't need human readability" is underbaked His argument is circular: "Why optimize for human readability when the AI is doing all the work?" But:
His 80% security confidence exists because he can't read the code He had to use external tools (VS) when Opus couldn't diagnose errors What happens when context windows are exceeded and the LLM loses track? Model behavior changes between versions; human-readable code is version-agnostic
The custom prompt he shares actually encodes many good engineering practices (minimal coupling, explicit state, linear control flow) that benefit LLMs and humans. The "no comments needed" claim conflates what's optimal for LLM regeneration with what's optimal for debugging production issues at 3am. What's actually being demonstrated
The honest version of this article would be: Opus 4.5 dramatically compresses the gap between "can write code" and "can ship a personal app" for a specific class of greenfield projects. That's genuinely transformative for hobbyists, indie developers, and people solving their own problems. But that's different from "replacing developers." The article demonstrates a power tool; power tools don't eliminate tradespeople.
Leopold Aschenbrenner was talking about "unhobbling" as an ongoing process. That's what we are seeing here. Not unexpected
I only use VS Code with Copilot subscription ($10) and already get quite a lot out of it.
My experience is that Claude Code really drains your pocket extremely fast.
The fact of the matter, in my experience, is that most of the day to day software tasks done by an individual developer are not greenfield, complex tasks. They're boring data-slinging or protocol wrangling. This sort of thing has been done a thousand times by developers everywhere, and frankly there's really no need to do the vast majority of this work again when the AIs have all been trained on this very data.
I have had great success using AIs as vast collections of lego blocks. I don't "vibe code", I "lego code", telling the AI the general shape and letting it assemble the pieces. Does it build garbage sometimes? Sure, but who doesn't from time to time? I'm experienced enough notice the garbage smell and take corrective action or toss it and try again. Could there be strange crevices in a lego-coded application that the AI doesn't quite have a piece for? Absolutely! Write that bit yourself and then get on with your day.
If the only thing you use these tools for is doing simple grunt-work tasks, they're still useful, and dismissing them is, in my opinion, a mistake.
It's always just the "Fibonacci" equivalent
Antigravity with Gemini 3 pro from Google has the same capability.
Things are changing. Now everyone can build bespoke apps. Are these apps pushing the limits of technology? No! But they work for the very narrow and specific domain they where designed. And yes they do not scale and have as much bugs as your personal shell scripts. But they work.
But let's not compare these with something more advance - at least not yet. Maybe by end of this year?
We switched from Sonnet 4.5 to Opus 4.5 as our default coding agent recently and we pay the price for the switch (3x the cost) but as the OP said, it is quite frankly amazing. It does a pretty good job, especially, especially when your code and project is structured in a such a way that it helps the agent perform well. Anthropic released an entire video on the subject recently which aligns with my own observations as well.
Where it fails hard is in the more subtle areas of the code, like good design, best practices, good taste, dry, etc. We often need to prompt it to refactor things as the quick solution it decided to do is not in our best interest for the long run. It often ends in deep investigations about things which are trivially obvious. It is overfitted to use unix tools in their pure form as it fail to remember (even with prompting) that it should run `pnpm test:unit` instead `npx jest` - it gets it wrong every time.
But when it works - it is wonderful.
I think we are at the point where we are close to self-improving software and I don't mean this lightly.
It turns out the unix philosophy runs deep. We are right now working on ways to give our agents more shells and we are frankly a few iterations there. I am not sure what to expect after this but I think whatever it is, it will be interesting to see.
The article is fine opinion but at what point are we going to either:
a) establish benchmarks that make sense and are reliable, or
b) stop with the hypecycle stuff?
What LLMs will NOT do however, is write or invent SOMETHING KNEW.
And parts of our industry still are about that: Writing Software that has NOT been written before.
If you hire junior developers to re-invent the wheels: Sure, you do not need them anymore.
But sooner or later you will run out of people who know how to invent NEW things.
So: This is one more of those posts that completely miss the point. "Oh wow, if I look up on Wikipedia how to make pancakes I suddenly can make and have pancakes!!!1". That always was possible. Yes, you now can even get an LLM to create you a pancake-machine. Great.
Most of the artists and designers I am friends with have lost their jobs by now. In a couple of years you will notice the LLMs no longer have new styles to copy from.
I am all for the "remix culture". But don't claim to be an original artist, if you are just doing a remix. And LLM source code output are remixes, not original art.
As a side project / experiment, I designed a language spec and am using (mostly) Opus 4.5 to write a transpiler (language transpiles to C) for it. Parser was no problem (I used s-expressions for a reason). The type checker and transpiler itself have been a slog - I think I'm finding the limits of Opus :D. It particularly struggles with multi-module support. Though, some of this is probably mistakes made by me while playing architect and iterating with Claude - I haven't written a compiler since my senior year compiler design course 20+ years ago. Someone who does this for a living would probably have an easier time of it.
But for the CRUD stuff my day job has me doing? Pffttt... it's great.
I’ve always liked the quote that sufficiently advanced tech looks like magic, but its mistake to assume that things that look like magic also share other properties of magic. They don’t.
Software engineering spans over several distinct skills: forming logical plans, encoding them in machine executable form(coding), making them readable and expandable by other humans(to scale engineering), and constantly navigating tradeoffs like performance, maintainability and org constraints as requirements evolve.
LLMs are very good at some of these, especially instruction following within well known methodologies. That’s real progress, and it will be productized sooner than later, having concrete usecases, ROI and clearly defined end user.
Yet, I’d love to see less discussion driven by anecdotes and more discussion about productizing these tools, where they work, usage methodologies, missing tooling, KPIs for specific usecases. And don’t get me started on current evaluation frameworks, they become increasingly irrelevant once models are good enough at instruction following.
As soon as you try to sell it to me, you have a duty of care. You are not meeting that duty of care with this ignorant and reckless way of working.
And it writes with more clarity too.
The only people who are complaining about "AI slop" are those whose jobs depend on AI to go away (which it won't).
I just tried again and ask Opus to add custom video controls around ReactPlayer. I started in Plan mode which looked overal good (used our styling libs, existing components, icons and so on).
I let it execute the plan and behold I have controls on the video, so far so good. I then look at the code and I see multiple issues: Over usage of useEffect for trivial things, storing state in useState which should be computed at run time, failing to correctly display the time / duration of the video and so on...
I ask follow up question like: Hide the controls after 2 seconds and it starts introducing more useEffects and states which all are not needed (granted you need one).
Cherry on the cake, I asked to place the slider at the bottom and the other controls above it, it placed the slider on the top...
So I suck at prompting and will start looking for a gardening job I guess...
I had been saying since around summer of this year that coding agents were getting extremely good. The base model improvements were ok, but the agentic coding wrappers were basically game changers if you were using them right. Until recently they still felt very context limited, but the context problem increasingly feels like a solved problem.
I had some arguments on here in the summer about how it was stupid to hire junior devs at this point and how in a few years you probably wouldn't need senior devs for 90% of development tasks either. This was an aggressive prediction 6 months ago, but I think it's way too conservative now.
Today we have people at our company who have never written code building and shipping bespoke products. We've also started hiring people who can simply prove they can build products for us using AI in a single day. These are not software engineers because we are paying them wages no SWEs would accept, but it's still a decent wage for a 20 something year old without any real coding skills but who is interested in building stuff.
This is something I wouldn't have never of expected to be possible 6 months ago. In 6 months we've gone from senior developers writing ~50% of their code with AI, to just a handful of senior developers who now write close to 90% of their code with AI while they support a bunch of non-developers pumping out a steady stream of shippable products and features.
Software engineers and traditional software engineer is genuinely running on borrowed time right now. It's not that there will be no jobs for knowledgable software engineers in the coming years, but companies simply won't need many hotshot SWEs anymore. The companies that are hiring significant numbers of software engineers today simply can not have realised how much things have changed over just the last few months. Apart from the top 1-2% of talent I simply see no good reason to hire a SWE for anything anymore. And honestly outside of niche areas, anyone hand-cracking code today is a dinosaur... A good SWE today should see their job as simply reviewing code and prompting.
If you think that the quality of code LLMs produce today isn't up to scratch you've either not used the latest models and tools or you're using them wrong. That's not to say it's the best code – they still have a tendency to overcomplicate things in my opinion – but it's probably better than the average senior software engineer. And that's really all that matters.
I'm writing this because if you're reading this thinking we're basically still in 2024 with slightly better models and tooling you're just wrong and you're probably not prepared for what's coming.
- or biome 1.x and biome 2.x ?
- nah! it never will and that is why it ll never replace mid level engineers, FTFY
And then cheap and good enough option will eventually get better because that's the one that is more used.
It's how Japanese manufacturing beat Western manufacturing. And how Chinese manufacturing then beat Japanese again.
It's why it's much more likely you are using the Linux kernel and not GNU hurd.
It's how digital cameras left traditional film based cameras in the dust.
Bet on the cheaper and good enough outcome.
Bet against it at your peril.