Are LLM merge rates not getting better? (opens in new tab)

(entropicthoughts.com)

174 points4diii3mo ago156 comments

Related: Many SWE-bench-Passing PRs would not be merged - https://news.ycombinator.com/item?id=47341645 - March 2026 (149 comments)

156 comments

109 comments · 44 top-level

aerhardt3mo ago· 9 in thread

I feel that two things are true at the same time:

1) Something happened during 2025 that made the models (or crucially, the wrapping terminal-based apps like Claude Code or Codex) much better. I only type in the terminal anymore.

2) The quality of the code is still quite often terrible. Quadruple-nested control flow abounds. Software architecture in rather small scopes is unsound. People say AI is “good at front end” but I see the worst kind of atrocities there (a few days ago Codex 5.3 tried to inject a massive HTML element with a CSS before hack, rather than proprerly refactoring markup)

Two forces feel true simultaneously but in permanent tension. I still cannot make out my mind and see the synthesis in the dialectic, where this is truly going, if we’re meaningfully moving forward or mostly moving in circles.

zx80803mo ago

> People say AI is “good at front end” but I see the worst kind of atrocities there

It's commonly universal to say "AI is great in X", where one is not professional in X. It's because that's how AI is designed: to output tokens according to stats, not logic, not semantic, and not meaning: stats.

contextfree3mo ago

Reading discussions online and comparing them to my own experience makes me feel crazy, because I've found today's LLMs and agents to be seemingly good at everything except writing code. Including everything else in software engineering around code (debugging, reviewing, reading code, brainstorming architecture, etc.) as well as discussing various questions in the humanities and sciences where I'm a dilettante. But whenever I've asked them to generate any substantial amount of code, beyond a few lines to demonstrate usage of some API I'm unfamiliar with, the results have always been terrible and I end up either throwing it out or rewriting almost all of it myself and spending more time than if I'd just written it myself from the start.

It's occurred to me that maybe this just shows that I'm better at writing code and/or worse at everything else than I'd realized.

pornel3mo ago

Gell-Mann Amnesia for code quality.

leoedin3mo ago

This matches my experience too. The models write code that would never pass a review normally. Mega functions, "copy and pasted" code with small changes, deep nested conditionals and loops. All the stuff we've spent a lot of time trying to minimise!

You could argue it's OK because a model can always fix it later. But the problem comes when there's subtle logic bugs and its basically impossible to understand. Or fixing the bug in one place doesn't fix it in the 10 other places almost the same code exists.

I strongly suspect that LLMs, like all technologies, are going to follow an S curve of capability. The question is where in that S curve we are right now.

jygg43mo ago

The models lose the ability to inject subtle and nuance stuff as they scale up, is what I’ve observed.

orwin3mo ago

> People say AI is “good at front end”

I only say that because I'm a shit frontend dev. Honestly, I'm not that bad anymore, but I'm still shit, and the AI will probably generate better code than I will.

jygg43mo ago

As long as humans are needed to review code, it sounds your role evolves toward prompting and reviewing.

Which is akin to driving a car - the motor vehicle itself doesn’t know where to go. It requires you to prompt via steering and braking etc, and then to review what is happening in response.

That’s not necessarily a bad thing - reviewing code ultimately matters most. As long as what is produced is more often than not correct and legible.. now this is a different issue for which there isn’t a consensus across software engineer’s.

1 more reply

naruhodo3mo ago

> 1) Something happened during 2025 that made the models (or crucially, the wrapping terminal-based apps like Claude Code or Codex) much better. I only type in the terminal anymore.

I have heard say that the change was better context management and compression.

bbatha3mo ago

A lot of enhancements came on the model side which in many ways enabled context engineering.

200k and now 1M contexts. Better context management was enabled by improvements in structured outputs/tool calling at the model level. Also reasoning models really upped the game “plan” mode wouldn’t work well without them.

sunaurus3mo ago· 8 in thread

I am pretty convinced that for most types of day to day work, any perceived improvements from the latest Claude models for example were total placebo. In blind tests and with normal tasks, people would probably have no idea if they're using Opus 4.5 or 4.6.

sumeno3mo ago

This has basically been my experience since Sonnet 3.5. I've been working on a personal project on and off with various models and things since then and the biggest difference between then and now is that it will do larger chunks of work than it did before, but the quality of the code is not particularly better, I still have to do a lot of cleanup and it still goes off the rails pretty frequently. I have to do fewer individual prompts, but the time spent reviewing the code takes longer because I also have to mentally process and fix larger chunks of code too

Is it a better user experience now? Yes. Has it boosted my productivity on this project? Absolutely.

But it still needs a ton of hand holding for anything complicated and I still deal with tons of "OK, this bug is fixed now!" followed by manually confirming a bug still exists.

SkyPuncher3mo ago

4.6 has been a very, very slight regression for me, but the tradeoff is they've added better compaction - and now larger context windows. That's a reasonable tradeoff for me.

BoumTAC3mo ago

It's because they are getting so good it's impossible to recognize them.

Haiku 4.5 is already so good it's ok for 80% (95%?) of dev tasks.

FuckButtons3mo ago

I must be writing very different software than you, I keep opus on a tight leash and it still comes to the strangest conclusions.

1 more reply

Bolwin3mo ago

I've found Haiku to be truly mediocre for working with. If you want a cheap models, the open source ones are much better

AussieWog933mo ago

I'd agree with you on 4.5 to 4.6, but going from gpt-5 or 4.0 to 4.5 was night and day.

butILoveLife3mo ago

GPT5 added the router, which was def a downgrade. 4.5 was probably the best non-COT model humanity has made. But too expensive to run.

1 more reply

NewLogic3mo ago

Because post 4.0 dropped the sycophancy?

BoppreH3mo ago· 5 in thread

Controversial opinion from a casual user, but state-of-art LLMs now feel to me more intelligent then the average person on the steet. Also explains why training on more average-quality data (if there's any left) is not making improvements.

But LLMs are hamstrung by their harnesses. They are doing the equivalent of providing technical support via phone call: little to no context, and limited to a bidirectional stream of words (tokens). The best agent harnesses have the equivalent of vision-impairment accessibility interfaces, and even those are still subpar.

Heck, giving LLMs time to think was once a groundbreaking idea. Yesterday I saw Claude Code editing a file using shell redirects! It's barbaric.

I expect future improvements to come from harness improvements, especially around sub agents/context rollbacks (to work around the non-linear cost of context) and LLM-aligned "accessibility tools". That, or more synthetic training data.

xyzsparetimexyz3mo ago

Steet? Do you mean street? They're smarter in the same way a search engine is smarter.

BoppreH3mo ago

Yes, "street". Typing from my phone, sorry.

And search engines are narrow tools that can only output copies of its dataset. An LLM is capable of surprisingly novel output, even if the exact level of creativity is heavily debated.

1 more reply

8note3mo ago

> But LLMs are hamstrung by their harnesses

entirely so. i think anthropic updated something about the compact algorithm recently, and its gone from working well over long times to basically garbage whenever a compact happens

globular-toast3mo ago

It's so disrespectful to say an LLM is more intelligent than a person on the street. The LLM has nothing at stake, cares not a sausage about the consequences of what it spits out. People have all kinds of pressures, dependants, and personal issues like health. Our thoughts and actions have real consequences. It's so easy to be intelligent when you're the pretend human that gets switched on for five minutes then switched off again.

BoppreH3mo ago

It's not a value judgement, I'm no misanthrope. But it's a fact or life that we humans must specialize, while LLMs can afford to have "studied" a staggering variety of topics. It's no different than being slower than a car, or weaker than a hydraulic press.

On a different note, LLMs are still not very wise, as displayed by all the prompt attacks and occasional inane responses like walking to the car wash.

1 more reply

mike_hearn3mo ago· 5 in thread

That's an interesting claim, but I don't see it in my own work. They have got better but it's very hard to quantify. I just find myself editing their work much less these days (currently using GPT 5.4).

dwedge3mo ago

Without meaning to sound dismissive, because I'm really not intending to, there's also the possibility that you've gotten worse after enough time using them. You're treating yourself as a constant in this, but man cannot walk in the same river twice.

Mond_3mo ago

This is such a silly response when "You've gotten better at using them and know how to work around their flaws now." is right there and seems a lot more plausible.

mike_hearn3mo ago

That's a possibility, but I doubt it. I've been programming for 35 years and know what I like in code. I've also previously maintained a long review prompt in which I tell the models all the ways in which they get things wrong and to go look for/fix those problems. But those review passes now don't take as long because there are fewer such problems to begin with.

In particular GPT 5.4 is much better at not duplicating code unnecessarily. It'll take the time to refactor, to search for pre-existing utility functions, etc.

nkozyra3mo ago

The problem with evals is the underlying rubric will always be either subjective, or a quantitative score based on something that is likely now baked into the training set directly.

You kind of have to go on "feels" for a lot of this.

mountainriver3mo ago

Yeah same, and all my coworkers feel the same.

Most of us have been coding for ages. I actually find it really odd people keep trying to disprove things that are relatively obvious with LLMs

wongarsu3mo ago· 4 in thread

I don't find this very compelling. If you look at the actual graph they are referencing but never showing [1] there is a clear improvement from Sonnet 3.7 -> Opus 4.0 -> Sonnet 4.5. This is just hidden in their graph because they are only looking at the number of PRs that are mergable with no human feedback whatsoever (a high standard even for humans).

And even if we were to agree that that's a reasonable standard, GPT 5 shouldn't be included. There is only one datapoint for all OpenAI models. That data point more indicative of the performance of OpenAI models (and the harness used) than of any progression. Once you exclude it it matches what you would expect from a logistic model. Improvements have slowed down, but not stopped

1: https://metr.org/assets/images/many-swe-bench-passing-prs-wo...

yorwba3mo ago

Yes, I think this is basically an instance of the "emergent abilities mirage." https://arxiv.org/abs/2304.15004

If you measure completion rate on a task where a single mistake can cause a failure, you won't see noticeable improvements on that metric until all potential sources of error are close to being eliminated, and then if they do get eliminated it causes a sudden large jump in performance.

That's fine if you just want to know whether the current state is good enough on your task of choice, but if you also want to predict future performance, you need to break it down into smaller components and track each of them individually.

thesz3mo ago

  > until all potential sources of error are close to being eliminated

This is what PSP/TSP did - one has to (continually) review its' own work to identify most frequent sources of (user facing) defects.

  >  if you also want to predict future performance, you need to break it down into smaller components and track each of them individually.

This is also one of tenets of PSP/TSP. If you have a task with estimate longer that a day (8 hours), break it down.

This is fascinating. LLM community discovers PSP/TSP rules that were laid over more than twenty years ago.

What LLM community miss is that in PSP/TSP it is an individual software developer who is responsible to figure out what they need to look after.

What I see is that it is LLM users who try to harness LLMs with what they perceive as errors. It's not that LLMs are learning, it is that users of LLMs are trying to stronghold these LLMs with prompts.

2 more replies

Bombthecat3mo ago

That's how the public perceive it though.

It's useless and never gets better until it suddenly, unexpecty got good enough.

1 more reply

roxolotl3mo ago

I don't know that graph to me shows Sonnet 4.5 as worse than 3.7. Maybe the automated grader is finding code breakages in 3.7 and not breaking that out? But I'd much prefer to add code that is a different style to my codebase than code that breaks other code. But even ignoring that the pass rate is almost identical between the two models.

curiouscube3mo ago· 4 in thread

There is a decent case for this thesis to hold true especially if we look at the shift in training regimes and benchmarking over the last 1-2 years. Frontier labs don't seem to really push pure size/capability anymore, it's an all in focus on agentic AI which is mainly complex post-training regimes.

There are good reasons why they don't or can't do simple param upscaling anymore, but still, it makes me bearish on AGI since it's a slow, but massive shift in goal setting.

In practice this still doesn't mean 50 % of white collar can't be automated though.

lich_king3mo ago

> In practice this still doesn't mean 50 % of white collar can't be automated though.

Let me ask you this, though: if we wanted to, what percentage of white collar jobs could have been automated or eliminated prior to LLMs?

Meta has nearly 80k employees to basically run two websites and three mobile apps. There were 18k people working at LinkedIn! Many big tech companies are massive job programs with some product on the side. Administrative business partners, program managers, tech writers, "stewards", "champions", "advocates", 10-layer-deep reporting chains... engineers writing cafe menu apps and pet programming languages... a team working on in-house typefaces... the list goes on.

I can see AI producing shifts in the industry by reducing demand for meaningful work, but I doubt the outcome here is mass unemployment. There's an endless supply of bs jobs as long as the money is flowing.

jmalicki3mo ago

Meta has 80k employees to run the world's most massive engine of commerce through advertising and matching consumers to products.

They build generative AI tools so people can make ads more easily.

They have some of the most sophisticated tracking out there. They have shadow profiles on nearly everyone. Have you visited a website? You have a shadow profile even if you don't have a Facebook account. They know who your friends are based on who you are near. They know what stores you visit when.

Large fractions of their staff are making imperceptible changes to ads tracking and feed ranking that are making billions of dollars of marginal revenue.

What draws you in as a consumer is a tiny tip of the iceberg of what they actually do.

1 more reply

ehnto3mo ago

There are many reasons why we are seeing cuts economically, but the fact that it is possible to make such large cuts is because there were way too many people working at these companies. They had so much cheap money that they over-hired, now money isn't so cheap and they need to reduce headcount. AI need not enter the conversation to get to that point.

suttontom3mo ago

This is unfair and dismissive of many roles. Coordination in a massive, technically complex company that has to adhere to laws and regulations is a critical role. I don't get why people shit on certain roles (I'm a SWE). Our PgMs reduce friction and help us be more productive and focused. Technical writers produce customer-facing content and code, and have nothing to do with supporting internal bureaucracy. There are arguments against this in Bullshit Jobs but do you think companies pay PgMs or HR employees hundreds of thousands of dollars a year out of the goodness of their own hearts? Or maybe they actually help the business?

2 more replies

ryanackley3mo ago· 4 in thread

I agree completely. I haven't noticed much improvement in coding ability in the last year. I'm using frontier models.

What's been the game changer are tools like Claude Code. Automatic agentic tool loops purpose built for coding. This is what I have seen as the impetus for mainstream adoption rather than noticeable improvements in ability.

sho_hn3mo ago

My anecdotal experience is rather different.

I write a lot of C++ and QML code. Codex 5.3, only released in Feb, is the the first model I've used that would regularly generate code that passes my 25 years expert smell test and has turned generative coding from a timesap/nuisance into a tool I can somewhat rely on not to set me back.

Claude still wasn't quite there at the time, but I haven't tried 4.6 yet.

QML is a declarative-first markup language that is a superset of the JavaScript syntax. It's niche and doesn't have a giant amount of training data in the corpus. Codex 5.3 is the first model that doesn't super botch it or prefers to write reams of procedural JS embeds (yes, after steering). Much reduced is also the tendency to go overboard on spamming everything with clouds of helper functions/methods in both C++ and QML. It knows when to stop, so to speak, and is either trained or able to reason toward a more idiomatic ideal, with far less explicit instruction / AGENTS.md wrangling.

It's a huge difference. It might be the result of very specific optimization, or perhaps simultaneous advancements in the harness play a bigger role, but in my books my kneck of the woods (or place on the long tail) only really came online in 2026 as far as LLMs are concerned.

rubymamis3mo ago

As a Qt C++ and QML developer myself[1], Opus 4.6 thinking is much better than any other model I've tested (Codex 5.3/GPT 5.4/Gemini 3.1 Pro).

[1] https://rubymamistvalove.com/block-editor

mavamaarten3mo ago

Maybe n=1, but I disagree? I notice that Sonnet 4.6 follows instructions much better than 4.5 and it generates code much closer to our already in-place production code.

It's just a point release and it isn't a significant upgrade in terms of features or capabilities, but it works... better for me.

ryanackley3mo ago

Are you using a tool like Claude Code or Codex or windsurf? I ask because I've found their ability to pull in relevant context improves tasks in exactly the way you're describing.

My own experience is that some things get better and some things get worse in perceived quality at the micro-level on each point release. i.e. 4.5->4.6

thomascgalvin3mo ago· 4 in thread

Anecdotally, I haven't seen any real improvement from the AI tools I leverage. They're all good-ish at what they do, but all still lie occasionally, and all need babysitting.

I also wonder how much of the jump in early 2025 comes from cultural acceptance by devs, rather than an improvement in the tools themselves.

rustyhancock3mo ago

I think I'm coming to the same conclusion Gpt-3 to 5.3 have had real tangible but incremental improvements with quite diminishing returns.

Perhaps we won't see a phase change like improvement as we did from gpt-2 through to 3 until there is several more orders of magnitude parameters and/or training. Perhaps we will never see it again!

What is getting rapidly better is scaffolding but this seems to be more about understanding and building tools around LLMs than the LLMs themselves improving.

I'm still excited about AI but not constantly hyped to the rafters as some.

egwor3mo ago

I think it depends on what you're using it for. If it is a simple kubernetes config then the model doesn't matter too much. Contract that with writing the scenario for a backtest for an algo that trades on a venue: it is not the same complexity and the basic models are terrible. I've had it tell me that it has added tests to find that they're just stubs! Opus seems to be getting there, but on more complex tasks the others are a complete waste of time.

utopiah3mo ago

> If it is a simple kubernetes config then the model doesn't matter too much

I guess at least this person https://www.tomshardware.com/tech-industry/artificial-intell... might disagree. I think already to know what Kubernetes even is requires quite a bit of knowledge. Using a tool that manipulate its configuration files IN PRODUCTION without risking data loss is another ball game entirely.

jwpapi3mo ago

It’s better pre and post training + better harnessing

reedf13mo ago· 3 in thread

Given that it is the general consensus that a step function occurred with Opus 4.5/4.6 only 3 months ago - it seems like an insane omission.

jeremyjh3mo ago

This has been the general consensus for about three years now. "Drastic increases in capability have happened the last 3-6 months" have been a constant refrain.

Without any data from the study past September I think its not unreasonable, if you want to make an argument based on evidence.

For me personally, I agree with you, I'm really seeing it as well.

postflopclarity3mo ago

> "Drastic increases in capability have happened the last 3-6 months" have been a constant refrain.

well, yeah. because that's been the experience for many people.

3 years ago, trying to use ChatGPT 3.5 for coding tasks was more of a gimmick than anything else, and was basically useless for helping me with my job.

today, agentic Opus 4.6 provides more value to me than probably 2 more human engineers on my team would

2 more replies

Toutouxc3mo ago

There's a consensus that SOMETHING changed with Opus 4.5. It might have been the "merge rates" metric, it might have not.

I'm certainly getting faster and cleaner-looking solutions for certain issues on Opus 4.6 than I was 5 months ago, but I'm not sure about the ability to solve (or even weigh in) the actual hard stuff, i.e. the stuff I'm paid for.

And I'm definitely not sure about the supposed big step between 4.5 and 4.6. I'm literally not seeing any.

jeffnv3mo ago· 2 in thread

I don't think it's true, but am I alone in wishing it was? My world is disrupted somewhat but so far I don't think we have a thing that upends our way of life completely yet. If it stayed exactly this good I'd be pretty content.

cj3mo ago

I agree with your sentiment, but I think we've yet to see the full application of the current technology. (Even if LLMs themselves don't improve, there's significant opportunity for people to use it in ways not currently being done)

jygg43mo ago

The issue with llm’s is trust.

I don’t see that ever going away. Humans have learned to trust other humans over a large time scale with rules in place to control behaviour.

2 more replies

orwin3mo ago· 2 in thread

I think what happened with static image generation is happening with LLMs. Basically the tools around are becoming better, but all the AI improvements stall, the error rate stay the same (but external tools curate the results so it won't be noticeable if you don't run your own model), the accuracy is still slightly improving, but slower and slower, and never reach the 'perfect' point. Basically stablediffusion early 2025

GaggiX3mo ago

Image quality has improved a lot in recent months thanks to better models. The ability of people to notice these improvements is plateauing because they are not trained to spot artifacts, which are becoming more obscure.

orwin3mo ago

Yes, slight increase in that kind of accuracy. And newer models still generate absurd stuff. Ask for an historical picture, like 'a London market in the 18th century', and it is still as historically wrong as it was 2 years ago. It is useful for fantasy/sci-fi though, I use them a lot. But I don't see the point of newer models since late 2024.

1 more reply

Flavius3mo ago· 2 in thread

> This means llms have not improved in their programming abilities for over a year. Isn’t that wild? Why is nobody talking about this?

Because it's not true. They have improved tremendously in the last year, but it looks like they've hit a wall in the last 3 months. Still seeing some improvements but mostly in skills and token use optimization.

postflopclarity3mo ago

> but mostly in skills and token use optimization.

I have heard rumors that token use optimization has been a recent focus to try to tidy up the financials of these companies before they IPO. take that with a grain of salt though

saulpw3mo ago

After only 3 months (!) you can claim a plateau, but not a wall.

roxolotl3mo ago· 2 in thread

These studies are always really hard to judge the efficacy of. I would say though the most surprising thing to me about LLMs in the past year is how many people got hyped about the Opus 4.5 release. Having used Claude Code at work since it was released I haven't really noticed any step changes in improvement. Maybe that's because I've never tried to use it to one shot things?

Regardless I'm more inclined to believe that 4.5 was the point that people started using it after having given up on copy/pasting output in 2024. If you're going from chat to agentic level of interaction it's going to feel like a leap.

eterm3mo ago

I used it with Sonnet 4.0 a lot, and there was vastly more back-and-forth and correction of "dumb" things, such as forgetting to add "using" statements in C# files.

I don't know if it's model, or harness improvements, or inbuilt-memory or all of the above, but it often has a step where it'll check itself that is done now before trying to build and getting an inevitable failure.

Those small things add up to a much smoother and richer experience today compared to 6 months ago.

tossandthrow3mo ago

Nah, pre 4.5 it was not comfortable to use agentic coding.

BloondAndDoom3mo ago· 2 in thread

I feel like anyone used AI coding tools before 11/25 and after 1/26 (with frontier models) will say there has been a massive jump in, there is a difference between whether LLM can do a specific task or pass some arguably arbitrary checks by maintainers vs. what the are capable of.

We still have tons of gaps about how to build and maintain code with AI, but LLM themselves getting better at an unbelievable pace, even with this kind of data analysis I’m surprised anyone can even question it.

_zagj3mo ago

> I feel like anyone used AI coding tools before 11/25 and after 1/26 (with frontier models) will say there has been a massive jump in, there is a difference between whether LLM can do a specific task or pass some arguably arbitrary checks by maintainers vs. what the are capable of.

How much of that is the model and how much of that is the tooling built around it? Also why is the tooling, specifically Claude Code, so buggy?

BloondAndDoom3mo ago

90% model if not more, look at terminal benchmark terminus tool, that mostly proves it

fluidcruft3mo ago· 2 in thread

Yeah I'm not buying the last bit about lower MSE with one term in the model vs two (Brier with one outcome category is MSE of the probabilities). That's the sort of thing that would make me go dig to find where I fucked up the calculation.

kqr3mo ago

With one term it gets more robust in the face of excluding endpoints when constructing the jackknife train/test split, I think. But you're right, it does sound fishy.

fluidcruft3mo ago

What the post is describing is just ANOVA. If removing a category improves the overall fit then fitting the two terms independently has the same optimal solution (with the two independent terms found to be identical). MSE never increases when adding a category.

This is why you have to reach to things that penalize adding parameters to models when running model comparisons.

1 more reply

Havoc3mo ago· 1 in thread

As they become more capable peoples commits will also become more ambitious.

So I’d say fairly flat commit acceptance numbers make sense even in the context of improving LLMs

jygg43mo ago

Indeed. Why is this post down voted? There’s always trade-offs taking place, it’s good to call them out.

pu_pe3mo ago· 1 in thread

Benchmaxxing aside, if you are using those tools for programming on a regular basis it should be self-evident that they are improving. I find it very hard to believe that someone using LLMs today vs what was available one year ago (Claude Code released Feb 2025) would have any difficulty answering this question.

Zababa3mo ago

I think it is important to try to find more rigorous things to test than the general sentiment of the people using the tools. If only because the more benchmarks we have the more we can improve models without regressions. METR is asking a really interesting question here, "are models improving at making one shot PRs?". The answer seems to be, yes, but slower than benchmarks suggest, if you look at the pass rate of different versions of Claude Sonnet. A reasonable answer is "you're not supposed to use them by making one shot PRs", but then ideally we would need to have some kind of standarized test for the ability of models to incorporate feedback and evolve PRs.

davecoffin3mo ago· 1 in thread

I've been able to supercharge a hobby project of mine over the last couple months using Opus 4.6 in claude code. I had to collaborate and write code still, but claude did like 75% of the work to add meaningful new features to an iOS/Android native mobile app, including Live Activities which is so overly complicated i would not have been able to figure that out. I have it running in a folder that contains both my back end api (express) and my mobile app (nativescript), so it does back end and front end work simultaneously to support new features. this wasnt possible 8 months ago.

polyterative3mo ago

I have a similar experience. My hobby project was put on hold after a burnout and lack of motivation. I got a big burst of energy back when I started implementing some long desired features quickly with these new models. I was able to get the project to the point of what I consider is maturity. I did in a month during free time the kind of work that would have burned me up in a good six months fulltime.

juancn3mo ago· 1 in thread

Well, on one hand they lack new data. Lot's of new code came out of an LLM, so it feeds back.

On the other hand, LLMs tend to go for an average by their nature (if you squint enough). What's more common in their training data, it's more common in the output, so getting them better without fundamental changes, requires one to improve the training data on average too which is hard.

What did improve a lot is the tooling around them. That's gotten way better.

_zagj3mo ago

> Well, on one hand they lack new data. Lot's of new code came out of an LLM, so it feeds back.

Supposedly model curation is a Big Deal at Big AI, and they're especially concerned about Ouroboros effects and poisoned data. Also people are still contributing to open source and open sourcing new projects, something that should have slowed to trickle by 2023, once it became clear that from now on, you're just providing the fuel for the machines that will ultimately render you unemployable (or less employable), and that these machines will completely disregard your license terms, including those of the most permissive licenses that seek only attribution, and that you're doing all of this for free.

pnathan3mo ago· 1 in thread

Data is missing on this chart.

It's my experience that opus 4, and then, particularly, 4.5, in Claude code, are head and shoulders above the competition.

I wrote an agentic coder years ago and it yielded trash. (Tried to make it do then what kiro does today).

The models are better. Now, caveat - I don't use anything but opus for coding - Sonnet doesn't do the trick. My experience with Codex and Gemini is that their top models are as good as Sonnet for coding...

BloondAndDoom3mo ago

I was trying to do something yestesrday and Claude was keep messing it up, after like an hour i realized the model somehow switched to sonet, opus 4.6 is crazy good. It’s very obvious in practice.

Although I feel like for chasing bugs and big systems codex is even better

GaggiX3mo ago· 1 in thread

How the "costant function" result fits the data points better than a slope that has two parameters instead of one.

kqr3mo ago

Cross-validation. The slope overfits when the test set is included from the data the model is fitted on.

WithinReason3mo ago· 1 in thread

If you look at a separate trend for the smaller Sonnet models, you can see a rapid trend

suddenlybananas3mo ago

3.7 to 4.5 looks pretty flat here.

boonzeet3mo ago

Interesting article, although with so few data points and such a specific time slice it is difficult to draw serious conclusions about the "improvement" of LLM models.

It's notably lacking newer models (4.5 Opus, 4.6 Sonnet) and models from Gemini.

LLMs appear to naturally progress in short leaps followed by longer plateaus, as breakthroughs are developed such as chain-of-thought, mixture-of-experts, sub-agents, etc.

1 more reply

Incipient3mo ago

I feel even if the models are stagnating, the tooling around them, and the integrations and harnesses they have are getting significantly more capable (if not always 'better' - the recent vscode update really handicapped them for some reason). Things like the new agent from booking.com or whatever, if it could integrate with all hotels, activities, mapping tools, flight system, etc could be hugely powerful.

Assuming we get no better than opus 4.6, they're very capable. Even if they make up nonsense 5% of the time!

1 more reply

idorozin3mo ago

My experience has been that raw “one-shot intelligence” hasn’t improved as dramatically in the last year, but the workflow around the models has improved massively.

When you combine models with:

tool use

planning loops

agents that break tasks into smaller pieces

persistent context / repos

the practical capability jump is huge.

utopiah3mo ago

I gave up on trying months ago, you can see the timeline on top of https://fabien.benetou.fr/Content/SelfHostingArtificialIntel...

Truth is I'm probably wrong. I should keep on testing ... but at the same time I precisely gave up because I didn't think the trend was fast enough to keep on investing on checking it so frequently. Now I just read this kind of post, ask around (mainly arguing with comments asking for genuine examples that should be "surprising" and kept on being disappointed) and that seems to be enough for a proxy.

I should though, as I mentioned in another comment, keep track of failed attempts.

PS: I check solely on self-hosted models (even if not on my machine but least on machines I could setup) because I do NOT trust the scaffolding around proprietary closed sources models. I can't verify that nobody is in the loop.

1 more reply

globular-toast3mo ago

I reckon LLM merge rates will go up, but not necessarily due to quality improvements. Instead I think maintainers will just become fatigued. The amount of code I'm expected to review now is way higher than before. And while I'm reviewing you know more is being generated. I'm sure I've let through more crap due to this fatigue attack on me.

sd93mo ago

You really can't model these 5 data points with a linear regression or a step function. The models are of different sizes / use cases, and from two different labs. I feel like what we've observed generally is that different labs releasing similarly sized models at similar times are generally pretty similar.

I think the only reasonable thing to read into is Sonnet 3.5 -> 3.7 -> 4.5. But yeah, you just can't draw a line through this thing.

I will die on the hill that LLMs are getting better, particularly Anthropic's releases since December. But I can't point at a graph to prove that, I'm just drawing on my personal experience. I do use Claude Code though, so I think a large part of the improvement comes from the harness.

antisthenes3mo ago

They are getting better, but they are also hitting diminishing returns.

There's only so much data to train on, and we are unlikely to see giant leaps in performance as we did in 2023/2024.

2026-27 will be the years of primarily ecosystem/agentic improvements and reducing costs.

jwpapi3mo ago

I had this suspicion for a while I think we just got way better in harnessing not the models actual reasoning

So we got better in giving it the right context and tools to do the stuff we need to do but not the actual thinking improvements

camdenreslink3mo ago

From my personal experience, they have gotten better, but they haven’t unlocked any new capabilities. They’ve just improved at what I was already using them for.

At the end of the day they still produce code that I need to manually review and fully understand before merging. Usually with a session of back-and-forth prompting or manual edits by me.

That was true 2 years ago, and it’s true now (except 2 years ago I was copy/pasting from the browser chat window and we have some nicer IDE integration now).

Slav_fixflex3mo ago

As someone who builds with LLMs daily without being a developer, I notice quality differences more in practical output than benchmarks. Claude handles complex multi-step tasks better in my experience, but consistency is still the biggest challenge – same prompt can give very different results day to day.

casey23mo ago

>fischer warned us against eyeballing plots proceeds to eyeball it with an arbitrary function

There was a long flat line before the step, models improve, but PR pass rate without human intervention is inherently a staircase function

delichon3mo ago

Yesterday I asked a frontier model to help generate a report. It said great, it can do that, and output a table. I asked it to evaluate its prompt compliance in the result. It concluded that it had failed on every requirement. I asked why it had expressed such confidence, was it analagous to narcissism or psycopathy? It said no, and then said that if I just had to anthropomorphize it, I should think of it as a brilliant friend with severe frontal lobe brain damage.

That actually helps.

techcam3mo ago

I’ve been noticing the same — a lot of failures aren’t obvious “jailbreaks,” they’re just subtle prompt structure issues that only show up in production.

dmos623mo ago

Tangential: I've found that having an LLM recreate the full file, with changes appllied, is less mistake-prone, than producing a patch. I wonder if anyone else came to this conclusion too.

varispeed3mo ago

In my niche the Opus 4.6 has been a game changer. In comparison all other LLMs look stupid. I am considering cancelling all other subscriptions.

boxedemp3mo ago

How do they know? Not everybody includes to coauthored by Claude. I certainly don't.

ordersofmag3mo ago

Even if one-shot LLM performance has plateaued (which I'm not convinced this data shows given omission of recent models that are widely claimed to be better) that missing the point that I see in my own work. The improved tooling and agent-based approaches that I'm using now make the LLM one-shot performance only a small part of the puzzle in terms of how AI tools have accelerated the time from idea to decent code. For instance the planning dialogs I now have with Claude are an important part of what's speeding things up for me. Also, the iterative use of AI to identify, track, and take care of small coding tasks (none of which are particularly challenging in terms of benchmarks) is simply more effective. Could this all have been done with the LLM engines of late 2024. Perhaps, but I think the fine-tuning (and conceivably the system prompts) that make the current LLM's more effective at agent-centered workflows (including tool-use) are a big part of it. One-shot task performance at challenging tasks is an interesting, certainly foundational, metric. But I don't think it captures the important advances I see in how LLM's have gotten better over the last year in ways that actually matter to me. I rarely have a well-defined programming challenge and the obligation to solve it in a single-shot.

sigmar3mo ago

>This means the step function has more predictive power (“fits better”) than the linear slope. For fun, we can also fit a function that is completely constant across the entire timespan. That happens to get the best Brier score.

I mean, sure. but it's obvious in that graph that the single openai model is dragging down the right side. Wouldn't it be better to just stick to analyzing models from only one lab so that this was showing change over time rather than differences between models?

Zababa3mo ago

From the METR study (https://metr.org/notes/2026-03-10-many-swe-bench-passing-prs...):

>To study how agent success on benchmark tasks relates to real-world usefulness, we had 4 active maintainers from 3 SWE-bench Verified repositories review 296 AI-generated pull requests (PRs). We had maintainers (hypothetically) accept or request changes for patches as well as provide the core reason they were requesting changes: core functionality failure, patch breaks other code or code quality issues.

I would also advise taking a look at the rejection reasons for the PRs. For example, Figure 5 shows two rejections for "code quality" because of (and I quote) "looks like a useless AI slop comment." This is something models still do, but that is also very easily fixable. I think in that case the issue is that the level of comment wanted hasn't been properly formalized in the repo and the model hasn't been able to deduce it from the context it had.

As for the article, I think mixing all models together doesn't make sense. For example, maybe a slope describe the increasing Claude Sonnet better than a step function.

raincole3mo ago

No Gemini. No Opus 4.5. No GPT codex.

As they said, ragebait used to be believable.

sigbottle3mo ago

LLM's have 100% gotten better, but it's hard to say if it's "intrinsically better", if that makes sense.

> OpenAI’s leading researchers have not completed a successful full-scale pre-training run that was broadly deployed for a new frontier model since GPT-4o in May 2024 [1]

That's evidence against "intrinsically better". They've also trained on the entire internet - we only have 1 internet, so.

However, late 2024 was the introduction of o1 and early 2025 was Deepseek R1 and o3. These were definitely significant reasoning models - the introduction of test time compute and significant RL pipelines were here.

Mid 2025 was when they really started getting integrated with tool calling.

Late 2025 is when they really started to become agentic and integrate with the CLI pretty well (at least for me). For example, codex would at least try and run some smoke tests for itself to test its code.

In early 2026, the trend now appears to be harness engineering - as opposed to "context engineering" in 2025, where we had to preciously babysit 1 model's context, we make it both easier to rebuild context (classic CS trick btw: rebooting is easier than restoring stale state [2]) and really lean into raw cli tool calling, subagents, etc.

[1] https://newsletter.semianalysis.com/p/tpuv7-google-takes-a-s...

[2] https://en.wikipedia.org/wiki/Kernel_panic

FWIW, AI programming has still been as frustrating as it was when it was just TTC in 2025. Maybe because I don't have the "full harness" but it still has programming styles embedded such as silent fallback values, overly defensive programming, etc. which are obvoiusly gleaned from the desire to just pass all tests, rather than truly good programming design. I've been able to do more, but I have to review more slop... also the agents are really unpleasant to work with, if you're trying to have any reasonable conversation with them and not just delegate to them. It's as if they think the entire world revolves around them, and all information from the operator is BS, if you try and open a proper 2-way channel.

It seems like 2026 will go full zoom with AI tooling because the goal is to replace devs, but hopefully AI agents become actually nice to work with. Not sycophantic, but not passively aggressively arrogant either.

codeulike3mo ago

This means llms have not improved in their programming abilities for over a year. Isn’t that wild? Why is nobody talking about this?

Because hype makes money.

j / k navigate · click thread line to collapse

156 comments

109 comments · 44 top-level

aerhardt3mo ago· 9 in thread

I feel that two things are true at the same time:

1) Something happened during 2025 that made the models (or crucially, the wrapping terminal-based apps like Claude Code or Codex) much better. I only type in the terminal anymore.

zx80803mo ago

> People say AI is “good at front end” but I see the worst kind of atrocities there

contextfree3mo ago

It's occurred to me that maybe this just shows that I'm better at writing code and/or worse at everything else than I'd realized.

pornel3mo ago

Gell-Mann Amnesia for code quality.

leoedin3mo ago

I strongly suspect that LLMs, like all technologies, are going to follow an S curve of capability. The question is where in that S curve we are right now.

jygg43mo ago

The models lose the ability to inject subtle and nuance stuff as they scale up, is what I’ve observed.

orwin3mo ago

> People say AI is “good at front end”

I only say that because I'm a shit frontend dev. Honestly, I'm not that bad anymore, but I'm still shit, and the AI will probably generate better code than I will.

jygg43mo ago

As long as humans are needed to review code, it sounds your role evolves toward prompting and reviewing.

Which is akin to driving a car - the motor vehicle itself doesn’t know where to go. It requires you to prompt via steering and braking etc, and then to review what is happening in response.

1 more reply

naruhodo3mo ago

> 1) Something happened during 2025 that made the models (or crucially, the wrapping terminal-based apps like Claude Code or Codex) much better. I only type in the terminal anymore.

I have heard say that the change was better context management and compression.

bbatha3mo ago

A lot of enhancements came on the model side which in many ways enabled context engineering.

sunaurus3mo ago· 8 in thread

sumeno3mo ago

Is it a better user experience now? Yes. Has it boosted my productivity on this project? Absolutely.

But it still needs a ton of hand holding for anything complicated and I still deal with tons of "OK, this bug is fixed now!" followed by manually confirming a bug still exists.

SkyPuncher3mo ago

4.6 has been a very, very slight regression for me, but the tradeoff is they've added better compaction - and now larger context windows. That's a reasonable tradeoff for me.

BoumTAC3mo ago

It's because they are getting so good it's impossible to recognize them.

Haiku 4.5 is already so good it's ok for 80% (95%?) of dev tasks.

FuckButtons3mo ago

I must be writing very different software than you, I keep opus on a tight leash and it still comes to the strangest conclusions.

1 more reply

Bolwin3mo ago

I've found Haiku to be truly mediocre for working with. If you want a cheap models, the open source ones are much better

AussieWog933mo ago

I'd agree with you on 4.5 to 4.6, but going from gpt-5 or 4.0 to 4.5 was night and day.

butILoveLife3mo ago

GPT5 added the router, which was def a downgrade. 4.5 was probably the best non-COT model humanity has made. But too expensive to run.

1 more reply

NewLogic3mo ago

Because post 4.0 dropped the sycophancy?

BoppreH3mo ago· 5 in thread

Heck, giving LLMs time to think was once a groundbreaking idea. Yesterday I saw Claude Code editing a file using shell redirects! It's barbaric.

xyzsparetimexyz3mo ago

Steet? Do you mean street? They're smarter in the same way a search engine is smarter.

BoppreH3mo ago

Yes, "street". Typing from my phone, sorry.

And search engines are narrow tools that can only output copies of its dataset. An LLM is capable of surprisingly novel output, even if the exact level of creativity is heavily debated.

1 more reply

8note3mo ago

> But LLMs are hamstrung by their harnesses

entirely so. i think anthropic updated something about the compact algorithm recently, and its gone from working well over long times to basically garbage whenever a compact happens

globular-toast3mo ago

BoppreH3mo ago

On a different note, LLMs are still not very wise, as displayed by all the prompt attacks and occasional inane responses like walking to the car wash.

1 more reply

mike_hearn3mo ago· 5 in thread

dwedge3mo ago

Mond_3mo ago

This is such a silly response when "You've gotten better at using them and know how to work around their flaws now." is right there and seems a lot more plausible.

mike_hearn3mo ago

In particular GPT 5.4 is much better at not duplicating code unnecessarily. It'll take the time to refactor, to search for pre-existing utility functions, etc.

nkozyra3mo ago

The problem with evals is the underlying rubric will always be either subjective, or a quantitative score based on something that is likely now baked into the training set directly.

You kind of have to go on "feels" for a lot of this.

mountainriver3mo ago

Yeah same, and all my coworkers feel the same.

Most of us have been coding for ages. I actually find it really odd people keep trying to disprove things that are relatively obvious with LLMs

wongarsu3mo ago· 4 in thread

1: https://metr.org/assets/images/many-swe-bench-passing-prs-wo...

yorwba3mo ago

Yes, I think this is basically an instance of the "emergent abilities mirage." https://arxiv.org/abs/2304.15004

thesz3mo ago

  > until all potential sources of error are close to being eliminated

This is what PSP/TSP did - one has to (continually) review its' own work to identify most frequent sources of (user facing) defects.

  >  if you also want to predict future performance, you need to break it down into smaller components and track each of them individually.

This is also one of tenets of PSP/TSP. If you have a task with estimate longer that a day (8 hours), break it down.

This is fascinating. LLM community discovers PSP/TSP rules that were laid over more than twenty years ago.

What LLM community miss is that in PSP/TSP it is an individual software developer who is responsible to figure out what they need to look after.

2 more replies

Bombthecat3mo ago

That's how the public perceive it though.

It's useless and never gets better until it suddenly, unexpecty got good enough.

1 more reply

roxolotl3mo ago

curiouscube3mo ago· 4 in thread

There are good reasons why they don't or can't do simple param upscaling anymore, but still, it makes me bearish on AGI since it's a slow, but massive shift in goal setting.

In practice this still doesn't mean 50 % of white collar can't be automated though.

lich_king3mo ago

> In practice this still doesn't mean 50 % of white collar can't be automated though.

Let me ask you this, though: if we wanted to, what percentage of white collar jobs could have been automated or eliminated prior to LLMs?

jmalicki3mo ago

Meta has 80k employees to run the world's most massive engine of commerce through advertising and matching consumers to products.

They build generative AI tools so people can make ads more easily.

Large fractions of their staff are making imperceptible changes to ads tracking and feed ranking that are making billions of dollars of marginal revenue.

What draws you in as a consumer is a tiny tip of the iceberg of what they actually do.

1 more reply

ehnto3mo ago

suttontom3mo ago

2 more replies

ryanackley3mo ago· 4 in thread

I agree completely. I haven't noticed much improvement in coding ability in the last year. I'm using frontier models.

sho_hn3mo ago

My anecdotal experience is rather different.

Claude still wasn't quite there at the time, but I haven't tried 4.6 yet.

rubymamis3mo ago

As a Qt C++ and QML developer myself[1], Opus 4.6 thinking is much better than any other model I've tested (Codex 5.3/GPT 5.4/Gemini 3.1 Pro).

[1] https://rubymamistvalove.com/block-editor

mavamaarten3mo ago

Maybe n=1, but I disagree? I notice that Sonnet 4.6 follows instructions much better than 4.5 and it generates code much closer to our already in-place production code.

It's just a point release and it isn't a significant upgrade in terms of features or capabilities, but it works... better for me.

ryanackley3mo ago

Are you using a tool like Claude Code or Codex or windsurf? I ask because I've found their ability to pull in relevant context improves tasks in exactly the way you're describing.

My own experience is that some things get better and some things get worse in perceived quality at the micro-level on each point release. i.e. 4.5->4.6

thomascgalvin3mo ago· 4 in thread

Anecdotally, I haven't seen any real improvement from the AI tools I leverage. They're all good-ish at what they do, but all still lie occasionally, and all need babysitting.

I also wonder how much of the jump in early 2025 comes from cultural acceptance by devs, rather than an improvement in the tools themselves.

rustyhancock3mo ago

I think I'm coming to the same conclusion Gpt-3 to 5.3 have had real tangible but incremental improvements with quite diminishing returns.

Perhaps we won't see a phase change like improvement as we did from gpt-2 through to 3 until there is several more orders of magnitude parameters and/or training. Perhaps we will never see it again!

What is getting rapidly better is scaffolding but this seems to be more about understanding and building tools around LLMs than the LLMs themselves improving.

I'm still excited about AI but not constantly hyped to the rafters as some.

egwor3mo ago

utopiah3mo ago

> If it is a simple kubernetes config then the model doesn't matter too much

jwpapi3mo ago

It’s better pre and post training + better harnessing

reedf13mo ago· 3 in thread

Given that it is the general consensus that a step function occurred with Opus 4.5/4.6 only 3 months ago - it seems like an insane omission.

jeremyjh3mo ago

This has been the general consensus for about three years now. "Drastic increases in capability have happened the last 3-6 months" have been a constant refrain.

Without any data from the study past September I think its not unreasonable, if you want to make an argument based on evidence.

For me personally, I agree with you, I'm really seeing it as well.

postflopclarity3mo ago

> "Drastic increases in capability have happened the last 3-6 months" have been a constant refrain.

well, yeah. because that's been the experience for many people.

3 years ago, trying to use ChatGPT 3.5 for coding tasks was more of a gimmick than anything else, and was basically useless for helping me with my job.

today, agentic Opus 4.6 provides more value to me than probably 2 more human engineers on my team would

2 more replies

Toutouxc3mo ago

There's a consensus that SOMETHING changed with Opus 4.5. It might have been the "merge rates" metric, it might have not.

And I'm definitely not sure about the supposed big step between 4.5 and 4.6. I'm literally not seeing any.

jeffnv3mo ago· 2 in thread

cj3mo ago

jygg43mo ago

The issue with llm’s is trust.

I don’t see that ever going away. Humans have learned to trust other humans over a large time scale with rules in place to control behaviour.

2 more replies

orwin3mo ago· 2 in thread

GaggiX3mo ago

orwin3mo ago

1 more reply

Flavius3mo ago· 2 in thread

> This means llms have not improved in their programming abilities for over a year. Isn’t that wild? Why is nobody talking about this?

postflopclarity3mo ago

> but mostly in skills and token use optimization.

I have heard rumors that token use optimization has been a recent focus to try to tidy up the financials of these companies before they IPO. take that with a grain of salt though

saulpw3mo ago

After only 3 months (!) you can claim a plateau, but not a wall.

roxolotl3mo ago· 2 in thread

eterm3mo ago

I used it with Sonnet 4.0 a lot, and there was vastly more back-and-forth and correction of "dumb" things, such as forgetting to add "using" statements in C# files.

Those small things add up to a much smoother and richer experience today compared to 6 months ago.

tossandthrow3mo ago

Nah, pre 4.5 it was not comfortable to use agentic coding.

BloondAndDoom3mo ago· 2 in thread

_zagj3mo ago

How much of that is the model and how much of that is the tooling built around it? Also why is the tooling, specifically Claude Code, so buggy?

BloondAndDoom3mo ago

90% model if not more, look at terminal benchmark terminus tool, that mostly proves it

fluidcruft3mo ago· 2 in thread

kqr3mo ago

With one term it gets more robust in the face of excluding endpoints when constructing the jackknife train/test split, I think. But you're right, it does sound fishy.

fluidcruft3mo ago

This is why you have to reach to things that penalize adding parameters to models when running model comparisons.

1 more reply

Havoc3mo ago· 1 in thread

As they become more capable peoples commits will also become more ambitious.

So I’d say fairly flat commit acceptance numbers make sense even in the context of improving LLMs

jygg43mo ago

Indeed. Why is this post down voted? There’s always trade-offs taking place, it’s good to call them out.

pu_pe3mo ago· 1 in thread

Zababa3mo ago

davecoffin3mo ago· 1 in thread

polyterative3mo ago

juancn3mo ago· 1 in thread

Well, on one hand they lack new data. Lot's of new code came out of an LLM, so it feeds back.

What did improve a lot is the tooling around them. That's gotten way better.

_zagj3mo ago

> Well, on one hand they lack new data. Lot's of new code came out of an LLM, so it feeds back.

pnathan3mo ago· 1 in thread

Data is missing on this chart.

It's my experience that opus 4, and then, particularly, 4.5, in Claude code, are head and shoulders above the competition.

I wrote an agentic coder years ago and it yielded trash. (Tried to make it do then what kiro does today).

BloondAndDoom3mo ago

I was trying to do something yestesrday and Claude was keep messing it up, after like an hour i realized the model somehow switched to sonet, opus 4.6 is crazy good. It’s very obvious in practice.

Although I feel like for chasing bugs and big systems codex is even better

GaggiX3mo ago· 1 in thread

How the "costant function" result fits the data points better than a slope that has two parameters instead of one.

kqr3mo ago

Cross-validation. The slope overfits when the test set is included from the data the model is fitted on.

WithinReason3mo ago· 1 in thread

If you look at a separate trend for the smaller Sonnet models, you can see a rapid trend

suddenlybananas3mo ago

3.7 to 4.5 looks pretty flat here.

boonzeet3mo ago

Interesting article, although with so few data points and such a specific time slice it is difficult to draw serious conclusions about the "improvement" of LLM models.

It's notably lacking newer models (4.5 Opus, 4.6 Sonnet) and models from Gemini.

LLMs appear to naturally progress in short leaps followed by longer plateaus, as breakthroughs are developed such as chain-of-thought, mixture-of-experts, sub-agents, etc.

1 more reply

Incipient3mo ago

Assuming we get no better than opus 4.6, they're very capable. Even if they make up nonsense 5% of the time!

1 more reply

idorozin3mo ago

My experience has been that raw “one-shot intelligence” hasn’t improved as dramatically in the last year, but the workflow around the models has improved massively.

When you combine models with:

tool use

planning loops

agents that break tasks into smaller pieces

persistent context / repos

the practical capability jump is huge.

utopiah3mo ago

I gave up on trying months ago, you can see the timeline on top of https://fabien.benetou.fr/Content/SelfHostingArtificialIntel...

I should though, as I mentioned in another comment, keep track of failed attempts.

1 more reply

globular-toast3mo ago

sd93mo ago

I think the only reasonable thing to read into is Sonnet 3.5 -> 3.7 -> 4.5. But yeah, you just can't draw a line through this thing.

antisthenes3mo ago

They are getting better, but they are also hitting diminishing returns.

There's only so much data to train on, and we are unlikely to see giant leaps in performance as we did in 2023/2024.

2026-27 will be the years of primarily ecosystem/agentic improvements and reducing costs.

jwpapi3mo ago

I had this suspicion for a while I think we just got way better in harnessing not the models actual reasoning

So we got better in giving it the right context and tools to do the stuff we need to do but not the actual thinking improvements

camdenreslink3mo ago

From my personal experience, they have gotten better, but they haven’t unlocked any new capabilities. They’ve just improved at what I was already using them for.

At the end of the day they still produce code that I need to manually review and fully understand before merging. Usually with a session of back-and-forth prompting or manual edits by me.

That was true 2 years ago, and it’s true now (except 2 years ago I was copy/pasting from the browser chat window and we have some nicer IDE integration now).

Slav_fixflex3mo ago

casey23mo ago

>fischer warned us against eyeballing plots proceeds to eyeball it with an arbitrary function

There was a long flat line before the step, models improve, but PR pass rate without human intervention is inherently a staircase function

delichon3mo ago

That actually helps.

techcam3mo ago

I’ve been noticing the same — a lot of failures aren’t obvious “jailbreaks,” they’re just subtle prompt structure issues that only show up in production.

dmos623mo ago

Tangential: I've found that having an LLM recreate the full file, with changes appllied, is less mistake-prone, than producing a patch. I wonder if anyone else came to this conclusion too.

varispeed3mo ago

In my niche the Opus 4.6 has been a game changer. In comparison all other LLMs look stupid. I am considering cancelling all other subscriptions.

boxedemp3mo ago

How do they know? Not everybody includes to coauthored by Claude. I certainly don't.

ordersofmag3mo ago

sigmar3mo ago

Zababa3mo ago

From the METR study (https://metr.org/notes/2026-03-10-many-swe-bench-passing-prs...):

As for the article, I think mixing all models together doesn't make sense. For example, maybe a slope describe the increasing Claude Sonnet better than a step function.

raincole3mo ago

No Gemini. No Opus 4.5. No GPT codex.

As they said, ragebait used to be believable.

sigbottle3mo ago

LLM's have 100% gotten better, but it's hard to say if it's "intrinsically better", if that makes sense.

> OpenAI’s leading researchers have not completed a successful full-scale pre-training run that was broadly deployed for a new frontier model since GPT-4o in May 2024 [1]

That's evidence against "intrinsically better". They've also trained on the entire internet - we only have 1 internet, so.

Mid 2025 was when they really started getting integrated with tool calling.

[1] https://newsletter.semianalysis.com/p/tpuv7-google-takes-a-s...

[2] https://en.wikipedia.org/wiki/Kernel_panic

codeulike3mo ago

This means llms have not improved in their programming abilities for over a year. Isn’t that wild? Why is nobody talking about this?

Because hype makes money.

j / k navigate · click thread line to collapse