Many SWE-bench-Passing PRs would not be merged (opens in new tab)

(metr.org)

278 pointsmustaphah3mo ago153 comments

153 comments

90 comments · 22 top-level

cornstalks3mo ago· 41 in thread

Anecdote time! I had Codex GPT 5.4 xhigh generate a Rust proc macro. It's pretty straightforward: use sqlparser to parse a SQL statement and extract the column names of any row-producing queries.

It generated an implementation that worked well, but I hated the ~480 lines of code. The structure and flow was just... weird. It was hard to follow and I was seriously bugged by it.

So I asked it to reimplement it with some simplifications I gave it. It dutifully executed, producing a result >600 lines long. The flow was simpler and easier to follow, but still seemed excessive for the task at hand.

So I rolled up my sleeves and started deleting code and making changes manually. A little bit later, I had it down to <230 lines with a flow that was extremely easy to read and understand.

So yeah, I can totally see many SWE-bench-passing PRs being functionally correct but still terrible code that I would not accept.

SerCe3mo ago

If you've got some time, I highly recommend going through the exercise of trying to change the prompt in a way that would produce code similar to what you've achieved manually. Doing a similar exercise really helps to improve agent prompting skills, as it shows how changing parts of the prompt influences the result.

foltik3mo ago

I haven’t had any luck prompting LLMs to “have taste.” They seem to over fixate on instructions (e.g. golfing when asked for concise code) or require specifying so many details and qualifications that the results no longer generalize well to other problems.

Do you have any examples or resources that worked well for you?

zarzavat3mo ago

Yeah prompting doesn't work for this problem because the entire point of an LLM is you give it the what and it outputs the how. The more how that you have to condition it with in the prompt, the less profitable the interaction will be. A few hints is OK, but doing all the work for the LLM tends to lead to negative productivity.

Writing prompts and writing code takes about the same amount of time, for the same amount of text, plus there's the extra time that the LLM takes to accomplish the task, and review time afterwards. So you might as well just write the code yourself if you have to specify every tiny implementation detail in the prompt.

3 more replies

SerCe3mo ago

> Do you have any examples or resources that worked well for you?

Using this particular example, if you simply paste the exact code into the prompt, the model should able to reproduce it. Now, you can start removing the bits and see how much you can remove from the prompt, e.g. simplify it to pseudocode, etc. Then you can push it further and try to switch from the pseudocode to the architecture, etc.

That way, you'll start from something that's working and work backwards rather than trying to get there in the absence of a clear path.

1 more reply

johndough3mo ago

What worked for me was Gemini 3 Pro (I guess 3.1 should work even better now) with the prompt "This code is unnecessarily complicated. Simplify it, but no code golf". This decreased code size by about 60 %. It still did a bit of code-golfing, but it was manageable.

It is important to start a new chat so the model is not stuck in its previous mindset, and it is beneficial to have tests to verify that the simplified code still works as it did before.

Telling the model to generate concise code did not work for me, because LLMs do not know beforehand what they are going to write, so they are rarely able to refactor existing code to break out common functionality into reusable functions. We might get there eventually. Thinking models are a bit better at it. But we are not quite there yet.

1 more reply

newswasboring3mo ago

I have a stupid solution for this which is working wonders. It does not help to tell the LLM "don't do this pattern". I literally make it write a regex based test which looks for that pattern and fails the test.

For example I am developing a game using GDscript, LLMs (including codex and claude) keep making scripts with no classnames and then loading them with @preload. Hate this, and its explicitly mentioned in my godot-development skill. What agents can't stand is a failing test. Feels a bit like enforcing rules automatically.

This is a stupid idea but it works wonders on giving taste to my LLM. I wonder if I should open source that test suite for other agentic developers.

XenophileJKO3mo ago

I really should spend some time analyzing what I do to get the good output I get..

One thing that is fairly low effort that you could try is find code you really like and ask the model to list the adjectives and attributes that that code exhibits. Then try them in a prompt.

With LLMs generally you want to adjust the behavior at the macro level by setting things like beliefs and values, vs at the micro level by making "rules".

By understanding how the model maps the aspects that you like about the code to language, that should give you some shorthand phrases that give you a lot of behavioral leverage.

Edit: Better yet.. give a fresh context window the "before" and "after" and have it provide you with contrasting values, adjectives, etc.

ndriscoll3mo ago

Concise isn't specific enough: I've primed mine on basic architecture I want: imperative shell/functional core, don't mix abstraction levels in one function, each function should be simple to read top-to-bottom with higher level code doing only orchestration with no control flow. Names should express business intent. Prefer functions over methods where possible. Use types to make illegal states unrepresentable. RAII. etc.

You need to think about what "good taste " is to you (or find others who have already written about software architecture and take their ideas that you like). People disagree on what that even means (e.g. some people love Rails. To me a lot of it seems like the exact opposite of "good taste").

stared3mo ago

I spend much more time refactoring that creating features (though, it is getting better with each model). My go-to approach is to use Claude Code Opus 4.6 for writing and Gemini 3.1 Pro for cleaning up. I feel that doing it just one-stage is rarely enough.

A lot of prompts about finding the right level of abstraction, DRY, etc.

An earlier example (Opus 4.5 + Gemini 3 Pro) is here: https://github.com/stared/sc2-balance-timeline

I tried as well to just use Gemini 3 Pro (maybe the model, maybe the harness) it was not nearly as good as writing, but way better at refining.

brap3mo ago

I actually don’t think golfing is such a bad thing, granted it will first handle the low hanging fruits like variable names etc, but if you push it hard enough it will be forced to think of a simpler approach. Then you can take a step back and tell it to fix the variable names, formatting etc. With the caveat that a smaller AST doesn’t necessarily mean simpler code, but it’s a decent heuristic.

irthomasthomas3mo ago

Have you tried meta-prompts e.g. "Rewrite the prompt to improve the perceived taste and expertise of the author"

globnomulous3mo ago

I appreciate that your message is a good-natured, friendly tip. I don't mean for the following to crap on that. I just need to shout into the void:

If I have some time, the last thing I want to do with it is sharpen prompting skills. I can't imagine a worse or more boring use of my time on a computer or a skill I want less.

Every time I visit Hacker News I become more certain that I want nothing to do with either the future the enthusiasts think awaits us or the present that they think is building towards it.

1 more reply

vasco3mo ago

You dont need to learn anything, it needs to learn from you. When it fails, don't correct it out of bounds, correct it in the same UI. At the end say "look at what I did and create a proposed memory with what you learned" and if it looks good have it add it to memories.

laserlight3mo ago

> change the prompt in a way that would produce code similar to what you've achieved manually.

The problem is that I don't know what I'll achieve manually before attempting the task.

Bridged77563mo ago

This better reflects what I thought about the other day. You either, let clankers do its thing and then bake in your implementation on top, you think it through and make them do it, but at the end of the day you've still gotta THINK of the optimal solution and state of the code at which point, do clankers do anything asides from saving you a bunch of keypresses, and maybe catching a couple of bugs?

avereveard3mo ago

Also useful to encode into the steering of your platform. The incremental aspect of many little updates really help picking up speed by reducing review time.

Big bang approach could be a start, but a lot of one line guidance from specific things you dont want to see stack up real fast.

aix13mo ago

My mildly amusing anecdote is that, whenever Claude Code produces something particularly egregious, I often find it sufficient to reply with just "wtf?" for it to present a much more sensible version of the code (which often needs further refinement, but that's another story...)

ernst_klim3mo ago

Indeed. I have a few colleages and they constantly try to push these long convoluted functions which look like

    is_done = False
    while not is_done:
      if pattern1:
        ...
        if pattern2:
          ...
          if matched == "SUCCESS":
             is_done = True
             break
        if pattern3:
          ...

It's usually correct but extremely hard to follow and reminds of the good old asm code with convoluted goto's.

And the colleages tend to do reviews with the help of the agents so they don't even care to read this mess.

laserlight3mo ago

I reported a similar case of mine several days ago [0]. I was able to achieve better quality than Claude Code's 624 lines of spaghetti code in 334 lines of well-designed code. In a previous case, I rewrote ~400-line LLM generated code in 130 lines.

[0] https://news.ycombinator.com/item?id=47272913

scuff3d3mo ago

Had the same problem with a Python project. Just for the hell of it I tried to have it implement a simple version of a proxy I've made in the past. What was finally produced "technically" worked, but it was a mess. It suppressed exceptions all over the place, it did weird shit with imports it couldn't get to work, and the way it managed connection state was bizarre.

It has a third year college students approach to "make it work". It can't take a step back and reevaluate a situation, or determine a new path forward, it just hammers away endlessly with whatever it's trying until it can technically be called "correct".

kqr3mo ago

When I benchmark LLMs on text adventures, they reason like four-year olds but have the worlds largest vocabulary and infinite patience. I'm not surprised this is how they approach programming too.

duskdozer3mo ago

>It has a third year college students approach to "make it work". It can't take a step back and reevaluate a situation, or determine a new path forward, it just hammers away endlessly with whatever it's trying until it can technically be called "correct".

OH! Yeah I think this is the exact bad feeling I've gotten whenever I've tried testing these things before, except without clear and useful feedback like compiler error messages or something. I remember when I used to code/learn like that early on and...it's not fun now. I also don't think it's really solvable

1 more reply

iamflimflam13mo ago

We’re heading for a world of terrible code that can only be maintained by extremely good coding agents and are pretty much impossible for a human to really understand.

The days of the deep expert, who knew the codebase inside out and had it contained in their head, are coming to an end.

thesz3mo ago

  > We’re heading for a world of terrible code that can only be maintained by extremely good coding agents and are pretty much impossible for a human to really understand.

I once figured out the algorithm of the program written in one-instruction ISA. I think the instruction was three-address subtraction.

In my opinion, you overestimate the ability of coding agents to, well, code and underestimate the ability of humans to really understand code.

The chart in the article we discuss appears to plateau if one excludes sample from 2024-07. So, we are not quite heading, we are plateauing, if I may.

pas3mo ago

that was the exception not the rule

1 more reply

hinkley3mo ago

Then this is an era of snake oil because customers aren’t going to put up with that shit for long.

Gud3mo ago

They’ve been putting up with crappy software for two decades(at least).

2 more replies

mplanchard3mo ago

I had a similar experience yesterday. Was working on some async stream extensions. Wrote a couple proofs of concept to benchmark, and picked one based on the results. I almost never use LLMs to write code, but out of curiosity, asked whatever the newest claude is to make it with all the real prod requirements, and it spit out over 400 lines of code, lots of spaghetti, with strange flow and a lot of weird decisions. Wrote it myself with all the same functionality in right around 170 lines.

Also had a similar experience in the past weeks reviewing PRs written with LLMs by other engineers in languages they don't know well, one in rust and one in bash. Both required a lot of rounds of revision and a couple of pairing sessions to get to a point where we got rid of the extraneous bits and made it read normally. I'm glad the tool gave these engineers the confidence to work in areas they wouldn't normally have felt comfortable contributing to, but man do I hate the code that it writes.

lmeyerov3mo ago

Once my code exists and passes test, I generally move on to having it iteratively hunt for bugs, security issues, and DRY code reduction opportunities until it stops finding worthwhile ones.

This doesn't always work as well as I'd like, but largely does enough. Conversely, doing as I go has been a waste of time.

yodsanklai3mo ago

Happens all the time. I usually propose a details structure myself (e.g. do it in three phases, add 3 functions + an orchestrator, make sure structure is valid before writing the function bodies), or iterate on detailed plan before implementing code.

Now some people argue that terrible code is fine nowadays, because humans won't read it anymore...

tobr3mo ago

I wonder why they fail this specific way. If you just let them do stuff everything quickly turns spaghetti. They seem to overlook obvious opportunities to simplify things or see a pattern and follow through. The default seems to be to add more, rather than rework or adjust what’s already in place.

samdjstephens3mo ago

I suspect it has something to do with a) the average quality of code in open source repos and b) the way the reward signal is applied in RL post-training - does the model face consequences of a brittle implementation for a task?

I wonder if these RL runs can extend over multiple sequential evaluations, where poor design in an early task hampers performance later on, as measured by amount of tokens required to add new functionality without breaking existing functionality.

foo423mo ago

Yeah I've been wondering if the increasing coding RL is going to draw models towards very short term goals relative to just learning from open source code in the wild

catlifeonmars3mo ago

To me this seems like a natural consequence of the next-token prediction model. In one particular prompt you can’t “backtrack” once you’ve emitted a token. You can only move forwards. You can iteratively refine (e.g the agent can one shot itself repeatedly), but the underlying mechanism is still present.

I can’t speak for all humans, but I tend to code “nonlinearly”, jumping back and forth and typically going from high level (signatures, type definitions) to low level (fill in function bodies). I also do a lot of deletion as I decide that actually one function isn’t needed or if I find a simpler way to phrase a particular section.

Edit: in fact thinking on this more, code is _much_ closer to a tree than sequence of tokens. Not sure what to do with that, except maybe to try a tree based generator which iteratively adds child nodes.

tobr3mo ago

This would make sense to me as an explanation when it only outputs code. (And I think it explains why code often ends up subtly mangled when moved in a refactoring, where a human would copy paste, the agent instead has to ”retype” it and often ends up slightly changing formatting, comments, identifiers, etc.)

But for the most part, it’s spending more tokens on analysis and planning than pure code output, and that’s where these problems need to be caught.

1 more reply

OtomotO3mo ago

All it does is generate soup. Some of which may taste good.

There is no thinking, no matter what marketing tells you.

Antibabelic3mo ago

LLMs are next token predictors. Their core functionality boils down to simply adding more stuff.

logicchains3mo ago

They do what you tell them to. If you regularly tell them to look for opportunities to clean up/refactor the code, they will.

mvanzoest3mo ago

Yeah I had a similar experience on a smaller scale, reducing a function from 125 lines to 25.

cbg03mo ago

xhigh effort is actually pretty terrible for 5.2/5.3/5.4 models. Stick to medium/high as it overthinks less.

jlandersen3mo ago

Very familiar experience

bisonbear3mo ago· 5 in thread

I've been working on building out "evals for your repo" based on the theory that commonly used benchmarks like SWE-bench are broken as they are not testing the right / valuable things, and are baked into the training data (see OpenAI's research on this here https://openai.com/index/why-we-no-longer-evaluate-swe-bench...)

Interestingly, I had a similar finding where, on the 3 open-source repos I ran evals on, the models (5.1-codex-mini, 5.3-codex, 5.4) all had relatively similar test scores, but when looking at other metrics, such as code quality, or equivalence to the original PR the task was based on, they had massive differences. posted results here if anyone is curious https://www.stet.sh/leaderboard

dirtbag__dad3mo ago

This sounds amazing. In particular, I like comps to existing PRs. But I’m also not sure that I want existing PRs to be a template for most things reasonable or best practice.

I’ve been building out internal linters that enforce design patterns I want and raise common code smells (also note tools like eslint allow custom rules which are easy write with something like opus 4.6). The use case is a total refactor of react and fastapi apps. We are suffering from everything’s a snowflake syndrome and just want the same pattern employed across features.

This works pretty well when the linter has a companion agents.md file which explains the architecture and way about the world.

But to get the agent (Claude code opus 4.6 currently) to nail the directory structure and design primitives, and limit some doofus behavior, I still haven’t cracked how to make literally each line of code simple and sensible. And I haven’t figured out how to prevent agents from going out of bounds and doing weird things unless I catch it in review and add another rule.

This is a relatively new endeavor, but my gut is that it’s not much more time (linter rules and perhaps “evals” or a beefy agent review cycle) before I have bespoke linters in place that force what I want from our architecture.

Note that a huge bottleneck to all of this is that the codebase our current team inherited has no tests. It’s too easy to accidentally nuke a screen’s subtle details. It’s also really hard to write good tests without knowing what all of the functionality is. It feels like a blocker to a lot of large-swath agentic changes is a test strategy or solution first then a rigid push for rearchitecture or new design.

bisonbear3mo ago

yikes, using AI without tests is not fun. with testing at least you have some confidence that the AI isn't going completely off track, without them you're pretty much flying blind

having linters is super important IMO - I never try to make the AI do a linter's job. let the AI focus on the hard stuff - architecture, maintainability, cleanliness, and the linter can handle the boring pieces.

I also definitely see the AI making changes that are way larger than necessary. I try to capture that in the eval by comparing a "footprint risk" which is essentially how many unnecessary changes did the AI make vs the original PR.

I would certainly like to move beyond using PRs as a sole source of truth, since humans don't always write great code either. Maybe having LLM-as-a-judge looking for scope creep/bloat would be a decent band-aid?

ebhn3mo ago

Nice, I really like your idea. First I've heard of something like that

floodfx3mo ago

Working on that too. Lmk if you’re up for a chat?

bisonbear3mo ago

yea I'm down - feel free to send me an email ben@benr.build

coderenegade3mo ago· 5 in thread

There needs to be a measure (or measures) of the entropy of a codebase that provides a signal of complexity. When you're paying for every token, you want code patterns that convey a lot of immediate information to the agent so that it can either repeat the pattern, or extend it in a way that makes sense. This is probably the next wave of assisted coding (imo), because we're at the stage where writing code works, the quality is mostly decent, but it can be needlessly complex given the context of the existing repo.

js83mo ago

There's a way to measure "entropy" of a codebase. Take something like the binary lambda calculus or the triage calculus, convert your program (including libraries, programming language constructs, operating system) into it, and measure the size of the program in it in bits.

You can also measure the crossentropy, which is essentially the whole program entropy above minus entropy of the programming language and functions from standard libraries (i.e. abstractions that you assume are generally known). This is useful to evaluate the conformance to "standard" abstractions.

There is also a way to measure a "maximum entropy" using types, by counting number of states a data type can represent. The maximum entropy of a function is a crossentropy between inputs and outputs (treating the function like a communication channel).

The "difference" (I am not sure how to make them convertible) between "maximum entropy" and "function entropy" (size in bits) then shows how good your understanding (compared to specification expressed in type signature) of the function is.

I have been advocating for some time that we use entropy measures (and information theory) in SW engineering to do estimation of complexity (and thus time required for a change).

malfist3mo ago

Maybe cyclomatic complexity would be a good proxy. It can obviously be gamed but it's obvious when it is

johncomposed3mo ago

There was a measure used during the Toyota Unintended Acceleration case called McCabe Cyclomatic Complexity, I wonder if anyone is using it alongside AI assisted code.

kqr3mo ago

It is roughly equivalent to diff size: https://entropicthoughts.com/lines-of-code

bandrami3mo ago

I mean, it's ultimately a string, and the measurement of the entropy of a string is well-studied. The LLM might start gaming that with variable names so you'd need to do the AST instead. I may actually try something like that; cool idea.

antirez3mo ago· 4 in thread

Of what is happening with AI the most bizarre thing, for me, is how these tools are 20$ away from being tested. Yet, to form an idea about actual real world usefulness many folks seek some kind of indirect proxy.

This is combined with the incredible general feeling that automatic programming can be evaluated as producing the same results regardless of the user using it. Something true only with benchmarks, basically. Benchmarks are useful metrics because even if weak we need some guidance, but the current real world dynamic is that AI will completely change what it is capable of doing based on the programmer using it.

Maybe never in the history of programming there was a time where diverse programming skills were as important as today (but this may change as AI evolves).

croemer3mo ago

Benchmarks do a few things: 1. Help choose a model from the hundreds out there, or at least help create a shortlist to try. 2. Quantify progress/improvements (or lack thereof) over time. 3. Inform about relative strengths and weaknesses.

utopiah3mo ago

Assuming the benchmark can't be gamed.

utopiah3mo ago

> automatic programming can be evaluated as producing the same results regardless of the user using it.

That's something I've argued here several time and that's actually rarely done. Namely it's totally different when a non-developer use such tool for programming vs when a (senior) SWE does. That's a fundamental point which IMHO a potential for (non-riskfree) augmentation versus replacement. Replacement though makes for excellent narrative (if not scapegoat) yet if the tool is "productive" (with KPIs to agree on) only with skilled staff that it's not the reality, just a "wish".

Archit3ch3mo ago

I'm about to put up the 20 to see what everyone is raving for. But the real cost is time: if this doesn't work, I'm worse off than never trying.

varispeed3mo ago· 3 in thread

Do these benchmarks make any sense? I tried a few local models that seem to be scoring well in SWE but the results were pure rubbish. (For instance MiniMax-M2.5 at 128GB from unslothed - completely unusable).

segmondy3mo ago

Which quant? I find folks running lower quants complaining, yet they should be running higher quant. Qwen3CoderNext is great, even at Q6. I mistakenly had it loaded for an agentic workflow and was surprised at how well it is.

code_biologist3mo ago

What is "lower quant"? What is "higher quant"? I mean, I know what they are, but the very people you intend to reach don't know the difference between Q4_K_M and Q6_K and blog posts like [1] have nuggets like "For tests of the type ran here, there appear to be major diminishing returns past Q4".

[1] https://big-stupid-jellyfish.github.io/GFMath/pages/llm-quan...

zozbot2343mo ago

> "For tests of the type ran here, there appear to be major diminishing returns past Q4"

These statements are silly, because the only interesting comparison is among models with highly comparable on-disk sizes, or sizes for their active parameters. Obviously, a Q4 model is not going to be the same effectiveness as a Q6: no one sensibly expects that, you need to compare the Q4 with a smaller model. (The GP has the same problem of course.) I believe that once you do that kind of comparison, higher quantizations tend to do better up to Q2 or so for casual chat, maybe slightly more bits-per-param for agentic use cases where avoiding erratic behavior is important.

stevefan19993mo ago· 3 in thread

I think a far greater problem is the human psychological and prejudice factor itself. When we heard AI assistance on a PR, we usually dive down the road to thinking about "oh my god is it another LLM slop" (for example: https://github.com/jneem/imbl/pull/149#pullrequestreview-370...). I do use AI but I review the code before I push it, yet most people don't. Once there is a trend, it is easy to form a prejudice and it is hard to go back, unless there is a substantial improvement both in quality and quantity.

Also, some people would have spoken outright rejecting any AI code, but most maintainers would employ the silent treatment tactics. And then when you demand them to review, they either close it or say that "I'm too busy" as an argument. I would call this one of the biggest dick move, because it hurts the most yet you can't find anything wrong with them until they reveal their motives.

catlifeonmars3mo ago

> I would call this one of the biggest dick move

I don’t think that’s a fair characterization. You don’t know if the maintainer/reviewer is overloaded. No one is obligated to accept/review PRs and there is no question that the amount of noise has gone up. You are not the main character in that story, so to speak.

duskdozer3mo ago

>And still, I really hate writing those PR descriptions. Yet you can't just leave it empty.

If you can't write a description in your own words explaining why you're doing it, why should they take the time reviewing it (which they did on the same day you posted it, btw, even if one of them wasn't pleased)? It makes it seem much less likely that you read the code yourself.

JoshTriplett3mo ago

> And then when you demand them to review

You might want to think carefully about why you chose to use the word "demand" there.

(Personally, if I'm rejecting AI slop, I'm not going to do it silently. But there are any number of valid reasons to not jump on someone's PR to review it.)

nubg3mo ago· 2 in thread

> mid-2024 agents

Is this a post about AI archeology?

Lerc3mo ago

It's more about the test than the AI.

For the most part, I think the tests AI have been given have been appropriately designed. At release, many AIs do poorly at them, the models rapidly catch up until the point where a new test is needed.

They should be measuring close to the limits of ability like that.

There will be some that try and steal headlines by targeting the specific nature of the test, but that is not a long term winning solution, the tests keep getting harder. If they make a model good at every test it has seen without regression, then with enough tests, that too ceases to be a problem.

Perhaps there should be an aggregate AI test score that evaluates all of the tests released in a given year. If a model passes the latest test really well but does worse at TestSet2024 than the models before, it would perhaps indicate the model being trained to pass the latest cool test.

There is a problem with people interpreting an AI that passes a test of X,Y or Z as indicating that the AI has the abilities of a human who passes X,Y, or Z. You should tell people who say that, Kasparov makes a nice coffee.

nine_k3mo ago

LLM-written code passed SWE Bench even back then. This may just say that SWE Bench is an inadequate test, and should not be used for serious evaluation.

croemer3mo ago· 2 in thread

Figure 1 should not fit a straight line as a trend. Scores are 0 to 100%, the straight line will go outside those bounds at sufficiently large times.

The simplest reasonable model would be logistic regression. It's also got 2 parameters and the range is correct.

kqr3mo ago

Although you are technically correct, if you look at the data you'll recognise that for this narrow span of values, the logistic fit will be practically equivalent to the linear. Indeed, they perform the same in cross-validation. Here's what the logistic fit looks like: https://i.xkqr.org/logisfit.png

croemer3mo ago

Author gave me the same reply. I just don't want to have to think about whether it's equivalent or not, why use a 2 param model that's strictly less appropriate even if the difference is small.

woeirua3mo ago· 1 in thread

This paper doesn’t really tell us much. The cutoff was September of 2025. The models have improved so much that I just don’t know what you can take away from this experiment.

croemer3mo ago

SWE-bench scores are inflated compared against actual maintainer merge decisions as opposed to an LLM grader.

XenophileJKO3mo ago· 1 in thread

I was totally aligned until I saw the refusal for a comment in the code. When the refusals are pedantic like that, it just weakens the overall findings significantly.

finnthehuman3mo ago

Yeah, why be such a tryhard? Keeping PR friction down is what matters. Just let the codebase slowly deteriorate. It'll be fine.

AndrewHampton3mo ago· 1 in thread

This seems like an important caveat to the SWE-bench, but the trend is still clearly AI becoming more and more capable.

utopiah3mo ago

> the trend is still clearly AI becoming more and more capable.

Isn't it precisely what this article is questioning?

50lo3mo ago

Would be interesting to see alternative scoring besides “tests pass”, e.g. diff size, abstraction depth added/removed, or whether the solution introduces new modules/dependencies. That would allow to see if “unmergeable” PRs correlate with simple structural signals.

1 more reply

languid-photic3mo ago

makes sense! we wrote something yesterday about the weaknesses of test-based evals like swe-bench [1]

they are definitely useful but they miss the things that are hard to encode in tests, like spec/intent alignment, scope creep, adherence to codebase patterns, team preferences (risk tolerance, etc)

and those factors are really important. which means that test-evals should be relied upon more as weak/directional priors than as definitive measures of real-world usefulness

[1] https://voratiq.com/blog/test-evals-are-not-enough/

thesz3mo ago

Time to completion (in appendix A9) should be treated as log-normally distributed, or by some other one-sided distribution because one cannot complete the task faster than 0 seconds.

This transformation will rule out confidence ranges with negative time.

BTW, log-normal distribution tend to produces events P(x>E(X)+d) more frequently than events P(x<E(X)-d). If one needs reasons why software projects often late, this is one of them.

tonipotato3mo ago

I feel the same! they are raising the bar higher and higher. I wrote a bot and pass the swe bench lite for 67% and can not get a chance to show. I also tried to submit for swe bench full but they limit it to organization only. where can us independent developers post our stuff, can we have an open bench mark for everyone and we just use merit to rank?

shanjai_raj73mo ago

I see this with claude code all the time. it writes code that works but tries to cover every edge case and becomes hard to read. I usually just tell it to make it shorter and simpler and it does a better job on the second pass. passing a benchmark and writing good code are two different things.

slopinthebag3mo ago

This makes sense to me based on personal experience. LLM's will do anything to pass tests and get a working result, and it will do very weird things in order to get there. For fun I've tried to get it to do stuff while being purposely ambiguous about the implementation details and sometimes the stuff it does makes me literally laugh out loud. It can write some very strange code.

But hey, the tests pass!

If I force it to use plan mode for everything and babysit it, it can work really well, but it's really just acting as a faster typer for me, which is great. But it requires an experienced dev steering it.

1 more reply

blockpilot_ai3mo ago

Interesting discussion.

I've been thinking about tools for organizing long AI conversations.

Scrolling through hundreds of messages quickly becomes painful. I'm curious how people here manage long AI chats.

blockpilot_ai3mo ago

Interesting project.

I've been thinking a lot about tools for organizing long AI conversations. Curious how people here currently manage them.

jurschreuder3mo ago

The test is supposed to be a proxy.

blockpilot_ai3mo ago

not rule but thanks discussion

xthunk3mo ago

Really interesting note. That echoes thoughts I’ve had about how much automated benchmark scores really reflect production‑ready code.

For me the big takeaway is that passing doesn't automatically mean it is maintainable, follows established patterns / conventions or have unexpected side effects that real reviewers care about.

j / k navigate · click thread line to collapse

153 comments

90 comments · 22 top-level

cornstalks3mo ago· 41 in thread

Anecdote time! I had Codex GPT 5.4 xhigh generate a Rust proc macro. It's pretty straightforward: use sqlparser to parse a SQL statement and extract the column names of any row-producing queries.

It generated an implementation that worked well, but I hated the ~480 lines of code. The structure and flow was just... weird. It was hard to follow and I was seriously bugged by it.

So I rolled up my sleeves and started deleting code and making changes manually. A little bit later, I had it down to <230 lines with a flow that was extremely easy to read and understand.

So yeah, I can totally see many SWE-bench-passing PRs being functionally correct but still terrible code that I would not accept.

SerCe3mo ago

foltik3mo ago

Do you have any examples or resources that worked well for you?

zarzavat3mo ago

3 more replies

SerCe3mo ago

> Do you have any examples or resources that worked well for you?

That way, you'll start from something that's working and work backwards rather than trying to get there in the absence of a clear path.

1 more reply

johndough3mo ago

It is important to start a new chat so the model is not stuck in its previous mindset, and it is beneficial to have tests to verify that the simplified code still works as it did before.

1 more reply

newswasboring3mo ago

This is a stupid idea but it works wonders on giving taste to my LLM. I wonder if I should open source that test suite for other agentic developers.

XenophileJKO3mo ago

I really should spend some time analyzing what I do to get the good output I get..

One thing that is fairly low effort that you could try is find code you really like and ask the model to list the adjectives and attributes that that code exhibits. Then try them in a prompt.

With LLMs generally you want to adjust the behavior at the macro level by setting things like beliefs and values, vs at the micro level by making "rules".

By understanding how the model maps the aspects that you like about the code to language, that should give you some shorthand phrases that give you a lot of behavioral leverage.

Edit: Better yet.. give a fresh context window the "before" and "after" and have it provide you with contrasting values, adjectives, etc.

ndriscoll3mo ago

stared3mo ago

A lot of prompts about finding the right level of abstraction, DRY, etc.

An earlier example (Opus 4.5 + Gemini 3 Pro) is here: https://github.com/stared/sc2-balance-timeline

I tried as well to just use Gemini 3 Pro (maybe the model, maybe the harness) it was not nearly as good as writing, but way better at refining.

brap3mo ago

irthomasthomas3mo ago

Have you tried meta-prompts e.g. "Rewrite the prompt to improve the perceived taste and expertise of the author"

globnomulous3mo ago

I appreciate that your message is a good-natured, friendly tip. I don't mean for the following to crap on that. I just need to shout into the void:

If I have some time, the last thing I want to do with it is sharpen prompting skills. I can't imagine a worse or more boring use of my time on a computer or a skill I want less.

Every time I visit Hacker News I become more certain that I want nothing to do with either the future the enthusiasts think awaits us or the present that they think is building towards it.

1 more reply

vasco3mo ago

laserlight3mo ago

> change the prompt in a way that would produce code similar to what you've achieved manually.

The problem is that I don't know what I'll achieve manually before attempting the task.

Bridged77563mo ago

avereveard3mo ago

Also useful to encode into the steering of your platform. The incremental aspect of many little updates really help picking up speed by reducing review time.

Big bang approach could be a start, but a lot of one line guidance from specific things you dont want to see stack up real fast.

aix13mo ago

ernst_klim3mo ago

Indeed. I have a few colleages and they constantly try to push these long convoluted functions which look like

    is_done = False
    while not is_done:
      if pattern1:
        ...
        if pattern2:
          ...
          if matched == "SUCCESS":
             is_done = True
             break
        if pattern3:
          ...

It's usually correct but extremely hard to follow and reminds of the good old asm code with convoluted goto's.

And the colleages tend to do reviews with the help of the agents so they don't even care to read this mess.

laserlight3mo ago

[0] https://news.ycombinator.com/item?id=47272913

scuff3d3mo ago

kqr3mo ago

When I benchmark LLMs on text adventures, they reason like four-year olds but have the worlds largest vocabulary and infinite patience. I'm not surprised this is how they approach programming too.

duskdozer3mo ago

1 more reply

iamflimflam13mo ago

We’re heading for a world of terrible code that can only be maintained by extremely good coding agents and are pretty much impossible for a human to really understand.

The days of the deep expert, who knew the codebase inside out and had it contained in their head, are coming to an end.

thesz3mo ago

  > We’re heading for a world of terrible code that can only be maintained by extremely good coding agents and are pretty much impossible for a human to really understand.

I once figured out the algorithm of the program written in one-instruction ISA. I think the instruction was three-address subtraction.

In my opinion, you overestimate the ability of coding agents to, well, code and underestimate the ability of humans to really understand code.

The chart in the article we discuss appears to plateau if one excludes sample from 2024-07. So, we are not quite heading, we are plateauing, if I may.

pas3mo ago

that was the exception not the rule

1 more reply

hinkley3mo ago

Then this is an era of snake oil because customers aren’t going to put up with that shit for long.

Gud3mo ago

They’ve been putting up with crappy software for two decades(at least).

2 more replies

mplanchard3mo ago

lmeyerov3mo ago

Once my code exists and passes test, I generally move on to having it iteratively hunt for bugs, security issues, and DRY code reduction opportunities until it stops finding worthwhile ones.

This doesn't always work as well as I'd like, but largely does enough. Conversely, doing as I go has been a waste of time.

yodsanklai3mo ago

Now some people argue that terrible code is fine nowadays, because humans won't read it anymore...

tobr3mo ago

samdjstephens3mo ago

foo423mo ago

Yeah I've been wondering if the increasing coding RL is going to draw models towards very short term goals relative to just learning from open source code in the wild

catlifeonmars3mo ago

tobr3mo ago

But for the most part, it’s spending more tokens on analysis and planning than pure code output, and that’s where these problems need to be caught.

1 more reply

OtomotO3mo ago

All it does is generate soup. Some of which may taste good.

There is no thinking, no matter what marketing tells you.

Antibabelic3mo ago

LLMs are next token predictors. Their core functionality boils down to simply adding more stuff.

logicchains3mo ago

They do what you tell them to. If you regularly tell them to look for opportunities to clean up/refactor the code, they will.

mvanzoest3mo ago

Yeah I had a similar experience on a smaller scale, reducing a function from 125 lines to 25.

cbg03mo ago

xhigh effort is actually pretty terrible for 5.2/5.3/5.4 models. Stick to medium/high as it overthinks less.

jlandersen3mo ago

Very familiar experience

bisonbear3mo ago· 5 in thread

dirtbag__dad3mo ago

This sounds amazing. In particular, I like comps to existing PRs. But I’m also not sure that I want existing PRs to be a template for most things reasonable or best practice.

This works pretty well when the linter has a companion agents.md file which explains the architecture and way about the world.

bisonbear3mo ago

yikes, using AI without tests is not fun. with testing at least you have some confidence that the AI isn't going completely off track, without them you're pretty much flying blind

ebhn3mo ago

Nice, I really like your idea. First I've heard of something like that

floodfx3mo ago

Working on that too. Lmk if you’re up for a chat?

bisonbear3mo ago

yea I'm down - feel free to send me an email ben@benr.build

coderenegade3mo ago· 5 in thread

js83mo ago

I have been advocating for some time that we use entropy measures (and information theory) in SW engineering to do estimation of complexity (and thus time required for a change).

malfist3mo ago

Maybe cyclomatic complexity would be a good proxy. It can obviously be gamed but it's obvious when it is

johncomposed3mo ago

There was a measure used during the Toyota Unintended Acceleration case called McCabe Cyclomatic Complexity, I wonder if anyone is using it alongside AI assisted code.

kqr3mo ago

It is roughly equivalent to diff size: https://entropicthoughts.com/lines-of-code

bandrami3mo ago

antirez3mo ago· 4 in thread

Maybe never in the history of programming there was a time where diverse programming skills were as important as today (but this may change as AI evolves).

croemer3mo ago

utopiah3mo ago

Assuming the benchmark can't be gamed.

utopiah3mo ago

> automatic programming can be evaluated as producing the same results regardless of the user using it.

Archit3ch3mo ago

I'm about to put up the 20 to see what everyone is raving for. But the real cost is time: if this doesn't work, I'm worse off than never trying.

varispeed3mo ago· 3 in thread

segmondy3mo ago

code_biologist3mo ago

[1] https://big-stupid-jellyfish.github.io/GFMath/pages/llm-quan...

zozbot2343mo ago

> "For tests of the type ran here, there appear to be major diminishing returns past Q4"

stevefan19993mo ago· 3 in thread

catlifeonmars3mo ago

> I would call this one of the biggest dick move

duskdozer3mo ago

>And still, I really hate writing those PR descriptions. Yet you can't just leave it empty.

JoshTriplett3mo ago

> And then when you demand them to review

You might want to think carefully about why you chose to use the word "demand" there.

(Personally, if I'm rejecting AI slop, I'm not going to do it silently. But there are any number of valid reasons to not jump on someone's PR to review it.)

nubg3mo ago· 2 in thread

> mid-2024 agents

Is this a post about AI archeology?

Lerc3mo ago

It's more about the test than the AI.

They should be measuring close to the limits of ability like that.

nine_k3mo ago

LLM-written code passed SWE Bench even back then. This may just say that SWE Bench is an inadequate test, and should not be used for serious evaluation.

croemer3mo ago· 2 in thread

Figure 1 should not fit a straight line as a trend. Scores are 0 to 100%, the straight line will go outside those bounds at sufficiently large times.

The simplest reasonable model would be logistic regression. It's also got 2 parameters and the range is correct.

kqr3mo ago

croemer3mo ago

Author gave me the same reply. I just don't want to have to think about whether it's equivalent or not, why use a 2 param model that's strictly less appropriate even if the difference is small.

woeirua3mo ago· 1 in thread

This paper doesn’t really tell us much. The cutoff was September of 2025. The models have improved so much that I just don’t know what you can take away from this experiment.

croemer3mo ago

SWE-bench scores are inflated compared against actual maintainer merge decisions as opposed to an LLM grader.

XenophileJKO3mo ago· 1 in thread

I was totally aligned until I saw the refusal for a comment in the code. When the refusals are pedantic like that, it just weakens the overall findings significantly.

finnthehuman3mo ago

Yeah, why be such a tryhard? Keeping PR friction down is what matters. Just let the codebase slowly deteriorate. It'll be fine.

AndrewHampton3mo ago· 1 in thread

This seems like an important caveat to the SWE-bench, but the trend is still clearly AI becoming more and more capable.

utopiah3mo ago

> the trend is still clearly AI becoming more and more capable.

Isn't it precisely what this article is questioning?

50lo3mo ago

1 more reply

languid-photic3mo ago

makes sense! we wrote something yesterday about the weaknesses of test-based evals like swe-bench [1]

they are definitely useful but they miss the things that are hard to encode in tests, like spec/intent alignment, scope creep, adherence to codebase patterns, team preferences (risk tolerance, etc)

and those factors are really important. which means that test-evals should be relied upon more as weak/directional priors than as definitive measures of real-world usefulness

[1] https://voratiq.com/blog/test-evals-are-not-enough/

thesz3mo ago

Time to completion (in appendix A9) should be treated as log-normally distributed, or by some other one-sided distribution because one cannot complete the task faster than 0 seconds.

This transformation will rule out confidence ranges with negative time.

BTW, log-normal distribution tend to produces events P(x>E(X)+d) more frequently than events P(x<E(X)-d). If one needs reasons why software projects often late, this is one of them.

tonipotato3mo ago

shanjai_raj73mo ago

slopinthebag3mo ago

But hey, the tests pass!

1 more reply

blockpilot_ai3mo ago

Interesting discussion.

I've been thinking about tools for organizing long AI conversations.

Scrolling through hundreds of messages quickly becomes painful. I'm curious how people here manage long AI chats.

blockpilot_ai3mo ago

Interesting project.

I've been thinking a lot about tools for organizing long AI conversations. Curious how people here currently manage them.

jurschreuder3mo ago

The test is supposed to be a proxy.

blockpilot_ai3mo ago

not rule but thanks discussion

xthunk3mo ago

Really interesting note. That echoes thoughts I’ve had about how much automated benchmark scores really reflect production‑ready code.

For me the big takeaway is that passing doesn't automatically mean it is maintainable, follows established patterns / conventions or have unexpected side effects that real reviewers care about.

j / k navigate · click thread line to collapse