When I reject AI code even if it works (opens in new tab)

(vinibrasil.com)

229 pointsvnbrs8d ago166 comments

166 comments

18 comments · 9 top-level

whilenot-dev8d ago· 5 in thread

Titles like these make me always point out the obvious: A working state is the absolute minimum requirement for any code to be merged, isn't it? ...imagine to merge something even though you know that's not working.

Besides, this post has nothing specific to code produced by an LLM, and placing AI in the stated reasons feels completely arbitrary, or is rather a fallacy of our times:

- I reject [AI] code when I can’t explain the approach in my own words.

- I reject [AI] code when the diff is bigger than the problem.

- I reject [AI] code when it introduces abstractions before proving they’re needed.

- I reject [AI] code when it works locally but makes the system harder to reason about.

- I reject [AI] code when I’m trusting the output more than my understanding.

QuiEgo7d ago

I’ve had multiple people say “you don’t work on code anymore, that’s for the AI. You work a level of abstraction above that. As long as you prove it works through testing, the code doesn’t matter anymore. It’s like looking at the assembly the compiler spits out now - who cares?”

These are the people who spit out an incredible volume of code with AI, to the point reviews simply can’t keep up.

The last person who said this to me works in embedded, where we look at the assembly all the time. Scary.

theshrike797d ago

But if the output matches the duck typing test, does it actually matter what's inside the black box of code?

If you're given two embedded devices and both pass the same testing, how would you tell which one was 100% AI code and which was beautifully handcrafted line by line?

QuiEgo7d ago

Most embedded code is security / safety critical, so it gets looked at by auditors. So, then.

Also, when something invariably doesn’t work (maybe I told Claude “delay 1 sec after each swing of the axe the robot makes if the proximity sensor trips to avoid the puppy that walks across the ax’s path once every month”, and meant to type “2 sec”), I still have to go down to the level of the code sometimes. I’m sure the counter argument is “well then that just means your testing wasn’t good enough”. Sure, but I’ve never seen any project with hardware in the loop where the testing was good enough 100% of the time. Sometimes it’s hard to test once in a month type events in a regression test suite.

FWIW I hover around 80-90% code AI written these days. I still look at every line of code it makes.

1 more reply

simondotau8d ago

Well said. Replace [AI] with "junior dev" or "consultancy contractor" and these assertions have always been thus.

utopiah8d ago

Fallacy or scapegoat. If management ask for revised KPIs where PRs must be 10x and AI is the "excuse" for this (unrealistic) new demand.

krupan8d ago· 3 in thread

And again this makes me wonder, is AI really helping if this much review and rework is needed for all the code it writes?

teaearlgraycold8d ago

Depends on what it’s writing. There are times an LLM saves me a lot of time researching library functionality. Especially with testing frameworks. So many strange and arcane features out there beyond the basics, but not hard to understand what they do once you see the code. On that topic I should say I am careful when reviewing the actual test cases.

However if you’re highly familiar with a domain then LLMs are much less useful.

mkozlows8d ago

Most code they write is obviously fine. Much of the rest isn't obviously fine, but is in fact fine once you've gone through understanding it. But yes, there's some that still benefits from a human eye.

(For as long as that's true, "software developer" is still a job. It's not clear for how long it will be true.)

unknownfuture8d ago

I mean, the reality is a ton of folks in the industry, myself included, are writing glorified CRUD apps in their day jobs. We're building into existing an codebase with established infrastructure and ways of working. What we're building isn't inherently complex or very interesting.

Meanwhile, those codebases often require a ton of boilerplate and drudgery to get anything done.

In these spaces it's very easy to read and comprehend AI generated output and review it fairly quickly. So the time savings from dealing with all that boilerplate and conforming with all that existing infrastructure are potentially substantial.

julianlam8d ago· 1 in thread

I think a particular failing with developers embracing AI is fighting the sunk cost fallacy. While you might not have spent as much time putting together a non-working solution, you still did spend time working with the agent to slap together a non-working solution.

Being able to step back and say "this was a failure and we need to discard the day's work and start over" is still hard with LLMs.

theshrike797d ago

If I spent half a day asking an agent to do something and it's a "non-working solution". I can just throw it away. I have sunk close to zero cost in it. I have no emotional attachment to the code.

It's like if I 3D-printed something I haven't modelled myself and the print goes wonky. I don't spend days trying to glue and file it back together. I chuck it in the bin and start a new one.

But if I had handcrafted the same item over multiple days, of course I'd try to salvage it - because there was a sunk cost of me spending time doing it.

Aurornis8d ago

Even using Fable (while it was briefly available), having it refine a plan, and directing it to make only small incremental changes, I still found reasons to reject its first pass at a lot of work. There was a lot of “You’re right to push back” responses. A lot of incidents where it would creat some giant complex set of abstractions to accomplish something that I could find ways to do much more elegantly and in a more maintainable manner.

It’s really eye opening to work with these tools on a codebase you know deeply because these problems are everywhere.

However if I opened an unfamiliar project in another language and I wanted to add a little feature with no intention of maintaining it, I’d happily accept the changes and loop until it worked well enough for my temporary needs.

The scary middle is when you’re dealing with coworkers who don’t care about anything other than closing tickets and collecting credit. With enough of a token budget you can now wrap loops around an LLM and have it try things until the program appears to work. Ask it to do a code review and then submit the PR without having understood what it was doing. There are a lot of workplaces where there isn’t a good mechanism to push back on this and the tech debt just keeps growing.

14 more replies

jdw648d ago

Coding with AI eventually comes down to two paths, I've realized. One is using AI exclusively for everything. The other is not using it at all. There is almost no middle ground. The reason is that as the complexity and depth of the problem increase, the code AI generates increasingly follows enterprise level patterns. The deeper the meaning of what I input, the more AI tends to produce code that goes beyond my own area of expertise. For example, a human expert's code is very powerful and deep within their own domain, but when you look at the entire codebase, it's often shallow and uneven outside that domain. But the moment you write code with AI, once you go deep in one part, AI tries to standardize the rest accordingly. This means the entire codebase converges toward enterprise level standard code, which essentially reflects the average patterns of senior programmers who built large scale systems.

The problem is this. Human cognitive resources are finite, so we inevitably become shallow outside our own expertise. There is no programmer who can do everything well. And as systems grow in scale, they become more modularized and fragmented, making it impossible to understand the whole system. So what should we do about this? That's always the question.

In the end, do I choose not to use AI, finish the project with uneven code outside my domain, and deliver it? Or do I use AI and deliver a program that is uniform and consistent, but not in my own style? I still don't know. I haven't found the answer yet.

7 more replies

jameslaneyno96d ago

The sycophancy problem disappears when you stop pushing back on the agent and start having a human review the plan before the agent writes anything. Push back on a human who wrote the plan and their agreement or disagreement will actually be meaningfull. Push back on the agent and you get "you're absolutely right..".

The plan also solves "I can't explain this code" because you wrote the plan before the build, so you can explain it.

After tracking some internal metrics recently we found plan review costs 0.7 hours on average compared to PR review that costs 16 hours. We rejected 13 out of 165 plans meaning no code was written.

The one gap this doesn't close is that the agent drifts from the plan. We run a separate adversarial check that compares the diff against the approved plan and flags anything the plan didn't specify. That catches scope drift without reading every line.

CraigJPerry8d ago

The bottleneck when using a "faster keyboard" is understanding. We have a tool for this in compsci. Not having to fully understand something in order to successfully exploit it is a staple of computer science; we use abstractions to help us reason at a higher level. You don't necessarily always have to understand the nuance involved in selecting a hash function just to put and get some items in a hash map. Specifically, when are these cases where you don't need to go that deep? Are there similar scenarios for ai written code?

I'm more interested right now in what does that abstraction look like for AI generated code. Is there some reasonable solution wherein a sandboxed component in the enterprise architecture has various attributes (e.g. the bytes i stuff into this file store component are always the exact bytes i get back from it) confirmed by methods other than a human reading its code? Those methods, are they cheaper, faster, safer than just having a human do it?

If your enterprise architects have to read every line of code in your system today then i'd claim your architecture practices have room to mature. What can derived from that, and in which scenarios, for the purposes of safely leveraging immutable write-only code? I'm not interested in evolving the code (lines of code spent to solve a business problem was never an asset, it was always a cost) if it wasn't hand crafted by a human, i still have the requirements so i can just regenerate the entire thing with the revised requirement.

1 more reply

swordsith7d ago

why I reject thing that makes other peoples lives easier even if it makes peoples lives easier.

1 more reply

panchtatvam8d ago

You must accept AI code only if you deem yourself dumber than AI.

1 more reply

j / k navigate · click thread line to collapse

166 comments

18 comments · 9 top-level

whilenot-dev8d ago· 5 in thread

Besides, this post has nothing specific to code produced by an LLM, and placing AI in the stated reasons feels completely arbitrary, or is rather a fallacy of our times:

- I reject [AI] code when I can’t explain the approach in my own words.

- I reject [AI] code when the diff is bigger than the problem.

- I reject [AI] code when it introduces abstractions before proving they’re needed.

- I reject [AI] code when it works locally but makes the system harder to reason about.

- I reject [AI] code when I’m trusting the output more than my understanding.

QuiEgo7d ago

These are the people who spit out an incredible volume of code with AI, to the point reviews simply can’t keep up.

The last person who said this to me works in embedded, where we look at the assembly all the time. Scary.

theshrike797d ago

But if the output matches the duck typing test, does it actually matter what's inside the black box of code?

If you're given two embedded devices and both pass the same testing, how would you tell which one was 100% AI code and which was beautifully handcrafted line by line?

QuiEgo7d ago

Most embedded code is security / safety critical, so it gets looked at by auditors. So, then.

FWIW I hover around 80-90% code AI written these days. I still look at every line of code it makes.

1 more reply

simondotau8d ago

Well said. Replace [AI] with "junior dev" or "consultancy contractor" and these assertions have always been thus.

utopiah8d ago

Fallacy or scapegoat. If management ask for revised KPIs where PRs must be 10x and AI is the "excuse" for this (unrealistic) new demand.

krupan8d ago· 3 in thread

And again this makes me wonder, is AI really helping if this much review and rework is needed for all the code it writes?

teaearlgraycold8d ago

However if you’re highly familiar with a domain then LLMs are much less useful.

mkozlows8d ago

(For as long as that's true, "software developer" is still a job. It's not clear for how long it will be true.)

unknownfuture8d ago

Meanwhile, those codebases often require a ton of boilerplate and drudgery to get anything done.

julianlam8d ago· 1 in thread

Being able to step back and say "this was a failure and we need to discard the day's work and start over" is still hard with LLMs.

theshrike797d ago

If I spent half a day asking an agent to do something and it's a "non-working solution". I can just throw it away. I have sunk close to zero cost in it. I have no emotional attachment to the code.

It's like if I 3D-printed something I haven't modelled myself and the print goes wonky. I don't spend days trying to glue and file it back together. I chuck it in the bin and start a new one.

But if I had handcrafted the same item over multiple days, of course I'd try to salvage it - because there was a sunk cost of me spending time doing it.

Aurornis8d ago

It’s really eye opening to work with these tools on a codebase you know deeply because these problems are everywhere.

14 more replies

jdw648d ago

7 more replies

jameslaneyno96d ago

The plan also solves "I can't explain this code" because you wrote the plan before the build, so you can explain it.

After tracking some internal metrics recently we found plan review costs 0.7 hours on average compared to PR review that costs 16 hours. We rejected 13 out of 165 plans meaning no code was written.

CraigJPerry8d ago

1 more reply

swordsith7d ago

why I reject thing that makes other peoples lives easier even if it makes peoples lives easier.

1 more reply

panchtatvam8d ago

You must accept AI code only if you deem yourself dumber than AI.

1 more reply

j / k navigate · click thread line to collapse