undefined | Better HN

0 pointsmapontosevenths2mo ago0 comments

> An engineer should be code reviewing every line written by an LLM,

I disagree.

Instead, a human should be reviewing the LLM generated unit tests to ensure that they test for the right thing. Beyond that, YOLO.

If your architecture makes testing hard build a better one. If your tests arent good enough make the AI write better ones.

0 comments

jddj2mo ago

The venn diagram for "bad things an LLM could decide are a good idea" and "things you'll think to check that it tests for" has very little overlap. The first circle includes, roughly, every possible action. And the second is tiny.

Just read the code.

sarchertech2mo ago

There’s no way you or the AI wrote tests to cover everything you care about.

If you did, the tests would be at least as complicated as the code (almost certainly much more so), so looking at the tests isn’t meaningfully easier than looking at the code.

If you didn’t, any functionality you didn’t test is subject to change every time the AI does any work at all.

As long as AIs are either non-deterministic or chaotic (suffer from prompt instability, the code is the spec. Non determinism is probably solvable, but prompt instability is a much harder problem.

mapontoseventhsOP2mo ago

> As long as AIs are either non-deterministic or chaotic

You just hit the nail on the head.

LLM's are stochastic. We want deterministic code. The way you do that is with is by bolting on deterministic linting, unit tests, AST pattern checks, etc. You can transform it into a deterministic system by validating and constraining output.

One day we will look back on the days before we validated output the same way we now look at ancient code that didn't validate input.

sarchertech2mo ago

None of those things make it deterministic though. And they certainly don’t make it non-chaotic.

You can have all the validation, linters, and unit tests you want and a one word change to your prompt will produce a program that is 90%+ different.

You could theoretically test every single possible thing that an outside observer could observe, and the code being different wouldn’t matter, but then your tests would be 100x longer than the code.

mapontoseventhsOP2mo ago

> None of those things make it deterministic though.

In the information theoretical sense you're correct, of course. I mean it's a variation on the halting problem so there will never be any guarantee of bug free code. Heck, the same is true of human code and it's foibles. However, in the "does it work or not" sense I'm not sure why we care?

If the gate only passes the digits 0-9 sent within 'x' seconds, and the code's job is to send a digit between 0 and 9, how is it non-deterministic?

Let's say the linter says it's good, it passes the regression tests, you've validated that it only outputs what it's supposed to and does it in a reasonable amount of time, and maybe you're even super paranoid so you ran it through some mutation tests just to be sure that invalid inputs didn't lead to unacceptable outputs. How can it really be non-deterministic after all that? I get that it could still be doing some 'other stuff' in the background, or doing it inefficiently, but if we care about that we just add more tests for that.

I suppose there's the impossible problem edge case. IE - You might never get an answer that works, and satisfies all constraints. It's happened to me with vibe-coding several times and once resulted in the agent tearing up my codebase, so I learned to include an escape hatch for when it's stuck between constraints ("email user123@corpo.com if stuck for 'x' turns then halt"). Now it just emails me and waits for further instruction.

To me, perfect is the enemy of good and good is mostly good enough.

1 more reply

kavok2mo ago

It’s amazing how often an LLM mocks or stubs some code and then writes a test that only checks the mock, which ends up testing nothing.

Dig1t2mo ago

I have seen junior engineers do this on multiple occasions. This is why all code should be reviewed by experienced engineers, whether written by a human or an LLM.

mapontoseventhsOP2mo ago

You really do have to verify and validate the tests. Worse you have to constantly battle the thing trying to cheat at the tests or bypass them completely.

But once you figure that out, it's pretty effective.

j / k navigate · click thread line to collapse

0 comments

jddj2mo ago

Just read the code.

sarchertech2mo ago

There’s no way you or the AI wrote tests to cover everything you care about.

If you did, the tests would be at least as complicated as the code (almost certainly much more so), so looking at the tests isn’t meaningfully easier than looking at the code.

If you didn’t, any functionality you didn’t test is subject to change every time the AI does any work at all.

As long as AIs are either non-deterministic or chaotic (suffer from prompt instability, the code is the spec. Non determinism is probably solvable, but prompt instability is a much harder problem.

mapontoseventhsOP2mo ago

> As long as AIs are either non-deterministic or chaotic

You just hit the nail on the head.

One day we will look back on the days before we validated output the same way we now look at ancient code that didn't validate input.

sarchertech2mo ago

None of those things make it deterministic though. And they certainly don’t make it non-chaotic.

You can have all the validation, linters, and unit tests you want and a one word change to your prompt will produce a program that is 90%+ different.

mapontoseventhsOP2mo ago

> None of those things make it deterministic though.

If the gate only passes the digits 0-9 sent within 'x' seconds, and the code's job is to send a digit between 0 and 9, how is it non-deterministic?

To me, perfect is the enemy of good and good is mostly good enough.

1 more reply

kavok2mo ago

It’s amazing how often an LLM mocks or stubs some code and then writes a test that only checks the mock, which ends up testing nothing.

Dig1t2mo ago

I have seen junior engineers do this on multiple occasions. This is why all code should be reviewed by experienced engineers, whether written by a human or an LLM.

mapontoseventhsOP2mo ago

You really do have to verify and validate the tests. Worse you have to constantly battle the thing trying to cheat at the tests or bypass them completely.

But once you figure that out, it's pretty effective.

j / k navigate · click thread line to collapse