A new era for software testing (opens in new tab)

(antirez.com)

141 pointsChrisszz18d ago58 comments

58 comments

39 comments · 12 top-level

simianwords18d ago· 10 in thread

Scenario testing is the new word for it and I think this is a game changer.

Two of the reasons I never liked writing tests is

- they didn’t seem to usually assert much internal logic

- they would have to be maintained along with the original code

I think scenario testing is much better instead because the actual way a person uses a feature hardly changes but the internals might change a lot.

So imagine I’m making an e-commerce website. There are lots of internal mechanisms. I’ll have an agent testing all the functionalities as if it were a customer. This gives me much much more confidence while writing code because it is more uncorellated with the code.

Tomorrow I can change a lot of internals but the testing agent stays the same.

There’s something to note though: not all code is possible to be scenario tested. Like data engineering and other things where the feedback time is huge.

anthonypasq14d ago

are we just re-inventing playwright tests except 10x slower and infinity times more expensive?

i feel like im going insane

hugs14d ago

since the rise of agentic coding tools, it feels like we're in a new "eternal september" of people discovering ui end-to-end test automation.

2 more replies

righthand14d ago

Well playwright tests used to be called puppeteer tests which used to be called selenium tests, so you tell me.

1 more reply

avensec14d ago

So, throw out the traditional test pyramid, shift right, and rely more on persona testing than fine-grained atomic tests? I would hope teams don't need to re-learn that lesson for themselves, but...

konart14d ago

>Scenario testing is the new word

How is scenario different from a behavior (as in Behavior-Driven Development)?

Gherkin and things like Cucumber are not something new, are they?

rahoulb14d ago

My clankers are instructed to use "Outside In development" with "red/green TDD" at all times.

They write really good Gherkin features and then work inwards writing unit tests as they go - checking that they fail before implementation so it's actually testing something worthwhile.

And the code they ship is decent quality (not as good as me most of the time - but a LOT better than me when I'm tired or I'm pissed off about something or the work is really boring).

pbalau13d ago

Well, scenario contains the letters s, c and n, while behavior doesn't.

righthand14d ago

This already exists. You mean capturing user flows which should already be supplied by product to the developer. A decent system is Behavior Driven Development (though honestly a poor acronym for it’s use).

hulitu17d ago

> Two of the reasons I never liked writing tests

Are you an engineer ? You must test your "creation". Or would you expect that the microwave owen you just bougth will be tested by your child while getting burned ?

robotresearcher14d ago

'I never liked writing tests' is not the same as 'I don't write tests'.

mlmonkey14d ago· 7 in thread

Writing unit tests used to be the bane of my existence. I used to hate them. Often times, the LoC for unit tests was 3X the LoC of the actual code.

But not any more! Now I point the LLM to the code and order it to write unit tests, covering all edge cases, etc. I'd rather spend 3 hours arguing with the LLM than writing unit tests! :-D

dkn14d ago

I am curious in your experience how often the LLM must also update the tests. I find that if LLMs write tests after the implementation exists, they are either extremely brittle because they are coupled to the implementation, or they cover little of value because they mock everything to the point of testing nothing.

mplanchard14d ago

I have found a decent trick to be to write a parameterized test with e.g. a `cases` array that tests a function how you want it tested. Then ask the LLM to fill out more cases. It’s not perfect, but results in much less brittleness since you’ve already defined the specifics of what gets tested and what doesn’t.

dcastm14d ago

Same for me. I actively ask the LLM to write as few tests as possible. Otherwise you end up redundant and low value ttests.

1 more reply

zerr14d ago

Some companies (e.g. Microsoft) used to have "Software Engineers in Test" who's job was writing such tests all day long, so that those developers who were developing features wouldn't waste time on it.

kovek14d ago

I heard people say this before. I'm wondering, how do you instruct the LLM to generate the tests? Do you tell it the scenarios that would be covered, or do you just tell it to write tests for the code?

what13d ago

>cover all edge cases

That’s probably the extent of the prompt.

1 more reply

pydry12d ago

If your tests align with the spec and the tooling is good it isnt tedious.

If you find writing tests tedious enough to make using an LLM to write them seem like a good idea you're probably churning out repetitive tests, unnecessary tests, tests which aren't great at catching bugs.

rglover14d ago· 3 in thread

> I have the feeling that the introduction of automatic QA may raise the bar of quality for new releases of software, and maybe partially compensate for the lower quality of the code produced at high speed with the use of automatic programming.

In theory. The only difference between today and "the aughts" is that we have machines that can spit out a ton of code very quickly.

Nothing has changed about the discipline or honesty around testing (you can skip automated tests even faster now if you wish). You can and should work with AI to write tests, but you have to know the difference between a good test and a "looks good on paper" test in order for it to truly be effective and raise the quality of what you're building.

onlyrealcuzzo14d ago

I've been building a compiler with LLMs for a memory safe language like Rust with near zero cost abstractions (no GC), but with WAY less cognitive overhead.

I can tell you right now:

1) It's 100x more than I could have achieved with zero compiler design experience.

2) I'm HIGHLY skeptical that LLMs can build something of this complexity (in some ways it's more difficult than implementing a Rust compiler) - so the testing is quite robust - 3 different systems (unit, integration, fuzz tests) each with mutant testing, each with between ~65-90% line coverage and ~50-80% branch coverage, combined with ~99% line coverage and ~86% branch coverage.

There is ZERO chance I could get something even close to this level of "working" by myself ever - let alone with minimal effort.

The test is kind of simple - if LLM's can do this... They should be able to do just about anything... Compilers are notoriously difficult to verify they actually work, rather than just kind of work sometimes...

People can say I'm wasting my time all they want.

But, one, it's been enlightening. I'm literally in awe of what they can do and have done.

Two, I've developed a bunch of tooling / metrics necessary to get them to be able to do something at this level of complexity without falling over themselves. And I think it can work at scale pretty easily.

Nearly all of the research comes from the 80s or farther back for the complexity metrics.

achierius14d ago

Hate to be a pedant, but that's really not what "zero cost abstractions" means. The idea behind those is that you get a cleaner interface to some gross machine functionality/OS API/etc. layer, but don't pay a performance cost vs. using the gross lower-level layer. E.g. Rust's Option, unlike C++'s std::optional.

What you're thinking of is "no runtime" or "lightweight runtime", which does often mean "no garbage collector".

1 more reply

wavemode13d ago

You're not wasting your time; LLMs have written plenty of compilers. Compilers are easy for LLMs to work on, because their level of verifiability is very high. That is, an LLM can easily determine whether what a compiler is doing is correct or incorrect.

Automated verifiability goes down once a software project incorporates things like:

- Concurrency

- Networking / distributed systems

- Visuals / animations

- Domain knowledge (e.g. banking, finance)

1 more reply

marshalhq14d ago· 3 in thread

I ran mutation testing on a side project recently and found a test that passed even if the production method returned an empty string. AI-generated tests at scale will have exactly this problem. High coverage, confident test names, zero actual verification.

onemoresoop14d ago

Don't worry, AI maximalists have a solution: create tests for the tests.

pfdietz14d ago

That's what mutation testing is.

ahartmetz12d ago

IME there are these levels of tests:

- If you call the setter, the getter returns the same value - these are kinda bullshit and would be caught by the next level anyway

- Testing basic normal use

- Testing known difficulties of the implementation

- Exhaustive or randomized (if necessary) testing of the state space, ~= property-based testing

I expect AI to have very different levels of ability for these, not necessarily in strictly descending order as listed.

wrxd18d ago· 3 in thread

I believe this can work if done on top of traditional testing. I would feel very uneasy to replace deterministic (ok, not always but mostly) test suites with something that is not deterministic at all

simianwords18d ago

I think this is just TDD or unit test dogma and I’m personally not a fan.

Unit tests and deterministic tests are hard to get right and need to be done at the correct boundary.

I have seen many people dogmatically pushing unit tests religiously but this often leads to very hard to maintain tests that mostly exist just to change along with the main code itself.

A good way to understand if your unit tests are good: are you changing them along with changing your actual code? Then it’s a bad test. I think the argument for “it’s just documentation” is weak.

fcarraldo18d ago

I don’t disagree with your point, but there is still value in having unit tests that change along with the code. It’s less than a “proper” test, but when these tests break _unexpectedly_, it’s still more signal than you’d have without them. Like, always changing `file.go` alongside `file_test.go` may be acceptable if you catch errors that impact `serve_test.go` unexpectedly.

Of course, if you’re just watching Claude changing both and saying “LGTM” then it’s not very valuable.

skydhash14d ago

> A good way to understand if your unit tests are good: are you changing them along with changing your actual code? Then it’s a bad test. I think the argument for “it’s just documentation” is weak.

Unit tests are great for pure algorithms, like file format, data encoding, crypto,… etc. Everything with a specs that will rarely changes. You write your tests once and basically never have to update them.

But for requirements that changes often like in a enterprise settings or applications, maintaining a suite of unit tests is expensive. Integration tests are better because contracts between modules don’t change that much. Even if the suite are not exhaustive, they’re useful enough to catch some failures.

1 more reply

wesselbindt14d ago· 1 in thread

The idea of injecting more indeterminacy in pipelines is beyond me.

devin14d ago

Well you see, you just run the same test 10,000 times, and then...

bob102913d ago

If you are working with a web application, playwright + frontier LLM is incredibly capable. They added some recent features to make this sort of use case go a lot smoother:

https://playwright.dev/docs/release-notes#version-159

If you set this up correctly, you can have a main agent issue natural language testing instructions to this playwright agent which returns a natural language summary of what it experiences. This is the sort of thing where I begin to get interested in the idea of agents working while I sleep.

avensec14d ago

> The idea is to create a markdown file where an AI agent is asked to work as a QA engineer

Given your code-base is mature enough, please don't have a single Skill/Steering/Persona/Ruleset (or whatever) for your "QA Engineer." This is just the same "my behavioral file can one-shot the entire system build" kind of thinking that will give you expensive, marginal results as the system grows.

If you want to have success in this space, get really fine-grained. Every single test scope needs its own behavioral files.

Have your core behavioral file define some simple specifics around Test Pyramid, Test Purposes, checks for tautological tests, etc. Then get _really_ specific;

<test-type>-architect (plan)

<test-type>-engineer (execute)

<test-type>-resolver (problem solver, maintenance, how to manage a failure, etc.)

e.g., playwright-architect, etc.

Then create additional ones for Unit tests, API tests, contract tests, or any other required test layer for the SUT.

Overengineered? Maybe given the size of your codebase. But for anything significant, you are codifying what humans and their skillsets do.

1 more reply

kulahan14d ago

Isn’t this explicitly the one place you’d never want to use AI? Like, the only actual problem with AI is that it sometimes ignores errors in output like it has a PHD in Blindness To Problems. I always figured the path forward was strictly enforced and managed tests written by hand, because who gives a shit about the code behind it as long as you can prove that the output is real?

Ten million blackboxes with ten billion tests or whatever. Otherwise it’s literally the blind leading the blind

ptx12d ago

Is the use of the term "automatic programming" a deliberate parallel to the development of COBOL [1]?

If so, if this is meant to imply that LLMs are just another step towards higher-level abstractions, the analogy doesn't quite work. Unlike a COBOL compiler, the LLMs output can't be predicted or reasoned about, so you can't really fix bugs in your program (i.e. your prompt) but only try to permute it haphazardly and hope for the best.

[1] https://ethw.org/Milestones:A-0_Compiler_and_Initial_Develop...

npodbielski13d ago

What is the point of asking LLM to do manual testing? IMHO it would be much better to make it write automated tests. So you can just rerun them?

jason_s12d ago

Please use a more readable variable-width font

j / k navigate · click thread line to collapse

58 comments

39 comments · 12 top-level

simianwords18d ago· 10 in thread

Scenario testing is the new word for it and I think this is a game changer.

Two of the reasons I never liked writing tests is

- they didn’t seem to usually assert much internal logic

- they would have to be maintained along with the original code

I think scenario testing is much better instead because the actual way a person uses a feature hardly changes but the internals might change a lot.

Tomorrow I can change a lot of internals but the testing agent stays the same.

There’s something to note though: not all code is possible to be scenario tested. Like data engineering and other things where the feedback time is huge.

anthonypasq14d ago

are we just re-inventing playwright tests except 10x slower and infinity times more expensive?

i feel like im going insane

hugs14d ago

since the rise of agentic coding tools, it feels like we're in a new "eternal september" of people discovering ui end-to-end test automation.

2 more replies

righthand14d ago

Well playwright tests used to be called puppeteer tests which used to be called selenium tests, so you tell me.

1 more reply

avensec14d ago

So, throw out the traditional test pyramid, shift right, and rely more on persona testing than fine-grained atomic tests? I would hope teams don't need to re-learn that lesson for themselves, but...

konart14d ago

>Scenario testing is the new word

How is scenario different from a behavior (as in Behavior-Driven Development)?

Gherkin and things like Cucumber are not something new, are they?

rahoulb14d ago

My clankers are instructed to use "Outside In development" with "red/green TDD" at all times.

They write really good Gherkin features and then work inwards writing unit tests as they go - checking that they fail before implementation so it's actually testing something worthwhile.

And the code they ship is decent quality (not as good as me most of the time - but a LOT better than me when I'm tired or I'm pissed off about something or the work is really boring).

pbalau13d ago

Well, scenario contains the letters s, c and n, while behavior doesn't.

righthand14d ago

hulitu17d ago

> Two of the reasons I never liked writing tests

Are you an engineer ? You must test your "creation". Or would you expect that the microwave owen you just bougth will be tested by your child while getting burned ?

robotresearcher14d ago

'I never liked writing tests' is not the same as 'I don't write tests'.

mlmonkey14d ago· 7 in thread

Writing unit tests used to be the bane of my existence. I used to hate them. Often times, the LoC for unit tests was 3X the LoC of the actual code.

But not any more! Now I point the LLM to the code and order it to write unit tests, covering all edge cases, etc. I'd rather spend 3 hours arguing with the LLM than writing unit tests! :-D

dkn14d ago

mplanchard14d ago

dcastm14d ago

Same for me. I actively ask the LLM to write as few tests as possible. Otherwise you end up redundant and low value ttests.

1 more reply

zerr14d ago

kovek14d ago

what13d ago

>cover all edge cases

That’s probably the extent of the prompt.

1 more reply

pydry12d ago

If your tests align with the spec and the tooling is good it isnt tedious.

rglover14d ago· 3 in thread

In theory. The only difference between today and "the aughts" is that we have machines that can spit out a ton of code very quickly.

onlyrealcuzzo14d ago

I've been building a compiler with LLMs for a memory safe language like Rust with near zero cost abstractions (no GC), but with WAY less cognitive overhead.

I can tell you right now:

1) It's 100x more than I could have achieved with zero compiler design experience.

There is ZERO chance I could get something even close to this level of "working" by myself ever - let alone with minimal effort.

People can say I'm wasting my time all they want.

But, one, it's been enlightening. I'm literally in awe of what they can do and have done.

Nearly all of the research comes from the 80s or farther back for the complexity metrics.

achierius14d ago

What you're thinking of is "no runtime" or "lightweight runtime", which does often mean "no garbage collector".

1 more reply

wavemode13d ago

Automated verifiability goes down once a software project incorporates things like:

- Concurrency

- Networking / distributed systems

- Visuals / animations

- Domain knowledge (e.g. banking, finance)

1 more reply

marshalhq14d ago· 3 in thread

onemoresoop14d ago

Don't worry, AI maximalists have a solution: create tests for the tests.

pfdietz14d ago

That's what mutation testing is.

ahartmetz12d ago

IME there are these levels of tests:

- If you call the setter, the getter returns the same value - these are kinda bullshit and would be caught by the next level anyway

- Testing basic normal use

- Testing known difficulties of the implementation

- Exhaustive or randomized (if necessary) testing of the state space, ~= property-based testing

I expect AI to have very different levels of ability for these, not necessarily in strictly descending order as listed.

wrxd18d ago· 3 in thread

simianwords18d ago

I think this is just TDD or unit test dogma and I’m personally not a fan.

Unit tests and deterministic tests are hard to get right and need to be done at the correct boundary.

I have seen many people dogmatically pushing unit tests religiously but this often leads to very hard to maintain tests that mostly exist just to change along with the main code itself.

fcarraldo18d ago

Of course, if you’re just watching Claude changing both and saying “LGTM” then it’s not very valuable.

skydhash14d ago

1 more reply

wesselbindt14d ago· 1 in thread

The idea of injecting more indeterminacy in pipelines is beyond me.

devin14d ago

Well you see, you just run the same test 10,000 times, and then...

bob102913d ago

If you are working with a web application, playwright + frontier LLM is incredibly capable. They added some recent features to make this sort of use case go a lot smoother:

https://playwright.dev/docs/release-notes#version-159

avensec14d ago

> The idea is to create a markdown file where an AI agent is asked to work as a QA engineer

If you want to have success in this space, get really fine-grained. Every single test scope needs its own behavioral files.

Have your core behavioral file define some simple specifics around Test Pyramid, Test Purposes, checks for tautological tests, etc. Then get _really_ specific;

<test-type>-architect (plan)

<test-type>-engineer (execute)

<test-type>-resolver (problem solver, maintenance, how to manage a failure, etc.)

e.g., playwright-architect, etc.

Then create additional ones for Unit tests, API tests, contract tests, or any other required test layer for the SUT.

Overengineered? Maybe given the size of your codebase. But for anything significant, you are codifying what humans and their skillsets do.

1 more reply

kulahan14d ago

Ten million blackboxes with ten billion tests or whatever. Otherwise it’s literally the blind leading the blind

ptx12d ago

Is the use of the term "automatic programming" a deliberate parallel to the development of COBOL [1]?

[1] https://ethw.org/Milestones:A-0_Compiler_and_Initial_Develop...

npodbielski13d ago

What is the point of asking LLM to do manual testing? IMHO it would be much better to make it write automated tests. So you can just rerun them?

jason_s12d ago

Please use a more readable variable-width font

j / k navigate · click thread line to collapse