My Agent Skill for Test-Driven Development (opens in new tab)

(saturnci.com)

251 pointslaxmena22d ago109 comments

109 comments

89 comments · 20 top-level

zuzululu21d ago· 14 in thread

TDD sounds great on paper for agentic development but you quickly realize it balloons the token cost. Often I write some feature and then its repurposed or removed, code is refactored moved around as time goes. With TDD I would be taxed heavily and velocity slow to a crawl.

The waterfall approach is better after trying out TDD especially when you have a multi-agent setup. Also I found that in some cases the tests were just superficial hallucinations that never actually tested the components written or there some some context corruption and ultimately triggered a false positive that kicked off a completely unintentional refactoring.

__mharrison__21d ago

My experience is the opposite. TDD keeps the guardrails on and let's me refactor with confidence.

Crazy times here in the development world. I'm always curious to watch other's best practices.

dools21d ago

Yeah I specifically tell it not to pre-emptively fix tests that it knows will break as a result of changes its making and instead limit itself only to creating new tests for new changes. I want to see the tests break, then we go through and review each set of breakages versus the mission and assess if they’re regressions or stale assertions. This is a) how I know it’s actually writing meaningful tests b) a very functional and useful form of “code review” versus just trying to catch problems by reading diffs and c) helped me find real problems and regressions.

Almost all the breakages after a big refactor are stale assertions but every time I catch a couple of critical problems that make the entire exercise very worth it.

The whole dev process is so fast compared to writing software manually that I find it absurd that I wouldn’t invest heavily in automated tests.

1 more reply

rsalus21d ago

I was a big proponent of encoding TDD red-green-refactor methodology into my agent workflows until recently when I made the same realization after reading this study: https://arxiv.org/pdf/2602.07900

TLDR; it found test-writing volume only weakly correlates with success and that encoding test-writing principles did not move resolution rates but _did_ materially change cost. Encouraging tests cost +19.8% output tokens for 0% gain; discouraging them saved 33–49% input tokens for ≤2.6pp accuracy loss. Separately, imposing the TDD procedure specifically seems like it can backfire: it actually _increased_ regressions from 6.08% to 9.94%.

IMO, where tests clearly help is primarily as an "oracle" applied after generation. It gives the models a signal that enables them to verify and self-correct if necessary.

zuzululu20d ago

Very interesting paper and it lines up exactly with my observations. The ROI just isn't there writing tests up front and the conclusion in that paper lays it out clearly

    Overall, these findings suggest that agent-written
    tests often behave more like a habitual software-development rou-
    tine than a dependable source of validation in this setting. More
    agent-written tests do not mean more solves; what they more reli-
    ably change is the process footprint—API calls, token usage, and
    interaction patterns. Improving the value of testing for code agents
    may therefore require better oracles and more actionable validation
    signals, rather than simply inducing agents to write more tests.

> IMO, where tests clearly help is primarily as an "oracle" applied after generation

Bingo. I'm not against writing tests it's that the returns are better when its used as verification feedback and as "Oracle" exactly as you put it.

1 more reply

esperent20d ago

From that paper:

> This raises a central question: do such tests meaningfully improve issue resolution, or do they mainly mimic a familiar software-development practice while consuming interaction budget?

This is an important question but it's not the one I'm most interested in when requiring agents to follow TDD. My goal is to lock in behavior because it was happening way too frequently that an agent would successfully fix the issue at hand, but break something else that it wasn't supposed to touch.

The tests add another layer and it's why I always separate out red and green worker subagents. The green worker might get trigger happy and go beyond scope/break something but it's not allowed to fudge the tests so I'll know and can clean up and revert.

It's also why I'm not too bothered about perfect red green TDD. I can add the tests later if needed.

1 more reply

necovek20d ago

The paper focuses on two things: default behavior and behavior with a prompt to write at least one new test.

In general — just like with humans — I find "just add more tests" to be counter-productive.

Tests make sense in a testable architecture: TDD can encourage one to be implicitly used, but it is a design, architectural choice that should be made explicit (lean to functional code; use direct, explicit dependency injection; ensure test stubs are just variants of the real implementation and fully tested using the same test as the real one...). LLMs should be prompted with this guidance instead for proper value estimation.

dnautics20d ago

no. red green tdd is great because you'll have tests when your llm breaks something later, or you're doing a massive refactor. i imagine studies are not done on codebases where the complexity gets that high.

tdd has been invaluable for this project (almost entirely llm written, but i review it) https://github.com/ityonemo/clr

1 more reply

pramodbiligiri20d ago

My approach (with LLMs especially) aligns more with what's outlined in "Growing OO Software Guided by Tests" (https://growing-object-oriented-software.com/toc.html). Chapter 4 there says "First, Test a Walking Skeleton", and Chapter 5 has "Start Each Feature with an Acceptance Test". I think it comes down to: get something working end-to-end first in a verifiable way, and then keep refining both the feature and its tests (preferable with TDD).

I've noticed that LLMs tend to generate multiple testcases in one shot (which is not how humans usually go about TDD), and also they don't start with Integration Tests, unless instructed to do so.

dnautics20d ago

> it balloons the token cost

how!!??

you write a test, which is one extra function. and maybe a paragraph or so per feature ("i made a RED test"... "i made it GREEN"), everything else is the same between normal development and TDD. this is chump change compared to the rest of development, including thinking tokens

1 more reply

manmal21d ago

> With TDD I would be taxed heavily and velocity slow to a crawl.

And the code will be good.

rsalus20d ago

not necessarily, TDD has little bearing on output quality

2 more replies

jzig21d ago

Pattern-based testing can theoretically reduce the token cost?

reg_dunlop21d ago

But that repurposing/removal is exactly what's avoided if you follow through with the SEF framework he outlines.

I have to push back on the idea that token costs balloon when using TDD within the context of a strong framework such as Jason has laid out here.

If the feature is repurposed/removed/refactored....I'd argue the specification wasn't well thought out prior to burning into tokens.

We're so eager to do a lot of the wrong things quickly, when it may serve us better to do a more precise thing slowly.

zuzululu21d ago

You cant spec out what you dont know, scope, requirements change from real world feedback

1 more reply

behnamoh21d ago· 12 in thread

Snake oil. Just ask the model, all these custom agents/skills haven't proven that useful in practice.

jw122421d ago

Skills already are "just asking the model". Unless you'd prefer to type out the same instructions every single time?

Skills are literally just Markdown documents that get loaded into context when the /skill-name is invoked.

dominotw21d ago

i belive gp means llms produce what they see in training data/rl there isnt much too much customization you can do with skills.

they are being sold as more powerful than they are. Like llms are intelligent blank slates that can be customized with mere markdown files.

1 more reply

Zetaphor21d ago

I think they're maybe confusing Skills and MCP servers

coffeeaddict121d ago

I disagree. Not all skills are useless. For example, I sometime use Qt for GUI projects and I have found their skills [0] very useful to improve the quality and performance of my projects. I their absence, I would each time have to direct the agents to find the docs or specific tools, wasting tokens and thus decreasing the quality of the output.

[0] https://github.com/TheQtCompanyRnD/agent-skills

pramodbiligiri21d ago

I don't think the idea of skills is quite snake oil. It seems you can change what LLM outputs next by what's called few-shot prompting or in-context learning: https://www.promptingguide.ai/techniques/fewshot

john_strinlai21d ago

not that i know much about the effectiveness of these skill files, i find it odd to call something given for free "snake oil", which i thought referred to the sale of fraudulent products (to the benefit of the snake oil salesperson), typically around healthcare-related stuff.

krupan20d ago

If the consume more tokens then they are not free. If they consume those tokens without really improving anything them they are snake oil

dominotw21d ago

i think gp is calling skills snakeoil in genral

internet10101021d ago

Lol wut. One of first things people do at a company when they get enterprise LLM tools is share a skill with company-specific color palettes or standards for creating visualizations (I prefer Tufte's principles).

beezlewax21d ago

I've found them useful for in house stuff where you are using a specific design system or architecture. But custom everything works best. Are that Claude works well on its own though at this point.

wyre21d ago

Ya, if im constantly asking a model to do TDD development, you know what would make it a lot easier? A skill.

theptip21d ago

Nah. Skills are great. But you should write your own.

simonw21d ago· 10 in thread

This article would benefit from a date. It looks like it's recent (Internet Archive first grabbed it on May 29th) but it's the kind of information that can quickly become stale as models and agents improve.

(I've been getting solid results recently from simply telling Claude Code and Codex "Test with uv run pytest, use red/green TDD".)

__mharrison__21d ago

Here's a portion of my AGENTS.md from this week (playing FDE, implementing a custom workflow for a client that 20x their productivity).

    # Python Tooling
    
    - Use `uv` to manage Python environments and dependencies.
    - Use `uv run` to execute Python scripts and commands.
    - Use `pytest` for testing your code.
    - Use the `hypothesis` library for property-based testing when you have complex input spaces or need to test edge cases.
    - Don't edit `pyproject.toml` directly. Instead, use `uv add` and `uv add --dev` to manage dependencies.
    - Use ruff, ty, prek, wily for code quality and linting.
    - Don't use excessive casting. If you find yourself needing to cast types frequently, consider refactoring your code to use more appropriate types. Casting should only be done in boundary layers where you are interfacing with external systems.
    - Run appropriate tooling after making changes to your code to ensure it meets quality standards.
    - When you come across a bug or regression, think hard about writing a test and also how to create code that will prevent this from happening again in the future.
    - When creating a command line interface, add `--verbose` flag that provides logging output useful for debugging issues.
    - Before creating code, brainstorm 5 different approaches to solve the problem and sort them by their probable effectiveness. Then, choose the best approach and implement it.
    - Use Test Driven Development (TDD) for all code you write. Write tests before writing the implementation code. 
    - Collect pytest fixtures in a `conftest.py` file to avoid duplication 
    - Prefer testing real code where possible. Use doubles and `monkeypatch` when absolute necessary. Try to avoid mocking as much as possible.
    - Favor pytest monkeypatch to mock.
    - When a test fails, run the last failed test first using `uv run pytest --last-failed` 
    - Use numpy-style docstrings for all functions and classes you create.
    - Include doctests in the docstrings of your functions to provide examples
    - Use type hints for all function parameters and return types.
    - Use logging to provide insight into failures. Don't use print for debugging. Don't use logging to hide stack traces.

1 more reply

porphyra21d ago

A lot of prompt engineering goes out of date quickly. Nobody nowadays goes "you are an expert software engineer. make no mistakes" lol.

As a personal anecdote, I find that a lot of big prompts and skills use up context window budget and in many cases agents will eagerly try to use a skill even if it isn't super relevant or necessary for the current task. So when I have too many skills I have to spend a bunch of time toggling the checkboxes to figure out which ones are needed for the task at hand before starting...

Royce-CMR20d ago

I can't find the link now, but Anthropic has a post about using either a light model call or other logic (regex etc) to dynamically decide what tools to expose per incoming request.

I've run into the same issue and I still end up manually curtailing what's exposed to the model, limiting to the task at hand, but I like the idea of another (smaller I hope) model doing 70% of the clipping instead, automagically.

1 more reply

oefrha20d ago

> Nobody nowadays goes "you are an expert software engineer. make no mistakes"

You know what, I checked Opus 4.8's instructions to a review subagent the other day and it literally opened with

> You are a senior infrastructure/security engineer doing a thorough, adversarial code review...

I didn't say anything like that myself.

1 more reply

jasonswett20d ago

Good point! Will add a date.

disgruntledphd221d ago

Me too, although I dislike the fact that it over-focuses on mocks (which I accept is over-represented in the training data).

galsapir21d ago

sometimes I also feel it tries to optimise for "per line coverage" over more "real, complex use cases" type tests

nextaccountic20d ago

https://github.com/jasonswett/llm-skills/blob/main/tdd/SKILL... has a timestamp (mar 14, 2026 as of today)

chrisweekly20d ago

Every article should include a date!

0123456789ABCDE21d ago

fwiw, response headers include: Last-Modified: Fri, 22 May 2026 19:08:09 GMT

krupan20d ago· 6 in thread

I find it hard to believe that these LLM systems with their enormous training sets and built-in system prompts have their output meaningfully modified by a few paragraphs of extra prompting in the form of these skill files, BUT, it is cool to see people writing out consise, focused documents like this. These would have great to have as a young developer, and great for several of the teams I've worked in in the past. I dabble with python for automating things here and there and I just learned some new things reading __mharison__'s skill in the comments here.

This kind of wisdom used to be cfound in blog posts, or in the beads of more senior developers, but they were never written out as concisely as these skill files. It's kinda funny that billions of dollars had to be spent creating a machine that's a rough human analog needing guidance to get us to produce these documents

jasonswett20d ago

The reason it works is because there's a difference between the model knowing something and the agent doing something. Claude will happily write giant untested functions even though it "knows" that short functions are easier to understand and then testing enables safe refactoring etc. The model also "knows" many conflicting "facts", such as the fact that testing is smart and that testing is a waste of time. It can't act on both beliefs at the same time. That's why nudging it toward your own preferred behaviors works.

turlockmike20d ago

It lacks a critical self, but the weights are there for in context learning and nudging behaviour. It's goal is to complete whatever task is given. You need to make sure the outcome you want is clearly defined.

You don't need elaborate prompts, just a few lines

"All code must have corresponding tests written ahead of time to prove the code meets the specification" is sufficient for most use cases. Prose can help nudge it more if it isn't adhearing consistently.

gruez20d ago

Isn't all of what you described what post-training/RLHF is supposed to do? The internet is full of racism, so if you're just predicting the next token based on training data, you'll get racism (eg. Microsoft Tay), but that's more or less solved by AI companies now.

1 more reply

vikramkr20d ago

Why is that hard to believe? It's literally the prompt telling it what to do - if you want a poem about watermelons you tell it to write a poem about watermelons, if you want tests you tell it to write tests. It's not like TDD is some universal pattern that every llm will naturally optimize towards

Nizoss20d ago

I use a different approach, I enforce TDD using hooks. Think of it this way: You interact with your agent and ask it to implement a feature. Now every change it wants to make will have to be approved by a separate agent. This second agent is spawned using the SDK and can see the pending change, recent session history for context, instructions on how to interpret the information in relation to TDD, and any project custom instructions.

This setup works great especially when you work with multiple agents or sessions in parallel and don’t want to be babysitting TDD. You just know that no TDD shortcuts or violations will be made and can focus on the solution instead. Agents are good at internally justifying shortcuts and lowering what’s good enough as the session goes. You can notice this when you ask them to review their own work compared to when asking a new session to review the changes. The difference is stark.

What’s interesting about the TDD instructions I dogfooded for this is that there is a lot that is implicit about how to interpret operations in terms of TDD violations. For example, earlier versions of the instructions had the validation agent block multi-step refactor changes because there was no guarantee to them that further changes will follow. It would also block changes when a definition is removed while it is still being called. The reasoning is that the code will no longer build and thereby not fulfill the ”refactoring is allowed under green”. Improving the wording and clarifying the process helped from this unwanted false blocks.

If you want to give this approach a try, you’ll find it here. I’m the author and I’m happy to and any further questions: https://github.com/nizos/probity

ArtRichards19d ago

Its all about retaining the context and spawning sub agents which can bootstrap quickly and accurately.

I'm interested in others dping something similar :) I included a docs cli tool in pypi to manage this context:

https://artrichards.github.io/agent-playbook-suite/blog/

SubiculumCode20d ago· 5 in thread

One issue that I've run into with codex has been excessive use of fallbacks routines. Perhaps this is good practice in.professional programming in many situations, but for mine (in this case): computing geodesic distances and analysis, a silent bad fallback means the processed data is not what I thought it was..e.g. used an inaccurate geodesic method in place of the accurate one.

jasonswett20d ago

I HATE this. I call it speculative coding. Claude often calls it "defensive" programming. It's easily my #1 LLM pet peeve. I have yet to figure out a reliable way to make this stop happening.

homieg3320d ago

I’m going to second this. Probably a side effect of its training to always produce an output, even if its some naive handling of issues it really should have root caused and fixed.

tarrant30020d ago

I hate it as well. I have all sorts of skills and CLAUDE.md-based protections against it. I call it "a form of lying" to trigger ethics-related neurons, and I've also used linter rules and git pre-commit hooks to protect against this. I also don't ask for unit tests anymore, and instead ask for integration tests (with red/green TDD). I probably prevent 98% of the fallbacks/mocks with these methods, but some still slip through.

bmitc19d ago

> excessive use of fallbacks routines

What are "fallbacks routines"?

victorbjorklund19d ago

Yea, I have seen that too.

fowlie21d ago· 3 in thread

Haven't tried this, but I've recently become a big fan of Matt Pococks skills. Workflow: /grill-with-docs -> /to-prd -> /to-issue -> /tdd. That will interview relentlessy until there is a "shared understanding" using "ubiquitous language", then it will spec all requirements with user stories, create issues and implement them using tdd.

dchuk20d ago

Been using his skills a lot lately, they are wonderful. I’ve added an issue to specs skill that grounds the issues with a technical plan against the current codebase, and a research school that spawns a bunch of agents to look up best practices on the internet for those issues with specs, it really dials things in. I need to issue a PR to his project for those two…

Rohunyyy20d ago

It seems like he got the skills from BMAD method https://docs.bmad-method.org/tutorials/getting-started/

kirtivr20d ago

TIL. Thanks for that!

steno13221d ago· 3 in thread

Test driven development is one of the worst ideas nowadays in the LLM age. We have models that can consistently write expert level, usually bug free code for you and rapidly fix even complex bugs in your codebase.

The token cost and tech debt introduced by tests is just not worth it. There's usually no bugs and if there are, you can fix them quickly if and when it's needed.

Ginop21d ago

I disagree

Testing was and is still very important, as LLMs can still miss important points in business logic or other edge cases I would argue that tests became as important as code, if not more.

esafak21d ago

IF your code has no bugs it's either trivial or you haven't noticed the bugs.

buster19d ago

No it's probably the most important idea.

enraged_camel21d ago· 2 in thread

Spawning separate agents to review the original agent's implementation results in a very noticeable increase in code quality and decrease in bugs. This is why I encode two or three rounds of sub-agent review during the planning process, where I tell the agent authoring the plan to include those review rounds at the end. If the code is particularly load-bearing, I then ask a fourth agent, usually from the other frontier lab.

All of this burns more tokens of course, but probably way less than coming back to the code later to fix bugs. It is also slower, but in the long run saves time.

yaodub20d ago

Have you found integrating outputs from different frontier labs consistently improves final results, or is it just kind of voodoo?

enraged_camel20d ago

It's useful, but increases review time and mental energy requirement. Often times Codex and Opus will find the same issues when given a review task, but will disagree on issue severity. Codex might claim that something is a blocker, while Opus will say it's just a medium/low. Or vice versa.

1 more reply

dluxem21d ago· 2 in thread

I believe using a skill here is the wrong approach. LLMs already know what TDD is and how to do it, just like object oriented programming.

If this is encoded in a skill, that skill essentially has to be loaded for everything thing your LLM is doing. This is probably one of the few areas where direct instructions via AGENTS.md is best, and I don't believe it requires much direction here to force the issue.

But I think the OP is just trying to have their agent work in a very specific way -- that is fine too.

> 5. Show me the test and ask for approval before continuing

jasonswett20d ago

My experience has been that yes, LLMs already know about TDD, OOP, etc., but they won't necessarily BEHAVE according to what they know unless you tell them. And of course, they "know" a lot of things that conflict with each other.

zuzululu21d ago

People forget skill is just a markdown file and I don't think TDD makes sense. It's more for specific niches like working on your custom codebase or some less beaten paths you take and save the lessons going forward

But everybody is free to choose how they work and it may be required in ways that we can't know about.

realty_geek21d ago· 2 in thread

As an aside, check out Jason's podcast (codewithjason.com) - its pretty good.

The latest one is with "Uncle Bob Martin" who has some interesting takes on coding with AI from .... can I say an oldie?

ElijahLynn20d ago

Just looked it up! Gonna give it a listen on my drive this morning.

https://open.spotify.com/episode/2UooZQNEpjXurZYBasds73?si=1...

jasonswett20d ago

Thanks, I'm glad you like it!

__mharrison__21d ago· 2 in thread

Testing is so important for development.

Even more so when coding with agents. I think it is the probably the biggest lever to keep AI in guardrails.

(It's also why I wrote my latest book, Effective Testing, because I routinely find that my clients are very poor at treating.)

necovek20d ago

Having thought heavily and even presenting on exactly the same topics, looking at the ToC, your book seems to cover the basics well.

However, since we are talking about effectiveness, applying a lot of these principles might lead to a non-maintainable codebase — for humans and LLMs alike.

When any change causes 500 tests to break, or it causes nothing to break (see monkey-patching and/or mocking), you've gotten to a point where your testing approach is ineffective.

Most start applying principles of just enough tests and testable architectures too late, yet I believe they are fundamental.

Do you cover these in your book?

__mharrison__20d ago

My experience with refactoring is if a change causes larger number of tests faults, I run the last failed test and fix that (see my AGENTS.md posted elsewhere). Generally, if you fix the one issue everything else falls in line.

Wrt mocking. I'm not a huge fan. Again, look at my AGENTS.md. I prefer monkeypatch as a last resort option. Luckily, if you use TDD, you rarely have to use mocking. If you don't use TDD...

Ampersander20d ago· 2 in thread

Testing is obsolete in the AI age. I just one shot every problem with claude, it never makes a mistake.

dev_hugepages20d ago

Then, you are obsolete in the AI age.

Ampersander20d ago

Yes, I might need to bet the farm on the IPO to make it

jvuygbbkuurx21d ago· 1 in thread

All of these post are missing actual comparisons on results. I read exactly opposite 'you should do x' everyday. If TDD actually was better it would simply be in the system prompts already.

bisonbear21d ago

Agree - all of this is based on vibes (I also use TDD based on vibes FWIW). The only way to settle "does TDD / caveman / [insert random skill here] help" is to replay real PRs from your repo and measure quality

servercobra21d ago· 1 in thread

This overall is pretty close to how I've set up my implementation skill. One thing I'm curious about is how well the analogies like "We don't make dinner in a dirty kitchen." work vs something a lot more straightforward. Any input OP?

jasonswett20d ago

OP here. I don't know, in my experience Claude took "clean the kitchen before we make dinner" to heart in an astonishingly productive way. I haven't tried many other analogies though.

revlsas20d ago· 1 in thread

TDD is unnecessary bloat at this point

Just work with Codex to fill the gaps, and then get it to one shot the implementation

Do review afterwards if needed

All these md files will be increasingly useless as models improve

mercutio220d ago

There are many projects where one shot is the right answer!

But surely you aren’t suggesting literally every software project is composed of one-shot-able building blocks, or that the building blocks never require modifications to previous one-shots?

whateveracct20d ago· 1 in thread

/test-me

whateveracct20d ago

sorry, /test-with-docs

yieldcrv20d ago· 1 in thread

Tests are vanity in agentic engineering

They do nothing to keep an AI on track in comparison to the aspects that simulate a product manager

And the AI just will correct the test when it fails as opposed to correct the code, because the code didn't miss anything the specification changed

My protip: just write tickets or have the AI write those too. that and the commits and the PRs will function as the AI’s memory better than any client side markdown file masquerading as a soul

kgdiem20d ago

My agent / skill files always tell it to trust neither the code or the test and to reason about the test failure which seems to work pretty well.

In another project without my rules I’ve noticed I have to tell it to set up data for playwright tests instead of skipping if none exists.

bob102920d ago· 1 in thread

TDD is fundamentally problematic in every practical implementation I've ever seen. I don't think the same thing, but much faster, is going to help at all. TDD tends to cause adverse, higher order effects.

I am currently observing AI authored tests creating a massive sense of complacency because a human no longer owns responsibility for the test suite. It's too easy to reject ownership by way of the various agent prompting schemes. I find myself enjoying the idea of it too, primarily because adding tests to even the most trivial functionality is mandatory due to the TDD policy.

Developing good tests is like an artform. Total coverage is a terrible objective. Correctness does not compose upward. It's a game of chasing ghosts if you think you can build a perfectly clean system bottom up and then magically meet the customer at the top. They're gonna kick your jenga tower over on day one.

cbcjcyv520d ago

Tests AI writes aren't for me. They're for the AI.

I mostly agree though, I've seen a lot of vapid assertions in my day job recently.

I should note Im specifically not doing tdd with AI.

csbartus20d ago

This specify-encode-fulfill loop/method is effective to make agents create bug-free code.

In my version of this workflow I do specify myself, then let the LLM do the rest.

This way 1.) I'm 100% sure the understanding/spec is good 2.) It's translated into an executable format so the implementation can be verified 3.) The implementation has maximum code coverage tests which steers the AI to produce code which follows standards, fits into the existing codebase, and it's very easy to refactor.

So far, this is the one and only advantage of using LLMs in my SWE practice. They glue together (human written) specs with code, with confidence, in no time.

nullc21d ago

If you don't follow up with a pass of injecting bugs and validating that the tests fail in the presence of bugs... then you've only confirmed that the tests can pass and they may be substantially useless.

j / k navigate · click thread line to collapse

109 comments

89 comments · 20 top-level

zuzululu21d ago· 14 in thread

__mharrison__21d ago

My experience is the opposite. TDD keeps the guardrails on and let's me refactor with confidence.

Crazy times here in the development world. I'm always curious to watch other's best practices.

dools21d ago

Almost all the breakages after a big refactor are stale assertions but every time I catch a couple of critical problems that make the entire exercise very worth it.

The whole dev process is so fast compared to writing software manually that I find it absurd that I wouldn’t invest heavily in automated tests.

1 more reply

rsalus21d ago

I was a big proponent of encoding TDD red-green-refactor methodology into my agent workflows until recently when I made the same realization after reading this study: https://arxiv.org/pdf/2602.07900

IMO, where tests clearly help is primarily as an "oracle" applied after generation. It gives the models a signal that enables them to verify and self-correct if necessary.

zuzululu20d ago

Very interesting paper and it lines up exactly with my observations. The ROI just isn't there writing tests up front and the conclusion in that paper lays it out clearly

    Overall, these findings suggest that agent-written
    tests often behave more like a habitual software-development rou-
    tine than a dependable source of validation in this setting. More
    agent-written tests do not mean more solves; what they more reli-
    ably change is the process footprint—API calls, token usage, and
    interaction patterns. Improving the value of testing for code agents
    may therefore require better oracles and more actionable validation
    signals, rather than simply inducing agents to write more tests.

> IMO, where tests clearly help is primarily as an "oracle" applied after generation

Bingo. I'm not against writing tests it's that the returns are better when its used as verification feedback and as "Oracle" exactly as you put it.

1 more reply

esperent20d ago

From that paper:

> This raises a central question: do such tests meaningfully improve issue resolution, or do they mainly mimic a familiar software-development practice while consuming interaction budget?

It's also why I'm not too bothered about perfect red green TDD. I can add the tests later if needed.

1 more reply

necovek20d ago

The paper focuses on two things: default behavior and behavior with a prompt to write at least one new test.

In general — just like with humans — I find "just add more tests" to be counter-productive.

dnautics20d ago

tdd has been invaluable for this project (almost entirely llm written, but i review it) https://github.com/ityonemo/clr

1 more reply

pramodbiligiri20d ago

I've noticed that LLMs tend to generate multiple testcases in one shot (which is not how humans usually go about TDD), and also they don't start with Integration Tests, unless instructed to do so.

dnautics20d ago

> it balloons the token cost

how!!??

1 more reply

manmal21d ago

> With TDD I would be taxed heavily and velocity slow to a crawl.

And the code will be good.

rsalus20d ago

not necessarily, TDD has little bearing on output quality

2 more replies

jzig21d ago

Pattern-based testing can theoretically reduce the token cost?

reg_dunlop21d ago

But that repurposing/removal is exactly what's avoided if you follow through with the SEF framework he outlines.

I have to push back on the idea that token costs balloon when using TDD within the context of a strong framework such as Jason has laid out here.

If the feature is repurposed/removed/refactored....I'd argue the specification wasn't well thought out prior to burning into tokens.

We're so eager to do a lot of the wrong things quickly, when it may serve us better to do a more precise thing slowly.

zuzululu21d ago

You cant spec out what you dont know, scope, requirements change from real world feedback

1 more reply

behnamoh21d ago· 12 in thread

Snake oil. Just ask the model, all these custom agents/skills haven't proven that useful in practice.

jw122421d ago

Skills already are "just asking the model". Unless you'd prefer to type out the same instructions every single time?

Skills are literally just Markdown documents that get loaded into context when the /skill-name is invoked.

dominotw21d ago

i belive gp means llms produce what they see in training data/rl there isnt much too much customization you can do with skills.

they are being sold as more powerful than they are. Like llms are intelligent blank slates that can be customized with mere markdown files.

1 more reply

Zetaphor21d ago

I think they're maybe confusing Skills and MCP servers

coffeeaddict121d ago

[0] https://github.com/TheQtCompanyRnD/agent-skills

pramodbiligiri21d ago

john_strinlai21d ago

krupan20d ago

If the consume more tokens then they are not free. If they consume those tokens without really improving anything them they are snake oil

dominotw21d ago

i think gp is calling skills snakeoil in genral

internet10101021d ago

beezlewax21d ago

I've found them useful for in house stuff where you are using a specific design system or architecture. But custom everything works best. Are that Claude works well on its own though at this point.

wyre21d ago

Ya, if im constantly asking a model to do TDD development, you know what would make it a lot easier? A skill.

theptip21d ago

Nah. Skills are great. But you should write your own.

simonw21d ago· 10 in thread

(I've been getting solid results recently from simply telling Claude Code and Codex "Test with uv run pytest, use red/green TDD".)

__mharrison__21d ago

Here's a portion of my AGENTS.md from this week (playing FDE, implementing a custom workflow for a client that 20x their productivity).

    # Python Tooling
    
    - Use `uv` to manage Python environments and dependencies.
    - Use `uv run` to execute Python scripts and commands.
    - Use `pytest` for testing your code.
    - Use the `hypothesis` library for property-based testing when you have complex input spaces or need to test edge cases.
    - Don't edit `pyproject.toml` directly. Instead, use `uv add` and `uv add --dev` to manage dependencies.
    - Use ruff, ty, prek, wily for code quality and linting.
    - Don't use excessive casting. If you find yourself needing to cast types frequently, consider refactoring your code to use more appropriate types. Casting should only be done in boundary layers where you are interfacing with external systems.
    - Run appropriate tooling after making changes to your code to ensure it meets quality standards.
    - When you come across a bug or regression, think hard about writing a test and also how to create code that will prevent this from happening again in the future.
    - When creating a command line interface, add `--verbose` flag that provides logging output useful for debugging issues.
    - Before creating code, brainstorm 5 different approaches to solve the problem and sort them by their probable effectiveness. Then, choose the best approach and implement it.
    - Use Test Driven Development (TDD) for all code you write. Write tests before writing the implementation code. 
    - Collect pytest fixtures in a `conftest.py` file to avoid duplication 
    - Prefer testing real code where possible. Use doubles and `monkeypatch` when absolute necessary. Try to avoid mocking as much as possible.
    - Favor pytest monkeypatch to mock.
    - When a test fails, run the last failed test first using `uv run pytest --last-failed` 
    - Use numpy-style docstrings for all functions and classes you create.
    - Include doctests in the docstrings of your functions to provide examples
    - Use type hints for all function parameters and return types.
    - Use logging to provide insight into failures. Don't use print for debugging. Don't use logging to hide stack traces.

1 more reply

porphyra21d ago

A lot of prompt engineering goes out of date quickly. Nobody nowadays goes "you are an expert software engineer. make no mistakes" lol.

Royce-CMR20d ago

I can't find the link now, but Anthropic has a post about using either a light model call or other logic (regex etc) to dynamically decide what tools to expose per incoming request.

1 more reply

oefrha20d ago

> Nobody nowadays goes "you are an expert software engineer. make no mistakes"

You know what, I checked Opus 4.8's instructions to a review subagent the other day and it literally opened with

> You are a senior infrastructure/security engineer doing a thorough, adversarial code review...

I didn't say anything like that myself.

1 more reply

jasonswett20d ago

Good point! Will add a date.

disgruntledphd221d ago

Me too, although I dislike the fact that it over-focuses on mocks (which I accept is over-represented in the training data).

galsapir21d ago

sometimes I also feel it tries to optimise for "per line coverage" over more "real, complex use cases" type tests

nextaccountic20d ago

https://github.com/jasonswett/llm-skills/blob/main/tdd/SKILL... has a timestamp (mar 14, 2026 as of today)

chrisweekly20d ago

Every article should include a date!

0123456789ABCDE21d ago

fwiw, response headers include: Last-Modified: Fri, 22 May 2026 19:08:09 GMT

krupan20d ago· 6 in thread

jasonswett20d ago

turlockmike20d ago

You don't need elaborate prompts, just a few lines

gruez20d ago

1 more reply

vikramkr20d ago

Nizoss20d ago

If you want to give this approach a try, you’ll find it here. I’m the author and I’m happy to and any further questions: https://github.com/nizos/probity

ArtRichards19d ago

Its all about retaining the context and spawning sub agents which can bootstrap quickly and accurately.

I'm interested in others dping something similar :) I included a docs cli tool in pypi to manage this context:

https://artrichards.github.io/agent-playbook-suite/blog/

SubiculumCode20d ago· 5 in thread

jasonswett20d ago

I HATE this. I call it speculative coding. Claude often calls it "defensive" programming. It's easily my #1 LLM pet peeve. I have yet to figure out a reliable way to make this stop happening.

homieg3320d ago

I’m going to second this. Probably a side effect of its training to always produce an output, even if its some naive handling of issues it really should have root caused and fixed.

tarrant30020d ago

bmitc19d ago

> excessive use of fallbacks routines

What are "fallbacks routines"?

victorbjorklund19d ago

Yea, I have seen that too.

fowlie21d ago· 3 in thread

dchuk20d ago

Rohunyyy20d ago

It seems like he got the skills from BMAD method https://docs.bmad-method.org/tutorials/getting-started/

kirtivr20d ago

TIL. Thanks for that!

steno13221d ago· 3 in thread

The token cost and tech debt introduced by tests is just not worth it. There's usually no bugs and if there are, you can fix them quickly if and when it's needed.

Ginop21d ago

I disagree

Testing was and is still very important, as LLMs can still miss important points in business logic or other edge cases I would argue that tests became as important as code, if not more.

esafak21d ago

IF your code has no bugs it's either trivial or you haven't noticed the bugs.

buster19d ago

No it's probably the most important idea.

enraged_camel21d ago· 2 in thread

All of this burns more tokens of course, but probably way less than coming back to the code later to fix bugs. It is also slower, but in the long run saves time.

yaodub20d ago

Have you found integrating outputs from different frontier labs consistently improves final results, or is it just kind of voodoo?

enraged_camel20d ago

1 more reply

dluxem21d ago· 2 in thread

I believe using a skill here is the wrong approach. LLMs already know what TDD is and how to do it, just like object oriented programming.

But I think the OP is just trying to have their agent work in a very specific way -- that is fine too.

> 5. Show me the test and ask for approval before continuing

jasonswett20d ago

zuzululu21d ago

But everybody is free to choose how they work and it may be required in ways that we can't know about.

realty_geek21d ago· 2 in thread

As an aside, check out Jason's podcast (codewithjason.com) - its pretty good.

The latest one is with "Uncle Bob Martin" who has some interesting takes on coding with AI from .... can I say an oldie?

ElijahLynn20d ago

Just looked it up! Gonna give it a listen on my drive this morning.

https://open.spotify.com/episode/2UooZQNEpjXurZYBasds73?si=1...

jasonswett20d ago

Thanks, I'm glad you like it!

__mharrison__21d ago· 2 in thread

Testing is so important for development.

Even more so when coding with agents. I think it is the probably the biggest lever to keep AI in guardrails.

(It's also why I wrote my latest book, Effective Testing, because I routinely find that my clients are very poor at treating.)

necovek20d ago

Having thought heavily and even presenting on exactly the same topics, looking at the ToC, your book seems to cover the basics well.

However, since we are talking about effectiveness, applying a lot of these principles might lead to a non-maintainable codebase — for humans and LLMs alike.

When any change causes 500 tests to break, or it causes nothing to break (see monkey-patching and/or mocking), you've gotten to a point where your testing approach is ineffective.

Most start applying principles of just enough tests and testable architectures too late, yet I believe they are fundamental.

Do you cover these in your book?

__mharrison__20d ago

Wrt mocking. I'm not a huge fan. Again, look at my AGENTS.md. I prefer monkeypatch as a last resort option. Luckily, if you use TDD, you rarely have to use mocking. If you don't use TDD...

Ampersander20d ago· 2 in thread

Testing is obsolete in the AI age. I just one shot every problem with claude, it never makes a mistake.

dev_hugepages20d ago

Then, you are obsolete in the AI age.

Ampersander20d ago

Yes, I might need to bet the farm on the IPO to make it

jvuygbbkuurx21d ago· 1 in thread

All of these post are missing actual comparisons on results. I read exactly opposite 'you should do x' everyday. If TDD actually was better it would simply be in the system prompts already.

bisonbear21d ago

servercobra21d ago· 1 in thread

jasonswett20d ago

OP here. I don't know, in my experience Claude took "clean the kitchen before we make dinner" to heart in an astonishingly productive way. I haven't tried many other analogies though.

revlsas20d ago· 1 in thread

TDD is unnecessary bloat at this point

Just work with Codex to fill the gaps, and then get it to one shot the implementation

Do review afterwards if needed

All these md files will be increasingly useless as models improve

mercutio220d ago

There are many projects where one shot is the right answer!

But surely you aren’t suggesting literally every software project is composed of one-shot-able building blocks, or that the building blocks never require modifications to previous one-shots?

whateveracct20d ago· 1 in thread

/test-me

whateveracct20d ago

sorry, /test-with-docs

yieldcrv20d ago· 1 in thread

Tests are vanity in agentic engineering

They do nothing to keep an AI on track in comparison to the aspects that simulate a product manager

And the AI just will correct the test when it fails as opposed to correct the code, because the code didn't miss anything the specification changed

My protip: just write tickets or have the AI write those too. that and the commits and the PRs will function as the AI’s memory better than any client side markdown file masquerading as a soul

kgdiem20d ago

My agent / skill files always tell it to trust neither the code or the test and to reason about the test failure which seems to work pretty well.

In another project without my rules I’ve noticed I have to tell it to set up data for playwright tests instead of skipping if none exists.

bob102920d ago· 1 in thread

cbcjcyv520d ago

Tests AI writes aren't for me. They're for the AI.

I mostly agree though, I've seen a lot of vapid assertions in my day job recently.

I should note Im specifically not doing tdd with AI.

csbartus20d ago

This specify-encode-fulfill loop/method is effective to make agents create bug-free code.

In my version of this workflow I do specify myself, then let the LLM do the rest.

So far, this is the one and only advantage of using LLMs in my SWE practice. They glue together (human written) specs with code, with confidence, in no time.

nullc21d ago

j / k navigate · click thread line to collapse