Ask HN: How do you know if a tweak to your AI skill made it better?

9 pointsyo103jg2mo ago4 comments

Curious how people here evaluate changes when they tweak an AI skill / prompt / workflow.

A lot of the time, a tweak might feel better in one or two cases, but it’s hard to tell if it actually improved the skill overall or just changed its behavior in a way that looks better for a bit.

Do you mostly go by intuition, or do you have some lightweight way to check if a tweak really helped?

9 pointsyo103jg2mo ago4 comments

Curious how people here evaluate changes when they tweak an AI skill / prompt / workflow.

A lot of the time, a tweak might feel better in one or two cases, but it’s hard to tell if it actually improved the skill overall or just changed its behavior in a way that looks better for a bit.

Do you mostly go by intuition, or do you have some lightweight way to check if a tweak really helped?

4 comments

4 comments · 3 top-level

Areena_282mo ago· 1 in thread

The way we handle it is keeping a small set of fixed test cases that we never change. Like same inputs, same expected outputs. so when we tweak a prompt we run it against those first. if it passes the fixed cases and feels better on the new ones, we keep it.

gtirloni2mo ago

How you get deterministic output though? t=0? Pydantic AI outputs?

1 more reply

sdevonoes2mo ago

In general, you don’t know. Sure thing if you have a specific code base in which you already had a bunch of tests (non ai generated ) and the code you are regenerating is always touching the logic behind those tests, sure you can assess to some extent your skills/prompt changes. But in general you just don’t know. You havr a bunch of skills md files that who knows how they work if changed a little bit here a little bit there. People who claim they know are selling snake oil

bisonbear2mo ago

a bit heavier weight, but seems worthwhile if working in an org where many people consume the skill:

- find N tasks from your repo that serve as good representation of what you want the agent to do with the task - run agent with old skill/new skill against those tasks - measure test pass rate / other quality metrics that you care about with skill - token usage, speed, alignment, ... - tests aren't a great measure alone - I've found them to be almost bimodal (most models either pass/fail) and not a good differentiator - use this to make decisions about what to do with the skill - keep skill A, promote skill B, or keep tweaking

I've also had success with an "autoresearch" variant of this, where I have my agent run these tests in a loop and optimize for the scores I'm grading o

j / k navigate · click thread line to collapse