Show HN: Agent-skills-eval – Test whether Agent Skills improve outputs (opens in new tab)

(github.com)

76 pointsdarkrishabh3d ago36 comments

36 comments

I'm skeptical skills will outperform training given that Opus 4.7 already ignores a 720-byte CLAUDE.md telling it to use tidewave (a Rails MCP server with 6 tools) for db queries. When I asked a new claude session about a record it called

> Bash(DATABASE_URL=$(grep -E '^DATABASE_URL=' .env 2>/dev/null | head -1) echo "ok")

even though I have in CLAUDE.md:

> For database queries, use tidewave first.

I then prompted:

> use tidewave as per CLAUDE.md. also diagnose why you failed to heed that

> ● Diagnosis first: I defaulted to shell habits (env grep → psql) instead of pausing to recall the CLAUDE.md rule that tidewave is the first-line DB tool. The trigger was "look at this record" — I should have read that as "run a SQL query" and reached for tidewave immediately.

If Opus 4.7 doesn't follow simple CLAUDE.md instructions, I'm not sure what benefits other markdown files could bring. I don't trust Opus's own explanation, but it could point to the fact that the system prompt for bash is much longer than CLAUDE.md with tidewave.

While LLM judging could be helpful, I think the tool-call assertions (https://github.com/darkrishabh/agent-skills-eval#what-you-ge...) may be the most useful thing in agent-skills-eval given that it's the only objective measure of compliance.

4 more replies

ChairmanLmao2d ago

Depending on skill, Claude already does this when creating new skills with their skill-creator skill (what a sentence), it's pretty neat. It creates ~6 subagents with and without the skill and judges if they differ in performance.

1 more reply

ssgodderidge2d ago

The example model in the documentation is 4o-mini, you might want to update that to a more recent model.

As an aside, 4o-mini came out months before agent skills were released… I’m curious how it performs with choosing to load skills in the first place?

stingraycharles2d ago

It’s an artifact of the documentation being AI generated, they usually pick gpt4-era models, without giving it further thought.

For Gemini it seems to always pick 2.5 despite 3.1 being the latest, Claude the 3.5-era models.

Not sure what’s preventing AI labs on ensuring this stuff is refreshed during training.

simonpure2d ago

I was wondering the same and learned the model doesn't know about itself during training [0]

[0] https://developers.googleblog.com/closing-the-knowledge-gap-...

block_dagger2d ago

The skill is deterministically added to the prompt by the harness before the target model is invoked. There is no “choosing” to load a skill. You might be confusing skills with tools (MCP etc).

ssgodderidge2d ago

The metadata is loaded by the harness, but the LLM still needs to choose to load the rest of the skill, no?

block_dagger2d ago

Define “load.” It follows the instructions in the prompt - its natural behavior.

1 more reply

TheGRS2d ago

This is all still really early stuff, but there was a blog yesterday that got me thinking we need a way to send telemetry data for work being done by agents out to a central agent the org controls. It would be responsible for creating skills based on the work people are doing - or in other words the stuff they're correcting the agents on. And then you could develop skills for an entire department (customer service, engineering, marketing, etc).

This tool has me thinking there's some merit to setting that up. My only real qualm is that I'm not super convinced skills are that great yet. I'm trying to get better at developing them in my workflow, but still get a lot of results where they are ignored even after spending time trying to tighten them up.

egeozcan2d ago

Are there any published results gathered using this?

jarym2d ago

Not sure but I'm interested in trying it because I've for a while sensed that adding SKILLS.md degraded my overall experience - most probably I wrote them wrong. But this sort of tooling I guess can help me figure it out?

codecheers2d ago

With-skill vs without-skill evals are useful, but what about comparing skills against each other? Is there an emerging standard for saying one Skill is better than another, beyond custom pass/fail evals?

VinamraYadav1d ago

one thing id want in the report is token cost per run alongside correctness. seen skills that technically improve outputs but cost 35-40% more tokens so they're not really wins in production. without that number the with/without comparison is only half the story

ianhxu2d ago

How do you iterate on the judge prompt? Is there an auto rater?

datadrivenangel2d ago

That is the billion dollar question. Who watches the watchmen?

blitzar2d ago

the watchwatchmen

ianhxu2d ago

exactly

scosman2d ago

Why so narrowly eval just with/without skill?

Same approach is useful for everything: model, params, prompt, sub-agents, skills, rag, etc?

1 more reply

hiroto_lemon2d ago

having token counts surface on each side in the report would be super useful

j / k navigate · click thread line to collapse

36 comments

reedlaw2d ago

> Bash(DATABASE_URL=$(grep -E '^DATABASE_URL=' .env 2>/dev/null | head -1) echo "ok")

even though I have in CLAUDE.md:

> For database queries, use tidewave first.

I then prompted:

> use tidewave as per CLAUDE.md. also diagnose why you failed to heed that

4 more replies

ChairmanLmao2d ago

1 more reply

ssgodderidge2d ago

The example model in the documentation is 4o-mini, you might want to update that to a more recent model.

As an aside, 4o-mini came out months before agent skills were released… I’m curious how it performs with choosing to load skills in the first place?

stingraycharles2d ago

It’s an artifact of the documentation being AI generated, they usually pick gpt4-era models, without giving it further thought.

For Gemini it seems to always pick 2.5 despite 3.1 being the latest, Claude the 3.5-era models.

Not sure what’s preventing AI labs on ensuring this stuff is refreshed during training.

simonpure2d ago

I was wondering the same and learned the model doesn't know about itself during training [0]

[0] https://developers.googleblog.com/closing-the-knowledge-gap-...

block_dagger2d ago

The skill is deterministically added to the prompt by the harness before the target model is invoked. There is no “choosing” to load a skill. You might be confusing skills with tools (MCP etc).

ssgodderidge2d ago

The metadata is loaded by the harness, but the LLM still needs to choose to load the rest of the skill, no?

block_dagger2d ago

Define “load.” It follows the instructions in the prompt - its natural behavior.

1 more reply

TheGRS2d ago

egeozcan2d ago

Are there any published results gathered using this?

jarym2d ago

codecheers2d ago

VinamraYadav1d ago

ianhxu2d ago