> Bash(DATABASE_URL=$(grep -E '^DATABASE_URL=' .env 2>/dev/null | head -1) echo "ok")
even though I have in CLAUDE.md:
> For database queries, use tidewave first.
I then prompted:
> use tidewave as per CLAUDE.md. also diagnose why you failed to heed that
> ● Diagnosis first: I defaulted to shell habits (env grep → psql) instead of pausing to recall the CLAUDE.md rule that tidewave is the first-line DB tool. The trigger was "look at this record" — I should have read that as "run a SQL query" and reached for tidewave immediately.
If Opus 4.7 doesn't follow simple CLAUDE.md instructions, I'm not sure what benefits other markdown files could bring. I don't trust Opus's own explanation, but it could point to the fact that the system prompt for bash is much longer than CLAUDE.md with tidewave.
While LLM judging could be helpful, I think the tool-call assertions (https://github.com/darkrishabh/agent-skills-eval#what-you-ge...) may be the most useful thing in agent-skills-eval given that it's the only objective measure of compliance.
As an aside, 4o-mini came out months before agent skills were released… I’m curious how it performs with choosing to load skills in the first place?
For Gemini it seems to always pick 2.5 despite 3.1 being the latest, Claude the 3.5-era models.
Not sure what’s preventing AI labs on ensuring this stuff is refreshed during training.
[0] https://developers.googleblog.com/closing-the-knowledge-gap-...
This tool has me thinking there's some merit to setting that up. My only real qualm is that I'm not super convinced skills are that great yet. I'm trying to get better at developing them in my workflow, but still get a lot of results where they are ignored even after spending time trying to tighten them up.
Same approach is useful for everything: model, params, prompt, sub-agents, skills, rag, etc?