Agents.md file isn't the problem. Your lack of Evals is (opens in new tab)

(tessl.io)

42 pointssjmaplesec4mo ago17 comments

17 comments

14 comments · 6 top-level

skybrian4mo ago· 4 in thread

Okay, but how would I write evals for my project's agents file? Any good examples out there?

I wrote https://ai-evals.io (community site) to make the concept approachable no matter what tools you choose to use.

You can learn about them evaluating that site https://github.com/Alexhans/eval-ception and then the pattern should be easy to test on your own thing.

skybrian4mo ago

Doing an eval on itself is clever but confusing for the reader. How about a tutorial explaining how to do an evals on something more normal?

1 more reply

popey4mo ago

The agents are smart enough to write the evals too.

It's agents all the way down!

Submit a GitHub repo containing skills to Tessl, and it will generate the evals, run them, and present the results. https://tessl.io/registry/skills/submit

The evals and results are all shown, no login necessary, so you can assess them yourself. e.g. https://tessl.io/registry/skills/github/coreyhaines31/market... (click details to see the eval texts).

skybrian4mo ago

At first glance this looks like an entire ecosystem full of slop and by running that eval you generate more? I'm looking for something a bit more curated.

1 more reply

hamuraijack4mo ago· 1 in thread

so how would you eval your own claude.md? Each context is unique to the project, team, and personal root claude.md. Do you just take given task and ask it to redo the same one over and over again against a known solution? Do you just keep using it and "feel" whether or not it's working? How is that different from what everyone is already doing?

sjmaplesecOP4mo ago

The review eval tests language, activation etc of skills. I guess you could move it all to a skill quick and then run an eval on that if using Tessl. This checks if the way you write the instructions etc are being well understood by the agent

pavel_lishin4mo ago· 1 in thread

I don't even know what an eval is.

sjmaplesecOP4mo ago

An eval is to an LLM as a test is to code.

furyofantares4mo ago· 1 in thread

If it was easy to write evals, I would come at it from that direction.

But since it's not, what I do to avoid working on AGENTS.md blind is I test it on whatever causes me to write it.

I have some prompt, the AI messes it up in some way that I think it shouldn't, maybe it's something I've seen it do before and I'm sick of it. So I update AGENTS.md, revert the changes, /undo in the chat context and re-submit the same prompt.

sjmaplesecOP4mo ago

Tessl can generate the evals, both to test anthropic best practices as well as running scenarios with and without the skill to check if it's helping

stuaxo4mo ago· 1 in thread

I mean.. Claude kept putting in deprecated APIs for code I was getting it to write, so I adjusted the prompt to say not to + it seemed to help.

sjmaplesecOP4mo ago

Can add this as a skill or as part of a skill, and so you don't need to keep prompting the same things.

theodorewiles4mo ago

Ai;dr

j / k navigate · click thread line to collapse

17 comments

14 comments · 6 top-level

skybrian4mo ago· 4 in thread

Okay, but how would I write evals for my project's agents file? Any good examples out there?

alexhans4mo ago

I wrote https://ai-evals.io (community site) to make the concept approachable no matter what tools you choose to use.

You can learn about them evaluating that site https://github.com/Alexhans/eval-ception and then the pattern should be easy to test on your own thing.

skybrian4mo ago

Doing an eval on itself is clever but confusing for the reader. How about a tutorial explaining how to do an evals on something more normal?

1 more reply

popey4mo ago

The agents are smart enough to write the evals too.

It's agents all the way down!

Submit a GitHub repo containing skills to Tessl, and it will generate the evals, run them, and present the results. https://tessl.io/registry/skills/submit

The evals and results are all shown, no login necessary, so you can assess them yourself. e.g. https://tessl.io/registry/skills/github/coreyhaines31/market... (click details to see the eval texts).

skybrian4mo ago

At first glance this looks like an entire ecosystem full of slop and by running that eval you generate more? I'm looking for something a bit more curated.

1 more reply

hamuraijack4mo ago· 1 in thread

sjmaplesecOP4mo ago

pavel_lishin4mo ago· 1 in thread

I don't even know what an eval is.

sjmaplesecOP4mo ago

An eval is to an LLM as a test is to code.

furyofantares4mo ago· 1 in thread

If it was easy to write evals, I would come at it from that direction.

But since it's not, what I do to avoid working on AGENTS.md blind is I test it on whatever causes me to write it.

sjmaplesecOP4mo ago

Tessl can generate the evals, both to test anthropic best practices as well as running scenarios with and without the skill to check if it's helping

stuaxo4mo ago· 1 in thread

I mean.. Claude kept putting in deprecated APIs for code I was getting it to write, so I adjusted the prompt to say not to + it seemed to help.

sjmaplesecOP4mo ago

Can add this as a skill or as part of a skill, and so you don't need to keep prompting the same things.

theodorewiles4mo ago

Ai;dr

j / k navigate · click thread line to collapse