undefined | Better HN

0 pointsapwheele6h ago0 comments

There needs to be a new name for people creating these with no obvious validation.

Skill spam?

0 comments

Spam. It takes a minute in 2026 to create any app, any skill, any anything without any education that looks plausible that took five years ago a highly educated and skilled person at least months. Now it takes the highly skilled individual ten times the time to evaluate the vibeslopped spam it took the author to publish.

Daviey1h ago

In job I am seeing it as an amplification DoS attack, the amount of content being produced is crippling the processes to protect the org.

AndyNemmity6h ago

Define obviously validation? What is the signal that tells you one is reasonable vs another?

I find the only way to do that is to look at it, if it passes some visual tests, try it, and then a/b test if it's any better than without it.

theptip5h ago

Some sort of eval. Eg TermBench, implemented in Harbor.

It’s an insane amount of effort to build shareable, reusable, comprehensive evals, hence why so almost all skills are stuck in the “vibes” phase.

That said I think it’s quite easy to skim/intuit these sort of skills and do horizontal gene transfer into your own vibes-based system. If you use the skills regularly you can construct a cheap personal eval that is a lot easier to maintain and use it to compare a new skill/plugin. Just things like “please write a paper on <my personal unpublished thesis>” is a good starting point here. You get a good feel for whether a skill is better than vanilla by running it a couple times and watching the failure modes.

AndyNemmity5h ago

Yeah, I think we're in a phase honestly where you shouldn't use anyone elses skills, and you should instead point your stuff at a repo with skills, have it really read it, and then ask what of value there is to potentially rewrite in your style based on your preferences.

I have a complex setup with a lot of things based around what I do. I don't know how anyone could reasonably get their head around any of it. It's a research project in itself.

So I tell people, please don't use it. Just point your claude code at it, and see if there's anything useful for you.

apwheeleOP5h ago

So yes a/b broadly speaking is what I was saying (test cases and can show it is actually better).

Even this repo just the "b" showcase, showing the outputs as is (with no clear documentation how those were generated, is it headless in a CI pipeline somewhere?), is not good, https://github.com/Imbad0202/academic-research-skills/tree/m....

AndyNemmity5h ago

I run a lot of a/b testing. But I'm not sure showing it actually communicates all that much. Since these are non deterministic systems, even showing you an a/b test from when i made the decision a month ago, doesn't really mean a whole lot.

I agree we need more clear indications of value, I don't quite understand how to legitimately do that in a fair, and honest way.

whattheheckheck4h ago

Its the same as whipping out some random python package why diminish it? Your comment could be called skeptic reply guy spam

elashri6h ago

Skill-slop.

mmooss5h ago

The OP evaluates what it has developed with great rigor and describes the evaluation in detail. What do you feel is missing?

apwheeleOP2h ago

It actually does not -- and that is part of the issue. Consumers just see "oh gosh this looks very detailed" and superficially think someone must of spent quite a bit of time on this and it works well.

Skills are just prompts -- and most of what I am seeing are people using AI to write the (quite verbose) prompts. There should be a test, somewhere, that shows "my prompt does better than XYZ other prompt" for some model and some specific inputs. This is what is called a benchmark.

It may work well, I don't know. Just asking Claude "hey help me iterate on a paper" works pretty well out of the box too. Call me skeptical this actually works in any substantive way without seeing any evidence it works.

I agree writing a good benchmark takes time. How do people know if all these prompts they are writing are any good though? You could make an edit and it causes a regression overall. Or add too much info and it is just wasted space in the context window, or causes the model to go in loops between the different skills, or plenty of other errors.

AndyNemmity2h ago

I really do run a/b tests. I really do test, and validate.

I do not believe me giving you that information is honest. If I do, I am pretending that you will get the same experience.

Maybe you're using a different model. Maybe you have stuff in your CLAUDE.md that will break it.

It is not honest to me to give you confidence in it, when no one can be confident in it.

sumeno2h ago

I seriously doubt any human has ever read the full readme for the project.

adityamwagh6h ago

SkillBros?

j / k navigate · click thread line to collapse

0 comments

siva73h ago

Daviey1h ago

In job I am seeing it as an amplification DoS attack, the amount of content being produced is crippling the processes to protect the org.

AndyNemmity6h ago

Define obviously validation? What is the signal that tells you one is reasonable vs another?

I find the only way to do that is to look at it, if it passes some visual tests, try it, and then a/b test if it's any better than without it.

theptip5h ago

Some sort of eval. Eg TermBench, implemented in Harbor.

It’s an insane amount of effort to build shareable, reusable, comprehensive evals, hence why so almost all skills are stuck in the “vibes” phase.

AndyNemmity5h ago

I have a complex setup with a lot of things based around what I do. I don't know how anyone could reasonably get their head around any of it. It's a research project in itself.

So I tell people, please don't use it. Just point your claude code at it, and see if there's anything useful for you.

apwheeleOP5h ago

So yes a/b broadly speaking is what I was saying (test cases and can show it is actually better).

AndyNemmity5h ago

I agree we need more clear indications of value, I don't quite understand how to legitimately do that in a fair, and honest way.

whattheheckheck4h ago

Its the same as whipping out some random python package why diminish it? Your comment could be called skeptic reply guy spam

elashri6h ago

Skill-slop.

mmooss5h ago

The OP evaluates what it has developed with great rigor and describes the evaluation in detail. What do you feel is missing?

apwheeleOP2h ago

AndyNemmity2h ago

I really do run a/b tests. I really do test, and validate.

I do not believe me giving you that information is honest. If I do, I am pretending that you will get the same experience.

Maybe you're using a different model. Maybe you have stuff in your CLAUDE.md that will break it.

It is not honest to me to give you confidence in it, when no one can be confident in it.

sumeno2h ago

I seriously doubt any human has ever read the full readme for the project.

adityamwagh6h ago

SkillBros?

j / k navigate · click thread line to collapse