Ask HN: Is there a metric for AI code quality?

6 pointsfractalf9d ago9 comments

I've tried many different models and without doubt the code coming out of them differs a lot when it comes to "quality". Some of that is subjective for sure, but there are objective sides to "good" code.

I wish this was a metric for the AI benchmarks so I could choose a model based on this, because honestly it's one of the things I care most about.

Problem: How can you measure such things, whats the metrcis?

...maybe there just isn't a way to do it, since that metric isn't in the charts..

Ask HN: Is there a metric for AI code quality?

6 pointsfractalf9d ago9 comments

I wish this was a metric for the AI benchmarks so I could choose a model based on this, because honestly it's one of the things I care most about.

Problem: How can you measure such things, whats the metrcis?

...maybe there just isn't a way to do it, since that metric isn't in the charts..

9 comments

9 comments · 5 top-level

spgorbatiuk9d ago· 2 in thread

Not sure if I got the question right, but there are benchmarks like SWE pro and stuff. There's whole another debate whether you can trust it or not, and whether the labs are training on those benchmarks, but that's one way to measure that.

Other than benchmarks, I'd say that's your own test suite

sama0049d ago

i would never trust benchmarks tbh most of the new model releases do benchmaxxing

spgorbatiuk8d ago

Sad, but fair!

mattsadowsky6d ago· 2 in thread

Internally, in our agency we developed several core skill for claude/codex that allow us to not bother about generated code qaulity.

jryan495d ago

What about AI changes things though? Why didn't you just ignore code quality from humans writing code too?

Lionga5d ago

You just need a single skill, works 100% perfect code https://github.com/thesysdev/make-no-mistakes

iodosite8d ago

I don't think there's a good objective metric here, at least not like cyclomatic complexity or SonarQube-style checks, because it's difficult to tell whether the code is overcomplicated by AI, or whether the domain itself is just complicated.

Code is derivative - it's modeling real behavior. So its quality depends closely on how well it captures what should actually happen.

That's why measuring the actual outcome is more important than raw "code quality" metrics: do the important user flows and edge cases work, how the system behaves in these edge cases. I'd more use something like Journey SDK to fuzz edge cases and measure how well the system behaves, rather than measure some arbitrary properties of the code.

inthepond6d ago

I am not sure mate. I occasionally use Claude Code /ultrareview or /code-review to have the code thoroughly checked, and I also use my OSS project git-aftermerge to check if LLM is making a same mistake or something (Or you could just use CLAUDE.md to memorise that). I am looking into Langsmith and Braintrust to see if code output can be more stable and with higher quality.

verdverm9d ago

Why would a metric for code quality be different depending on how the code got to to a file? In other words, if there was a good measure, would it not exist already for us? How do we measure the quality of our own code?

j / k navigate · click thread line to collapse