undefined | Better HN

0 pointsosigurdson25d ago0 comments

I can see how you could avoid regressions this way, but what do you add to your harness to prove that a new feature is working?

0 comments

6 comments · 2 top-level

kami2325d ago· 2 in thread

I have it record a series of gifs or videos that I look over. If something looks off I'll dig into it, but I break down work into very very small chunks that are usually easily verifiable or don't require multiple steps.

Another thing I have in the general sdlc process is having it add enough logging to verify features are turned on, configured as we expected, and that becomes enough feedback for most of my features.

I've been mostly focusing on being able to replicate this across stacks greater than 3 projects so far (with the eventual goal of having an agent be able to orchestrate our complete infra stack, and this being a large component of a DR plan to rebuild).

None of this is really new for us, I'm just the most knowledgeable in my group in how the different products across teams glue together so I've been creating these rube goldbergs as a prototype, and then having it iterate on codifying the parts that don't need a constant LLM. We were blessed to have an engineer a decade ago build out tooling for local container automation that matches 95% of the deployed infra stack. That last 5% sucks when you fall into it, but that's always been a truth. I've added and expanded the tool over the years with making it act more like the deployed environment networking wise, but a lot of things don't end up working well in docker containers on M series macs when most of our complicated virtualization in our private cloud can't run on them yet...

pstorm25d ago

I’ve been building this out too, and your comment made me realize the missing piece for me. I’ve given the agents tools to validate its own work, but I haven’t improved the experience of humans verifying the agents’ work.

kami2325d ago

For video/image stuff I found the ability for the LLMs to use ffmpeg and imagemagick to be quite fun.

_zoltan_25d ago· 2 in thread

for us it's (usually) very easy as I work on performance optimization. a non-negligible part of this is correctness and verifiability, so we already have some of that.

to give you an example just recently I've coded a feature that for our shuffle operation can report which channel did the bytes flow through (as the PR giving us the plumbing underneath has landed upstream recently). what this basically means is that you run the shuffle, you know you've shuffled X bytes (because you have stats on both ends) and then you need to attribute them to different layers. on the first iteration, the count was off. the agent went, debugged, fixed, iterated, and then it was 1.5% off. again, it went, iterated, ... and now we're fine.

part of the task description was that the breakdown must match the known amount of bytes we're shuffling, so the agent took this upon as a self-verification point. so besides running our normal, boring unit tests, integration tests and end-to-end verification harnesses (which it not only has programmatic/cli/API access, but are documented in .md files for projects), it could use this criteria on top to verify.

looking at /usage, my API duration was 2h 43m, and on top of that:

      claude-haiku-4-5:  2.7k input, 115.3k output, 16.3m cache read, 867.9k cache write ($3.30)
       claude-opus-4-8:  46.9k input, 555.0k output, 166.6m cache read, 2.9m cache write ($115.77)

osigurdsonOP25d ago

Definitely agree that performance optimization is a good use case for LLMs. Here you have both a measurable goal / objective function and guardrails against functional regressions. It kind of closes the loop in that regard.

One thing however is a test suite is not usually exhaustive in the sense that any code that passes the tests is valid. Usually tests are more complimentary in nature. Therefore you could still possibly get code degradation, potentially.

jaggederest25d ago

> One thing however is a test suite is not usually exhaustive in the sense that any code that passes the tests is valid. Usually tests are more complimentary in nature.

Not in the world of AI - if your tests don't catch any known issues, the problem is the tests aren't comprehensive enough. There's no excuse at this point not to have an incredibly comprehensive test suite, to go with your other agent feedback loop constraints

1 more reply

j / k navigate · click thread line to collapse