undefined | Better HN

0 pointsaizk2mo ago0 comments

How do you guys manage regressions as a whole with every new model update? A massive test set of e2e problem solving seeing how the models compare?

0 comments

3 comments · 3 top-level

bcherny2mo ago

A mix of evals and vibes.

3 more replies

cududa2mo ago

Remember when they shipped that version that didn't actually start/ run? At work we were goofing on them a bit, until I said "Wait how did their tests even run on that?" And we realized whatever their CI/CD process is, it wasn't at the time running on the actual release binary... I can imagine their variation on how most engineers think about CI/CD probably is indicative of some other patterns (or lack of traditional patterns)

As someone that used to work on Windows, I kind of had a vision of a similar in scope e2e testing harness, similar to Windows Vista/ 7 (knowing about bugs/ issues doesn't mean you can necessarily fix them ... hence Vista then 7) - and that Anthropic must provide some Enterprise guarantee backed by this testing matrix I imagined must exist - long way of saying, I think they might just YOLO regressions by constantly updating their testing/ acceptance criteria.

Why not provide pinable versions or something? This episode and wasted 2 months of suboptimal productivity hits on the absurdity of constantly changing the user/ system prompt and doing so much of the R&D and feature development at two brittle prompts with unclear interplay. And so until there’s like a compostable system/user prompt framework they reliably develop tests against, I personally would prefer pegged selectable versions. But each version probably has like known critical bugs they’re dancing around so there is no version they’d feel comfortable making a pegged stable release..

2 more replies

try-working2mo ago

I use a self-documenting recursive workflow: https://github.com/doubleuuser/rlm-workflow

j / k navigate · click thread line to collapse