Test Driven Development (TDD) for your LLMs? Yes please, more of that please (opens in new tab)

(blog.helix.ml)

86 pointslewq1y ago30 comments

30 comments

23 comments · 8 top-level

heavyarms1y ago· 9 in thread

Whenever I see one of these posts, I click just to see if the proposed solution to testing the output of an LLM is to use the output of an LLM... and in almost all cases it is. It doesn't matter how many buzzwords and acronyms you use to describe what you're doing, at the end of the day it's turtles all the way down.

The issue is not the technology. When it comes to natural language (LLM responses that are sentences, prose, etc.) there is no actual standard by which you can even judge the output. There is no gold standard for natural language. Otherwise language would be boring. There is also no simple method for determining truth... philosophers have been discussing this for thousands of years and after all that effort we now know that... ¯\_(ツ)_/¯... and also, Earth is Flat and Birds Are Not Real.

Take, for example, the first sentence of my comment: "Whenever I see one of these posts, I click just to see if the proposed solution to testing the output of an LLM is to use the output of an LLM... and in almost all cases it is." This is absolutely true, in my own head, as my selective memory is choosing to remember that one time I clicked on a similar post on HN. But beyond the simple question of if it is true or not, even an army of human fact checkers and literature majors could probably not come up with a definitive and logical analysis regarding the quality and veracity of my prose. Is it even a grammatically correct sentence structure... with the run-on ellipsis and what not... ??? Is it meant to be funny? Or snarky? Who knows ¯\_(ツ)_/¯ WFT is that random pile of punctuation marks in the middle of that sentence... does the LLM even have a token for that?

senko1y ago

The output of an LLM is often qualitative, not quantitative, and to test that, you need something that can judge the quality.

You're not debating philosophy with the LLM, you're just asking it if the answer matches (semantically) to the expected one.

I usually test LLM output quality with the following prompt (simplified):

"An AI assistant was tasked with {task}. The relevant information for their task was {context}. Their answer is {answer}. The correct answer should be something like {ground truth}. Is their answer correct?"

Then you can spice it up with chain of thought, asking it to judge alongside preferred criteria/dimensions and output a score, etc... you can go as wild as you'd like. But even this simple approach tends to work really well.

> turtles all the way down.

Saying "LLM testing LLM" is bad is like saying "computer testing computer" is bad. Yet, automated tests have value. And just as the unit tests will not prove your program is bug free, LLM evals won't guarantee 100% correctness. But they're incredibly useful tool.

In my experience working on pretty complex multi-agent multi-step systems, trying to get those to work without an eval framework in place is like playing whack-a-mole, only way less fun.

senko1y ago

Too late to edit, but here's a great, really in-depth post about using LLMs as judges to evaluate LLM outputs (when you don't have the ground truth for everything): https://cameronrwolfe.substack.com/p/finetuned-judge This is about finetuning LLMs to do it, but the first part is a good intro to why and how.

maeil1y ago

> "An AI assistant was tasked with {task}. The relevant information for their task was {context}. Their answer is {answer}. The correct answer should be something like {ground truth}. Is their answer correct?"

If you have a ground truth, what was the purpose of asking the AI assistant for an answer in the first place?

3 more replies

hansonkd1y ago

See von neumanns work on reliable machines from unreliable processes.

I wouldn't doubt that if each layer of an LLM added some additional check for an unreliable process that eventually you can make something reliable from the unreliable.

drpossum1y ago

I don't see how von Neumann's work here helps at eliminating the problem and is arguably not particularly different than "just use more LLMs". His key result was to come up with a sufficient number of redundant computations to get the error below a threshold, which is still unreliable. This problem is worse because the fundamental issue is even trying to quantify what "correct" means.

Your suggestion at evaluating accuracy at the layers level necessarily implies there's some method of quantifiably detecting hallucinations. This is not necessarily possible given the particular attention models or even what is mathematically possible given an "infer this from finite text and no ability for independent verification"

sitkack1y ago

https://news.ycombinator.com/item?id=10177396

https://www.cs.unm.edu/~saia/talks/reliable-from-unreliable....

chrismorgan1y ago

> Who knows ¯\_(ツ)_/¯ WFT is that random pile of punctuation marks in the middle of that sentence... does the LLM even have a token for that?

  $ ollama run llama3.2
  >>> What does ¯\_(ツ)_/¯ mean?
  A popular internet meme!

  ¯\_(ツ)_/¯ is a Unicode character that represents a face with a
  surprised or confused expression. The three parts of the character are
  often interpreted as:

  1. `¯` - a horizontal bar, indicating skepticism or "not buying it"
  2. `(ツ)` - a smiling face with a tongue-out gesture, showing amusement
  or frustration
  3. `/¯` - an arrow pointing downwards, suggesting "that's just not
  right" or "good riddance"

  Together, the character is often used to express exasperation,
  disappointment, or sarcasm in response to a situation that seems
  ridiculous, unfair, or annoying. It's commonly used in text messages,
  social media posts, and online forums.

  Example: "You think you can eat an entire pizza by yourself?¯\_(ツ)_/¯"

  >>> /clear
  Cleared session context
  >>> ¯\_(ツ)_/¯
  ¯\_(ツ)_/¯

  >>> ¯\_(ツ)_/¯
  I see you're trying to express a mix of sass and Japanese flair! Keep
  'em coming, I'll keep responding in kind!

  >>> ¯\_(ツ)_/¯
  The eternal "shrug emoji" of the internet. You win, you've conveyed an
  entire emotional spectrum with just three characters. Well played,
  friend!

  >>> ¯\_(ツ)_/¯
  I think we've reached a new level of emoji sophistication. You're not
  even trying to convey anything anymore; it's just a masterclass in
  subtlety and understatement. Bravo!

Sometimes it’s nonsense, sometimes it’s approximately right (“shrug emoji”), sometimes it barks up the wrong tree (“Japanese flair”), sometimes it’s just vapid drivel.

arccy1y ago

it's an "eastern" style emoticon https://en.wikipedia.org/wiki/List_of_emoticons#Eastern though they're more common in the west nowadays (but not enough for your virtual keyboards to have them built in...)

jusssi1y ago

> barks up the wrong tree (“Japanese flair”)

It just knows more than you. Google says:

katakana letter tu (U+30C4) - ツ

1 more reply

justanotheratom1y ago· 3 in thread

There is a real need for this. I have to admit most of my testing right now is vibes based. Problem is, these LLM evaluation platforms get in-between me and my LLM.

Onawa1y ago

I use promptfoo for our internal evaluations. There are a ton of different "assertions" (tests) that you can write, including model-graded evaluations using rubrics.

This is far from a solved problem, but there are options out there for systematic testing of LLMs.

senko1y ago

Probably non-optimal long term, but you can easily whip up rudimentary tests with your LLM/lib of choice (see my other comment at https://news.ycombinator.com/item?id=42322794) without using any of the eval platforms.

vinnymac1y ago

Offline evaluators such as laminar and (evalite wip) stay out of the way fortunately. Evals can be very useful when they do.

jmathai1y ago· 2 in thread

I've been working on a prompt to application product[1] and one of the approaches we tried was test driven development. We would have the LLM write tests based on a detailed description of the application. Then give the LLM the tests and the requirements and ask it to write the application.

The thinking is we could run the tests to verify that the requirements are functional (assuming it wrote the tests correctly in the first place - in many cases it did, fyi).

The problem was that it was too fickle. Sometimes the failing tests caught application bugs. But too often the LLM just couldn't get the tests to pass even though sometimes the application was working fine.

It resulted in a terrible user experience (they only see latency of getting the application correctly written or a failure if it gives up).

That being said, I think a lot of the issues folks like us find with LLMs are because we haven't figured out how and what to ask.

Ultimately, we found an alternative approach which gets at least 95% of the application working 100% of the time. And this is actually a MUCH better user experience than waiting forever to sometimes just get "Sorry, we couldn't create your application.".

[1] https://withlattice.com

barapa1y ago

Ok you've got us at the edge of our seats. What's the alternative approach?

jmathai1y ago

What is working better for us is to spin the application up and poke at it with HTTP requests and look for non 200 response codes.

The tradeoff is it's not as comprehensive. But giving users something that is 95% working and making it easy for them to fix those issues appears to be the best user experience we've found so far.

1 more reply

bdangubic1y ago· 1 in thread

I just mock the answers and assert on the mock of the answer - never fails!

lukeramsden1y ago

You joke but I have seen this far too many times... in very highly valued startups...

100% line coverage though!

satisfice1y ago

Posts like this make me sick at heart as a tester. This is a video of someone with no respect or understanding of the mission, complexities and subtleties of testing who tramples all over any concept of engineering ethics with his sleek tools and detached, impassive English accent.

How do you test things? Easy, he implies: Tell an LLM to test them and then assume everything will be okay! Also, STOP ASKING QUESTIONS!

There is zero critical thinking in this video beyond a speedo-level of coverage given by the first test idea that drifts into this guy's head. He's not testing, he's not engineering, he's just developing excuses to release a product.

benatkin1y ago

I read this blog post on my iPhone and when I went to the top to try and find out more about Helix, it had a giant link to install the Substack iOS app, which detracted from the experience. It might be a good idea to use a real CMS.

Here’s the website: https://tryhelix.ai/

throwawaymaths1y ago

Its inherently difficult because LLMs are necessarily probabilistic and even worse, for any practical use the key step irreversibly discards most of the probabilities

jasfi1y ago

I'm working on an AI agents platform that intends to reduce the amount of code you need to write to get high performing prompts working correctly.

The wait-list is at https://aiconstrux.com

j / k navigate · click thread line to collapse

30 comments

23 comments · 8 top-level

heavyarms1y ago· 9 in thread

senko1y ago

The output of an LLM is often qualitative, not quantitative, and to test that, you need something that can judge the quality.

You're not debating philosophy with the LLM, you're just asking it if the answer matches (semantically) to the expected one.

I usually test LLM output quality with the following prompt (simplified):

> turtles all the way down.

In my experience working on pretty complex multi-agent multi-step systems, trying to get those to work without an eval framework in place is like playing whack-a-mole, only way less fun.

senko1y ago

maeil1y ago

If you have a ground truth, what was the purpose of asking the AI assistant for an answer in the first place?

3 more replies

hansonkd1y ago

See von neumanns work on reliable machines from unreliable processes.

I wouldn't doubt that if each layer of an LLM added some additional check for an unreliable process that eventually you can make something reliable from the unreliable.

drpossum1y ago

sitkack1y ago

https://news.ycombinator.com/item?id=10177396

https://www.cs.unm.edu/~saia/talks/reliable-from-unreliable....

chrismorgan1y ago

> Who knows ¯\_(ツ)_/¯ WFT is that random pile of punctuation marks in the middle of that sentence... does the LLM even have a token for that?

  $ ollama run llama3.2
  >>> What does ¯\_(ツ)_/¯ mean?
  A popular internet meme!

  ¯\_(ツ)_/¯ is a Unicode character that represents a face with a
  surprised or confused expression. The three parts of the character are
  often interpreted as:

  1. `¯` - a horizontal bar, indicating skepticism or "not buying it"
  2. `(ツ)` - a smiling face with a tongue-out gesture, showing amusement
  or frustration
  3. `/¯` - an arrow pointing downwards, suggesting "that's just not
  right" or "good riddance"

  Together, the character is often used to express exasperation,
  disappointment, or sarcasm in response to a situation that seems
  ridiculous, unfair, or annoying. It's commonly used in text messages,
  social media posts, and online forums.

  Example: "You think you can eat an entire pizza by yourself?¯\_(ツ)_/¯"

  >>> /clear
  Cleared session context
  >>> ¯\_(ツ)_/¯
  ¯\_(ツ)_/¯

  >>> ¯\_(ツ)_/¯
  I see you're trying to express a mix of sass and Japanese flair! Keep
  'em coming, I'll keep responding in kind!

  >>> ¯\_(ツ)_/¯
  The eternal "shrug emoji" of the internet. You win, you've conveyed an
  entire emotional spectrum with just three characters. Well played,
  friend!

  >>> ¯\_(ツ)_/¯
  I think we've reached a new level of emoji sophistication. You're not
  even trying to convey anything anymore; it's just a masterclass in
  subtlety and understatement. Bravo!

Sometimes it’s nonsense, sometimes it’s approximately right (“shrug emoji”), sometimes it barks up the wrong tree (“Japanese flair”), sometimes it’s just vapid drivel.

arccy1y ago

jusssi1y ago

> barks up the wrong tree (“Japanese flair”)

It just knows more than you. Google says:

katakana letter tu (U+30C4) - ツ

1 more reply

justanotheratom1y ago· 3 in thread

There is a real need for this. I have to admit most of my testing right now is vibes based. Problem is, these LLM evaluation platforms get in-between me and my LLM.

Onawa1y ago

I use promptfoo for our internal evaluations. There are a ton of different "assertions" (tests) that you can write, including model-graded evaluations using rubrics.

This is far from a solved problem, but there are options out there for systematic testing of LLMs.

senko1y ago

vinnymac1y ago

Offline evaluators such as laminar and (evalite wip) stay out of the way fortunately. Evals can be very useful when they do.

jmathai1y ago· 2 in thread

The thinking is we could run the tests to verify that the requirements are functional (assuming it wrote the tests correctly in the first place - in many cases it did, fyi).

It resulted in a terrible user experience (they only see latency of getting the application correctly written or a failure if it gives up).

That being said, I think a lot of the issues folks like us find with LLMs are because we haven't figured out how and what to ask.

[1] https://withlattice.com

barapa1y ago

Ok you've got us at the edge of our seats. What's the alternative approach?

jmathai1y ago

What is working better for us is to spin the application up and poke at it with HTTP requests and look for non 200 response codes.

The tradeoff is it's not as comprehensive. But giving users something that is 95% working and making it easy for them to fix those issues appears to be the best user experience we've found so far.

1 more reply

bdangubic1y ago· 1 in thread

I just mock the answers and assert on the mock of the answer - never fails!

lukeramsden1y ago

You joke but I have seen this far too many times... in very highly valued startups...

100% line coverage though!

satisfice1y ago

How do you test things? Easy, he implies: Tell an LLM to test them and then assume everything will be okay! Also, STOP ASKING QUESTIONS!

benatkin1y ago

Here’s the website: https://tryhelix.ai/

throwawaymaths1y ago

Its inherently difficult because LLMs are necessarily probabilistic and even worse, for any practical use the key step irreversibly discards most of the probabilities

jasfi1y ago

I'm working on an AI agents platform that intends to reduce the amount of code you need to write to get high performing prompts working correctly.

The wait-list is at https://aiconstrux.com

j / k navigate · click thread line to collapse