Using LLMs to Generate Fuzzers (opens in new tab)

(verse.systems)

156 pointsmoyix2y ago28 comments

28 comments

24 comments · 6 top-level

ttul2y ago· 6 in thread

I read a lot of niggling comments here about whether Claude was really being smart in writing this GIF fuzzer. Of course it was trained on fuzzer source code. Of course it has read every blog post about esoteric boundary conditions in GIF parsers.

But to bring all of those things together and translate the concepts into working Python code is astonishing. We have just forgotten that a year ago, this achievement would have blown our minds.

I recently had to write an email to my kid’s school so that he could get some more support for a learning disability. I fed Claude 3 Opus a copy of his 35 page psychometric testing report along with a couple of his recent report cards and asked it to draft the email for me, making reference to things in the three documents provided. I also suggested it pay special attention to one of the testing results.

The first email draft was ready to send. Sure, I tweaked a thing or two, but this saved me half an hour of digging through dense material written by a psychologist. After verifying that there were no factual errors, I hit “Send.” To me, it’s still magic.

asymmetric2y ago

Were yout not concerned about the privacy implications of uploading your child's sensitive halth data to a private LLM?

michaelbuckbee2y ago

Not OP, but parent of multiple school-age kids and both:

1. You're 100% right, there are privacy concerns.

2. I don't know if they could possibly be worse than the majority of school districts (including my kids) running directly off of Google's Education system (Chromebooks, Google Docs, Gmail etc.).

flemhans2y ago

Can you opt out? Are there privacy-friendly schools?

Could you enroll your child under a fake name? How messed up would they think that is :D

1 more reply

kelseyfrog2y ago

It's important to differentiate concern(a feeling) from choosing to upload or not. In the calculus of benefits and risks, The feeling of concern(potentially leaking PII/health information) may be outweighed by the benefit in education. Even if someone is concerned, they may still see the positives outweigh the risks. It's a subjective decision at the end of the day.

ttul2y ago

I should have clarified that I used Adobe Acrobat to redact his personal identifiers from the report before uploading it to Claude. I generally also prompt using fake names. It's not perfect, but it's better than nothing.

ttul2y ago

And, on another note, this may be foolish, but I generally trust well funded organizations like Anthropic and OpenAI on the assumption that they have everything to lose if they leak private information from their paid users. Anthropic has a comprehensive and thoughtful privacy policy (https://www.anthropic.com/legal/privacy), which specifies they do not use your data to train their models, other than to refine models used for trust and safety:

"We will not use your Inputs or Outputs to train our models, unless: (1) your conversations are flagged for Trust & Safety review (in which case we may use or analyze them to improve our ability to detect and enforce our Acceptable Use Policy, including training models for use by our Trust and Safety team, consistent with Anthropic’s safety mission), or (2) you’ve explicitly reported the materials to us (for example via our feedback mechanisms), or (3) by otherwise explicitly opting in to training."

As for defending against a data breach, Anthropic hired a former Google engineer, Jason Clinton, as CISO. I couldn't find much information about the relevant experience at Google that may have made him a good candidate for this role, but people with a key role in security at large organizations often don't advertise this fact on their LinkedIn profiles as it makes them a target. Once you're the CISO, the target appears, but that's what the big money is for.

1 more reply

smusamashah2y ago· 5 in thread

I have kind of pet peeve with people testing LLMs like this these days.

They take whatever it spits out in the first attempt. And then they go on extrapolate this to draw all kinds of conclusions. They forget the output it generated is based on a random seed. A new attempt (with a new seed) is going to give a totally different answer.

If the author has retried that prompt, that new attempt might have generated better code or might have generated lot worse code. You can not draw conclusions from just one answer.

Vetch2y ago

That doesn't seem to be the case here. Reading through the article and twitter thread, the impression I get is that between moyix and the author, a decent amount of time was spent on this. A valid criticism that could have been made is the use of Claude Sonnet but based on the twitter thread, it looks like opus was what @moyix leveraged.

moyixOP2y ago

Yes – it's a bit hard to follow the various branches of the thread on Twitter (I wasn't really intending this to be more than a 30 minute "hey that's neat" kind of experiment, but people kept suggesting new and interesting things to try :)), but I gave Claude Opus three independent tries at creating a fuzzer for Toby's packet parser, and it consistently missed the fact that it needed to include the sequence number in the CRC calculation.

Once that oversight was pointed out, it did write a decent fuzzer that found the memory safety bugs in do_read and do_write. I also got it to fix those two bugs automatically (by providing it the ASAN output).

refulgentis2y ago

> A new attempt (with a new seed) is going to give a totally different answer

Totally different...I'd posit 5% different, and mostly in trivialities.

It's worth doing an experiment and prompting an LLM with a coding question twice, then seeing how different it is.

For, say, a K-Means clustering algorithm, you're absolutely correct. The initial state is _completely_ dependent on the choice of seed.

With LLMs, the initial state is your prompt + a seed. The prompt massively overwhelms the seed. Then, the nature of the model, predicting probabilities, then the nature of sampling, attempting to minimize surprise, means there's a powerful forcing function towards answers that share much in common. This is both in theory, and I think you'll see, in practice.

smusamashah2y ago

Depends on the question. If you asked for a small fact, you are going to get almost the same answer every time. But if it's not a factual question, and answer is supposed to be a long tangled one, then the answer is going to depend on what LLM said in the first lines because it is going to stick with that.

e.g LLM might have said for some reason the writing a fuzzer like this isn't possible and then went on presenting some alternatives for tge given task.

I have only experience with GPT-4 via api but I believe at core all these LLMs work the same way.

refulgentis2y ago

You're absolutely correct, in that it's never guaranteed what the next token is.

My pushback is limited to that the theoretical maximal degenerate behavior described in either of your comments is highly improbable in practice, with a lot of givens, such as reasonable parameters, reasonable model.

I.e. it will not

- give totally different answers due to seed changing.

- end up X% of the time, where X > 5 say it is impossible, and the other (100 - X)%, provide some solution.

I have integrated with GPT3.0/GPT3.5/GPT4 and revisions thereof via API, as well as Claude 2 and this week, Claude 3. I wrote a native inference solution that runs, among others, StableLM Zephyr 3B, Mistral 7B, and Mixtral 8x7B, and I wrote code that does inference, step by excruciating step, in a loop, on web via WASM, and via C++, tailored solutions for Android, iOS, macOS, Android, and Windows.

1 more reply

aaron6952y ago· 3 in thread

I don't understand why we are getting LLMs to generate code to create fuzzing data as a 'thing'

Logically LLMs should be quite good at creating the fuzzing data.

To state the obvious why, it's too expensive to use LLMs directly and this way works since they found "4 memory safety bugs and one hang"

But the future we are heading to should be LLMs will directly pentest/test the code. This is where it's interesting and new.

moyixOP2y ago

I don't think using a language model to generate inputs directly is ever going to be as efficient as writing a little bit of code to do the generation; it's really hard to beat an input generator that can craft thousands of inputs/second.

yencabulator2y ago

For one, it'd be really hard for an LLM to get the CRC32 right, especially when it's in a header before the data it covers.

Then again, this whole approach to fuzzing comes across as kinda naive, at the very least you'd want to use an API of a coverage-guided fuzzer for generating the randomness (and then almost always fixing up CRC32 on top of that, like a human-written wrapper function would).

dmazzoni2y ago

Exactly. If I actually wanted to fuzz this I'd use libfuzzer and manually fix the crc32. An LLM would be useful in helping me write the libfuzzer glue code.

dmazzoni2y ago· 2 in thread

Why wouldn't you have an LLM write some code that uses something like libfuzzer instead?

That way you get an efficient, robust coverage-driven fuzzing engine, rather than having an LLM try to reinvent the wheel on that part of the code poorly. Let the LLM help write the boilerplate code for you.

moyixOP2y ago

They're actually orthogonal approaches – from what I've seen so far the LLM fuzzer generates much higher quality seeds than you'd get even after fuzzing for a while (in the case of the VRML target, even if you start with some valid test files found online), but it's not as good at generating broken inputs. So the obvious thing to do is have the LLM's fuzzer generate initial seeds that get pretty good coverage and then a traditional coverage-guided fuzzer to further mutate those.

These are still pretty small scale experiments on essentially toy programs, so it remains to be seen if LLMs remain useful on real world programs, but so far it looks pretty promising – and it's a lot less work than writing a new libfuzzer target, especially when the program is one that's not set up with nice in-memory APIs (e.g., that GIF decoder program just uses read() calls distributed all over the program; it would be fairly painful to refactor it to play nicely with libfuzzer).

fooker2y ago

Because you want to fuzz more effectively, not write a fuzzer more effectively.

popinman3222y ago· 1 in thread

You could likely also combine the LLM with a coverage tool to provide additional guidance when regenerating the fuzzer: "Your fuzzer missed lines XX-YY in the code. Explain why you think the fuzzer missed those lines, describe inputs that might reach those lines in the code, and then update the fuzzer code to match your observations."

This approach could likely also be combined with RL; the code coverage provides a decent reward signal.

wrsh072y ago

To me, if it detects bugs (and fixing those makes the others reachable), that seems like a pretty acceptable iterative step

It's less academically pure, but as an engineer who wants to fix bugs it seems ok

planetis2y ago· 1 in thread

It seems to overlook that the language model was developed using a large corpora of code, which probably includes structured fuzzers for file formats such as GIF. Plus, the scope of the "unknown" format introduced is limited.

moyixOP2y ago

The original test of the GIF parser does, but the VRML parser less so and the completely novel packet parser even less so. I'm not quite sure what you mean by the scope of the "unknown" format being limited – it's not the most complex format in the world, but neither is GIF.

Another test to check how much seeing the actual parser code helps is to have it generate a GIF fuzzer without giving it the code:

https://twitter.com/moyix/status/1766135426476064774

And finally, for fun, we can see how it does when we give it the RFC for GIF89a:

https://twitter.com/moyix/status/1766207786751279298

j / k navigate · click thread line to collapse

28 comments

24 comments · 6 top-level

ttul2y ago· 6 in thread

But to bring all of those things together and translate the concepts into working Python code is astonishing. We have just forgotten that a year ago, this achievement would have blown our minds.

asymmetric2y ago

Were yout not concerned about the privacy implications of uploading your child's sensitive halth data to a private LLM?

michaelbuckbee2y ago

Not OP, but parent of multiple school-age kids and both:

1. You're 100% right, there are privacy concerns.

2. I don't know if they could possibly be worse than the majority of school districts (including my kids) running directly off of Google's Education system (Chromebooks, Google Docs, Gmail etc.).

flemhans2y ago

Can you opt out? Are there privacy-friendly schools?

Could you enroll your child under a fake name? How messed up would they think that is :D

1 more reply

kelseyfrog2y ago

ttul2y ago

1 more reply

smusamashah2y ago· 5 in thread

I have kind of pet peeve with people testing LLMs like this these days.

If the author has retried that prompt, that new attempt might have generated better code or might have generated lot worse code. You can not draw conclusions from just one answer.

Vetch2y ago

moyixOP2y ago

refulgentis2y ago

> A new attempt (with a new seed) is going to give a totally different answer

Totally different...I'd posit 5% different, and mostly in trivialities.

It's worth doing an experiment and prompting an LLM with a coding question twice, then seeing how different it is.

For, say, a K-Means clustering algorithm, you're absolutely correct. The initial state is _completely_ dependent on the choice of seed.

smusamashah2y ago

e.g LLM might have said for some reason the writing a fuzzer like this isn't possible and then went on presenting some alternatives for tge given task.

I have only experience with GPT-4 via api but I believe at core all these LLMs work the same way.

refulgentis2y ago

You're absolutely correct, in that it's never guaranteed what the next token is.

I.e. it will not

- give totally different answers due to seed changing.

- end up X% of the time, where X > 5 say it is impossible, and the other (100 - X)%, provide some solution.

1 more reply

aaron6952y ago· 3 in thread

I don't understand why we are getting LLMs to generate code to create fuzzing data as a 'thing'

Logically LLMs should be quite good at creating the fuzzing data.

To state the obvious why, it's too expensive to use LLMs directly and this way works since they found "4 memory safety bugs and one hang"

But the future we are heading to should be LLMs will directly pentest/test the code. This is where it's interesting and new.

moyixOP2y ago

yencabulator2y ago

For one, it'd be really hard for an LLM to get the CRC32 right, especially when it's in a header before the data it covers.

dmazzoni2y ago

Exactly. If I actually wanted to fuzz this I'd use libfuzzer and manually fix the crc32. An LLM would be useful in helping me write the libfuzzer glue code.

dmazzoni2y ago· 2 in thread

Why wouldn't you have an LLM write some code that uses something like libfuzzer instead?

moyixOP2y ago

fooker2y ago

Because you want to fuzz more effectively, not write a fuzzer more effectively.

popinman3222y ago· 1 in thread

This approach could likely also be combined with RL; the code coverage provides a decent reward signal.

wrsh072y ago

To me, if it detects bugs (and fixing those makes the others reachable), that seems like a pretty acceptable iterative step

It's less academically pure, but as an engineer who wants to fix bugs it seems ok

planetis2y ago· 1 in thread

moyixOP2y ago

Another test to check how much seeing the actual parser code helps is to have it generate a GIF fuzzer without giving it the code:

https://twitter.com/moyix/status/1766135426476064774

And finally, for fun, we can see how it does when we give it the RFC for GIF89a:

https://twitter.com/moyix/status/1766207786751279298

j / k navigate · click thread line to collapse