undefined | Better HN

0 pointspx433y ago0 comments

My understanding is that weights are normally stored as pickled python blobs, which means arbitrary code execution as they are unpickled.

0 comments

14 comments · 4 top-level

Yoric3y ago· 5 in thread

If it's PyTorch, it can definitely contain and execute arbitrary code.

One of the reasons I'm not a huge fan of PyTorch.

londons_explore3y ago

They could contain arbitrary code... But typically do not. That means that with the right viewer application it will be trivial to know for sure.

It isn't like a multi gigabyte game for example, where knowing if there is any malicious code could easily be a multi-month reverse engineering project to get to the answer of 'probably not, but we don't have time to check every byte with a fine tooth comb'

Yoric3y ago

In theory, this could be done, sure.

In practice, who's going to bother checking the language model? All the code that runs Stable Diffusion or other Hugging Face models that I've seen just downloads the model dynamically, then uses it without asking question. That's a pretty low-hanging supply chain attack waiting to happen, I believe.

1 more reply

trentdotexe3y ago

I mentioned this above, but Trail of Bits has a tool just for this purpose called Fickling. https://github.com/trailofbits/fickling

zb33y ago

I only found this picklescan[0] serving this purpose, but it doesn't seem to be a finished project.

[0] - https://github.com/mmaitre314/picklescan

1 more reply

winddude3y ago

Anything that loads pickles from sources your unsure of can contain executable code. There were a few samples a couple month ago showing distribution on huggingface.

Some solutions for checking: https://huggingface.co/docs/hub/security-pickle

or run them in an isolated env.

moffkalast3y ago· 3 in thread

"They turned the model into a pickle? Funniest shit I've ever seen."

But seriously, why not something more human readable and text-based if it's just weights?

oefrha3y ago

Because human-readable text-based formats are really inefficient to both download and load, especially when in the hundreds of GB range. And no human cares to read billions of weights.

Yoric3y ago

Agreed. However, there are much better formats than Python pickles for exchanging binary data. As it is, using PyTorch means that you force your users to also use PyTorch, which is a shame, as libtorch (which is what makes PyTorch work) offers a much more portable format (which I suspect might also be more efficient at least in terms of raw size, but I haven't checked).

pas3y ago

... why not CBOR or other efficient binary format?

kir-gadjello3y ago· 1 in thread

A few months ago I made a small library to sanitize pytorch checkpoints, here it is: https://github.com/kir-gadjello/safer_unpickle

The usage boils down to

import safer_unpickle from safer_unpickle

safer_unpickle.patch_torch_load()

This overrides default torch unpickler with a relatively safe one. Hope this helps.

eigenvalue3y ago

Sounds like this should be the default. Maybe you can submit a PR to the official Torch repo? There is no reason why a static model checkpoint should be potentially dangerous to run.

trentdotexe3y ago· 1 in thread

You're right! You should probably use Trail of Bits Fickling tool to investigate. https://github.com/trailofbits/fickling

TaylorAlexander3y ago

Thanks for the tip. I tried this on the 7B parameter model and got an error.

$ fickling --check-safety consolidated.00.pth

  File "/usr/lib/python3.10/pickletools.py", line 359, in read_stringnl
    data = codecs.escape_decode(data)[0].decode("ascii")

UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 63: ordinal not in range(128)

j / k navigate · click thread line to collapse

0 comments

14 comments · 4 top-level

Yoric3y ago· 5 in thread

If it's PyTorch, it can definitely contain and execute arbitrary code.

One of the reasons I'm not a huge fan of PyTorch.

londons_explore3y ago

They could contain arbitrary code... But typically do not. That means that with the right viewer application it will be trivial to know for sure.

Yoric3y ago

In theory, this could be done, sure.

1 more reply

trentdotexe3y ago

I mentioned this above, but Trail of Bits has a tool just for this purpose called Fickling. https://github.com/trailofbits/fickling

zb33y ago

I only found this picklescan[0] serving this purpose, but it doesn't seem to be a finished project.

[0] - https://github.com/mmaitre314/picklescan

1 more reply

winddude3y ago

Anything that loads pickles from sources your unsure of can contain executable code. There were a few samples a couple month ago showing distribution on huggingface.

Some solutions for checking: https://huggingface.co/docs/hub/security-pickle

or run them in an isolated env.

moffkalast3y ago· 3 in thread

"They turned the model into a pickle? Funniest shit I've ever seen."

But seriously, why not something more human readable and text-based if it's just weights?

oefrha3y ago

Because human-readable text-based formats are really inefficient to both download and load, especially when in the hundreds of GB range. And no human cares to read billions of weights.

Yoric3y ago

pas3y ago

... why not CBOR or other efficient binary format?

kir-gadjello3y ago· 1 in thread

A few months ago I made a small library to sanitize pytorch checkpoints, here it is: https://github.com/kir-gadjello/safer_unpickle

The usage boils down to

import safer_unpickle from safer_unpickle

safer_unpickle.patch_torch_load()

This overrides default torch unpickler with a relatively safe one. Hope this helps.

eigenvalue3y ago

Sounds like this should be the default. Maybe you can submit a PR to the official Torch repo? There is no reason why a static model checkpoint should be potentially dangerous to run.

trentdotexe3y ago· 1 in thread

You're right! You should probably use Trail of Bits Fickling tool to investigate. https://github.com/trailofbits/fickling

TaylorAlexander3y ago

Thanks for the tip. I tried this on the 7B parameter model and got an error.

$ fickling --check-safety consolidated.00.pth

  File "/usr/lib/python3.10/pickletools.py", line 359, in read_stringnl
    data = codecs.escape_decode(data)[0].decode("ascii")

UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 63: ordinal not in range(128)

j / k navigate · click thread line to collapse