RLHF a LLM in <50 lines of Python (opens in new tab)

(datadreamer.dev)

223 pointspatelajay2852y ago66 comments

66 comments

49 comments · 18 top-level

jerpint2y ago· 12 in thread

I don’t understand the obsession of LOC for wrappers - it’s the whole point of a wrapper. It makes it much easier for the user at the expense of making it less hackable

Title should be instead “Library for low-code RLHF in python”

vvrm2y ago

Another problem with the title: the article is about DPO, which doesn’t do reinforcement learning. So not RLHF. I guess RLHF has more of a name recognition than DPO.

janalsncm2y ago

Honestly a much bigger problem than LOC. It’s a completely different algorithm.

patelajay285OP2y ago

This was discussed in another comment, DPO is pretty much strictly better than RLHF + PPO, and far more stable when training. Yes, DPO is not technically "RL", but it's semantics for the most part. DataDreamer does support PPO training if you want, but it's so unstable, it's a less popular choice now.

2 more replies

behnamoh2y ago

This. If I'm the type of person who wants to do RLHF, then I'm the type of person who wants control and doesn't like delegating it to imported libraries.

patelajay285OP2y ago

This is built for ML researchers out of an academic lab. There's a ton of functionality in the library (beyond RLHF and alignment) that ML researchers do every day to write papers and run experiments that the library helps abstract and make repeatable and usable.

Unless your research hypothesis is specifically around improving or changing RLHF, it's unlikely you should be implementing it from scratch. Abstractions are useful for a reason. The library is quite configurable to let you tune any knobs you would want.

patelajay285OP2y ago

This is developed for researchers, so I assure it’s very hackable and configurable. ;-) but appreciate the feedback on the title!

tgsovlerkhgsel2y ago

Honestly the amount of complicated boilerplate that you're supposed to write from scratch every time you do something with ML (in some of the major frameworks) deterred me from touching anything ML-related for a long time.

As far as I understand, what the training loop is supposed to be doing is pretty static and you don't need to understand most of it in order to "do ML", but at the same time it's full of complicated things to get right (which would be much easier to understand when controlled through well defined parameters instead of mixing boilerplate and config).

brigadier1322y ago

I always appreciate these projects because I just dive into the code itself and copy out what I need once the wrapper becomes too much of a burden.

patelajay285OP2y ago

That’s totally valid and something we would even encourage! This project is for researchers so if there is a point where the abstraction is no longer useful, by all means configure, or subclass, or copy code.

rovr1382y ago

Of course. And they're not saying they don't have a place.

They're saying why does it matter if it's 50 vs 60 or even 100. It's a wrapper, which should be less lines. That's the whole point. Abstracting things even further and making assumptions.

Of course you can use them. Of course you can remove them after and use the underlying code. But the LOC shouldn't be the important part of it

verticalscaler2y ago

Yes you do. Most casuals are downright afraid of code. This messaging is meant to make the project more approachable.

Kind of like everybody knows the pop-science around e = mc^2 but most are completely oblivious that it takes a bunch of whiteboards to derive it and what all that actually means.

No pithy formula no way for the actual ideas to spread to the mainstream for you to somehow hear about it.

antonvs2y ago

This reminds me of the advice Stephen Hawking's publisher gave him, which was that every equation he included in his book, A Brief History of Time, would cut the sales of the book in half. As a result the only equation that ended up in the book was E=mc^2.

lopkeny12ko2y ago· 5 in thread

It's not 50 lines of code if all the real work is done by importing a library...

That's like saying, I can solve any problem in 2 lines of code. I'll publish a library for it first, then:

import foo; foo.do_the_thing()

Magic!

antonvs2y ago

Software developers hate this one simple trick!

peab2y ago

did people say the same thing when assembly code got abstracted away?

orblivion2y ago

There's levels of abstraction. "lines of Python" to me roughly means the standard library.

1 more reply

skelpmargyar2y ago

Importing a library is not abstraction any more than closing your eyes is abstracting the world to black.

1 more reply

therealdrag02y ago

[relevant xkcd]

imjonse2y ago· 5 in thread

The first paragraphs says RLHF can be used to align models, and the seconds say here's how to do it by using DPO. These two methods are not the same, and the latter is not an instance of the former.

patelajay285OP2y ago

Fair, DPO is considered a fairly well established technique now that is far more stable in training than PPO, but also helps align LLMs from human feedback. The package also helps do PPO, so you can do traditional RLHF, but figured more people would be interested in seeing a DPO example, given how unstable PPO is.

Der_Einzige2y ago

The latter is strictly superior to the former though. RlHF has been abandoned in the open source world.

numeri2y ago

I don't know about strictly superior. It's certainly strictly easier for people with a budget, who just need "good enough" results the first try. I don't have any evidence whatsoever, but I'd expect that enough tuning and retries can get squeeze a bit more performance out of RLHF than you can get out of DPO.

patelajay285OP2y ago

Yep, DPO is not technically “RL” and implicitly uses the LLM itself as a reward model, but training with DPO is far more stable for that reason.

2 more replies

imjonse2y ago

I am just saying the intro paragraphs are confusing.

1 more reply

mk_stjames2y ago· 3 in thread

I feel the preparation and loading of the dataset has been abstracted too far away. I have no idea what type of data format I need or how it is loaded for this (it is using a pre-prepared huggingface dataset?). If I have local data how should it be loaded? What does that even look like? Is it expecting some sort of JSON?

When you get so far as to abstracting every step to loading a one-liner from huggingface, including the downloading of a prepared dataset with no example of doing the same on custom local dataset, you've abstracted too far to be useful for anyone other than the first user.

patelajay285OP2y ago

Thanks for the question. This is built for ML researchers, so in examples we use the defacto source for datasets researchers often use, HF Hub.

However, there is a lot of documentation on the site to help guide users. This documentation page shows you can load in data via local datasets as well. For example, JSON, CSV, text files, a local HF Dataset folder, or even from a Python `dict` or `list`:

https://datadreamer.dev/docs/latest/datadreamer.steps.html#t...

We'll definitely keep improving documentation, guides, and examples. We have a lot of it already, and more to come! This has only recently become a public project :)

If anyone has any questions on using it, feel free to email me directly (email on the site and HN bio) for help in the meantime.

mk_stjames2y ago

I did glance at the docs first before commenting but I was looking in 'datasets' to try and understand importing a potential CSV/JOSN etc and all I saw was verbage on accessing the output.

I would not have guessed that the base input data processing would have been filed under 'steps'. But now I kinda see how you are working, but I admit I'm not the target audience.

If you want this to really take off for people outside of a very, very specific class of researchers... setup an example on your landing page that calls to a local JSON of user prompts/answers/rejects finetuning a llama model with your datadreamer.steps.JSONDataSource into the loader. Or, a txt file with the system/user/assistant prompts tagged and examples given. Yes, your 'lines of code' for your frontpage example may grow a bit!

Maybe there are a lot of 'ML researchers' that are used to the type of super-abstract OOP API, load-it-from-huggingface-scheme-people you are targeting but also know that there are a ton that aren't.

1 more reply

ganeshkrishnan2y ago

The dataset is here I presume: https://huggingface.co/datasets/Intel/orca_dpo_pairs

you can look at the samples. Mostly its questions and accepted/rejected answers.

patelajay285OP2y ago· 2 in thread

Hi everyone, there are no easy tools for synthetic data generation or training and aligning LLMs simply in Python. Most of the stuff out there are messy adhoc scripts.

DataDreamer is an open source Python package with a nice API from the University of Pennsylvania that does all this that we’re actively developing. Will be here to answer questions.

https://github.com/datadreamer-dev/DataDreamer

baggiponte2y ago

The API looks nice, congratulations. Will experiment with it. One small silly question: why did you choose to specify the dependencies inside the src dir with the requirements format - rather than inside the pyproject?

patelajay285OP2y ago

Thanks! It makes it easier to run with the existing run scripts I have on our large university GPU cluster. :) no other reason

aethelyon2y ago· 2 in thread

This is cool, but the data collection is the hard part, right?

patelajay285OP2y ago

Yes it is :), but the library is also a synthetic data generation library, so for example you can create the data for DPO fully synthetically, check out the self-rewarding LLMs example:

https://datadreamer.dev/docs/latest/pages/get_started/quick_...

sillysaurusx2y ago

I’m extremely skeptical of this approach. Until proven otherwise, with a model that users actually find useful, I don’t think this can work.

It would be nice. But I’ve seen too many nice ideas completely fall apart in practice to accept this without some justification. Even if there are papers on the topic, and those papers show that the models rank highly according to some eval metrics, the only metric that truly matters is "the user likes the model and it solves their problems."

By the way, on a separate topic, the 90/10 dataset split that you do in all of your examples turns out to be fraught with peril in practice. The issue is that the validation dataset quality turns out to be crucial, and randomly yeeting 10% of your data into the validation dataset without manual review is a recipe for problems.

1 more reply

g4zj2y ago· 1 in thread

Very cool, but I can't help but feel like titles that reference low-LOC are a bit clickbait-y when nearly all the heavy lifting is done by imported libraries.

patelajay285OP2y ago

Appreciate the feedback on the title, this is developed for ML researchers, so I assure there is a lot it’s doing under the hood to make this process easier (for example introducing automatic caching and resumability).

However, we also tried to simplify the API and have sensible defaults to make it usable for anyone / make ML research code cleaner :)

ilaksh2y ago· 1 in thread

How do you normally do DPO? Is that built in to PyTorch or something?

Theoretically the hard part is collecting the examples with rejections etc.

patelajay285OP2y ago

Collecting data is hard, but the library is also a synthetic data generation library, so for example you can create the data for DPO fully synthetically, check out the self-rewarding LLMs example: https://datadreamer.dev/docs/latest/pages/get_started/quick_...

proto-n2y ago

Yeah well in bash I can do it in one line: `python train.py`. I hate examples like this, the 50loc statement is totally useless (and so is the code example, as I can't learning anything from it).

MrYellowP2y ago

I don't prefer aligned models and I'm a human. It's not okay to claim that that's what humans prefer. There might be a subset of humans who can't handle words, but they're not even remotely in the majority.

Algined models are dumber, treat everyone like they're stupid immature idiots who can't handle words and they're a wannabe moral authority.

theptip2y ago

Interested if local RLHF is actually viable; can you get meaningful steering from 1k feedback points on a narrow task? I feel that annotation count is achievable with a single dedicated annotator making a few comments per minute (though tedious), 10k would be a week of work so achievable for a very dedicated hobbyist, and 100k seems out of reach for a hobby project.

Say for simple conversation usecases (eg customer support for a specific product, interactive fiction, things like that without deep technical knowledge).

I was also wondering if it’s possible to do such RLHF for SD running locally.

bbstats2y ago

I can abstract this to 2 lines

v4dok2y ago

I feel like the current meta on finetuning LLMs is random accounts at X/Twitter. Google results are littered with SEO garbage or some kind of guides that fail to work the moment you need something slightly different.

rldjbpin2y ago

it is very conflicting to see "do x in y LOC" in this field, especially when most of the workflow for different models are fragmented across non-overlapping frameworks/tooling.

to actually do something from scratch or using the author's code requires adopting something esoteric just for this purpose. for these scenarios it is nice to appreciate hf and their abstraction. but the reinventing the wheel situation is very frustrating to work with.

if you want to go beyond the demo, you have to deal with this painful reality. i hope there is more progress on this rather than making stacks of api.

spdustin2y ago

It occurs to me that there must be a model that's been "aligned" opposite to the usual RLHF. Or has nobody done that?

potatoman222y ago

This seems useful, thanks!

rrr_oh_man2y ago

RLHF = Reinforcement Learning from Human Feedback

cztomsik2y ago

DPO is not RLHF.

j / k navigate · click thread line to collapse

66 comments

49 comments · 18 top-level

jerpint2y ago· 12 in thread

I don’t understand the obsession of LOC for wrappers - it’s the whole point of a wrapper. It makes it much easier for the user at the expense of making it less hackable

Title should be instead “Library for low-code RLHF in python”

vvrm2y ago

Another problem with the title: the article is about DPO, which doesn’t do reinforcement learning. So not RLHF. I guess RLHF has more of a name recognition than DPO.

janalsncm2y ago

Honestly a much bigger problem than LOC. It’s a completely different algorithm.

patelajay285OP2y ago

2 more replies

behnamoh2y ago

This. If I'm the type of person who wants to do RLHF, then I'm the type of person who wants control and doesn't like delegating it to imported libraries.

patelajay285OP2y ago

This is developed for researchers, so I assure it’s very hackable and configurable. ;-) but appreciate the feedback on the title!

tgsovlerkhgsel2y ago

brigadier1322y ago

I always appreciate these projects because I just dive into the code itself and copy out what I need once the wrapper becomes too much of a burden.

patelajay285OP2y ago

rovr1382y ago

Of course. And they're not saying they don't have a place.

They're saying why does it matter if it's 50 vs 60 or even 100. It's a wrapper, which should be less lines. That's the whole point. Abstracting things even further and making assumptions.

Of course you can use them. Of course you can remove them after and use the underlying code. But the LOC shouldn't be the important part of it

verticalscaler2y ago

Yes you do. Most casuals are downright afraid of code. This messaging is meant to make the project more approachable.

Kind of like everybody knows the pop-science around e = mc^2 but most are completely oblivious that it takes a bunch of whiteboards to derive it and what all that actually means.

No pithy formula no way for the actual ideas to spread to the mainstream for you to somehow hear about it.

antonvs2y ago

lopkeny12ko2y ago· 5 in thread

It's not 50 lines of code if all the real work is done by importing a library...

That's like saying, I can solve any problem in 2 lines of code. I'll publish a library for it first, then:

import foo; foo.do_the_thing()

Magic!

antonvs2y ago

Software developers hate this one simple trick!

peab2y ago

did people say the same thing when assembly code got abstracted away?

orblivion2y ago

There's levels of abstraction. "lines of Python" to me roughly means the standard library.

1 more reply

skelpmargyar2y ago

Importing a library is not abstraction any more than closing your eyes is abstracting the world to black.

1 more reply

therealdrag02y ago

[relevant xkcd]

imjonse2y ago· 5 in thread

The first paragraphs says RLHF can be used to align models, and the seconds say here's how to do it by using DPO. These two methods are not the same, and the latter is not an instance of the former.

patelajay285OP2y ago

Der_Einzige2y ago

The latter is strictly superior to the former though. RlHF has been abandoned in the open source world.

numeri2y ago

patelajay285OP2y ago

Yep, DPO is not technically “RL” and implicitly uses the LLM itself as a reward model, but training with DPO is far more stable for that reason.

2 more replies

imjonse2y ago

I am just saying the intro paragraphs are confusing.

1 more reply

mk_stjames2y ago· 3 in thread

patelajay285OP2y ago

Thanks for the question. This is built for ML researchers, so in examples we use the defacto source for datasets researchers often use, HF Hub.

https://datadreamer.dev/docs/latest/datadreamer.steps.html#t...

We'll definitely keep improving documentation, guides, and examples. We have a lot of it already, and more to come! This has only recently become a public project :)

If anyone has any questions on using it, feel free to email me directly (email on the site and HN bio) for help in the meantime.

mk_stjames2y ago

I did glance at the docs first before commenting but I was looking in 'datasets' to try and understand importing a potential CSV/JOSN etc and all I saw was verbage on accessing the output.

I would not have guessed that the base input data processing would have been filed under 'steps'. But now I kinda see how you are working, but I admit I'm not the target audience.

Maybe there are a lot of 'ML researchers' that are used to the type of super-abstract OOP API, load-it-from-huggingface-scheme-people you are targeting but also know that there are a ton that aren't.

1 more reply

ganeshkrishnan2y ago

The dataset is here I presume: https://huggingface.co/datasets/Intel/orca_dpo_pairs

you can look at the samples. Mostly its questions and accepted/rejected answers.

patelajay285OP2y ago· 2 in thread

Hi everyone, there are no easy tools for synthetic data generation or training and aligning LLMs simply in Python. Most of the stuff out there are messy adhoc scripts.

DataDreamer is an open source Python package with a nice API from the University of Pennsylvania that does all this that we’re actively developing. Will be here to answer questions.

https://github.com/datadreamer-dev/DataDreamer

baggiponte2y ago

patelajay285OP2y ago

Thanks! It makes it easier to run with the existing run scripts I have on our large university GPU cluster. :) no other reason

aethelyon2y ago· 2 in thread

This is cool, but the data collection is the hard part, right?

patelajay285OP2y ago

Yes it is :), but the library is also a synthetic data generation library, so for example you can create the data for DPO fully synthetically, check out the self-rewarding LLMs example:

https://datadreamer.dev/docs/latest/pages/get_started/quick_...

sillysaurusx2y ago

I’m extremely skeptical of this approach. Until proven otherwise, with a model that users actually find useful, I don’t think this can work.

1 more reply

g4zj2y ago· 1 in thread

Very cool, but I can't help but feel like titles that reference low-LOC are a bit clickbait-y when nearly all the heavy lifting is done by imported libraries.

patelajay285OP2y ago

However, we also tried to simplify the API and have sensible defaults to make it usable for anyone / make ML research code cleaner :)

ilaksh2y ago· 1 in thread

How do you normally do DPO? Is that built in to PyTorch or something?

Theoretically the hard part is collecting the examples with rejections etc.

patelajay285OP2y ago

proto-n2y ago

Yeah well in bash I can do it in one line: `python train.py`. I hate examples like this, the 50loc statement is totally useless (and so is the code example, as I can't learning anything from it).

MrYellowP2y ago

Algined models are dumber, treat everyone like they're stupid immature idiots who can't handle words and they're a wannabe moral authority.

theptip2y ago

Say for simple conversation usecases (eg customer support for a specific product, interactive fiction, things like that without deep technical knowledge).

I was also wondering if it’s possible to do such RLHF for SD running locally.

bbstats2y ago

I can abstract this to 2 lines

v4dok2y ago

rldjbpin2y ago

it is very conflicting to see "do x in y LOC" in this field, especially when most of the workflow for different models are fragmented across non-overlapping frameworks/tooling.

if you want to go beyond the demo, you have to deal with this painful reality. i hope there is more progress on this rather than making stacks of api.

spdustin2y ago

It occurs to me that there must be a model that's been "aligned" opposite to the usual RLHF. Or has nobody done that?

potatoman222y ago

This seems useful, thanks!

rrr_oh_man2y ago

RLHF = Reinforcement Learning from Human Feedback

cztomsik2y ago

DPO is not RLHF.

j / k navigate · click thread line to collapse