Show HN: LlamaGym – fine-tune LLM agents with online reinforcement learning (opens in new tab)

(github.com)

239 pointsKhoomeiK2y ago28 comments

28 comments

28 comments · 13 top-level

zeroq2y ago· 7 in thread

When 150 lines of boilerplate can land you the first page on HN, maybe it is, in fact, the end of programming?

KhoomeiKOP2y ago

Karpathy’s micrograd [1] is literally 154 lines. Guess programming ended 4 years ago.

[1] https://github.com/karpathy/micrograd

DSingularity2y ago

You think autograd is boilerplate?

yowlingcat2y ago

Carmack's infamous fast inverse square root was only 13 lines. Measuring code by line metrics rather than its contents reflects shallow, questionable comprehension.

sevagh2y ago

If the first line of Carmack's infamous code was "import fast_inverse_square_root" from Pypi.org, it wouldn't be as impressive.

yen2232y ago

Let's not be one of those people who measure developer productivity by number of lines

klysm2y ago

I’m not really sure what your point is. Is it not remarkable that valuable things can be done in 150 lines?

xpe2y ago

I agree with you. / Above, I wouldn't assume a single nor clearly intended "point". Reading it I got an impression more of concern, even fear. I'm guessing one underlying driver may be a concern that AI is creeping into more and more programming. Which is true.

katzenversteher2y ago· 3 in thread

From the title I misunderstood what it does. However, now I'm wondering if what I thought is was (don't ask my why I thought it) is possible:

I have a PC that is able to run e.g. Mistral Instruct 7B Q4 inference with around 30 token/s.

How (computation and memory) expensive would it be to also run backpropagation in addition to inference?

I'm aware that the models are typically fed with much more and better data than what is typically provided during normal conversations but on the other hand if I could finetune my local model a teeny tiny bit during during / after each conversation I have with it anyways, it would after a while be perfectly customize for me.

I'm also aware that this could be problematic for models that are used by multiple users but my intended use case would be personal use by a single user.

tveita2y ago

For an idea of what's possible you might be interested in this story that was just on HN where they fine tune a quantized 70b model in 48GB VRAM: https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html

dartos2y ago

Very expensive.

AFAIK the model can’t be quantized during backprop, so right there you’d need a ton of RAM.

Backprop is faster bc it can be parallelized, but IIRC you need to hold an entire copy of the model for each backprop process.

scribu2y ago

Actually, there have been attempts to do quantized backprop, but not sure how successfully.

internet1010102y ago· 2 in thread

Thank you for making this. Simplifying any aspect of RL is always welcome.

KhoomeiKOP2y ago

Thanks! Yeah RL for LLMs is pretty underexplored I think beyond the RLHF stuff. Pretty tough to get working tho.

dartos2y ago

Didn’t DPO supplant rlhf?

kayson2y ago· 1 in thread

I want to make a Discord bot that impersonates all my friends and continues to refine the model as the conversations continue. Basically this [1] post, but with a more modern model and, ideally, reinforcement learning. Seems like this would fit the bill.... Is there anything else that would make this easier?

[1] https://www.izzy.co/blogs/robo-boys.html

mzl2y ago

You could perhaps adapt the Doppel Bot slack bot from Modal Labs: https://github.com/modal-labs/doppel-bot

dennisy2y ago· 1 in thread

Can this be used outside of OpenAI environments? If yes I think an example would be great!

KhoomeiKOP2y ago

Gymnasium is now maintained by the Farama Fpundation, an open-source consortium, not OpenAI. But most RL environment work for the past 5+ years has been Gym-compliant. The TextWord example in the repo, for example, instantiates a Gym-style environment but it doesn’t import from Gymnasium (uses textworld.gym instead).

3abiton2y ago· 1 in thread

Interesting project, basically a wrapper too around openai gym-like functionality that can handle open llms.

KhoomeiKOP2y ago

Yup, it does simplify LLM agent inference on Gym environments but the main technical contribution is reducing your would-be code overhead for online RL

potatoman222y ago

Could someone help me understand the kinds of things you can build with this? Is this like RLHF?

KhoomeiKOP2y ago

Twitter thread: https://x.com/khoomeik/status/1766805213644800011?s=46

adawg42y ago

Thanks for making this! Helps simplify it nicely

raidicy2y ago

Thanks for creating this!

ponderchan2y ago

llamagym.com for sale

neodypsis2y ago

Very interesting!

SuhanaJabin2y ago

Simplified the concept. Nicely done!

j / k navigate · click thread line to collapse

28 comments

28 comments · 13 top-level

zeroq2y ago· 7 in thread

When 150 lines of boilerplate can land you the first page on HN, maybe it is, in fact, the end of programming?

KhoomeiKOP2y ago

Karpathy’s micrograd [1] is literally 154 lines. Guess programming ended 4 years ago.

[1] https://github.com/karpathy/micrograd

DSingularity2y ago

You think autograd is boilerplate?

yowlingcat2y ago

Carmack's infamous fast inverse square root was only 13 lines. Measuring code by line metrics rather than its contents reflects shallow, questionable comprehension.

sevagh2y ago

If the first line of Carmack's infamous code was "import fast_inverse_square_root" from Pypi.org, it wouldn't be as impressive.

yen2232y ago

Let's not be one of those people who measure developer productivity by number of lines

klysm2y ago

I’m not really sure what your point is. Is it not remarkable that valuable things can be done in 150 lines?

xpe2y ago

katzenversteher2y ago· 3 in thread

From the title I misunderstood what it does. However, now I'm wondering if what I thought is was (don't ask my why I thought it) is possible:

I have a PC that is able to run e.g. Mistral Instruct 7B Q4 inference with around 30 token/s.

How (computation and memory) expensive would it be to also run backpropagation in addition to inference?

I'm also aware that this could be problematic for models that are used by multiple users but my intended use case would be personal use by a single user.

tveita2y ago

dartos2y ago

Very expensive.

AFAIK the model can’t be quantized during backprop, so right there you’d need a ton of RAM.

Backprop is faster bc it can be parallelized, but IIRC you need to hold an entire copy of the model for each backprop process.

scribu2y ago

Actually, there have been attempts to do quantized backprop, but not sure how successfully.

internet1010102y ago· 2 in thread

Thank you for making this. Simplifying any aspect of RL is always welcome.

KhoomeiKOP2y ago

Thanks! Yeah RL for LLMs is pretty underexplored I think beyond the RLHF stuff. Pretty tough to get working tho.

dartos2y ago

Didn’t DPO supplant rlhf?

kayson2y ago· 1 in thread

[1] https://www.izzy.co/blogs/robo-boys.html

mzl2y ago

You could perhaps adapt the Doppel Bot slack bot from Modal Labs: https://github.com/modal-labs/doppel-bot

dennisy2y ago· 1 in thread

Can this be used outside of OpenAI environments? If yes I think an example would be great!

KhoomeiKOP2y ago

3abiton2y ago· 1 in thread

Interesting project, basically a wrapper too around openai gym-like functionality that can handle open llms.

KhoomeiKOP2y ago

Yup, it does simplify LLM agent inference on Gym environments but the main technical contribution is reducing your would-be code overhead for online RL

potatoman222y ago

Could someone help me understand the kinds of things you can build with this? Is this like RLHF?

KhoomeiKOP2y ago

Twitter thread: https://x.com/khoomeik/status/1766805213644800011?s=46

adawg42y ago

Thanks for making this! Helps simplify it nicely

raidicy2y ago

Thanks for creating this!

ponderchan2y ago

llamagym.com for sale

neodypsis2y ago

Very interesting!

SuhanaJabin2y ago

Simplified the concept. Nicely done!

j / k navigate · click thread line to collapse