Learnings from fine-tuning LLM on my Telegram messages (opens in new tab)

(asmirnov.xyz)

216 pointsfuriousteabag2y ago69 comments

69 comments

48 comments · 12 top-level

NoraCodes2y ago· 15 in thread

A meta-comment, but, what is the difference between "learnings" and "lessons"? Why use the former when we have the latter?

furyofantares2y ago

Learnings implies a report of your own experience; lessons implies something prepared as teaching material for the audience. (In the context of the title sentence anyway.)

xanderlewis2y ago

‘Lessons’ to me also seems to carry a sense of regret, as in ‘things (we) got wrong’. ‘Learnings’ is a more obscure word that I would take to mean something more neutral: literally ‘things (I’ve) learnt’.

kagol2y ago

Perhaps "findings" over "learnings", based on your description?

swatcoder2y ago

https://en.m.wiktionary.org/wiki/learnings

Beyond what's noted there (contemporary business jargon), English is diffused across the globe and has many regional variations that are different than class-signalling/formal American and British usage. As we all encounter each other online, it's not always worth over-analyzing word choice when you can understand the intent.

amccollum2y ago

This usage of "learnings", while certainly more common in "business jargon" today, was used by Shakespeare:

https://www.opensourceshakespeare.org/views/plays/play_view....

nescioquid2y ago

Some words in Shakespeare have different meanings today or have simply left standard usage. I don't think the presence of a word in Shakespeare means it is de facto good style to use today.

From a correctness stand-point, I think a descriptionist would be satisfied with an attested usage, especially from such a source. From a style point of view, I still find myself feeling embarrassed for the author when I encounter this usage (which is my own problem).

bee_rider2y ago

I think when you ask what the difference between two phrases is, people will really dig down to try and find a difference.

IMO in this context it is basically shorthand for “things I learned/lessons learned while tuning LLM…,” and either would be fine. It is sort of an informal list of stuff the author learned.

In my experience (nothing special, just another native speaker) “lessons from <event>” is the more typical American (at least) English phrase. But it is sort of close to “Lessons on.” “Lessons on” would imply more refined material that is more narrowly focused on teaching. So I wonder if the author decided they just didn’t want to worry about any confusion, or the possibility that they might misuse a phrase.

klooney2y ago

I've always associated it with Indian English, possibly it's a dialect thing that's spread from that community.

xanderlewis2y ago

Maybe it’s Kazakh. https://en.m.wikipedia.org/wiki/Borat

;-)

c0pium2y ago

Gotta earn those fat management consultant fees somehow. I’m sure there’s a whole team at McKinsey doing nothing but inventing new ways to say the same things.

fl73052y ago

In Swedish, there's a commonly used word "lärdomar" which is a direct match for "learnings".

But where the Swedish word sounds natural in that language, "learnings" just sounds wrong in English, even though it apparently is technically correct.

Jolter2y ago

Lessons may be given, but are not necessarily learned.

AlexCoventry2y ago

I think it's new. I've only heard it in the last few years.

bigdict2y ago

learnings = lessons learned

catlover762y ago

I assumed the author was a non-native English speaker

thefourthchime2y ago· 10 in thread

This part caught my eye:

"Using a half-precision FSDP full shard with a 1024 sequence length and a micro batch size of 2 required 63GB of VRAM on each of the eight A100 80 GB GPUs. The training, lasting three epochs, took just 20 minutes. The total cost for the VM was $8.88 per hour, resulting in $3, not including the time for experiments and bug fixes."

I wondered where you could rent cycles on a machine like that, a quick Google found that p4d.24xlarge on AWS is available, while the on-demand cost is $20.1755 per hour, the Spot is only $8.99 (I guess it's gone up?)

Cool to know I could fine-tune for only ~$3.

furiousteabagOP2y ago

I've been using vast.ai for a very long time. It is like a GPU marketplace, where people rent and lease GPUs. There are a lot of VMs with 4090, and beasts like 8xA100 80GB are also available from time to time.

skerit2y ago

I've used vast.ai to do some fine-tuning just a few days ago. It is indeed pretty great, though some servers fail to start up properly, or have some weird performance issues. I also wish they had more templates to try.

1 more reply

mk_stjames2y ago

There are all these 8x 4090 machines on Vast.ai running in ASRock epyc servers and I just want to know where the hell all those are coming from. Like I want to see pictures of these setups, since there are no off-the-shelf 4090s with blower cooler setups and watercooling that many cards together is a lot of custom hardware. And the backstories, because the fact they are 4090s and not datacenter cards, are these hobbyists just building octo-gpu $18k EPYC rigs for fun? (I even saw one with 9x 4090s! gotta use up those occulink PCIe lanes) It's not ex-mining hardware since the 4090 landed after the Eth proof-of-stake-changeover.

I've been looking for an answer to this every time I check out the current vast.ai console.

1 more reply

_ea1k2y ago

I think Tensordock and vast.ai are cheaper than AWS. Lambda labs can be as well, but they seem to only have reserved instances now.

cheptsov2y ago

We are building dstack.ai, an open-source tool that helps run anything on vast.ai and TensorDock. Happy to hear your feedback.

1 more reply

cosmojg2y ago

runpod.io is another good-and-cheap option

ojosilva2y ago

It caught mine too. I'm weighting several alternatives to "fine-tuning model fine-tuning", meaning the back-and-forth, trial-and-error previous to massively running the full training set.

My goal is to fine-tune a model on our codebase. I find RAG to be too orthopedic, I'd really would like to train the model on what is each part of the code and how we do things and see how it responds with a more complete perspective that goes beyond context.

The options I've considered for pre-fine-tuning:

- using a service like vast.ai, runpod, gradient or similar

- use Google Collab

- getting a more powerful MacBook, M3max with plenty of RAM

siquick2y ago

Excuse the ignorance but are you using these instances to fine tune a “fresh install” of a model, and then when you’ve finished fine tuning it do you download the whole model from the instance for use somewhere else?

furiousteabagOP2y ago

First I download the weights of the base pre-trained model to the VM instance. Then I upload my data there. Afterward, I fine-tune either LoRA or full and when training finishes, from the VM instance I download the adapters in case of LoRA and full weights in case of full fine-tune and run inference on a way less expensive instance (usually 3090).

bllchmbrs2y ago

Check out other prices on https://gpumonger.com/

Disclosure: I collected the data and built the site, but it has a ton of comparison data for GPU clouds.

andai2y ago· 4 in thread

Off-topic but I think it's important: OP, in the article you say you don't want a company to have your private messages, but you are using Telegram? I also use Telegram, but I am under no illusion of privacy!

Except for encrypted chats (which have bad UI and only work on one device) your messages are stored unencrypted on their servers (handed over to authorities, etc.).

furiousteabagOP2y ago

In IM, there's a balance between total privacy and widespread use. Apps like Signal offer high privacy but have fewer users, while popular ones like WhatsApp are less secure. Telegram lies somewhere in between, offering a level of privacy that most users find comfortable. It's widely used and there haven't been significant incidents of legal issues arising from its messages. Ultimately, it boils down to whom you trust and which app has more of your contacts.

greiskul2y ago

> like WhatsApp are less secure

WhatsApp uses end to end encryption by default. In fact, it uses the library that Signal developed. It is much more secure than Telegram, unless proved otherwise (which would need some backdoor in the application code to change its behavior).

2 more replies

konart2y ago

>you don't want a company to have your private messages

But that's not what he says.

>I don’t want to use any third-party fine-tuning services

He might be okay with a third-party storing his messages but not using them in their models etc.

furiousteabagOP2y ago

This may sound stupid, but from my perspective renting random VMs on vast.ai is safe in general and might be safer than using traditional cloud providers in particular. Consider this: on your VM a new image starts several times a day, each time with a new volume. It downloads tens of GBs of data and weights for training. Once training is done, everything gets cleaned up and the process starts again for a new tenant. This constant cycle makes it kind of difficult to track and extract any meaningful data from it.

1 more reply

gwern2y ago· 2 in thread

> My data collator ensures that the loss is only calculated based on someone’s response. Predicting who will speak next is relatively straightforward, and we don’t want the model to focus on learning that. Therefore, parts of the conversation where the loss is calculated are highlighted in bold.

If it's so easy, then you don't need to remove it. The model will solve it easily and focus on everything else. At best, you save some parameters and compute, at worst, you are damaging its ability to learn important things like conversational skills or modeling people. When it comes to LLMs, more is more, and trying to hand-engineer the dataset or think for the LLM can backfire in very subtle and difficult to diagnose ways.

> Ok, it is capable of forming coherent sentences. The most noticeable problem is its lack of awareness regarding the context of the conversations which leads to bland and generic replies. The messages lacked any distinct style, feeling quite basic... > > Conversations have become more interesting and engaging, although there’s still a risk of losing context. Russian language performance has improved, but errors still occur. I believe that before fine-tuning for a specific task with limited data, like mine, it would be beneficial to first fine-tune the model unsupervised on a large corpus of Russian texts. Additionally, incorporating common conversation partners’ names as separate tokens might enhance the quality. I wouldn’t say it has turned out to be significantly better than LoRA. It might be more effective to focus solely on a single person and calculate the loss based only on my responses (or someone else’s), instead of trying to learn about each and every conversational partner.

furiousteabagOP2y ago

I agree that usually 'more is more' for training LLMs. However, for fine-tuning with limited data, it seems crucial to focus the task as much as possible. Since the model still encounters these masked sentences in the data, it effectively learns to respond based on the speaker's name. So, complicating the task might not be necessary. Also, I'm concerned about interpreting the loss value. If the model quickly reduces loss by picking up predictable phrases, it's hard to tell if it's genuinely learning or just echoing these predictable elements.

gwern2y ago

> However, for fine-tuning with limited data, it seems crucial to focus the task as much as possible.

That doesn't make any sense when you're dealing with a model which is so hugely over-parameterized. The model will learn the easy data that you are removing just fine. There's no 'limited data' there.

> If the model quickly reduces loss by picking up predictable phrases, it's hard to tell if it's genuinely learning or just echoing these predictable elements.

You can't interpret the loss qualitatively anyway. It's totally dependent on the details of tokenization, formatting, corpus size, etc. You still have to look at the samples or a downstream task to see if it's working well. Even quantitatively, the loss is only meaningful if you're comparing to a heldout sample or something, and then it doesn't matter if you were screwing with it like OP.

jstrieb2y ago· 1 in thread

This post and its findings are really interesting!

Back when the "I forced a bot to watch 1000 hours..." memes were popular (https://knowyourmeme.com/memes/i-forced-a-bot), ages ago in AI/ML time, I tried to do something similar by fine-tuning GPT-2 on messages from a group chat of my friends. Since there were years of chat data, it seemed like a really good opportunity to test whether the language model would capture everyone's personality and generate funny, uncanny-valley versions of our banter.

Turns out that the group chat was used nearly exclusively for sending funny pictures and videos (that the language model obviously couldn't see), and for making plans to meet up. The generated conversations almost exclusively consisted of a random group chat member starting with "there is a party tonight, who wants to go?" and others saying "I'm down" or "when?" or "where?" It was 0% banter, and 100% logistics.

It was pretty hilarious in its own way, but not for the reasons anyone expected! I didn't learn very much about language models with that experiment, but I did learn that my friends' group chat is actually pretty boring.

I guess the best banter happens in real life. Glad to see it worked out somewhat more interestingly for this person, even if they did allude to some similar results in their closing thoughts section.

razodactyl2y ago

This is an interesting insight into model training because it shows how hidden bias arises - think of the "left-leaning" stance of GPT3 etc. being a side-effect of the cohort that trained the model, not any deliberate action on their behalf.

Additionally - it's why it's important to think ahead of what you're training your model on because the model will always regress to the training data itself even if that means going backwards in ability.

u3856392y ago· 1 in thread

Great post. I wonder how much this can improve if you RAG-ify a diverse set of contextual data, for example calendar, meals, recent conversations from the real world, etc.

It's also interesting that бля was translated to 'damn'. :)

furiousteabagOP2y ago

I think incorporating knowledge from other apps is a good next step because the model definitely lacks the context of what is going on right now. The nature of instant messaging is that most of the messages are about what is happening right now or what will happen in the near future, so past communication history does not help much.

andai2y ago· 1 in thread

Fascinating. A few years ago a friend and I finetuned GPT-2 on our WhatsApp chat. So it was just a long text file of:

Mark: wassup

Andy: just chilling

It simulated our conversational style and topics quite well, though GPT-2 reads like a glorified Markov chain. Sometimes the outputs were absolutely hilarious and inappropriate. GPT-2 was peak comedy.

My friend described GPT-2 as "like watching a toddler learning how to walk. When it stumbles it's cute and funny." GPT-3, not so much...

Also, it was oddly (painfully) accurate as far as personality goes... like looking into a mirror. For one thing, I talk way more, and this was reflected in the model's output. For another, I am constantly trying to turn my life around and failing, but ever optimistic... and talking about creative plans endlessly without much execution. (So GPT-2-andai ended up the same way...)

codelion2y ago

GPT-2 is surprisingly good at fine-tuning such conversations even now. I gave a talk recently on "Sparks of Digital Immortality" that covers a bit about how we did it - https://www.youtube.com/watch?v=F9-Qk86QyMM

spdif8992y ago· 1 in thread

Really interesting - as another person who's used telegram for several long-standing group chats, I imagine a tool to simplify this would be well-received.

I've wondered since fine-tuning started being a thing how long it'd be before somebody makes a utility where you can dump a giant chat export into it and an API key and then it fine tunes a Telegram bot that can imitate any of your friends - would be fun to play with and even create a group chat with multiple friend-bots talking to each other to see how long until it goes off the rails.

furiousteabagOP2y ago

It's true that fine-tuning models on personal messages could be simplified, but many, like myself, can't use third-party services due to sensitive data in our messages. I'm curious if others face this trust issue and how it might be resolved.

haltist2y ago· 1 in thread

Great example of immortal digital avatars. This is just a simple personal avatar but it is possible to make technological gods with the same techniques. All that's needed is scale and $80B.

codelion2y ago

We do this at https://meraGPT.com

goda902y ago

We're probably quite some time off from the bio-mimetic android part, but we're feeling closer and closer to the AI replacement avatar from the Black Mirror episode "Be Right Back"[0]

[0]https://en.wikipedia.org/wiki/Be_Right_Back

vineetsinha2y ago

This was a fun read. Wondering if it will work for Hindi. TIL that Mistral > Llama 2

jxdxbx2y ago

“Learnings.” Horrible word

j / k navigate · click thread line to collapse

69 comments

48 comments · 12 top-level

NoraCodes2y ago· 15 in thread

A meta-comment, but, what is the difference between "learnings" and "lessons"? Why use the former when we have the latter?

furyofantares2y ago

Learnings implies a report of your own experience; lessons implies something prepared as teaching material for the audience. (In the context of the title sentence anyway.)

xanderlewis2y ago

kagol2y ago

Perhaps "findings" over "learnings", based on your description?

swatcoder2y ago

https://en.m.wiktionary.org/wiki/learnings

amccollum2y ago

This usage of "learnings", while certainly more common in "business jargon" today, was used by Shakespeare:

https://www.opensourceshakespeare.org/views/plays/play_view....

nescioquid2y ago

Some words in Shakespeare have different meanings today or have simply left standard usage. I don't think the presence of a word in Shakespeare means it is de facto good style to use today.

bee_rider2y ago

I think when you ask what the difference between two phrases is, people will really dig down to try and find a difference.

IMO in this context it is basically shorthand for “things I learned/lessons learned while tuning LLM…,” and either would be fine. It is sort of an informal list of stuff the author learned.

klooney2y ago

I've always associated it with Indian English, possibly it's a dialect thing that's spread from that community.

xanderlewis2y ago

Maybe it’s Kazakh. https://en.m.wikipedia.org/wiki/Borat

;-)

c0pium2y ago

Gotta earn those fat management consultant fees somehow. I’m sure there’s a whole team at McKinsey doing nothing but inventing new ways to say the same things.

fl73052y ago

In Swedish, there's a commonly used word "lärdomar" which is a direct match for "learnings".

But where the Swedish word sounds natural in that language, "learnings" just sounds wrong in English, even though it apparently is technically correct.

Jolter2y ago

Lessons may be given, but are not necessarily learned.

AlexCoventry2y ago

I think it's new. I've only heard it in the last few years.

bigdict2y ago

learnings = lessons learned

catlover762y ago

I assumed the author was a non-native English speaker

thefourthchime2y ago· 10 in thread

This part caught my eye:

Cool to know I could fine-tune for only ~$3.

furiousteabagOP2y ago

skerit2y ago

1 more reply

mk_stjames2y ago

I've been looking for an answer to this every time I check out the current vast.ai console.

1 more reply

_ea1k2y ago

I think Tensordock and vast.ai are cheaper than AWS. Lambda labs can be as well, but they seem to only have reserved instances now.

cheptsov2y ago

We are building dstack.ai, an open-source tool that helps run anything on vast.ai and TensorDock. Happy to hear your feedback.

1 more reply

cosmojg2y ago

runpod.io is another good-and-cheap option

ojosilva2y ago

It caught mine too. I'm weighting several alternatives to "fine-tuning model fine-tuning", meaning the back-and-forth, trial-and-error previous to massively running the full training set.

The options I've considered for pre-fine-tuning:

- using a service like vast.ai, runpod, gradient or similar

- use Google Collab

- getting a more powerful MacBook, M3max with plenty of RAM

siquick2y ago

furiousteabagOP2y ago

bllchmbrs2y ago

Check out other prices on https://gpumonger.com/

Disclosure: I collected the data and built the site, but it has a ton of comparison data for GPU clouds.

andai2y ago· 4 in thread

Except for encrypted chats (which have bad UI and only work on one device) your messages are stored unencrypted on their servers (handed over to authorities, etc.).

furiousteabagOP2y ago

greiskul2y ago

> like WhatsApp are less secure

2 more replies

konart2y ago

>you don't want a company to have your private messages

But that's not what he says.

>I don’t want to use any third-party fine-tuning services

He might be okay with a third-party storing his messages but not using them in their models etc.

furiousteabagOP2y ago

1 more reply

gwern2y ago· 2 in thread

furiousteabagOP2y ago

gwern2y ago

> However, for fine-tuning with limited data, it seems crucial to focus the task as much as possible.

> If the model quickly reduces loss by picking up predictable phrases, it's hard to tell if it's genuinely learning or just echoing these predictable elements.

jstrieb2y ago· 1 in thread

This post and its findings are really interesting!

I guess the best banter happens in real life. Glad to see it worked out somewhat more interestingly for this person, even if they did allude to some similar results in their closing thoughts section.

razodactyl2y ago

u3856392y ago· 1 in thread

Great post. I wonder how much this can improve if you RAG-ify a diverse set of contextual data, for example calendar, meals, recent conversations from the real world, etc.

It's also interesting that бля was translated to 'damn'. :)

furiousteabagOP2y ago

andai2y ago· 1 in thread

Fascinating. A few years ago a friend and I finetuned GPT-2 on our WhatsApp chat. So it was just a long text file of:

Mark: wassup

Andy: just chilling

My friend described GPT-2 as "like watching a toddler learning how to walk. When it stumbles it's cute and funny." GPT-3, not so much...

codelion2y ago

spdif8992y ago· 1 in thread

Really interesting - as another person who's used telegram for several long-standing group chats, I imagine a tool to simplify this would be well-received.

furiousteabagOP2y ago

haltist2y ago· 1 in thread

Great example of immortal digital avatars. This is just a simple personal avatar but it is possible to make technological gods with the same techniques. All that's needed is scale and $80B.

codelion2y ago

We do this at https://meraGPT.com

goda902y ago

We're probably quite some time off from the bio-mimetic android part, but we're feeling closer and closer to the AI replacement avatar from the Black Mirror episode "Be Right Back"[0]

[0]https://en.wikipedia.org/wiki/Be_Right_Back

vineetsinha2y ago

This was a fun read. Wondering if it will work for Hindi. TIL that Mistral > Llama 2

jxdxbx2y ago

“Learnings.” Horrible word

j / k navigate · click thread line to collapse