Small offline large language model – TinyChatEngine from MIT (opens in new tab)

(graphthinking.blogspot.com)

117 pointsphysicsgraph2y ago24 comments

24 comments

22 comments · 6 top-level

antirez2y ago· 7 in thread

Use llama.cpp for quantized model inference. It is simpler (no Docker nor Python required), faster (works well on CPUs), and supports many models.

Also there are better models than the one suggested. Mistral for 7B parameters. Yi if you want to go larger and happen to have 32Gb of memory. Mixtral MoE is the best but requires too much memory right now for most users.

j-bos2y ago

I'm curious, what do you use these small LLMs for, like can you give some examples of (not too) personal uses cases from the past month?

SOLAR_FIELDS2y ago

My understanding (I haven’t used a fine tuned one) is that you can use one that you fine tune yourself for narrow automation tasks. Kind of like a superpowered script. From my llama 2 7b experiments I have not gotten great results out of the non fine tuned versions of the model for coding tasks. I haven’t tried code llama yet.

physicsgraphOP2y ago

Thanks for the suggestion. I'm new to running LLMs so I'll take a look at your suggestion [0]. My ~10 year old MacBook Air has 4GB of RAM, so I'm primarily interested in smaller LLMs.

[0] https://github.com/ggerganov/llama.cpp

akx2y ago

You don't necessarily need to fit the model all in memory – llama.cpp supports mmaping the model directly from disk in some cases. Naturally inference speed will be affected.

TotalCrackpot2y ago

Btw, shouldn't it in theory be possible to run the Mixtral MoE loading next submodel sequentially and store outputs and then do the rest of the algorithm to make it easier to run on machines that cannot fit whole model in the memory?

wfhpw2y ago

Yes but loading weights into memory takes time

1 more reply

PeterStuer2y ago

Python is only used in the toolchain, the inference engine is entirely C/C++.

upon_drumhead2y ago· 4 in thread

I’m a tad confused

> TinyChatEngine provides an off-line open-source large language model (LLM) that has been reduced in size.

But then they download the models from huggingface. I don’t understand how these are smaller? Or do they modify them locally?

lrem2y ago

https://github.com/mit-han-lab/TinyChatEngine

Turns out the original source is actually somewhat informative. Including telling you how much hardware do you need. This blog post looks like your typical note you leave for yourself to annotate a bit of your shell history.

pmontra2y ago

I wish that all these repos were more clear about the hardware requirements. Seeing that it runs on a 8 GB Raspberry, probably with abysmal performance, I'd say that it will run on my 32 GB Intel laptop on the CPU. Will it run on its Nvidia card? I remember that the rule of thumb was one GB of GPU RAM per G parameters, so I'd say that it won't run. However this has 4 bit quantization so it could have lower requirements.

Of course the main problem is that I don't know enough about the subject to reason on it on my own.

1 more reply

physicsgraphOP2y ago

Your assessment is exactly correct -- the blog post is my note-to-self about getting the repo to work. My "added value" in the post is a Dockerfile for ease of installation.

PeterStuer2y ago

They have postprocessed the models specifically for size and latency. They published several papers on this.

Their optimized models are not downloaded from HF, but from dropbox. I have no idea why.

aravindgp2y ago· 3 in thread

I have used them and I can say it's pretty decent overall. I personally plan to use tinyengineon iot devices which is for even smaller iot microcontroller devices.

jcjmcclean2y ago

May I ask what your use case is? I've found LLMs are pretty good at parsing unstructured data into JSON, with minimal hallucinations.

collyw2y ago

Is there a tutorial on how to do something like that? It sounds damn useful.

hm-nah2y ago

I’m also curious.

rodnim2y ago· 1 in thread

"Small large" ..... so, medium? :)

the_sleaze92y ago

No - LLMs can't talk to the dead, they're just fancy autocompletes

collyw2y ago· 1 in thread

Where is a good place to understand the high level topics in AI. Like an offline language model compared to a presumably online model?

darkmuck2y ago

https://huggingface.co/learn

dkjaudyeqooe2y ago

I tried this and installation was easy on macOS 10.14.6 (once I updated Clang correctly).

Performance on my relatively old i5-8600 CPU running 6 cores at 3.10GHz with 32GB of memory gives me about 150-250 ms per token on the default model, which is perfectly usable.

j / k navigate · click thread line to collapse

24 comments

22 comments · 6 top-level

antirez2y ago· 7 in thread

Use llama.cpp for quantized model inference. It is simpler (no Docker nor Python required), faster (works well on CPUs), and supports many models.

j-bos2y ago

I'm curious, what do you use these small LLMs for, like can you give some examples of (not too) personal uses cases from the past month?

SOLAR_FIELDS2y ago

physicsgraphOP2y ago

Thanks for the suggestion. I'm new to running LLMs so I'll take a look at your suggestion [0]. My ~10 year old MacBook Air has 4GB of RAM, so I'm primarily interested in smaller LLMs.

[0] https://github.com/ggerganov/llama.cpp

akx2y ago

You don't necessarily need to fit the model all in memory – llama.cpp supports mmaping the model directly from disk in some cases. Naturally inference speed will be affected.

TotalCrackpot2y ago

wfhpw2y ago

Yes but loading weights into memory takes time

1 more reply

PeterStuer2y ago

Python is only used in the toolchain, the inference engine is entirely C/C++.

upon_drumhead2y ago· 4 in thread

I’m a tad confused

> TinyChatEngine provides an off-line open-source large language model (LLM) that has been reduced in size.

But then they download the models from huggingface. I don’t understand how these are smaller? Or do they modify them locally?

lrem2y ago

https://github.com/mit-han-lab/TinyChatEngine

pmontra2y ago

Of course the main problem is that I don't know enough about the subject to reason on it on my own.

1 more reply

physicsgraphOP2y ago

Your assessment is exactly correct -- the blog post is my note-to-self about getting the repo to work. My "added value" in the post is a Dockerfile for ease of installation.

PeterStuer2y ago

They have postprocessed the models specifically for size and latency. They published several papers on this.

Their optimized models are not downloaded from HF, but from dropbox. I have no idea why.

aravindgp2y ago· 3 in thread

I have used them and I can say it's pretty decent overall. I personally plan to use tinyengineon iot devices which is for even smaller iot microcontroller devices.

jcjmcclean2y ago

May I ask what your use case is? I've found LLMs are pretty good at parsing unstructured data into JSON, with minimal hallucinations.

collyw2y ago

Is there a tutorial on how to do something like that? It sounds damn useful.

hm-nah2y ago

I’m also curious.

rodnim2y ago· 1 in thread

"Small large" ..... so, medium? :)

the_sleaze92y ago

No - LLMs can't talk to the dead, they're just fancy autocompletes

collyw2y ago· 1 in thread

Where is a good place to understand the high level topics in AI. Like an offline language model compared to a presumably online model?

darkmuck2y ago

https://huggingface.co/learn

dkjaudyeqooe2y ago

I tried this and installation was easy on macOS 10.14.6 (once I updated Clang correctly).

Performance on my relatively old i5-8600 CPU running 6 cores at 3.10GHz with 32GB of memory gives me about 150-250 ms per token on the default model, which is perfectly usable.

j / k navigate · click thread line to collapse