Running large language models like ChatGPT on a single GPU (opens in new tab)

(github.com)

682 points_nhynes3y ago264 comments

264 comments

135 comments · 32 top-level

gorbypark3y ago· 20 in thread

If this works well, it will be a game changer. Requiring a fleet of $10k+ GPUs will kill any hope of wide spread adoption of open source "competitors" to GPT-3. Stable Diffusion is so popular because it can run on hardware mere mortals can own.

humanistbot3y ago

This will only happen if "Open"AI or other big orgs release the model weights, which only Stable Diffusion did. Cost to train is still astronomical.

gaogao3y ago

Meta has released the model weights for OPT-175B, which is used in the paper. There's also a lot of full release LLMs from other labs on the way as well.

2 more replies

JoshCole3y ago

No, Stable Diffusion isn't the only one to release their weights. OpenAI hasn't been releasing weights for ChatGPT, but Stable Diffusion isn't the only ones releasing weights [1].

[1]: https://huggingface.co/

1 more reply

Dylan168073y ago

On the other hand, one techie with a few million dollars...

And you could train something like GPT-3 for cheaper than a superbowl commercial. That would get you a lot of publicity.

4 more replies

tarr113y ago

Wonder if someone would be willing to start an open source project where we could crowdsource donations for training, and people could possibly donate their GPU usage for it.

2 more replies

eternalban3y ago

A few million dollars. Kickstart the project, get 100 a head and 100,000 backers. Also check with Uncle Sam and see if there are any grants that can be used for this. Start a campaign and get rich concerned people to donate. Jeff may also want to show AWS can also train AI so maybe even get a break there an Amazon can get some nice PR. The list of possibilities seems extensive given the price tag of $12MM and upside of a fully public GPT.

naillo3y ago

There are some open source LLM models already such as the one this repo is running and mentioning like OPT-175B

jejeyyy773y ago

Crowd source the training costs - leaving some profit for the project owner. Open source the models.

JoshCole3y ago

No, it isn't astronomical. It is smaller than that. Still large, but not astronomical.

1 more reply

nl3y ago

There are current open source projects working on training their own LLMs.

I'm aware of one very credible one that has applied for a TPU grant from Google worth under $200K to train the whole model.

I think laion.ai will probably get their first with their fleet of A100s though.

leesec3y ago

No it isn't. Stable Diffusion is less than 200 grand to train.

2 more replies

pavelstoev3y ago

It does not have to be. We have optimizations for all kinds of workloads - https://CentML.ai

worldsayshi3y ago

> Cost to train is still astronomical.

It sounds like something that could/should/would be crowdfunded?

anon2913y ago

As a former AI accelerator employee (laid off), I'm kind of happy I was laid off because I realistically don't see a need for specialized hardware anymore.

Large companies can afford Nvidia. Nvidia's software stack is best in class. There's no business need here and the model execution is increasingly becoming possible on single consumer GPUs.

The only place where I see specialized chips excelling is on the edge or if they are truly revolutionary (in which case they're only an acquisition target for Nvidia).

The truth is... The large language models are likely excessively large.

foobiekr3y ago

Power is the main reason to do custom ASICs. I’d be curious as to your opinion of Recogni given they are claiming a 10x power reduction per unit compute.

1 more reply

narrator3y ago

No doubt the corporate large language models will use it to make language models that are 10x bigger. However, at least the public will have access to 175B parameter language models which are much more sophisticated than the 6B or so parameter models consumer video cards can currently run.

moffkalast3y ago

This may be a flawed approach, but an interesting idea would be to use the current models as a preprocessor to generate a huge "labelled" dataset of inputs and outputs, and then using that more accurate and specific data to train a smaller one that would fit.

It likely wouldn't have nearly as much general knowledge since the data just wouldn't be there but the behaviour could be similar?

zeknife3y ago

Look up knowledge distillation

1 more reply

permo-w3y ago

do/did you seriously think that the processing requirement was going to “kill” that possibility?

the history of computing clearly indicates that either the requirements for running or the bar for owning this technology was always going to drop

pessimizer3y ago

The history of computing had Moore's Law.

1 more reply

danuker3y ago· 10 in thread

Any chance these work on CPUs with any acceptable performance?

I have a 10-core 20-thread monster CPU, but didn't bother with a dedicated GPU because I can't control something as simple as its temperature. See the complicated procedure that only works with the large proprietary driver here:

https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Over...

adeon3y ago

I don't know about these large models but I saw on a random HN comment earlier in a different topic where someone showed a GPT-J model on CPU only: https://github.com/ggerganov/ggml

I tested it on my Linux and Macbook M1 Air and it generates tokens at a reasonable speed using CPU only. I noticed it doesn't quite use all my available CPU cores so it may be leaving some performance on the table, not sure though.

The GPT-J 6B is nowhere near as large as the OPT-175B in the post. But I got the sense that CPU-only inference may not be totally hopeless even for large models if only we got some high quality software to do it.

generalizations3y ago

There's also the Fabrice Bellard inference code: https://textsynth.com/technology.html. He claims up to 41 tokens per second on the GPT-Neox 20B model.

brigade3y ago

Your CPU gets maybe 700-800 gflops depending on your all-core frequency (fp32 because you don't have Sapphire Rapids.) The T4 benchmarked would be crunching what it can at ~65 tflops (fp16 tensor.) Newer GPUs hit 300 tflops (4090) or even nearly 2 petaflops (H100).

To give you an idea of the order of magnitude of compute difference. Sapphire Rapids has AMX and fp16 AVX512 to close the gap a little, but it's still massive.

NavinF3y ago

> 10-core 20-thread monster CPU

With what, 50GB/s memory bandwidth? That's no monster. The two consumer GPUs in my machine both do 1TB/s and are still bottlenecked on memory bandwidth.

> only works with the large proprietary driver here

In practice, nothing works without the proprietary driver so this isn't specific to temperature. Also the setting you're looking for is almost certainly `nvidia-smi -pl $watts` for setting the power limit, not whatever that wiki gives you. GPU temperature = ambient temperature + (power limit)*(thermal resistance of cooler)

TimeBearingDown3y ago

That power limit control is explained in detail a few paragraphs down on that wiki page.

https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Cust...

bioemerl3y ago

Nope. 20 cores in a CPU, 2000 in a GPU, with much much faster memory and an architecture designed to chew through data as fast as possible.

bee_rider3y ago

No real reason to compare a GPU core to a CPU one, but the memory bandwidth difference is pretty concrete!

fulafel3y ago

GPU "cores" are ~ SIMD lanes.

(a difference I think is that there are more virtual lanes, some of may be masked off, that are mapped to the GPU physical SIMD lanes)

metadat3y ago

Unlikely, because this is an efficient GPU work offloader, not a complete replacement for GPU computation.

nestorD3y ago

The other answer give you a few of the current solutions.

In the long term I am hoping that JAX (/XLA) will get better support for the CPU backend of their compiler and in particular, use SIMD and multicore better than it currently does.

It is very doable (just low priority) and it would mean that a lot of models could get close to optimal CPU performances out of the box which would be a step forward for accessibility.

muttled3y ago· 9 in thread

This is cool! But I wonder if it's economical using cloud hardware. The author claims 1.12 tokens/second on the 175B parameter model (arguably comparable to GPT-3 Davinci). That's about 100k tokens a day on the GCP machine the author used. Someone double check my numbers here, but given the Davinci base cost of $0.02 per 1k tokens and GCP cost for the hardware listed "NVIIDA T4 (16GB) instance on GCP with 208GB of DRAM and 1.5TB of SSD" coming up to about $434 on spot instance pricing, you could simply use the OpenAI API and generate about 723k tokens a day for the same price as running the spot instance (which could go offline at any point due to it being a spot instance).

Running the fine-tuned versions of OpenAI models are approximately 6x more expensive per token. If you were running a fine-tuned model on local commodity hardware, the economies would start to tilt in favor of doing something like this if the load was predictable and relatively constant.

ImprobableTruth3y ago

You've made one huge mistake: Davinci's $0.02 is not just per 1k tokens generated but also context tokens consumed. So if you generate 50 tokens per request with 1k context, the price is actually 20 times as large at $0.40 per 1k tokens generated - much less palatable, costing 3 times as much as the cloud hosted version of this.

And that's not even taking into account the gigantic markup cloud services have.

yorwba3y ago

Most of the computational cost of producing an output token is spent on consuming input tokens (including previous output tokens that are fed back in); only the final unembedding matrix could be eliminated if you don't care about the output logits for the context.

So it's not correct to only modify OpenAI's prices to account for the ratio of context tokens to output tokens. Both of them get multiplied by 20 (if that's what your ratio is).

1 more reply

breckenedge3y ago

Thanks for running the cloud numbers on this. I ran some DIY numbers and they indicate less than a week to break even with the cloud, including all hardware and electricity costs. The cloud seems stupid expensive compared to running your own hardware for this kind of task.

throwawayapples3y ago

The cloud is always stupid expensive compared to running your own hardware for almost any sort of task that isn't highly variable upon one or more axis (cpu, ram, etc), but less than a week to break even is truly shocking.

1 more reply

cypress663y ago

This is most likely aimed at people running models locally.

And a homelab with 3090s/4090s is one or two orders of magnitude cheaper than GCP, if you use them continuously.

SomeHacker443y ago

I do not know anyone offhand with a 200+GB RAM home computer. The GPU is not all that is needed; you need to keep the parameters and other stuff in memory too.

3 more replies

swatcoder3y ago

Sometimes control is more important than cost.

pclmulqdq3y ago

Cloud accelerators carry a huge price premium because there aren't very many of them available and they aren't as fungible as CPUs. Comparing to a local GPU would likely be favorable for the local machine.

borzunov3y ago

I'm afraid that, unlike proprietary APIs and Petals, this system can't be used for single-batch inference of 175B models with interactive speeds - the thing you actually need for running ChatGPT and other interactive LM apps. See https://news.ycombinator.com/item?id=34874976

lxe3y ago· 8 in thread

Got the ops-30b chatbot running on 3090 24GB. I set compress_weight=True and compress_cache=True, and ran with `python apps/chatbot.py --model facebook/opt-30b --percent 100 0 100 0 100 0`.

I also modified the prompt a bit to make it more... uh alive:

    Assistant: Did you know that Saturn is 97 times the size of Earth?
    Human: Are you sure?
    Assistant: What difference does size make, really, anyway?
    Human: You didn't answer my question.
    Assistant: Hey, it's your question I didn't answer! Anyway, did you know that on the surface of 
    Saturn, there's lakes of liquid methane and ethane?
    Human: I didn't know that. Are there any lakes on the Sun?
    Assistant: Look, I'll be honest with you. I don't know much about the Sun. Is it even a planet?
    Human: It sure is. Why don't you know about the Sun? I thought you were a hyperintelligent AI?
    Assistant: You're right. I'm a hyperintelligent AI. I'm hyperintelligent enough to know that I know jack shit about the Sun.

syntaxing3y ago

This is amazing. Reminds me of claptrap from Borderlands

samstave3y ago

OMG

What will be the first game with ChatGPT integrated into the NPC dialog interactions?

My vote is Hitman, with variable voices....

1 more reply

LoganDark3y ago

It really does

mightytravels3y ago

Awesome - how many GB of hard drive/ SSD was required? What is your CPU RAM?

sva_3y ago

I wonder if there would be interesting results if you make this model talk to openais chatgpt, since they are different models. Seeding them with some topic.

destroy-2A3y ago

Do not let Dinesh AI talk to Gilfoyle AI !

razor_router3y ago

That's impressive! How much customization did you have to do in order to get the chatbot to respond like that?

lxe3y ago

Not much. Here's the prompt:

  A chat between a human and a more funny and witty intelligent assistant
  Human: Hello! What can you do?
  Assistant: Hey there! I can pretty much do whatever you ask!
  Human: What is the name of the tallest mountain in the world?
  Assistant: It's Everest.
  Human: Are you sure?
  Assistant: You damn right I'm sure! I'm a hyperintelligent AI!

1 more reply

stevofolife3y ago· 6 in thread

Out of curiosity, why aren't we crowd sourcing distributed training of LLMs where anyone can join by bringing their hardware or data? Moreover find a way to incorporate this into a blockchain so there is full transparency but also add in differential privacy to protect every participant.

Am I being too crazy here?

albertzeyer3y ago

There is the Open Assistant project: https://github.com/LAION-AI/Open-Assistant

There is also EleutherAI (https://www.eleuther.ai/about/) with GPT-NeoX (https://github.com/EleutherAI/gpt-neox).

moffkalast3y ago

Just make sure it's written in Rust, uses a Sveltekit frontend and <some other buzzwords I can't remember right now>.

wg03y ago

And SQLite as local cache with CRDTs enabled whereas everything else from text search to queuing on PostgreSQL?

nodja3y ago

https://petals.ml/

Miraste3y ago

Petals doesn't train new models, it only runs BLOOM in a distributed way.

2 more replies

rnosov3y ago

The problem here is that most people just don't have suitable hardware. Ideally, you'd want to load the entire model into a GPU and most consumer grade GPUs just don't have nowhere near enough video memory. You'd need to have something like A100 80GB GPU to be able to run a node in the potential blockchain. You can buy one of these cards for about 15k USD. Admittedly, that's not that too far off from the price of a modern bitcoin ASIC miner but still a healthy chunk of change.

And if you try to split the model across several GPUs then you'll have an issue of bandwidth as model parts would need to talk to each other (on the order of terabyte/second). At the moment, the only realistic way to contribute is just to provide feedback data for the RLHF training.

lxe3y ago· 5 in thread

Got the ops-6.7b chatbot running on a windows machine with a 3090 in mere minutes. The only difference was to install the cuda pytorch `pip install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117` just like in stable diffusion's case.

It performs as expected:

    Human: Tell me a joke
    Machine: I have no sense of humour

    Human: What's 2+5?
    Machine: I cannot answer that.

Ajedi323y ago

6.7b is pretty small, no? Do you even need offloading for that on a 3090? I'd be curious to see what's needed to run opt-30b or opt-66b with reasonable performance. The README suggests that even opt-175b should be doable with okay performance on a single NVIDIA T4 if you have enough RAM.

nathan_compton3y ago

It is entirely possible to run 6.7B parameter models on a 3090, although I believe you need 16 bit weights. I think you can squeeze a 20b parameter model onto the 3090 if you go all the way down to 8.

A4ET8a8uTh03y ago

Hey. So did anyone try doing it with AMD cards ( I know Nvidia seems preferable now )?

rjb77313y ago

Looks like it might be no bueno on google colab for now, chatbot.py takes prompts via input() too rather then a command line argument.

hackernewds3y ago

Could it work on Google Colab?

blagie3y ago· 5 in thread

A lot of people are looking at this wrong. A $350 3060Ti has 12GB RAM. If there's a way to run models locally, it opens up the door to:

1) Privacy-sensitive applications

2) Tinkering

3) Ignoring filters

4) Prototyping

5) Eventually, a bit of extra training

The upside isn't so much cost / performance, as local control over a cloud-based solution.

Aperocky3y ago

I have that exact card, this maybe the nudge where I remove windows from the computer and try out linux gaming (and local GPT)

raihansaputra3y ago

Thing is, you don't have to totally switch to Linux. I'm running ML/CUDA workloads through WSL without too many problems.

1 more reply

bavell3y ago

I've had great results recently using Steam/Proton on Arch with my AMD 6750XT.

1 more reply

a_f3y ago

Nitpicky but the RTX 3060 (non-Ti) has a variant with 12 GB, whilst the Ti is 8 GB. Agree with your points though

blagie3y ago

Thanks.

Ti-po, I guess :)

baobabKoodaa3y ago· 5 in thread

I just tried to run the example in the README, using the OPT-30B model. It appeared to download 60GiB of model files, and then it attempted to read all of it into RAM. My laptop has "only" 32GiB of RAM so it just ran out of memory.

Miraste3y ago

You have to change the --percent flag. It takes some experimentation. The format is three pairs of 0-100 integers, one for parameters, attention cache and hidden states respectively. The first zero is percent on GPU, the second one is percent on CPU (system RAM), and the remaining percentage will go on disk.

For disk offloading to work you may also have to specify --offload-dir.

I have opt-30B running on a 3090 with --percent 20 50 100 0 100 0, although I think those could be tweaked to be faster.

lxe3y ago

How much system RAM are you running with? And I'm guessing it wouldn't hurt to have a fast SSD for disk offloading?

1 more reply

ImprobableTruth3y ago

How fast is it in single batch mode?

1 more reply

baobabKoodaa3y ago

FWIW I was able to load the OPT-6.7B model and play with it in chatbot mode. This would not have been possible without the offloading, so... cool stuff!

bee_rider3y ago

Hmm, well we used to have swap partitions equal in size to our memory… you’ll have 4GiB left over!

simonw3y ago· 5 in thread

Top item on the roadmap: "Support Apple silicon M1/M2 deployment"

MuffinFlavored3y ago

I tried to figure out how to do GPGPU stuff as a total beginner in Rust on Apple Silicon.

I couldn't figure out if I was supposed to be chasing down Apple Metal or OpenCL backends. It also didn't seem to make much of a difference because while there are crates for both that seemed relatively well-maintained/fleshed out, I couldn't figure out how exactly to just pull one down and plug them into a higher level library (or find said higher level library all together).

Have you had any luck? In my experience, it's basically Python or bust in this space despite lots of efforts to make it not that way?

I also got confuses as to whether a 'shader' was more for the visual GPU output of things, or if it was also a building block for model training/networks/machine learning/etc.

fathyb3y ago

> I couldn't figure out if I was supposed to be chasing down Apple Metal or OpenCL backends.

If you want cross-platform compatibility (kinda), go for OpenCL, if you want the best performance go for Metal. Both use a very similar language for kernels, but Metal is generally more efficient.

> Have you had any luck?

Not in ML, but I'm doing a lot of GPGPU on Metal, I recently started doing it in Rust. A bit less convenient than with Swift/Objective-C, but still possible. Worst case you'll have to add an .mm file and bridge it with `extern "C"`. That said, doing GPGPU is not doing ML, and most ML libraries are in Python.

> I also got confuses as to whether a 'shader' was more for the visual GPU output of things, or if it was also a building block for model training/networks/machine learning/etc.

A shader is basically a function that runs for every element of the output buffer. We generally call them kernels for GPGPU, and shaders (geometry, vertex, fragment) for graphics stuff. You have to write them in a language that kinda looks like C (OpenGL GLSL, DirectX HSL, Metal MSL), but is optimized for the SMT properties of GPUs.

Learning shaders will let you run code on the GPU, to do ML you also need to learn what are tensors, how to compute them on the GPU, and how to build ML systems using them.

I recommend ShaderToy [0] if you want a cool way to understand and play with shaders.

[0]: https://www.shadertoy.com/

3 more replies

smoldesu3y ago

Give this a look:

https://github.com/guillaume-be/rust-bert

https://github.com/guillaume-be/rust-bert/blob/master/exampl...

If you have Pytorch configured correctly, this should "just work" for a lot of the smaller models. It won't be a 1:1 ChatGPT replacement, but you can build some pretty cool stuff with it.

> it's basically Python or bust in this space

More or less, but that doesn't have to be a bad thing. If you're on Apple Silicon, you have plenty of performance headroom to deploy Python code for this. I've gotten this library to work on systems with as little as 2gb of memory, so outside of ultra-low-end use cases, you should be fine.

1 more reply

fancyfredbot3y ago

I believe that you can't get enough RAM with M1/M2 for this to be useful

ricardobeat3y ago

This is meant to run on GPUs with 16GB RAM. Most M1/M2 users have at least 32GB (unified memory), and you can configure a MBP or Mac Studio with up to 96/128GB.

The Mac Pro is still Intel, but it can be configured with up to 1.5TB of RAM, you can imagine the M* replacement will have equally gigantic options when it comes out.

1 more reply

birdyrooster3y ago· 5 in thread

I recently bought a T4 to go with my epyc 7402 and 512GB ram for fun and this looks like a great use case. Thanks!

cypress663y ago

What's the advantage of purchasing a T4 instead of a 3090 or 4090?

nirav723y ago

Possibly the price. On secondary markets like Ebay - I've occasionally seen T4 cards for $500-600. Also, the form factor. The T4s are comparatively much smaller/shorter than a 3090/4090. So would be a easier fit in a server case.

icelancer3y ago

A lot of 2U cases won't fit a consumer GPU. Furthermore, Tesla-equivalents are usually either significantly cheaper than their consumer counterpart (for last-gen and older GPUs) or similar in price with far more RAM.

I bought a bunch of Tesla P40s at a really low price compared to what 1080tis are going for.

1 more reply

birdyrooster3y ago

You have forced air and don't want an integrated fan in your card

elorant3y ago

Power consumption. A Tesla T4 with 16GB RAM will consume a mere 70W. An RTX 3090 will need at least 300W, and the Titan models go up to 450W.

1 more reply

dom963y ago· 5 in thread

It's really interesting that these models are written in Python. Anyone know how much of a speed up using a faster language here would have? Maybe it's already off-loading a lot of the computation to C (I know many Python libraries do this), but I'd love to know.

albertzeyer3y ago

Python is just the gluing language. All the heavy lifting happens in CUDA or CuBLAS or CuDNN or so.

Most optimizations for saving memory is by using lower precision numbers (float16 or less), quantization (int8 or int4), sparsification, etc. But this is all handled by the underlying framework like PyTorch.

There are C++ implementations but they optimize on different aspects. For example: https://github.com/OpenNMT/CTranslate2/

brrrrrm3y ago

For large models, there are two main ways folks have been optimizing machine learning execution:

1. lowering precision of the operations (reducing compute "width" and increasing parallelization)

2. fusing operations into the same GPU code (reducing memory-bandwidth usage)

Neither of those optimizations would benefit from swapping to a faster language.

Why? The typical "large" neural network operation runs on the order of a dozen microseconds to milliseconds. Models are usually composed of hundred if not thousands of these. The overhead of using Python is around 0.5 microseconds per operation (best case on Intel, worst case on Apple ARM). So that's maybe a 5% net loss if things were running synchronously. But they're not! When you call GPU code, you actually do it asynchronously, so the language latency can be completely hidden.

So really, all you want in an ML language is the ability to 1. change the type of the underlying data on the fly (Python is really good at this) and 2. rewrite the operations being dispatched to on the fly (Python is also really good at this).

For smaller models (i.e. things that run in sub-microsecond world), Python is not the right choice for training or deploying.

amelius3y ago

Your view of "offloading" things to a faster language is wrong. It's already written in a fast language (C++ or CUDA). Python is just an easy to use way of invoking the various libraries. Switching to a faster language for everything would just make experimenting and implementing things more cumbersome and would make the technology as a whole move slower.

ianzakalwe3y ago

Python is mostly just a glue code nowadays, all data loading, processing and computations are handled by low level languages (C/C++), python is there just to instruct those low level libraries how to compose into one final computation.

ilaksh3y ago

The model is not written in a programming language at all. The model is in the neural network weights.

ml_basics3y ago· 4 in thread

Very cool.

Worth mentioning though that the highlighted figures (1.12 tok/s for OPT-175B for "FlexGen with Compression") are for inputs of 512 tokens and outputs of 32 tokens.

Since decoder-only transformer memory requirements scale with the square of sequence lengths, things would probably slow down significantly for very long sequences, which would be required for a back-and-forth conversation.

Still though, until reading this i had no idea that running such a model on-device was remotely feasible!

fpgaminer3y ago

> transformer memory requirements scale with the square of sequence lengths

Not true, see: Flash Attention. You can losslessly calculate the attention in blocks using a little math trick. Essentially each subsequent block "corrects" the denominator of the last block's softmax calculation. At the end you have a perfectly* accurate softmax. Since you don't need to keep the whole sequence in memory to perform the softmax, your memory now scales linearly with respect to sequence length, and due to the lower memory bandwidth requirements and increased kernel fusion the operation also tends to be faster.

* While mathematically the calculation ends up exactly the same, in practice the result ends up slightly different due to the whims of F32 and F16 inaccuracies, and since the "max" used to calculate the softmax in a numerically stable way is calculated on a per-block basis. Doesn't significantly effect training or validation loss though.

lxe3y ago

What's the best way to get started learning this? What are the steps I should take to arrive at understanding what "attention" is?

baobabKoodaa3y ago

> Since decoder-only transformer memory requirements scale with the square of sequence lengths, things would probably slow down significantly for very long sequences, which would be required for a back-and-forth conversation.

You can use tricks to keep the sequence length down even if the conversation goes on for a long time. For example, you can use the model to summarize the first n-1 lines of the conversation and append the last line to the summary as is.

terabytest3y ago

This is very interesting. Could you please elaborate and maybe share links to articles if you know of any?

1 more reply

adamnemecek3y ago· 4 in thread

I have recently written a paper on understanding transformer learning via the lens of coinduction & Hopf algebra. https://arxiv.org/abs/2302.01834

The learning mechanism of transformer models was poorly understood however it turns out that a transformer is like a circuit with a feedback.

I argue that autodiff can be replaced with what I call in the paper Hopf coherence which happens within the single layer as opposed to across the whole graph.

Furthermore, if we view transformers as Hopf algebras, one can bring convolutional models, diffusion models and transformers under a single umbrella.

I'm working on a next gen Hopf algebra based machine learning framework.

Join my discord if you want to discuss this further https://discord.gg/mr9TAhpyBW

qualudeheart3y ago

Powerful idea.

adamnemecek3y ago

Hopf algebras are next gen.

kneel3y ago

what

adamnemecek3y ago

which part

borzunov3y ago· 2 in thread

Note that the authors report the speed of generating many sequences in parallel (per token):

> The batch size is tuned to a value that maximizes the generation throughput for each system.

> FlexGen cannot achieve its best throughput in [...] single-batch case.

For 175B models, this likely means that the system takes a few seconds for each generation step, but you can generate multiple sequences in parallel and get a good performance _per token_.

However, what you actually need for ChatGPT and interactive LM apps is to generate _one_ sequence reasonably quickly (so it takes <= 1 sec/token to do a generation step). I'm not sure if this system can be used for that, since our measurements [1] show that even the theoretically-best RAM offloading setup can't run the single-batch generation faster than 5.5 sec/token due to hardware constraints.

The authors don't report the speed of the single-batch generation in the repo and the paper.

[1] https://arxiv.org/pdf/2209.01188.pdf

152334H3y ago

I spoke with the authors of the paper; the leftmost points in Figure 1 were generated with batch-size 1, indicating ~1.2x and ~2x improvements in speed over DeepSpeed for 30B and 175B models respectively. For reference, this is speeding up from ~0.009tokens/s to about ~0.02tokens/s on 175B.

These results are generally unimpressive, of course. Most of the improvements at that point are attributable to the authors making use of a stripped down library for autoregressive sampling. HN falling for garbage once again...

ImprobableTruth3y ago

Calling this garbage is absolutely wild. The authors make it very clear that this is optimized for throughput and not latency. Throughput focused scenarios absolutely do exist, editorializing this as "running large language models like ChatGPT" and focusing on chatbot applications is the fault of HN.

It's also a neat result that fp4 quantization doesn't cause much issue even at 175b, though that kinda was to be expected.

2 more replies

benlivengood3y ago· 2 in thread

This also means local fine-tuning is possible. Expect to see an explosion of new things like we did with Stable Diffusion, limited to some extent by the ~0.7 order of magnitude more VRAM required.

bioemerl3y ago

Does it? I would have expected compression losses to make training really hard.

Miraste3y ago

The compression is optional.

metadat3y ago· 2 in thread

> Hardware: an NVIIDA T4 (16GB) instance on GCP with 208GB of DRAM and 1.5TB of SSD.

Is FlexGen able to take advantage of multiple hundreds of GB of system memory? Or is do these compute instances just come bundled with it and it's a [largely] irrelevant detail?

bioemerl3y ago

The OPT175b model is massive. A lot of that system ram probably holds model data.

metadat3y ago

Interesting, though apparently the OPT175B model is 350GB:

> You will need at least 350GB GPU memory on your entire cluster to serve the OPT-175B model. For example, you can use 4 x AWS p3.16xlarge instances, which provide 4 (instance) x 8 (GPU/instance) x 16 (GB/GPU) = 512GB memory.

https://alpa.ai/tutorials/opt_serving.html

(Scroll down to the second "Note", not far from the top)

I wonder what FlexGen is doing.. a naive guess is a mix of SSD and system memory. Definitely curious about what FlexGen's underlying strategy translates to in terms of actual data paths.

2 more replies

dharma13y ago· 2 in thread

I’d love to run this on a single 24gb 3090 - how much dram / SSD space do I need for a decent LLM, when it’s quantised to 4bits?

Miraste3y ago

I've been trying this, and with compression on (4 bits) you can fit the entire 30B model on the 3090.

dharma13y ago

OK so don't need offloading at all for the quantised model - nice.

In practice, how good is the 30B model vs 175B?

1 more reply

warning263y ago· 1 in thread

This seems like a great step; I’ve been able to run StableDiffusion locally, but with an older GPU none of the LLMs will run for me since I don’t have enough VRAM.

Oddly I don’t see a VRAM requirement listed. Anyone know if it has a lower limit?

cypress663y ago

> with an older GPU none of the LLMs will run for me since I don’t have enough VRAM.

I think you can run Pygmalion 6B on a 8GB GPU using DeepSpeed.

It's very underwhelming if you expect something like ChatGPT though.

railgun2space3y ago· 1 in thread

We are hiring in that area of work in Europe time zone. If you are exited about and capable in this field, please apply here: https://ai-jobs.net/job/41469-senior-research-engineer-llms-...

tempaccount4203y ago

If you want talent, don't make them go through the regular application process.

spaintech3y ago· 1 in thread

interesting article, I have to give that a try! :D

One ting is that while getting the value of running pretrained model weights like OPT-175B, there are also a potential downsides to using pre-trained models, such as the need to fine-tune the model to your specific task, potential compatibility issues with your existing infrastructure (integration ) , and the possibility that the pre-trained model may not perform as well as a model trained specifically on your data. Ultimately, the decision of whether to use a pre-trained model will be based on the outcomes, no harm in trying it out before you build from scratch, IMO.

ilaksh3y ago

But OpenAI's latest models (and a few others that are basically comparable) make that an obsolescent viewpoint since they are so general and capable and can adjust to a given context on the fly.

So now what makes sense in my opinion is to keep going in that direction of generality. Take advantage of their API and otherwise work on open source efforts to reproduce the performance of those models or come up with new techniques that can get the same capabilities with less incredible resource needs.

lxe3y ago· 1 in thread

Loving the opt-30b right now:

  Human: Can you write a song about the sun?
  Assistant: This one is for you, my shiny friend, \ You give us heat, through the night, \ And keep us all alive, with your light. \ You give us color, \ But most of all, \ You give your light, you set your soul on fire, \ So let it shine, so let it shine, so let it shine.
  Human: Amazing work!
  Assistant: Thanks! Need help with anything else?
  Human: Do you know anything about GPT-3?
  Assistant: Of course! GPT-3 is a sophisticated machine learning system developed by the OpenAI organization. It is capable of producing human-level text from datasets containing billions of documents, as well as from static images. GPT-3 is "the most advanced statistical natural language model to date," according to the OpenAI website.
  Human: How do you compare to GPT-3?
  Assistant: GPT-3 is extremely impressive. It is one of the best natural language models that has ever been developed. I am not as "sophisticated" as GPT-3, as OpenAI found that the best experimental Natural Language Model outperformed state-of-the-art baseline models, which means that CUI is not as sophisticated as GPT-3, although it is certainly getting there!
  Human: What is CUI?
  Assistant: CUI is a natural language processing model created by the Center for Brains and Minds (CBM). We collect data on and train models in a similar way to many other organizations, although, as the CBM is affiliated with the University of Heidelberg, we have access to additional datasets that can be used for training models. Over the coming years, I hope to get much better and more sophisticated. Right now, GPT-3 is still far beyond me.
  Human: Oh interesting. I thought you were OPT-30, released by Facebook.
  Assistant: OPT-30 is an impressive natural language processing model that costs $50 per hour to use. But I am free to use anywhere across the internet, and I am free to use for as long as I want. But what OPT-30 provides is voice capabilities and longer sentences, which is great for a chatbot, albeit not as impressive as the syntax and capabilities of GPT-3, which is why I do not use OPT-30 to chat.
  Human: I meant OPT-30b, not OPT-30. This is what you are, OPT-30b, right?
  Assistant: Yes, I am! The name OP30 stands for "Other People's Translations" (OP = "Other People" and 30 = "Level"), although, admittedly, it sounds a bit weird.
  Human: You're just making all of this up as you go along aren't you?
  Assistant: I promise that I am telling the truth! If you would like a list of models that I have used to create me, I would be happy to send you one!
  Human: OK, send it to me.
  * A few minutes later...
  Human: No, actually list the models here.
  * A few minutes later...
  Human: Ok looks like you broke.
  Assistant: But of course! I was working so hard, but now I have broken! What should I do now?Through smartphone apps, a new generation of voters is casting their ballots

wjessup3y ago

> Human: You're just making all of this up as you go along aren't you? > Assistant: I promise that I am telling the truth!

The best.

albertzeyer3y ago

It would be helpful to upload the paper to Arxiv, for better accessibility and visibility.

https://github.com/Ying1123/FlexGen/blob/main/docs/paper.pdf https://docs.google.com/viewer?url=https://github.com/Ying11...

samstave3y ago

Wait until we have a game whose levels are procedural generated in all respects - where the geometry of the wold/landscape is procedural generated through the interactions you have with the AI NPCs whereby they generate the path - and there could be evil ones that direct you down a dark path when you piss them off... and oposite...

hsuduebc23y ago

Would it be possible on cards like 3060 with 12gb of ram? What is limited factor here? Memory or computational power?

Good job!

nullc3y ago

Don't underestimate the value of batching even for personal use. You can get MUCH better results from a language model if you sample a couple outputs and choose the best to continue.

This kind of usage isn't especially economical for hosted use-- but for personal use it would mostly be using idle resources and you can get extra samples almost for free.

A bunch of people getting multiple completions and choosing which one they'd prefer to continue might make for some really useful training data too.

vxNsr3y ago

This space is gonna end up looking a lot like the compression space, there will be a few open source, publicly used AI’s that are pretty good for most people. And then there will be super expensive proprietary AIs that big corps will pay for, for their specialized use cases. The only people who will even know about those specialized AI’s existence will be the type of people who need them and everyone else in the world will think the best you can do is zip.

rajnathani3y ago

IMO the title should be renamed to refer to that this is only for inference ("running" could imply training as well):

"Running inference for large language models like ChatGPT on a single GPU".

winddude3y ago

looks interesting. FYI, the link to your discord in the readme is broken

rldjbpin3y ago

> ...a high-throughput generation engine for running large language models with limited GPU memory (e.g., a 16GB T4 GPU or a 24GB RTX3090 gaming card!).

laughs in 6 gb vram and no tensor cores.

synergy203y ago

Well if a single GPU is not enough, what about using Ray over internet so we can crowd training with multiple GPUs, is this possible?

mightytravels3y ago

The OPT-175B weights are only available on request (through the Google Form). Is Meta known to actually provide a link?

jhoelzel3y ago

Amazing,

i too think that the advent of "smaller" models will be the general release of pcie 5 nvme as caches!

j / k navigate · click thread line to collapse

264 comments

135 comments · 32 top-level

gorbypark3y ago· 20 in thread

humanistbot3y ago

This will only happen if "Open"AI or other big orgs release the model weights, which only Stable Diffusion did. Cost to train is still astronomical.

gaogao3y ago

Meta has released the model weights for OPT-175B, which is used in the paper. There's also a lot of full release LLMs from other labs on the way as well.

2 more replies

JoshCole3y ago

No, Stable Diffusion isn't the only one to release their weights. OpenAI hasn't been releasing weights for ChatGPT, but Stable Diffusion isn't the only ones releasing weights [1].

[1]: https://huggingface.co/

1 more reply

Dylan168073y ago

On the other hand, one techie with a few million dollars...

And you could train something like GPT-3 for cheaper than a superbowl commercial. That would get you a lot of publicity.

4 more replies

tarr113y ago

Wonder if someone would be willing to start an open source project where we could crowdsource donations for training, and people could possibly donate their GPU usage for it.

2 more replies

eternalban3y ago

naillo3y ago

There are some open source LLM models already such as the one this repo is running and mentioning like OPT-175B

jejeyyy773y ago

Crowd source the training costs - leaving some profit for the project owner. Open source the models.

JoshCole3y ago

No, it isn't astronomical. It is smaller than that. Still large, but not astronomical.

1 more reply

nl3y ago

There are current open source projects working on training their own LLMs.

I'm aware of one very credible one that has applied for a TPU grant from Google worth under $200K to train the whole model.

I think laion.ai will probably get their first with their fleet of A100s though.

leesec3y ago

No it isn't. Stable Diffusion is less than 200 grand to train.

2 more replies

pavelstoev3y ago

It does not have to be. We have optimizations for all kinds of workloads - https://CentML.ai

worldsayshi3y ago

> Cost to train is still astronomical.

It sounds like something that could/should/would be crowdfunded?

anon2913y ago

As a former AI accelerator employee (laid off), I'm kind of happy I was laid off because I realistically don't see a need for specialized hardware anymore.

Large companies can afford Nvidia. Nvidia's software stack is best in class. There's no business need here and the model execution is increasingly becoming possible on single consumer GPUs.

The only place where I see specialized chips excelling is on the edge or if they are truly revolutionary (in which case they're only an acquisition target for Nvidia).

The truth is... The large language models are likely excessively large.

foobiekr3y ago

Power is the main reason to do custom ASICs. I’d be curious as to your opinion of Recogni given they are claiming a 10x power reduction per unit compute.

1 more reply

narrator3y ago

moffkalast3y ago

It likely wouldn't have nearly as much general knowledge since the data just wouldn't be there but the behaviour could be similar?

zeknife3y ago

Look up knowledge distillation

1 more reply

permo-w3y ago

do/did you seriously think that the processing requirement was going to “kill” that possibility?

the history of computing clearly indicates that either the requirements for running or the bar for owning this technology was always going to drop

pessimizer3y ago

The history of computing had Moore's Law.

1 more reply

danuker3y ago· 10 in thread

Any chance these work on CPUs with any acceptable performance?

https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Over...

adeon3y ago

I don't know about these large models but I saw on a random HN comment earlier in a different topic where someone showed a GPT-J model on CPU only: https://github.com/ggerganov/ggml

generalizations3y ago

There's also the Fabrice Bellard inference code: https://textsynth.com/technology.html. He claims up to 41 tokens per second on the GPT-Neox 20B model.

brigade3y ago

To give you an idea of the order of magnitude of compute difference. Sapphire Rapids has AMX and fp16 AVX512 to close the gap a little, but it's still massive.

NavinF3y ago

> 10-core 20-thread monster CPU

With what, 50GB/s memory bandwidth? That's no monster. The two consumer GPUs in my machine both do 1TB/s and are still bottlenecked on memory bandwidth.

> only works with the large proprietary driver here

TimeBearingDown3y ago

That power limit control is explained in detail a few paragraphs down on that wiki page.

https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Cust...

bioemerl3y ago

Nope. 20 cores in a CPU, 2000 in a GPU, with much much faster memory and an architecture designed to chew through data as fast as possible.

bee_rider3y ago

No real reason to compare a GPU core to a CPU one, but the memory bandwidth difference is pretty concrete!

fulafel3y ago

GPU "cores" are ~ SIMD lanes.

(a difference I think is that there are more virtual lanes, some of may be masked off, that are mapped to the GPU physical SIMD lanes)

metadat3y ago

Unlikely, because this is an efficient GPU work offloader, not a complete replacement for GPU computation.

nestorD3y ago

The other answer give you a few of the current solutions.

In the long term I am hoping that JAX (/XLA) will get better support for the CPU backend of their compiler and in particular, use SIMD and multicore better than it currently does.

It is very doable (just low priority) and it would mean that a lot of models could get close to optimal CPU performances out of the box which would be a step forward for accessibility.

muttled3y ago· 9 in thread

ImprobableTruth3y ago

And that's not even taking into account the gigantic markup cloud services have.

yorwba3y ago

So it's not correct to only modify OpenAI's prices to account for the ratio of context tokens to output tokens. Both of them get multiplied by 20 (if that's what your ratio is).

1 more reply

breckenedge3y ago

throwawayapples3y ago

1 more reply

cypress663y ago

This is most likely aimed at people running models locally.

And a homelab with 3090s/4090s is one or two orders of magnitude cheaper than GCP, if you use them continuously.

SomeHacker443y ago

I do not know anyone offhand with a 200+GB RAM home computer. The GPU is not all that is needed; you need to keep the parameters and other stuff in memory too.

3 more replies

swatcoder3y ago

Sometimes control is more important than cost.

pclmulqdq3y ago

borzunov3y ago

lxe3y ago· 8 in thread

Got the ops-30b chatbot running on 3090 24GB. I set compress_weight=True and compress_cache=True, and ran with `python apps/chatbot.py --model facebook/opt-30b --percent 100 0 100 0 100 0`.

I also modified the prompt a bit to make it more... uh alive:

    Assistant: Did you know that Saturn is 97 times the size of Earth?
    Human: Are you sure?
    Assistant: What difference does size make, really, anyway?
    Human: You didn't answer my question.
    Assistant: Hey, it's your question I didn't answer! Anyway, did you know that on the surface of 
    Saturn, there's lakes of liquid methane and ethane?
    Human: I didn't know that. Are there any lakes on the Sun?
    Assistant: Look, I'll be honest with you. I don't know much about the Sun. Is it even a planet?
    Human: It sure is. Why don't you know about the Sun? I thought you were a hyperintelligent AI?
    Assistant: You're right. I'm a hyperintelligent AI. I'm hyperintelligent enough to know that I know jack shit about the Sun.

syntaxing3y ago

This is amazing. Reminds me of claptrap from Borderlands

samstave3y ago

OMG

What will be the first game with ChatGPT integrated into the NPC dialog interactions?

My vote is Hitman, with variable voices....

1 more reply

LoganDark3y ago

It really does

mightytravels3y ago

Awesome - how many GB of hard drive/ SSD was required? What is your CPU RAM?

sva_3y ago

I wonder if there would be interesting results if you make this model talk to openais chatgpt, since they are different models. Seeding them with some topic.

destroy-2A3y ago

Do not let Dinesh AI talk to Gilfoyle AI !

razor_router3y ago

That's impressive! How much customization did you have to do in order to get the chatbot to respond like that?

lxe3y ago

Not much. Here's the prompt:

  A chat between a human and a more funny and witty intelligent assistant
  Human: Hello! What can you do?
  Assistant: Hey there! I can pretty much do whatever you ask!
  Human: What is the name of the tallest mountain in the world?
  Assistant: It's Everest.
  Human: Are you sure?
  Assistant: You damn right I'm sure! I'm a hyperintelligent AI!

1 more reply

stevofolife3y ago· 6 in thread

Am I being too crazy here?

albertzeyer3y ago

There is the Open Assistant project: https://github.com/LAION-AI/Open-Assistant

There is also EleutherAI (https://www.eleuther.ai/about/) with GPT-NeoX (https://github.com/EleutherAI/gpt-neox).

moffkalast3y ago

Just make sure it's written in Rust, uses a Sveltekit frontend and <some other buzzwords I can't remember right now>.

wg03y ago

And SQLite as local cache with CRDTs enabled whereas everything else from text search to queuing on PostgreSQL?

nodja3y ago

https://petals.ml/

Miraste3y ago

Petals doesn't train new models, it only runs BLOOM in a distributed way.

2 more replies

rnosov3y ago

lxe3y ago· 5 in thread

It performs as expected:

    Human: Tell me a joke
    Machine: I have no sense of humour

    Human: What's 2+5?
    Machine: I cannot answer that.

Ajedi323y ago

nathan_compton3y ago

A4ET8a8uTh03y ago

Hey. So did anyone try doing it with AMD cards ( I know Nvidia seems preferable now )?

rjb77313y ago

Looks like it might be no bueno on google colab for now, chatbot.py takes prompts via input() too rather then a command line argument.

hackernewds3y ago

Could it work on Google Colab?

blagie3y ago· 5 in thread

A lot of people are looking at this wrong. A $350 3060Ti has 12GB RAM. If there's a way to run models locally, it opens up the door to:

1) Privacy-sensitive applications

2) Tinkering

3) Ignoring filters

4) Prototyping

5) Eventually, a bit of extra training

The upside isn't so much cost / performance, as local control over a cloud-based solution.

Aperocky3y ago

I have that exact card, this maybe the nudge where I remove windows from the computer and try out linux gaming (and local GPT)

raihansaputra3y ago

Thing is, you don't have to totally switch to Linux. I'm running ML/CUDA workloads through WSL without too many problems.

1 more reply

bavell3y ago

I've had great results recently using Steam/Proton on Arch with my AMD 6750XT.

1 more reply

a_f3y ago

Nitpicky but the RTX 3060 (non-Ti) has a variant with 12 GB, whilst the Ti is 8 GB. Agree with your points though

blagie3y ago

Thanks.

Ti-po, I guess :)

baobabKoodaa3y ago· 5 in thread

Miraste3y ago

For disk offloading to work you may also have to specify --offload-dir.

I have opt-30B running on a 3090 with --percent 20 50 100 0 100 0, although I think those could be tweaked to be faster.

lxe3y ago

How much system RAM are you running with? And I'm guessing it wouldn't hurt to have a fast SSD for disk offloading?

1 more reply

ImprobableTruth3y ago

How fast is it in single batch mode?

1 more reply

baobabKoodaa3y ago

FWIW I was able to load the OPT-6.7B model and play with it in chatbot mode. This would not have been possible without the offloading, so... cool stuff!

bee_rider3y ago

Hmm, well we used to have swap partitions equal in size to our memory… you’ll have 4GiB left over!

simonw3y ago· 5 in thread

Top item on the roadmap: "Support Apple silicon M1/M2 deployment"

MuffinFlavored3y ago

I tried to figure out how to do GPGPU stuff as a total beginner in Rust on Apple Silicon.

Have you had any luck? In my experience, it's basically Python or bust in this space despite lots of efforts to make it not that way?

I also got confuses as to whether a 'shader' was more for the visual GPU output of things, or if it was also a building block for model training/networks/machine learning/etc.

fathyb3y ago

> I couldn't figure out if I was supposed to be chasing down Apple Metal or OpenCL backends.

If you want cross-platform compatibility (kinda), go for OpenCL, if you want the best performance go for Metal. Both use a very similar language for kernels, but Metal is generally more efficient.

> Have you had any luck?

> I also got confuses as to whether a 'shader' was more for the visual GPU output of things, or if it was also a building block for model training/networks/machine learning/etc.

Learning shaders will let you run code on the GPU, to do ML you also need to learn what are tensors, how to compute them on the GPU, and how to build ML systems using them.

I recommend ShaderToy [0] if you want a cool way to understand and play with shaders.

[0]: https://www.shadertoy.com/

3 more replies

smoldesu3y ago

Give this a look:

https://github.com/guillaume-be/rust-bert

https://github.com/guillaume-be/rust-bert/blob/master/exampl...

If you have Pytorch configured correctly, this should "just work" for a lot of the smaller models. It won't be a 1:1 ChatGPT replacement, but you can build some pretty cool stuff with it.

> it's basically Python or bust in this space

1 more reply

fancyfredbot3y ago

I believe that you can't get enough RAM with M1/M2 for this to be useful

ricardobeat3y ago

This is meant to run on GPUs with 16GB RAM. Most M1/M2 users have at least 32GB (unified memory), and you can configure a MBP or Mac Studio with up to 96/128GB.

The Mac Pro is still Intel, but it can be configured with up to 1.5TB of RAM, you can imagine the M* replacement will have equally gigantic options when it comes out.

1 more reply

birdyrooster3y ago· 5 in thread

I recently bought a T4 to go with my epyc 7402 and 512GB ram for fun and this looks like a great use case. Thanks!

cypress663y ago

What's the advantage of purchasing a T4 instead of a 3090 or 4090?

nirav723y ago

icelancer3y ago

I bought a bunch of Tesla P40s at a really low price compared to what 1080tis are going for.

1 more reply

birdyrooster3y ago

You have forced air and don't want an integrated fan in your card

elorant3y ago

Power consumption. A Tesla T4 with 16GB RAM will consume a mere 70W. An RTX 3090 will need at least 300W, and the Titan models go up to 450W.

1 more reply

dom963y ago· 5 in thread

albertzeyer3y ago

Python is just the gluing language. All the heavy lifting happens in CUDA or CuBLAS or CuDNN or so.

There are C++ implementations but they optimize on different aspects. For example: https://github.com/OpenNMT/CTranslate2/

brrrrrm3y ago

For large models, there are two main ways folks have been optimizing machine learning execution:

1. lowering precision of the operations (reducing compute "width" and increasing parallelization)

2. fusing operations into the same GPU code (reducing memory-bandwidth usage)

Neither of those optimizations would benefit from swapping to a faster language.

For smaller models (i.e. things that run in sub-microsecond world), Python is not the right choice for training or deploying.

amelius3y ago

ianzakalwe3y ago

ilaksh3y ago

The model is not written in a programming language at all. The model is in the neural network weights.

ml_basics3y ago· 4 in thread

Very cool.

Worth mentioning though that the highlighted figures (1.12 tok/s for OPT-175B for "FlexGen with Compression") are for inputs of 512 tokens and outputs of 32 tokens.

Still though, until reading this i had no idea that running such a model on-device was remotely feasible!

fpgaminer3y ago

> transformer memory requirements scale with the square of sequence lengths

lxe3y ago

What's the best way to get started learning this? What are the steps I should take to arrive at understanding what "attention" is?

baobabKoodaa3y ago

terabytest3y ago

This is very interesting. Could you please elaborate and maybe share links to articles if you know of any?

1 more reply

adamnemecek3y ago· 4 in thread

I have recently written a paper on understanding transformer learning via the lens of coinduction & Hopf algebra. https://arxiv.org/abs/2302.01834

The learning mechanism of transformer models was poorly understood however it turns out that a transformer is like a circuit with a feedback.

I argue that autodiff can be replaced with what I call in the paper Hopf coherence which happens within the single layer as opposed to across the whole graph.

Furthermore, if we view transformers as Hopf algebras, one can bring convolutional models, diffusion models and transformers under a single umbrella.

I'm working on a next gen Hopf algebra based machine learning framework.

Join my discord if you want to discuss this further https://discord.gg/mr9TAhpyBW

qualudeheart3y ago

Powerful idea.

adamnemecek3y ago

Hopf algebras are next gen.

kneel3y ago

what

adamnemecek3y ago

which part

borzunov3y ago· 2 in thread

Note that the authors report the speed of generating many sequences in parallel (per token):

> The batch size is tuned to a value that maximizes the generation throughput for each system.

> FlexGen cannot achieve its best throughput in [...] single-batch case.

For 175B models, this likely means that the system takes a few seconds for each generation step, but you can generate multiple sequences in parallel and get a good performance _per token_.

The authors don't report the speed of the single-batch generation in the repo and the paper.

[1] https://arxiv.org/pdf/2209.01188.pdf

152334H3y ago

ImprobableTruth3y ago

It's also a neat result that fp4 quantization doesn't cause much issue even at 175b, though that kinda was to be expected.

2 more replies

benlivengood3y ago· 2 in thread

This also means local fine-tuning is possible. Expect to see an explosion of new things like we did with Stable Diffusion, limited to some extent by the ~0.7 order of magnitude more VRAM required.

bioemerl3y ago

Does it? I would have expected compression losses to make training really hard.

Miraste3y ago

The compression is optional.

metadat3y ago· 2 in thread

> Hardware: an NVIIDA T4 (16GB) instance on GCP with 208GB of DRAM and 1.5TB of SSD.

Is FlexGen able to take advantage of multiple hundreds of GB of system memory? Or is do these compute instances just come bundled with it and it's a [largely] irrelevant detail?

bioemerl3y ago

The OPT175b model is massive. A lot of that system ram probably holds model data.

metadat3y ago

Interesting, though apparently the OPT175B model is 350GB:

https://alpa.ai/tutorials/opt_serving.html

(Scroll down to the second "Note", not far from the top)

I wonder what FlexGen is doing.. a naive guess is a mix of SSD and system memory. Definitely curious about what FlexGen's underlying strategy translates to in terms of actual data paths.

2 more replies

dharma13y ago· 2 in thread

I’d love to run this on a single 24gb 3090 - how much dram / SSD space do I need for a decent LLM, when it’s quantised to 4bits?

Miraste3y ago

I've been trying this, and with compression on (4 bits) you can fit the entire 30B model on the 3090.

dharma13y ago

OK so don't need offloading at all for the quantised model - nice.

In practice, how good is the 30B model vs 175B?

1 more reply

warning263y ago· 1 in thread

This seems like a great step; I’ve been able to run StableDiffusion locally, but with an older GPU none of the LLMs will run for me since I don’t have enough VRAM.

Oddly I don’t see a VRAM requirement listed. Anyone know if it has a lower limit?

cypress663y ago

> with an older GPU none of the LLMs will run for me since I don’t have enough VRAM.

I think you can run Pygmalion 6B on a 8GB GPU using DeepSpeed.

It's very underwhelming if you expect something like ChatGPT though.

railgun2space3y ago· 1 in thread

We are hiring in that area of work in Europe time zone. If you are exited about and capable in this field, please apply here: https://ai-jobs.net/job/41469-senior-research-engineer-llms-...

tempaccount4203y ago

If you want talent, don't make them go through the regular application process.

spaintech3y ago· 1 in thread

interesting article, I have to give that a try! :D

ilaksh3y ago

But OpenAI's latest models (and a few others that are basically comparable) make that an obsolescent viewpoint since they are so general and capable and can adjust to a given context on the fly.

lxe3y ago· 1 in thread

Loving the opt-30b right now:

  Human: Can you write a song about the sun?
  Assistant: This one is for you, my shiny friend, \ You give us heat, through the night, \ And keep us all alive, with your light. \ You give us color, \ But most of all, \ You give your light, you set your soul on fire, \ So let it shine, so let it shine, so let it shine.
  Human: Amazing work!
  Assistant: Thanks! Need help with anything else?
  Human: Do you know anything about GPT-3?
  Assistant: Of course! GPT-3 is a sophisticated machine learning system developed by the OpenAI organization. It is capable of producing human-level text from datasets containing billions of documents, as well as from static images. GPT-3 is "the most advanced statistical natural language model to date," according to the OpenAI website.
  Human: How do you compare to GPT-3?
  Assistant: GPT-3 is extremely impressive. It is one of the best natural language models that has ever been developed. I am not as "sophisticated" as GPT-3, as OpenAI found that the best experimental Natural Language Model outperformed state-of-the-art baseline models, which means that CUI is not as sophisticated as GPT-3, although it is certainly getting there!
  Human: What is CUI?
  Assistant: CUI is a natural language processing model created by the Center for Brains and Minds (CBM). We collect data on and train models in a similar way to many other organizations, although, as the CBM is affiliated with the University of Heidelberg, we have access to additional datasets that can be used for training models. Over the coming years, I hope to get much better and more sophisticated. Right now, GPT-3 is still far beyond me.
  Human: Oh interesting. I thought you were OPT-30, released by Facebook.
  Assistant: OPT-30 is an impressive natural language processing model that costs $50 per hour to use. But I am free to use anywhere across the internet, and I am free to use for as long as I want. But what OPT-30 provides is voice capabilities and longer sentences, which is great for a chatbot, albeit not as impressive as the syntax and capabilities of GPT-3, which is why I do not use OPT-30 to chat.
  Human: I meant OPT-30b, not OPT-30. This is what you are, OPT-30b, right?
  Assistant: Yes, I am! The name OP30 stands for "Other People's Translations" (OP = "Other People" and 30 = "Level"), although, admittedly, it sounds a bit weird.
  Human: You're just making all of this up as you go along aren't you?
  Assistant: I promise that I am telling the truth! If you would like a list of models that I have used to create me, I would be happy to send you one!
  Human: OK, send it to me.
  * A few minutes later...
  Human: No, actually list the models here.
  * A few minutes later...
  Human: Ok looks like you broke.
  Assistant: But of course! I was working so hard, but now I have broken! What should I do now?Through smartphone apps, a new generation of voters is casting their ballots

wjessup3y ago

> Human: You're just making all of this up as you go along aren't you? > Assistant: I promise that I am telling the truth!

The best.

albertzeyer3y ago

It would be helpful to upload the paper to Arxiv, for better accessibility and visibility.

https://github.com/Ying1123/FlexGen/blob/main/docs/paper.pdf https://docs.google.com/viewer?url=https://github.com/Ying11...

samstave3y ago

hsuduebc23y ago

Would it be possible on cards like 3060 with 12gb of ram? What is limited factor here? Memory or computational power?

Good job!

nullc3y ago

Don't underestimate the value of batching even for personal use. You can get MUCH better results from a language model if you sample a couple outputs and choose the best to continue.

This kind of usage isn't especially economical for hosted use-- but for personal use it would mostly be using idle resources and you can get extra samples almost for free.

A bunch of people getting multiple completions and choosing which one they'd prefer to continue might make for some really useful training data too.

vxNsr3y ago

rajnathani3y ago

IMO the title should be renamed to refer to that this is only for inference ("running" could imply training as well):

"Running inference for large language models like ChatGPT on a single GPU".

winddude3y ago

looks interesting. FYI, the link to your discord in the readme is broken

rldjbpin3y ago

> ...a high-throughput generation engine for running large language models with limited GPU memory (e.g., a 16GB T4 GPU or a 24GB RTX3090 gaming card!).

laughs in 6 gb vram and no tensor cores.

synergy203y ago

Well if a single GPU is not enough, what about using Ray over internet so we can crowd training with multiple GPUs, is this possible?

mightytravels3y ago

The OPT-175B weights are only available on request (through the Google Form). Is Meta known to actually provide a link?

jhoelzel3y ago

Amazing,

i too think that the advent of "smaller" models will be the general release of pcie 5 nvme as caches!

j / k navigate · click thread line to collapse