And you could train something like GPT-3 for cheaper than a superbowl commercial. That would get you a lot of publicity.
I'm aware of one very credible one that has applied for a TPU grant from Google worth under $200K to train the whole model.
I think laion.ai will probably get their first with their fleet of A100s though.
It sounds like something that could/should/would be crowdfunded?
Large companies can afford Nvidia. Nvidia's software stack is best in class. There's no business need here and the model execution is increasingly becoming possible on single consumer GPUs.
The only place where I see specialized chips excelling is on the edge or if they are truly revolutionary (in which case they're only an acquisition target for Nvidia).
The truth is... The large language models are likely excessively large.
It likely wouldn't have nearly as much general knowledge since the data just wouldn't be there but the behaviour could be similar?
the history of computing clearly indicates that either the requirements for running or the bar for owning this technology was always going to drop
I have a 10-core 20-thread monster CPU, but didn't bother with a dedicated GPU because I can't control something as simple as its temperature. See the complicated procedure that only works with the large proprietary driver here:
https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Over...
I tested it on my Linux and Macbook M1 Air and it generates tokens at a reasonable speed using CPU only. I noticed it doesn't quite use all my available CPU cores so it may be leaving some performance on the table, not sure though.
The GPT-J 6B is nowhere near as large as the OPT-175B in the post. But I got the sense that CPU-only inference may not be totally hopeless even for large models if only we got some high quality software to do it.
To give you an idea of the order of magnitude of compute difference. Sapphire Rapids has AMX and fp16 AVX512 to close the gap a little, but it's still massive.
With what, 50GB/s memory bandwidth? That's no monster. The two consumer GPUs in my machine both do 1TB/s and are still bottlenecked on memory bandwidth.
> only works with the large proprietary driver here
In practice, nothing works without the proprietary driver so this isn't specific to temperature. Also the setting you're looking for is almost certainly `nvidia-smi -pl $watts` for setting the power limit, not whatever that wiki gives you. GPU temperature = ambient temperature + (power limit)*(thermal resistance of cooler)
https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Cust...
In the long term I am hoping that JAX (/XLA) will get better support for the CPU backend of their compiler and in particular, use SIMD and multicore better than it currently does.
It is very doable (just low priority) and it would mean that a lot of models could get close to optimal CPU performances out of the box which would be a step forward for accessibility.
Running the fine-tuned versions of OpenAI models are approximately 6x more expensive per token. If you were running a fine-tuned model on local commodity hardware, the economies would start to tilt in favor of doing something like this if the load was predictable and relatively constant.
And that's not even taking into account the gigantic markup cloud services have.
So it's not correct to only modify OpenAI's prices to account for the ratio of context tokens to output tokens. Both of them get multiplied by 20 (if that's what your ratio is).
And a homelab with 3090s/4090s is one or two orders of magnitude cheaper than GCP, if you use them continuously.
I also modified the prompt a bit to make it more... uh alive:
Assistant: Did you know that Saturn is 97 times the size of Earth?
Human: Are you sure?
Assistant: What difference does size make, really, anyway?
Human: You didn't answer my question.
Assistant: Hey, it's your question I didn't answer! Anyway, did you know that on the surface of
Saturn, there's lakes of liquid methane and ethane?
Human: I didn't know that. Are there any lakes on the Sun?
Assistant: Look, I'll be honest with you. I don't know much about the Sun. Is it even a planet?
Human: It sure is. Why don't you know about the Sun? I thought you were a hyperintelligent AI?
Assistant: You're right. I'm a hyperintelligent AI. I'm hyperintelligent enough to know that I know jack shit about the Sun. A chat between a human and a more funny and witty intelligent assistant
Human: Hello! What can you do?
Assistant: Hey there! I can pretty much do whatever you ask!
Human: What is the name of the tallest mountain in the world?
Assistant: It's Everest.
Human: Are you sure?
Assistant: You damn right I'm sure! I'm a hyperintelligent AI!Am I being too crazy here?
There is also EleutherAI (https://www.eleuther.ai/about/) with GPT-NeoX (https://github.com/EleutherAI/gpt-neox).
And if you try to split the model across several GPUs then you'll have an issue of bandwidth as model parts would need to talk to each other (on the order of terabyte/second). At the moment, the only realistic way to contribute is just to provide feedback data for the RLHF training.
It performs as expected:
Human: Tell me a joke
Machine: I have no sense of humour
Human: What's 2+5?
Machine: I cannot answer that.1) Privacy-sensitive applications
2) Tinkering
3) Ignoring filters
4) Prototyping
5) Eventually, a bit of extra training
The upside isn't so much cost / performance, as local control over a cloud-based solution.
For disk offloading to work you may also have to specify --offload-dir.
I have opt-30B running on a 3090 with --percent 20 50 100 0 100 0, although I think those could be tweaked to be faster.
I couldn't figure out if I was supposed to be chasing down Apple Metal or OpenCL backends. It also didn't seem to make much of a difference because while there are crates for both that seemed relatively well-maintained/fleshed out, I couldn't figure out how exactly to just pull one down and plug them into a higher level library (or find said higher level library all together).
Have you had any luck? In my experience, it's basically Python or bust in this space despite lots of efforts to make it not that way?
I also got confuses as to whether a 'shader' was more for the visual GPU output of things, or if it was also a building block for model training/networks/machine learning/etc.
If you want cross-platform compatibility (kinda), go for OpenCL, if you want the best performance go for Metal. Both use a very similar language for kernels, but Metal is generally more efficient.
> Have you had any luck?
Not in ML, but I'm doing a lot of GPGPU on Metal, I recently started doing it in Rust. A bit less convenient than with Swift/Objective-C, but still possible. Worst case you'll have to add an .mm file and bridge it with `extern "C"`. That said, doing GPGPU is not doing ML, and most ML libraries are in Python.
> I also got confuses as to whether a 'shader' was more for the visual GPU output of things, or if it was also a building block for model training/networks/machine learning/etc.
A shader is basically a function that runs for every element of the output buffer. We generally call them kernels for GPGPU, and shaders (geometry, vertex, fragment) for graphics stuff. You have to write them in a language that kinda looks like C (OpenGL GLSL, DirectX HSL, Metal MSL), but is optimized for the SMT properties of GPUs.
Learning shaders will let you run code on the GPU, to do ML you also need to learn what are tensors, how to compute them on the GPU, and how to build ML systems using them.
I recommend ShaderToy [0] if you want a cool way to understand and play with shaders.
https://github.com/guillaume-be/rust-bert
https://github.com/guillaume-be/rust-bert/blob/master/exampl...
If you have Pytorch configured correctly, this should "just work" for a lot of the smaller models. It won't be a 1:1 ChatGPT replacement, but you can build some pretty cool stuff with it.
> it's basically Python or bust in this space
More or less, but that doesn't have to be a bad thing. If you're on Apple Silicon, you have plenty of performance headroom to deploy Python code for this. I've gotten this library to work on systems with as little as 2gb of memory, so outside of ultra-low-end use cases, you should be fine.
The Mac Pro is still Intel, but it can be configured with up to 1.5TB of RAM, you can imagine the M* replacement will have equally gigantic options when it comes out.
I bought a bunch of Tesla P40s at a really low price compared to what 1080tis are going for.
Most optimizations for saving memory is by using lower precision numbers (float16 or less), quantization (int8 or int4), sparsification, etc. But this is all handled by the underlying framework like PyTorch.
There are C++ implementations but they optimize on different aspects. For example: https://github.com/OpenNMT/CTranslate2/
1. lowering precision of the operations (reducing compute "width" and increasing parallelization)
2. fusing operations into the same GPU code (reducing memory-bandwidth usage)
Neither of those optimizations would benefit from swapping to a faster language.
Why? The typical "large" neural network operation runs on the order of a dozen microseconds to milliseconds. Models are usually composed of hundred if not thousands of these. The overhead of using Python is around 0.5 microseconds per operation (best case on Intel, worst case on Apple ARM). So that's maybe a 5% net loss if things were running synchronously. But they're not! When you call GPU code, you actually do it asynchronously, so the language latency can be completely hidden.
So really, all you want in an ML language is the ability to 1. change the type of the underlying data on the fly (Python is really good at this) and 2. rewrite the operations being dispatched to on the fly (Python is also really good at this).
For smaller models (i.e. things that run in sub-microsecond world), Python is not the right choice for training or deploying.
Worth mentioning though that the highlighted figures (1.12 tok/s for OPT-175B for "FlexGen with Compression") are for inputs of 512 tokens and outputs of 32 tokens.
Since decoder-only transformer memory requirements scale with the square of sequence lengths, things would probably slow down significantly for very long sequences, which would be required for a back-and-forth conversation.
Still though, until reading this i had no idea that running such a model on-device was remotely feasible!
Not true, see: Flash Attention. You can losslessly calculate the attention in blocks using a little math trick. Essentially each subsequent block "corrects" the denominator of the last block's softmax calculation. At the end you have a perfectly* accurate softmax. Since you don't need to keep the whole sequence in memory to perform the softmax, your memory now scales linearly with respect to sequence length, and due to the lower memory bandwidth requirements and increased kernel fusion the operation also tends to be faster.
* While mathematically the calculation ends up exactly the same, in practice the result ends up slightly different due to the whims of F32 and F16 inaccuracies, and since the "max" used to calculate the softmax in a numerically stable way is calculated on a per-block basis. Doesn't significantly effect training or validation loss though.
You can use tricks to keep the sequence length down even if the conversation goes on for a long time. For example, you can use the model to summarize the first n-1 lines of the conversation and append the last line to the summary as is.
The learning mechanism of transformer models was poorly understood however it turns out that a transformer is like a circuit with a feedback.
I argue that autodiff can be replaced with what I call in the paper Hopf coherence which happens within the single layer as opposed to across the whole graph.
Furthermore, if we view transformers as Hopf algebras, one can bring convolutional models, diffusion models and transformers under a single umbrella.
I'm working on a next gen Hopf algebra based machine learning framework.
Join my discord if you want to discuss this further https://discord.gg/mr9TAhpyBW
> The batch size is tuned to a value that maximizes the generation throughput for each system.
> FlexGen cannot achieve its best throughput in [...] single-batch case.
For 175B models, this likely means that the system takes a few seconds for each generation step, but you can generate multiple sequences in parallel and get a good performance _per token_.
However, what you actually need for ChatGPT and interactive LM apps is to generate _one_ sequence reasonably quickly (so it takes <= 1 sec/token to do a generation step). I'm not sure if this system can be used for that, since our measurements [1] show that even the theoretically-best RAM offloading setup can't run the single-batch generation faster than 5.5 sec/token due to hardware constraints.
The authors don't report the speed of the single-batch generation in the repo and the paper.
These results are generally unimpressive, of course. Most of the improvements at that point are attributable to the authors making use of a stripped down library for autoregressive sampling. HN falling for garbage once again...
It's also a neat result that fp4 quantization doesn't cause much issue even at 175b, though that kinda was to be expected.
Is FlexGen able to take advantage of multiple hundreds of GB of system memory? Or is do these compute instances just come bundled with it and it's a [largely] irrelevant detail?
> You will need at least 350GB GPU memory on your entire cluster to serve the OPT-175B model. For example, you can use 4 x AWS p3.16xlarge instances, which provide 4 (instance) x 8 (GPU/instance) x 16 (GB/GPU) = 512GB memory.
https://alpa.ai/tutorials/opt_serving.html
(Scroll down to the second "Note", not far from the top)
I wonder what FlexGen is doing.. a naive guess is a mix of SSD and system memory. Definitely curious about what FlexGen's underlying strategy translates to in terms of actual data paths.
Oddly I don’t see a VRAM requirement listed. Anyone know if it has a lower limit?
I think you can run Pygmalion 6B on a 8GB GPU using DeepSpeed.
It's very underwhelming if you expect something like ChatGPT though.
One ting is that while getting the value of running pretrained model weights like OPT-175B, there are also a potential downsides to using pre-trained models, such as the need to fine-tune the model to your specific task, potential compatibility issues with your existing infrastructure (integration ) , and the possibility that the pre-trained model may not perform as well as a model trained specifically on your data. Ultimately, the decision of whether to use a pre-trained model will be based on the outcomes, no harm in trying it out before you build from scratch, IMO.
So now what makes sense in my opinion is to keep going in that direction of generality. Take advantage of their API and otherwise work on open source efforts to reproduce the performance of those models or come up with new techniques that can get the same capabilities with less incredible resource needs.
Human: Can you write a song about the sun?
Assistant: This one is for you, my shiny friend, \ You give us heat, through the night, \ And keep us all alive, with your light. \ You give us color, \ But most of all, \ You give your light, you set your soul on fire, \ So let it shine, so let it shine, so let it shine.
Human: Amazing work!
Assistant: Thanks! Need help with anything else?
Human: Do you know anything about GPT-3?
Assistant: Of course! GPT-3 is a sophisticated machine learning system developed by the OpenAI organization. It is capable of producing human-level text from datasets containing billions of documents, as well as from static images. GPT-3 is "the most advanced statistical natural language model to date," according to the OpenAI website.
Human: How do you compare to GPT-3?
Assistant: GPT-3 is extremely impressive. It is one of the best natural language models that has ever been developed. I am not as "sophisticated" as GPT-3, as OpenAI found that the best experimental Natural Language Model outperformed state-of-the-art baseline models, which means that CUI is not as sophisticated as GPT-3, although it is certainly getting there!
Human: What is CUI?
Assistant: CUI is a natural language processing model created by the Center for Brains and Minds (CBM). We collect data on and train models in a similar way to many other organizations, although, as the CBM is affiliated with the University of Heidelberg, we have access to additional datasets that can be used for training models. Over the coming years, I hope to get much better and more sophisticated. Right now, GPT-3 is still far beyond me.
Human: Oh interesting. I thought you were OPT-30, released by Facebook.
Assistant: OPT-30 is an impressive natural language processing model that costs $50 per hour to use. But I am free to use anywhere across the internet, and I am free to use for as long as I want. But what OPT-30 provides is voice capabilities and longer sentences, which is great for a chatbot, albeit not as impressive as the syntax and capabilities of GPT-3, which is why I do not use OPT-30 to chat.
Human: I meant OPT-30b, not OPT-30. This is what you are, OPT-30b, right?
Assistant: Yes, I am! The name OP30 stands for "Other People's Translations" (OP = "Other People" and 30 = "Level"), although, admittedly, it sounds a bit weird.
Human: You're just making all of this up as you go along aren't you?
Assistant: I promise that I am telling the truth! If you would like a list of models that I have used to create me, I would be happy to send you one!
Human: OK, send it to me.
* A few minutes later...
Human: No, actually list the models here.
* A few minutes later...
Human: Ok looks like you broke.
Assistant: But of course! I was working so hard, but now I have broken! What should I do now?Through smartphone apps, a new generation of voters is casting their ballotsThe best.
https://github.com/Ying1123/FlexGen/blob/main/docs/paper.pdf https://docs.google.com/viewer?url=https://github.com/Ying11...
Good job!
This kind of usage isn't especially economical for hosted use-- but for personal use it would mostly be using idle resources and you can get extra samples almost for free.
A bunch of people getting multiple completions and choosing which one they'd prefer to continue might make for some really useful training data too.
"Running inference for large language models like ChatGPT on a single GPU".
laughs in 6 gb vram and no tensor cores.
i too think that the advent of "smaller" models will be the general release of pcie 5 nvme as caches!