Meta released their OPT model which they claim is comparable to the GPT-3 model. Guidance for running that model [1] suggests a LOT of memory - at least 350GB of gpu memory which is roughly 4 A1000s, which are pricy.
Running this on AWS with the above suggestion would cost $25/hr - just for one model running. That’s almost $0.50 a minute. If you imagine it takes a few seconds to run the model for one request… easily you’ll hit $0.05 per request once you factor in the rest of the infra (storage, CDN, etc) and the engineering cost, and the research cost, and the fact that they probably have a scale to hundreds of instances for heavy traffic and that may mean less efficient purchased servers.
OpenAI has a sweetheart deal with Azure, but this is roughly the cost structure for serving requests. And this doesn’t include the upfront cost of training.
Yet people these days believe something like the brain was bruteforced by nature into an accidental existence.
Stable diffusion can run on a home pc, while it seems you need a super computer for GPT3. I'm not sure that would have been my intuition.
- run tokenization of inputs on CPU
- sort inputs by length
- batch inputs of similar length and apply padding to make of uniform length
- pass the batches through so a single model can process many inputs in parallel.
For GPT-style decoder models however, this becomes much more challenging because inference requires a forward pass for every token generated. (Stopping criteria also may differ but that’s another tangent).
Every generated token performs attention on every previous token, both the context (or “prompt”) and the previously generated tokens (important for self consistency). this is a quadratic operation in the vanilla case.
Model sizes are large , often spanning multiple machines, and the information for later layers depends on previous ones, meaning inference has to be pipelined.
The naive approach would be to have a single transaction processed exclusively by a single instance of the model. this is expensive! even if each model can be crammed into a single A100 , if you want to run something like Codex or ChatGPT for millions of users with low latency inference, you’d have to have thousands of GPUs preloaded with models, and each transaction would take a highly variable amount of time.
If a model spans multiple machines, you’d achieve a max of 1/n% utilization because each shard has to remain loaded while the others process, and then if you want to do pipeline parallelism like in pipe dream, you’d have to deal with attention caches since you don’t want to have to recompute every previous state each time
The answer is almost certainly "no." A service like Chat GPT is expensive because it requires heavy-duty GPU computations.
I don't think these numbers sounds very out of line. It would be easier to understand the feasibility of this if we knew how fast those cards could execute the model. If it takes a second to run it then a few cents seems about right, if it takes a few milliseconds then it is a lot less than a few cents unless Microsoft charges huge premium for the servers.
Now there are different ways to achieve this, but in essence because it has to know everything all at once plus instructions on how to handle that.
You can actually ask it to explain to you how you could create a natural language processing algorithm yourself and it will even give you a starter framework in the language of your choice. But a fair warning, for me it was like a 6 hour deep rabbit hole :D
Read the papers.
Is the H100 deployed at Azure? I wonder how much more efficient that would be over A100s.
Memory shipped with computers have been stagnate for a decade
We live in a really exciting age :). Local AI models will also finally give Microsoft reasons again to require hardware for coming Windows versions. Now they have to require obscure security chips and stuff but in the future they might have some local cortana thingy or something that requires a certain amount of computational power.
It will write 200 lines of code for me which would maybe take me a few hours. I have to spend 15 minutes cleaning it up, but still it saved me 80% of the time. It's a massive win.
Also great for writing articles, or emails. I write what I want to say into ChatGPT and tell it to state rewrite it to be pleasant and less harsh and it does a great job of that.
For example, it gives you code. You run that code to see if the outputs are as expected.
This hubris over the top-level language in the system is so passe, so 2000s.