That's not hyperbole. Why is OpenAI able to charge so little for their API's? I have heard rival mega LLM company CEO's complain that OpenAI's prices would be a loss for their rivals. But I think it's still positive margin, and that they can charge low prices for API because they've invested more into managing the infra, sure, but most importantly because they have the best utilization of their existing hardware.
If it costs everyone $X/gpu/hr to serve models, the company that has the most throughput wins on price. In a world without finetunes, the most capable model, the one that can zero- or few-shot the most tasks will have the most usage. Finetuned open models can reach parity with GPT on narrow tasks, but until now, having public providers serve the models was expensive. Your private finetune is only going to be queried by you, not everyone, so it's super expensive to serve on a per token level. With hot swappable LoRA adapters, that calculus changes, and the cost per token can go way down. Super, super exciting!
Underprice to avoid or drive out competition and encourage lock-in, then increase prices when you no longer have competitors or your user base is large enough and reliant enough that your attrition is manageable. Then you sell to a bigger company who grinds it up and integrates into their own products. Same as always. Bonus points if you claim to be open source for the free marketing and/or free development/testing in the form of user contributions before switching to a proprietary model.
Shouldn’t we have a standardized corporate strategy bingo card by now?
There is a difference between pricing aggressively and pricing at a loss. Their pricing for gpt-3.5-turbo now matches leading public providers for Llama-70B ($1/million tokens). Rumors are that 3.5-turbo is actually a 20B model, but even let's assume that it is larger than 70B: OpenAI can still price more aggressively than Llama-70B providers because they have better throughput and utilization of the same hardware.
Really looking forward to these innovations becoming more widespread -- I expect we're very close to a world where training a LoRA on a one-off task like "review every HN post from the last 3 years and flag any of them that contain informed speculation about the architecture of GPT-4" will be easy, cheap and routine.
We'll keep doing more research on finetuning. And hopefully, we'll see the results soon.
[1] https://le.qun.ch/en/blog/2023/09/11/multi-lora-potentials/
How hard would it be to adapt your kernels to work with the new-gen quants like AWQ or EXL2?
We are polishing the 4-bit code. It will be added to Punica code base soon. Please stay tuned :)
So Atom base models would be compatible with Punica?
I also wonder, many people already train LoRAs in 8 or even 4 bit (for the base model), would it make sense to match the quantization algo used during training and inference?
From what this seems to do is host multiple deltas fine-tunings and hot swap as needed. Incredible optimization. It's like going from AMIs to ECS or Kubernetes.
I'm curious if there is a quality argument to be made: imagine needing to finetune k different classifiers...
Before this work, we could train a single multi-label classifier by pooling the training sets, and deploy as 1 LoRa
Now, we can have k distinct classifiers, and not risk them interfering with one another
Any sense of, in realistic scenarios, when the quality of k distinct LoRas would be better?
Any thoughts as to how this would come together with serving frameworks like vLLM, lmdeploy, Triton Inference Server, etc?
We call for the open source community to help us integrate Punica with all frameworks, thus the whole society can benefit from the efficiency improvement!
Look forward to collaboration with TVM and MLC to reach more users :)