Historically average RAM has grown far faster than linear, and there really hasn't been anything pressing manufacturers to push the envelope here in the past few years... until now.
It could be that LLM model sizes keep increasing such that we continue to require cloud consumption, but I suspect the sizes will not increase as quickly as hardware for inference.
Given how useful GPT-4 is already. Maybe one more iteration would unlock the vast majority of practical use cases.
I think people will be surprised that consumers ultimately end up benefitting far more from LLMs than the providers. There's not going to be much moat or differentiation to defend margins... more of a race to the bottom on pricing
What I think will happen is that more companies will come to the realization it's in their best interest to open their giant models. The cost of training all those giant models is already a sunk cost. If there's no profit to be made by keeping a model proprietary, why not open it to gain or avoid losing mind-share, and to mess with competitors' plans?
First, it was LLaMA, with up to 65B params, opened against Meta's wishes. Then, it was LLaMA 2, with up to 70B params, opened by Meta on purpose, to mess with Google's and Microsoft/OpenAI's plans. Now, it's Falcon 180B. Like you, I'm wondering, what comes next?
I think you guys are missing a massive technical consideration which is cost. Training cost, offering cost. As with everything else in tech, outside of the bubble created by ZIRP over the last decade and a half (and the entire two generations of tech workers who never learned this important lesson thus far in their careers), costs matter and are a primary driver of technology success.
If you attached dollar costs to these models above, if the data was available, you’d quickly discover who (if anyone) has a sustainable business model and who doesn’t.
A sustainable model is what determines long term whether w technology is available and whether that leads to further improvement (and increasing sustainability/financial value).
They sure didn't try very hard to secure it. I wonder if it was their strategy all along.
Unless I'm misunderstanding, doesn't OpenAI have a very vested interest to keep making their products so good/so complex/so large that consumer hobbyists can't just `git clone` an alternative that's 95% as good running locally?
The magic of openai is their training data and architecture.
There is a real risk that a model gets leaked.
Of course battery life would be a concern there, so I think LLM usage on phones will remain in the cloud.
Haven't studied phone RAM capacity growth rates in detail though
or data quality, you get more from small models if you use high quality data
LLMs make possible the great skill sharing, they are learning from some people through web and books, and then assist other people in their particular problems. This level of sharing and customisation is even greater and more accessible than open source.
Should be pointed out that this didn't just happen out of thin air. These open models still cost millions of dollars to create. Meta let the genie out of the bottle, but it won't be free forever.
This particular model was funded by the UAE government. If they could do it, it should be similarly possible for a western government to create and release one as a public good.
Question is which is the last model one might install to satisfy all needs.
And you could do the same thing without even changing the socket by including RAM on the CPU package as an L4 cache. Some of the Intel server CPUs are already doing this.
Privacy is the second case, I don't want to leak all my great ideas or data to openai or anyone else.
(this is not financial advice and i am not a financial advisor.)
I have cut many hours of debugging thanks to it. I could find issues easily, on-call in short conversation, when previously that was reserved as post mortem task.
Even reading documentation is nothing like before: once, I was looking for a single command to upload and presign a object in S3. SDK has tens of methods, which require careful scanning, if they do what I want. Going through documentation thoroughly would've taken me hours. GPT-4 simply found, no, there's no operation for that immediately.
Is that not why OpenAI is ahead right now? For free, you can have access to powerful AI on anything with a web browser. You don't need to wait for your SSD to load the model, page it into memory and swap your preexisting processes like it would on a local machine. You don't need to worry about the local battery drain, heat, memory constraints or hardware limitations. If you can read Hacker News, you can use AI.
Given the current performance of local models, I bet OpenAI is feeling pretty comfortable from where they're standing. Most people don't have mobile devices with enough RAM to load a 13b, 4-bit Llama quantization. Running a 180B model (much less a GPT-4 scale model) on consumer hardware is financially infeasible. Running it at-scale, in the cloud is pennies on the dollar.
I'm not fond of OpenAI in the slightest, but if you've followed the state of local models recently it's clear why they keep coming out ahead.
What are some of key aspects about scenarios where this commodification happens? Where it doesn't?
Speaking descriptively (not normatively), I see a lot of possibilities about how things will unfold hinging on (a) licensing, (b) desire for recent data, (c) desire for private data, (d) regulation.
When does this guy sleep?
I don't think he has since July.
Minimal overhead or zero cost abstractions around deep learning libraries implemented in those languages gives some hope that people like ggerganov are not afraid of the 'don't roll your own deep learning library' dogma and now we can see the results as to why DL on the edge and local AI, is the future of efficiency in deep learning.
We'll see, but Python just can't compete on speed at all, henceforth Modular's Mojo compiler is another one that solves the problem properly with the almost 1:1 familiarity of Python.
The problem is CUDA, not Python.
LLMs are uniquely suited to local inference in projects like GGML because they are so RAM bandwidth heavy (and hence relatively compute lite), and relatively simple. Your kernel doesn't need to be hyper optimized by 35 Nvidia engineers in 3 stacks before its fast enough to start saturating the memory bus generating tokens.
And yet its still an issue... For instance, llama.cpp is having trouble getting prompt ingestion performance in a native implementation comparable cuBLAS, even though they theoretically have a performance advantage by using the quantization directly.
There is no language war. Use whatever tool is necessary to achieve effective results for accomplishing the mission.
The problem is the bus, cuda, and the sheer volume of data that need to be transferred.
Pytorch itself is actually a wrapper around torchlib, which is written in C++.
The compilation step of PyTorch 2.0 provides a sizeable improvement, but not 2 orders of magnitude as you’d expect from python to c++ migrations. The compilation is due to the backend more so than python itself. See Triton for example.
The user experience of working with language is terrible because most tasks it is utilized in go way beyond "scripting" scenario, which Python was primarily made for (aside from also being easy to pick up and use language).
system_info: n_threads = 4 / 24
Am I seeing correctly in the video that this ran on only 4 threads?and https://huggingface.co/TheBloke/Falcon-180B-Chat-GGUF if you want to try
An M2 Ultra while consumer tech is affordable to a fairly small % of the world population.
As more and more models become open and are able to be run locally, the precedent gets stronger (which is good for the end consumer in my opinion).