Running a 180B parameter LLM on a single Apple M2 Ultra (opens in new tab)

(twitter.com)

255 pointstbruckner2y ago141 comments

141 comments

59 comments · 15 top-level

adam_arthur2y ago· 22 in thread

Even a linear growth rate of average RAM capacity would obviate the need to run current SOTA LLMs remotely in short order.

Historically average RAM has grown far faster than linear, and there really hasn't been anything pressing manufacturers to push the envelope here in the past few years... until now.

It could be that LLM model sizes keep increasing such that we continue to require cloud consumption, but I suspect the sizes will not increase as quickly as hardware for inference.

Given how useful GPT-4 is already. Maybe one more iteration would unlock the vast majority of practical use cases.

I think people will be surprised that consumers ultimately end up benefitting far more from LLMs than the providers. There's not going to be much moat or differentiation to defend margins... more of a race to the bottom on pricing

cs7022y ago

I agree: No one has any technological advantage when it comes to LLMs anymore. Some companies, like OpenAI, may have other advantages, like an ecosystem of developers. But most of the gobs of money that so many companies have burned to train giant proprietary models is unlikely to see any payback.

What I think will happen is that more companies will come to the realization it's in their best interest to open their giant models. The cost of training all those giant models is already a sunk cost. If there's no profit to be made by keeping a model proprietary, why not open it to gain or avoid losing mind-share, and to mess with competitors' plans?

First, it was LLaMA, with up to 65B params, opened against Meta's wishes. Then, it was LLaMA 2, with up to 70B params, opened by Meta on purpose, to mess with Google's and Microsoft/OpenAI's plans. Now, it's Falcon 180B. Like you, I'm wondering, what comes next?

foobiekr2y ago

The cost isn’t sunk cost at all. These models need to be trained and retrained as data sets increase. Putting aside historical cutoff points, there’s a lot of data and kinds of data not currenty used and the costs even to train the current models is incredible.

I think you guys are missing a massive technical consideration which is cost. Training cost, offering cost. As with everything else in tech, outside of the bubble created by ZIRP over the last decade and a half (and the entire two generations of tech workers who never learned this important lesson thus far in their careers), costs matter and are a primary driver of technology success.

If you attached dollar costs to these models above, if the data was available, you’d quickly discover who (if anyone) has a sustainable business model and who doesn’t.

A sustainable model is what determines long term whether w technology is available and whether that leads to further improvement (and increasing sustainability/financial value).

1 more reply

bugglebeetle2y ago

I think it’s the opposite. Models will become more commoditized and closed/invisible as the basis of other service offerings. Apple isn’t going to start offering general API access to the model they’re training, but will bake it into a bunch of stuff and maybe give platform developers limited access. Meta will probably continue to drive the commoditization train because they have a killer ML/AI team, but the same thing will likely happen there once it’s the basis for a service that generates money.

2 more replies

lambda_garden2y ago

> LLaMA, with up to 65B params, opened against Meta's wishes

They sure didn't try very hard to secure it. I wonder if it was their strategy all along.

1 more reply

mistymountains2y ago

Cool it with the italics.

MuffinFlavored2y ago

> Given how useful GPT-4 is already. Maybe one more iteration would unlock the vast majority of practical use cases.

Unless I'm misunderstanding, doesn't OpenAI have a very vested interest to keep making their products so good/so complex/so large that consumer hobbyists can't just `git clone` an alternative that's 95% as good running locally?

reckless2y ago

Indeed they do, however companies like Meta (altruistically or not) are preventing OpenAI from building 'moats' by releasing models and architecture details in a very public way.

2 more replies

chongli2y ago

What is OpenAI's moat? Loads of people outside the company are working on alternative models. They may have a lead right now but will it last a few years? Will it even last 6 months?

4 more replies

Frannyies2y ago

They have a huge cost incentive to optimize it for runtime.

The magic of openai is their training data and architecture.

There is a real risk that a model gets leaked.

1 more reply

ls6122y ago

For me the test is; when will a Siri-LLM be able to run locally on my iPhone at at least GPT-4 levels? 2030? Farther out? Never because of governments forbidding it? To what extent will improvements be driven by the last gasps of Moore’s Law vs by improving model architectures to be more efficient?

adam_arthur2y ago

Given that phones are a few years behind PCs on RAM, likely whenever the average PC can do it, plus a few years. There are phones out there with 24GB of RAM already, it looks like.

Of course battery life would be a concern there, so I think LLM usage on phones will remain in the cloud.

Haven't studied phone RAM capacity growth rates in detail though

2 more replies

bugglebeetle2y ago

Apple is already training their own LLM to rival GPT-4, so I doubt it will take that long.

visarga2y ago

> vs by improving model architectures to be more efficient?

or data quality, you get more from small models if you use high quality data

visarga2y ago

> I think people will be surprised that consumers ultimately end up benefitting far more from LLMs than the providers.

LLMs make possible the great skill sharing, they are learning from some people through web and books, and then assist other people in their particular problems. This level of sharing and customisation is even greater and more accessible than open source.

passion__desire2y ago

All the great points Salman Khan made about Khan Academy in his famous ted talk apply here. The only difference is LLMs can go from Eli5 to EliPhD in just few back and forth. Then to put cherry on the top, you can ask it summarize the conversation in a poem written in style of Walt Whitman.

gorbypark2y ago

I can't wait for my phone to have something like 512Gb-1TB of RAM to run some really interesting models locally :D

AnthonyMouse2y ago

You can buy 768GB of DDR3 and an Ivy Bridge Xeon E5 to put it in for a total of around $500, most of which is the memory. (The CPUs wouldn't be fast for a model that size though.)

1 more reply

ramesh312y ago

>I think people will be surprised that consumers ultimately end up benefitting far more from LLMs than the providers. There's not going to be much moat or differentiation to defend margins... more of a race to the bottom on pricing

Should be pointed out that this didn't just happen out of thin air. These open models still cost millions of dollars to create. Meta let the genie out of the bottle, but it won't be free forever.

logicchains2y ago

>These open models still cost millions of dollars to create. Meta let the genie out of the bottle, but it won't be free forever.

This particular model was funded by the UAE government. If they could do it, it should be similarly possible for a western government to create and release one as a public good.

noiv2y ago

RAM may be growing, but free and acceptable content to train models isn't.

Question is which is the last model one might install to satisfy all needs.

tomohelix2y ago

RAM is easy. The hard part is making the unified memory SOC like Apple's. From what I know, Apple performance is almost magic. And whatever Apple is making, they are at peak capacity already and they can't make more even if they want to. Nobody else has a comparable technology. Apple is in its own league.

AnthonyMouse2y ago

Apple is just using a wide memory bus, the same as GPUs and server-class x86 CPUs do. It's not even hard, it's just not something desktop CPUs previously had any use for so the current sockets don't support it.

And you could do the same thing without even changing the socket by including RAM on the CPU package as an L4 cache. Some of the Intel server CPUs are already doing this.

randomopining2y ago· 6 in thread

Is there any actual usecases to run this stuff on a local computer? Or are most of these models actually suited to run on remote clusters?

acdha2y ago

Here’s a simple one: corporate policy doesn’t allow you to send company data to a cloud service. There are a ton of people with significant budgets in that situation.

zamadatix2y ago

I think that use case still matches the remote cluster use case better as a policy like "We can't use cloud" doesn't mean "we have to use our individual local workstations". This approach really makes sense for the "we have 1-3 people that want to really push this on a budget", beyond that big iron makes more sense. And this still helps with that IMO, it's just one step in getting to there from "only the largest can play".

1 more reply

beardedwizard2y ago

Absolutely! Local experimentation. I built a transcription and summarization pipeline for $0. If I want it to be faster, I can move it to beefier hardware. If I fail 1000s of times it still costs me nothing.

Privacy is the second case, I don't want to leak all my great ideas or data to openai or anyone else.

logicchains2y ago

The use-case is you want to generate pornographic, violence-depicting or politically-incorrect content, and would rather buy a powerful computer than rent a server (or you already own a powerful computer).

beardedwizard2y ago

You what? You can run smaller and plenty powerful models on a m1 MacBook. Idk what the porn and violence angle is but maybe keep that one to yourself.

1 more reply

catchnear43212y ago

it seems infinitely cheaper to jailbreak poorly implemented publicly-facing gimmick LLM “use cases” and “demonstrations” that rely on / thinly veneer commercial apis.

(this is not financial advice and i am not a financial advisor.)

m3kw92y ago· 5 in thread

OpenAIs moat will soon largely be UX. Anyone can do plugins, code etc but when operating by everyday users the best UX wins after LLM becomes commodified. Just look at stand alone digital cameras vs mobile phone cams from Apple.

ZoomerCretin2y ago

GPT4 is still leagues ahead of the competition. Open source LLMs will be used more widely, but for the most demanding tasks, there is no alternative for GPT4.

eurekin2y ago

Anecdata confirmation: I've been toying around with LLMs for simple fun stuff, but when it comes to real work, GPT-4 delivers in spades.

I have cut many hours of debugging thanks to it. I could find issues easily, on-call in short conversation, when previously that was reserved as post mortem task.

Even reading documentation is nothing like before: once, I was looking for a single command to upload and presign a object in S3. SDK has tens of methods, which require careful scanning, if they do what I want. Going through documentation thoroughly would've taken me hours. GPT-4 simply found, no, there's no operation for that immediately.

smoldesu2y ago

> but when operating by everyday users the best UX wins

Is that not why OpenAI is ahead right now? For free, you can have access to powerful AI on anything with a web browser. You don't need to wait for your SSD to load the model, page it into memory and swap your preexisting processes like it would on a local machine. You don't need to worry about the local battery drain, heat, memory constraints or hardware limitations. If you can read Hacker News, you can use AI.

Given the current performance of local models, I bet OpenAI is feeling pretty comfortable from where they're standing. Most people don't have mobile devices with enough RAM to load a 13b, 4-bit Llama quantization. Running a 180B model (much less a GPT-4 scale model) on consumer hardware is financially infeasible. Running it at-scale, in the cloud is pennies on the dollar.

I'm not fond of OpenAI in the slightest, but if you've followed the state of local models recently it's clear why they keep coming out ahead.

anurag68922y ago

this advantage is not specific to OpenAI right? Any big cloud provider like Amazon/Google can host these open LLM models.

2 more replies

xpe2y ago

I buy this general argument, at least to extent that 'good enough' LLMs get commodified.

What are some of key aspects about scenarios where this commodification happens? Where it doesn't?

Speaking descriptively (not normatively), I see a lot of possibilities about how things will unfold hinging on (a) licensing, (b) desire for recent data, (c) desire for private data, (d) regulation.

regularfry2y ago· 4 in thread

4-bit quantised model, to be precise.

When does this guy sleep?

beardedwizard2y ago

What ever is he doing, we must protect this man at all costs.

esafak2y ago

With his name recognition he could easily raise over $10m in funding for a seed round and sleep well, if he wanted.

swyx2y ago

he already has? lol https://news.ycombinator.com/item?id=36215651

ramesh312y ago

>When does this guy sleep?

I don't think he has since July.

rvz2y ago· 4 in thread

Totally makes sense for C++ or Rust based AI models for inference instead of the over-bloated networks run on Python with sub-optimal inference and fine-tuning costs.

Minimal overhead or zero cost abstractions around deep learning libraries implemented in those languages gives some hope that people like ggerganov are not afraid of the 'don't roll your own deep learning library' dogma and now we can see the results as to why DL on the edge and local AI, is the future of efficiency in deep learning.

We'll see, but Python just can't compete on speed at all, henceforth Modular's Mojo compiler is another one that solves the problem properly with the almost 1:1 familiarity of Python.

brucethemoose22y ago

The actual inference is not run in Python in PyTorch, and its usually not bottlenecked by it.

The problem is CUDA, not Python.

LLMs are uniquely suited to local inference in projects like GGML because they are so RAM bandwidth heavy (and hence relatively compute lite), and relatively simple. Your kernel doesn't need to be hyper optimized by 35 Nvidia engineers in 3 stacks before its fast enough to start saturating the memory bus generating tokens.

And yet its still an issue... For instance, llama.cpp is having trouble getting prompt ingestion performance in a native implementation comparable cuBLAS, even though they theoretically have a performance advantage by using the quantization directly.

survirtual2y ago

Python is generally just the glue language for underlying, highly optimized c++ libs. The improvements aren't just about languages. I would imagine facebook is less focused on inference, so didn't bother to make a highly optimized LLM inference engine. There also just isn't a business case for CPU-bound LLMs at an enterprise scale, so why code for that? Additionally, llama.cpp can be called by python and python could still do all the glue.

There is no language war. Use whatever tool is necessary to achieve effective results for accomplishing the mission.

PartiallyTyped2y ago

Python is not really the bottleneck in LLM applications. It is for tabular RL, but certainly not for deep RL (i have had discussions with DM folk over this in r/RL, and the ppl from stable diffusion).

The problem is the bus, cuda, and the sheer volume of data that need to be transferred.

Pytorch itself is actually a wrapper around torchlib, which is written in C++.

The compilation step of PyTorch 2.0 provides a sizeable improvement, but not 2 orders of magnitude as you’d expect from python to c++ migrations. The compilation is due to the backend more so than python itself. See Triton for example.

neonsunset2y ago

I'm not sure why this is downvoted but wanted to chime in that ML successes are taking place, first and foremost, despite Python's shortcomings, which are many.

The user experience of working with language is terrible because most tasks it is utilized in go way beyond "scripting" scenario, which Python was primarily made for (aside from also being easy to pick up and use language).

logicchains2y ago· 1 in thread

Pretty amazing that in such a short span of time we went from people being amazed how powerful GPT3.5 was upon its release to people being able to run something equivalently powerful locally.

zagfai2y ago

however, GPT3.5 did not surprised me but GPT4 did. 3.5 just a kid.

pella2y ago· 1 in thread

Is this an M2 Ultra with 192 GB of unified memory, or the standard version with 64 GB of unified memory?

zargon2y ago

4-bit quantized 180B will not fit in 64GB. You'll need need over 100 GB for that.

tiffanyh2y ago· 1 in thread

  system_info: n_threads = 4 / 24

Am I seeing correctly in the video that this ran on only 4 threads?

wmf2y ago

It's using the GPU so I guess not that many CPU threads are needed to feed the GPU.

sbierwagen2y ago

The screenshot shows a working set size of 147,456 mb, so he's using the mac studio with 192 gb of ram?

1 more reply

homarp2y ago

https://www.reddit.com/r/LocalLLaMA/comments/16bynin/falcon_... has some more data like sample answers with various level of quantizations

and https://huggingface.co/TheBloke/Falcon-180B-Chat-GGUF if you want to try

doctoboggan2y ago

Georgi is doing so much to democratize LLM access, I am very thankful he is doing it all on apple silicon!

Havoc2y ago

Great progress, but I also can't help but feel a sense of apprehension on the access front.

An M2 Ultra while consumer tech is affordable to a fairly small % of the world population.

ViktorBash2y ago

It's refreshing to see how fast open LLMs are advancing in terms of the models available. A year ago I thought that besides for the novelty of it, running LLMs locally would be nowhere close to stuff like OpenAI's closed models in terms of utility.

As more and more models become open and are able to be run locally, the precedent gets stronger (which is good for the end consumer in my opinion).

two_in_one2y ago

Just wondering what are local LLMs used for today? So far they look more like a.. promising.

growt2y ago

So how much ram did the machine have?

j / k navigate · click thread line to collapse

141 comments

59 comments · 15 top-level

adam_arthur2y ago· 22 in thread

Even a linear growth rate of average RAM capacity would obviate the need to run current SOTA LLMs remotely in short order.

Historically average RAM has grown far faster than linear, and there really hasn't been anything pressing manufacturers to push the envelope here in the past few years... until now.

It could be that LLM model sizes keep increasing such that we continue to require cloud consumption, but I suspect the sizes will not increase as quickly as hardware for inference.

Given how useful GPT-4 is already. Maybe one more iteration would unlock the vast majority of practical use cases.

cs7022y ago

foobiekr2y ago

If you attached dollar costs to these models above, if the data was available, you’d quickly discover who (if anyone) has a sustainable business model and who doesn’t.

A sustainable model is what determines long term whether w technology is available and whether that leads to further improvement (and increasing sustainability/financial value).

1 more reply

bugglebeetle2y ago

2 more replies

lambda_garden2y ago

> LLaMA, with up to 65B params, opened against Meta's wishes

They sure didn't try very hard to secure it. I wonder if it was their strategy all along.

1 more reply

mistymountains2y ago

Cool it with the italics.

MuffinFlavored2y ago

> Given how useful GPT-4 is already. Maybe one more iteration would unlock the vast majority of practical use cases.

reckless2y ago

Indeed they do, however companies like Meta (altruistically or not) are preventing OpenAI from building 'moats' by releasing models and architecture details in a very public way.

2 more replies

chongli2y ago

What is OpenAI's moat? Loads of people outside the company are working on alternative models. They may have a lead right now but will it last a few years? Will it even last 6 months?

4 more replies

Frannyies2y ago

They have a huge cost incentive to optimize it for runtime.

The magic of openai is their training data and architecture.

There is a real risk that a model gets leaked.

1 more reply

ls6122y ago

adam_arthur2y ago

Given that phones are a few years behind PCs on RAM, likely whenever the average PC can do it, plus a few years. There are phones out there with 24GB of RAM already, it looks like.

Of course battery life would be a concern there, so I think LLM usage on phones will remain in the cloud.

Haven't studied phone RAM capacity growth rates in detail though

2 more replies

bugglebeetle2y ago

Apple is already training their own LLM to rival GPT-4, so I doubt it will take that long.

visarga2y ago

> vs by improving model architectures to be more efficient?

or data quality, you get more from small models if you use high quality data

visarga2y ago

> I think people will be surprised that consumers ultimately end up benefitting far more from LLMs than the providers.

passion__desire2y ago

gorbypark2y ago

I can't wait for my phone to have something like 512Gb-1TB of RAM to run some really interesting models locally :D

AnthonyMouse2y ago

You can buy 768GB of DDR3 and an Ivy Bridge Xeon E5 to put it in for a total of around $500, most of which is the memory. (The CPUs wouldn't be fast for a model that size though.)

1 more reply

ramesh312y ago

Should be pointed out that this didn't just happen out of thin air. These open models still cost millions of dollars to create. Meta let the genie out of the bottle, but it won't be free forever.

logicchains2y ago

>These open models still cost millions of dollars to create. Meta let the genie out of the bottle, but it won't be free forever.

This particular model was funded by the UAE government. If they could do it, it should be similarly possible for a western government to create and release one as a public good.

noiv2y ago

RAM may be growing, but free and acceptable content to train models isn't.

Question is which is the last model one might install to satisfy all needs.

tomohelix2y ago

AnthonyMouse2y ago

And you could do the same thing without even changing the socket by including RAM on the CPU package as an L4 cache. Some of the Intel server CPUs are already doing this.

randomopining2y ago· 6 in thread

Is there any actual usecases to run this stuff on a local computer? Or are most of these models actually suited to run on remote clusters?

acdha2y ago

Here’s a simple one: corporate policy doesn’t allow you to send company data to a cloud service. There are a ton of people with significant budgets in that situation.

zamadatix2y ago

1 more reply

beardedwizard2y ago

Privacy is the second case, I don't want to leak all my great ideas or data to openai or anyone else.

logicchains2y ago

beardedwizard2y ago

You what? You can run smaller and plenty powerful models on a m1 MacBook. Idk what the porn and violence angle is but maybe keep that one to yourself.

1 more reply

catchnear43212y ago

it seems infinitely cheaper to jailbreak poorly implemented publicly-facing gimmick LLM “use cases” and “demonstrations” that rely on / thinly veneer commercial apis.

(this is not financial advice and i am not a financial advisor.)

m3kw92y ago· 5 in thread

ZoomerCretin2y ago

GPT4 is still leagues ahead of the competition. Open source LLMs will be used more widely, but for the most demanding tasks, there is no alternative for GPT4.

eurekin2y ago

Anecdata confirmation: I've been toying around with LLMs for simple fun stuff, but when it comes to real work, GPT-4 delivers in spades.

I have cut many hours of debugging thanks to it. I could find issues easily, on-call in short conversation, when previously that was reserved as post mortem task.

smoldesu2y ago

> but when operating by everyday users the best UX wins

I'm not fond of OpenAI in the slightest, but if you've followed the state of local models recently it's clear why they keep coming out ahead.

anurag68922y ago

this advantage is not specific to OpenAI right? Any big cloud provider like Amazon/Google can host these open LLM models.

2 more replies

xpe2y ago

I buy this general argument, at least to extent that 'good enough' LLMs get commodified.

What are some of key aspects about scenarios where this commodification happens? Where it doesn't?

Speaking descriptively (not normatively), I see a lot of possibilities about how things will unfold hinging on (a) licensing, (b) desire for recent data, (c) desire for private data, (d) regulation.

regularfry2y ago· 4 in thread

4-bit quantised model, to be precise.

When does this guy sleep?

beardedwizard2y ago

What ever is he doing, we must protect this man at all costs.

esafak2y ago

With his name recognition he could easily raise over $10m in funding for a seed round and sleep well, if he wanted.

swyx2y ago

he already has? lol https://news.ycombinator.com/item?id=36215651

ramesh312y ago

>When does this guy sleep?

I don't think he has since July.

rvz2y ago· 4 in thread

Totally makes sense for C++ or Rust based AI models for inference instead of the over-bloated networks run on Python with sub-optimal inference and fine-tuning costs.

We'll see, but Python just can't compete on speed at all, henceforth Modular's Mojo compiler is another one that solves the problem properly with the almost 1:1 familiarity of Python.

brucethemoose22y ago

The actual inference is not run in Python in PyTorch, and its usually not bottlenecked by it.

The problem is CUDA, not Python.

survirtual2y ago

There is no language war. Use whatever tool is necessary to achieve effective results for accomplishing the mission.

PartiallyTyped2y ago

The problem is the bus, cuda, and the sheer volume of data that need to be transferred.

Pytorch itself is actually a wrapper around torchlib, which is written in C++.

neonsunset2y ago

I'm not sure why this is downvoted but wanted to chime in that ML successes are taking place, first and foremost, despite Python's shortcomings, which are many.

logicchains2y ago· 1 in thread

Pretty amazing that in such a short span of time we went from people being amazed how powerful GPT3.5 was upon its release to people being able to run something equivalently powerful locally.

zagfai2y ago

however, GPT3.5 did not surprised me but GPT4 did. 3.5 just a kid.

pella2y ago· 1 in thread

Is this an M2 Ultra with 192 GB of unified memory, or the standard version with 64 GB of unified memory?

zargon2y ago

4-bit quantized 180B will not fit in 64GB. You'll need need over 100 GB for that.

tiffanyh2y ago· 1 in thread

  system_info: n_threads = 4 / 24

Am I seeing correctly in the video that this ran on only 4 threads?

wmf2y ago

It's using the GPU so I guess not that many CPU threads are needed to feed the GPU.

sbierwagen2y ago

The screenshot shows a working set size of 147,456 mb, so he's using the mac studio with 192 gb of ram?

1 more reply

homarp2y ago

https://www.reddit.com/r/LocalLLaMA/comments/16bynin/falcon_... has some more data like sample answers with various level of quantizations

and https://huggingface.co/TheBloke/Falcon-180B-Chat-GGUF if you want to try

doctoboggan2y ago

Georgi is doing so much to democratize LLM access, I am very thankful he is doing it all on apple silicon!

Havoc2y ago

Great progress, but I also can't help but feel a sense of apprehension on the access front.

An M2 Ultra while consumer tech is affordable to a fairly small % of the world population.

ViktorBash2y ago

As more and more models become open and are able to be run locally, the precedent gets stronger (which is good for the end consumer in my opinion).

two_in_one2y ago

Just wondering what are local LLMs used for today? So far they look more like a.. promising.

growt2y ago

So how much ram did the machine have?

j / k navigate · click thread line to collapse