undefined | Better HN

undefined | Better HN

0 comments

seanmcdirmid8mo ago

LocalLLMs would be useful for low latency local language processing/home control, assuming they ever become fast enough where the 500ms to 1s network latency becomes a dominate factor in having a fluid conversation with a voice assistant. Right now the pauses are unbearable for anything but one way commands (Siri, do something! - 3 seconds later it starts doing the thing...that works but it wouldn't work if Siri needed to ask follow up questions). This is even more important if we consider low latency gaming situations.

Mobile applications are also relevant. An LLM in your car could be used for local intelligence. I'm pretty sure self driving cars use some about of local AI already (although obviously not LLM, and I don't really know how much of their processing is local vs done on a server somewhere).

If models stop advancing at a fast clip, hardware will eventually become fast and cheap enough that running models locally isn't something we think about as being a non-sensical luxury, in the same way that we don't think that rendering graphics locally is a luxury even though remote rendering is possible.

jameshart8mo ago

> Servers will always have way more compute power than edge nodes

This doesn't seem right to me.

You take all the memory and CPU cycles of all the clients connected to a typical online service, compared to the memory and CPU in the datacenter serving it? The vast majority of compute involved in delivering that experience is on the client. And there's probably vast amounts of untapped compute available on that client - most websites only peg the client CPU by accident because they triggered an infinite loop in an ad bidding war; imagine what they could do if they actually used that compute power on purpose.

But even doing fairly trivial stuff, a typical browser tab is using hundreds of megs of memory and an appreciable percentage of the CPU of the machine it's loaded on, for the duration of the time it's being interacted with. Meanwhile, serving that content out to the browser took milliseconds, and was done at the same time as the server was handling thousands of other requests.

Edge compute scales with the amount of users who are using your service: each of them brings along their own hardware. Server compute has to scale at your expense.

Now, LLMs bring their special needs - large models that need to be loaded into vast fast memory... there are reasons to bring the compute to the model. But it's definitely not trivially the case that there's more compute in servers than clients.

pdpi8mo ago

As an industry, we've swung from thin clients to fat clients and back countless times. I'm sure LLMs won't be immune to that phenomenon.

Closi8mo ago

IMO the benefit of a local LLM on a smartphone isn't necessarily compute power/speed - it's reliability without a reliance on connectivity, it can offer privacy guarantees, and assuming the silicon cost is marginal, could mean you can offer permanent LLM capabilities without needing to offer some sort of cloud subscription.

hapticmonkey8mo ago

If the future is AI, then a future where every compute has to pass through one of a handful of multinational corporations with GPU farms...is something to be wary of. Local LLMs is a great idea for smaller tasks.

Nevermark8mo ago

Boom! [0]

> Deepseek-r1 was loaded and ran locally on the Mac Studio

> M3 Ultra chip [...] 32-core CPU, an 80-core GPU, and the 32-core Neural Engine. [...] 512GB of unified memory, [...] memory bandwidth of 819GB/s.

> Deepseek-r1 was loaded [...] 671-billion-parameter model requiring [...] a bit less than 450 gigabytes of [unified] RAM to function.

> the Mac Studio was able to churn through queries at approximately 17 to 18 tokens per second

> it was observed as requiring 160 to 180 Watts during use

Considering getting this model. Looking into the future, a Mac Studio M5 Ultra should be something special.

[0] https://appleinsider.com/articles/25/03/18/heavily-upgraded-...

waterTanuki8mo ago

I regularly use local LLMs at work (full stack dev) due to restrictions and occasionally I get some results comparable to gpt-5 or opus 4

rowanG0778mo ago

That's assuming diminishing returns won't hit hard. If a 10x smaller local model is 95%(Whatever that means) as good as the remote model it makes sense to use local models most of the time. It remains to be seen if that will happen but it's certainly not unthinkable imp.

PaulRobinson8mo ago

Apple literally mentioned local LLMs in the event video where they announced this phone and others.

Apple's privacy stance is to do as much as possible on the user's device and as little as possible in cloud. They have iCloud for storage to make inter-device synch easy, but even that is painful for them. They hate cloud. This is the direction they've had for some years now. It always makes me smile that so many commentators just can't understand it and insist that they're "so far behind" on AI.

All the recent academic literature suggests that LLM capability is beginning to plateau, and we don't have ideas on what to do next (and no, we can't ask the LLMs).

As you get more capable SLMs or LLMs, and the hardware gets better and better (who _really_ wants to be long on nVIDIA or Intel right now? Hmm?), people are going to find that they're "good enough" for a range of tasks, and Apple's customer demographic are going to be happy that's all happening on the device in their hand and not on a server [waves hands] "somewhere", in the cloud.

fennecfoxy8mo ago

I think they will be, but more for hand-off. Local will be great for starting timers, adding things to calendar, moving files around. Basic, local tasks. But it also needs to be intelligent enough to know when to hand off to server-side model.

Android crowd has been able to run LLMs on-device since LlamaCPP first came out. But the magic is in the integration with OS. As usual there will be hype around Apple, idk, inventing the very concept of LLMs or something. But the truth is neither Apple nor Android did this; only the wee team that wrote the attention is all you need paper + the many open source/hobbyist contributors inventing creative solutions like LoRA and creating natural ecosystems for them.

That's why I find this memo so cool (and will once again repost the link): https://semianalysis.com/2023/05/04/google-we-have-no-moat-a...

brookst8mo ago

Couldn’t you apply that same thinking to all compute? Servers will always have more, timesharing means lower cost, people will probably only ever own dumb terminals?

MPSimmons8mo ago

The crux is how big the L is in the local LLMs. Depending on what it's used for, you can actually get really good performance on topically trained models when leveraged for their specific purpose.

alwillis8mo ago

> don't see why not, then the era of viable local LLM inferencing is upon us. I don't think local LLMs will ever be a thing except for very specific use cases.

I disagree.

There's a lot of interest in local LLMs in the LLM community. My internet was down for a few days and did I wish I had a local LLM on my laptop!

There's a big push for privacy; people are using LLMs for personal medical issues for example and don't want that going into the cloud.

Is it necessary to talk to a server just to check out a letter I wrote?

Obviously with Apple's release of iOS 26 and macOS 26 and the rest of their operating systems, tens of millions of devices are getting a local LLM with 3rd party apps that can take advantage of them.

unethical_ban8mo ago

It's a thing right now.

I'm running Qwen 30B code on my framework laptop to ask questions about ruby vs. python syntax because I can, and because the internet was flaky.

At some point, more doesn't mean I need it. LLMs will certainly get "good enough" and they'll be lower latency, no subscription, and no internet required.

hotstickyballs8mo ago

If compute power is the deciding factor server vs edge discussion then we’d never have smartphones.

nsonha8mo ago

local LLM may not be good enough for answering questions (which I think won't be true really soon) or generating images, but it today should be good enough to infer deeplinks and app extension calls or agentic walk-through... and ushers a new era of controlling phone by voice command.

j / k navigate · click thread line to collapse