undefined | Better HN

0 comments

74 comments · 23 top-level

AdventureMouse9mo ago· 16 in thread

> If the M5 generation gets this GPU upgrade, which I don't see why not, then the era of viable local LLM inferencing is upon us.

I don't think local LLMs will ever be a thing except for very specific use cases.

Servers will always have way more compute power than edge nodes. As server power increases, people will expect more and more of the LLMs and edge node compute will stay irrelevant since their relative power will stay the same.

seanmcdirmid9mo ago

LocalLLMs would be useful for low latency local language processing/home control, assuming they ever become fast enough where the 500ms to 1s network latency becomes a dominate factor in having a fluid conversation with a voice assistant. Right now the pauses are unbearable for anything but one way commands (Siri, do something! - 3 seconds later it starts doing the thing...that works but it wouldn't work if Siri needed to ask follow up questions). This is even more important if we consider low latency gaming situations.

Mobile applications are also relevant. An LLM in your car could be used for local intelligence. I'm pretty sure self driving cars use some about of local AI already (although obviously not LLM, and I don't really know how much of their processing is local vs done on a server somewhere).

If models stop advancing at a fast clip, hardware will eventually become fast and cheap enough that running models locally isn't something we think about as being a non-sensical luxury, in the same way that we don't think that rendering graphics locally is a luxury even though remote rendering is possible.

jameshart9mo ago

> Servers will always have way more compute power than edge nodes

This doesn't seem right to me.

You take all the memory and CPU cycles of all the clients connected to a typical online service, compared to the memory and CPU in the datacenter serving it? The vast majority of compute involved in delivering that experience is on the client. And there's probably vast amounts of untapped compute available on that client - most websites only peg the client CPU by accident because they triggered an infinite loop in an ad bidding war; imagine what they could do if they actually used that compute power on purpose.

But even doing fairly trivial stuff, a typical browser tab is using hundreds of megs of memory and an appreciable percentage of the CPU of the machine it's loaded on, for the duration of the time it's being interacted with. Meanwhile, serving that content out to the browser took milliseconds, and was done at the same time as the server was handling thousands of other requests.

Edge compute scales with the amount of users who are using your service: each of them brings along their own hardware. Server compute has to scale at your expense.

Now, LLMs bring their special needs - large models that need to be loaded into vast fast memory... there are reasons to bring the compute to the model. But it's definitely not trivially the case that there's more compute in servers than clients.

pdpi9mo ago

As an industry, we've swung from thin clients to fat clients and back countless times. I'm sure LLMs won't be immune to that phenomenon.

Closi9mo ago

IMO the benefit of a local LLM on a smartphone isn't necessarily compute power/speed - it's reliability without a reliance on connectivity, it can offer privacy guarantees, and assuming the silicon cost is marginal, could mean you can offer permanent LLM capabilities without needing to offer some sort of cloud subscription.

hapticmonkey9mo ago

If the future is AI, then a future where every compute has to pass through one of a handful of multinational corporations with GPU farms...is something to be wary of. Local LLMs is a great idea for smaller tasks.

Nevermark9mo ago

Boom! [0]

> Deepseek-r1 was loaded and ran locally on the Mac Studio

> M3 Ultra chip [...] 32-core CPU, an 80-core GPU, and the 32-core Neural Engine. [...] 512GB of unified memory, [...] memory bandwidth of 819GB/s.

> Deepseek-r1 was loaded [...] 671-billion-parameter model requiring [...] a bit less than 450 gigabytes of [unified] RAM to function.

> the Mac Studio was able to churn through queries at approximately 17 to 18 tokens per second

> it was observed as requiring 160 to 180 Watts during use

Considering getting this model. Looking into the future, a Mac Studio M5 Ultra should be something special.

[0] https://appleinsider.com/articles/25/03/18/heavily-upgraded-...

waterTanuki9mo ago

I regularly use local LLMs at work (full stack dev) due to restrictions and occasionally I get some results comparable to gpt-5 or opus 4

rowanG0779mo ago

That's assuming diminishing returns won't hit hard. If a 10x smaller local model is 95%(Whatever that means) as good as the remote model it makes sense to use local models most of the time. It remains to be seen if that will happen but it's certainly not unthinkable imp.

PaulRobinson9mo ago

Apple literally mentioned local LLMs in the event video where they announced this phone and others.

Apple's privacy stance is to do as much as possible on the user's device and as little as possible in cloud. They have iCloud for storage to make inter-device synch easy, but even that is painful for them. They hate cloud. This is the direction they've had for some years now. It always makes me smile that so many commentators just can't understand it and insist that they're "so far behind" on AI.

All the recent academic literature suggests that LLM capability is beginning to plateau, and we don't have ideas on what to do next (and no, we can't ask the LLMs).

As you get more capable SLMs or LLMs, and the hardware gets better and better (who _really_ wants to be long on nVIDIA or Intel right now? Hmm?), people are going to find that they're "good enough" for a range of tasks, and Apple's customer demographic are going to be happy that's all happening on the device in their hand and not on a server [waves hands] "somewhere", in the cloud.

fennecfoxy9mo ago

I think they will be, but more for hand-off. Local will be great for starting timers, adding things to calendar, moving files around. Basic, local tasks. But it also needs to be intelligent enough to know when to hand off to server-side model.

Android crowd has been able to run LLMs on-device since LlamaCPP first came out. But the magic is in the integration with OS. As usual there will be hype around Apple, idk, inventing the very concept of LLMs or something. But the truth is neither Apple nor Android did this; only the wee team that wrote the attention is all you need paper + the many open source/hobbyist contributors inventing creative solutions like LoRA and creating natural ecosystems for them.

That's why I find this memo so cool (and will once again repost the link): https://semianalysis.com/2023/05/04/google-we-have-no-moat-a...

brookst9mo ago

Couldn’t you apply that same thinking to all compute? Servers will always have more, timesharing means lower cost, people will probably only ever own dumb terminals?

MPSimmons9mo ago

The crux is how big the L is in the local LLMs. Depending on what it's used for, you can actually get really good performance on topically trained models when leveraged for their specific purpose.

alwillis9mo ago

> don't see why not, then the era of viable local LLM inferencing is upon us. I don't think local LLMs will ever be a thing except for very specific use cases.

I disagree.

There's a lot of interest in local LLMs in the LLM community. My internet was down for a few days and did I wish I had a local LLM on my laptop!

There's a big push for privacy; people are using LLMs for personal medical issues for example and don't want that going into the cloud.

Is it necessary to talk to a server just to check out a letter I wrote?

Obviously with Apple's release of iOS 26 and macOS 26 and the rest of their operating systems, tens of millions of devices are getting a local LLM with 3rd party apps that can take advantage of them.

unethical_ban9mo ago

It's a thing right now.

I'm running Qwen 30B code on my framework laptop to ask questions about ruby vs. python syntax because I can, and because the internet was flaky.

At some point, more doesn't mean I need it. LLMs will certainly get "good enough" and they'll be lower latency, no subscription, and no internet required.

hotstickyballs9mo ago

If compute power is the deciding factor server vs edge discussion then we’d never have smartphones.

nsonha9mo ago

local LLM may not be good enough for answering questions (which I think won't be true really soon) or generating images, but it today should be good enough to infer deeplinks and app extension calls or agentic walk-through... and ushers a new era of controlling phone by voice command.

baby9mo ago· 15 in thread

IMO it's underwhelming considering folding phones have been out for many years now and we still don't have a folding iPhone. What are the PMs doing at Apple.

ndiddy9mo ago

I think folding phones will remain a small niche unless someone figures out how to make a foldable screen that doesn't get permanently scratched by your fingernails.

bayindirh9mo ago

> What are the PMs doing at Apple.

Probably trying to find better screen materials, and addressing reliability issues.

I used Palm devices with resistive touch screens. It was good, but when you go glass, there's no turning back.

I would never buy a phone with folding screens protected by plastic. I want a dependable slab. Not a gimmicky gadget which can die any moment. I got my fix for dying flex cables with Casiopeia PDAs. Never again.

erikpukinskis9mo ago

Folding phones are ~1.5% of the market.

Apple cancelled their mini line which was 3% of sales.

It’s not a big enough slice for them to want to chase.

jsheard9mo ago

I think they'd rather sell you an iPhone and an iPad Mini rather than one device that does both, just like they'd rather sell you an iPad Air/Pro and a MacBook with basically the same internals, rather than a convertible macOS tablet.

meindnoch9mo ago

Aside from the obvious mechanical issues, the screen quality compromises, et cetera, folding phones are just dorky. Apple wants their products to be anything but dorky.

There will never be a folding iPhone, simple as.

Miraste9mo ago

They're in the right. Folding phones are great, and I've used one for years, but the technology hasn't reached Apple levels. Get rid of the crease, make the screen less scratchable, and make them waterproof, and then it could go in an iPhone.

boppo19mo ago

Folders seem gimmicky to me

yoyohello139mo ago

The PMs are probably thinking folding phones are dumb…because they are.

nylonstrung9mo ago

Marques Brownlee said they have prototypes for a folding phone and will likely release one

swiftcoder9mo ago

Do any of the folding phones actually work well? I still haven't seen one in the wild (admittedly, I'm not living in a tech Mecca these days)

caycep9mo ago

I dunno, I always felt folding phones added unnecessary complexity and moving parts. The slab phone seems closer to a platonic ideal and from a user/engineering perspective, has less compromises

runako9mo ago

In all seriousness, is there a folding phone that doesn't have a crease in the screen while unfolded?

The one I have used felt like using a real phone through a layer of vinyl, definitely not a pleasant experience.

rickdeckard9mo ago

> IMO it's underwhelming considering folding phones have been out for many years now and we still don't have a folding iPhone. What are the PMs doing at Apple.

They're buying another year of very-high margin phones I guess...

busymom09mo ago

I know they have been out for a while but I have yet to see a single one in person. They just don't make much of the market.

pdntspa9mo ago

Why do we need a folding phone?

mgerdts9mo ago· 3 in thread

If you compare the specs of the 10 and 11 series watches you will see they both claim high blood pressure detection.

https://www.apple.com/watch/compare/?modelList=watch-series-...

In the past few weeks the oxymeter feature was enabled by a firmware update on series 10. Measurements are done on the watch, results are only reported on a phone.

sgustard9mo ago

Good to know! The fine print:

As of September 9, 2025, hypertension notifications are currently under FDA review and expected to be cleared this month, with availability on Apple Watch Series 9 and later and Apple Watch Ultra 2 and later. The feature is not intended for use by people under 22 years old, those who have been previously diagnosed with hypertension, or pregnant persons.

zimpenfish9mo ago

Going to be interesting comparing the series 10 blood pressure sensing against my Hilo (formerly Aktiia) band on the other wrist. Although without calibration against a cuff, I'm not super convinced the Apple Watch will give reliable information.

SirMaster9mo ago

Also works on the Series 9.

vasco9mo ago· 2 in thread

> bold sexy orange color

Luckily they added the blood pressure check for when you get too excited about the color orange.

formerly_proven9mo ago

It is almost strange, since iPhones were only available in ugly drab colors for several generations. And the Pro models in particular were previously never available in a decent color.

bobmcnamara9mo ago

BondiBlue4lyfe

Nokinside9mo ago· 2 in thread

The first SoC including Neural Engine was the A11 Bionic, used in iPhone 8, 8 Plus and iPhone X, introduced in 2017. Since then, every Apple A-series SoC has included a Neural Engine.

aurareturnOP9mo ago

The Neural Engine is its own block. Neural Engine is not used for local LLMs on Macs. Neural Engine is optimized for power efficiency while running small models. It's not good for LARGE language models.

This change is strictly adding matmul acceleration into each GPU core where it is being used for LLMs.

runjake9mo ago

The matmul stuff is part of the Neural Accelerator marketing, which is distinct from the Neural Engine you're talking about.

I don't blame you. It's confusing.

Uehreka9mo ago· 2 in thread

I will believe this when I see it. It’s totally possible that those capabilities are locked behind some private API or that there’s some weedsy hardware complication not mentioned that makes them non-viable for what we want to do with them.

aurareturnOP9mo ago

Already available via Metal: https://x.com/liuliu/status/1932158994698932505

llm_nerd9mo ago

They might recommend using CoreML to leverage them, though I imagine it will be available to Metal.

The whole point of CoreML is that your solution uses whatever hardware is available to you, including enlisting a heterogeneous set of units to conquer a large problem. Software written years ago would use the GPU matmul if deployed to a capable machine.

SilverElfin9mo ago· 2 in thread

deleted

apparent9mo ago

According to this page, [1] it reduces unwanted noise 4x as much as the original AirPods Pro and 2x as much as the AirPods Pro 2.

Though I do wonder, given the logarithmic nature of sound perception, are these numbers deceptive in terms of what the user will perceive?

1: https://www.apple.com/airpods-pro/

WanderPanda9mo ago

It was 4x over the original version IIRC so should be ~ 2x over the previous

Aperocky9mo ago· 2 in thread

So.. 6 hour batteries like the Apple Watch?

apparent9mo ago

According to Apple's comparison tool, the Air has 27 hrs of video playback, compared to 30 for the 17 and 39 for the Pro.

Based on that, it doesn't sound like it's that much worse. Of course, if you're trying to maximize battery longevity by not exceeding 80% charge, that might make it not very useful for many people.

mbirth9mo ago

But there’s this now:

https://store.apple.com/uk/xc/product/MGPG4ZM/A

astrange9mo ago· 1 in thread

A19 supports MTE: https://news.ycombinator.com/item?id=45186265

Which is a very powerful feature for anyone who likes security or finding bugs in their code. Or other people's code. Even if you didn't really want to find them.

rising-sky9mo ago

MIE

zumu9mo ago· 1 in thread

> the bold sexy orange color of the iPhone 17 Pro

The color line up reminds me of the au MEDIA SKIN phones (Japanese carrier) circa 2007. Maybe it's because I had one back in the day, but I can't help but think they took some influence.

user_78329mo ago

> MEDIA SKIN phones

Wow, thanks for sharing the name, these are really good! I don't know why I was surprised to realize that great designers have made fantastic products even in the past...

Some sites with images, for anyone curious: 1. https://www.dezeen.com/2007/01/17/tokujin-yoshioka-launches-... 2. https://spoon-tamago.com/best-of-2007-part-iv/

babl-yc9mo ago· 1 in thread

I've always been a bit confused about when to run models on the GPU vs the neural engine. The best I can tell, GPU is simpler to use as a developer especially when shipping a cross platform app. But an optimized neural engine model can run lower power.

With the addition of NPUs to the GPU, this story gets even more confusing...

avianlyric9mo ago

In reality you don’t much of a choice. Most of the APIs Apple exposes for running neural nets don’t let you pick. Instead some Apple magic in one of their frameworks decides where it’s going to host your network. At least from what I’ve read, these frameworks will usually distribute your networks over all available matmul compute, starting on the neural net (assuming your specific network is compatible) and spilling onto the GPU as needed.

But there isn’t a trivial way to specifically target the neural engine.

commandersaki9mo ago· 1 in thread

Hoping this budget macbook rumour based on A19/A19 Pro is real.

cj9mo ago

Isn’t the MacBook Air already pretty cheap at $999?

sercand9mo ago· 1 in thread

Where did you see the matmul acceleration support? I couldn't find this detail online.

aurareturnOP9mo ago

Apple calls it "Neural Accelerators". It's all over their A19 marketing.

chisleu9mo ago· 1 in thread

Because of the prompt processing speed, small models like Qwen 3 coder 30b a3b are the sweet spot for mac platform right now. Which means a 32 or 64GB mac is all you need to use Cline or your favorite agent locally.

DrAwdeOccarim9mo ago

Yes, I use LM Studio daily with Qwen 3 30b a3b. I can't believe how good it is locally.

ActorNightly9mo ago· 1 in thread

Good luck actually getting access to ANE. There is a reason why Pytorch doesn't use it even if its been around for a while.

aurareturnOP9mo ago

No luck needed. This is for the GPU and is already available via Metal. https://x.com/liuliu/status/1932158994698932505

whyenot9mo ago

I wish they would offer the 17 pro in some lighter colors (like the new sage green for the regular 17). Not everyone wants bold, and the color selection for pro is always so limited. They don't even have white with this generation, just silver.

atcon9mo ago

Viable may be already here: demo of smollm3/3b <https://news.ycombinator.com/item?id=44501413> on iphone with asr + tts: <https://x.com/adrgrondin/status/1965097304995889642>

Intrigued to explore with a19/m5 and test energy efficiency.

SirMaster9mo ago

The live translation is software. It works on the AirPods Pro 2 and the AirPods 4 with AND.

So is the high blood pressure detection. It's not from the new watch, it works also in the series 10 and series 9 watches.

supportengineer9mo ago

I was reminded of this today for no particular reason:

"iPhone4 vs HTC Evo"

https://www.youtube.com/watch?v=FL7yD-0pqZg

ottah9mo ago

Nah, memory is still the bottleneck. Kernel performance is already pretty good, but cpu memory is still dramatically slower than gpu memory.

aagha9mo ago

Apple is playing 3D chess while every other PC maker is learning how to play checkers.

bendoy9mo ago

I'm most excited about the heart rate sensor in Airpods Pro 3!

amelius9mo ago

> It has A19 Pro.

But it's not general purpose. Broken by design.

I'll pass. Not going to support this. We need less of this crap not more.

j / k navigate · click thread line to collapse