A 30B Qwen model walks into a Raspberry Pi and runs in real time (opens in new tab)

(byteshape.com)

363 pointsdataminer4mo ago135 comments

135 comments

There is a huge market segment waiting here. At least I think there is. Well, at least people like me want this. Ok, tens of dollars can be made at least. It is just missing a critical tipping point. Basically, I want an alexa like device for the home backed by local inference and storage with some standardized components identified:

- the interactive devices - all the alexa/google/apple devices out there are this interface, also, probably some TV input that stays local and I can voice control. That kind of thing. It should have a good speaker and voice control. It probably should also do other things like act as a wifi range extender or be the router. That would actually be good. I would buy one for each room so no need for crazy antennas if they are close and can create true mesh network for me. But I digress.

- the home 'cloud' server that is storage and control. This is a cheap CPU, a little ram and potentially a lot of storage. It should hold the 'apps' for my home and be the one place I can back-up everything about my network (including the network config!)

- the inference engines. That is where this kind of repo/device combo comes in. I buy it and it knows how to advertise in a standard way its services and the controlling node connects it to the home devices. It would be great to just plug it in and go.

Of course all of these could be combined but conceptually I want to be able to swap and mix and match at these levels so options here and interoperability is what really matters.

I know a lot of (all of) these pieces exist, but they don't work well together. There isn't a simple standard 'buy this turn it on and pair with your local network' kind of plug and play environment.

My core requirements are really privacy and that it starts taking over the unitaskers/plays well together with other things. There is a reason I am buying all this local stuff. If you phone home/require me to set up an account with you I probably don't want to buy your product. I want to be able to say 'Freddy, set timer for 10 mins' or 'Freddy, what is the number one tourist attraction in South Dakota' (wall drugs if you were wondering)

Normal_gaussian4mo ago

No, there isn't a plug and play one yet, but I've have great success with Home Assistant and the Home Assistant Voice Preview edition and its goal is pretty much to get rid of Alexa.

I'd imagine you'd have a bunch of cheap ones in the house that are all WiFi + Mic + Speakers, streaming back to your actual voice processing box (which would cost a wee bit more, but also have local access to all the data it needs).

You can see quite quickly that this becomes just another program running on a host, so if you use a slightly beefier machine and chuck a WiFi card in as well you've got your WiFi extenders.

joshstrange4mo ago

> but I've have great success with Home Assistant and the Home Assistant Voice Preview edition

As compared to Alexa? I bought their preview hardware (and had a home-rolled ESP32 version before that even) and things are getting closer, I can see the future where this works but we aren't there today IMHO. HA Voice (the current hardware) does not do well enough in the mic or speaker [0] department when compared to the Echos. My Echo can hear me over just about anything and I can hear it back, the HA Voice hardware is too quiet and the mic does not pick my up from the same distances or noise pollution levels as the Echo.

I _love_ my HA setup and run everything through it. I'd like nothing more than to trash all my Echos, I cam close to ordering multiple of the preview devices but convinced myself to get just 1 to test (glad I did).

Bottom line: I think HA Voice is the future (for me) but it's not ready yet, it doesn't compare to the Echos. I wish so much that my Sonos speakers could integrate with HA Voice since I already have those everywhere and I know they sound good.

[0] I use Sonos for all my music/audio listening in my house so I only care about the speaker for hearing it talk back to me, I don't need high-end audiophile speakers.

2 more replies

mcny4mo ago

And if it is plugged in to the wall, I'd be tempted to add a touch screen display and a camera just in case.

But really my use case is as simple as

1. Wake word, what time is it in ____

2. Wake word, how is the weather in ____

3. Wake word, will it rain/snow/?? in _____ today / tomorrow / ??

4. Wake word, what is ______

5. Wake word, when is the next new moon / full moon?

6. Wake word, when is sunrise / sunset?

And something similar like that

1 more reply

ragebol4mo ago

A bit like HomeAssistant Voice? https://www.home-assistant.io/voice-pe/

estimator72924mo ago

Just last week I hacked my Echo Show to install a custom OS and hook it into HomeAssistant.

Even gave it a custom wake word, she's Janet now.

HA is pretty clunky and there's a lot of manual setup. But I have a voice assistant contained entirely within my local infrastructure. I'm even planning to wire it up to my local ollama server for actual AI inference behind it.

So far it's exactly as crappy as Alexa, but only because I haven't waded deep enough into configuration. I'm okay with tools being crap when it's my fault instead of the tool being crap because it doesn't make Amazon enough money.

password43214mo ago

> hacked my Echo Show

Wowsers I did not know this was a thing; TIL, thanks!

sofixa4mo ago

It sounds like you want Home Asisstant.

You have all of the different components:

* you can use a number of things for the interactive devices (any touchscreen device, buttons, voice, etc)

* have it HA do the basic parsing (word for word matching), with optionally plugging into something more complex (cloud service like ChatGPT, or self-hosted Ollama or whatever) for more advanced parsing (logical parsing)

Every part of the ecosystem is interchangeable and very open. You can use a bunch of different devices, a bunch of different LLMs to do the advanced parsing if you want it. HA can control pretty much everything with an API, and can itself be controlled by pretty much anything that can talk an API.

fuzzer3714mo ago

And there never will be. You know why? Because the giant corporations can't suck up all your data and tailor advertisements to you. Why sell a good thing once, when you can sell crappy shovelware ridden with ads and a subscription service every month?

jmward014mo ago

Open source is amazing for this. Honestly, I suspect this is much simpler than the jellyfin ecosystem and other open source projects out there. Really, we are so close to this now it is just missing a few things like a good 'how to' that ties it all together and turns into the opensource repo that bundles things.

protocolture4mo ago

Keen for this also. Been having issues getting a smooth voice experience from HA to ChatGPT. I dont like the whole wakeword concept for the receiver either. I think theres work to be done on the whole stack.

fennecbutt4mo ago

What's wrong with the wakeword stuff?

Great timing as I was looking into it yesterday as was thinking about writing my own set of agents to run house stuff. I don't want to spent loads of time on voice interaction so HA wakeword stuff would've been useful. If not I'll bypass HA for voice and really only use HA via mcp.

I can do fw dev for micros...but omg do I not want to spend the time looking thru a datasheet and getting something to run efficiently myself these days.

1 more reply

nickthegreek4mo ago

you can use a physical button instead of wakeword.

protocolture4mo ago

Doesnt suit my use case sadly.

1 more reply

65104mo ago

It should participate in all conversations, take initiative and experiment.

1 more reply

mkul4mo ago

I've just started using it but I'd recommend https://github.com/steipete/clawdis, you need to set it up a bit but it's really cool to just be able to do things on the go by just texting an assistant. You can see all the different ways people are using it @clawdbot on twitter.

mr_mitm4mo ago

Can you give us some highlights on how this is helpful in your day-to-day life for those of us who aren't on twitter?

BizarroLand4mo ago

Why does it require an online AI service? Why can it not work with ollama or some other locally hosted setup?

colechristensen4mo ago

I've been working on this on and off for a couple of years now, the loop is definitely closing, I think it's possible at this point but not yet easy.

PunchyHamster4mo ago

There is but that market doesn't sell subscriptions and that is what tech giants wants to sell - renewable flow of money that will keep flowing even if product stagnates because effort to move to competition is big.

Haaargio4mo ago

We are in a free market with china still playing the open source game.

The market is not ready for building this due to costs etc. not because the big companies block them or anything. And nvidia is not selling subscriptions at all.

matusp4mo ago

The sota chatbots are getting more and more functionality that is not just LLM inference. They can search the web, process files, integrate with other apps. I think that's why most people will consider local LLMs to be insufficient very soon.

woooooo4mo ago

But that's just software that also runs fine locally. A few tools with a local LLM can do it.

1 more reply

BoxOfRain4mo ago

Nah I disagree, tool calling isn't that difficult. I've got my own Cats Effect based model orchestration project I'm working on, and while it's not 100% yet I can do web browse, web search, memory search (this one is cool), and others on my own hardware.

throwaway77834mo ago

And toys

zwnow4mo ago

> Well, at least people like me want this.

Yeah because dynamic digital price signs in shops based on what data vendors have about you and AI can extract from it are such fun! Total surveillance. More than what's already happening. Such fun!

yjftsjthsd-h4mo ago

In case anyone else clicked in wondering what counts as "real time" for this:

> On a Pi 5 (16GB), Q3_K_S-2.70bpw [KQ-2] hits 8.03 TPS at 2.70 BPW and maintains 94.18% of BF16 quality.

And they talk about other hardware and details. But that's the expanded version of the headline claim.

CSSer4mo ago

Someone should make a version of the Hacker News homepage that is just LLM extracts of key article details like this.

grosswait4mo ago

Not sure if it is still updating https://hackyournews.com/

ukuina4mo ago

Thanks for pointing this out, https://hackyournews.com should be up and running again!

1 more reply

Aurornis4mo ago

If you read a lot of comment sections, there are bot accounts showing up on LLM that try to do this constantly.

Their output is not great so they get downvoted and spotted quickly.

jacquesm4mo ago

If you spot any that live longer than a few comments please pass that info to Dan & Tom.

mschuster914mo ago

Please not. There were some bots (or karma-farming users) doing this and yuck, was it annoying.

1 more reply

Imustaskforhelp4mo ago

https://chatgpt.com/share/695d9ac2-c314-8011-8938-b0d7de7059...

You can paste any article and chatgpt (took the most laymen AI thing) and just writing summarize this article https://byteshape.com/blogs/Qwen3-30B-A3B-Instruct-2507/

can give you insights about it.

Although I am all for freedom, one forgets that this is one of the few places left on internet where discussions feel meaningful and I am not judging you if you want AI but do it at your own discretion using chatbots.

If you want, you can even hack around a simple extension (tampermonkey etc.) where you can have a button which can do this for you if you really so desire.

Ended up being bored and asked chatgpt to do this but chatgpt is having something wrong, it got just blinking mode so I asked claude web (4.5 sonnet) to do it and I ended up building it with tampermonkey script.

Created the code. https://github.com/SerJaimeLannister/tampermonkey-hn-summari...

I was just writing this comment and I just got curious I guess so in the end ended up building it.

Although Edit: Thinking about it, I felt that we should read other people's articles as well. I just created this tool not out of endorsement of idea or anything but just curiosity or boredom but I think that we should probably read the articles themselves instead of asking chatgpt or LLM's about it.

There is this quote which I remembered right now

If something is worth talking/discussing about, its worth writing

If something is worth writing, then its worth reading.

Information that we write is fundamentally subjective (our writing style etc with our biases etc.), passing it through a black box which will try to homogenify all of it just feels like it misses the point.

65104mo ago

<s>I'm not entirely sure but I think</s> if the file name ends with .user.js like HN%20ChatGPT%20Summarize.user.js it will prompt to install when opening the raw file.

haha, like so works too

https://raw.githubusercontent.com/SerJaimeLannister/tampermo...

1 more reply

Alex20374mo ago

>we should read other people's articles

sure, and reading a LLM summary allows one to decide whether the full article is worth reading or not.

3 more replies

kadoban4mo ago

I mean, they didn't bury it far in the article, it's like a two second skim into it and it's labelled with a tl;dr. Not a bad idea in general but you don't even need it for this one.

kristianp4mo ago

Also this is the model name: Qwen3-30B-A3B-Instruct-2507

I tried the q4 quantization when it came out and didn't find it to be great for my coding use case.

make34mo ago

I wonder what "94.18% of quality" means

jareds4mo ago

Is there a good place for easy comparisons of different models? I know gpt-oss-20b and gpt-oss-120b have different numbers of parameters, but don't know what this means in practice. All my experience with AI has been with larger models like Gemini and GPT. I'm interested in running models on my own hardware but don't know how small I can go and still get useful output both for simple things like fixing spelling and grammar, as well as complex things like programming.

ekidd4mo ago

One easy way to test different models is purchase $20 worth of tokens from one of the Open Router-like sites. This will let you asks tons of questions and try out lots of models.

Realistically, the biggest models you can run at a reasonable price right now are quantized versions of things like the Qwen3 30B A3B family. A 4-bit quantized version fits in roughly 15GB of RAM. This will run very nicely on something like an Nvidia 3090. But you can also use your regular RAM (though it will be slower).

These models aren't competitive with GPT 5 or Opus 4.5! But they're mostly all noticeably better than GPT-4o, some by quite a bit. Some of the 30B models will run as basic agentic coders.

There are also some great 4B to 8B models from various organizations that will fit on smaller systems. A 8B model, for example, can be a great translator.

(If you have a bunch of money and patience, you can also run something like GPT OSS 120B or GLM 4.5 Air locally.)

nl4mo ago

I wrote https://tools.nicklothian.com/llm_comparator.html so you can compare different models.

OpenRouter gives you $10 credit when you sign up - stick your API key in and compare as many models as you want. It's all browser local storage.

kouteiheika4mo ago

> (If you have a bunch of money and patience, you can also run something like GPT OSS 120B or GLM 4.5 Air locally.)

Don't need patience for these, just money. A single RTX 6000 Pro runs those great and super fast.

scotty794mo ago

> GPT OSS 120B

This one runs at perfectly servicable pace locally on a laptop 5090 with 64gb system ram with zero effort required. Just download ollama and select this model from the drop-down.

Muromec4mo ago

Oh... 8 thousand of eurobucks for the thing.

1 more reply

sofixa4mo ago

Or a single AMD Strix Halo with lots of RAM, which could be had before the RAM crisis for ~1.5k eur.

Haaargio4mo ago

Or why not just buy a blackwell rack?

Runs everything today with bleeding edge performance.

Overall whats the difference between 8k or 30k?

1 more reply

cmrdporcupine4mo ago

This is the answer. There's a half dozen sites that let you run these models by the token, and actually $20 is excessive. $5 will get you a long long way.

jdright4mo ago

https://swe-rebench.com/

nl4mo ago

I've been super impressed by qwen3:0.6b (yes, 0.6B) running in Ollama.

If you have very specific, constrained tasks it can do quite a lot. It's not perfect though.

https://tools.nicklothian.com/llm_comparator.html?gist=fcae9... is an example conversation where I took OpenAI's "Natural language to SQL" prompt[1], send it to Ollama:qwen3:0.6b and the asked Gemini Flash 3 to compare what qwen3:0.6b did vs what Flash did.

Flash was clearly correct, but the qwen3:0.6b errors are interesting in themselves.

[1] https://platform.openai.com/docs/examples/default-sql-transl...

Aurornis4mo ago

I’ve experimented with several of the really small models. It’s impressive that they can produce anything at all, but in my experience the output is basically useless for anything of value.

nl4mo ago

Yes, I thought that too! But qwen3:0.6b (and to some extent gemma 1b) has made me reevaluate.

They still aren't useful like large LLMs, but for things like summarization, and other tasks where you can give them structure but want the sheen of natural language they are much better than things like the Phi series were.

2 more replies

geerlingguy4mo ago

I've just tried replicating this on my Pi 5 16GB, running the latest llama.cpp... and it segfaults:

    ./build/bin/llama-cli -m "models/Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.70bpw.gguf" -e --no-mmap -t 4
    ...
    Loading model... -ggml_aligned_malloc: insufficient memory (attempted to allocate 24576.00 MB)
    ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 25769803776
    alloc_tensor_range: failed to allocate CPU buffer of size 25769803776
    llama_init_from_model: failed to initialize the context: failed to allocate buffer for kv cache
    Segmentation fault

I'm not sure how they're running it... any kind of guide for replicating their results? It does take up a little over 10 GB of RAM (watching with btop) before it segfaults and quits.

[Edit: had to add -c 4096 to cut down the context size, now it loads]

thcuk4mo ago

Tested same model on Intel N100 miniPC with 16G - the hundred bucks pc

llama-server -m /Qwen3-30B-A3B-Instruct-2507-GGUF:IQ3_S --jinja -c 4096 --host 0.0.0.0 --port 8033 Got <= 10 t/s Which I think is not so bad!

On AMD Ryzen 5 5500U with Radeon Graphics and Compiled for Vulkan Got 15 t/s - could swear this morning was <= 20 t/s

On AMD Ryzen 7 H 255 w/ Radeon 780M Graphics and Compiled for Vulkan Got 40 t/s On the last I did a quick comparison with unsloth version unsloth/Qwen3-30B-A3B-GGUF:Q4_K_M and got 25 t/s Can't really comment on quality of output - seems similar

LargoLasskhyfv4mo ago

Have you tried anything with https://codeberg.org/ikawrakow/illama

https://github.com/ikawrakow/ik_llama.cpp and their 4Bit-quants?

Or maybe even Microsofts Bitnet? https://github.com/microsoft/BitNet

https://github.com/ikawrakow/ik_llama.cpp/pull/337

https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf ?

That would be an interesting comparison for running local LLMs on such low-end/edge-devices. Or common office machines with only iGPU.

westpfelia4mo ago

Would you be able to actually get useful results from it? I'm looking into self hosting LLM's for python/js development. But I dont know if I would get useful results.

graemep4mo ago

I have been thinking the same and have tried a little. I have tried some small models and got some useful results.

I have not figured out what models that fit in the available memory (say 16Gb) that would be best for doing this. A CPU model I can run on a laptop would be nice. The models I have tried are much smaller than 30B.

batch124mo ago

Could they have added some swap?

geerlingguy4mo ago

No, just updated the parent comment, I added -c 4096 to cut down the context size, and now the model loads.

I'm able to get 6-7 tokens/sec generation with 10-11 tokens/sec prompt processing with their model. Seems quite good, actually—much more useful than llama 3.2:3b, which has comparable performance on this Pi.

3 more replies

anonzzzies4mo ago

We need custom inference chips at scale for this imho. Every computer (whatever formfactor/board) should have an inference unit on it so at least inference is efficient and fast and can be offloaded while the cpu is doing something else.

Aurornis4mo ago

The bottleneck in common PC hardware is mostly memory bandwidth. Offloading the computation part to a different chip wouldn’t help if memory access is the bottleneck.

There have been a lot of boards and chips for years with dedicated compute hardware, but they’re only so useful for these LLM models that require huge memory bandwidth.

touristtam4mo ago

It is also to note that the bandwidth bus has seen very little upgrade over the years and even the onboard RAM on GPU card have seen mediocre upgrades. If everyone and their grandma wasn't using NVidia GPUs we would probably have seen a more competitive market and greater changes outside the chip itself.

bigyabai4mo ago

I don't think that's true. AMD, Apple and Intel are all dGPU competitors with roughly the same struggle bringing upgrades to market. They have every incentive to release a disruptive product, but refuse to invest in their ecosystem the way Nvidia did.

chvid4mo ago

Look at the specs of this Orange Pi 6+ board - dedicated 30 TPU NPU.

https://boilingsteam.com/orange-pi-6-plus-review/

sofixa4mo ago

Almost all of them have it already. Microsoft's "Copilot+" branding includes a prerequisite for an NPU with a minimal amount of TOPS.

It's just that practically nothing uses those NPUs.

baq4mo ago

At this point of the timeline compute is cheap, it’s RAM which is basically unavailable.

fouc4mo ago

I can't believe this was downvoted. It makes a lot of sense that it would be highly useful to have mass custom inference chips.

bigyabai4mo ago

It's quite easy to understand. The tech industry has gone through 4-5 generations of obsolete NPU hardware that was dead-on-arrival. Meanwhile, there are still GPUs from 2014-2016 that run CUDA and are more power efficient than the NPUs.

The industry has to copy CUDA, or give up and focus on raster. ASIC solutions are a snipe chase, not to mention small and slow.

syx4mo ago

I can’t wait to get home and try this on my Pi. Past few months, I’ve been building a fully local agent [0] that runs inference entirely on a Raspberry Pi, and I’ve been extensively testing a plethora of small, open models as part of my studies. This is an incredibly exciting field, and I hope it gains more attention as we shift away from massive, centralized AI platforms and toward improving the performance of local models.

For anyone interested in a comparative review of different models that can run on a Pi, here’s a great article [1] I came across while working on my project.

[0] https://github.com/syxanash/maxheadbox

[1] https://www.stratosphereips.org/blog/2025/6/5/how-well-do-ll...

tgtweak4mo ago

So basically the quantization in a byteshape model is per-tensor and can be variable and is an "average" in the final result? The results look good - curious why this isn't more prevalent! Would also love to better understand what factors into "accuracy" since there might be some nuance there depending on the measure.

kouteiheika4mo ago

> Would also love to better understand what factors into "accuracy" since there might be some nuance there depending on the measure.

It's accuracy across GSM8K, MMLU, IFEVAL and LiveCodeBench.

They detail their methodology here: https://byteshape.com/blogs/Qwen3-4B-I-2507/

lostmsu4mo ago

GPT-OSS-20B is only 11.2GB. Should fit in any 16GB machine with descent context without any quality degradation.

nunodonato4mo ago

Had to try this on my laptop (vs my non-byteshape qwen3-30-a3b)

Original: 11tok/s Byteshape: 16tok/s

Quite a nice improvement!

oliwary4mo ago

Does anyone have any use-cases for long-running, interesting tasks that do not require perfect accuracy? Seems like this would be the sweet spot for running local models on low-powered hardware.

baq4mo ago

looking at the dump of your home assistant server and sensor data and saying 'hmm that's interesting' if it notices something interesting

Havoc4mo ago

Is there something different about that accuracy measure? i.e. relative to perplexity term usually used

Going from BF16 to 2.8 and losing only ~5% sounds odd to me.

kouteiheika4mo ago

It's accuracy across GSM8K, MMLU, IFEVAL and LiveCodeBench.

They detail their methodology here: https://byteshape.com/blogs/Qwen3-4B-I-2507/

cwoolfe4mo ago

How can I use ByteShape to run LLMs faster on my 32GB MacBook M1 Max? Or has Ollama already optimized that?

nunodonato4mo ago

don't use ollama. llama.cpp is better because ollama has an outdated llama.cpp

zoe_mnode4mo ago

This is impressive work on quantization. It really validates the hypothesis that software optimization (and proper memory management) matters more than just raw FLOPs.

syntaxing4mo ago

I feel like calling it a “30B” model is slightly disingenuous. It’s a 30B-A3B. So only 3B parameters is active at a given time. While still impressive nevertheless, being able to get 8T/s for a “A3B” compared to a dense 30B is very different.

CamperBob24mo ago

Out of curiosity, I just tried Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.70bpw.gguf (the version they recommend for the Raspberry Pi) on a Blackwell GPU. It cranked out 200+ tokens per second on some private benchmark queries, and it is surprisingly sharp.

It punches well above the weight class expected from 3B active parameters. You could build the bear in Spielberg's "AI" with this thing, if not the kid.

rcarmo4mo ago

I’m bearish about that kind of future :)

throwaway8943454mo ago

What does it mean that only 3B parameters are active at a time? Also any indication of whether this was purely CPU or if it’s using the Pi’s GPU?

kouteiheika4mo ago

> What does it mean that only 3B parameters are active at a time?

In a nutshell: LLMs generate tokens one at a time. "only 3B parameters active a a time" means that for each of those tokens only 3B parameters need to be fetched from memory, instead of all of them (30B).

tgv4mo ago

Then I don't understand why it would matter. Or does it really mean that for each input token 10% of the total network runs, and then another 10% for the next token, rather than running each 10 batches of 10% for each token? If so, any idea or pointer to how the selection works?

1 more reply

numpad04mo ago

I've asked Gemini about it the other day(I'm dumb and shameless). Apparently it means that the model branches into bunch of 3B sections in the middle and joins at both ends, totaling in parameters at 30B. This means computational footprint reduces to (bottom "router" parts + 3B + top parts) of effectively-5B or whatever specific to that model implied by "3B", rather than the full 30B.

MoE models still operate on token-by-token basis, i.e. "pot/at/o" -> "12345/7654/8472". "Experts" are selected on per-token basis, not per-interation, so "expert" naming might be a bit of a misnomer, or marketing.

TheRealPomax4mo ago

LLMs are, by definition, real time at any speed. 50,000 tokens per second? Real time. Only 0.0002 tokens per minute? Still real time.

Eight tokens per second is "real time" in that sense, but that's also the kind of speeds that we used to mock old video games for, when they would show "computers" but the text would slowly get printed to a screen letter for letter or word for word.

kouteiheika4mo ago

In this context by "real time" people usually mean "as fast as I can read the reply", so, 0.0002 tokens per minute would not be considered "real time".

rurban4mo ago

Real time typically means guaranteed reaction time below 30ms, because slower reactions will make the body through up.

baq4mo ago

Real time is defined as ‘no slower than some critical speed’, in case of conversation with humans this should be around 10 tok/s including speech synthesis.

cess114mo ago

We're approaching the point where we could have sophisticated sound controlled sex toys.

shnpln4mo ago

How would this run on a Jetson Orin Nano dev kit?

runlaszlorun4mo ago

I'm just pulling stuff out of my butt here but is this an area where an fpga might be worth it for a particular type of model?

j / k navigate · click thread line to collapse

135 comments

jmward014mo ago

Of course all of these could be combined but conceptually I want to be able to swap and mix and match at these levels so options here and interoperability is what really matters.

I know a lot of (all of) these pieces exist, but they don't work well together. There isn't a simple standard 'buy this turn it on and pair with your local network' kind of plug and play environment.

Normal_gaussian4mo ago

No, there isn't a plug and play one yet, but I've have great success with Home Assistant and the Home Assistant Voice Preview edition and its goal is pretty much to get rid of Alexa.

You can see quite quickly that this becomes just another program running on a host, so if you use a slightly beefier machine and chuck a WiFi card in as well you've got your WiFi extenders.

joshstrange4mo ago

> but I've have great success with Home Assistant and the Home Assistant Voice Preview edition

[0] I use Sonos for all my music/audio listening in my house so I only care about the speaker for hearing it talk back to me, I don't need high-end audiophile speakers.

2 more replies

mcny4mo ago

And if it is plugged in to the wall, I'd be tempted to add a touch screen display and a camera just in case.

But really my use case is as simple as

1. Wake word, what time is it in ____

2. Wake word, how is the weather in ____

3. Wake word, will it rain/snow/?? in _____ today / tomorrow / ??

4. Wake word, what is ______

5. Wake word, when is the next new moon / full moon?

6. Wake word, when is sunrise / sunset?

And something similar like that

1 more reply

ragebol4mo ago

A bit like HomeAssistant Voice? https://www.home-assistant.io/voice-pe/

estimator72924mo ago

Just last week I hacked my Echo Show to install a custom OS and hook it into HomeAssistant.

Even gave it a custom wake word, she's Janet now.

password43214mo ago

> hacked my Echo Show

Wowsers I did not know this was a thing; TIL, thanks!

sofixa4mo ago

It sounds like you want Home Asisstant.

You have all of the different components:

* you can use a number of things for the interactive devices (any touchscreen device, buttons, voice, etc)

fuzzer3714mo ago

jmward014mo ago

protocolture4mo ago

fennecbutt4mo ago

What's wrong with the wakeword stuff?

I can do fw dev for micros...but omg do I not want to spend the time looking thru a datasheet and getting something to run efficiently myself these days.

1 more reply

nickthegreek4mo ago

you can use a physical button instead of wakeword.

protocolture4mo ago

Doesnt suit my use case sadly.

1 more reply

65104mo ago

It should participate in all conversations, take initiative and experiment.

1 more reply

mkul4mo ago

mr_mitm4mo ago

Can you give us some highlights on how this is helpful in your day-to-day life for those of us who aren't on twitter?

BizarroLand4mo ago

Why does it require an online AI service? Why can it not work with ollama or some other locally hosted setup?

colechristensen4mo ago

I've been working on this on and off for a couple of years now, the loop is definitely closing, I think it's possible at this point but not yet easy.

PunchyHamster4mo ago

Haaargio4mo ago

We are in a free market with china still playing the open source game.

The market is not ready for building this due to costs etc. not because the big companies block them or anything. And nvidia is not selling subscriptions at all.

matusp4mo ago

woooooo4mo ago

But that's just software that also runs fine locally. A few tools with a local LLM can do it.

1 more reply

BoxOfRain4mo ago

throwaway77834mo ago

And toys

zwnow4mo ago

> Well, at least people like me want this.

Yeah because dynamic digital price signs in shops based on what data vendors have about you and AI can extract from it are such fun! Total surveillance. More than what's already happening. Such fun!

yjftsjthsd-h4mo ago

In case anyone else clicked in wondering what counts as "real time" for this:

> On a Pi 5 (16GB), Q3_K_S-2.70bpw [KQ-2] hits 8.03 TPS at 2.70 BPW and maintains 94.18% of BF16 quality.

And they talk about other hardware and details. But that's the expanded version of the headline claim.

CSSer4mo ago

Someone should make a version of the Hacker News homepage that is just LLM extracts of key article details like this.

grosswait4mo ago

Not sure if it is still updating https://hackyournews.com/

ukuina4mo ago

Thanks for pointing this out, https://hackyournews.com should be up and running again!

1 more reply

Aurornis4mo ago

If you read a lot of comment sections, there are bot accounts showing up on LLM that try to do this constantly.

Their output is not great so they get downvoted and spotted quickly.

jacquesm4mo ago

If you spot any that live longer than a few comments please pass that info to Dan & Tom.

mschuster914mo ago

Please not. There were some bots (or karma-farming users) doing this and yuck, was it annoying.

1 more reply

Imustaskforhelp4mo ago

https://chatgpt.com/share/695d9ac2-c314-8011-8938-b0d7de7059...

You can paste any article and chatgpt (took the most laymen AI thing) and just writing summarize this article https://byteshape.com/blogs/Qwen3-30B-A3B-Instruct-2507/

can give you insights about it.

If you want, you can even hack around a simple extension (tampermonkey etc.) where you can have a button which can do this for you if you really so desire.

Created the code. https://github.com/SerJaimeLannister/tampermonkey-hn-summari...

I was just writing this comment and I just got curious I guess so in the end ended up building it.

There is this quote which I remembered right now

If something is worth talking/discussing about, its worth writing

If something is worth writing, then its worth reading.

65104mo ago

<s>I'm not entirely sure but I think</s> if the file name ends with .user.js like HN%20ChatGPT%20Summarize.user.js it will prompt to install when opening the raw file.

haha, like so works too

https://raw.githubusercontent.com/SerJaimeLannister/tampermo...

1 more reply

Alex20374mo ago

>we should read other people's articles

sure, and reading a LLM summary allows one to decide whether the full article is worth reading or not.

3 more replies

kadoban4mo ago

I mean, they didn't bury it far in the article, it's like a two second skim into it and it's labelled with a tl;dr. Not a bad idea in general but you don't even need it for this one.

kristianp4mo ago

Also this is the model name: Qwen3-30B-A3B-Instruct-2507

I tried the q4 quantization when it came out and didn't find it to be great for my coding use case.

make34mo ago

I wonder what "94.18% of quality" means

jareds4mo ago

ekidd4mo ago

One easy way to test different models is purchase $20 worth of tokens from one of the Open Router-like sites. This will let you asks tons of questions and try out lots of models.

These models aren't competitive with GPT 5 or Opus 4.5! But they're mostly all noticeably better than GPT-4o, some by quite a bit. Some of the 30B models will run as basic agentic coders.

There are also some great 4B to 8B models from various organizations that will fit on smaller systems. A 8B model, for example, can be a great translator.

(If you have a bunch of money and patience, you can also run something like GPT OSS 120B or GLM 4.5 Air locally.)

nl4mo ago

I wrote https://tools.nicklothian.com/llm_comparator.html so you can compare different models.

OpenRouter gives you $10 credit when you sign up - stick your API key in and compare as many models as you want. It's all browser local storage.

kouteiheika4mo ago

> (If you have a bunch of money and patience, you can also run something like GPT OSS 120B or GLM 4.5 Air locally.)

Don't need patience for these, just money. A single RTX 6000 Pro runs those great and super fast.

scotty794mo ago

> GPT OSS 120B

This one runs at perfectly servicable pace locally on a laptop 5090 with 64gb system ram with zero effort required. Just download ollama and select this model from the drop-down.

Muromec4mo ago

Oh... 8 thousand of eurobucks for the thing.

1 more reply

sofixa4mo ago

Or a single AMD Strix Halo with lots of RAM, which could be had before the RAM crisis for ~1.5k eur.

Haaargio4mo ago

Or why not just buy a blackwell rack?

Runs everything today with bleeding edge performance.

Overall whats the difference between 8k or 30k?

1 more reply

cmrdporcupine4mo ago

This is the answer. There's a half dozen sites that let you run these models by the token, and actually $20 is excessive. $5 will get you a long long way.

jdright4mo ago

https://swe-rebench.com/

nl4mo ago

I've been super impressed by qwen3:0.6b (yes, 0.6B) running in Ollama.

If you have very specific, constrained tasks it can do quite a lot. It's not perfect though.

Flash was clearly correct, but the qwen3:0.6b errors are interesting in themselves.

[1] https://platform.openai.com/docs/examples/default-sql-transl...

Aurornis4mo ago

I’ve experimented with several of the really small models. It’s impressive that they can produce anything at all, but in my experience the output is basically useless for anything of value.

nl4mo ago

Yes, I thought that too! But qwen3:0.6b (and to some extent gemma 1b) has made me reevaluate.

2 more replies

geerlingguy4mo ago

I've just tried replicating this on my Pi 5 16GB, running the latest llama.cpp... and it segfaults:

    ./build/bin/llama-cli -m "models/Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.70bpw.gguf" -e --no-mmap -t 4
    ...
    Loading model... -ggml_aligned_malloc: insufficient memory (attempted to allocate 24576.00 MB)
    ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 25769803776
    alloc_tensor_range: failed to allocate CPU buffer of size 25769803776
    llama_init_from_model: failed to initialize the context: failed to allocate buffer for kv cache
    Segmentation fault

I'm not sure how they're running it... any kind of guide for replicating their results? It does take up a little over 10 GB of RAM (watching with btop) before it segfaults and quits.

[Edit: had to add -c 4096 to cut down the context size, now it loads]

thcuk4mo ago

Tested same model on Intel N100 miniPC with 16G - the hundred bucks pc

llama-server -m /Qwen3-30B-A3B-Instruct-2507-GGUF:IQ3_S --jinja -c 4096 --host 0.0.0.0 --port 8033 Got <= 10 t/s Which I think is not so bad!

On AMD Ryzen 5 5500U with Radeon Graphics and Compiled for Vulkan Got 15 t/s - could swear this morning was <= 20 t/s

LargoLasskhyfv4mo ago

Have you tried anything with https://codeberg.org/ikawrakow/illama

https://github.com/ikawrakow/ik_llama.cpp and their 4Bit-quants?

Or maybe even Microsofts Bitnet? https://github.com/microsoft/BitNet

https://github.com/ikawrakow/ik_llama.cpp/pull/337

https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf ?

That would be an interesting comparison for running local LLMs on such low-end/edge-devices. Or common office machines with only iGPU.

westpfelia4mo ago

Would you be able to actually get useful results from it? I'm looking into self hosting LLM's for python/js development. But I dont know if I would get useful results.

graemep4mo ago

I have been thinking the same and have tried a little. I have tried some small models and got some useful results.

batch124mo ago

Could they have added some swap?

geerlingguy4mo ago

No, just updated the parent comment, I added -c 4096 to cut down the context size, and now the model loads.

3 more replies

anonzzzies4mo ago

Aurornis4mo ago

The bottleneck in common PC hardware is mostly memory bandwidth. Offloading the computation part to a different chip wouldn’t help if memory access is the bottleneck.

There have been a lot of boards and chips for years with dedicated compute hardware, but they’re only so useful for these LLM models that require huge memory bandwidth.

touristtam4mo ago

bigyabai4mo ago

chvid4mo ago

Look at the specs of this Orange Pi 6+ board - dedicated 30 TPU NPU.

https://boilingsteam.com/orange-pi-6-plus-review/

sofixa4mo ago

Almost all of them have it already. Microsoft's "Copilot+" branding includes a prerequisite for an NPU with a minimal amount of TOPS.

It's just that practically nothing uses those NPUs.

baq4mo ago

At this point of the timeline compute is cheap, it’s RAM which is basically unavailable.

fouc4mo ago

I can't believe this was downvoted. It makes a lot of sense that it would be highly useful to have mass custom inference chips.

bigyabai4mo ago

The industry has to copy CUDA, or give up and focus on raster. ASIC solutions are a snipe chase, not to mention small and slow.

syx4mo ago

For anyone interested in a comparative review of different models that can run on a Pi, here’s a great article [1] I came across while working on my project.

[0] https://github.com/syxanash/maxheadbox

[1] https://www.stratosphereips.org/blog/2025/6/5/how-well-do-ll...

tgtweak4mo ago

kouteiheika4mo ago

> Would also love to better understand what factors into "accuracy" since there might be some nuance there depending on the measure.

It's accuracy across GSM8K, MMLU, IFEVAL and LiveCodeBench.

They detail their methodology here: https://byteshape.com/blogs/Qwen3-4B-I-2507/

lostmsu4mo ago

GPT-OSS-20B is only 11.2GB. Should fit in any 16GB machine with descent context without any quality degradation.

nunodonato4mo ago

Had to try this on my laptop (vs my non-byteshape qwen3-30-a3b)

Original: 11tok/s Byteshape: 16tok/s

Quite a nice improvement!

oliwary4mo ago

Does anyone have any use-cases for long-running, interesting tasks that do not require perfect accuracy? Seems like this would be the sweet spot for running local models on low-powered hardware.

baq4mo ago

looking at the dump of your home assistant server and sensor data and saying 'hmm that's interesting' if it notices something interesting

Havoc4mo ago

Is there something different about that accuracy measure? i.e. relative to perplexity term usually used

Going from BF16 to 2.8 and losing only ~5% sounds odd to me.

kouteiheika4mo ago

It's accuracy across GSM8K, MMLU, IFEVAL and LiveCodeBench.

They detail their methodology here: https://byteshape.com/blogs/Qwen3-4B-I-2507/

cwoolfe4mo ago

How can I use ByteShape to run LLMs faster on my 32GB MacBook M1 Max? Or has Ollama already optimized that?

nunodonato4mo ago

don't use ollama. llama.cpp is better because ollama has an outdated llama.cpp

zoe_mnode4mo ago

This is impressive work on quantization. It really validates the hypothesis that software optimization (and proper memory management) matters more than just raw FLOPs.

syntaxing4mo ago

CamperBob24mo ago

It punches well above the weight class expected from 3B active parameters. You could build the bear in Spielberg's "AI" with this thing, if not the kid.

rcarmo4mo ago

I’m bearish about that kind of future :)

throwaway8943454mo ago

What does it mean that only 3B parameters are active at a time? Also any indication of whether this was purely CPU or if it’s using the Pi’s GPU?

kouteiheika4mo ago

> What does it mean that only 3B parameters are active at a time?

tgv4mo ago

1 more reply

numpad04mo ago

TheRealPomax4mo ago

LLMs are, by definition, real time at any speed. 50,000 tokens per second? Real time. Only 0.0002 tokens per minute? Still real time.

kouteiheika4mo ago

In this context by "real time" people usually mean "as fast as I can read the reply", so, 0.0002 tokens per minute would not be considered "real time".

rurban4mo ago

Real time typically means guaranteed reaction time below 30ms, because slower reactions will make the body through up.

baq4mo ago

Real time is defined as ‘no slower than some critical speed’, in case of conversation with humans this should be around 10 tok/s including speech synthesis.

cess114mo ago

We're approaching the point where we could have sophisticated sound controlled sex toys.

shnpln4mo ago

How would this run on a Jetson Orin Nano dev kit?

runlaszlorun4mo ago

I'm just pulling stuff out of my butt here but is this an area where an fpga might be worth it for a particular type of model?

j / k navigate · click thread line to collapse