The thing that kills this for me (and they even mentioned it) is wake word detection. I have both the HA voice preview and FPH Satellite1 devices, plus have experimented with a few other options like a Raspberry Pi with a conference mic.
Somehow nothing is even 50% good as my Echo devices at picking up the wake word. The assistant itself is far better, but that doesn't matter if it takes 2-3 tries to get it to listen to you. If someone solves this problem with open hardware I'll be immediately buying several.
I'd prefer to physically press a button on an intercom box than having something churning away constantly processing sound.
Also I have all my voice assistant devices mounted to the ceiling
Could be pressed even if your hands were busy.
Or do you mean a button that activates chunked recording, passes it to a speech-to-text model, forwards to an LLM to infer intent, which triggers HA to issue a command, over a wireless network, to the computer with the light attached, to tell the light to turn on.
Funky chicken for Gemini
Penguin dance for OpenAI
Claude?
The Zoidberg Shuffle?
I haven't tried training my own wake word though, I'm tempted to see if it improves things.
I used it personally, did a lot of research (including asking questions to the creator of microWakeWord), and submitted an upstream PR (I think it's already merged), which improved the resulting model slightly. I imagine the Nvidia version is similar, but I don't have experience with it. I also noticed that the model is so small (~25000 parameters), the actual training part doesn't even noticably improve with the GPU, only the TTS voice generation really only uses it.
if you are using this, I strongly recommend you create lots of personal samples with the recorder. I personally used 400, 200 from myself and 200 from my partner, with varying moods and in all the rooms we plan on using the assistant. I am considering re-training with more samples. it takes effort, but the resulting model seems to be well worth it.
[0] https://www.home-assistant.io/voice_control/worlds-most-priv...
the core issue is prosody: kokoro and piper are trained on read speech, but conversational responses have shorter breath groups and different stress patterns on function words. that's why numbers, addresses, and hedged phrases sound off even when everything else works.
the fix is training data composition. conversational and read speech have different prosody distributions and models don't generalize across them. for self-hosted, coqui xtts-v2 [1] is worth trying if you want more natural english output than kokoro.
btw i'm lily, cofounder of rime [2]. we're solving this for business voice agents at scale, not really the personal home assistant use case, but the underlying problem is the same.
Here are the following models I found work well:
- Qwen ASR and TTS are really good. Qwen ASR is faster than OpenAI Whisper on Apple Silicon from my tests. And the TTS model has voice cloning support so you can give it any voice you want. Qwen ASR is my default.
- Chatterbox Turbo also does voice cloning TTS and is more efficient to run than Qwen TTS. Chatterbox Turbo is my default.
- Kitten TTS is good as a small model, better than Kokoro
- Soprano TTS is surprisingly really good for a small model, but it has glitches that prevent it from being my default
But overall the mlx-audio library makes it really easy to try different models and see which ones I like.
After getting it working i was get motivation to actually able to build out the full fine-tuning pipeline. I wrote a little post about it all https://quickthoughts.ca/posts/listenr-asr-training-data-pro...
I would argue that the hardest part is correctly recognizing that it's being addressed. 98% of my frustration with voice assistants is them not responding when spoken to. The other 2% is realizing I want them to stop talking.
I hear the same phrases 10+ times in a day and they stress things a bit different each time, it seems like an exceptionally hard problem. My dream of a super reliable [llm output stream -> streaming TTS endpoint -> webRTC audio stream] seems pretty much impossible at this point.
Is the goal to trick people into thinking it is a human or to create a high trust robot? I am hoping as voice agents get more sophisticated the stigma around "It's making me talk to a robot" lessens so we don't need to worry so much about convincing someone it is a real person.
(Yes, I appreciate that some people may be disabled in such a way that it makes sense to use voice assistants, eg motor problems)
If a light cannot be automatically on when I need it (like a motion sensor) or controlled with a dedicated button within arms reach (like a remote on my desk) then the third best option is one that lets me control it without interrupting what I'm doing, moving from where I am, using my hands, or possessing anything (a voice assistant).
My point being that it might be a failure to you but not them, some people don't want it.
This is my struggle, how to get the automation to do what I want without affecting everyone else equally. (And vise versa)
It’s why I haven’t and won’t enable Gemini, and I’ll likely chuck my nest minis once I’m forced to have an LLM-based experience. Hopefully they’ll be able to at least function as dumb Bluetooth speakers still but I’m not holding out hope on that end
A Radiologist friend of mine convinced me to give it a try, apparently radiology reports are dictated in most places nowadays
I think the main frustration is often speed and precision but with modern dictation software it is pretty flawless.
Same for different scenarios when you don't want to use your hands (say you are replanting a flower or something).
I mostly set timers because it’s one of the few things that always works.
> Understands when it is in a particular area and does not ask “which light?” when there is only one light in the area, but does correctly ask when there are multiple of the device type in the given area.
I set 2 timers for the same thing somehow. I then tried to cancel one of them.
>“Siri, cancel the second timer”
“You have 2 timers running, would you like me to cancel one of them?”
>“Yes”
“Yes is an English rock band from the 70s…”
>“Siri, please cancel the timer with 2 minutes and 10 seconds on it”
“Would you like me to cancel the timer with 2 minutes and 8 seconds on it?”
>“Yes”
“Yes is an English rock band from the 70s…”
Eventually they both rang and she listened when I said stop.Me: "Text Jane Would you mind dropping down the robe and underpants"
Siri: Sends Jane "Would you mind dropping down"
Me: rolls eyes "Text Jane robe and underpants"
Siri: "I don't see a Jane Robe in your contacts."
Me: wishes I could drown Siri in the bathtub
It's wild to me that Apple got the ability to do the actual speech-to-text part pretty much 100% solved more than half a decade ago, yet struggles in 2026 to turn streams of very simple, correctly-transcribed text into intents in ways that even a local model can figure out. Siri is good STT, a bunch of serviceable APIs that can control lots of stuff, with the digital equivalent of a brain-damaged cat sitting at the center of it guaranteeing the worst possible experience.
For me, Siri on either phone or watch is pretty much perfect - I don’t ask for much, mostly timers or making reminders.
Google’s Nest Minis though? “Lights on” has a 50/50 shot of being a song of the same name, or similar name, or totally unrelated name. Same for “lights off”. If I don’t annunciate “play rain sounds” clearly enough I get an album called “Rain Songs” that is very much NOT calming for bed time. It doesn’t help that none of these understand that if I whisper a command, it should respond quietly - honestly the siris and nests and alexas all got like one iteration and then stopped it feels like.
I want more features but less LLM. I want more control, and more predictability. Eg if every night around 1am I say “play rain sounds” my god just learn that I’m not, in all likelihood, asking to hear an album I’ve never listened to!
- Wake word detection isn't as good as the Google Homes (more false positives, more false negatives - so I can't just tune sensitivity).
- Mic and speakers are both of poor quality in comparison to Google Home devices.
- Flow is awkward. On a Google Home device, you can say "Okay Google, turn on the lights" with no pause. On the Voice PE, you have to say "Hey Mycroft [awkward pause while you wait for the acknowledgement noise] turn on the lights" - it seems like the Google Home devices start buffering immediately after the wake word, but the Voice PE doesn't.
- Voice fingerprints don't exist, so this prevents the device from figuring out that two separate people are talking, or who is talking to it.
- The device has poor identification of background noise, so if you talk to it while there is a TV playing speech in the background, it will continue to listen to the speech from the TV. It will eventually transcribe everything you said + everything from the TV and get confused. (This probably folds into the voice print thing as well.)
On the upside, though:
- Setting it up was really easy.
- All of the entities I want to control with it are already available, without needing to export them or set them up separately in Google Home.
- Despite all of the above complaints, the device is probably 80-90% of what I realistically need to use it day-to-day. If they throw a better speaker and mic array in, I'd likely be comfortable replacing all of my Google Homes.
Google Home devices are always buffering. The wake word just tells it to look back in the buffer and start processing.
How are you hosting your LLM locally? I tried Ollama on an M4 Mac mini, even with a smaller LLM, the performance was very poor.
The wake word detection isn't great, and the audio quality is abysmal (for voice responses, not music).
Amazon has ruined their Alexa and Echo devices with ads and annoying nag messages.
I'd really like an open alternative, but the basics are lacking right now.
Some of the devices contain browsers, and people have set up hacky ways to turn them into thin clients through that, but it’s not particularly reliable IME.
I heard some Chinese brands which made similar hardware for Chinese consumers don’t lock their devices down, letting you flash an open install of Android on them, but I haven’t seen anyone try that IRL.
They mention the "Qwen3.5 (35B)" model for example which was released around 2 weeks ago.
Also, the entire setup was done through Codex. I asked Codex to figure out how to run models locally given my architecture (Ubuntu, AMD GPU). It told me which steps to apply and I hit zero snags.
I almost fainted