This is very good. But this is, unfortunately, still bound by the dominant paradigm of web APIs. The speech to text model doesn't get its first byte until I'm done talking, the LLM doesn't get its first byte until the speech to text model is done transcribing, and the speech to text model doesn't get its first byte until the LLM call is complete.
When all of these things are very fast, it can be very seamless, but each of these contributes to a floor of latency that makes it hard to get to lifelike conversation. Most of these models should be capable of streaming prefill - if not decode (for the transformer like models) - but inference servers are targeting the lowest common denominator on the web: a synchronous POST.
When only 3 very fast models are involved, that's great. But this only compounds when trying to combine these with agentic systems, tool calling.
The sooner we adopt end-to-end, bidirectional streaming for AI, the sooner we'll reach more lifelike, friendly, low latency experiences. After all, inter-speaker gaps in person to person conversations are often in the sub-100ms range and between friends, can even be negative! We won't have real "agents" until models can interrupt one another and talk over each other. Otherwise these latencies compound to a pretty miserable experience.
Relatedly, Guillermo - I've contributed PRs to reduce the latency of tool calling APIs to the AI SDK and Websockets to Next.js. Let's break free of request-response and remove the floor on latency.
For true full duplex communication, including interruption, it will be more challenging but should be possible with current model architectures. The model may need to be able to emit no-op or "pause" tokens or be used as the VAD, and positional encoding of tokens might need to be replaced or augmented with time and participant.
I imagine the first language model which has "awkward pauses" is only a year or so away.
(Not your server, not your code!)
"Step into my cloud," said the spider to the serf
"Your data's safe here, protected from the earth"
But as he uploads, bit by bit he'll see
The silken strands that bind his destiny
"Scaling's easy," it promises with a smile
But switching costs accumulate, all the while
The serf's apps and files, once free to roam
Are now trapped in a rented home
The spider's web, so soft and full of ease
Soon becomes a cage the serf can't leaveIs it really open source then, even though (as far as I can tell) Whisper and Llama have open weights but not open data, and that speech synthesis thing is seemingly fully proprietary?
Loving the new wave of ultrafast voice assistants though, and your execution in particular is very good.
I take it that Show HN is not just about the creation but also about the creator and the journey behind what’s being shown.
I'm 16, and have only been programming for a few years, so it's a good opportunity for me to learn a lot about web development and engineering.
> Show HN is for something you've made that other people can play with.
...
> The project must be something you've worked on personally and which you're around to discuss.
---
OP doesn't spam Show HN or anything, so probably worth giving the benefit of the doubt. If it doesn't comply they'll probably realize and fix it.
I haven't been using LLM-powered voice assistants much since I usually prefer text. One thing I noticed playing around with this demo is that the conversational uncanny valley becomes much more apparent when you're speaking with the LLM.
That's not a knock on this project, but wow it's something I want to think about more.
Thanks for sharing!
[1]: https://github.com/ai-ng/swift/blob/7d1f993b095abc4a51cf9c70...
Why is this still so easy?