undefined | Better HN

0 pointsLucasoato5d ago0 comments

Wait a minute... I’m genuinely happy that they are sharing this, but keep in mind that realtime audio model from OpenAI are still stuck with the 4o family in terms of capabilities, sadly. I still find them so useful, such a pity that there’s no real competitor in this segment, having the experience a real conversation has helped me so much in expressing ideas and concepts.

Still, it’s worth to keep in mind that these are not frontier models, differently from when they were released.

(Please Sam, if you read this, release the new realtime audio models)

0 comments

modeless5d ago

Grok voice is surprisingly good, actually. It's still a dumber model than the thinking modes of frontier models, but it's less dumb than the voice modes of other providers.

artdigital5d ago

Grok voice model is also a thinking model. I agree that it’s far better than the other voice models

Just give me a option to have a slower response but better model…

dharma15d ago

Yes the voice part of OpenAI realtime/voice mode is great but it’s pretty dumb compared to newer models and often gets stuck repeating itself.

Google’s Gemini flash live 3.1 is better, especially used via the API - it can do tool calling (including to other, even smarter LLMs if you set it up yourself), you can set the reasoning level (even high is still close enough to realtime) and it can ground answers in google search. I love bidirectional voice and right now it’s probably the best option. You can try it in AI studio

LucasoatoOP5d ago

Thanks, I’ll try it, even if my experience wasn’t that great with Google models lately (503s)

dharma15d ago

Give it a shot, 3.1 live one in AI studio/API and max out reasoning - not the one in Gemini app it’s an older model.

Another option is to use pipecat with their VAD and separate STT and TTS and any (fast) LLM of your choice - but it’s more plumbing and not a true speech to speech model

2 more replies

TomGarden5d ago

Claude voice mode has come a long way! I'd say it's smarter than CGPT AVM last time i tried it.

But personally I've settled on just speaking to the slower models over a custom tts app, I find it being instant was not actually that important, and in the silence I find myself marinating in the discussion more anyway

radicality5d ago

Yeah I was quite surprised that the advanced chat gpt voice mode can’t itself go and message the frontier model underneath to retrieve data and then speak it. I basically tried asking it for that (something like “can you go and ask gpt5.5 to research this more in depth, and while we wait, tell me about XYZ”), but apparently that’s not a thing.

artdigital5d ago

This is what makes their voice mode unusable to me. I can’t stand the way 4o replies and it’s such a big jump in quality from text mode

sails5d ago

You can feel what is possible using Gemini speech to speech model, it can do tool calls and is very fast. It lacks somewhat in thinking capability but you can setup a tool call to a smarter model and it acts as a relay. I’ve been very impressed.

ddp265d ago

Yeah, the question in the title can be answered: "by using gpt-4o, a model 2 years behind the frontier, to serve audio responses"

j / k navigate · click thread line to collapse

0 comments

modeless5d ago

Grok voice is surprisingly good, actually. It's still a dumber model than the thinking modes of frontier models, but it's less dumb than the voice modes of other providers.

artdigital5d ago

Grok voice model is also a thinking model. I agree that it’s far better than the other voice models

Just give me a option to have a slower response but better model…

dharma15d ago

Yes the voice part of OpenAI realtime/voice mode is great but it’s pretty dumb compared to newer models and often gets stuck repeating itself.

LucasoatoOP5d ago

Thanks, I’ll try it, even if my experience wasn’t that great with Google models lately (503s)

dharma15d ago

Give it a shot, 3.1 live one in AI studio/API and max out reasoning - not the one in Gemini app it’s an older model.

Another option is to use pipecat with their VAD and separate STT and TTS and any (fast) LLM of your choice - but it’s more plumbing and not a true speech to speech model

2 more replies

TomGarden5d ago

Claude voice mode has come a long way! I'd say it's smarter than CGPT AVM last time i tried it.

radicality5d ago

artdigital5d ago

This is what makes their voice mode unusable to me. I can’t stand the way 4o replies and it’s such a big jump in quality from text mode

sails5d ago

ddp265d ago

Yeah, the question in the title can be answered: "by using gpt-4o, a model 2 years behind the frontier, to serve audio responses"

j / k navigate · click thread line to collapse