undefined | Better HN

Skip to content

Top New Best Ask Show Jobs

undefined | Better HN

0 pointsJensson1y ago0 comments

The most impressive part is that the voice uses the right feelings and tonal language during the presentation. I'm not sure how much of that was that they had tested this over and over, but it is really hard to get that right so if they didn't fake it in some way I'd say that is revolutionary.

0 comments

gdb1y ago

(I work at OpenAI.)

It's really how it works.

baq1y ago

> (I work at OpenAI.)

Winner of the 'understatement of the week' award (and it's only Monday).

Also top contender in the 'technically correct' category.

9999000009991y ago

How far are we away from something like a helmet with chat GPT and a video camera installed, I imagine this will be awesome for low vision people. Imagine having a guide tell you how to walk to the grocery store, and help you grocery shop without an assistant. Of course you have tons of liability issues here, but this is very impressive

jamestimmins1y ago

With this capability, how close are y'all to it being able to listen to my pronunciation of a new language (e.g. Italian) and given specific feedback about how to pronounce it like a local?

Seems like these would be similar.

This is damn near one of the most impressive things, can only imagine especially with live translation and voice synthesis (eleven labs style) you'd be capable of to integrate with something like teams (select each persons language and do realtime translation to each persons native language, with their own voice and intonations would NUTS)

terhechte1y ago

Random OpenAI question: While the GPT models have become ever cheaper, the price for the tts models have stayed in the $15/1Mio char range. I was hoping this would also become cheaper at some point. There're so many apps (e.g. language learning) that quickly become too expensive given these prices. With the GPT-4o voice (which sounds much better than the current TTS or TTS HD endpoint) I thought maybe the prices for TTS would go down. Sadly that hasn't happened. Is that something on the OpenAI agenda?

selfmodruntime1y ago

I've always been wondering what GPT models lack that makes them "query->response" only. I've always tried to get chatbots to lose the initially needed query, with no avail. What would It take to get a GPT model to freely generate tokens in a thought like pattern? I think when I'm alone without query from another human. Why can't they?

ALittleLight1y ago

In my ChatGPT app or on the website I can select GPT-4o as a model, but my model doesn't seem to work like the demo. The voice mode is the same as before and the images come from DALLE and ChatGPT doesn't seem to understand or modify them any better than previously.

jacobsimon1y ago

I couldn’t quite tell from the announcement, but is there still a separate TTS step, where GPT is generating tones/pitches that are to be used, or is it completely end to end where GPT is generating the output sounds directly?

mttpgn1y ago

Licensing the emotion-intoned TTS as a standalone API is something I would look forward to seeing. Not sure how feasible that would be if, as a sibling comment suggested, it bypasses the text-rendering step altogether.

rane1y ago

Will the new voice mode allow mixing languages in sentences?

As a language learner, this would be tremendously useful.

Is it possible to use this as a TTS model? I noticed on the announcement post that this is a single model as opposed to a text model being piped to a separate TTS model.

May I just say this launch was a bit of a mess?

The web page implies you can try it immediately. Initially it wasn't available.

A few hours later it was in both the web UI and the mobile app - I got a popu[ telling me that GPT-4o was available. However nothing seems to be any different. I'm not given any option to use video as an input, the app can't seem to pick up any new info from my voice.

I'm left a bit confused as to what I can do that I couldn't do before. I certainly can't seem to recreate much of the stuff from the announcement demos.

dpflan1y ago

Who's idea was the singing AIs? What specifically did you want to highlight with that part of the demo?

I imagine that there is a lot of usage at the HQ, human + AI karaoke?

skottenborg1y ago

"(I work at OpenAI.)"

Ah yes, also known as being co-founder :)

rrr_oh_man1y ago

https://community.openai.com/t/when-i-log-in-to-chatgpt-i-am...

Sorry to hijack, but how the hell can I solve this? I have the EXACT SAME error on two iOS devices (native app only — web is fine), but not on Android, Mac, or Windows.

hpeter1y ago

I can't wait to try it out, it sounds too good to be real.

It will be fully available in Eu with the GDPR compliance?

xanderlewis1y ago

I like the humility in your first statement.

newzisforsukas1y ago

Right to who? To me, the voice sounds like an over enthusiastic podcast interviewer. Whats wrong with wanting computers to sound like what people think computers should sound like?

twelvechairs1y ago

It sounds VERY California. "Its going great!" "Nice choice" "Whats up with the..." all within 10 seconds.

(not that this is the most important thing about the announcement at all. Just an aside)

JenssonOP1y ago

It understands tonal language, you can tell it how you want it to talk, I have never seen a model like that before. If you want it to talk like a computer you can tell it to, they did it during the presentation, that is so much better than the old attempts at solving this.

tr3ntg1y ago

Right... enthusiastic and generally confused. It's uncanny valley level expressions. Still better than drab, monotonous speech though.

> "over enthusiastic podcast interviewer"

Yeh it's cringe. I had to stop listening.

Why did they make the woman sound like she's permanently on the brink of giggling? It's nauseating how overstated her pretentious banter is. Somewhere between condescending nanny and preschool teacher. Like how you might talk to a child who's at risk of crying so you dial up the positive reinforcement.

It's a computer from the valley.

navigate83101y ago

> voice sounds like an over enthusiastic podcast interviewer

I believe it can be toned down using system prompts, which they'll expose in future iterations

kybernetikos1y ago

Genuine People Personalities™, just like in Hitchikers. Perhaps one of the milder forms of 'We Created The Torment Nexus'.

angryasian1y ago

agree I don't get it. I just want the right information and explained well. I don't want to be social with a robot.

exactly. Hope we can customize the voice soon. I want to talk to ultron... or the one from mass effect

famouswaffles1y ago

>The most impressive part is that the voice uses the right feelings and tonal language during the presentation.

Consequences of audio2audio (rather than audio >text text>audio). Being able to manipulate speech nearly as well as it manipulates text is something else. This will be a revelation for language learning amongst other things. And you can interrupt it freely now!

jcims1y ago

Anyone who has used elevenlabs for voice generation has found this to be the case. Voice to voice seems like magic.

pants21y ago

However, this looks like it only works with speech - i.e. you can't ask it, "What's the tune I'm humming?" or "Why is my car making this noise?"

I could be wrong but I haven't seen any non-speech demos.

twobitshifter1y ago

I asked it to make a bird noise, instead it told me what a bird sounds like with words. True audio to audio should be able to be any noise, a trombone, traffic, a crashing sea, anything. Maybe there is a better prompt there but it did not seem like it.

mvkel1y ago

I was in the audience at the event. The only parts where it seemed to get snagged was hearing the audience reaction as an interruption. Which honestly makes the demo even better. It showed that hey, this is live.

Magic.

px431y ago

I wonder when it will be able to understand that there is more than one human talking to it. It seems like even in today's demo if two people are talking, it can't tell them apart.

I mention this down thread, but a symptom of a tech product of sufficient advancement is the nature of its introduction matters less and less.

Based on the casual production of these videos, the product must be this good.

https://news.ycombinator.com/item?id=40346002

jasondigitized1y ago

I noticed this as well. They gave zero fs about the fit and finish of these videos because they know this is magic in a bottle.

simonw1y ago

That was very impressive, but it doesn't surprise me much given how good the voice mode is in the ChatGPT iPhone app is already.

The new voice mode sounds better, but the current voice mode did also have inflection that made it feel much more natural than most computer voices I've heard before.

duckmysick1y ago

Slight off-topic, but I noticed you've updated your llm CLI app to work with the 4o model (plus bunch of other APIs through plugins). Kudos for working extremely fast. I'm really grateful for your tool; I tried many others, but for some reason none clicked as much as your.

Link in case other readers are curious: https://llm.datasette.io

JenssonOP1y ago

Can you tell the current voice model what feelings and tone it should communicate with? If not it isn't even comparable, being able to control how it reads things is absolutely revolutionary, that is what was missing from using these AI models as voice actors.

The voice mode was quite good but the latency and start / stop has been encumbering.

Seems about as good as Azure's Speech Service. I wonder if that's what they are using behind the scenes

Intralexical1y ago

"Right" feelings and tonal language? "Right" for what? For whom?

We've already seen how much damage dishonest actors can do by manipulating our text communications with words they don't mean, plans they don't intend to follow through on, and feelings they don't experience. The social media disinfo age has been bad enough.

Are you sure you want a machine which is able to manipulate our emotions on an even more granular and targetted level?

LLMs are still machines, designed and deployed by humans to perform a task. What will we miss if we anthropomorphize the product itself?

infinitezest1y ago

This gives me a lot of anxiety but my only recourse is to stop paying attention to AI dev. Its not going to stop, downside be damned. The "We're working super hard to make these things safe" routine from tech companies, who have largely been content to make messes and then not be held accountable in any significant way, rings pretty hollow for me. I don't want to be a doomer but I'm not convinced that the upside is good enough to protect us from the downside.

That's the part that really struck me. I thought it was particularly impressive with the Sal Khan maths tutor demo and the one with BeMyEyes. The comment at the end about the dog was an interesting ad-lib.

The only slightly annoying thing at the moment is they seem hard to interrupt, which is an important mechanism in conversations. But that seems like a solvable problem. They kind of need to be able to interpret body language a bit to spot when the speaker is about to interrupt.

ta-run1y ago

Crazy that interruption also seems to work pretty smoothly

Really? I think interruption and timing in general still seems like a problem that has yet to be solved. It was the most janky aspect of the demos imo.

I’m not sure how revolutionary the style is. It can already mimic many styles of writing. It seems like mimicking a cheerful happy assistant, with associated filler words, etc. is very much in-line with what LLM’s are good at.

Somehow it also sounds almost like Dot Matrix from Spaceballs.

Joan Rivers!

burntalmonds1y ago

Yeah, the female voice especially is really impressive in the demos. The voice always sounds natural. The male voice I heard wasn't as good. It wasn't terrible, but it had a somewhat robotic feel to it.

j / k navigate · click thread line to collapse