undefined | Better HN

0 pointscal851y ago0 comments

We've had voice input and voice output with computers for a long time, but it's never felt like spoken conversation. At best it's a series of separate voice notes. It feels more like texting than talking.

These demos show people talking to artificial intelligence. This is new. Humans are more partial to talking than writing. When people talk to each other (in person or over low-latency audio) there's a rich metadata channel of tone and timing, subtext, inexplicit knowledge. These videos seem to show the AI using this kind of metadata, in both input and output, and the conversation even flows reasonably well at times. I think this changes things a lot.

0 comments

kkukshtel1y ago

The "magic" moment really hit in this, like you're saying. Watching it happen and being like "this is a new thing". Not only does it respond in basically realtime, it concocts a _whole response_ back to you as well. It's like asking someone what they think about chairs, and then that person being able to then respond to you with a verbatim book on the encyclopedia of chairs. Insane.

I'm also incredibly excited about the possibility of this as an always available coding rubber duck. The multimodal demos they showed really drove this home, how collaboration with the model can basically be as seamless as screensharing with someone else. Incredible.

baq1y ago

Still patiently waiting for the true magic moment where I don't have to chat with the computer, I just tell it what to do and it does it without even an 'OK'.

I don't want to chat with computers to do basic things. I only want to chat with computers when the goal is to iterate on something. If the computer is too dumb to understand the request and needs to initiate iteration, I want no part.

(See also 'The Expanse' for how sci-fi imagined this properly.)

5 more replies

nerdponx1y ago

As goofy as I personally think this is, it's pretty cool that we're converging on something like C3P0 or Plankton's Computer with nothing more than the entire corpus of the world's information, a bunch of people labeling data, and a big pile of linear algebra.

3 more replies

iAMkenough1y ago

I wonder how long until we see a product that's able to record workstation displays and provide a conversational analysis of work conducted that day by all of your employees.

7 more replies

lottin1y ago

But in this case you're not talking with a real person. Instinctively, I dislike a robot that pretends to be a real human being.

irjustin1y ago

> Instinctively, I dislike a robot that pretends to be a real human being.

Is that because you're not used to it? Honestly asking.

This is probably the first time it feels natural where as all our previous experiences make "chat bots" and "automated phone systems", "automated assistants" absolutely terrible.

Naturally, we dislike it because "it's not human". But this is true of pretty much any thing that approaches "uncanny valley". But, if the "it's not human" solves your answer 100% better/faster than the human counter part, we tend to accept it a lot faster.

This is the first real contender. Siri was the "glimpse" and ChatGPT is probably the reality.

[EDIT]

https://vimeo.com/945587328 the Khan academy demo is nuts. The inflections are so good. It's pretty much right there in the uncanny valley because it does still feel like you're talking to a robot but it also directly interacting with it. Crazy stuff.

6 more replies

ambrozk1y ago

These sorts of comments are going to go in the annals with the hackernews people complaining about Dropbox when it first came out. This is so revolutionary. If you're not agog you're just missing the obvious.

1 more reply

astrange1y ago

I would say many pets pretend to be human beings (usually babies) in a way that most people like.

2 more replies

Art96811y ago

Good thing you can tell the AI to speak to you in a robotic monotone and even drop IQ if you feel the need to speak with a dumb bot. Or abstain from using the service completely. You have choices. Use them.

2 more replies

hoag1y ago

But I think this animosity is very much expected, no? Even I felt a momentary hint of "jealousy" -- if you can even call it that -- when I realized that we humans are, in a sense, not really so special anymore.

But of course this was the age-old debate with our favorite golden-eyed android; and unsurprisingly, he too received the same sort of animosity:

Bones was deeply skeptical when he first met Data: "I don't see no points on your ears, boy, but you sound like a Vulcan." And we all know how much he loved those green-blooded fools.

Likewise, Dr. Pulanski has since been criticized for her rude and dismissive attitudes towards Data that had flavors of what might even be considered "racism," or so goes the Trekverse discussion on the topic.

And let's of course not forget when he was on trial essentially for "humanity," or whether hew as indeed just the property of Starfleet, and nothing more.

More recent incarnations of Star Trek: Picard illustrated the outright ban on "synthetics" and indeed their effective banishment; non-synthetic life -- from human to Roman -- simply weren't ok with them.

Yes this is all science fiction silliness -- or adoration depending on your point of view -- but I think it very much reflects the myriad directions our real life world is going to scatter (shatter?) in the coming years ahead.

1 more reply

chiefalchemist1y ago

To your point, there's been a lot of talk about AI, regulation, guardrails, whatever. Now is the time to say, AI must speak such that we know it's AI and not a real human voice.

We get the upside of conversation, and avoid the downside of falling asleep at the wheel (as Ethan Mollick mentions in "Co-Intelligence".)

nsonha1y ago

I dislike a robot that's equal/surpasses human beings. A silly machine that pretends to be human is what I want.

interludead1y ago

It felt like a videogame for me

x3haloed1y ago

Exactly. I'm not sure if this is brand new or not, but this is definitely on the frontier.

I was literally just thinking about this a few days ago... that we need a multi-modal language model with speech training built-in.

As soon as this thing rolls out, we'll be talking to language models like we talk to each other. Previously it was like dictating a letter and waiting for the responding letter to be read to you. Communication is possible, but not really in the way that we do it with humans.

This is MUCH more human-like, with the ability to interrupt each other and glean context clues from the full richness of the audio.

The model's ability to sing is really fascinating. It's ability to change the sound of its voice -- its pacing, its pitch, its tonality. I don't know how they're controlling all that via GPT-4o tokens, but this is much more interesting stuff than what we had before.

I honestly don't fully understand the implications here.

deanCommie1y ago

> Humans are more partial to talking than writing.

Amazon, Google, and Apple have sunk literally billions of dollars into this idea only to find out that, no, we aren't.

We are with other humans, yes. When socialization is part of the conversation. When I'm talking to my local barista I'm not just ordering a coffee, I'm also maintaining a relationship with someone in my community.

But when it comes to work, writing >>> talking. Writing is clarity of ideas. Talking is cult of personality.

And when it comes to inputs/outputs, typing is more precise and more efficient.

Don't get me wrong, this is an incredibly revolutionary piece of technology, but I don't think the benefits of talking you're describing (timing, subtext, inexplicit knowledge) are achievable here either (for now), since even that requires HOURS of interaction over days/weeks/months of experiences for humans to achieve with each other.

thehappypm1y ago

I think Alexa and Google Assistant simply are too low-intelligence to really consider it “talking” and not just voice commanding

2 more replies

enraged_camel1y ago

>> When I'm talking to my local barista I'm not just ordering a coffee, I'm also maintaining a relationship with someone in my community.

>>> But when it comes to work, writing >>> talking. Writing is clarity of ideas. Talking is cult of personality.

A lot of people think of their colleagues as part of a professional community as well, though.

cal85OP1y ago

I don't think they've sunk $1 into that idea. They've sunk billions into a different idea: that people enjoy using their vocal cords more than their hands to compose messages to send to each other. That is not a spoken conversation, it's just correspondence with voice input/output options.

throwthrowuknow1y ago

Writing is only superior to conversation when weighed against discussions with more than 3 people. A quick call with one or two other people always results in more progress being made as long as everyone involved wants to get it done. Messaging back and forth takes much more time and often leads to misunderstandings.

1 more reply

achow1y ago

> Humans are more partial to talking than writing.

Is it so?

Speaking most of the time is for short exchange of information (pleasantries to essential information exchanges).

I prefer writing for long in-depth thought exchanges (whether by emails, blogs etc.)

In many cultures - European or Asian, people are not very loquacious in everyday life.

mrtranscendence1y ago

I wouldn't say speaking is mostly for short exchanges of information. Sometimes it's the opposite: my wife will text me for simple comments or requests, but for anything complicated she'll put the phone to her ear and call me. Or coworkers often want to set up a meeting rather than exchange a series of asynchronous emails -- iteration, brainstorming, Q&A, and the like can be more agile with voice than it can with text.

dmix1y ago

Time and place

I’m 100% a text everything never calls person but I can’t live without Alexa these days, every time I’m in a hotel or on vacation I nearly ask a question out loud.

I also hate how much Alexa sucks so this is a big deal. I spent years weeding out what it could do and can’t do so it will be nice to have one that I don’t have to treat like a toddler

insane_dreamer1y ago

I started using the Pi LLM app (by Inflection.ai) with my kids about six months ago and was completely blown away by how human-like it sounded, not just the voice itself but the way it expresses itself, the tiny pauses and hesitations, the human-like imperfections. It does feel like conversing with another human -- I've never seen another LLM do that.

(We mostly use it in car trips -- great for keeping the kids (ages 8, 12) occupied with endless Harry Potter trivia questions, answers to science questions, etc.)

cal85OP1y ago

This is great, thanks for sharing. Yeah the little imperfections work really well, it's the most humanlike computer voice I've heard so far.

ktosobcy1y ago

I wonder how it will work in real life and not in a demo…

Besides - not sure if I want this level of immersion/fake when talking to a computer...

"Her" comes to mind pretty quickly…

adren1231y ago

Indeed, the 2013 Spike Jonze movie is the first idea that popped-up to my mind when I saw those videos amazing to see this movie 10 years after it was released in the light of those "futuristic" tools (AI assistant and such)

1 more reply

wumbo1y ago

Siri comes off as impatient.

If you don’t complete your thought in one go, you have to insert filler words to keep it listening.

cal85OP1y ago

Yeah it's the worst. And 'um' doesn't seem to work, you actually need convincing filler words. It feels like being forced to speak under duress.

I've long felt that embracing the concept of the 'prompt' was a terrible idea for Siri and all the other crappy voice assistants. They built ecosystems on top of this dumb reduction, which only engineers could have made: that _talking to someone_ is basically taking turns to compose a series of verbal audio snippets in a certain order.

j451y ago

Is it new, or is it just a big jump forward?

The previous ChatAI app was getting pretty good once you learned the difference between run on sentences or breaking it up enough.

The tonality and inflections in the voice are a little too good.

Most people put on a spectrum/average aren't that good at speaking and communicating and that stands out as an uncanny valley approach. It is mindbogglingly good at it though.

https://en.wikipedia.org/wiki/Uncanny_valley

93po1y ago

im human and much much more partial to typing than talking. talking is a lot of work for me and i can't process my thinking well at all without writing.

jasondigitized1y ago

The good news is the interface will be multi modal. Talk, type, and I guess someday just think.

HarHarVeryFunny1y ago

> Humans are more partial to talking than writing

I don't think that's generally true, other than for socializing with other humans.

Note how people, now having a choice, prefer to text each other most of the time rather than voice call.

I don't think people sitting at work in their cube farm want to be talking to their computer either. The main use for voice would seem to be for occasional use talking to an assistant on a smartphone.

Maybe things will change in the future when we get to full human AGI level, treating the AGI as an equal, more as a person.

pdfernhout1y ago

When I was working at the IBM Speech group circa 1999 as a contractor on an embedded speech system (IBM Personal Speech Assistant), I discussed with Raimo Bakis (a researcher there then) this issue of such metadata and how it might improve conversational speech recognition. It turned out that IBM ViaVoice detected some of that metadata (like pitch/tone as a reflection of emotion) -- but then on purpose threw it away rather than using it for anything. Back then it was so much harder to get speech recognition to do anything useful -- beyond limited transcripts of audio with ~5% error rates that was good enough mainly for searching -- that perhaps doing that made sense. Very interesting to see such metadata in use now both in speech recognition and in speech generation.

More on the IBM Personal Speech Assistant for which I am on a patent (since expired) by Liam Comerford: http://liamcomerford.com/alphamodels3.html "The Personal Speech Assistant was a project aimed at bringing the spoken language user interface into the capabilities of hand held devices. David Nahamoo called a meeting among interested Research professionals, who decided that a PDA was the best existing target. I asked David to give me the Project Leader position, and he did. On this project I designed and wrote the Conversational Interface Manager and the initial set of user interface behaviors. I led the User Interface Design work, set specifications and approved the Industrial Design effort and managed the team of local and offsite hardware and software contractors. With the support of David Frank I interfaced it to a PC based Palm Pilot emulator. David wrote the Palm Pilot applications and the PPOS extensions and tools needed to support input from an external process. Later, I worked with IBM Vimercati (Italy) to build several generations of processor cards for attachment to Palm Pilots. Paul Fernhout, translated (and improved) my Python based interface manager into C and ported it to the Vimercati coprocessor cards. Jan Sedivy's group in the Czech Republic Ported the IBM speech recognizer to the coprocessor card. Paul, David and I collaborated on tools and refining the device operation. I worked with the IBM Design Center (under Bob Steinbugler) to produce an industrial design. I ran acoustic performance tests on the candidate speakers and microphones using the initial plastic models they produced, and then farmed the design out to Insync Designs to reduce it to a manufacturable form. Insync had never made a functioning prototype so I worked closely with them on Physical UI and assemblability issues. Their work was outstanding. By the end of the project I had assembled and distributed nearly 100 of these devices. These were given to senior management and to sales personnel."

Thanks for the fun/educational/interesting times, Liam!

As a bonus for that work, I had been offered one of the chessboards that been used when IBM Deep Blue defeated Garry Kasparov, but I turned it down as I did not want a symbol around of AI defeating humanity.

Twenty-five years later, how far that aspiration towards conversational speech with computers has come. Some ideas I've put together to help deal with the fallout: https://pdfernhout.net/beyond-a-jobless-recovery-knol.html "This article explores the issue of a "Jobless Recovery" mainly from a heterodox economic perspective. It emphasizes the implications of ideas by Marshall Brain and others that improvements in robotics, automation, design, and voluntary social networks are fundamentally changing the structure of the economic landscape. It outlines towards the end four major alternatives to mainstream economic practice (a basic income, a gift economy, stronger local subsistence economies, and resource-based planning). These alternatives could be used in combination to address what, even as far back as 1964, has been described as a breaking "income-through-jobs link". This link between jobs and income is breaking because of the declining value of most paid human labor relative to capital investments in automation and better design. Or, as is now the case, the value of paid human labor like at some newspapers or universities is also declining relative to the output of voluntary social networks such as for digital content production (like represented by this document). It is suggested that we will need to fundamentally reevaluate our economic theories and practices to adjust to these new realities emerging from exponential trends in technology and society."

Another idea for dealing with the consequences is using AI to facilitate Dialogue Mapping with IBIS for meetings to help small groups of people collaborate better on "wicked problems" like dealing with AI's pros and cons (like in this 2019 talk I gave at IBM's Cognitive Systems Institute Group). https://twitter.com/sumalaika/status/1153279423938007040

Talk outline here: https://cognitive-science.info/wp-content/uploads/2019/07/CS...

A video of the presentation: https://cognitive-science.info/wp-content/uploads/2019/07/zo...

lobochrome1y ago

I don’t know. Have you even seen a gen z?

cal85OP1y ago

I don’t follow, what about them?

2 more replies

interludead1y ago

> I think this changes things a lot.

Yeah, and it's only the beginging.

j / k navigate · click thread line to collapse

0 comments

kkukshtel1y ago

baq1y ago

Still patiently waiting for the true magic moment where I don't have to chat with the computer, I just tell it what to do and it does it without even an 'OK'.

(See also 'The Expanse' for how sci-fi imagined this properly.)

5 more replies

nerdponx1y ago

3 more replies

iAMkenough1y ago

I wonder how long until we see a product that's able to record workstation displays and provide a conversational analysis of work conducted that day by all of your employees.

7 more replies

lottin1y ago

But in this case you're not talking with a real person. Instinctively, I dislike a robot that pretends to be a real human being.

irjustin1y ago

> Instinctively, I dislike a robot that pretends to be a real human being.

Is that because you're not used to it? Honestly asking.

This is probably the first time it feels natural where as all our previous experiences make "chat bots" and "automated phone systems", "automated assistants" absolutely terrible.

This is the first real contender. Siri was the "glimpse" and ChatGPT is probably the reality.

[EDIT]

6 more replies

ambrozk1y ago

1 more reply

astrange1y ago

I would say many pets pretend to be human beings (usually babies) in a way that most people like.

2 more replies

Art96811y ago

2 more replies

hoag1y ago

But of course this was the age-old debate with our favorite golden-eyed android; and unsurprisingly, he too received the same sort of animosity:

Bones was deeply skeptical when he first met Data: "I don't see no points on your ears, boy, but you sound like a Vulcan." And we all know how much he loved those green-blooded fools.

And let's of course not forget when he was on trial essentially for "humanity," or whether hew as indeed just the property of Starfleet, and nothing more.

1 more reply

chiefalchemist1y ago

To your point, there's been a lot of talk about AI, regulation, guardrails, whatever. Now is the time to say, AI must speak such that we know it's AI and not a real human voice.

We get the upside of conversation, and avoid the downside of falling asleep at the wheel (as Ethan Mollick mentions in "Co-Intelligence".)

nsonha1y ago

I dislike a robot that's equal/surpasses human beings. A silly machine that pretends to be human is what I want.

interludead1y ago

It felt like a videogame for me

x3haloed1y ago

Exactly. I'm not sure if this is brand new or not, but this is definitely on the frontier.

I was literally just thinking about this a few days ago... that we need a multi-modal language model with speech training built-in.

This is MUCH more human-like, with the ability to interrupt each other and glean context clues from the full richness of the audio.

I honestly don't fully understand the implications here.

deanCommie1y ago

> Humans are more partial to talking than writing.

Amazon, Google, and Apple have sunk literally billions of dollars into this idea only to find out that, no, we aren't.

But when it comes to work, writing >>> talking. Writing is clarity of ideas. Talking is cult of personality.

And when it comes to inputs/outputs, typing is more precise and more efficient.

thehappypm1y ago

I think Alexa and Google Assistant simply are too low-intelligence to really consider it “talking” and not just voice commanding

2 more replies

enraged_camel1y ago

>> When I'm talking to my local barista I'm not just ordering a coffee, I'm also maintaining a relationship with someone in my community.

>>> But when it comes to work, writing >>> talking. Writing is clarity of ideas. Talking is cult of personality.

A lot of people think of their colleagues as part of a professional community as well, though.

cal85OP1y ago

throwthrowuknow1y ago

1 more reply

achow1y ago

> Humans are more partial to talking than writing.

Is it so?

Speaking most of the time is for short exchange of information (pleasantries to essential information exchanges).

I prefer writing for long in-depth thought exchanges (whether by emails, blogs etc.)

In many cultures - European or Asian, people are not very loquacious in everyday life.

mrtranscendence1y ago

dmix1y ago

Time and place

I’m 100% a text everything never calls person but I can’t live without Alexa these days, every time I’m in a hotel or on vacation I nearly ask a question out loud.

I also hate how much Alexa sucks so this is a big deal. I spent years weeding out what it could do and can’t do so it will be nice to have one that I don’t have to treat like a toddler

insane_dreamer1y ago

(We mostly use it in car trips -- great for keeping the kids (ages 8, 12) occupied with endless Harry Potter trivia questions, answers to science questions, etc.)

cal85OP1y ago

This is great, thanks for sharing. Yeah the little imperfections work really well, it's the most humanlike computer voice I've heard so far.

ktosobcy1y ago

I wonder how it will work in real life and not in a demo…

Besides - not sure if I want this level of immersion/fake when talking to a computer...

"Her" comes to mind pretty quickly…

adren1231y ago

1 more reply

wumbo1y ago

Siri comes off as impatient.

If you don’t complete your thought in one go, you have to insert filler words to keep it listening.

cal85OP1y ago

Yeah it's the worst. And 'um' doesn't seem to work, you actually need convincing filler words. It feels like being forced to speak under duress.

j451y ago

Is it new, or is it just a big jump forward?

The previous ChatAI app was getting pretty good once you learned the difference between run on sentences or breaking it up enough.

The tonality and inflections in the voice are a little too good.

Most people put on a spectrum/average aren't that good at speaking and communicating and that stands out as an uncanny valley approach. It is mindbogglingly good at it though.

https://en.wikipedia.org/wiki/Uncanny_valley

93po1y ago

im human and much much more partial to typing than talking. talking is a lot of work for me and i can't process my thinking well at all without writing.

jasondigitized1y ago

The good news is the interface will be multi modal. Talk, type, and I guess someday just think.

HarHarVeryFunny1y ago

> Humans are more partial to talking than writing

I don't think that's generally true, other than for socializing with other humans.

Note how people, now having a choice, prefer to text each other most of the time rather than voice call.

Maybe things will change in the future when we get to full human AGI level, treating the AGI as an equal, more as a person.

pdfernhout1y ago

Thanks for the fun/educational/interesting times, Liam!

Talk outline here: https://cognitive-science.info/wp-content/uploads/2019/07/CS...

A video of the presentation: https://cognitive-science.info/wp-content/uploads/2019/07/zo...

lobochrome1y ago

I don’t know. Have you even seen a gen z?

cal85OP1y ago

I don’t follow, what about them?

2 more replies

interludead1y ago

> I think this changes things a lot.

Yeah, and it's only the beginging.

j / k navigate · click thread line to collapse