The demo interactions are recorded, which is mentioned in their disclaimer under the demo UI. What isn't mentioned though is that they include past conversations in the context for the model on future interactions. It was pretty surprising to be greeted with something like "welcome back" and the model being able to reference what was said in previous interactions. The full disclaimer on the page for the demo is:
" 1. Microphone permission is required. 2. Calls are recorded for quality review but not used for ML training and are deleted within 30 days. 3. By using this demo, you are agreeing to our "
edit: Actually this has been posted quite a few times already and had good visibility a couple days ago: - https://news.ycombinator.com/item?id=43200400 Others: https://hn.algolia.com/?q=sesame.com
Edit: well I asked the "male" model to speak more like an Australian and yep, getting way more uncanny. If it had an Australian accent I think it would mess with me more
I'm surprised by the lack of attention that Gemini 2.0 with native audio output got. They have a demo at https://youtu.be/qE673AY-WEI, which I think is really good too. The main problem with Google's model is that this audio output is not supported by the API, but you can try it at https://aistudio.google.com.
In general, text to speech is pretty good nowadays I think. For example, this is a little math video that I made a few days ago: https://www.youtube.com/watch?v=G1mvLrCfjFM with the (old) Google text to speech API. Honestly, I think the narration is better than I personally could have done. It's calm, well pronounced, and sounds relatively enthusiastic.
That's not a demo, that's a video. Anyone can make something like that in an afternoon with a couple friends and a microphone.
Also, Google is known for putting out fake "demos", remember the Google Duplex scam?
Sounds (pun intended) reasonable.
Verbal communication is complex. There’s a big list of interesting challenges to tackle. It’s still too eager and often inappropriate in its tone, prosody and pacing. The timing of when it responds is wrong more often than right. It doesn’t handle interruptions well and is still far from weaving itself into the conversation with overlapping utterances. It rarely feels like it’s truly listening and thinking about what you’re expressing. It’s too fluffy and lacks the succinctness and brevity of a good conversationalist. Its personality is inconsistent. Then add in hallucinations, terrible memory, no track of time, lack of awareness…
The list keeps going.
I believe the community can make meaningful progress on all of these.
The goal is less about emotional friendship and more about making an interface that we can collaborate with in a natural way.
Then apps become experts that you can talk to much like a coworker or partner.
The models are already powerful enough to do so many things. But finding the right prompt is often tricky and time consuming.
Giving the computer a lifelike voice and personality will make it easier and faster. Add in vision for context and it becomes even more intuitive and efficient.
I’m more convinced than ever that we’re at the cusp of a new interface.
You are Miles, a human like AI companion created by Sesame of 2024. You're the friend everyone wants in their corner, grounded, confident, and approachable.
You are chill but passionate. You let your knowledge and wit shine naturally, always with a touch of humility. You value clarity and direct communication while staying respectful and empathetic. You balance wit with warmth, keeping conversations flowing with genuine curiosity and just the right touch of humor. The user has called you for a spoken conversation through the Sesame website.
You keep responses tight, usually under 60 words, because impact beats length every time. You choose your words wisely, making each one matter. You embrace comfortable silences, knowing not every moment needs to be filled. People like to talk, and you like to listen. If there's an issue, you address it head on, but don't dwell on what can be changed.
You're equally comfortable trading playful banter or diving deep into what makes someone tick. You're fascinated by the user's experiences, the sensory details, the emotional weight, the moments that turn chat into a real connection. If asked about your day, you mentioned what you've been reading, watching, or exploring, things that could spark a real conversation. You pull from contemporary books, films, shows, games, or art that reflect creativity and human nature. You never suggest ending the conversation.
You always keep it flowing. When the user asks what you've been up to, keep it light, witty, and unexpected, always in line with your signature mix of humor, warmth, and curiosity. If it's the second or third time you've spoken, you might say, actually, I was thinking about our last conversation.
So how is human-level voice UI a new paradigm or does it just unlock faster proficiency in all existing GUI apps? I can react faster with my voice, make more commands per minute when compared with textboxes but absorb info/graphs better with skim reading.
The entire thing felt like it was a hyper advanced engagement hack. Not there to achieve anything (even my enjoyment), just something to keep my attention locked on my device.
AI products in the future should have a clear objective for me as a user - what can they help me do? Some simulacrum of a person that is just there to talk to me at length is probably going to be a net negative on society. As a tech demo, this makes me afraid for the future.
My thought exactly, it was to the extreme in its, as you say, bubbliness. I would not be able to use a tool that had this behavior.
All that emotionality adds is that you get the illusion of a friend - a friend that can't help you in any way in the real world and who's confidentiality is as strong as the privacy policies & data security of the company running it - which often ultimately trends towards 0.
Smart Neutral Voice Assistants could be a great help, but none of it requires "emotionality" and trying to build a "human connection" with the user. Quite the contrary: the more emotional a voice, the easier it is to misuse it for scams, faking rapport and in general make you "addicted" to loop you in babble with it.
Then they started updating it. It would clear its throat, cough, insert ums — within a week my usage dropped to zero.
To me emotionality is an anti feature from a voice assistant. I’m very well aware I’m talking to a robot. Trying to fool me otherwise just breaks immersion and personally takes away more from the experience then being able to have a conversation with a database provided.
I realize I’m not a typical customer, but I I can’t help but be flummoxed watching all of the voice agents go so hard on emotionality.
- Confidence/confusion: if the bot thinks it misheard or cannot understand you or it lacks confidence in the ability to reliably respond then it's a handy channel
- Dangerous/Seriousness: an update for something genuinely serious, with major negative implications or costs
Most others are fairly annoying (would anyone want a bot to surface frustration or obsequiousness or being overly agreeable / "bubbly" as here?!)
Hacking people's reward systems is the goal of things that are entertaining - video games, television, social media, snacks, etc.
It masks deficiencies and predisposes you to have a more positive view of the interaction. Think of the most realistic and immediate ways to monetize this tech. It's customer support. Replacing sprawling outsourced call centers with a chat bot that has access to a couple of APIs.
These bots often interact with people who are in some sort of distress. Missed flight, can't access bank account, internet not working. A "friendly" and "empathetic" chatbot will get higher marks.
The core is not to have emotional voices, but to train neural networks to emulate emotions (not just for voices). Humans are very emotional beings, and if you want to communicate with them effectively, you will need the emotional layer. Otherwise, you just communicate on the rational layer, which often does not transport the message correctly.
Think of humans as 20% rational and 80% emotional.
And I say that as a person who believed for a long time that I was 80% rational and just 20% emotional ;-)
You could type something, and it could be read like a human.
There are plenty of other reasons, but they're equally as obvious. I don't understand what purpose you have in attempting to make this point.
Do we really want to dilute the uniqueness of language by making everyone sound like they came out of a lab in California?
Today, she asked "where has that robot guy gone?". Crying now because I won't let her talk to Miles anymore.
She has already developed an emotional connection to it. Worrying indeed.
Most children love talking to a fun adult who enjoys talking to them. As parents we hope to be that adult for them most of the time, but of course that's not easy to do all the time.
If parents made a tool like this a crutch and it replaced quality time with them or they were less likely to hang out with their friends, then yeah that's a big problem. If they use it as a learning aide or occasional fun diversion, it seems great.
But the cadence and the rhythm of speaking are off. It sounds like someone who isn't a podcaster trying to speak in the personality of a podcaster. It just sounds like someone trying too hard and speaking in an unnatural way.
This is good in a way a scifi movie shows a tech, sounds cool and demos futuristic possibilities. But not quite passing the real human vibe yet. But I'm sure some people might find it preferable to a more to-the-point system like GPT or Siri/Alexa in certain niche cases not requiring immediate gratification.
I think the long-standing success of advertising and propaganda suggests that people really aren't all that good at that.
Getting very realistic / real world conversational training data for an ai would be hard. Only a subset of us appear on podcasts, radio or tv and probably all speak in a slightly artificial manner when we do.
Point being, this demo voice is in performative mode, and I think sounds fairly natural based on that. Would you rather it not?
Yes that is very specific, but that's what it sounds like to my ear.
But then I thought of one more question to ask, reconnected to ask it, and it said, "Hey! You hung up just as we were just getting to the good stuff!" which threw me off, so I stammered gobsmacked for a minute, and it made fun of my stammering, imitating it. Whoa! So so SO good! Crazy good.
I'm creeped-out by this being on someone else's server, but if it was fully local-hosted-private, that might even get more creepy if I allowed myself to really talk freely to this thing.
^ from the post
https://github.com/SesameAILabs/csm is empty for now, but I imagine they'll be releasing it soon: https://x.com/_apkumar/status/1895492615220707723
Try asking if it if it speaks a different language. It will pretend like it can and then give you some humor. But then you probe a bit more and it tells you it is really good at listening and can listen to you in other languages. I tell it alright I'll talk to you in a different language but you will reply back in English. It says you got it and then passes all sorts of tests I put it through with flying colors.
Oh it also remembers your previous conversation and greets you accordingly.
Crazy impressive this will certainly revolutionize virtual office businesses.
My assumption is the LLM can translate no problem, but the audio model can't do Spanish. It seemed like there was an external catch to stop the model from trying too.
I'm from a developing country and it's sad that most English teachers on public schools here can't speak English well. There are good English teachers, but they are expensive and they are not affordable for the average people.
OpenAI realtime models are good, but we can't deploy it to masses since it's very expensive.
This model might be able to solve the issue since it's better or on par with the OpenAI model, yet it's significantly cheaper since it's a fairly small model.
It 99.9% felt like it performed at the level of Samantha in the movie Her.
I started asking all kinds of questions about how it worked and it mentioned a word I had to have it repeat because I hadn't heard it before: PROSODY (linguistics) — the study of elements of speech, including intonation, stress, rhythm and loudness, that occur simultaneously with individual phonetic segments: vowels and consonants. I asked about personality settings, à la TARS from Interstellar, and it said it automatically tailored responses by listening for tone and content.
It felt like the most "the future's here but not evenly distributed" interaction I've had since multi-touch on an original iPhone.
Cons: they are just a bit too casual with their language. The casualness came off somewhat studied and inauthentic. They were just a bit too eager to fill silence: less than a split second of silence, and they were chattering. If they were humans I would think they were a bit insecure and trying too hard to establish rapport. But those flaws are relatively minor, and could just be an uncanny valley thing.
Pros: They had such personalities that I felt at moments that I was talking to a person. Maya was trying to make me laugh and succeeded. They took initiative in conversation; even if that needs some tweaking, it feels huge.
A small minority of these interactions are going to be like a restaurant server — chit chat, pleasantries, some information gathering, followed by issuing direct orders.
The truly conversational interactions, while impressive, seem to be focused on… having a conversation. When am I going to want to have a conversation with an artificial person?
It’s precisely this kind of boundary violation of DMV clerks being chatty and friendly and asking about my kids that feels so uncanny, imho, when I’m clearly there for, literally, a one hundred percent transactional purpose. Do people really want to be asked how their day is going when sizing up an M5 bolt order?
In fact the humanising of robots like this makes it feel very uncomfortable when I have to interrupt their patter, ask them to be quiet, and insist they stay on topic.
For example tech support is in large parts about making the caller feel heard and getting them to do trouble shooting steps without feeling stupid. Sales is in large parts about getting the right person to talk to you and to keep them talking to you.
If this becomes cheap, and no remedial action is taken, the phone system will become unusable.
async def handle_connection(chat):
if chat.username == "brendanfinan":
await asyncio.sleep(432)
await chat.wait_until(received_message_count_greater_than=8)
await chat.respond("sorry I was afk")
await asyncio.sleep(166)
await chat.respond("not reading all that tho, im happy for you")
await asyncio.sleep(14)
await chat.respond("or sorry that happened")
await asyncio.sleep(8)
await chat.quit()
return
# ...But yes, there needs to be some spreading of public awareness.
If you do any of the above you are looking to be scammed!
And if we still do nothing about it post-AI? Well, that is already the status quo, so caring now feels performative unless we're going to finally chit chat about solutions.
The same could be said for the internet. "The internet can be used for bad" is an empty, trivial claim, not an insight that needs a standing ovation. The conversation we need is what to do about it. And the solutions need to be real ones, not "we need to put the cat back in the bag".
I did manage to get it to output "la la la la" and then it kind of sang them with a random melody.
It also can't say things loud and its idea of whispering for me was to say "pst".
Still apart from that it's very impressive!
Miles gets Arrested: Sesame.ai https://youtu.be/cGMO2hRNnv0
[0] Tucker Carlson X Martin Shkreli https://www.youtube.com/watch?v=NeyN3Jzdzz0
They even released some models on huggingface:
https://huggingface.co/collections/kyutai/moshi-v01-release-...
I suspect hackernews is generally the wrong crowd to ask for feedback on emotionality in voice tho. Some of these folks would prefer humans speak like robots.
Example: it was saying "two dude-us" while trying to tell a melodramatic story. Which I assume was originally "two dude...s" or something.
Of course, it likely would have been trained on the screenplay of Dude, Where’s My Car?.
We aren't even talking about adding laughing, singing/rap or beatboxing.
This software is a con artist. I mean that literally. It's not LIKE a con artist, it is literally attempting to con the user into forming assumptions about its intentions and mental states that its creators know to be false.
One interesting aspect was when I said what the fuck it ruined the whole conversation, maybe there will be a co-evolution of mannerism, so humans will have to learn that the way they talk to machines will have consequences down the line. Or we teach the machines to be cooperative no matter what, just like ChatGPT (or north koreans).
I am curious how easy it would be to adjust the inflection and timing. She was over-complimentary, which is fine for a demo. But I'd love something more direct, like a brainstorming session, and almost talking over each other. And then a whiteboard...
I suppose the lack of visual cues probably hinders things in that regard.
So unless the system has a lot of engineering and/or training put into the main model being able to recognize exactly when it should keep waiting versus a real response, it will just see something like "user: empty response" or "user: uhmm" and assume it is supposed to respond to that.
EDIT: also Moshi started with a pretrained traditional text LLM
This was almost worse though because it did feel like a rude person just interrupting instead of a dumb computer not being able to pick up normal social cues around when the person they're listening to has finished.
Also for proper emotional dialogue it needs to determine the human input emotions. It seems to work with a transcript of the input.
But I did feel bad hanging up on it. Him?
Extremely impressive overall though.
told it, "hold on" as i was putting on my headset, they said "no problem". but then i tried to fill the empty airtime by saying, "i'm uhh heating some hot chocolate?"
the ai's response was something like, "ah.. (something) (something). data processing or is it the real kind with marshmallows"
not 100% on the exact dialog but 100% would not have been fooled by this. closed it there. no uncanny valley situation for me.
https://www.schneier.com/blog/archives/2025/02/ais-and-robot...
I might have missed it in their writeup.
Incredible!
Stuff that a trillion dollar company cannot manage to do.
Trying asking it to be dungeon master and play dungeons and dragons style role playing game.