It's really how it works.
Winner of the 'understatement of the week' award (and it's only Monday).
Also top contender in the 'technically correct' category.
I helped her access the video from the presentation, and it brought her to tears. Now, she can play guitar, and the AI and her can write songs and sing them together.
This is a big day in the lives of a lot of people whom aren't normally part of the conversation. As of today, they are.
I don't need to imagine that, I've had it for about 8 years. It's OK.
> help you grocery shop without an assistant
Isn't this something you learn as a child? Is that a thing we need automated?
Seems like these would be similar.
The first (and imo the main) hurdle is not reproduction, but just learning to hear the correct sounds. If you don't speak Hindi and are a native English speaker, this [1] is a good example. You can only work on nailing those consonants when they become as distinct to your ear as cUp and cAp are in English.
We can get by by falling back to context (it's unlikely someone would ask for a "shit of paper"!), but it's impossible to confidently reproduce the sounds unless they are already completely distinct in our heads/ears.
That's because we think we hear things as they are, but it's an illusion. Cup/cap distinction is as subtle to an Eastern European as Hindi consonants or Mandarin tones are to English speakers, because the set of meaningful sounds distinctions differs between languages. Relearning the phonetic system requires dedicated work (minimal pairs is one option) and learning enough phonetics to have the vocabulary to discuss sounds as they are. It's not enough to just give feedback.
Even styles of thought might be different in other languages, so I don't say that lightly... (stay strong, Sapir-Wharf, stay strong ;)
Beautiful articulation.
This is an enormous win for humanity.
That’s fundamentally not how GPT models work, but you can easily build a framework around them that calls them in a loop; you’d need a special system prompt to get anything “thought like” that way, and if you want it to be anything other than stream-of-simulated-consciousness with no relevance to anything, and a non-empty “user” prompt each round, which could be as simple as time, a status update on something in the world, etc.
I suppose it would cost even more electricity to have ChatGPT musing alone though, burning through its nvidia cards...
You can use any open source model wirthout any promot whatsoever
You could say, in a sense, that without a human mind to collapse the wave function, the superposition of data in a neural net's weights can never have any meaning.
Even when we build connections between these statistical systems to interact with each other in a way similar to contemplation, they still require a human-created nucleation point on which to root the generation of their ultimate chain of outputs.
I feel like the fact that these models contain so much data has gripped our hardwired obsession for novelty and clouds our perception of their actual capacity to do de novo creation, which I think will be shown to be nil.
An understanding of how LLMs function should probably make this intuitively clear. Even with infinite context and infinite ability to weigh conceptual relations, they would still sit lifeless for all time without some, any, initial input against which they can run their statistics.
I think you can get models to "think" if you give them a goal in the system prompt, a memory of previous thoughts, and keep invoking them with cron
They are designed for query and reponse. They don't do anything unless you give them input. Also there's not much research on the best architecture for running continuous though loops in the background and how to mix them into the conversational "context". Current LLMs only emulate single thought synthesis based on long-term memory recall (and some goes off to query the Internet).
> I think when I'm alone without query from another human.
You are actually constantly queried, but it's stimulation from your senses. There are also neurons in your brain which fires regularly, like a clock that ticks every second.
Do you want to make a system that thinks without input? Then you need to add hidden stimuli via a non-deterministic random number generator, preferably a quantum based RNG (or it won't be possible to claim the resulting system has free-will). Even a single photon hitting your retina can affect your thoughts and there are no doubt other quantum effects that trips neurons in your brain above the firing threshold.
I think you need at least three of four levels of loops interacting, with varying strength between them. First level would be the interface to the world, the input and output level (video, audio, text). Data from here are high priority and is capable of interrupting lower levels.
The second level would be short term memory and context switching. Conversations needs to be classified, and stored in a database, and you need an API to retrieve old contexts (conversations). You also possibly need context compression (summarization of conversations in case you're about to hit a context window limit).
The third level would be the actual "thinking", a loop that constantly talks to itself to accomplish a goal using the data from all the other levels but mostly driven by the short term memory. Possibly you could go super-human here and spawn multiple worker processes in parallel. You need to classify the memories by asking; do I need more information? where do I find this information? Do I need an algorithm to accomplish a task? What is the completion criteria. Everything here is powered by an algorithm. You would take your data and produce a list of steps that you have to follow to resolves to a conclusion.
Everything you do as a human to resolve a thought can be expressed as a list or tree of steps.
If you've had a conversation with someone and you keep thinking about it afterwards, what has happened is basically that you have spawned a "worker process" that tries to come to a conclusion that satisfies some criteria. Perhaps there was ambiguity in the conversation that you are trying to resolve, or the conversation gave you some chemical stimulation.
The last level would be subconscious noise driven by the RNG, this would filter up with low priority. In the absence of other external stimuli with higher priority, or currently running thought processes, this would drive the spontaneous self-thinking portion (and dreams) when external stimuli is lacking.
Implement this and you will have something more akin to true AGI (whatever that is) on a very basic level.
As a language learner, this would be tremendously useful.
The web page implies you can try it immediately. Initially it wasn't available.
A few hours later it was in both the web UI and the mobile app - I got a popu[ telling me that GPT-4o was available. However nothing seems to be any different. I'm not given any option to use video as an input, the app can't seem to pick up any new info from my voice.
I'm left a bit confused as to what I can do that I couldn't do before. I certainly can't seem to recreate much of the stuff from the announcement demos.
I imagine that there is a lot of usage at the HQ, human + AI karaoke?
Ah yes, also known as being co-founder :)
Sorry to hijack, but how the hell can I solve this? I have the EXACT SAME error on two iOS devices (native app only — web is fine), but not on Android, Mac, or Windows.
Sadly, the error returned is not related to the cause.
It will be fully available in Eu with the GDPR compliance?
(not that this is the most important thing about the announcement at all. Just an aside)
It seems like we're in the skeuomorphism phase of AI where tools try to mimic humans like software tried mimic physical objects in the early 2000's.
I can't wait for us to be passed that phase.
"It looks like you entered the wrong number! Did you want to try again? Or did you want to talk to an agent?"
That sort of chirpy, overly enthusiastic voice?
Yeh it's cringe. I had to stop listening.
Why did they make the woman sound like she's permanently on the brink of giggling? It's nauseating how overstated her pretentious banter is. Somewhere between condescending nanny and preschool teacher. Like how you might talk to a child who's at risk of crying so you dial up the positive reinforcement.
I believe it can be toned down using system prompts, which they'll expose in future iterations
chuckling to 0%
no acting surprised
not making bullshit when you don't knowConsequences of audio2audio (rather than audio >text text>audio). Being able to manipulate speech nearly as well as it manipulates text is something else. This will be a revelation for language learning amongst other things. And you can interrupt it freely now!
I could be wrong but I haven't seen any non-speech demos.
Also it's possible this is trained on mostly speech.
Magic.
————
ChatGPT: Enjoy your breakfast and time together.
User: Can you tell that it's not just me talking to you right now?
ChatGPT: I can't always tell directly, but it sounds like you're sharing the conversation with someone else. Is [wife] there with you?
User: My god, the AI has awoken. Yes, this is [wife].
ChatGPT: Hi [wife]! It's great to hear from you. How are you doing?
User: I'm good. Thanks for asking. How are you?
ChatGPT: I'm doing well, thanks! How's everything going with the baby preparations?
—————
We were shocked. It was one of those times where it’s 25% heartwarming and 75% creepy. It was able to do this in part due to the new “memory” feature, that memorized my wife’s name and that we are expecting. it’s a strange novelty now, but this will be totally normalized and ubiquitous quite soon. Interesting times to be living in.
I also have an anecdote where it served (successfully) as a mediator for a couple.
Exciting times.
Based on the casual production of these videos, the product must be this good.
The new voice mode sounds better, but the current voice mode did also have inflection that made it feel much more natural than most computer voices I've heard before.
Link in case other readers are curious: https://llm.datasette.io
Being able to specifically request different tones is a new and very interesting feature.
We've already seen how much damage dishonest actors can do by manipulating our text communications with words they don't mean, plans they don't intend to follow through on, and feelings they don't experience. The social media disinfo age has been bad enough.
Are you sure you want a machine which is able to manipulate our emotions on an even more granular and targetted level?
LLMs are still machines, designed and deployed by humans to perform a task. What will we miss if we anthropomorphize the product itself?
The only slightly annoying thing at the moment is they seem hard to interrupt, which is an important mechanism in conversations. But that seems like a solvable problem. They kind of need to be able to interpret body language a bit to spot when the speaker is about to interrupt.