undefined | Better HN

0 pointstompetry2y ago0 comments

I've worked quite a bit with STT and TTS over the past ~7 years, and this is the most impressive and even startling demo I've seen.

But I would like to see how this is integrated into applications by third party developers where the AI is doing a specific job. Is it still as impressive?

The biggest challenge I've had with building any autonomous "agents" with generic LLM's is they are overly gullible and accommodating, requiring the need to revert back to legacy chatbot logic trees etc. to stay on task and perform a job. Also STT is rife with speaker interjections, leading to significant user frustrations and they just want to talk to a person. Hard to see if this is really solved yet.

0 comments

12 comments · 5 top-level

steve_adams_862y ago· 4 in thread

I’ve found using logic trees with LLMs isn’t necessarily a problem or a deficit. I suppose if they were truly magical and could intuit the right response every time, cool, but I’d always worry about the potential for error and hallucinations.

I’ve found that you can create declarative logic trees from JSON and use that as a prompt for the LLM, which it can then use to traverse the tree accordingly. The only issue I’ve encountered is when it wants to jump to part of the tree which is invalid in the current state. For example, you want to move a user into a flow where certain input is required, but the input hasn’t been provided yet. A transition is suggested to the program by the LLM, but it’s impossible so the LLM has to be prompted that the transition is invalid and to correct itself. If it fails to transition again, a default fallback can be given but it’s not ideal at all.

However, another nice aspect of having the tree declared in advance is that it shows human beings what the system is capable and how it’s intended to be used as well. This has proven to be pretty useful, as letting the LLM call functions it sees fit based on broad intentions and system capabilities leaves humans in the dark a bit.

So, I like the structure and dependability. Maybe one day we can depend on LLM magic and not worry about a team understanding the ins and outs of what should or shouldn’t be possible, but we don’t seem to be there yet at all. That could be in part because my prompts were bad, though.

tomhallett2y ago

Any recommendations on patterns/approaches for these declarative logic trees and where you put which types of logic (logic which goes in the prompt, logic which goes in the code which parses the prompt response, how to detect errors in the response and retry the prompt, etc). On "Show HN" I see a lot of "fully automated agents" which seem interesting, but not sure if they are over-kill or not.

spdustin2y ago

Personally, I've found that a nested class structure with instructions in annotated field descriptions and/or docstrings can work wonders. Especially if you handle your own serialization to JSON Schema (either by rolling your own or using hooks provided by libraries like Pydantic), so you can control what attributes get included and when.

1 more reply

steve_adams_862y ago

I actually only used an XState state machine with JSON configuration and used that data as part of the prompt. It worked surprisingly well.

Since it has an okay grasp on how finite state machines and XState work, it seems to do a good job of navigating the tree properly and reliably. It essentially does so by outputting information it thinks the state machine should use as a transition in a JSON object which gets parsed and passed to a transition function. This would fail occasionally so there was a recursive “what’s wrong with this JSON?” prompt to get it to fix its own malformed JSON, haha. That was meant to be a temporary hack but it worked well, so it stayed. There were a few similar tools for trying to correct errors. That might be one of the strangest developments in programming for me… Deploying non-deterministic logic to fix itself in production. It feels wrong, but it works remarkably well. You just need sane fallbacks and recovery tactics.

It was a proprietary project so I can’t share the source, but I think reading up on XState JSON configuration might explain most of it. You can describe most of your machine in a serializable format.

You can actually store a lot of useful data in state names, context, meta, and effect/action names to aid with the prompting and weaving state flows together in a language-friendly way. I also liked that the prompt would be updated by information that went along with the source code, so a deployment would reliably carry the correct information.

The LLM essentially hid a decision tree from the user and smoothed over the experience of navigating it through adaptive and hopefully intuitive language. I’d personally prefer to provide more deterministic flows that users can engage with on their own, but one really handy feature of this was the ability to jump out of child states into parent states without needing to say, list links to these options in the UI. The LLM was good at knowing when to jump from leaves of the tree back up to relevant branches. That’s not always an easy UI problem to solve without an AI to handle it for you.

edit: Something I forgot to add is that the client wanted to be able to modify these trees themselves, so the whole machine configuration was generated by a graph in a database that could be edited. That part was powered by Strapi. There was structured data in there and you could define a state, list which transitions it can make, which actions should be triggered and when, etc. The client did the editing directly in Strapi with no special UI on top.

Their objective is surveying people in a more engaging and personable way. They really wanted surveys which adapt to users rather than piping people through static flows or exposing them to redundant or irrelevant questions. Initially this was done with XState and no LLM (it required some non-ideal UI and configuration under the hood to make those jumps to parent states I mentioned, but it worked), and I can't say how effective it is but they really like it. The AI hype was very very strong on that team.

1 more reply

sghiassy2y ago

LangGraph

famouswaffles2y ago· 2 in thread

>Also STT is rife with speaker interjections, leading to significant user frustrations and they just want to talk to a person. Hard to see if this is really solved yet.

This is not using TTS or STT. Audio and Image data can be tokenized as readily as text. This is simply a LLM that happens to have been trained to receive and spit out audio and image tokens as well as text tokens. Interjections are a lot more palatable in this paradigm as most of the demos show.

somenameforme2y ago

Adding audio data as a token, in and of itself, would dramatically increase training size, cost, and time for very little benefit. Neural networks also generally tend to function less effectively with highly correlated inputs, which I can only assume is still an issue for LLMs. And adding combined audio training would introduce rather large scale correlations in the inputs.

I would wager like 100:1 that this is just introducing some TTS/STT layers. The video processing layer is probably also doing something similarly, by taking an extremely limited number of 'screenshots', carrying out typical image captioning using another layer, and then feeding that as an input. So the demo, to me, seems most likely to just be 3 separate 'plugins' operating in unison - text to speech, speech to text, and image to text.

The interjections are likely just the software being programmed to aggressively begin output following any lull after an input pattern. Note in basically all the videos, the speakers have to repeatedly cut off the LLM as it starts speaking in conversationally inappropriate locations. In the main video which is just an extremely superficial interaction, the speaker made sure to be constantly speaking when interacting, only pausing once to take a breath that I noticed. He also struggled with the timing of his own responses as the LLM still seems to be attached to its typical, and frequently inappropriate, rambling verbosity (though perhaps I'm not one to critique that).

famouswaffles2y ago

>I would wager like 100:1 that this is just introducing some TTS/STT layers.

Literally the first paragraph of the linked blog.

"GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, and image and generates any combination of text, audio, and image outputs."

Then

"Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.

With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network."

3 more replies

KoolKat232y ago· 1 in thread

I think if you listen to the way it answers, it seems its using a technique trained speakers use. To buy itself time to think, it repeats/paraphrases the question/request before actually answering.

I'm sure you'll find this part is a lot quicker to process, giving the instant response (the old gpt4-turbo is generally very quick with simple requests like this). Rather impressively all it would need is an additional custom instruction.

Very clever and eerily human.

dolmen2y ago

This behavior is clearly shown on the dad joke demo: https://vimeo.com/945587876

stitched2gethr2y ago

I would ask you to watch the demo on SoundHound.com. It does less, yes, but it's so crucially fit for use. You'll notice from the shown gpt-4 demo they were guiding the LLM into chain of reasoning. It works very well when you know how to work it, which aligns with what you're saying. I don't mean to degrade the achievement, it's great, but we often inflate the expectations of what something can actually do before reaching real productivity.

Mystery-Machine2y ago

Have you seen this video from Microsoft, uploaded to YT in 2012, the actual video could be even older: https://www.youtube.com/watch?v=Nu-nlQqFCKg

j / k navigate · click thread line to collapse

0 comments

12 comments · 5 top-level

steve_adams_862y ago· 4 in thread

tomhallett2y ago

spdustin2y ago

1 more reply

steve_adams_862y ago

I actually only used an XState state machine with JSON configuration and used that data as part of the prompt. It worked surprisingly well.

1 more reply

sghiassy2y ago

LangGraph

famouswaffles2y ago· 2 in thread

>Also STT is rife with speaker interjections, leading to significant user frustrations and they just want to talk to a person. Hard to see if this is really solved yet.

somenameforme2y ago

famouswaffles2y ago

>I would wager like 100:1 that this is just introducing some TTS/STT layers.

Literally the first paragraph of the linked blog.

Then

With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network."

3 more replies

KoolKat232y ago· 1 in thread

I think if you listen to the way it answers, it seems its using a technique trained speakers use. To buy itself time to think, it repeats/paraphrases the question/request before actually answering.

Very clever and eerily human.

dolmen2y ago

This behavior is clearly shown on the dad joke demo: https://vimeo.com/945587876

stitched2gethr2y ago

Mystery-Machine2y ago

Have you seen this video from Microsoft, uploaded to YT in 2012, the actual video could be even older: https://www.youtube.com/watch?v=Nu-nlQqFCKg

j / k navigate · click thread line to collapse