This has been my team's experience (and frustration) as well, and has led us to look at using LLMs for classifying / structuring, but not entrusting an LLM with making a decision based on things like a database schema or business logic.
I think the technology and tooling will get there, but the enormous amount of effort spent trying to get the system to "do the right thing" and the nondeterministic nature have really put us into a camp of "let's only allow the LLM to do things we know it is rock-solid at."
Even this is insanely hard in my opinion. The one thing that you would assume LLM to excel at is spelling and grammar checking for the English language, but even the top model (GPT-4o) can be insanely stupid/unpredictable at times. Take the following example from my tool:
https://app.gitsense.com/?doc=6c9bada92&model=GPT-4o&samples...
5 models are asked if the sentence is correct and GPT-4o got it wrong all 5 times. It keeps complaining that GitHub is spelled like Github, when it isn't. Note, only 2 weeks ago, Claude 3.5 Sonnet did the same thing.
I do believe LLM is a game changer, but I'm not convinced it is designed to be public-facing. I see LLM as a power tool for domain experts, and you have to assume whatever it spits out may be wrong, and your process should allow for it.
Edit:
I should add that I'm convinced that not one single model will rule them all. I believe there will be 4 or 5 models that everybody will use and each will be used to challenge one another for accuracy and confidence.
While LLMs do plenty of awful things, people make the most incredibly stupid mistakes too, and that is what LLMs needs to be benchmarked against. The problem is that most of the people evaluating LLMs are better educated than most and often smarter than most. When you see any quantity of prompts input by a representative sample of LLM losers, you quickly lose all faith in humanity.
I'm not saying LLMs are good enough. They're not. But we will increasingly find that there are large niches where LLMs are horrible and error prone yet still outperform the people companies are prepared to pay to do the task.
In other words, on one hand you'll have domain experts becoming expert LLM-wranglers. On the other hand you'll have public-facing LLMs eating away at tasks done by low paid labour where people can work around their stupid mistakes with process or just accepting the risk, same as they currently do with undertrained labor.
This means that on one hand firms are demanding RTO for culture and team work improvements. While on the other they will be ok with a tool that makes unpredictable errors like humans, but can never be impacted by culture and team work.
These two ideas lie in odd juxtaposition to each other.
I am 100% not blaming the LLM, but rather VCs and the media for believing the VCs. Once we get over the hype and people realize there isn't a golden goose, the better off we will be. Once we accept that LLM is not perfect and that it is not what we are being sold, I believe we will find a place for it that will make a huge impact. Unfortunately for OpenAI and others, I don't believe they will play as big of a role as they would like us to believe/will.
this gets to the heart of it for me. I think LLMs are an incredible tool, providing advanced augmentation on our already developed search capabilities. What advanced user doesnt want to have a colleague they can talk about their specific domain capacity with?
The problem comes from the hyperscaling ambitions of the players who were the first in this space. They quickly hyped up the technology beyond want it should have been.
What a ringing endorsement.
- every time a different result is produced.
- no reasoning capabilities were categorically determined.
So this is it. If you want LLM - brace for different results and if this is okay for your application (say it’s about speech or non-critical commands) then off you are.
Otherwise simply forget this approach, and particularly when you need reproducible discreet results.
I don’t think it gets any better than that and nothing so far implicated it will (with this particular approach to AGI or whatever the wet dream is)
There’s a whole classification of tasks where a human can look at a body of work and determine whether it’s correct or not in far less time than it would take for them to produce the work directly.
As a random example, having LLMs write unit tests.
My masters was text-to-sql and I can tell you hundreds of papers conclude that seq2seq and the transformer dérivâtes suck at logic even when you approach logic the symbolic way.
We’d love to figure production rules of any sort emerge with scale of the transformer, but I’m get to read such paper.
Which Apple engineers? Yours is the only reference to the company in this comment section or in the article.
I have had good luck using an LLM as a "sanity checking" layer for transcription output, though. A simple prompt like "is this paragraph coherent" has proven to be a pretty decent way to check the accuracy of whisper transcriptions.
https://app.gitsense.com/?doc=905f4a9af74c25f&model=Claude+3...
Claude 3.5 Sonnet will now misinterpret "GitHub as "Github"
I think that, too, is a UX problem.
If you present the output as you do, as simple text on a screen, the average user will read it with the voice of an infallible Star Trek computer and be irritated by every mistake.
But if you present the same thing as a bunch of cartoon characters talking to each other, users might not only be fine with "egg in your face moments", as you put it, they will laugh about them.
The key is to move the user away from the idealistic mental model of what a computer is and does.
clippy.gif
Leaving aside "we're" and "we are" are the same, it is absolutely active voice
I feel like this is unfair. That's the only thing it got wrong? But we want it to pass all of our evals, even ones the perhaps a dictionary would be better at solving? Or even an LLM augmented with a dictionary.
LLM has its place and it will forever change how we think about UX and other things, but we need to realize you really can't create a public facing solution without significant safe guards, if you don't want egg on your face.
LLM investors will be reviewing their portfolios and will likely begin declining further investments without clear evidence of profits in the very near future. On the other side, LLM companies will likely try to downplay this and again promise the Moon.
And on and on the market goes
As a user I want it to be right, even if that contradicts the normal rules of the language.