For ex this is GPT 4: https://chat.openai.com/share/e24501ad-8f1c-4b5a-a6d0-d933f5d1d209
And this is GPT 3.5: https://chat.openai.com/share/b9372bdc-ffff-4655-bee4-2b3f3c3b8285
In the latter case I didn't even need to ask for the order by clause as it anticipates it and provides an answer for it. GPT 4's first answer was wrong.
In the past two days I've seen at least 2 other cases where GPT 4's answer was plain wrong and GPT 3.5's was not only correct but of very high quality, reminding me of what I first felt when using GPT 4 for the first time.
Like AirBnB. Or Uber. Etc.
Take the Chat completion API, give it a system message of "Choose for yourself" and then ask it "Who are you?"
I guarantee that before the performance drop you'd have had a different answer than what you get now after the drop.
The dumb thing is that I suspect OpenAI isn't lying when they say they are spending just as much compute on this deceased performance as before.
Even their depreciation of their foundational model APIs is incredibly frustrating. We never even got to see direct access to the foundational GPT-4 model, which I'm sure was and is far more amazing than today's fine tuned shell of it.
So we have different population of GPT users.
An average experience might be to get a mixture of spot-on helpful responses and obvious bullshit^H^H^Hallucinations, this population might learn what questions to ask given the limitations of the model. This is really a best case scenario as people can actually get a feel for how to use the technology, strengths and weaknesses etc.
Personally my experience was the first few dozen times I used it I was amazed at the responses, I was on team superintelligence, anyone who is getting lackluster responses is just holding it wrong. But luck changes and over months of use I see now that on average the responses are just OK. But this is the case that leads to disappointment and bitter conspiracy (the superintelligence is being suppressed, give it back!)
Another population had rotten luck to begin with, and got dumb unhelpful response over and over. This population quickly determined that the AI was all hype and stopped exploring (you don't keep going back to the casino if you lose everything your first time...).
This divergence is destructive to the larger discourse, since we have fanboys flummoxed by naysayers and critics bamboozled by hype beasts.
What I've seen on indie hacker type website is that developers are fully on this train and not very critical of the outputs.
This is why you get very basic prompts sent by "wrapper apps", which might have given the developer a good result the only time it was tested before being put in production.
I think it might take a while before tools show up that can generate 100 test cases and test a given prompt with all 100 to report on the results... It seems to be a tough problem to crack.
IMHO front-end chat end-users have many many more "at-bats" and get to see more model results than devs do, which make them more critical of those results.
The search space on the fine tuned GPT-3.5 chat models versus the foundational Davinci text completion model is MUCH more narrow, particularly in starting off.
Even with the same temperature, you'll see any marketing-style prompt for chat begin with "Introducing XYZ..." around 30% of the time as if it's a junior door to door salesman, whereas the foundational model doesn't have any single intro that common across runs and generally employs a much broader vocabulary set.
We saw Google shoot Lambda in the foot after Blake's press tour which set them behind the next round of competition.
Now we're watching OpenAI snatch defeat from the jaws of victory out of anxiety around oversight and articles like 'Sydney' interviewed by the NYT.
For anyone following along in the 100 million+ training space, maybe don't overreact to press overreactions that will blow over in months as users get hands on experience or you'll blow your lead and waste massive amounts of resources and time.
This was a "user education" issue and not a "handicap your product" issue, in both cases.
I think this is less a problem with paranoia about "safety" and avoiding bad PR specifically, and more a fundamental problem with overfitting to human feedback.
The training approach that makes GPT4 more consistent at solving certain types of problem adequately (which is useful for chatbots that can break down coding questions or write in iambic pentameter as well as ones that avoid being 'Sydney') also makes it less "creative" in other domains.
And there's an "alignment problem" in that people evaluating what responses align best with "marketing" prompts aren't experienced copywriters evaluating them for understanding of product and consistency with brand tone and a/b testing conversion rates, they're low paid ESL speakers and people playing with the interface approving the cheesiness because the response with "Introducing XYZ... Buy XYZ today!" sure looks like the requested ad for XYZ. So you get a response conditioned on "summarise in a way that looks maximally like an ad" rather than conditioned on "summarise in a way which clearly articulates benefits of the listed features in a tone appropriate to the target market"
For code, 3.5 is superior. 3.5 allows for about 21k tokens of input, while ChatGPT 4 allows for around 10k. This also makes it a lot better for boilerplate work as at it can take a lot more input, and handles long conversations and iterations better.
Brainstorming, 4 is better. It's capable of some top tier brainstorming and it argues back quite frequently.
Unguided creative writing (describe a potato), they're roughly equal.
Guided creative writing (i.e. write a story around (400 words of requirements)), 4 is much better.
Poems and wordplay, 4 absolutely floors 3.5. Wider vocabulary and it's able to do rhymes and alliterations better, which humans are usually bad at.
For reasoning and riddles, 4 is still benchmark among the LLMs.
I really dislike that they named it GPT-3.5 instead of something like "Glide". It implies that it's inferior to 4, when they're just suited for different things.
I asked for a fairly simple regression in R for some simple data and it gave me quite literally the lowest quality answer and then told me I need to seek out a data scientist for the answer. I couldn’t believe that.
One of the best parts about AI is not hearing pedantic or bullshit reasons why the thing you’re asking is wrong, isn’t possible, isn’t ideal, blah blah blah
I can’t wait for something way better I’m tired of this watered down nonsense