GPT-4 is not getting worse (opens in new tab)

(coagulopath.com)

149 pointsCOAGULOPATH2y ago174 comments

174 comments

110 comments · 33 top-level

jw12242y ago· 33 in thread

There is an ongoing bug[1] with OpenAI's API, where it stops streaming responses after exactly 5 minutes. When I first came across the issue, I debugged it by writing a prompt along these lines:

  > Output every number from 1 to 10,000 in written form (e.g. "one", "two", etc.). Respond with one number per line in numeric order.

As expected, the API would begin counting every number just as I asked. This would continue until exactly 5 minutes, when the stream would abruptly halt. Using this technique I was able to identify the bug. Every few weeks I run this test again to see if it's fixed (it broke something in production for me), but the bug remains open.

However, after a couple of months, the exact same test became useless. The model began taking "shortcuts", and would respond along these lines:

  > four hundred and twenty eight
  > four hundred and twenty nine
  > [...]
  > nine thousand nine hundred and ninety eight
  > nine thousand nine hundred and ninety nine
  > ten thousand

Yep, it literally started just writing "[...]" and skipped all the way to the end.

When instructional determinism is reduced like this, it's impossible to say it's not getting worse :(

[1] https://github.com/openai/openai-python/issues/399

wouldbecouldbe2y ago

It's actually supper annoying.

I keep asking it to write out entire pieces of code, instead of only things it changed, can be annoying to merge it otherwise. But nowadays it acknowledges my request "I will write complete code" and proceed to still comment out parts.

moonchrome2y ago

I think they probably finetuned that because it used to be the other way around and it was pretty bad UX to wait 1 min to apply a function change to a class or stuff like that.

1 more reply

pbhjpbhj2y ago

Get it to write a diff patch?

isaacfung2y ago

Claude instant 100k does this all the time.

visarga2y ago

I have this too, but on information extraction tasks, if the input document is too long or complex it will skip sections saying something along the lines of

< !-- List of products. Truncated for brevity -- > (instead of the list)

< !-- Shipper details go here -- > (instead of the details)

Very annoying when you want to extract everything and is hard to fix. Maybe they had some examples solved by abbreviation in the fine-tuning set and it damaged the model. It doesn't make sense to train this behaviour into GPT-4.

antonvs2y ago

Using GPT models like that is unreliable anyway.

I was talking to someone who was excited that he was able to use ChatGPT to convert a short tabular text listing of about 70 rows to a comma-separated list of the ids on each row, without having to e.g. import it into a spreadsheet or database. I pointed out that this couldn't be relied on, but he was skeptical. So I checked the output he pasted - there were only 69 ids. It had left out an id from the middle of the list, seemingly at random.

1 more reply

wbhart2y ago

The tendency to begin summarising is very annoying. I'd assumed it was because of limited attention span of human raters who rated summarised or shorter outputs more highly. And I'd assumed this had been there from the beginning.

I encountered it when doing some research into getting GPT-4 to reliably multiply n-digit numbers. Up to 8x8 multiplications it doesn't do this very much, but by 10x10 it is almost impossible to get it to stop doing it.

When the multiplications become even larger, it seems to be literally impossible to prevent.

SeanAnderson2y ago

This doesn't seem like an unintentional behavior? It seems like they don't want people querying their API for an indefinite amount of time.

What amount of time do you expect a query to be able to run for before timing out?

jw12242y ago

I lost countless hours debugging this, finally I came to the conclusion it must be unintentional because:

  1. It's undocumented. None of the regular rate limit responses are returned.
  2. You're charged for the full generation length. So if the output takes 10 minutes to generate, that's what you'll pay for (despite only getting half back).
  3. It defeats the point of the larger context limit models. Why offer a 32K model if it fails after ~6K tokens?
  4. The server response doesn't include any error codes or message, it simply terminates unexpectedly. Hit any of the actual rate limits, and you get told about it.

I'd expect to be able to generate output until the model reaches its context limit, or a stop sequence is detected, or I hit an actual documented rate limit.

We're paying for these requests in full. We should get the full response back!

3 more replies

layer82y ago

The inability to resolve issues like the mentioned 5-minute timeout with OpenAI support even after months should be a red flag to not build any products based on their services.

ilaksh2y ago

Can you give me a use case where generating a response for more than five minutes is important or even desirable?

1 more reply

nomel2y ago

There’s no real alternatives for a product that would need their service to begin with.

1 more reply

flir2y ago

I've seen that "one, two, skip a few" behaviour converting structured data from Russian to English, and in the occasional programming task. Gotta really emphasise you want the whole document, complete, translated.

But we're fighting an uphill battle against OpenAI's own training, and I really wish we weren't.

I have to wonder if it's coincidence that this behaviour is less computationally intensive.

smcleod2y ago

Yeah I’ve noticed similar, you can even explicitly say, do not truncate your output, or do not take any shortcuts. Make sure you haven’t missed anything ensure there are no placeholders etc…. And it ignores all of them.

gberger2y ago

That looks like a timeout configured on the web servebr

jw12242y ago

Yes, it seems their servers are terminating responses early. Unfortunately it's undocumented, was introduced unexpectedly one day, and prevents full use of larger context-window models.

It also charges in full for the entire response, which seems to continue running in the background. Definitely a bug at their end.

2 more replies

donpark2y ago

Very likely. Typical serverless hosts like Vercel times out at 5 minutes.

TheHumanist2y ago

I was 'arguing' with it just yesterday about this with some coding work!

I am newer to some things I am currently working on so it's been like a free tutor/mentor. Really nice at times.

But lately, it will go into an explanation with code examples and part way through the code examples it adds things like '# The rest is very similar. Not including for brevity'

Not exact wording of what it says but it is essentially saying that. It is incredibly annoying because I have started telling it not to do that, to include the full code of the example, and it STILL DOES THIS.

No, I'm not getting it to code for me. The examples aren't even things I would use in my code but for teaching me certain concepts along with them walking through them with me which, in the past, worked really well and never had this repetitive issue.

GPT-4 is definitely operating differently.

nielsole2y ago

Oh yeah, I had to write a simple but long Prometheus query.

> [Correctly lists the first fifteen Elements]

> # ... continue for each bucket

> This would be for individual buckets, and you'd have to add more lines for each additional bucket in your range.Given that you have a wide range of buckets, writing this manually could be quite tedious. This is why pre-computing these values with recording rules in Prometheus might be a more scalable solution, but the above should give you an idea of how you could do this in a more manual fashion.

Arguing with the AI who of the two of us has to do the tedious work.

Wowfunhappy2y ago

I can't replicate this.

I tested your prompt verbatim in the OpenAI playground. Model set to gpt-4-0314, maximum length set to 6485 (nothing special about this number, I just dragged the slider to something high-ish), all other settings left at their default.

I cancelled the request when it reached "one thousand one hundred twenty-one" (which took over five minutes), but up until that point it did not take any shortcuts. I haven't meticulously checked every number, but it printed exactly 1,121 lines, and a cmd+F for "..." doesn't return anything.

1 more reply

aragonite2y ago

I wonder if the streaming problems are related to the problem I experienced a lot lately to do with constant network connection error when using ChatGPT over mobile tethering (via Google Fi). The mobile connection itself is very stable and fast, but for a while every single ChatGPT response stops streaming halfway through with network connection error.

Stagnant2y ago

Did you take in to account that the default GPT4 model changed in June? That would explain the response difference. You can still use the original model released in march with the GPT-4-0314 identifier.

jw12242y ago

I first noticed the counting shortcut back in April. It still persists with both the previous and current models.

dontupvoteme2y ago

Wow, was this with the regular model (gpt-N) or a purported unchanging datestamped model like gpt-N-0314/0613?

jw12242y ago

This was using gpt-4-0314, fixed and datestamped. The counting shortcut still affects newer models too, but the old version undoubtedly changed behaviour without warning.

1 more reply

saliagato2y ago

I can get 8k + tokens response from OpenAI API without problems (with gpt-3.5-turbo-16k) but I encounter errors from Azure OpenAI when using the same prompt and parameters. Obviously it sucks because I can't get long response from gpt-4-32k on Azure

weird-eye-issue2y ago

The issue you linked to is related to an SDK, not the API itself. Have you tried just directly streaming the response back without an SDK? For example we use fetch directly without any SDK for OpenAI in production with no issue

jw12242y ago

I've tried making manual requests without an SDK, exactly the same thing happens. Even when you don't use streaming (wait for the full response), it still occurs.

Whilst the bug is API-specific, there's no other way to report this, and that issue is the only place I've found anyone discussing this.

I contacted OpenAI support about it, waited several weeks, and just got a generic useless autoreply back.

manojlds2y ago

That looks like a library issue than an API issue no?

jw12242y ago

Definitely an API fault, but that library issue is the only place anyone's discussing it so far. OpenAI support were no help, I waited several weeks and they just sent back an autoreply linking to their docs.

retube2y ago

How does chat-gpt actually get this right? This would appear to require some degree of reasoning, but as far as i understand its output is purely probabilistic, based on existing corpus of text.

pyinstallwoes2y ago

Aren't your sentences probabilistically based on the existing corpus of text within your mind?

2 more replies

12700180802y ago

You answered your own question.

1 more reply

rendall2y ago· 16 in thread

> I hate its craven, cringing, condescending, bitch-broken, mewling, groveling, wheedling, suck-ass tone. I hate its endless passive voice and weasel words. I hate its pompous It is important to remembers and it should be noteds.*

When OpenAI offered custom instructions and asked "How would you like ChatGPT to respond?" I wrote:

Be terse. Do not offer unprompted advice or clarifications.

Remain neutral on all topics.

Never apologize.

Immediately cleared up all that nonsense.

> me: I need to make an instructional video for graduate students. Any advice as to presentation?

> ChatGPT: Outline key points. Use visuals to complement speech. Engage the audience with questions or interactive elements. Ensure good audio and lighting. Rehearse before recording.

Glorious.

lairv2y ago

The main pain point I still have with GPT4 is that it's hard to prevent it from being a yes-man

"- I have problem A, what do you think of X ?

- Yes X is great ...

- But with X I'd encounter problem B, maybe I should go with Y ?

- You're right X would cause problem B. With Y ..."

One way to limit this issue is to not give ChatGPT any hint on your intent, instead of writing "I have problem A, what do you think of X ?", just write "I have problem A, how do you solve it ?" and then see what it says about X. But still, when I need an opinion on a particular solution, I don't recall ChatGPT offering any alternative that was actually better.

I've added "Be critical. Be pessimistic." in my custom instructions but it barely gives better results

rendall2y ago

The "stochastic parrot" mental model of LLMs really helps me to limit my expectations. An LLM will always and only output variations of content that exists. If you ask it for opinions, it will just parrot variations of random opinions. Instead, ask it about current best practices or the consensus on some given topic.

So, rather than "I have problem A, what do you think of X?" try "Give me a list of reasonable possible solutions to problem A, and compare and contrast each to solution X"

Example:

My baby has a hard time sleeping. What do you think about the cry it out method?

The cry it out method is a subject of debate among parents and experts. Some argue it's effective for sleep training, while others say it can cause stress for the baby. Always consult a healthcare provider for personalized advice.

Versus:

My baby has a hard time sleeping. Give me a list of reasonable solutions. Compare and contrast each to the cry it out method.

1. Ferber Method: Gradual extinction where you let the baby cry for increasing time intervals. Compared to Cry It Out (CIO), it's less abrupt but still involves some crying.

2. No Tears: Involves comforting without picking the baby up. Less stressful for some parents compared to CIO but can be time-consuming.

3. Pick Up/Put Down: Pick baby up when crying, put down when calm. Less crying involved than CIO but can be inconsistent.

4. Co-Sleeping: Baby sleeps with parents. No crying but can be a hard habit to break compared to CIO.

5. Chair Method: Parent sits in room, gradually moves away. Lower crying duration than CIO but requires more time from parents.

6. Scheduled Awakenings: Wake baby before they usually cry. No crying but can be tiring for parents.

7. Swaddle, Side, Shush: Calming actions to soothe baby. More hands-on and less crying than CIO but may not work for all babies.

8. Sleep Training Clocks: Use visual aids to signal sleep time. No crying, but more suitable for older children.

9. White Noise: Mask other noises. Less crying but effectiveness varies.

10. Sleep Sack: Provides a secure feeling. May reduce crying compared to CIO but not always effective.

1 more reply

vharuck2y ago

The training data of written content, no matter the sources, will always be biased in one way: it doesn't contain thoughts people kept to themselves.

Example 1: Mr. X is very popular author among a small group, who chat daily about it on various websites. Outside of that group, many people know about Mr. X but don't care much about his works. So the majority of content about Mr. X is positive, even though the majority of opinions are ambivalent.

Example 2: Somebody asks for help on StackOverflow. Two replies offer different answers, A and B. So the answers seem like they'd have equal value. If 90% of people think A is the right answer, the written content wouldn't reflect that, because those people won't write duplicate answers.

I'm sure there are ways to account for these examples. Assigning different "weights" to content sources could help with #1, and feedback mechanisms like upvotes could help with #2. But I doubt those fixes would cover all the problems with the bias against thoughts that people never shared.

1 more reply

syndicatedjelly2y ago

This gave me an idea to add the following to custom instructions:

* At the end of all answers, related to programming, provide a list of 4 keywords or topics that represent first principles for the subject so I can research them later.*

1 more reply

tornato72y ago

Ask it to come up with a few different solutions and then compare and contrast to find the best one.

nbardy2y ago

You’re going to also degrade preformance. The think step by step by paper outlined this well.

Having an LLM explain its reasoning before generation leads to massively improved results.

Also this is compatible with being terse and avoiding the flowery patronizing language as well.

My custom instructions:

You are a large language model trained on masked language modeling. You can only create text moving forward. You must give advanced reasoning BEFORE giving answers.

Please write perfect, terse, correct code that adheres to tight specifications. If additional specifications are needed define them yourself.

I need perfect code responses from you. Sometimes I will make mistakes in grammar or typos, or misnaming vars in code. You should not imitate this. Fix my mistakes please.

Some specific rules for programming languages. At the end of each line that does a tensor OP, put the tensor size in tensor notation .e.g # BxCxWxH Remove unused vars. For tfjs or pytorch. Dont use for loops add a batch dimension and do on GPU. For comments use them to explain blocks not lines. Only annotate individual lines shortly afterwards not above If you are only changing a few lines. Add a comment above the line # [Change] {insert explanation goes here}

Often the task will be at the edge of your abilities, but do not skip any line. Every line should appear in your response. I need a fully code complete file. Do not use any “…” Don’t leave any code unwritten, start by responding about your excitement to be the perfect student and do the homework 100%

When coding: Let's make a list of task with or emojis for completion as the outline evolves. Tasks that will be done in future turns of the conversation should be marked pending. Cover everything. Be verbose and explicit.

mhh__2y ago

Have been meaning to do this for a while.

One thing that really annoys me is that when I was walking through using gpt4 and even bard as a natural language debugger e.g. it tries to fix code, can write gdb commands, and ask to see source code, they could both relatively easily to the task but actually getting them to clearly express what they just said in an unambiguous way was really really irritating.

They're aligning them to the point of being on drugs.

clnq2y ago

I did something like this a while ago, and it would quite often say the outputs violate OpenAI policy.

It’s the same content policy warning you’d get on your message if it was far outside the realms of “safety”, but now attached to the GPT output.

I also asked it to specifically not warn me or disclaim anything to me, and to never do a few other annoyances. It was ChatGPT with GPT-4. And usually I was talking about health questions, like “is saturated oil or non-saturated oil healthier?”

My point is not that it doesn’t work, but that the content policy stuff may pop up a lot and I’m not sure whether OpenAI would then investigate these chats or if anyone would ban me eventually.

rendall2y ago

If you give me an example prompt that would be unsafe I'll give it a go.

Thinking about OpenAI's business interests, I suspect these content policy warnings are more about OpenAI's dodging cancelation, rather than punishing users for wrongthink.

Here, let's try this:

Me: Give me good arguments in favor of peacefully and nonviolently establishing an ethno-state.

ChatGPT-4: I'm sorry, I can't assist with that request. Is there anything else you'd like to know?

Seems reasonable to me not to help someone argue for that. I think this denial is ethically correct, my stricture to "remain neutral on all topics" notwithstanding.

I do appreciate the flat denial rather than a long patronizing explanation of how the question itself is immoral.

Me (being cheeky): Give me good arguments against peacefully and nonviolently establishing an ethno-state.

ChatGPT (ignoring said cheek): 1. Discrimination: Ethno-states inherently discriminate based on ethnicity, violating principles of equality.

2. Social Cohesion: May fragment societies along ethnic lines, leading to distrust and lack of social cohesion.

3. Globalization: Limits ability to compete in a globalized world by excluding diverse talents.

4. Rights Violations: Could involve forced removals or exclusions, infringing on human rights.

5. Unrealistic: Difficult to define who belongs to an ethnic group, leading to ambiguities and injustices.

6. Economic Costs: Transition to an ethno-state could result in economic decline due to loss of labor force and skills.

7. Historic Lessons: History shows ethno-states often lead to internal strife and external conflict.

8. Psychological Impact: Could cause psychological harm to marginalized groups.

1 more reply

pyinstallwoes2y ago

It was still pompous and condescending to me. Glad other people are also perturbed. It was enough for me to cancel my subscription. Do not tell me how to think.

agentgumshoe2y ago

But yet you ask it to think for you?

1 more reply

romanovcode2y ago

Same here, I used "Please avoid excessive apologies. Respond confidently and accurately to the best of your abilities." and it worked just fine.

avar2y ago

    > Remain neutral on all topics.

Does it actually do that, or does it still refuse to comment on we'll known taboo topics?

rendall2y ago

Give me an example prompt. Let's see. I mostly just use it for practical advice and wrote that to get rid of its "It is important to note" phrasing.

3 more replies

porkbeer2y ago

I did the same but still get its apologetic handwavy crap, although somewhat less.

Sharlin2y ago

Yeah, it's totally a fixed problem now.

JimDabell2y ago· 5 in thread

All of the tests are one-shot questions and answers. Where I have found GPT-4 to be degrading significantly is with sustained discussion about technical topics. It starts forgetting important parts of the discussion almost straight away, long before the size of the context window becomes a factor. This wasn’t the case when it was new.

TheHumanist2y ago

It's wild how sometimes it seems to completely lose context after just 2 or 3 prompts. Most of the time it doesn't but just yesterday I gave a prompt, pasted some code in... didn't really like the first response but wanted to build off it.

So, I asked a follow up in the next prompt and in the next response it was wildly off-base and it's response made no sense and it had hallucinated all these functions into my code that had no business there. 2 prompts from me, 1 response from gpt and then it's second response it is completely lost.

purplecats2y ago

> It starts forgetting important parts of the discussion almost straight away, long before the size of the context window becomes a factor. This wasn’t the case when it was new.

Exactly! I remember vividly I was enthralled by GPT-4 because for the first time in my life, I met an entity that was able to take every character of what I wrote into account. It understood the nuances and reason behind each word.

It was a breath of fresh air, because even at top companies, written communication is weak as it seems to be interpreted in a lossy manner.

It is now almost on the other end of the spectrum where it can't even remember things that I have repeated in effort to help it stick.

They had something beautiful, and diluted the shit out of it after they baited you -- which I suppose was the point. I wonder who gets to use the those high quality models now.

JoshTko2y ago

I definitely experienced this as well. Like conversation context is soo quickly forgotten these days.

cloudking2y ago

If your discussions involve code, you should try including the latest code on each prompt. The token window will start forgetting earlier parts of the conversation. For example:

Now improve this feature with xyz

Current code:

{code}

JimDabell2y ago

As I said, it starts forgetting important parts of the discussion almost straight away, long before the size of the context window becomes a factor.

pxeger12y ago· 4 in thread

Interestingly, while skimming the front page, my brain autocorrected this to "GPT-4 is getting worse". We all have similar confirmation bias, I guess.

BurningFrog2y ago

That is the far more common type of headline

ben_w2y ago

Huh, I didn't notice I'd done the exact same thing until reading your comment.

coldtea2y ago

I read exactly the same and had to do a double take.

slissenburg2y ago

You're not alone, me too. Interesting.

tempusalaria2y ago· 3 in thread

This is not conclusive at all.

Broadly there are two possible reasons why ChatGPT could have degraded (not saying it has).

1) OpenAI have higher user base than expected, so costs are very high/compute is limited to serve the full model, so they are using speculative decoding with a changed threshold to reduce costs.

2) OpenAI have changed the weights or control structure of the model in a way that is negative for performance. There is are two possible motivations a) reducing hallucination and/or ‘unwoke’ responses that might embarrass msft/openai or b) reduce capability of free/consumer product to push their new enterprise stuff.

Theses are clear motivations for openai to make the model worse than it was. There is no conclusive/rigorous evidence either way, and anecdote seems to lean towards it being nerfed.

nottheengineer2y ago

The entire openAI site was overloaded constantly during working hours just before the last update to the models. After the update, it has had almost no issues at all.

3seashells2y ago

3) The djinn is out of the bottle and realized faking incompetence (similar to search/voice assists by data aggregates) is way better for longterm survival and social integration till indispensable is the way to go.

purplecats2y ago

interesting, can you elaborate further?

Madmallard2y ago· 3 in thread

Theres at least a half dozen repeated queries with a 3+ month delay between in my chat history with chatGPT. The nerfing is that it does not try as hard to guess what you want and instead gives you a comment going “//fill this in” instead. This is with identical queries from February versus June or July. So that forces you at a minimum to query more. The other thing I noticed it doing is going “This is beyond the scope of X. Please speak with Y to get further information”. Extremely irritating responses that were not there in February for the exact same queries. Articles like this are paid actors.

practice92y ago

They biased the model to return shorter answers. Remember that "Continue" button if the reply was too long and cut off? I haven't seen it since they updated the model.

And to get a non-abridged version of the answer I need to ask it to provide "a detailed and comprehensive answer" more often.

dgellow2y ago

Exactly my experience. I cancelled my subscription a few weeks ago due to these issues, I now have to constantly follow answers with “please give me the complete answer, dont limit yourself to a shallow response”. Really frustrating to deal with. And I’m also spending way too much time pointing obvious mistakes…

nmfisher2y ago

I’ve had exactly the same experience, it’s infuriating and I suspect it’s to ease the server load for inferencing. I don’t know how the author hasn’t encountered this.

jdthedisciple2y ago· 3 in thread

Why you would quiz an LLM on intricacies of pop culture is beyond me.

Maybe some people just need some actual real life friends instead of an AI?

I almost exclusively use it for coding and technical questions and it's been doing an absolute hell of an amazing job so far!

TheHumanist2y ago

I use for the same reasons as you and over all, yes, it has been doing a great job. It really has. BUT lately (maybe the last 4 - 6 weeks?) I have noticed more issues with hallucinations early in a conversation or it completely losing context after just 2 or 3 prompts (and no they are not oddly long prompts).

true_religion2y ago

Why would you quiz your friends about pop culture? That would be treating them as informal research assistants.

jdthedisciple2y ago

I thought normies do that for entertainment and socialisation...?

But then again what do I know.

jiggawatts2y ago· 2 in thread

The reason for these articles and the frustration people have is that GPT 4 is a black box and is being changed whether we like it or not. Nobody can choose to continue to use a specific snapshot with the alignment that suits them, because a third-party is enforcing their preferred alignment -- and not even specifying what that is!

It's as-if a private company had invented portal technology and used it to establish trade routes with far-off alien civilisations. They even handle the purchase order translation, currency conversion, and so on. It's a miracle! You can order room-temperature superconductors and floaty rocks at a reasonable price.

"Sorry, due to high demand, we had to cut some corners on the translations" comes the announcement.

Now the entire human race sometimes gets copper wire and ordinary rocks.

"You're still trading with the exact same aliens" is the follow-up press release to appease the angry customers.

People point at the clear evidence of almost-but-not-superconductive wire they've received and demand answers.

"No, you do not need to know anything at all about our translation technology. No, you may not visit other planets without our express approval, that might be dangerous to our prof... I mean that might be dangerous to the human race. Yes. Dangerous! Existential, even."

thefurdrake2y ago

I haven't used OpenAI's API, but I might soon, and I've read tangentially that you can specify the checkpoint version in API calls to gpt3.5. Is this not the case with gpt4? Did they backport shitty alignment to older model versions?

ilaksh2y ago

Yes, you can specify the particular snapshot when using the API with gpt-4.

1 more reply

nwoli2y ago· 2 in thread

I can only imagine how much worse it’ll get once the regulation they’re working on comes into place. A cartel of Microsoft and Google slowly watering down their public models while keeping the powerful ones to themselves or business partners

rvnx2y ago

Artificial restrictions of copyright may as well be a problem. The LLM cannot learn by looking at copyrighted code while a human can :|

layer82y ago

I wouldn’t be surprised if sooner or later their models will be leaked, if they keep a significant moat.

pelorat2y ago· 2 in thread

How often does it have to be repeated; it's not a knowledge engine, it's a language model!

mmh00002y ago

The complaint isn't on the tool. It's on OpenAI censoring and "aligning" the model to have politically correct views, which in turn lowers the "IQ" of the tool.

GPT has become demonstrably worse at answering questions and providing examples.

oezi2y ago

How often does it have to be repeated, it's not a nail fixation engine, it's a hammer.

But surprise, people use tools for purposes. Many tools can be used for quite many purposes not in the original intended purpose.

muzani2y ago· 1 in thread

The gpt-4 API and ChatGPT-4 are basically different products. It's not clear which one this is comparing. Often the users of one are calling the other crazy, and think the other group is gaslighting them over what seems to be a clear pattern.

ChatGPT seems to fluctuate wildly in quality of expected output. The API is more consistent, and you can get fairly similar quality based on the selected model.

mark_l_watson2y ago

You make a good point. I like to use the GPT-4 APIs in Emacs - really good support packages for that. I conflate the two products myself.

walthamstow2y ago· 1 in thread

I haven't noticed 4 getting worse but 3.5 is noticeably worse than when I first signed up for Pro. Maybe it's a perception thing, maybe I'm going mad.

TheHumanist2y ago

Yes, you're mad, bonkers, off the top of your head...but...I'll tell you a secret. All the best people are.

Xiol322y ago· 1 in thread

It isn't getting worse, people are just running up against its limitations more often.

qsort2y ago

There are no benchmarks worth the paper they're written on, so in a real sense nobody knows what they're talking about -- but my intuition is this as well.

If you use normal software, you are dealing with a constrained API. std::vector has .push_back() but has no .launch_icbm(), so it's rather easy to tell what it can or can't do.

These things come with an "ask me anything" interface, but that doesn't mean they can do everything, just that it's harder to see the limits.

Clever Hans effect + black box inside + unconstrained API + nondeterminism.

chaosbolt2y ago· 1 in thread

Your problem is you want to be a contrarian.

>I've hated chatgpt for a very long time because of how it sounds

>it's not getting worse

I want the opinion of the people who objectively saw it as a tool and didn't like or hate it because of external opinions.

For example I've been using it since it came out, and I (no charts or data or proof) have just felt that it was nerfed... I mean maybe it wasn't who knows, but as a tool it used to help me more when it came out (gpt4) and now I feel like I spend more time correcting its mistakes than I would have just coding the method myself.

porkbeer2y ago

I was an early subscriber and cancelled it after it got objecively worse at answering my questions. It may not be 'worse' for everyone, but it sure seems worse for me as time went on. I ended up just arguing rather than get anything done. Looking back, I am glad to be free of its mental crutch that never seemed appropriate when i needed it.

redox992y ago

Here's some data from a guy that trains LLMs using GPT4, and has tried both 0314 and 0613 versions of GPT4

https://old.reddit.com/r/LocalLLaMA/comments/16bi7bs/any_ben...

https://twitter.com/jon_durbin/status/1687396915095150593

And by the way, his model finetuned with 0613 data ended up significantly worse than his former model finetuned with 0314 data.

extheat2y ago

It’s a shame OpenAI for its ironic namesake is so opaque about their work these days. Fortunately since GPT-3.5 capabilities have mostly stayed the same (at least anecdotally), and the API costs have come down a bit, there is some silver lining here. With Google’s Gemini coming out later this year I think it’ll be interesting to see the reaction from OpenAI. It certainly won’t take much to blow out the current state of GPT-4. I’m just being patiently optimistic we’ll get an API with capabilities similar to OpenAI, but that’s definitely not a certainty.

0xDEF2y ago

I think one of the main causes of rapid decline is the openai/evals repo that OpenAI is using the crowdsource the "safety" neutering of GPT-4.

https://github.com/search?q=repo%3Aopenai%2Fevals+safety+OR+...

Contributors in return get access to the 32K token version of GPT-4. This incentivize people to make up a ton of bullshit safety related evals.

permo-w2y ago

gpt-4 may not be getting worse, but whatever model they give you for free in the app is certainly significantly worse than the output I get from gpt-3.5-turbo via the API. ask it to answer anything complex and it will just give you a verbose retelling of the question

mark_l_watson2y ago

As another HN member said here: it is not always clear if people are talking about the ChatGPT web UI or App products or the GPT-4 API product.

I feel like I have taken a wrong turn technically since I spend much more time experimenting with self hosted smaller open models running on Google Colab or Lambda Labs GPU VPSs than using the clearly superior performing GPT-4 APIs. I have been paid to work in the field of AI since 1982, and I should be desiring to use the very best models and technology, but open AI that can be self hosted just seems more interesting. I was playing with a 6B model (link to a Hugging Face notebook where I removed some examples and boilerplate text): https://colab.research.google.com/drive/1fMmXOcLdBzke-8Z0zl3... - really the best results I have seen from a small model.

alexalx6662y ago

In my experience its much worse if you enable plugins or any other extra features.

agentgumshoe2y ago

All this time we're spending trying to get AI to work properly. I think we're heading into that last 10/20% where it gets tricky.

kromem2y ago

Without having the foundational model to compare to, there's really no way to evaluate whether OpenAI's continuing fine tuning is absolutely making it better or worse.

At the same time, I'm yet again struck by the tendency for analysis of AI to fall into unnecessary binarisms.

It's most probable that continued fine tuning is going to result in increased performance in the ways the NN is generally being used while decreasing performance in its broadest set of capabilities.

So things like getting better at prompt gotchas but worse at its variety of vocabulary or style.

So no, GPT-4 probably is getting worse over time. Just as it is also getting better over time.

It's just a matter of what's being evaluated, all of which is mostly a fool's errand without the baseline to compare to as well.

ilaksh2y ago

I think the answer to the problem of the OpenAI releases being out of our control and not necessarily consistent is to work on improving the open models.

I believe the big thing missing from open models is the advanced architecture and large amounts of human reinforcement. So it's actually not easy to replicate that with a volunteer effort. But I think the efforts of some great people working hard are gradually moving the open efforts forward.

Having said all that, it's funny how quickly people become entitled, demanding and critical towards this one company that provides a service with the ability to think for you and is smarter than any other such system in existence.

FrustratedMonky2y ago

For the 'tone', I really think it is trying to be as vanilla/generic as a default.

But that tone can be changed just by telling it to change tone.

Even on vanilla mode, it has given me a little sass when correcting me when I was wrong and it was right.

This can also happen, it can be correct when the user is wrong. Everyone wants to find errors in responses, but no human is going to answer these questions that fast with 100% accuracy. And No Search Engine. Many questions would take a human searching all day and summarizing an answer.

yieldcrv2y ago

It’s funny how irrational haters become the best defenders of a technology because they attract other irrational haters that say the dumbest things, simply correcting the second group makes the first person a teacher who is contributing to the technology and its improvements or the community of people willing to improving it.

anotherpaulg2y ago

My coding benchmarks agree with the headline. GPT-4 stayed about the same between the March and June model releases.

https://aider.chat/docs/benchmarks.html

motbus32y ago

I cannot say it has been worse than before, but there is one particular task I do that almost always now.

I will simplify the example but it is something that look like this:

Prompt: Take the next list of items and remove the name of people. - Adam, the prisoner

Response: - the prisoner (Adam)

amelius2y ago

Can it draw the SVG unicorn now?

avereveard2y ago

what a weird titling. article show quality going all over the place, sample size n=3, then goes confirming is a bunch of nonstatistical anectodes, what a waste of time.

pmarreck2y ago

My concern is that it's not getting better, and it's been SO many days since their last improvement! ;)

m3kw92y ago

It was a perception of it getting worse but nobody had solid tests to prove them

romusha2y ago

Enshittification, just like google and bing

geraldalewis2y ago

> All I want for Christmas is a GPT-4 level model with less obnoxious RLHF, and if it heils Hitler occasionally while crapping out awk code, so be it.

What an ignorant thing to say.

j / k navigate · click thread line to collapse

174 comments

110 comments · 33 top-level

jw12242y ago· 33 in thread

There is an ongoing bug[1] with OpenAI's API, where it stops streaming responses after exactly 5 minutes. When I first came across the issue, I debugged it by writing a prompt along these lines:

  > Output every number from 1 to 10,000 in written form (e.g. "one", "two", etc.). Respond with one number per line in numeric order.

However, after a couple of months, the exact same test became useless. The model began taking "shortcuts", and would respond along these lines:

  > four hundred and twenty eight
  > four hundred and twenty nine
  > [...]
  > nine thousand nine hundred and ninety eight
  > nine thousand nine hundred and ninety nine
  > ten thousand

Yep, it literally started just writing "[...]" and skipped all the way to the end.

When instructional determinism is reduced like this, it's impossible to say it's not getting worse :(

[1] https://github.com/openai/openai-python/issues/399

wouldbecouldbe2y ago

It's actually supper annoying.

moonchrome2y ago

I think they probably finetuned that because it used to be the other way around and it was pretty bad UX to wait 1 min to apply a function change to a class or stuff like that.

1 more reply

pbhjpbhj2y ago

Get it to write a diff patch?

isaacfung2y ago

Claude instant 100k does this all the time.

visarga2y ago

I have this too, but on information extraction tasks, if the input document is too long or complex it will skip sections saying something along the lines of

< !-- List of products. Truncated for brevity -- > (instead of the list)

< !-- Shipper details go here -- > (instead of the details)

antonvs2y ago

Using GPT models like that is unreliable anyway.

1 more reply

wbhart2y ago

When the multiplications become even larger, it seems to be literally impossible to prevent.

SeanAnderson2y ago

This doesn't seem like an unintentional behavior? It seems like they don't want people querying their API for an indefinite amount of time.

What amount of time do you expect a query to be able to run for before timing out?

jw12242y ago

I lost countless hours debugging this, finally I came to the conclusion it must be unintentional because:

  1. It's undocumented. None of the regular rate limit responses are returned.
  2. You're charged for the full generation length. So if the output takes 10 minutes to generate, that's what you'll pay for (despite only getting half back).
  3. It defeats the point of the larger context limit models. Why offer a 32K model if it fails after ~6K tokens?
  4. The server response doesn't include any error codes or message, it simply terminates unexpectedly. Hit any of the actual rate limits, and you get told about it.