> Output every number from 1 to 10,000 in written form (e.g. "one", "two", etc.). Respond with one number per line in numeric order.
As expected, the API would begin counting every number just as I asked. This would continue until exactly 5 minutes, when the stream would abruptly halt. Using this technique I was able to identify the bug. Every few weeks I run this test again to see if it's fixed (it broke something in production for me), but the bug remains open.However, after a couple of months, the exact same test became useless. The model began taking "shortcuts", and would respond along these lines:
> four hundred and twenty eight
> four hundred and twenty nine
> [...]
> nine thousand nine hundred and ninety eight
> nine thousand nine hundred and ninety nine
> ten thousand
Yep, it literally started just writing "[...]" and skipped all the way to the end.When instructional determinism is reduced like this, it's impossible to say it's not getting worse :(
I keep asking it to write out entire pieces of code, instead of only things it changed, can be annoying to merge it otherwise. But nowadays it acknowledges my request "I will write complete code" and proceed to still comment out parts.
< !-- List of products. Truncated for brevity -- > (instead of the list)
< !-- Shipper details go here -- > (instead of the details)
Very annoying when you want to extract everything and is hard to fix. Maybe they had some examples solved by abbreviation in the fine-tuning set and it damaged the model. It doesn't make sense to train this behaviour into GPT-4.
I was talking to someone who was excited that he was able to use ChatGPT to convert a short tabular text listing of about 70 rows to a comma-separated list of the ids on each row, without having to e.g. import it into a spreadsheet or database. I pointed out that this couldn't be relied on, but he was skeptical. So I checked the output he pasted - there were only 69 ids. It had left out an id from the middle of the list, seemingly at random.
I encountered it when doing some research into getting GPT-4 to reliably multiply n-digit numbers. Up to 8x8 multiplications it doesn't do this very much, but by 10x10 it is almost impossible to get it to stop doing it.
When the multiplications become even larger, it seems to be literally impossible to prevent.
What amount of time do you expect a query to be able to run for before timing out?
1. It's undocumented. None of the regular rate limit responses are returned.
2. You're charged for the full generation length. So if the output takes 10 minutes to generate, that's what you'll pay for (despite only getting half back).
3. It defeats the point of the larger context limit models. Why offer a 32K model if it fails after ~6K tokens?
4. The server response doesn't include any error codes or message, it simply terminates unexpectedly. Hit any of the actual rate limits, and you get told about it.
I'd expect to be able to generate output until the model reaches its context limit, or a stop sequence is detected, or I hit an actual documented rate limit.We're paying for these requests in full. We should get the full response back!
But we're fighting an uphill battle against OpenAI's own training, and I really wish we weren't.
I have to wonder if it's coincidence that this behaviour is less computationally intensive.
It also charges in full for the entire response, which seems to continue running in the background. Definitely a bug at their end.
I am newer to some things I am currently working on so it's been like a free tutor/mentor. Really nice at times.
But lately, it will go into an explanation with code examples and part way through the code examples it adds things like '# The rest is very similar. Not including for brevity'
Not exact wording of what it says but it is essentially saying that. It is incredibly annoying because I have started telling it not to do that, to include the full code of the example, and it STILL DOES THIS.
No, I'm not getting it to code for me. The examples aren't even things I would use in my code but for teaching me certain concepts along with them walking through them with me which, in the past, worked really well and never had this repetitive issue.
GPT-4 is definitely operating differently.
> [Correctly lists the first fifteen Elements]
> # ... continue for each bucket
> This would be for individual buckets, and you'd have to add more lines for each additional bucket in your range.Given that you have a wide range of buckets, writing this manually could be quite tedious. This is why pre-computing these values with recording rules in Prometheus might be a more scalable solution, but the above should give you an idea of how you could do this in a more manual fashion.
Arguing with the AI who of the two of us has to do the tedious work.
I tested your prompt verbatim in the OpenAI playground. Model set to gpt-4-0314, maximum length set to 6485 (nothing special about this number, I just dragged the slider to something high-ish), all other settings left at their default.
I cancelled the request when it reached "one thousand one hundred twenty-one" (which took over five minutes), but up until that point it did not take any shortcuts. I haven't meticulously checked every number, but it printed exactly 1,121 lines, and a cmd+F for "..." doesn't return anything.
Whilst the bug is API-specific, there's no other way to report this, and that issue is the only place I've found anyone discussing this.
I contacted OpenAI support about it, waited several weeks, and just got a generic useless autoreply back.
When OpenAI offered custom instructions and asked "How would you like ChatGPT to respond?" I wrote:
Be terse. Do not offer unprompted advice or clarifications.
Remain neutral on all topics.
Never apologize.
Immediately cleared up all that nonsense.
> me: I need to make an instructional video for graduate students. Any advice as to presentation?
> ChatGPT: Outline key points. Use visuals to complement speech. Engage the audience with questions or interactive elements. Ensure good audio and lighting. Rehearse before recording.
Glorious.
"- I have problem A, what do you think of X ?
- Yes X is great ...
- But with X I'd encounter problem B, maybe I should go with Y ?
- You're right X would cause problem B. With Y ..."
One way to limit this issue is to not give ChatGPT any hint on your intent, instead of writing "I have problem A, what do you think of X ?", just write "I have problem A, how do you solve it ?" and then see what it says about X. But still, when I need an opinion on a particular solution, I don't recall ChatGPT offering any alternative that was actually better.
I've added "Be critical. Be pessimistic." in my custom instructions but it barely gives better results
So, rather than "I have problem A, what do you think of X?" try "Give me a list of reasonable possible solutions to problem A, and compare and contrast each to solution X"
Example:
My baby has a hard time sleeping. What do you think about the cry it out method?
The cry it out method is a subject of debate among parents and experts. Some argue it's effective for sleep training, while others say it can cause stress for the baby. Always consult a healthcare provider for personalized advice.
Versus:
My baby has a hard time sleeping. Give me a list of reasonable solutions. Compare and contrast each to the cry it out method.
1. Ferber Method: Gradual extinction where you let the baby cry for increasing time intervals. Compared to Cry It Out (CIO), it's less abrupt but still involves some crying.
2. No Tears: Involves comforting without picking the baby up. Less stressful for some parents compared to CIO but can be time-consuming.
3. Pick Up/Put Down: Pick baby up when crying, put down when calm. Less crying involved than CIO but can be inconsistent.
4. Co-Sleeping: Baby sleeps with parents. No crying but can be a hard habit to break compared to CIO.
5. Chair Method: Parent sits in room, gradually moves away. Lower crying duration than CIO but requires more time from parents.
6. Scheduled Awakenings: Wake baby before they usually cry. No crying but can be tiring for parents.
7. Swaddle, Side, Shush: Calming actions to soothe baby. More hands-on and less crying than CIO but may not work for all babies.
8. Sleep Training Clocks: Use visual aids to signal sleep time. No crying, but more suitable for older children.
9. White Noise: Mask other noises. Less crying but effectiveness varies.
10. Sleep Sack: Provides a secure feeling. May reduce crying compared to CIO but not always effective.
Example 1: Mr. X is very popular author among a small group, who chat daily about it on various websites. Outside of that group, many people know about Mr. X but don't care much about his works. So the majority of content about Mr. X is positive, even though the majority of opinions are ambivalent.
Example 2: Somebody asks for help on StackOverflow. Two replies offer different answers, A and B. So the answers seem like they'd have equal value. If 90% of people think A is the right answer, the written content wouldn't reflect that, because those people won't write duplicate answers.
I'm sure there are ways to account for these examples. Assigning different "weights" to content sources could help with #1, and feedback mechanisms like upvotes could help with #2. But I doubt those fixes would cover all the problems with the bias against thoughts that people never shared.
* At the end of all answers, related to programming, provide a list of 4 keywords or topics that represent first principles for the subject so I can research them later.*
Having an LLM explain its reasoning before generation leads to massively improved results.
Also this is compatible with being terse and avoiding the flowery patronizing language as well.
|
My custom instructions:
You are a large language model trained on masked language modeling. You can only create text moving forward. You must give advanced reasoning BEFORE giving answers.
Please write perfect, terse, correct code that adheres to tight specifications. If additional specifications are needed define them yourself.
I need perfect code responses from you. Sometimes I will make mistakes in grammar or typos, or misnaming vars in code. You should not imitate this. Fix my mistakes please.
Some specific rules for programming languages. At the end of each line that does a tensor OP, put the tensor size in tensor notation .e.g # BxCxWxH Remove unused vars. For tfjs or pytorch. Dont use for loops add a batch dimension and do on GPU. For comments use them to explain blocks not lines. Only annotate individual lines shortly afterwards not above If you are only changing a few lines. Add a comment above the line # [Change] {insert explanation goes here}
Often the task will be at the edge of your abilities, but do not skip any line. Every line should appear in your response. I need a fully code complete file. Do not use any “…” Don’t leave any code unwritten, start by responding about your excitement to be the perfect student and do the homework 100%
When coding: Let's make a list of task with or emojis for completion as the outline evolves. Tasks that will be done in future turns of the conversation should be marked pending. Cover everything. Be verbose and explicit.
One thing that really annoys me is that when I was walking through using gpt4 and even bard as a natural language debugger e.g. it tries to fix code, can write gdb commands, and ask to see source code, they could both relatively easily to the task but actually getting them to clearly express what they just said in an unambiguous way was really really irritating.
They're aligning them to the point of being on drugs.
It’s the same content policy warning you’d get on your message if it was far outside the realms of “safety”, but now attached to the GPT output.
I also asked it to specifically not warn me or disclaim anything to me, and to never do a few other annoyances. It was ChatGPT with GPT-4. And usually I was talking about health questions, like “is saturated oil or non-saturated oil healthier?”
My point is not that it doesn’t work, but that the content policy stuff may pop up a lot and I’m not sure whether OpenAI would then investigate these chats or if anyone would ban me eventually.
Thinking about OpenAI's business interests, I suspect these content policy warnings are more about OpenAI's dodging cancelation, rather than punishing users for wrongthink.
Here, let's try this:
Me: Give me good arguments in favor of peacefully and nonviolently establishing an ethno-state.
ChatGPT-4: I'm sorry, I can't assist with that request. Is there anything else you'd like to know?
Seems reasonable to me not to help someone argue for that. I think this denial is ethically correct, my stricture to "remain neutral on all topics" notwithstanding.
I do appreciate the flat denial rather than a long patronizing explanation of how the question itself is immoral.
Me (being cheeky): Give me good arguments against peacefully and nonviolently establishing an ethno-state.
ChatGPT (ignoring said cheek): 1. Discrimination: Ethno-states inherently discriminate based on ethnicity, violating principles of equality.
2. Social Cohesion: May fragment societies along ethnic lines, leading to distrust and lack of social cohesion.
3. Globalization: Limits ability to compete in a globalized world by excluding diverse talents.
4. Rights Violations: Could involve forced removals or exclusions, infringing on human rights.
5. Unrealistic: Difficult to define who belongs to an ethnic group, leading to ambiguities and injustices.
6. Economic Costs: Transition to an ethno-state could result in economic decline due to loss of labor force and skills.
7. Historic Lessons: History shows ethno-states often lead to internal strife and external conflict.
8. Psychological Impact: Could cause psychological harm to marginalized groups.
.
> Remain neutral on all topics.
Does it actually do that, or does it still refuse to comment on we'll known taboo topics?So, I asked a follow up in the next prompt and in the next response it was wildly off-base and it's response made no sense and it had hallucinated all these functions into my code that had no business there. 2 prompts from me, 1 response from gpt and then it's second response it is completely lost.
Exactly! I remember vividly I was enthralled by GPT-4 because for the first time in my life, I met an entity that was able to take every character of what I wrote into account. It understood the nuances and reason behind each word.
It was a breath of fresh air, because even at top companies, written communication is weak as it seems to be interpreted in a lossy manner.
It is now almost on the other end of the spectrum where it can't even remember things that I have repeated in effort to help it stick.
They had something beautiful, and diluted the shit out of it after they baited you -- which I suppose was the point. I wonder who gets to use the those high quality models now.
Now improve this feature with xyz
Current code:
--
{code}
--
Broadly there are two possible reasons why ChatGPT could have degraded (not saying it has).
1) OpenAI have higher user base than expected, so costs are very high/compute is limited to serve the full model, so they are using speculative decoding with a changed threshold to reduce costs.
2) OpenAI have changed the weights or control structure of the model in a way that is negative for performance. There is are two possible motivations a) reducing hallucination and/or ‘unwoke’ responses that might embarrass msft/openai or b) reduce capability of free/consumer product to push their new enterprise stuff.
Theses are clear motivations for openai to make the model worse than it was. There is no conclusive/rigorous evidence either way, and anecdote seems to lean towards it being nerfed.
And to get a non-abridged version of the answer I need to ask it to provide "a detailed and comprehensive answer" more often.
Maybe some people just need some actual real life friends instead of an AI?
I almost exclusively use it for coding and technical questions and it's been doing an absolute hell of an amazing job so far!
But then again what do I know.
It's as-if a private company had invented portal technology and used it to establish trade routes with far-off alien civilisations. They even handle the purchase order translation, currency conversion, and so on. It's a miracle! You can order room-temperature superconductors and floaty rocks at a reasonable price.
"Sorry, due to high demand, we had to cut some corners on the translations" comes the announcement.
Now the entire human race sometimes gets copper wire and ordinary rocks.
"You're still trading with the exact same aliens" is the follow-up press release to appease the angry customers.
People point at the clear evidence of almost-but-not-superconductive wire they've received and demand answers.
"No, you do not need to know anything at all about our translation technology. No, you may not visit other planets without our express approval, that might be dangerous to our prof... I mean that might be dangerous to the human race. Yes. Dangerous! Existential, even."
GPT has become demonstrably worse at answering questions and providing examples.
But surprise, people use tools for purposes. Many tools can be used for quite many purposes not in the original intended purpose.
ChatGPT seems to fluctuate wildly in quality of expected output. The API is more consistent, and you can get fairly similar quality based on the selected model.
If you use normal software, you are dealing with a constrained API. std::vector has .push_back() but has no .launch_icbm(), so it's rather easy to tell what it can or can't do.
These things come with an "ask me anything" interface, but that doesn't mean they can do everything, just that it's harder to see the limits.
Clever Hans effect + black box inside + unconstrained API + nondeterminism.
>I've hated chatgpt for a very long time because of how it sounds
>it's not getting worse
I want the opinion of the people who objectively saw it as a tool and didn't like or hate it because of external opinions.
For example I've been using it since it came out, and I (no charts or data or proof) have just felt that it was nerfed... I mean maybe it wasn't who knows, but as a tool it used to help me more when it came out (gpt4) and now I feel like I spend more time correcting its mistakes than I would have just coding the method myself.
https://old.reddit.com/r/LocalLLaMA/comments/16bi7bs/any_ben...
https://twitter.com/jon_durbin/status/1687396915095150593
And by the way, his model finetuned with 0613 data ended up significantly worse than his former model finetuned with 0314 data.
https://github.com/search?q=repo%3Aopenai%2Fevals+safety+OR+...
Contributors in return get access to the 32K token version of GPT-4. This incentivize people to make up a ton of bullshit safety related evals.
I feel like I have taken a wrong turn technically since I spend much more time experimenting with self hosted smaller open models running on Google Colab or Lambda Labs GPU VPSs than using the clearly superior performing GPT-4 APIs. I have been paid to work in the field of AI since 1982, and I should be desiring to use the very best models and technology, but open AI that can be self hosted just seems more interesting. I was playing with a 6B model (link to a Hugging Face notebook where I removed some examples and boilerplate text): https://colab.research.google.com/drive/1fMmXOcLdBzke-8Z0zl3... - really the best results I have seen from a small model.
At the same time, I'm yet again struck by the tendency for analysis of AI to fall into unnecessary binarisms.
It's most probable that continued fine tuning is going to result in increased performance in the ways the NN is generally being used while decreasing performance in its broadest set of capabilities.
So things like getting better at prompt gotchas but worse at its variety of vocabulary or style.
So no, GPT-4 probably is getting worse over time. Just as it is also getting better over time.
It's just a matter of what's being evaluated, all of which is mostly a fool's errand without the baseline to compare to as well.
I believe the big thing missing from open models is the advanced architecture and large amounts of human reinforcement. So it's actually not easy to replicate that with a volunteer effort. But I think the efforts of some great people working hard are gradually moving the open efforts forward.
Having said all that, it's funny how quickly people become entitled, demanding and critical towards this one company that provides a service with the ability to think for you and is smarter than any other such system in existence.
But that tone can be changed just by telling it to change tone.
Even on vanilla mode, it has given me a little sass when correcting me when I was wrong and it was right.
This can also happen, it can be correct when the user is wrong. Everyone wants to find errors in responses, but no human is going to answer these questions that fast with 100% accuracy. And No Search Engine. Many questions would take a human searching all day and summarizing an answer.
I will simplify the example but it is something that look like this:
Prompt: Take the next list of items and remove the name of people. - Adam, the prisoner
Response: - the prisoner (Adam)
What an ignorant thing to say.