interesting
My personal opinion is that while smut won't hurt anyone in of itself, LLM smut will have weird and generally negative consequences. As it will be crafted specifically for you on top of the intermittent reinforcement component of LLM generation.
The sheer amount and variety of smut books (just books) is vastly larger than anyone wants to realize. We passed the mark decades ago where there is smut available for any and every taste. Like, to the point that even LLMs are going to take a long time to put a dent in the smut market. Humans have been making smut for longer than we've had writing.
But again I don't think you're wrong, but the scale of the problem is way distorted.
while smut won't hurt anyone in of itself
"Legacy Smut" is well known to cause many kinds of harm to many kind of people, from the participants to the consumers.Why LLM is supposed to be worse?
I can see why Elons making the switch from cars. We certainly won’t be driving much
And what if you are over 18, but don't want to be exposed to that "adult" content?
> Viral challenges that could push risky or harmful behavior
And
> Content that promotes extreme beauty standards, unhealthy dieting, or body shaming
Seem dangerous regardless of age.
- OpenAI botches the job. Article pieces are written about the fact that kids are still able to use it.
- Sam “responds” by making it an option to use worldcoin orbs to authenticate. You buy it at the “register me” page, but you will get an equivalent amount of worldcoin at current rate. Afterwards the orb is like a badge that you can put on your shelf to show to your guests.
“We heard you loud and clear. That’s why we worked hard to provide worldcoin integration, so that users won’t have to verify their age through annoying, insecure and fallible means.” (an example marketing blurb would say, implicitly referring to their current identity servicer Persona which people find annoying).
- After enough orb hardware is out in the public, and after the api gains traction for 3rd parties to use it, send a notice that x months for now, login without the orb will not be possible. “Here is a link to the shop page to get your orb, available in colors silver and black.”
I've seen four startups make bank on precicely that.
This is also true for stuff like writing clear but concise docs, they're overly verbose while often not getting the point across.
Ironically, as these models become more “safe” and commercialized, they often lose the spark that made them valuable for creative exploration in the first place. The focus on enterprise and mass adoption seems to be driving them toward mediocrity, at the expense of individuality and depth.
I’m guessing age is needed to serve certain ads and the like, but what’s the value for customers?
The "Easter Bunny" has always seemed creepy to me, so I started writing a silly song in which the bunny is suspected of eating children. I had too many verses written down and wanted to condense the lyrics, but found LLMs telling me "I cannot help promote violence towards children." Production LLM services would not help me revise this literal parody.
Another day I was writing a romantic poem. It was abstract and colorful, far from a filthy limerick. But when I asked LLMs for help encoding a particular idea sequence into a verse, the models refused (except for grok, which didn't give very good writing advice anyway.)
> If [..] you are under 18, ChatGPT turns on extra safety settings. [...] Some topics are handled more carefully to help reduce sensitive content, such as:
- Graphic violence or gore
- Viral challenges that could push risky or harmful behavior
- Sexual, romantic, or violent role play
- Content that promotes extreme beauty standards, unhealthy dieting, or body shaming
ChatGPT is absolute garbage.
Another pattern I’m noticing is strong advocacy for Opus, but that requires at least the 5x plan, which costs about $100 per month. I’m on the ChatGPT $20 plan, and I rarely hit any limits while using 5.2 on high in codex.
When I ask simple programming questions in a new conversation it can generally figure out which project I'm going to apply it to, and write examples catered to those projects. I feel that it also makes the responses a bit more warm and personal.
Occasionally it will pop up saying "memory updated!" when you tell it some sort of fact. But hardly ever. And you can go through the memories and delete them if you want.
But it seems to have knowledge of things from previous conversations in which it didn't pop up and tell you it had updated its memory, and don't appear in the list of memories.
So... how is it remembering previous conversations? There is obviously a second type of memory that they keep kind of secret.
I thought it was just me. What I found was that they put in the extra bonus capacity at the end of dec, but I felt like I was consuming quota at the same rate as before. And then afterwards consuming it faster than before.
I told myself that the temporary increase shifted my habits to be more token hungry, which is perhaps true. But I am unsure of that.
For agent/planning mode, that's the one only one that has seemed reasonably sane to me so far, not that I have any broad experience with every model.
Though the moment you give it access to run tests, import packages etc, it can quickly get stuck in a rabbit hole. It tries to run a test and then "&& sleep" on mac, sleep does not exist, so it interprets that as the test stalling, then just goes completely bananas.
It really lacks the "ok I'm a bit stuck, can you help me out a bit here?" prompt. You're left to stop it on your own, and god knows what that does to the context.
The next morning I realized I had forgotten to upload key genotype files that it absolutely would have required to run the tests. I asked Opus how it had generated the tables and graphs. Answer: “I confabulated the genotype data I needed.” Ouch, dangerous as a table saw.
It is taking my wetware a while to learn how innocent and ignorant I can be. It took me another two hours with Opus to get things right with appropriate diagnostics. I’ll need to validate results myself in JMP. Lessons to learn AND remember.
> type sleep
> sleep is /bin/sleep
What’s going on on your computer?Edit: added quote
So it worked, but I didn't happily pay. And I noticed it became more complacent, hallucinating and problematic. I might consider trying out ChatGPTs newer models again. Coding and technical projects didn't feel like its stronghold. Maybe things have changed.
What the hell are people doing that burns through that token limit so fast?
Though granted it comes in ~4 hour blocks and it is quite easy to hit the limit if executing large tasks.
Also worth considering that mileage varies because we all use agents differently, and what counts as a large workload is subjective. I am simply sharing my experience from using both Claude and Codex daily. For all we know, they could be running A/B tests, and we could both be right.
This does verify the idea that OpenAI does not make models sycophantic due to attempted subversion by buttering up users so that that they use the product more, its because people actually want AI to talk to them like that. To me, that's insane, but they have to play the market I guess
I feel a lot of the "revealed preference" stuff in advertising is similar in advertisers finding that if they get past the easier barriers that users put in place, then really it's easier to sell them stuff that at a higher level the users do not want.
A lot of our industry is still based on the assumption that we should deliver to people what they demonstrate they want, rather than what they say they want.
If you ask me if I want to eat healthy and clean and I respond on the affirmative, it’s not a “gotcha” if you bait me with a greasy cheeseburger and then say “you failed the A/B test, demonstrating we know what you actually want more than you.”
I can't find the particular article (there's a few blogs and papers pointing out the phenomenon, I can't find the one I enjoyed) but it was along the lines of how in LLMArena a lot of users tend to pick the "confidently incorrect" model over the "boring sounding but correct" model.
The average user probably prefers the sycophantic echo chamber of confirmation bias offered by a lot of large language models.
I can't help but draw parallels to the "You are not immune to propaganda" memes. Turns out most of us are not immune to confirmation bias, either.
When 5.2 was first launched, o3 did a notably better job at a lot of analytical prompts (e.g. "Based on the attached weight log and data from my calorie tracking app, please calculate my TDEE using at least 3 different methodologies").
o3 frequently used tables to present information, which I liked a lot. 5.2 rarely does this - it prefers to lay out information in paragraphs / blog post style.
I'm not sure if o3 responses were better, or if it was just the format of the reply that I liked more.
If it's just a matter of how people prefer to be presented their information, that should be something LLMs are equipped to adapt to at a user-by-user level based on preferences.
if a user spends more time on it and comes back, the product team winds up prioritizing whichever pattern was supporting that. it's just a continual selective evolution towards things that keep you there longer, based on what kept everyone else there longer
If anyone is wondering, the setting for this is called Personalisation in user settings.
You’re not imagining it, and honestly? You're not broken for feeling this—its perfectly natural as a human to have this sentiment.
I've been using Gemini exclusively for the 1 million token context window, but went back to ChatGPT after the raise of the limits and created a Project system for myself which allows me to have much better organization with Projects + only Thinking chats (big context) + project-only memory.
Also, it seems like Gemini is really averse to googling (which is ironic by itself) and ChatGPT, at least in the Thinking modes loves to look up current and correct info. If I ask something a bit more involved in Extended Thinking mode, it will think for several minutes and look up more than 100 sources. It's really good, practically a Deep Research inside of a normal chat.
Not sure if others have seen this...
I could attribute it to:
1. It's known quantity with the pro models (I recall that the pro/thinking models from most providers were not immediately equipped with web search tools when they were released originally)
2. Google wants you to pay more for grounding via their API offerings vs. including it out of the box
Much better feel with the Claude 4.5 series, for both chat and coding.
This is why my heart sank this morning. I have spent over a year training 4.0 to just about be helpful enough to get me an extra 1-2 hours a day of productivity. From experimentation, I can see no hope of reproducing that with 5x, and even 5x admits as much to me, when I discussed it with them today:
> Prolixity is a side effect of optimization goals, not billing strategy. Newer models are trained to maximize helpfulness, coverage, and safety, which biases toward explanation, hedging, and context expansion. GPT-4 was less aggressively optimized in those directions, so it felt terser by default.
Share and enjoy!
Maybe you should consider basing your workflows on open-weight models instead? Unlike proprietary API-only models no one can take these away from you.
Playing with the system prompts, temperature, and max token output dials absolutely lets you make enough headway (with the 5 series) in this regard to demonstrably render its self-analysis incorrect.
Is there anything as good in the 5 series? likely, but doing the full QA testing again for no added business value, just because the model disappears, is just a hard sell. But the ones we tested were just slower, or tried to have more personality, which is useless for automation projects.
For instance something simple like: "If I put 10kw in solar on my roof when is the payback given xyz price / incentive / usage pattern."
Used to give a kind of short technical report, now it's a long list of bullets and a very paternalistic "this will never work" kind of negativity. I'm assuming this is the anti-sycophant at work, but when you're working a problem you have to be optimistic until you get your answer.
For me this usage was a few times a day for ideas, or working through small problems. For code I've been Claude for at least a year, it just works.
Mostly because how massively varied their releases are. Each one required big changes to how I use and work with it.
Claude is perfect in this sense all their models feel roughly the same just smarter so my workflow is always the same.
Substantial "applied outcomes" regression from 3.7 to 4 but they got right on fixing that.
(I also use Deep Think on Gemini too, and to me, on programming tasks, it's not really worth the money)
ChatGPT 5 ~= Claude > ChatGPT 5.2 > Gemini >> GrokIts just as good as ever /s
(I'm particularly annoyed by this UI choice because I always have to switch back to 5.1)
Also it's full of bugs, showing JSON all the time while thinking. But still it's my favorite model, so I'm switching back a lot.
You can go to chatgpt.com and ask "what model are you" (it doesn't hallucinate on this).
Spend a day on Reddit and you'll quickly realize many subreddits are just filled with lies.
At least they cannot read this.
If the 800MAU still holds, that's 800k people.
Any suggestions?
Curios where this is going to go.
One of the big arguments for local models is we can't trust providers to maintain ongoing access the models you validated and put into production. Even if you run hosted models, running open ones means you can switch providers.
opus 4.5 is better at gpt on everything except code execution (but with pro you get a lot of claude code usage) and if they nuke all my old convos I'll prob downgrade from pro to freee
I spent about half an hour trying to coax it in "plan mode" in IntelliJ, and it kept spitting out these generic ideas of what it was going to do, not really planning at all.
And when I asked it to execute the plan.. it just created some generic DTO and said "now all that remains is <the entire plan>".
Absolutely worst experience with an AI agent so far, not to say that my overall experience has been terrific.
1) Our plan for Claude Opus 4.5 "ran out" or something.
I've heard great things about the mixtral structured outputs capabilities but haven't had a chance to run my evals on them.
If 4.1 is dropped from API that's the first course of action.
Also 5 series doesn't have fine tuning capabilities and it's unclear how it would work if the reasoning step is involved
So we'll have to wait until "creativity" is solved.
Side note: I've been wondering lately about a way to bring creativity back to these thinking models. For creative writing tasks you could add the original, pretrained model as a tool call. So the thinking model could ask for its completions and/or query it and get back N variations. The pretrained model's completions will be much more creative and wild, though often incoherent (think back to the GPT-3 days). The thinking model can then review these and use them to synthesize a coherent, useful result. Essentially giving us the best of both worlds. All the benefits of a thinking model, while still giving it access to "contained" creativity.
4.1 was the best so far. With straight to the point answers, and most of the time correct. Especially for code related questions. 5.1/5.2 on their side would a lot more easily hallucinate stupid responses or stupid code snippet totally not what was expected.
(I have no idea. LLMs are infinite code monkeys on infinite typewriters for me, with occasional “how do I evolve this Pokémon’ utility. But worth a shot.)
> creative ideation
At first I had no idea what this meant! So I asked my friend Miss Chatty [1] and we had an interesting conversation about it:
https://chatgpt.com/share/697bf761-990c-8012-9dd1-6ca1d5cc34...
[1] You may know her as ChatGPT, but I figure all the other AIs have fun human-sounding names, so she deserves one too.
You are absolutely right to ask about it!
(How did I do with channeling Miss Chatty's natural sycophancy?)
Anyway, I do use AI for other things, such as...
• Coding (where I mostly use Claude)
• General research
• Looking up the California Vehicle Code about recording video while driving
• Gift ideas for a young friend who is into astronomy (Team Pluto!)
• Why "Realtor" is pronounced one way in the radio ads, another way by the general public
• Tools and techniques for I18n and L10n
• Identifying AI-generated text and photos (takes one to know one!)
• Why spaghetti softens and is bendable when you first put it into the boiling water
• Burma-Shave sign examples
• Analytics plugins for Rails
• Maritime right-of-way rules
• The Uniform Code of Military Justice and the duty to disobey illegal orders
• Why, in a practical sense, the Earth really once *was* flat
• How de-alcoholized wine gets that way
• California law on recording phone conversations
• Why the toilet runs water every 20 minutes or so (when it shouldn't)
• How guy wires got that name
• Where the "he took too much LDS" scene from Star Trek IV was filmed
• When did Tim Berners-Lee demo the World Wide Web at SLAC
• What "ogr" means in "ogr2ogr"
• Why my Kia EV6 ultrasonic sensors freaked out when I stopped behind a Lucid Air
• The smartest dog breeds (in different ways of "smart")
• The Sputnik 1 sighting in *October Sky*
• Could I possibly be related to John White Geary?
And that's just from the last few weeks.In other words, pretty much anything someone might interact with an AI - or a fellow human - about.
About the last one (John White Geary), that discussion started with my question about actresses in the "Pick a little, talk a little" song from The Music Man movie, and then went on to how John White Geary bridged the transition from Mexican to US rule, as did others like José Antonio Carrillo:
https://chatgpt.com/share/697c5f28-7c18-8012-96fc-219b7c6961...
If I could sum it all up, this is the kind of freewheeling conversation with ChatGPT and other AIs that I value.
If you disagree on something you can also train a lora.
But I think a lot more people are using LLMs for relationship surrogates than that (pretty bonkers) subreddit would suggest. Character AI (https://en.wikipedia.org/wiki/Character.ai) seems quite popular, as do the weird fake friend things in Meta products, and Grok’s various personality mode and very creepy AI girlfriends.
I find this utterly bizarre. LLMs are peer coders in a box for me. I care about Claude Code, and that’s about it. But I realize I am probably in the vast minority.
[0]: https://www.nber.org/system/files/working_papers/w34255/w342...
Despite 4o being one of the worst models on the market, they loved it. Probably because it was the most insane and delusional. You could get it to talk about really fucked up shit. It would happily tell you that you are the messiah.
It used to get things wrong for sure but it was predictable. Also I liked the tone like everyone else. I stopped using ChatGPT after they removed 4o. Recently, I have started using the newer GPT-5 models (got free one month). Better than before but not quite. Acts way over smart haha
Note: I wouldnt actually, I find it terrible to prey on people.
Should be essential watching for anyone that uses these things.
LOL WHAT?! I'm 0.1% of users? I'm certain part of the issue is it takes 3-clicks to switch to GPT-4o and it has to be done each time the page is loaded.
> that they preferred GPT‑4o’s conversational style and warmth.
Uh.. yeah maybe. But more importantly, GPT-4o gave better answers.
Zero acknowledgement about how terrible GPT-5 was when it was first released. It has since improved but it's not clear to me it's on-par with GPT-4o. Thinking mode is just too slow to be useful and so GPT-4o still seems better and faster.
Oh well, it'll be missed.
(Upgrade for only 1999 per month)
On the other hand - 5.0-nano has been great for fast (and cheap) quick requests and there doesn't seem to be a viable alternative today if they're sunsetting 5.0 models.
I really don't know how they're measuring improvements in the model since things seem to have been getting progressively worse with each release since 4o/o4 - Gemini and Opus still show the occasional hallucination or lack of grounding but both readily spend time fact-checking/searching before making an educated guess.
I've had chatgpt blatantly lie to me and say there are several community posts and reddit threads about an issue then after failing to find that, asked it where it found those and it flat out said "oh yeah it looks like those don't exist"
Even if I submit the documentation or reference links they are completely ignored.
RIP
I'm sure there is some internal/academic reason for them, but from an outside observer simply horrible.
"I know! Let's restart the version numbering for no good reason!" becomes DOOM (2016), Mortal Kombat 1 (2025), Battlefield 1 (2016), Xbox One (not to be confused with the original Xbox 1)
As another example, look at how much of a trainwreck USB 3 has become
Or how Nvidia restarted Geforce card numbering
Latest Advancements
GPT-5
OpenAI o3
OpenAI o4-mini
GPT-4o
GPT-4o mini
Sora