undefined | Better HN

0 pointsziotom781d ago0 comments

I am a physics professor and often use Gemini to check my papers. It is a formidable tool: it was able to find a clerical error (a missing imaginary unit in a complex mathematical expression) I was not able to find for days, and it often underlines connections between concepts and ideas that I overlooked.

However, it often makes conceptual errors that I can spot only because I have good knowledge of the topic I am discussing. For instance, in 3D Clifford algebras it repeatedly confuses exponential of bivectors and of pseudoscalars.

Good to know that ChatGPT 5.5 Pro can produce a publishable paper, but from what I have seen so far with Gemini, it seems to me that it is better to consider LLMs as very efficient students who can read papers and books in no time but still need a lot of mentoring.

0 comments

nopinsight1d ago

I assume you're using the "regular" Pro version of Gemini 3.1 for the above, rather than the Deep Think mode, which is more comparable to GPT-5.5 Pro. To my knowledge, regular 3.1 Pro is a tier below and often makes mistakes.

Moreover, there's no reason to believe the progress of LLMs, which couldn't reliably solve high-school math problems just 3–4 years ago, will stop anytime soon.

You might want to track the progress of these models on the CritPt benchmark, which is built on *unpublished, research-level* physics problems:

https://critpt.com/

Frontier models are still nowhere near solving it, but progress has been rapid.

* o3 (high) <1.5 years ago was at 1.4%

* GPT 5.4 (xhigh), 23.4%

* GPT-5.5 (xhigh), 27.1%

* GPT-5.5 Pro (xhigh) 30.6%.

https://artificialanalysis.ai/evaluations/critpt.

FrojoS1d ago

> there's no reason to believe the progress of LLMs [...] will stop anytime soon

Wrong. Every advancement has followed a s curve. Where we are on that curve is anyones guess. Or maybe "this time its different".

dang14h ago

> Wrong.

Can you please edit out swipes/putdowns, as the guidelines ask (https://news.ycombinator.com/newsguidelines.html)? I'm sure you didn't intend it, but it comes across that way, and your comment would be just fine without that bit.

Edit: on closer look, it would be just fine without that bit and also without the snarky bit at the end. The rest is good.

gdhkgdhkvff21h ago

Great. You see a shape in graphs. And that shape tells you that _at some unknown point in the future_ progress will slow (but likely not stop).

Now back to the point, what reason do you have to believe progress will stop soon? If you have no reason, then it sounds like you agree with OP.

Which makes the patronizing sarcasm all that much more nauseating.

6 more replies

aspenmartin21h ago

It’s more of a guess if you don’t know about things like scaling laws and RL with verification. The onus of “we’re going to saturate” anytime soon is on that claim because every measurement points to that not being true.

2 more replies

aurareturn23h ago

He said "will stop anytime soon". He didn't say forever.

1 more reply

vessenes21h ago

There are advancements that do not follow s curves - consider for instance total data transmitted over all networks, or financial derivatives volumes.

I think a better question for AI is “is it more like a network effect, liquidity effect, or a biological/physical effect”?

4 more replies

gchamonlive21h ago

This could be right for the current architecture of LLMs, but you can come up with specialized large language models that can more efficiently use tokens for a specific subset of problems by encoding the information differently (https://www.nature.com/articles/d41586-024-03214-7).

So if instead of text we come up with a different representation for mathematical or physical problems, that could both improve the quality of the output while reducing the amount of transformers needed for decoding and encoding IO and for internal reasoning.

There are also difference inference methods, like autoregressive and diffusion, and maybe others we haven't discovered yet.

You combine those variables, along with the internal disposition of layers, parameter size and the actual dataset, and you have such a large search space for different models that no one can reliably tell if LLM performance is going to flatline or continue to improve exponentially.

2 more replies

dehrmann15h ago

I read an experiment someone wanted to try where they used pre-1900 content and tried to get relativity. Another version would be train an LLM on school curriculum up until calculus and see if it can invent calculus. Where we are on the curve depends on if it's remixing known things or genuinely inventing things.

From the article,

> ...LLMs have got to the point where if a problem has an easy argument that for one reason or another human mathematicians have missed (that reason sometimes, but not always, being that the problem has not received all that much attention), then there is a good chance that the LLMs will spot it. Conversely, for problems where one’s initial reaction is to be impressed that an LLM has come up with a clever argument, it often turns out on closer inspection that there are precedents for those arguments...

CuriouslyC19h ago

What people miss is that AI isn't one S curve, each capability we try to bake into a model has its own S curve. Model progress might not impact some capabilities at all, but other capabilities might get totally overhauled.

IanCal17h ago

Assuming it’ll stop soon is to wager that we’re at a very specific point on the curve.

If it’s anyone’s guess then we’re much more likely to be left of that, unless you argue we’re already on the flat side.

baq18h ago

you can tell where on the sigmoid we're currently sitting? frontier lab folks can't - chapeau bas good sir

1 more reply

holoduke20h ago

Software and hardware have no limits. Theoretically would could bozons for computations and have the same amount of computation available on one cm3 of the current total computation in the entire world. Same with software. Never there was a stop on new algorithms. With LLMs there are so many parts that will get better and are not very far fetched.

1 more reply

scotty7920h ago

It can be S curve (and it almost surely is), but on every chart you can plot, you don't see even of an inkling of the bend yet.

jeremyjh20h ago

What the fuck does that have to do with “soon”?

Der_Einzige20h ago

This is FUD and extremely wrong. None of the advancements have followed an S curve. This time IS different and it should be obvious to you at this point.

civvv1d ago

There are many indications that model progress is slowing down, so that is not entirely accurate.

aspenmartin21h ago

Please be specific because outside of anecdotal blog posts by people who don’t know what they’re talking about it’s not true. Look at scaling laws, composite benchmarks from the epoch capability index, nothing at all suggests “model progress is slowing down”

StrauXX23h ago

Which indications are that?

4 more replies

CuriouslyC19h ago

Model progress at spitting out unhallucinated facts is slowing down hard. Model progress at solving hard math challenges/programming tasks doesn't seem to be slowing down that I can tell.

Davidzheng21h ago

Deep think still makes many many many more mistakes than gpt 5.5 pro on math

maximamas1d ago

LLMs are at their best when you have an expectation for their output. I generally know the shape of the correct response and that allows me to evaluate it's output on it's "vibes", rather than line by line. If there's no expectation then I have to take everything at face value and now I'm at the mercy of the machine.

jillesvangurp1d ago

Exactly, if I generate a large chunk software, I'm going to have expectations about what it will do, how it will do it, etc. You don't just accept the statement that "it's done" for fact but you start looking for evidence.

A scientific approach here is to look to falsify the statement. You start asking questions, running tests, experiments, etc. to prove the notion that it is done wrong. And at some point you run out of such tests and it's probably done for some useful notion of done-ness.

I've built some larger components and things with AI. It's never a one shot kind of deal. But the good news is that you can use more AI to do a lot of the evaluation work. And if you align your agents right, the process kind of runs itself, almost. Mostly I just nudge it along. "Did you think about X? What about Y? Let's test Z"

noisy_boy23h ago

> Mostly I just nudge it along. "Did you think about X? What about Y? Let's test Z"

Exactly - you need to constantly have your sceptics glasses on and you need to be exacting in terms of the structure you want things to follow. Having and enforcing "taste" is important and you need to be willing to spend time on that phase because the quality of the payoff entirely depends on it.

I recently planned for a major refactor. The discussion with claude went on for almost two days. The actual implementation was done in 10 minutes. It probably has made some mistakes that I will have to check for during the review but given that the level of detail that plan document had, it is certainly 90-95% there. After pouring-in of that much opinion, it is a fairly good representation of what I would have written while still being faster than me doing everything by hand.

1 more reply

kannanvijayan19h ago

I can speak towards building large-scale systems from scratch with these tools. I've been working since late last year on a project that was barely a tech demo, and the progression of development on that project has seen me go from leveraging co-pilot autocomplete at the start, to full-on vibecoding 100% of the new additions.

I have reasonable eng chops I'd like to think - I have been a senior IC for a while on a reasonably diverse set of challenging systems problems and built out some pretty large-scale pieces of software the old "artisinal" way.

This particular project is a productization of some ideas I had for leveraging a virtual machine to execute high-divergence parallel logic on GPUs, in an effort to move complex things like "unit behaviour in games" (the classical symbolic kind, not NN-based unit behaviour) into the GPU. The project is going well but still quite a ways from release. But it's at about 300k lines of code now across 9 or so rust repositories, and a smattering of typescript on the frontend.

I have had stumbles, but overall I feel I have put together some good strategies and principles for pushing large projects along with these tools in an effective way.

The biggest takeaway for me is that the "feel" is different. Software construction by hand felt like building legos where you put the pieces together yourself. A lot of my focus would be on building and solidifying core components so I could rely on them when I stepped up to build higher-level components. Projects would get mired quickly if you didn't solidify your base.

With agentic development, one of the early challenges I ran into was this issue with something I'll call "oversight inception". It's when at some early point in the process a somewhat low-importance decision is made - an implementation decision, a decision to say.. align a test with the implementation rather than an implementation with a test.

Then, as you build more on top of this, that small decision somehow ends up getting reified into a core architectural policy that then cascades up.

You realize that when you're building a big project, the focus on some particular component is backstopped by a general understanding of local development directionality with respect to the larger level project. And the agent has no idea of directionality.

So small chinks in the design end up getting magnified and blown up as the dev process proceeds, and later on review you find major architectural pieces have just been overlooked, all flowing from some small incidental implementation choice a long time before.

This is one among a number of issues, but it's a big one. Once I saw it happening I tried an approach to mitigate it by developing a set of golden "goal" documents that describe directionality at the project level: what you are working towards and what design components need to exist.

This doesn't eliminate the "oversight inception" issue, but it does catch them earlier.

When I started applying the goal documentation aggressively to re-align the project implementation direction, I found velocity dropped a lot.

And as I progress, I'm balancing this out a bit - to allow the system to diverge a bit, but force reconvergence towards the goals at some specific cadence. I haven't found the right candence yet but I'm getting there.

This new style of development feels more like claymoulding pottery than lego assembly. You sort of "get it into shape". It's a very interesting new set of process assumptions.

ziotom78OP1d ago

I agree, but I would add that they can be very useful even if you do not have clear expectations but have some solid ways to verify their claims. Often in doing this verification I came up with new ideas.

tags2k1d ago

I'm no physics professor but this aligns with the way I use the tools in my "senior engineer" space. I bring the fundamentals to sanity-check the trigger-happy agent and try to imbue other humans with those fundamentals so they can move towards doing the same. It feels like the only way this whole thing will work (besides eventually moving to local models that do less but companies can afford).

illiac7861d ago

Using the word “Mentoring” is anthropomorphic and subconsciously makes you think it will learn. It does not, and it is for the human brain a formidable task to remember that something as smart as an LLM does not learn. I keep catching myself making the same mistake.

It’s also because it is so annoying to have to manage the memory of the LLM with custom prompts/instructions manually.

I have not yet played with the long term memory feature, but I fear it will be even less reliable than prompts, simply because in one year or two years so much will have changed again that this “memory” will have to be redone multiple times by then.

timschmidt1d ago

They can form new associations between concepts via their input prompts and thinking text. That is a form of learning. Just not very durable. I liken it to https://en.wikipedia.org/wiki/Anterograde_amnesia

illiac7861d ago

yeah, I should have been more specific: I meant the type of learning that mentoring fosters, the long term learning.

2 more replies

kybernetikos22h ago

Current LLM architecture doesn't learn - and you're right this is a huge piece that normal folks fail to understand, since in many ways, it's the opposite of what years of AI research has been trying to create.

However, I think it's important to remember that LLMs are embedded in larger systems, and those larger systems do learn.

baq17h ago

If I was a frontier lab and I solved continual learning, as of today I would absolutely not release it - the society isn't ready for this; society isn't even ready for widespread diffusion of current publicly available frontier models.

If however I was a frontier lab who solved continual learning and my competitor also solved and released it, I would release mine immediately, obviously.

The point is, continual learning might be solved already, we just don't know and those who might know would rather keep their mouths shut. It isn't my base case (financial situation of frontier labs is such that they'd probably release immediately as long as they have inference compute to serve this revolutionary capability), but it isn't impossible.

1 more reply

lukewarm70721h ago

exactly like you said - the harness might learn.

we do also have training on synthetic data. it might compound.

freedomben23h ago

I mostly agree, though after a mentoring session you can ask it to write skill or a memory and it can be reasonably durable. For Claude at least, the memories work pretty well (though I am still at a small scale with them. As they grow it might start to break somewhat. Doesn't always work, but has often enough that I thought it worth a mention.

stingraycharles22h ago

> Using the word “Mentoring” is anthropomorphic and subconsciously makes you think it will learn.

I think this is a bit pedantic. Obviously the parent you’re replying to is referring to the concept of “in-context learning”, which is the actual industry / academic term for this. So you feed it a paper, and then it can use that info, and it needs steering / “mentoring” to be guided into the right direction.

Heck the whole name of “machine learning” suggests these things can actually learn. “reasoning” suggests that these things can reason, instead of being fancy, directed autocomplete. Etc.

In other news: data hydration doesn’t actually make your data wet. People use / misuse words all the time, and that causes their meaning to evolve.

nozzlegear18h ago

Anthropomorphism is a subtle marketing tool used by these big AI companies, who are financially incentivized to push the myth of AGI and want everyone to believe they're right on the cusp of achieving it. It's good to be pedantic in this case, we shouldn't anthropomorphize these tools.

1 more reply

kasey_junk21h ago

I agree it’s pedantic and personally don’t get bent out of shape with people anthropomorphizing the llms. But I do think you get better results if keep the text prediction machine mental model in your head as you work with them.

And that can be very hard to do given the ui we most interact with them in is a chat session.

1 more reply

thfuran18h ago

But in-context learning is like a student only remembering what they’re being taught for the duration of the discussion. That’s not really how mentoring is meant to work, so pointing out the issues with the metaphor seems pretty reasonable.

In other news: That words can change meaning doesn’t mean that every possible change in meaning would be beneficial to communication and therefore desirable. Would you advocate in support of someone suggesting to use “left” to mean “right” simply on the basis words can change in meaning?

_the_inflator1d ago

I agree and put it this way: LLMs sound so convincing presenting you the work it does rose colored and promising to give you more if you keep going.

There is a 50/50 chance that it turns out to be right or letting you jump of the cliff.

Only the trip stays the same beautiful 5 star plus travel.

Also, spotting an error and telling LLM makes it in most cases worse, because the LLM wants to please you and goes on to apologize and change course.

The moment I find myself in such a situation I save or cancel the session and start from scratch in most cases or pivot with drastic measures.

Gemini to me is the most unpredictable LLM while GPT works best overall for me.

Gemini lately gave me two different answers to the same question. This was an intentional test because I was bored and wanted to see what happens if you simply open a new chat and paste the same prompt everything else being the same.

Reasoning doesn’t help much in the Coding domain for me because it is very high level and formally right what the LLM comes up with as an explanation.

I google more due to LLMs than before, because essentially what I witnessed is someone producing something that I gotta control first before I hit the button that it comes with. However, you only find out shortly afterwards whether the polished button started working or gave you a warm welcome to hell.

MattPalmer108623h ago

Reusing the same prompt several times is something I've started doing too. The contrast is often illuminating.

In one case, it made a thoroughly convincing argument that an approach was justified. The second time it made exactly the opposite argument, which was equally compelling.

I now see LLMs as persuasion machines.

eitally19h ago

One thing I've been doing lately -- and I'm in a business function, not a technical one, although I have an engineering background -- is pitting LLMs against each other. For example, if I'm structuring a proposal or a contract with the assistance of Claude, I'll begin my 360 feedback review first by asking Claude how it would react if it were the counter-party receiving the proposal. After some iterative changes, mostly manual, I will then run the same output document past Gemini and ask it to adopt personas from both sides and provide reactive feedback. The result of this is almost always a stronger proposal that I can also accompany with proactive objection handling and a solid FAQ, as well as clear points of negotiation that will likely be acceptable to both parties.

For this sort of thing, using multiple LLMs is extremely helpful.

scotty7920h ago

Before AI happened I watched youtube. Occasionally I encountered there very convincing arguments. Same person often made very convincing arguments on many subjects.

But noticed that the closer the domain they were talking about was to my area of competence the less convincing their arguments were. There were more holes, errors and wrong conclusions.

I recalibrated my bs meter thanks to that.

Since AI came I successfully used this strategy of being extremely cautious towards convincing arguments to not become mislead by AI.

However this year I'm working with AI more in the domain of software development. Where I can see the competence. And I see the competence. This had opposite effect on me. I tend to trust AI outside my domain of expertise much more after I saw what can it do in software.

One caveat though is that there are a lot of areas of human culture where there's very little actual knowledge, but a lot of opinions, like politics, economy, diet, business, health. I still don't trust AI in those domains. But then again, I don't trust humans there either.

For me basically AI achieved the threshold of useful reliability for any domain that humans are reliable at.

I don't really care about sycophancy. I might have a slight advantage that I don't talk to AI in my native language. So its responses don't have a direct line to my emotions.

taneq21h ago

Ever since they started getting really sycophantic, I’ve been presenting my ideas as “my co-worker says this is a good approach but I disagree, can you help me convince him that it’s wrong?”

pbhjpbhj23h ago

>LLM wants to please you

I was using Copilot and asked it a question about a PDF file (a concept search). It turned out the file was images of text. I was anticipating that and had the text ready to paste in.

Instead, it started writing an OCR program in python.

I stopped it after several minutes.

Often Copilot says it can't do something (sometimes it's even correct), that's preferential to the try-hard behaviour here.

freedomben23h ago

> Gemini to me is the most unpredictable LLM while GPT works best overall for me.

This nails an important thing IMHO. I've absolutely noticed this, for better or worse. Gemini can produce surprisingly excellent things, but it's unpredictability make me go for GPT when I only want to ask it once.

miki12321114h ago

I think that ultimately, the largest change brought on by LLMs will be due not to their intelligence, but to their tenacity.

If you had an infinite number of monkeys, each with a typewriter, one would eventually write Shakespeare. If you had an infinite number of college-educated interns, each with access to all the public records you can possibly get via FOIA, one would eventually get enough evidence to prove that a top politician is cheating on their partner, evidence which you could use to blackmail that politician.

You don't need that much intelligence to do that, you just need somebody who's willing to dedicate their life to knowing everything there is to know about that guy from Louisiana.

With humans, the amount of money you'd need to pay such a person just isn't worth the reward. With LLMs, it may very well be.

mixtureoftakes1d ago

please, sign up for a paid plan of either chatgpt or claude. gemini is while close, still noticeably behind

you deserve opinions shaped by interactions with the best tools that are out there.

wg01d ago

Gemini feels deep and philosophical. Especially for product management. Tell him you're a product manager and we're a team of two.

But regular reminder - All LLMs can be wrong all the time. I only work with LLMs in domains I'm expert in OR I have other sources to verify their output with utmost certainty.

wafflemaker1d ago

Or when you don't care about results being very correct.

When I'm cooking meatballs with sauce and the recipe calls for frying them, I'll have an LLM guestimate how long and which program to use in an air fryer to mimic the frying pan, based on a picture of balls in a Pyrex. So I can just move on with the sauce, instead of spending time browsing websites and stressing about getting it perfect.

I used to hate these non-deterministic instructions, now I treat it as their own game. When I will publish my first recipe, I'll have an LLM randomize the ingredient amounts, round them up to some imprecise units and also randomize the times. Psychologists say we artists need to participate and I WILL participate.

smartmic1d ago

> I only work with LLMs in domains I'm expert in

This. Should become a general rule for any non-trivial use of LLM in a professionel setting.

1 more reply

ainch21h ago

Agreed, Gemini is clearly a capable model, but the tool use is lagging behind the other two. Ironically it regularly gets things wrong (ie. the current version of some software) because of an unwillingness to use web search.

cubefox1d ago

Gemini is certainly not behind Claude in terms of physics.

peyton1d ago

Seriously, it’s not worth reaching for less intelligence. Use Extended Pro 100% of the time for things you’d spend the amount of time GP spent writing their post.

hodgehog111d ago

ChatGPT and Gemini are actually fairly comparable.

Claude has been utterly useless with most math problems in my experience because, much like less capable students, it tends to get overly bogged down in tedious details before it gets to the big picture. That's great for programming, not so much for frontier math. If you're giving it little lemmas, then sure it's great, but otherwise you're just burning tokens.

Quothling1d ago

We've got a rather extensive AI setup through our equity fund and I've setup a group of agents for data architecture at scale. One is the main agent I discuss with and it's setup to know our infrastructure and has access to image generation tools, websearch, hand off agents and other things. I tend to use Opus (4-6 currently) and I find it to be rather great. As you point out it comes with the danger of making mistakes, and again, as you point out, it's not an issue for things I'm an expert on. What I rely on it for, however, is analysing how specific tools would fit into our architecture. In the past you would likely have hired a group of consultants to do this research, but now you can have an AI agent tell you what the advantages and disadvantages of Microsoft Fabric in your setup. Since I don't know the capabilities of Fabric I can't tell if the AI gives me the correct analysis of a Lakehouse and a Warehouse (fabric tools).

What I do to mitigate this is that I have fact checking agents configured to be extremely critical and non-biased on Opus, Gemini and GPT. Which are then handed the entire conversation to review it. Then it's handed off to a Opus agent which is setup to assume everything is wrong. After this, and if I'm convinced something is correct I'll hand the entire thing off to a sonnet agent, which is setup to go through the source material and give me a compiled list of exactly what I'll need to verify.

It's ridicilously effective, but I do wonder how it would work with someone who couldn't challenge to analytic agent on domain knowledge it gets wrong. Because despite knowing our architecture and needs, it'll often make conceptional errors in the "science" (I'm not sure what the English word for this is) of data architecture. Each iteration gets better though, and with the image generation tools, "drawing" the architecture for presentations from c-level to nerds is ridiclously easy.

stasomatic20h ago

Are you using this agent hive for any repeatable tasks? What you described, superficially, seems like a one off. Genuinely curious.

wccrawford22h ago

This doesn't surprise me since the coding agents are similar. I've previously compared them to very fast, ambitious junior programmers. I think they are probably mid-level coders now, but they continue to make mistakes that a senior programmer wouldn't. Or at least shouldn't.

danielparsons15h ago

I find exactly the same for legal analysis. Great at ideation and proofreading but frequently misunderstands concepts and hallucinates conclusions from faulty premises.

northzen22h ago

Hi ziotom! I wonder about you work in 3D Cifford Algebras. May you share some links to the research you do? I also have interest in this topic I research on my own.

Just in case if you don't want to disclose your name my email is northzen@gmail.com

port1123h ago

Gemini’s smug and over-confident “this is the gold standard in 2026” definitely leaves little space for nuance if you don’t know the subject matter. Human students would, hopefully, know they don’t know everything.

quantummagic23h ago

> Gemini’s smug...

Anthropomorphizing these systems is dangerous, whether coming from the bullish or bearish perspective. The output is statistically generated by a machine lacking the capability to be smug.

Jtarii22h ago

>Anthropomorphizing these systems is dangerous

That ship has sailed. Humans will anthropomorphize a rock if you put googly eyes on it.

1 more reply

DiogenesKynikos22h ago

It's only "statistically generated" in the same way that your brain is just "neurons firing." That's the low-level description of what's happening, but on a higher level, it's correct to say that it's being smug.

1 more reply

recursivecaveat1d ago

This is close to my experience with code. LLMs can pick out small mistakes from giant code changes with surprising accuracy, or slowly narrow down a weird. On the other hand I've seen them bravely shoulder on under completely incorrect conceptual models of what they're working with and churn around in circles consequently, spin up giant piles of slop to re-implement something they decided was necessary, but didn't bother to search for, or outright dismiss important error signals as just 'transient failures'. Unlimited stamina, low wisdom.

wslh6h ago

I assume that once LLMs are trained with large [synthetic] information about 3D Clifford algebras it will work better.

tasuki1d ago

> in 3D Clifford algebras it repeatedly confuses exponential of bivectors and of pseudoscalars.

I have no idea what any of those words even mean. I'm sure LLMs make similar obvious-to-professors mistakes in all the domains. Not long ago, we didn't even have chatbots capable of basic conversation...

jiggawatts21h ago

Ironically, it's sort of the other way around! Every frontier chatbot since GPT 4 (at least) has had a pretty good understanding of even very esoteric technical concepts.

Bivectors and pseudoscalars (in a 3D context) are "just" signed areas and volumes. Easy!

Back around the GPT 3, 3.5, and 4.0 era I used to ask the bots to explain "counterfactual determinism", which is one of the most complex topics I personally understand.

Then I would lie to the bot about it, and see if it corrected me or not.

This test is useless now, the frontier models can't be fooled any longer on such "basic" concepts.

Conversely, LLMs are basically useless at anything that doesn't have enough (or no) public information for their training. Think: obscure proprietary product config files and the like, even if the concepts involved are trivial.

Similarly, Clifford Algebra is a relatively niche (even "alternative") area of mathematics and physics, with vastly less written material about it than the competing linear algebra. Hence, the AIs are bad at it.

wood_spirit1d ago

Chiming in to agree but clarify that the latest sota models are no better than Gemini.

I put my stuff through several sota models and round robin them in adversarial collaboration and they are all useful even though, fundamentally, they don’t “understand” anything. But they are super useful delegates as long as deciding on the problem and approach and solution all sits safely in your head so you can challenge them and steer them.

So I know the article is about one particular new model acing something and each vendor wants these stories to position their model as now good enough to replace humans and all other models, but working somewhere where I am lucky enough to be able to use all the sota models all the time, I can say that all keep making obvious mistakes and using all adversarially is way better than trusting just one.

I look forward to the day one a small open model that we can run ourselves outperforms the sum of all today’s models. That’s when enough is enough and we can let things plateau.

energy12323h ago

Basically all Erdos problems that get solved with AI use ChatGPT 5.* Pro, not Gemini/Opus.

5555watch22h ago

I would guess it's because ChatGPT Pro allows for 80min "think". I've never had even remotely similar think times with Gemini Deep Think. It's generally around 10-15min for math problems, and get increasingly shorter for continued interactions.

eth0up20h ago

Any experience with NotebookLM?

Mine has been epically bad.

ed_balls20h ago

intern that never sleeps

cyanydeez1d ago

I've been watching the automation of things like flight control systems for the past decade, and the evolution of the fallback to a real pilot in the event of a emergency is what's most concerning about where LLMs are being embedded.

Right now, we have a lot of smart people who have trained for decades to understand where these things go wrong and how to nudge them back, but the pool of people are going to slowly be replaced by less knowledgeable.

At some point, a rubicon will be crossed where these systems can't fallback to a human operator and will fail spectacularly.

pbhjpbhj23h ago

Watching a teenager approach their homework, instead of struggling to answer questions they don't know, they ask Gemini. Unfortunately, I think the mental struggle to approach an answer is where much of the learning is. They also miss out on the reward for persistence of seeing things fall together.

It is troubling. It suggests a plateauing of human understanding.

regularfry22h ago

It absolutely is where the learning is, that's pretty well established brain science.

regularfry22h ago

What that means practically is that we've got a generation - 25 years or less - to evolve these things not to need the fallback. If such a thing is possible.

leptons1d ago

We're on the road to Idiocracy.

DeathArrow1d ago

I don't think the experience with Gemini will be the same when using GPT.

ieieaaa20h ago

LLM’s are the most powerful tool invented to search across a huge information space in response to human input.

That’s all they are. They don’t ‘know’ anything intrinsically and do know ‘know’ what reasoning even is.

j / k navigate · click thread line to collapse

0 comments

nopinsight1d ago

Moreover, there's no reason to believe the progress of LLMs, which couldn't reliably solve high-school math problems just 3–4 years ago, will stop anytime soon.

You might want to track the progress of these models on the CritPt benchmark, which is built on *unpublished, research-level* physics problems:

https://critpt.com/

Frontier models are still nowhere near solving it, but progress has been rapid.

* o3 (high) <1.5 years ago was at 1.4%

* GPT 5.4 (xhigh), 23.4%

* GPT-5.5 (xhigh), 27.1%

* GPT-5.5 Pro (xhigh) 30.6%.

https://artificialanalysis.ai/evaluations/critpt.

FrojoS1d ago

> there's no reason to believe the progress of LLMs [...] will stop anytime soon

Wrong. Every advancement has followed a s curve. Where we are on that curve is anyones guess. Or maybe "this time its different".

dang14h ago

> Wrong.

Edit: on closer look, it would be just fine without that bit and also without the snarky bit at the end. The rest is good.

gdhkgdhkvff21h ago

Great. You see a shape in graphs. And that shape tells you that _at some unknown point in the future_ progress will slow (but likely not stop).

Now back to the point, what reason do you have to believe progress will stop soon? If you have no reason, then it sounds like you agree with OP.

Which makes the patronizing sarcasm all that much more nauseating.

6 more replies

aspenmartin21h ago

2 more replies

aurareturn23h ago

He said "will stop anytime soon". He didn't say forever.

1 more reply

vessenes21h ago

There are advancements that do not follow s curves - consider for instance total data transmitted over all networks, or financial derivatives volumes.

I think a better question for AI is “is it more like a network effect, liquidity effect, or a biological/physical effect”?

4 more replies

gchamonlive21h ago

There are also difference inference methods, like autoregressive and diffusion, and maybe others we haven't discovered yet.

2 more replies

dehrmann15h ago

From the article,

CuriouslyC19h ago

IanCal17h ago

Assuming it’ll stop soon is to wager that we’re at a very specific point on the curve.

If it’s anyone’s guess then we’re much more likely to be left of that, unless you argue we’re already on the flat side.

baq18h ago

you can tell where on the sigmoid we're currently sitting? frontier lab folks can't - chapeau bas good sir

1 more reply

holoduke20h ago

1 more reply

scotty7920h ago

It can be S curve (and it almost surely is), but on every chart you can plot, you don't see even of an inkling of the bend yet.

jeremyjh20h ago

What the fuck does that have to do with “soon”?

Der_Einzige20h ago

This is FUD and extremely wrong. None of the advancements have followed an S curve. This time IS different and it should be obvious to you at this point.

civvv1d ago

There are many indications that model progress is slowing down, so that is not entirely accurate.

aspenmartin21h ago

StrauXX23h ago

Which indications are that?

4 more replies

CuriouslyC19h ago

Model progress at spitting out unhallucinated facts is slowing down hard. Model progress at solving hard math challenges/programming tasks doesn't seem to be slowing down that I can tell.

Davidzheng21h ago

Deep think still makes many many many more mistakes than gpt 5.5 pro on math

maximamas1d ago

jillesvangurp1d ago

noisy_boy23h ago

> Mostly I just nudge it along. "Did you think about X? What about Y? Let's test Z"

1 more reply

kannanvijayan19h ago

I have had stumbles, but overall I feel I have put together some good strategies and principles for pushing large projects along with these tools in an effective way.

Then, as you build more on top of this, that small decision somehow ends up getting reified into a core architectural policy that then cascades up.

This doesn't eliminate the "oversight inception" issue, but it does catch them earlier.

When I started applying the goal documentation aggressively to re-align the project implementation direction, I found velocity dropped a lot.

This new style of development feels more like claymoulding pottery than lego assembly. You sort of "get it into shape". It's a very interesting new set of process assumptions.

ziotom78OP1d ago

tags2k1d ago

illiac7861d ago

It’s also because it is so annoying to have to manage the memory of the LLM with custom prompts/instructions manually.

timschmidt1d ago

illiac7861d ago

yeah, I should have been more specific: I meant the type of learning that mentoring fosters, the long term learning.

2 more replies

kybernetikos22h ago

However, I think it's important to remember that LLMs are embedded in larger systems, and those larger systems do learn.

baq17h ago

If however I was a frontier lab who solved continual learning and my competitor also solved and released it, I would release mine immediately, obviously.

1 more reply

lukewarm70721h ago

exactly like you said - the harness might learn.

we do also have training on synthetic data. it might compound.

freedomben23h ago

stingraycharles22h ago

> Using the word “Mentoring” is anthropomorphic and subconsciously makes you think it will learn.

Heck the whole name of “machine learning” suggests these things can actually learn. “reasoning” suggests that these things can reason, instead of being fancy, directed autocomplete. Etc.

In other news: data hydration doesn’t actually make your data wet. People use / misuse words all the time, and that causes their meaning to evolve.

nozzlegear18h ago

1 more reply

kasey_junk21h ago

And that can be very hard to do given the ui we most interact with them in is a chat session.

1 more reply

thfuran18h ago

_the_inflator1d ago

I agree and put it this way: LLMs sound so convincing presenting you the work it does rose colored and promising to give you more if you keep going.

There is a 50/50 chance that it turns out to be right or letting you jump of the cliff.

Only the trip stays the same beautiful 5 star plus travel.

Also, spotting an error and telling LLM makes it in most cases worse, because the LLM wants to please you and goes on to apologize and change course.

The moment I find myself in such a situation I save or cancel the session and start from scratch in most cases or pivot with drastic measures.

Gemini to me is the most unpredictable LLM while GPT works best overall for me.

Reasoning doesn’t help much in the Coding domain for me because it is very high level and formally right what the LLM comes up with as an explanation.

MattPalmer108623h ago

Reusing the same prompt several times is something I've started doing too. The contrast is often illuminating.

In one case, it made a thoroughly convincing argument that an approach was justified. The second time it made exactly the opposite argument, which was equally compelling.

I now see LLMs as persuasion machines.

eitally19h ago

For this sort of thing, using multiple LLMs is extremely helpful.

scotty7920h ago

Before AI happened I watched youtube. Occasionally I encountered there very convincing arguments. Same person often made very convincing arguments on many subjects.

But noticed that the closer the domain they were talking about was to my area of competence the less convincing their arguments were. There were more holes, errors and wrong conclusions.

I recalibrated my bs meter thanks to that.

Since AI came I successfully used this strategy of being extremely cautious towards convincing arguments to not become mislead by AI.

For me basically AI achieved the threshold of useful reliability for any domain that humans are reliable at.

I don't really care about sycophancy. I might have a slight advantage that I don't talk to AI in my native language. So its responses don't have a direct line to my emotions.

taneq21h ago

Ever since they started getting really sycophantic, I’ve been presenting my ideas as “my co-worker says this is a good approach but I disagree, can you help me convince him that it’s wrong?”

pbhjpbhj23h ago

>LLM wants to please you

I was using Copilot and asked it a question about a PDF file (a concept search). It turned out the file was images of text. I was anticipating that and had the text ready to paste in.

Instead, it started writing an OCR program in python.

I stopped it after several minutes.

Often Copilot says it can't do something (sometimes it's even correct), that's preferential to the try-hard behaviour here.

freedomben23h ago

> Gemini to me is the most unpredictable LLM while GPT works best overall for me.

miki12321114h ago

I think that ultimately, the largest change brought on by LLMs will be due not to their intelligence, but to their tenacity.

You don't need that much intelligence to do that, you just need somebody who's willing to dedicate their life to knowing everything there is to know about that guy from Louisiana.

With humans, the amount of money you'd need to pay such a person just isn't worth the reward. With LLMs, it may very well be.

mixtureoftakes1d ago

please, sign up for a paid plan of either chatgpt or claude. gemini is while close, still noticeably behind

you deserve opinions shaped by interactions with the best tools that are out there.

wg01d ago

Gemini feels deep and philosophical. Especially for product management. Tell him you're a product manager and we're a team of two.

But regular reminder - All LLMs can be wrong all the time. I only work with LLMs in domains I'm expert in OR I have other sources to verify their output with utmost certainty.

wafflemaker1d ago

Or when you don't care about results being very correct.

smartmic1d ago

> I only work with LLMs in domains I'm expert in

This. Should become a general rule for any non-trivial use of LLM in a professionel setting.

1 more reply

ainch21h ago

cubefox1d ago

Gemini is certainly not behind Claude in terms of physics.

peyton1d ago

Seriously, it’s not worth reaching for less intelligence. Use Extended Pro 100% of the time for things you’d spend the amount of time GP spent writing their post.

hodgehog111d ago

ChatGPT and Gemini are actually fairly comparable.

Quothling1d ago

stasomatic20h ago

Are you using this agent hive for any repeatable tasks? What you described, superficially, seems like a one off. Genuinely curious.

wccrawford22h ago

danielparsons15h ago

I find exactly the same for legal analysis. Great at ideation and proofreading but frequently misunderstands concepts and hallucinates conclusions from faulty premises.

northzen22h ago

Hi ziotom! I wonder about you work in 3D Cifford Algebras. May you share some links to the research you do? I also have interest in this topic I research on my own.

Just in case if you don't want to disclose your name my email is northzen@gmail.com

port1123h ago

quantummagic23h ago

> Gemini’s smug...

Anthropomorphizing these systems is dangerous, whether coming from the bullish or bearish perspective. The output is statistically generated by a machine lacking the capability to be smug.

Jtarii22h ago

>Anthropomorphizing these systems is dangerous

That ship has sailed. Humans will anthropomorphize a rock if you put googly eyes on it.

1 more reply

DiogenesKynikos22h ago

1 more reply

recursivecaveat1d ago

wslh6h ago

I assume that once LLMs are trained with large [synthetic] information about 3D Clifford algebras it will work better.

tasuki1d ago

> in 3D Clifford algebras it repeatedly confuses exponential of bivectors and of pseudoscalars.

jiggawatts21h ago

Ironically, it's sort of the other way around! Every frontier chatbot since GPT 4 (at least) has had a pretty good understanding of even very esoteric technical concepts.

Bivectors and pseudoscalars (in a 3D context) are "just" signed areas and volumes. Easy!

Back around the GPT 3, 3.5, and 4.0 era I used to ask the bots to explain "counterfactual determinism", which is one of the most complex topics I personally understand.

Then I would lie to the bot about it, and see if it corrected me or not.

This test is useless now, the frontier models can't be fooled any longer on such "basic" concepts.

wood_spirit1d ago

Chiming in to agree but clarify that the latest sota models are no better than Gemini.

I look forward to the day one a small open model that we can run ourselves outperforms the sum of all today’s models. That’s when enough is enough and we can let things plateau.

energy12323h ago

Basically all Erdos problems that get solved with AI use ChatGPT 5.* Pro, not Gemini/Opus.

5555watch22h ago

eth0up20h ago

Any experience with NotebookLM?

Mine has been epically bad.

ed_balls20h ago

intern that never sleeps

cyanydeez1d ago

At some point, a rubicon will be crossed where these systems can't fallback to a human operator and will fail spectacularly.

pbhjpbhj23h ago

It is troubling. It suggests a plateauing of human understanding.

regularfry22h ago

It absolutely is where the learning is, that's pretty well established brain science.

regularfry22h ago

What that means practically is that we've got a generation - 25 years or less - to evolve these things not to need the fallback. If such a thing is possible.

leptons1d ago

We're on the road to Idiocracy.

DeathArrow1d ago

I don't think the experience with Gemini will be the same when using GPT.

ieieaaa20h ago

LLM’s are the most powerful tool invented to search across a huge information space in response to human input.

That’s all they are. They don’t ‘know’ anything intrinsically and do know ‘know’ what reasoning even is.

j / k navigate · click thread line to collapse