undefined | Better HN

0 pointsalecco13d ago0 comments

In spite of their deeper pockets, massive datacenters, colosal amounts of user data, and hundreds of thousands of top developers, even Amazon, Meta, Microsoft, and Google are well behind.

I think Evans is completely wrong. There are only 2 truly frontier models. (at least for now). And Anthropic seems to be leaving OpenAI behind so there might be only 1 in the near future. (which is scary/dangerous)

0 comments

30 comments · 10 top-level

embedding-shape13d ago· 8 in thread

> I think Evans is completely wrong. There are only 2 truly frontier models. (at least for now). And Anthropic seems to be leaving OpenAI behind so there might be only 1 in the near future. (which is scary/dangerous)

Truly fascinating ecosystem and community in general, as experiences differ so wildly. Anthropic's models seems far behind OpenAI to me, especially when you get into "Pro" territory, and there doesn't seem to be any worthy competition to Pro Mode available at all.

And this is said with someone who use both platforms, and spend a lot of my day interacting with agents and LLMs in various ways. The interesting part is that probably so do you too, and probably your experience and what you share lines up with what you experience! Yet we come away with basically opposite takeaways :) I don't think either of us are wrong either, somehow.

haellsigh13d ago

I agree with what you're saying. I have a Claude plan for work and I prefer using Claude more than any other LLM I've tried. Having recently tried the Codex 100€ plan with GPT-5.5 in high/xhigh, I don't think it's worse that the Opus models, just different.

I've noticed that depending on how you talk to it, you get wildly different outputs. This seems to happen less with Opus: it mostly understand what I want. GPT is often a bit too literal.

Just my two cents.

embedding-shape13d ago

> I've noticed that depending on how you talk to it, you get wildly different outputs. This seems to happen less with Opus: it mostly understand what I want. GPT is often a bit too literal.

Yeah, exact prompting matters a lot, seemingly more than people think. There is definitely tradeoffs between how literal the models takes the prompts, on one hand it's useful for the model to ignore their own instinct when you know better, so they don't go chasing geese randomly, but on the other hand it's useful sometimes when they self-direct, when you misworded something and it's obvious you meant something different because of the context, and similar things. They're basically good at different things.

Really agree every model isn't equal and they aren't as interchangeable without adjusting how you prompt them as people seem to think.

WarmWash13d ago

People use a model as their daily driver, get very familiar with it and it's behavior, and then go and use another model and have a hard time. It's very difficult to separate "the model is bad" from "the model works differently".

JumpCrisscross13d ago

> It's very difficult to separate "the model is bad" from "the model works differently"

At which point it’s fair to reject the commoditization label.

Also missing from these discussions are e.g. Qwen, which is at least as good as one back from OpenAI or Anthropic’s frontiers.

1 more reply

computerex12d ago

For HPC/ai work opus blows gpt away, it’s no competition.

embedding-shape12d ago

As someone who just spent the last three days (tried using both, ended up using mostly Codex) implementing DiffusionGemma in Rust, I think they're more or less equal when it comes to machine learning and AI. They get stuck at different points, but wouldn't say one is a clear winner over the other. HPC I have no idea so I'll take your word for it :)

aleccoOP13d ago

When you say "Pro" territory, do you include Fable?

embedding-shape13d ago

You mean the model that was available for a whole of three days? No, I had played around with it a tiny bit, but not much than that. I guess time will tell if it gets close.

ksec13d ago· 5 in thread

>I think Evans is completely wrong.

I wish there was a case where I find Evans is wrong. As far as my memory served me, I failed to record a single one.

I disagree that Amazon, Meta, Microsoft, and Google are "well" behind. If anything the frontier model advantage seems to be at best 6 - 9 months. And that the Chinese model are all doing well.

One of Steve Jobs's line, "It is a feature, not a product." Even if Apple were a generation behind or 1 year behind frontier model. The advantage of default is enough to hold a lot of its user.

To put it simply, even if OpenAI or Anthropic were better, there is zero chances they would topple Apple in hardware sales, user or ecosystem. On the other hand, even if Apple's AI were 6 - 9 months or a generation behind, most user would settle for it and damage OpenAI / Anthropic.

ak_11112d ago

Just top of my head (and I don't even follow his takes that closely), just check his takes on Magic Leap which he consistently promoted using quite dramatic langauge (along with the entire AR space) and check how it panned out.

overfeed12d ago

> On the other hand, even if Apple's AI were 6 - 9 months or a generation behind,

Do you mean Google's AI with Apple wrappers? Apple's in-house AI is further behind Google, amd very far from the frontier according to your ranking. IMO, Google is on the frontier - I recall Altman calling for an OpenAI all-hands-on deck when Gemini was released because of how good it was compared to ChatGPT. I also suspect Google has the lowest operating expenses due to scale, experience and luck/planning (TPUs), there will come a time when AI investments will slow down, and the cost of revenue will become more important.

aleccoOP13d ago

Even their own employees get frustrated if they can't use Claude or Codex. 6-9 months is a big difference and I think it's closer to 9 than 6. And never mind the harness etc are also many months behind.

geodel13d ago

This is just wishful thinking. I am sure someone from gossip media will also find Apple employees who are ready to leave job if Apple disallows Claude usage.

If anything Apple should notice it is Anthropic has got a really good marketing team and it would be no shame if they pick a trick or two from them.

throwaway9879712d ago

people use outlook when gmail exists.

employees will always suffer.

tedggh13d ago· 4 in thread

I use both Claude and Codex and don’t see any meaningful difference between the two. My use case is modeling semi complex physical processes (energy and manufacturing) in code for simulations. I also have to do a good fair of automation via scripting in Python or PowerShell for manipulating data as well as legacy code analysis (C, Fortran, COBOL). Given I provide the models with the information and documentation they need, both perform very similarly. I recently did a full codebase review (for design patterns and vulnerabilities) and both Codex and Fable agreed 100% about the most critical findings. I do very little front end development, although some of my automation scripts have TUIs and again no problem with either Claude or Codex generating them for me. At this point I go with the less expensive, which seems to be Codex. With the $100 plan I rarely hit the limits. With Claude I max out my plan in about 4-6 hours of work.

joenot44313d ago

Did you find much of a difference between Fable and Opus?

thrill13d ago

Yes. Fable is much more organized and consistent at taking small bites of the (sorry) apple when solving a problem. Specifically I'm talking about a machine learning problem I'd been working on for awhile with Opus and it was (and is, again) constantly stating that all the signal is exploited, everything is now overfit, etc, etc, etc. The first day I pointed Fable at the situation I got a 10% improvement by paying attention to the little details that Opus instead took slightly negative results and extrapolated to "fully exploited". I've had to drop back, again, to forcing Opus to explain what it's looked at and the detail it has quietly assumed away.

It's like the difference to talking to two smartest kids in a class, but one really belongs a grade higher - and the other hasn't learned yet to ask the questions that encourage it to dig in that little bit more for the additional multi-order effects.

1 more reply

tedggh12d ago

I have used Fable only once to do an in depth codebase review of a complex system. I asked it to flag deviations from a particular design and also compile a list of vulnerabilities. It took about 15-20 minutes. The result was very similar to Codex for the most critical findings, different suggestions on how to address them but it found exactly the same critical issues as Codex. This is still not a good test to evaluate Fable. But my feeling is that the latest models are all pretty good and now it comes down to your personal setup and workflow, that’s where you can get the productivity gains IMO. It’s like picking between MacOS or Windows as development environment. For some Windows sucks and for a some is the opposite, but both groups of people can be equally productive if they know their environments well and know how to go around their respective limitations.

hedora13d ago

I constantly hit safety blocks in Fable (I’m trying to write secure software, which is equivalent to finding security holes, so banned).

I didn’t use it on big enough tasks to notice any improvement.

I had been hitting plan limits pretty regularly, but fixed it by changing my workflow. That also increased the success rate of claude by an order of magnitude.

wolttam13d ago· 2 in thread

I think it's highly likely that there will remain one or two companies on the very bleeding edge of AI development for the foreseeable future.

But what I think a lot of people miss is that the market for the truly bleeding edge (developing bio-tech, building the most sophisticated software stacks (probably with a tilt towards simulation, GPU kernel optimization, etc)) is not the whole market.

There's a plethora of use-cases for models that are not on the bleeding edge. If I can solve my relatively simple problems with an off-the-shelf model for a minuscule fraction of the cost of the frontier, I'm going to.

thewebguyd13d ago

Anecdotal case in point, but writing mostly enterprise CRUD in C#, I've gotten plenty of mileage out of Sonnet, very rarely do I need to use Opus.

Its somewhat of a myth that you need the most advanced, expensive model for software development.

johsole12d ago

There was a time when Opus was the only model really worth using, I think that was maybe 4.4 or 4.5, but I agree Sonnet is pretty good now and can be used quite often.

afavour13d ago· 1 in thread

Maybe I’m alone in thinking this but I think the long term victor will be the one that works out pricing best.

Fable might well be a better model but it’s too expensive for everyday AI use. Definitely if we’re talking about the kind of stuff you’re going to want to do on your phone. Even for coding, I’m not going to reach for Fable (well, when I can…) for 95% of the work I do.

I don’t believe a mature AI industry is going to have a one size fits all, single winner.

tedggh13d ago

Yes, and pricing is one of the features of a commodity, because users can jump back and forth between services, it becomes a pricing race to the bottom. Agree also that you don’t need the best model all the time. You could have the most powerful model draft the design, requirements, guidelines, policies or whatnot then get the lower tier models execute it. Then again you can have the most powerful model do the testing and review, and give back feedback, rinse and repeat. Just like in the real world you don’t need an entire staff of lead engineers.

hedora13d ago

Remember the implicit “pareto” in “frontier models”.

Anthropic and OpenAI are far behind state of the art for the entire curve except the “extremely expensive for barely measurable improvements” part.

GLM is probably the third most expensive frontier model (benchmarks and reviews will say for sure), and is apparently ~Opus 4.6 for 10% the inference cost.

The last I checked, qwen was still owning the 24-32GiB RAM range (it runs reasonably without a GPU!) and somewhere around 3.5-4 generation models.

Also, even anthropic says Mythos ~= ChatGPT 5.5, so it’s unlikely either one is leaving the other behind. The big problem they both have is they asked for the government to gate keep model releases and use cases, and their wish was granted.

That’s knocked them back 6 months already. Anthropic’s only frontier offering has been taken down.

jimbokun12d ago

Is Google behind? The general opinions I read suggest Gemini is very competitive with Anthropic and OpenAI's top models.

awongh12d ago

That's true now, but long-term (maybe just a few years) it doesn't seem feasible for the status quo to continue from a financial point of view.

Spend for compute seems like it needs to increase to get the next iterations of models, and even if they IPO the money might run out before they can solidify their revenue streams.

All while Google just needs to survive long enough with their good-enough models and do it without really putting themselves in any existential financial risk.

And ideally the chinese models are also still there keeping everyone honest.

The true dystopic worst case is a Google monopoly on cutting edge AI.

bushbaba12d ago

I'm perfectly happy at claude opus 4.6. All improvements since then have not meaningfully improved my day to day. If i can get 4.6 on my laptop for 5-10k, i'd gladly start shifting my ~1k/month Anthropic spend over.

Some of the harness even let you run a local model for most things, and only pay for the latest frontier models when needed, which cuts down cost drastically.

nxobject12d ago

> And Anthropic seems to be leaving OpenAI behind so there might be only 1 in the near future.

Well, in domains like SWE where Anthropic's putting in the effort. I don't they'll make the claims that OpenAI makes about how their models are pushing the life sciences forward, for example.

j / k navigate · click thread line to collapse

0 comments

30 comments · 10 top-level

embedding-shape13d ago· 8 in thread

haellsigh13d ago

I've noticed that depending on how you talk to it, you get wildly different outputs. This seems to happen less with Opus: it mostly understand what I want. GPT is often a bit too literal.

Just my two cents.

embedding-shape13d ago

> I've noticed that depending on how you talk to it, you get wildly different outputs. This seems to happen less with Opus: it mostly understand what I want. GPT is often a bit too literal.

Really agree every model isn't equal and they aren't as interchangeable without adjusting how you prompt them as people seem to think.

WarmWash13d ago

JumpCrisscross13d ago

> It's very difficult to separate "the model is bad" from "the model works differently"

At which point it’s fair to reject the commoditization label.

Also missing from these discussions are e.g. Qwen, which is at least as good as one back from OpenAI or Anthropic’s frontiers.

1 more reply

computerex12d ago

For HPC/ai work opus blows gpt away, it’s no competition.

embedding-shape12d ago

aleccoOP13d ago

When you say "Pro" territory, do you include Fable?

embedding-shape13d ago

You mean the model that was available for a whole of three days? No, I had played around with it a tiny bit, but not much than that. I guess time will tell if it gets close.

ksec13d ago· 5 in thread

>I think Evans is completely wrong.

I wish there was a case where I find Evans is wrong. As far as my memory served me, I failed to record a single one.

I disagree that Amazon, Meta, Microsoft, and Google are "well" behind. If anything the frontier model advantage seems to be at best 6 - 9 months. And that the Chinese model are all doing well.

One of Steve Jobs's line, "It is a feature, not a product." Even if Apple were a generation behind or 1 year behind frontier model. The advantage of default is enough to hold a lot of its user.

ak_11112d ago

overfeed12d ago

> On the other hand, even if Apple's AI were 6 - 9 months or a generation behind,

aleccoOP13d ago

geodel13d ago

This is just wishful thinking. I am sure someone from gossip media will also find Apple employees who are ready to leave job if Apple disallows Claude usage.

If anything Apple should notice it is Anthropic has got a really good marketing team and it would be no shame if they pick a trick or two from them.

throwaway9879712d ago

people use outlook when gmail exists.

employees will always suffer.

tedggh13d ago· 4 in thread

joenot44313d ago

Did you find much of a difference between Fable and Opus?

thrill13d ago

1 more reply

tedggh12d ago

hedora13d ago

I constantly hit safety blocks in Fable (I’m trying to write secure software, which is equivalent to finding security holes, so banned).

I didn’t use it on big enough tasks to notice any improvement.

I had been hitting plan limits pretty regularly, but fixed it by changing my workflow. That also increased the success rate of claude by an order of magnitude.

wolttam13d ago· 2 in thread

I think it's highly likely that there will remain one or two companies on the very bleeding edge of AI development for the foreseeable future.

thewebguyd13d ago

Anecdotal case in point, but writing mostly enterprise CRUD in C#, I've gotten plenty of mileage out of Sonnet, very rarely do I need to use Opus.

Its somewhat of a myth that you need the most advanced, expensive model for software development.

johsole12d ago

There was a time when Opus was the only model really worth using, I think that was maybe 4.4 or 4.5, but I agree Sonnet is pretty good now and can be used quite often.

afavour13d ago· 1 in thread

Maybe I’m alone in thinking this but I think the long term victor will be the one that works out pricing best.

I don’t believe a mature AI industry is going to have a one size fits all, single winner.

tedggh13d ago

hedora13d ago

Remember the implicit “pareto” in “frontier models”.

Anthropic and OpenAI are far behind state of the art for the entire curve except the “extremely expensive for barely measurable improvements” part.

GLM is probably the third most expensive frontier model (benchmarks and reviews will say for sure), and is apparently ~Opus 4.6 for 10% the inference cost.

The last I checked, qwen was still owning the 24-32GiB RAM range (it runs reasonably without a GPU!) and somewhere around 3.5-4 generation models.

That’s knocked them back 6 months already. Anthropic’s only frontier offering has been taken down.

jimbokun12d ago

Is Google behind? The general opinions I read suggest Gemini is very competitive with Anthropic and OpenAI's top models.

awongh12d ago

That's true now, but long-term (maybe just a few years) it doesn't seem feasible for the status quo to continue from a financial point of view.

Spend for compute seems like it needs to increase to get the next iterations of models, and even if they IPO the money might run out before they can solidify their revenue streams.

All while Google just needs to survive long enough with their good-enough models and do it without really putting themselves in any existential financial risk.

And ideally the chinese models are also still there keeping everyone honest.

The true dystopic worst case is a Google monopoly on cutting edge AI.

bushbaba12d ago

Some of the harness even let you run a local model for most things, and only pay for the latest frontier models when needed, which cuts down cost drastically.

nxobject12d ago

> And Anthropic seems to be leaving OpenAI behind so there might be only 1 in the near future.

Well, in domains like SWE where Anthropic's putting in the effort. I don't they'll make the claims that OpenAI makes about how their models are pushing the life sciences forward, for example.

j / k navigate · click thread line to collapse