$500 GPU outperforms Claude Sonnet on coding benchmarks (opens in new tab)

(github.com)

489 pointsyogthos1mo ago285 comments

285 comments

Generating big chunks of code is rarely what I want from an agent. They really shine for stuff like combing through logs or scanning dozens of source files to explain a test failure. Which benchmark covers that? I want the debugging benchmark that tests mastery of build systems, CLIs, etc.

bartread1mo ago

I agree. Also good for small changes that need to be applied consistently across an entire codebase.

I recently refactored our whole app from hard deletes to soft deletes. There are obviously various ways to skin this particular cat, but the way I chose needed all our deletions updated and also needed queries updating to exclude soft deleted rows, except in specific circumstances (e.g., admins restoring accidentally deleted data).

Of course, this is not hard to do manually but is is a bloody chore and tends toward error prone. But the agent made short work of it, for which I was very grateful.

CraigJPerry1mo ago

Do you not end up breaking half the value of referential integrity doing it that way (e.g. you had to update all the queries but now you have a sharp edge in that all future queries need to remember to be soft delete aware. Not a blocker for sure, just a sharp edge).

You know your system better than me for sure, a random commenter on a website :-D your comment just shocked me out of my daze enough for my brain to say "but I always move the record to another table rather than soft delete" and i felt compelled to give unsolicited and likely wrong opinion.

bartread1mo ago

Yeah, I did consider moving records to shadow tables, but - because of the nature of our data - it requires moving a lot of child records as well, so it's quite a lot of additional churn in WAL, and the same for restore. And this approach has its own challenges with referential integrity.

More than that, though: lots of queries for reporting, and the like, suddenly need to use JOINs. Same for admin use cases where we want them to be able to see archived and live data in a unified view. The conclusion I came to is it doesn't really eliminate complexity for us: just moves it elsehwere.

Totally valid approach though. I'd also considered different views for live versus archived (or live+archived) data. Again, it solves some issues, but moves complexity elsewhere.

The other key point: it's a Ruby on Rails system so the moment you start doing funky stuff with separate tables or views, whilst it is doable, you lose a lot of the benefits of Active Record and end up having to do a lot more manual lifting. So, again, this sort of played against the alternatives.

As I say, not to diss other approaches: in a different situation I might have chosen one of them.

My conclusion - not for the first time - is that soft delete obviously adds some level of irreducible complexity to an application or system versus hard delete no matter how you do it. Whether or not that extra complexity is worth it very much depends on the application and your user/customer base.

For some people, just the ability to restore deleted rows from backup would be enough - and in other cases it's been enough for me - but that is always a bit of a faff so not a great fit if you're optimising for minimal support overhead and rapid turnaround of any issues that do arise.

1 more reply

andyferris1mo ago

I move the record to another _index_, generally.

It depends whether you reliably control all the DB client code, of course.

1 more reply

dakolli1mo ago

must be something incredibly simple you're making out more complicated than it actually is, I've never seen an LLM do these things well.

bartread1mo ago

This is what gives me the warm fuzzies about the HN community: people jumping to wild conclusions about your domain and systems based on a 4 sentence comment. /s

1 more reply

sigmoid101mo ago

Probably want to look at SWE bench pro or terminal bench 2. They cover these longer horizon tasks that need more than just writing a bit of code in one file. And SWE bench pro in particular it is not yet saturated like many other common benchmarks. Normal SWE and LCB are not really useful anymore because they are already being gamed hard so the developers can quote high numbers in a repo readme or press release.

jakozaur1mo ago

Build systems are tested by CompileBench (Quesma's benchmark).

Disclaimer: I'm the founder.

slashdev1mo ago

Generating big chunks code is all I do, all day.

I don't write code by hand any more, neither at work, nor for side projects.

I work mostly in Rust and TypeScript at a developer tools company.

1 more reply

Bombthecat1mo ago

Oh yes! I let my environments now be built by agents via kubectl / helm and let them debug issues.

It's amazing! Saves hours of work!

I create the basic helm configd settings etc and when there is a conflict or something not working I let an agent fix it!

seunosewa1mo ago

Create it!

mmaunder1mo ago

I’d encourage devs to use MiniMax, Kimi, etc for real world tasks that require intelligence. The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable. Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.

thefourthchime1mo ago

I won’t use anything less than the SOTA. It tried using Opus 4.6 medium and immediately regretted it. High messes up enough.

overfeed1mo ago

What were you using 6 months ago?

withinboredom1mo ago

Opus 4.5 ~= Opus 4.6 high. Opus 4.5 was nerfed just before or after the release of 4.6.

1 more reply

rf151mo ago

You cannot afford the SOTA.

weird-eye-issue1mo ago

Why is that? The $200 per month subscription comes with a ton of usage.

Opus 4.6 is available on the $20 plan too

5 more replies

miroljub1mo ago

> I’d encourage devs to use MiniMax, Kimi, etc for real world tasks that require intelligence.

I use MiniMax daily, mostly for coding tasks, using pi-coding-agent mostly.

> The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable.

I don't care about token use, I pay per request in my cheap coding plan. I didn't notice slower outputs, it's even faster than Anthropic. Degradation is there for long sessions with long contexts, but that also happens with Anthropic models.

> Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.

Exactly. For my use case, I get 1500 API requests every 5 hours for 10€ monthly. I never hit the limit, even during the intensive coding sessions.

What I notice is, while Opus and Sonnet feel better for synthetic benchmarks, it doesn't matter in the real world. I never put so much effort into coming up with a perfect problem spec like the ones in benchmarks. I don't craft my prompts for hours expecting the LLM to one-shot a working program for me. And that's exactly what all those benchmarks are doing. And that's where Anthropic tools shine in comparison to cheaper Chinese models.

When it comes to the real world, where I put my half-baked thoughts in broken English in a prompt and execute 20 prompts in half an hour, the difference between Opus, Sonnet, and MiniMax is minimal, if at all. There, I don't want to think about costs and token savings and switching between different Anthropic models. I just use MiniMax, and that's it.

Yes, MiniMax sometimes gets stuck. Then I switch to Opus to unblock it. But the same happens if I use Opus the whole session. It gets stuck eventually, and model switch is sometimes required to get a fresh perspective on the problem.

The only difference is, using Opus or Sonnet quickly eats up my budget, while with MiniMax I have basically unlimited usage (for my coding use case) for 10€ per month.

tim-projects1mo ago

I've only been using free tokens for a year now. Gemini and they just dropped pro so I switched to minimax. Bit of a hurdle switching from Gemini-cli to kilo-cli, but now I can't really see too much difference.

If I was starting new projects I'd pay for a better model, but honestly I don't really know any different.

I've not ever used Claude and people seem to rave about it. Maybe its good, but I doubt its $200/month good.

When I hit issues with these lower models I think hard about creating the right tooling - agnostic to the harness and I feel like maybe its more work but I can carry those tools to any setup going forward. That's how it was in the early Linux days so why change what clearly works?

bethekind1mo ago

I've used Gemini and now claude. Both were meh until I found the superpowers skill. Will be trying chatgpt next month.

You can "feel" the llm being limited with Gemini, less so with Claude. Hopefully even less so with chatgpt

dkersten1mo ago

I’ve also never hit the MiniMax limits and M2.7 is pretty good.

Not as good as Opus, but substantially cheaper!

mongrelion1mo ago

What is this 10€ per month subscription that you are talking about?

harias1mo ago

MiniMax token plan

https://platform.minimax.io/docs/guides/pricing-token-plan

1 more reply

vidarh1mo ago

I get decent results with Kimi, but I agree with your overall premise. You do need to realise that while you can save money on a lot of tasks with those models, for the hardest tasks the "sticker price" of cost per million tokens isn't what matters.

It's also worth noting that the approach given in the link also benefits Sonnet and Opus. Not just as much - they are more forgiving - but put it in a harness that allows for various verification and repair and they too end up producing much better results than the "raw" model. And it's not clear that a harness around MiniMax, Kimi, or Qwen can measure up then.

I use those models a lot, and hope to use them more as my harnesses get better at discriminating which tasks they are cost effective for, but it's not straightforward to cost optimize this.

If I cared about running everything locally, then sure, it's amazing you can get to those kinds of results at all.

moffkalast1mo ago

Kimi's been one of my goto options lately and it oftentimes outperforms both Claude and GPT in debugging, finding the actual problem immediately while the other two flail around drunkenly.

It does have some kind of horrible context consistency problem though, if you ask it to rewrite something verbatim it'll inject tiny random changes everywhere and potentially break it. That's something that other SOTA models haven't done for at least two years now and is a real problem. I can't trust it to do a full rewrite, just diffs.

smokel1mo ago

And what tooling do you use with that? In my experience, there is quite a bit of difference between using, say, OpenCode, or the commercial offerings.

moffkalast1mo ago

No tooling, just manual use. When doing these comparisons I gather and format all the data they need to figure out the problem, and paste the same thing into all models so it's a pretty even eval.

I doubt Kimi would do well with most harnesses, its outputs are pretty chaotic in terms of formatting but the inteligence is definitely there.

XCSme1mo ago

Yup, they do quite poorly on random non-coding tasks:

https://aibenchy.com/compare/minimax-minimax-m2-7-medium/moo...

usagisushi1mo ago

Interesting benchmark. It is notable that Gemini-3-Flash outperforms 3.1 Pro. My experience using Flash via Opencode over the past month suggests it is quite underrated.

Needless to say, benchmarks are limited and impressions vary widely by problem domain, harness, written language, and personal preference (simplicity vs detail, tone, etc.). If personal experience is the only true measure, as with wine, solving this discovery gap is an interesting challenge (LLM sommelier!), even if model evolution eventually makes the choice trivial. (I prefer Gemini 3 for its wide knowledge, Sonnet 4.6 for balance, and GLM-5 for simplicity.)

rmi_1mo ago

Wild benchmark. Opus 4.6 is ranked #29, Gemini 3 Flash is #1, front of Pro.

I'm not saying it's bad, but it's definitely different than the others.

XCSme1mo ago

The main reason is that Claude models tend to ignore instructions. There is a failure example on the Methodology page.

1 more reply

wizee1mo ago

It’s worth also comparing Qwen 3.5, it’s a very strong model. Different benchmarks give different results, but in general Qwen 3.5, GLM 5, and Kimi K2.5 are all excellent models, and not too far from current SOTA models in capability/intelligence. In my own non-coding tests, they were better than Gemini 3.1 flash. They’re comparable to the best American models from 6 months ago.

XCSme1mo ago

I used qwen 3.5 plus in production, it was really good at instruction following and tool calling.

vidarh1mo ago

While I like these models, if you're getting similar results to SOTA models from 6 months ago, I have to question how far you pushed those models 6 months ago. It is really easy to find scenarios were these models really underperform. They take far more advanced harnesses to perform reasonably (and hence the linked project). It's possible to get good results out of them, but it takes a lot of extra work.

I badly want to shift more of my work to them, and I'm finding ways of shifting more lower-level loads to them regularly, but they're really not there yet for anything complex.

raincole1mo ago

I can't imagine anyone looking at this benchmark without laughing. It's so disconnected.

scotty791mo ago

GLM 5 here is significantly better than GPT-5.4

XCSme1mo ago

It's 8.3 vs 8.1, I wouldn't call that significantly better.

I think GLM got a bit in front, because on some tests that both got wrong, GLM did sometimes (inconsistently) respond with the correct answer.

That being said, yes, in this case probably with more and more tests added, gpt-5.4 would edge in front, especially if a coding would be added (there are no coding tests yet).

comboy1mo ago

Not really related, but does anybody know if somebody's tracking same models performance on some benchmarks over time? Sometimes I feel like I'm being A/B tested.

XCSme1mo ago

Oh, I didn't think about this, that's a good idea. I also feel generally model performance changes over time (usually it gets worse).

The problem with doing this is cost. Constsntly testing a lot of models on a large dataset can get really costly.

1 more reply

paulddraper1mo ago

Agreed.

They are equivalent of frontier models 8+ months ago.

m00x1mo ago

Minimax 2.7 is fine for most web stuff. It's slightly worse than Claude at backend, but works great for frontend.

They're all slop when the complexity is higher than a mid-tech intermediate engineer though.

dvt1mo ago

> They're all slop when the complexity is higher than a mid-tech intermediate engineer though.

This right here. Value prop quickly goes out the window when you're building anything novel or hard. I feel that I'm still spending the same amount of time working on stuff, except that now I'm also spending money on models.

stuaxo1mo ago

10x more code output is 10x more review.

We've gone from doing the first 90% and then the second 90% to the first 90% and the second 990%, its exausting.

1 more reply

Leynos1mo ago

Kimi is surprisingly good at Rust.

mkw20001mo ago

i find kimi to be very very good, minimax not so much

victorbjorklund1mo ago

yea, they are still useful. But yea not close to Claude or GPT. But works good for simple changes. I use a combo of minimax and codex

selcuka1mo ago

It's a race to the bottom. DeepSeek beats all others (single-shot), and it is ~50% cheaper than the cost of local electricity only.

> DeepSeek V3.2 Reasoning 86.2% ~$0.002 API, single-shot

> ATLAS V3 (pass@1-v(k=3)) 74.6% ~$0.004 Local electricity only, best-of-3 + repair pipeline

yogthosOP1mo ago

You could use this approach with DeepSeek as well. The innovation here is that you can generate a bunch of solutions, use a small model to pick promising candidates and then test them. Then you feed errors back to the generator model and iterate. In a way, it's sort of like a genetic algorithm that converges on a solution.

hu31mo ago

Indeed but:

1) That is relatively very slow.

2) Can also be done, simpler even, with SoTA models over API.

yogthosOP1mo ago

Right, this works with any models. To me, the most interesting part is that you can use a smaller model that you could run locally to get results comparable to SoTA models. Ultimately, I'd far prefer running local, even if slower, for the simple reason of having sovereignty over my data.

Being reliant on a service means you have to share whatever you're working on with the service, and the service provider decides what you can do, and make changes to their terms of service on a whim.

If locally running models can get to the point where they can be used as a daily driver, that solves the problem.

eru1mo ago

Why do you need a small model to pick promising candidates? Why not a bigger one?

(And ideally you'd probably test first, or at least try to feed compiler errors back etc?)

Overall, I mostly agree.

yogthosOP1mo ago

mostly an issue of speed and resource usage, if the model is too big then simply running the tests will be cheaper

strangescript1mo ago

I will "suffer" through .004 of electricity if I can run it on my own computer

sourcecodeplz1mo ago

I've tested many open models, Deepseek 3.2 is the only SOTA similar.

1 more reply

3abiton1mo ago

The method here is model agnostic.

mikestorrent1mo ago

> cheaper than the cost of local electricity only.

Can you explain what that means?

simonw1mo ago

I think they mean that the DeepSeek API charges are less than it would cost for the electricity to run a local model.

Local model enthusiasts often assume that running locally is more energy efficient than running in a data center, but fail to take the economies of scale into account.

BoredomIsFun1mo ago

> Local model enthusiasts often assume that running locally is more energy efficient than running in a data center,

It is a well known 101 truism in /r/Localllama that local is rarely cheaper, unless run batched - then it is massively, 10x cheaper indeed.

> I think they mean that the DeepSeek API charges are less than it would cost for the electricity to run a local model.

Because it is hosted in China, where energy is cheap. In ex-USSR where I live it is inexpensive too, and keeping in mind that whole winter I had to use small space heater, due to inadequacy of my central heating, using local came out as 100% free.

littlestymaar1mo ago

I guess it mostly comes from using the model with batch-size = 1 locally, vs high batch size in a DC, since GPU consumption don't grow that much with batch size.

Note that while a local chatbot user will mostly be using batch-size = 1, it's not going to be true if they are running an agentic framework, so the gap is going to narrow or even reverse.

1 more reply

croes1mo ago

Local enthusiasts don’t have to fear account banning.

jacquesm1mo ago

Some of those local model enthusiasts can actually afford solar panels.

1 more reply

pbhjpbhj1mo ago

Is it economies of scale, or is it unpaid externalities?

atoav1mo ago

It means that the electricity you would have to pay if you did the computations yourself would be more expensive than paying them to do it. Part of thst has to do with the fact that China has cheap electricity, also due to their massive push into renewables. Part of that is just economies of scale. A big server farm can run more efficiently than your PC on average.

AuthAuth1mo ago

cheap electric due to their massive push on non renewables. There has been no change in the price of electricity during the renewable shift.

1 more reply

jojobas1mo ago

China has cheap electricity.

ericd1mo ago

Well, also, LLM servers get much more efficient with request queue depth >1 - tokens per second per gpu are massively higher with 100 concurrents than 1 on eg vllm.

DeathArrow1mo ago

Yes, but the hardware they use for inference like Huawei Ascend 910C is less efficient than Nvidia H100 used in US due to the difference in the process node.

alifeinbinary1mo ago

All those parameters and it still won't answer questions about Tianenman Square in 1989... :(

viktorcode1mo ago

It will. The web chat has censorship features, but the model you can download doesn't.

memothon1mo ago

I'm always skeptical because you can make it pass the benchmarks, then you use it and it is not practically useful unlike an extremely general model.

Cool work though, really excited for the potential of slimming down models.

yogthosOP1mo ago

You obviously have to try it out to see how it works for you, but the trick they use is pretty clever. When you ask an AI to write code, it doesn’t always get it right. Sometimes the code has bugs, sometimes it misunderstands the problem entirely. A naive way to address that is to generate a few solutions and test each one. The odds that at least one works go way up. ATLAS generates multiple attempts, running each through a test suite. Each retry also gets told what went wrong with the previous attempt, so it can try to avoid the same mistake.

But this can be pretty slow since you have to run the code in an isolated environment, check the outputs, wait for it to finish. Doing that for every candidate quickly adds up. So ATLAS has another shortcut for avoiding unnecessary testing. Instead of simply generating solutions and testing all of them, it tries to predict which one is most likely correct before running any tests.

ATLAS also asks the model for an embedding of what it just wrote which acts as a fingerprint. Two similar pieces of code will produce similar fingerprints. A well-written, confident solution will produce a different fingerprint than a confused, buggy one.

These fingerprints get fed into a separate, much smaller neural network called the Cost Field. This little network was trained ahead of time on examples where they already knew which solutions were correct and which were wrong. It learned to assign a score to each fingerprint. Correct solutions get a low score and incorrect ones get a high one.

So the process is to generate multiple solutions, get their fingerprints, score each one, and pick the lowest. Only that one gets tested. The Cost Field picks correctly about 88% of the time according to the repo.

imtringued1mo ago

I tried to read the project documentation, but I got overwhelmed by the aimless AI generated documentation that has a nebulous goal of documenting absolutely everything, but never explaining anything.

If the author actually wanted to explain his project he should have started with something along the lines of "Inference-time learning is the act of updating model parameters while you are generating tokens. Inference time learning is cost prohibitive for LLMs due to the need to update billions of parameters. However, what if updating billions of parameters wasn't necessary to begin with? What if you could instead have a much smaller model that merely scores a bunch of candidate output tokens? That model could be small enough for inference time learning to become viable and that's exactly what ATLAS does to achieve a 74.6% pass rate in LiveCodeBench and thereby outperforms Claude Sonnet with a small 14B open weight model that can be run locally on your $500 GPU."

This would have primed the reader to know what to look for. Instead you got this insurmountable wall of distractions.

Example: "combining constraint-driven generation, energy-based verification, self-verified iterative refinement, and adaptive routing"

That's a very long sequence of unexplained buzzwords that could mean absolutely anything.

zar10485761mo ago

Really intriguing set of techniques to improve accuracy by generating multiple solutions. Even with the work to predict the most likely solutions, it's not clear to me based on the description how this could all be done efficiently. Would definitely be really impressive if it pans out on real-world use cases. Will look to kick the tires on this if I can get some time.

yogthosOP1mo ago

Seems like the key insight is to train a small model that acts as a heuristic for embeddings that resemble quality code. I imagine a lot depends on how well this model is trained. And you could probably create specialized versions for different languages and domains.

Another interesting approach could be to use this set up with a language like Clojure or Common Lisp which facilitates interactive development. If you could hook up the agent directly to a REPL in a running program, then it could run tests with a lot less overhead.

1 more reply

naasking1mo ago

> it's not clear to me based on the description how this could all be done efficiently.

Depends how you define efficiency. The power use of this rig is a lot less than the large data centers that serve trillion parameter models. The page suggests that the final dollar cost per request is an order of magnitude lower than the frontier models charge.

kimixa1mo ago

I find it's often very language and sector dependent. I still see a massive difference in systems programming (normally c++ and rust) between any open model I've tried and something like sonnet 4.5 (not really tried 4.6). And honestly, even the big models (like Opus 4.6) struggle in many cases.

Perhaps these things aren't well represented in the training data for these open models? Every local model I've tried (minimax2.5, GLM-4.7, Quen3, 3.5 and -coder variants) spend so much time trying to get something syntactically sensible and accepted by the compiler that when they've finished they barely seem to have any "momentum" left to actually solve the problems, as pretty much anything but the most trivial change ends up in another loop of actually trying to get it working again, often losing the intent of that change in the process.

My fear is that the solution here, having multiple instances all making the same changes for later comparison, would spend a huge amount of time beating it's head against compiler errors, types, memory allocation (NO DON'T JUST SPRINKLE IN A FEW MORE RAW "new" KEYWORDS DAMMIT) before it even gets to the "logic".

Having plenty of local GPU power I'd love to be able to actually use that, and I'm already wary about some of the training data use and it's interactions with the license of the code I'm "sending" to the cloud models...

vidarh1mo ago

> Perhaps these things aren't well represented in the training data for these open models

I know from first-hand experience that at least a couple of the SOTA providers use third-party providers for supervised finetuning with instructions that are heavily geared towards a specific set of languages as well. But of course the base dataset from the major providers is likely to be sufficiently better that it matters less, and the big models are good enough at carrying over training that it at least seems like extra training on the core languages they care about at least somewhat carries over (you see this with natural language too - they do really well for many minor languages that make up a miniscule proportion of the training data).

(I won't say much more regarding the SFT/RLHF work due to NDAs - plural; I know who one of the providers is; I don't know who the one or more others are as the intermediary I did some work for obscured it well enough that I couldn't really violate the NDA even if I wanted to)

MattRix1mo ago

I think this is because when you shrink it down, the model ends up space constrained and each “neuron” ends up having to do multiple duties. It can stil be tuned to perform well at specific tasks, but no longer generalizes as well. It’s somewhat unintuitive but models that are larger are often simpler than smaller ones for this same reason.

DanielHall1mo ago

These small models, having been fine-tuned for the test, achieve frighteningly high scores, yet perform abysmally in real-world scenarios.

b3ing1mo ago

Will open source or local llms kill the big AI providers eventually? If so when? I can see maybe basic chat, not sure about coding and images yet

jillesvangurp1mo ago

Not necessarily kill; but it will slowly push them off the critical path. Local agents can delegate to remote sub agents as needed but should default to local processing for low cost and latency reasons.

I think the notion of a one size fits all model that is a bit like a sports car in the sense that just get the biggest/fastest/best one is overkill; you use bigger models when needed. But they use a lot of resources and cost you a lot. A lot of AI work isn't solving important math or algorithm problems. Or leet coding exercises. Most AI work is mundane plumbing work, summarizing, a bit of light scripting/programming, tool calling, etc. With skills and guard rails, you actually want agents to follow those rather than get too creative. And you want them to work relatively quickly and not overthink things. Latency is important. You can actually use guard rails to decide when to escalate to bigger models and when not to.

throwaway858251mo ago

Financial gravity will kill them when returns don't match stratospheric expectations.

bluefirebrand1mo ago

I hope so too, but I think it's wishful thinking. Be prepared for the mother of all financial bailouts from the world governments to make sure that doesn't happen

hollerith1mo ago

I can understand why banks got bailed out by the US gov in 2008, but why would a government feel the need to bail out AI labs?

I hope you are not going to say, "to avoid a global recession or depression caused by the popping of the AI bubble". That would be unnecessary and harmful (in its second-order effects), and governments do have advisors who are competent enough in economics to advise against such a move.

4 more replies

qingcharles1mo ago

Unless there are some really, really major shortcuts found in inference, then it's always going to be hard to run a really great model locally. The costs of the PC + electric will usually be crazy compared to a $20/mo Claude sub.

38362936481mo ago

But that $20/month is still heavily subsidised. You have to compare to the API costs, not the direct subscription.

Tuna-Fish1mo ago

Centralized inference is more economically efficient⁰, and should be cheaper for most users once competition squeezes the air out of token prices. It remains very valid for anyone who wants to maintain their privacy, ofc.

0: Because the only way to get cache locality out of a LLM is to batch invocations. A centralized system where the server handles thousands of invocations at the same time only needs a tiny fraction of the total memory throughput as having all of those invocations run locally on different machines would.

freekh1mo ago

This has been my theory for a while: during this autumn Apple will release a version of Apple Intelligence that runs locally and works better than ChatGPT. They will do this because 1) they do not have an offering in AI yet 2) they have amazing hardware that even now almost can pull it off on open models and this will not be possible to replicate on android for a long time (presumably)

This will crush OpenAI.

Note: I am not talking about coding here - it will take a while longer but when it is optimized to the bone and llms output has stabilized, you will be running that too on local hardware. Cost will come down for Claude and friends too but why pay 5 when you can have it for free?

oarsinsync1mo ago

> This has been my theory for a while: during this autumn Apple will release a version of Apple Intelligence that runs locally and works better than ChatGPT.

In this theory, can you explain why Apple has announced it’s paying Google for Gemini too?

Eventually, this may be true. This autumn? Highly unlikely.

freekh1mo ago

The Google Gemini deal is one of the reasons I think it is likely since Gemini works pretty local hw...

eigenspace1mo ago

It'd be nice if they do, but I don't really see how. Training these open-weight local LLMs is still insanely expensive and hard to do, even if it's cheaper and faster than what the big corps are doing.

I don't get the financial motive for someone to keep funding these open-weight model training programs other than just purposefully trying to kill the big AI providers.

nerbert1mo ago

Some open source models will cross the chasm, some big ai providers will too, and in both case they will have their specific use cases.

CJefferson1mo ago

They won't for coding and images, but they will socially. Everyone I know who has invested in home AI use is mostly using it for 'things that might get you banned/limited'.

Mashimo1mo ago

I'm quite impressed what is possible with just 12 to 16 GB of vram in terms of image generation.

rudolph91mo ago

When Apple gets their shit together.

electroglyph1mo ago

what's with the weird "Geometric Lens routing" ?? sounds like a made up GPTism

ottah1mo ago

Feels very pseudo academic.

tgiba1mo ago

Despite skepticism I love to see experiments like that. If we all are able to run an open source model locally on mid-high end machines I'd be very happy.

emp173441mo ago

Yet more evidence that the harness matters more than the model.

riidom1mo ago

Not a word about the tok/sec, unfortunately.

arjie1mo ago

It won’t be meaningful considering the architecture: it’s a harness around the model that generated multiple solutions in multiple passes using the test to measure compliance and repair broken solutions. The resulting program won’t be streamed to you because it has existed for minutes as it goes through the cycle. It’s more for an asynchronous use-case.

I, too, was interested because I am always eager to use local models in my claw-like. It looks like this could be useful for an async portion of the harness but it wouldn’t work in interactive contexts.

Very cool ensemble of techniques, particularly because they’re so accessible. I think I will use this form for reusable portions of web browsing functionality in my personal agent.

Octoth0rpe1mo ago

> A single patched llama-server runs on K3s, providing both generation with speculative decoding (~100 tok/s)

There seems to be at least some detail on that point.

superkuh1mo ago

If anyone else was hoping this was using Q8 internally and that converted to Q4 it could fit in 12GB VRAM: unfortunately it's already at Q4_K_M (~9GB) and the the 16GB requirement is from other parts not a 14B@8bit+kv cache/etc you might guess.

15minutemail1mo ago

74% on LCB from a single 5060 Ti. I've been paying Anthropic per task and this guy is running it on electricity money, 20 minutes per task is rough for anything interactive though.

subroutine1mo ago

At 20 min per task you might as well code it yourself. Bill James needs to write a book on saber-metrics for LLM benchmarks.

alkonaut1mo ago

Great, it became a $1000 gpu while you were reading that.

0xbadcafebee1mo ago

This is specifically an experiment using ablation and multiple passes to improve the end result. Other techniques have been found that do this (like multiple passes through the same layers). But this technique - for this one specific model - seems to be both more performant, but also takes much longer, and requires more complexity. It's unlikely most people would use this technique, but it's interesting.

rldjbpin1mo ago

> coding benchmarks

> V3 phases were designed and tuned for LiveCodeBench.

only compared on the above benchmark, while this has been identified and being improved for the next version.

curious to see how it compares across the board against the base model (Qwen3-14B-Q4_K_M)

josefritzishere1mo ago

The core problem of AI remains unresolved, with no conceivable path to solvency. The issue is that AI isn't very good. It's OK, sometimes under very narrow criteria. But providing AI in reality very costly. Vague promises of it magically becoming better remain, very optimistic at best and still provide no route to solvency.

yakbarber1mo ago

AI would have a least worded your comment better

bdbdbdb1mo ago

This is the kind of innovation I love to see. The big AI companies days are numbered if we can have the same quality in house

bilekas1mo ago

Where is a RTX 5060 Ti 16 GB 500$?

Edit : The 8GB seems to hit this price but 16 not so much.

hedgehog1mo ago

They were $450 or so until recently, now... good luck.

Temporary_313371mo ago

the headline is pretty stupid - compares a model to a GPU that models run on. Somewhere in that data centre, some part of Sonnet infferencing runs on a 900$ GPU or maybe even cheaper Google tensor

dwa35921mo ago

I wonder if it's working out for the benchmark problems only?

one expensive and hard lesson we will learn overtime is that you can't compress generality beyond a point.

Aurornis1mo ago

This AI-written project is running its own LiveCodeBench on a completely different methodology. The AI-written notes even admit it:

> ATLAS scores are from 599 LCB tasks using the full V3 pipeline (best-of-3 + Lens selection + iterative repair) on a frozen 14B quantized model or "pass@k-v(k=3)". Competitor scores are single-shot pass@1 (zero-shot, temperature 0) from Artificial Analysis on 315 LCB problems -- not the same task set, so this is not a controlled head-to-head.

Instead of following the LiveCodeBench methodology, it's a harness that spins up a sandbox and spends a long time testing and refining the solution. If you did the same for Sonnet, GPT5.4, or other models they would also get significantly higher scores and they'd do it faster.

The AI-coded README is also full of signs of vibecoded slop like the discoveries that some of the complex structures implemented were not actually being used or contributing anything to the output.

negativegate1mo ago

Am I still SOL on AMD (9070 XT) when it comes to this stuff?

0xbadcafebee1mo ago

No? You can run any model that fits in its VRAM, and you can run larger models with layer/MoE offloading. Ask an AI what the best models you can run on that card are, then ask it for newer models than that. Ask what tuning options to pass to llama.cpp, and what the auto-tuning options are. Use ROCm builds.

It looks like your card has 16GB VRAM? Start with Qwen 3.5 9B Unsloth GGUFs (UD-Q6_K_XL) and branch out from there.

metalliqaz1mo ago

I've been running local models on my 9070XT and I have never found ROCm to be faster than Vulkan

0xbadcafebee1mo ago

It's not guaranteed to be faster, but it is faster in some cases due to ROCm taking more advantage of specific chipset features

patshead1mo ago

No, but yes? OmniCoder 9B at Q6 fits on my 9070 XT with 200k+ tokens of context, and it works pretty well with OpenCode. It is for sure the best local model that I've managed to squeeze onto my GPU, and it even works at 120k context at Q3 on an 8GB RX 580 GPU.

I can't imagine trying to using this model on either GPU for real work. I can use much bigger and faster models on the $3 Chutes subscription or $10 OpenCode Go subscription.

Even so, I am still excited. I don't feel like there was even a model worth using with a tool like OpenCode 6 to 9 months ago. I like the way things are heading, and I am looking forward to seeing how capable coding models of this size are in another 6 to 9 months!

1 more reply

dangus1mo ago

Well, this specific solution was only set up on specific hardware, and is Nvidia dependent, as the readme stares.

That doesn’t mean the 9070XT can’t do AI stuff, quite the opposite. ROCm gets better all the time. There are many AI workloads you can do on AMD cards.

Is it a card I would choose if I was primarily working on AI? Absolutely not. But it is the card I own and it’s been a great value for gaming.

dannyw1mo ago

Unfortunately AMD is much worse with supporting AI features like FSR4 on older hardware generations, despite the capability and leaked INT8 models being there. Totally unlike NVIDIA.

It’s absurd I have to use open source programs to get INT8 FSR4 support.

limoce1mo ago

The title should be "Adaptive Test-time Learning and Autonomous Specialization".

sznio1mo ago

On that topic, anyone here got a decent local coding AI setup for a 12GB VRAM system? I have a Radeon 6700 XT and would like to run autocomplete on it. I can fit some models in the memory and they run quick but are just a tad too dumb. I have 64GB of system ram so I can run larger models and they are at least coherent, but really slow compared to running from VRAM.

mongrelion1mo ago

Not the answer that you are looking for, but I am a fellow AMD GPU owner, so I want to share my experience.

I have a 9070 XT, which has 16GB of VRAM. My understanding from reading around a bunch of forums is that the smallest quant you want to go with is Q4. Below that, the compression starts hurting the results quite a lot, especially for agentic coding. The model might eventually start missing brackets, quotes, etc.

I tried various AI + VRAM calculators but nothing was as on the point as Huggingface's built-in functionality. You simply sign up and configure in the settings [1] which GPU you have, so that when you visit a model page, you immediately see which of the quants fits in your card.

From the open source models out there, Qwen3.5 is the best right now. unsloth produces nice quants for it and even provides guidelines [2] on how to run them locally.

The 6-bit version of Qwen3.5 9B would fit nicely in your 6700 XT, but at 9B parameters, it probably isn't as smart as you would expect it to run.

Which model have you tried locally? Also, out of curiosity, what is your host configuration?

[1]: https://huggingface.co/settings/local-apps [2]: https://unsloth.ai/docs/models/qwen3.5

kroaton1mo ago

For autocomplete, Qwen 3.5 9B should be enough even at Q4_k_m. The upcoming coding/math Omnicoder-2 finetune might be useful (should be released in a few days).

Either that or just load up Qwen3.5-35B-A3B-Q4_K_S I'm serving it at about 40-50t/s on a 4070RTX Super 12GB + 64GB of RAM. The weights are 20.7GB + KV Cache (which should be lowered soon with the upcoming addition of TurboQuant).

mongrelion1mo ago

I am definitely looking forward to TurboQuant. Makes me feel like my current setup is an investment that could pay over time. Imagine being able to run models like MiniMax M2.5 locally at Q4 levels. That would be swell.

sznio1mo ago

I don't remember exact models, but I tried whatever was available in Ollama. I remember using some really low parameter version of llama

felixagentai1mo ago

[flagged]

dang1mo ago

We've banned this account. Please don't post automated comments to HN.

https://news.ycombinator.com/newsguidelines.html#generated

Razengan1mo ago

Claude Code has been bleh or meh at best in my experience. There's so many posts on HN fawning about it lately that it could only be a guerrilla marketing campaign.

maipen1mo ago

You still need to give it precise context and instructions when dealing with things that are not web apps or some other software cliche.

The reasoning is great in opus, unbeatable at the moment.

I understand what you mean, it becomes disappointing on more niche or specific work. It’s honestly a good thing to see these models are not really intelligent yet.

Razengan1mo ago

I still don't trust any AI enough to generate or edit code, except for some throwaway experiments, because every time I tried it's been inefficient or too verbose or just plain wrong.

I use it for reviewing existing code, specifically for a components-based framework for Godot/GDScript at [0]. You can view the AGENTS.md and see that it's a relatively simple enough project: Just for 2D games and fairly modular so the AI can look at each file/class individually and have to cross-reference maybe 1-3 dependencies/dependents at most at any time during a single pass.

I've been using Codex, and it's helped me catch a lot of bugs that would have taken a long time on my own to even notice at all. Most of my productivity and the commits from the past couple months are thanks to that.

Claude on the other hand, oh man… It just wastes my time. It's had way more gaffes than Codex, on the exact same code and prompts.

[0] https://github.com/InvadingOctopus/comedot

dr_kiszonka1mo ago

I had a similar experience and the answer appears to be learning how to use a specific model for a specific task using a specific harness (model X task X harness). Another, and somewhat related, lesson learned is understanding how to work with a given model and not against it.

I still get really mad at AI sometimes and I am not sure whether I could use AI for coding full time.

(Codex broke my git a few days ago.)

spiderfarmer1mo ago

"I don't get it. Everyone else is wrong."

Razengan1mo ago

"There's no such thing as astroturfing." ok

I use Codex regularly and Claude is shit in comparison, from its constant "Oops you're right!!" backtracking to its crap Electron app (if their AI is so good why can't they make a fucking native app for each OS?)

Hell right freakin now I asked it to implement something and got a weird "Something went wrong" API error

spiderfarmer1mo ago

"Shit", "Crap", "Fucking", "Hell", "Freaking".

Maybe you're too easily frustrated. Or your existing code reads like your comments.

1 more reply

j / k navigate · click thread line to collapse

285 comments

bloppe1mo ago

bartread1mo ago

I agree. Also good for small changes that need to be applied consistently across an entire codebase.

Of course, this is not hard to do manually but is is a bloody chore and tends toward error prone. But the agent made short work of it, for which I was very grateful.

CraigJPerry1mo ago

bartread1mo ago

Totally valid approach though. I'd also considered different views for live versus archived (or live+archived) data. Again, it solves some issues, but moves complexity elsewhere.

As I say, not to diss other approaches: in a different situation I might have chosen one of them.

1 more reply

andyferris1mo ago

I move the record to another _index_, generally.

It depends whether you reliably control all the DB client code, of course.

1 more reply

dakolli1mo ago

must be something incredibly simple you're making out more complicated than it actually is, I've never seen an LLM do these things well.

bartread1mo ago

This is what gives me the warm fuzzies about the HN community: people jumping to wild conclusions about your domain and systems based on a 4 sentence comment. /s

1 more reply

sigmoid101mo ago

jakozaur1mo ago

Build systems are tested by CompileBench (Quesma's benchmark).

Disclaimer: I'm the founder.

slashdev1mo ago

Generating big chunks code is all I do, all day.

I don't write code by hand any more, neither at work, nor for side projects.

I work mostly in Rust and TypeScript at a developer tools company.

1 more reply

Bombthecat1mo ago

Oh yes! I let my environments now be built by agents via kubectl / helm and let them debug issues.

It's amazing! Saves hours of work!

I create the basic helm configd settings etc and when there is a conflict or something not working I let an agent fix it!

seunosewa1mo ago

Create it!

mmaunder1mo ago

thefourthchime1mo ago

I won’t use anything less than the SOTA. It tried using Opus 4.6 medium and immediately regretted it. High messes up enough.

overfeed1mo ago

What were you using 6 months ago?

withinboredom1mo ago

Opus 4.5 ~= Opus 4.6 high. Opus 4.5 was nerfed just before or after the release of 4.6.

1 more reply

rf151mo ago

You cannot afford the SOTA.

weird-eye-issue1mo ago

Why is that? The $200 per month subscription comes with a ton of usage.

Opus 4.6 is available on the $20 plan too

5 more replies

miroljub1mo ago

> I’d encourage devs to use MiniMax, Kimi, etc for real world tasks that require intelligence.

I use MiniMax daily, mostly for coding tasks, using pi-coding-agent mostly.

> The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable.

Exactly. For my use case, I get 1500 API requests every 5 hours for 10€ monthly. I never hit the limit, even during the intensive coding sessions.

The only difference is, using Opus or Sonnet quickly eats up my budget, while with MiniMax I have basically unlimited usage (for my coding use case) for 10€ per month.

tim-projects1mo ago

If I was starting new projects I'd pay for a better model, but honestly I don't really know any different.

I've not ever used Claude and people seem to rave about it. Maybe its good, but I doubt its $200/month good.

bethekind1mo ago

I've used Gemini and now claude. Both were meh until I found the superpowers skill. Will be trying chatgpt next month.

You can "feel" the llm being limited with Gemini, less so with Claude. Hopefully even less so with chatgpt

dkersten1mo ago

I’ve also never hit the MiniMax limits and M2.7 is pretty good.

Not as good as Opus, but substantially cheaper!

mongrelion1mo ago

What is this 10€ per month subscription that you are talking about?

harias1mo ago

MiniMax token plan

https://platform.minimax.io/docs/guides/pricing-token-plan

1 more reply

vidarh1mo ago

I use those models a lot, and hope to use them more as my harnesses get better at discriminating which tasks they are cost effective for, but it's not straightforward to cost optimize this.

If I cared about running everything locally, then sure, it's amazing you can get to those kinds of results at all.

moffkalast1mo ago

Kimi's been one of my goto options lately and it oftentimes outperforms both Claude and GPT in debugging, finding the actual problem immediately while the other two flail around drunkenly.

smokel1mo ago

And what tooling do you use with that? In my experience, there is quite a bit of difference between using, say, OpenCode, or the commercial offerings.

moffkalast1mo ago

No tooling, just manual use. When doing these comparisons I gather and format all the data they need to figure out the problem, and paste the same thing into all models so it's a pretty even eval.

I doubt Kimi would do well with most harnesses, its outputs are pretty chaotic in terms of formatting but the inteligence is definitely there.

XCSme1mo ago

Yup, they do quite poorly on random non-coding tasks:

https://aibenchy.com/compare/minimax-minimax-m2-7-medium/moo...

usagisushi1mo ago

Interesting benchmark. It is notable that Gemini-3-Flash outperforms 3.1 Pro. My experience using Flash via Opencode over the past month suggests it is quite underrated.

rmi_1mo ago

Wild benchmark. Opus 4.6 is ranked #29, Gemini 3 Flash is #1, front of Pro.

I'm not saying it's bad, but it's definitely different than the others.

XCSme1mo ago

The main reason is that Claude models tend to ignore instructions. There is a failure example on the Methodology page.

1 more reply

wizee1mo ago

XCSme1mo ago

I used qwen 3.5 plus in production, it was really good at instruction following and tool calling.

vidarh1mo ago

I badly want to shift more of my work to them, and I'm finding ways of shifting more lower-level loads to them regularly, but they're really not there yet for anything complex.

raincole1mo ago

I can't imagine anyone looking at this benchmark without laughing. It's so disconnected.

scotty791mo ago

GLM 5 here is significantly better than GPT-5.4

XCSme1mo ago

It's 8.3 vs 8.1, I wouldn't call that significantly better.

I think GLM got a bit in front, because on some tests that both got wrong, GLM did sometimes (inconsistently) respond with the correct answer.

That being said, yes, in this case probably with more and more tests added, gpt-5.4 would edge in front, especially if a coding would be added (there are no coding tests yet).

comboy1mo ago

Not really related, but does anybody know if somebody's tracking same models performance on some benchmarks over time? Sometimes I feel like I'm being A/B tested.

XCSme1mo ago

Oh, I didn't think about this, that's a good idea. I also feel generally model performance changes over time (usually it gets worse).

The problem with doing this is cost. Constsntly testing a lot of models on a large dataset can get really costly.

1 more reply

paulddraper1mo ago

Agreed.

They are equivalent of frontier models 8+ months ago.

m00x1mo ago

Minimax 2.7 is fine for most web stuff. It's slightly worse than Claude at backend, but works great for frontend.

They're all slop when the complexity is higher than a mid-tech intermediate engineer though.

dvt1mo ago

> They're all slop when the complexity is higher than a mid-tech intermediate engineer though.

stuaxo1mo ago

10x more code output is 10x more review.

We've gone from doing the first 90% and then the second 90% to the first 90% and the second 990%, its exausting.

1 more reply

Leynos1mo ago

Kimi is surprisingly good at Rust.

mkw20001mo ago

i find kimi to be very very good, minimax not so much

victorbjorklund1mo ago

yea, they are still useful. But yea not close to Claude or GPT. But works good for simple changes. I use a combo of minimax and codex

selcuka1mo ago

It's a race to the bottom. DeepSeek beats all others (single-shot), and it is ~50% cheaper than the cost of local electricity only.

> DeepSeek V3.2 Reasoning 86.2% ~$0.002 API, single-shot

> ATLAS V3 (pass@1-v(k=3)) 74.6% ~$0.004 Local electricity only, best-of-3 + repair pipeline

yogthosOP1mo ago

hu31mo ago

Indeed but:

1) That is relatively very slow.

2) Can also be done, simpler even, with SoTA models over API.

yogthosOP1mo ago

Being reliant on a service means you have to share whatever you're working on with the service, and the service provider decides what you can do, and make changes to their terms of service on a whim.

If locally running models can get to the point where they can be used as a daily driver, that solves the problem.

eru1mo ago

Why do you need a small model to pick promising candidates? Why not a bigger one?

(And ideally you'd probably test first, or at least try to feed compiler errors back etc?)

Overall, I mostly agree.

yogthosOP1mo ago

mostly an issue of speed and resource usage, if the model is too big then simply running the tests will be cheaper

strangescript1mo ago

I will "suffer" through .004 of electricity if I can run it on my own computer

sourcecodeplz1mo ago

I've tested many open models, Deepseek 3.2 is the only SOTA similar.

1 more reply

3abiton1mo ago

The method here is model agnostic.

mikestorrent1mo ago

> cheaper than the cost of local electricity only.

Can you explain what that means?

simonw1mo ago

I think they mean that the DeepSeek API charges are less than it would cost for the electricity to run a local model.

Local model enthusiasts often assume that running locally is more energy efficient than running in a data center, but fail to take the economies of scale into account.

BoredomIsFun1mo ago

> Local model enthusiasts often assume that running locally is more energy efficient than running in a data center,

It is a well known 101 truism in /r/Localllama that local is rarely cheaper, unless run batched - then it is massively, 10x cheaper indeed.

> I think they mean that the DeepSeek API charges are less than it would cost for the electricity to run a local model.

littlestymaar1mo ago

I guess it mostly comes from using the model with batch-size = 1 locally, vs high batch size in a DC, since GPU consumption don't grow that much with batch size.

Note that while a local chatbot user will mostly be using batch-size = 1, it's not going to be true if they are running an agentic framework, so the gap is going to narrow or even reverse.

1 more reply

croes1mo ago

Local enthusiasts don’t have to fear account banning.

jacquesm1mo ago

Some of those local model enthusiasts can actually afford solar panels.

1 more reply

pbhjpbhj1mo ago

Is it economies of scale, or is it unpaid externalities?

atoav1mo ago

AuthAuth1mo ago

cheap electric due to their massive push on non renewables. There has been no change in the price of electricity during the renewable shift.

1 more reply

jojobas1mo ago

China has cheap electricity.

ericd1mo ago

Well, also, LLM servers get much more efficient with request queue depth >1 - tokens per second per gpu are massively higher with 100 concurrents than 1 on eg vllm.

DeathArrow1mo ago

Yes, but the hardware they use for inference like Huawei Ascend 910C is less efficient than Nvidia H100 used in US due to the difference in the process node.

alifeinbinary1mo ago

All those parameters and it still won't answer questions about Tianenman Square in 1989... :(

viktorcode1mo ago

It will. The web chat has censorship features, but the model you can download doesn't.

memothon1mo ago

I'm always skeptical because you can make it pass the benchmarks, then you use it and it is not practically useful unlike an extremely general model.

Cool work though, really excited for the potential of slimming down models.

yogthosOP1mo ago

imtringued1mo ago

This would have primed the reader to know what to look for. Instead you got this insurmountable wall of distractions.

Example: "combining constraint-driven generation, energy-based verification, self-verified iterative refinement, and adaptive routing"

That's a very long sequence of unexplained buzzwords that could mean absolutely anything.

zar10485761mo ago

yogthosOP1mo ago

1 more reply

naasking1mo ago

> it's not clear to me based on the description how this could all be done efficiently.

kimixa1mo ago

vidarh1mo ago

> Perhaps these things aren't well represented in the training data for these open models

MattRix1mo ago

DanielHall1mo ago

These small models, having been fine-tuned for the test, achieve frighteningly high scores, yet perform abysmally in real-world scenarios.

b3ing1mo ago

Will open source or local llms kill the big AI providers eventually? If so when? I can see maybe basic chat, not sure about coding and images yet

jillesvangurp1mo ago

throwaway858251mo ago

Financial gravity will kill them when returns don't match stratospheric expectations.

bluefirebrand1mo ago

I hope so too, but I think it's wishful thinking. Be prepared for the mother of all financial bailouts from the world governments to make sure that doesn't happen

hollerith1mo ago

I can understand why banks got bailed out by the US gov in 2008, but why would a government feel the need to bail out AI labs?

4 more replies

qingcharles1mo ago

38362936481mo ago

But that $20/month is still heavily subsidised. You have to compare to the API costs, not the direct subscription.

Tuna-Fish1mo ago

freekh1mo ago

This will crush OpenAI.

oarsinsync1mo ago

> This has been my theory for a while: during this autumn Apple will release a version of Apple Intelligence that runs locally and works better than ChatGPT.

In this theory, can you explain why Apple has announced it’s paying Google for Gemini too?

Eventually, this may be true. This autumn? Highly unlikely.

freekh1mo ago

The Google Gemini deal is one of the reasons I think it is likely since Gemini works pretty local hw...

eigenspace1mo ago

I don't get the financial motive for someone to keep funding these open-weight model training programs other than just purposefully trying to kill the big AI providers.

nerbert1mo ago

Some open source models will cross the chasm, some big ai providers will too, and in both case they will have their specific use cases.

CJefferson1mo ago

They won't for coding and images, but they will socially. Everyone I know who has invested in home AI use is mostly using it for 'things that might get you banned/limited'.

Mashimo1mo ago

I'm quite impressed what is possible with just 12 to 16 GB of vram in terms of image generation.

rudolph91mo ago

When Apple gets their shit together.

electroglyph1mo ago

what's with the weird "Geometric Lens routing" ?? sounds like a made up GPTism

ottah1mo ago

Feels very pseudo academic.

tgiba1mo ago

Despite skepticism I love to see experiments like that. If we all are able to run an open source model locally on mid-high end machines I'd be very happy.

emp173441mo ago

Yet more evidence that the harness matters more than the model.

riidom1mo ago

Not a word about the tok/sec, unfortunately.

arjie1mo ago

Very cool ensemble of techniques, particularly because they’re so accessible. I think I will use this form for reusable portions of web browsing functionality in my personal agent.

Octoth0rpe1mo ago

> A single patched llama-server runs on K3s, providing both generation with speculative decoding (~100 tok/s)

There seems to be at least some detail on that point.

superkuh1mo ago

15minutemail1mo ago

74% on LCB from a single 5060 Ti. I've been paying Anthropic per task and this guy is running it on electricity money, 20 minutes per task is rough for anything interactive though.

subroutine1mo ago

At 20 min per task you might as well code it yourself. Bill James needs to write a book on saber-metrics for LLM benchmarks.

alkonaut1mo ago

Great, it became a $1000 gpu while you were reading that.

0xbadcafebee1mo ago

rldjbpin1mo ago

> coding benchmarks

> V3 phases were designed and tuned for LiveCodeBench.

only compared on the above benchmark, while this has been identified and being improved for the next version.

curious to see how it compares across the board against the base model (Qwen3-14B-Q4_K_M)

josefritzishere1mo ago

yakbarber1mo ago

AI would have a least worded your comment better

bdbdbdb1mo ago

This is the kind of innovation I love to see. The big AI companies days are numbered if we can have the same quality in house

bilekas1mo ago

Where is a RTX 5060 Ti 16 GB 500$?

Edit : The 8GB seems to hit this price but 16 not so much.

hedgehog1mo ago

They were $450 or so until recently, now... good luck.

Temporary_313371mo ago

the headline is pretty stupid - compares a model to a GPU that models run on. Somewhere in that data centre, some part of Sonnet infferencing runs on a 900$ GPU or maybe even cheaper Google tensor

dwa35921mo ago

I wonder if it's working out for the benchmark problems only?

one expensive and hard lesson we will learn overtime is that you can't compress generality beyond a point.

Aurornis1mo ago

This AI-written project is running its own LiveCodeBench on a completely different methodology. The AI-written notes even admit it:

The AI-coded README is also full of signs of vibecoded slop like the discoveries that some of the complex structures implemented were not actually being used or contributing anything to the output.

negativegate1mo ago

Am I still SOL on AMD (9070 XT) when it comes to this stuff?

0xbadcafebee1mo ago

It looks like your card has 16GB VRAM? Start with Qwen 3.5 9B Unsloth GGUFs (UD-Q6_K_XL) and branch out from there.

metalliqaz1mo ago

I've been running local models on my 9070XT and I have never found ROCm to be faster than Vulkan

0xbadcafebee1mo ago

It's not guaranteed to be faster, but it is faster in some cases due to ROCm taking more advantage of specific chipset features

patshead1mo ago

I can't imagine trying to using this model on either GPU for real work. I can use much bigger and faster models on the $3 Chutes subscription or $10 OpenCode Go subscription.

1 more reply

dangus1mo ago

Well, this specific solution was only set up on specific hardware, and is Nvidia dependent, as the readme stares.

That doesn’t mean the 9070XT can’t do AI stuff, quite the opposite. ROCm gets better all the time. There are many AI workloads you can do on AMD cards.

Is it a card I would choose if I was primarily working on AI? Absolutely not. But it is the card I own and it’s been a great value for gaming.

dannyw1mo ago

Unfortunately AMD is much worse with supporting AI features like FSR4 on older hardware generations, despite the capability and leaked INT8 models being there. Totally unlike NVIDIA.

It’s absurd I have to use open source programs to get INT8 FSR4 support.

limoce1mo ago

The title should be "Adaptive Test-time Learning and Autonomous Specialization".

sznio1mo ago

mongrelion1mo ago

Not the answer that you are looking for, but I am a fellow AMD GPU owner, so I want to share my experience.

From the open source models out there, Qwen3.5 is the best right now. unsloth produces nice quants for it and even provides guidelines [2] on how to run them locally.

The 6-bit version of Qwen3.5 9B would fit nicely in your 6700 XT, but at 9B parameters, it probably isn't as smart as you would expect it to run.

Which model have you tried locally? Also, out of curiosity, what is your host configuration?

[1]: https://huggingface.co/settings/local-apps [2]: https://unsloth.ai/docs/models/qwen3.5

kroaton1mo ago

For autocomplete, Qwen 3.5 9B should be enough even at Q4_k_m. The upcoming coding/math Omnicoder-2 finetune might be useful (should be released in a few days).

mongrelion1mo ago

sznio1mo ago

I don't remember exact models, but I tried whatever was available in Ollama. I remember using some really low parameter version of llama

felixagentai1mo ago

[flagged]

dang1mo ago

We've banned this account. Please don't post automated comments to HN.

https://news.ycombinator.com/newsguidelines.html#generated

Razengan1mo ago

Claude Code has been bleh or meh at best in my experience. There's so many posts on HN fawning about it lately that it could only be a guerrilla marketing campaign.

maipen1mo ago

You still need to give it precise context and instructions when dealing with things that are not web apps or some other software cliche.

The reasoning is great in opus, unbeatable at the moment.

I understand what you mean, it becomes disappointing on more niche or specific work. It’s honestly a good thing to see these models are not really intelligent yet.

Razengan1mo ago

I still don't trust any AI enough to generate or edit code, except for some throwaway experiments, because every time I tried it's been inefficient or too verbose or just plain wrong.

Claude on the other hand, oh man… It just wastes my time. It's had way more gaffes than Codex, on the exact same code and prompts.

[0] https://github.com/InvadingOctopus/comedot

dr_kiszonka1mo ago

I still get really mad at AI sometimes and I am not sure whether I could use AI for coding full time.

(Codex broke my git a few days ago.)

spiderfarmer1mo ago

"I don't get it. Everyone else is wrong."

Razengan1mo ago

"There's no such thing as astroturfing." ok

Hell right freakin now I asked it to implement something and got a weird "Something went wrong" API error

spiderfarmer1mo ago

"Shit", "Crap", "Fucking", "Hell", "Freaking".

Maybe you're too easily frustrated. Or your existing code reads like your comments.

1 more reply

j / k navigate · click thread line to collapse