I recently refactored our whole app from hard deletes to soft deletes. There are obviously various ways to skin this particular cat, but the way I chose needed all our deletions updated and also needed queries updating to exclude soft deleted rows, except in specific circumstances (e.g., admins restoring accidentally deleted data).
Of course, this is not hard to do manually but is is a bloody chore and tends toward error prone. But the agent made short work of it, for which I was very grateful.
You know your system better than me for sure, a random commenter on a website :-D your comment just shocked me out of my daze enough for my brain to say "but I always move the record to another table rather than soft delete" and i felt compelled to give unsolicited and likely wrong opinion.
More than that, though: lots of queries for reporting, and the like, suddenly need to use JOINs. Same for admin use cases where we want them to be able to see archived and live data in a unified view. The conclusion I came to is it doesn't really eliminate complexity for us: just moves it elsehwere.
Totally valid approach though. I'd also considered different views for live versus archived (or live+archived) data. Again, it solves some issues, but moves complexity elsewhere.
The other key point: it's a Ruby on Rails system so the moment you start doing funky stuff with separate tables or views, whilst it is doable, you lose a lot of the benefits of Active Record and end up having to do a lot more manual lifting. So, again, this sort of played against the alternatives.
As I say, not to diss other approaches: in a different situation I might have chosen one of them.
My conclusion - not for the first time - is that soft delete obviously adds some level of irreducible complexity to an application or system versus hard delete no matter how you do it. Whether or not that extra complexity is worth it very much depends on the application and your user/customer base.
For some people, just the ability to restore deleted rows from backup would be enough - and in other cases it's been enough for me - but that is always a bit of a faff so not a great fit if you're optimising for minimal support overhead and rapid turnaround of any issues that do arise.
It depends whether you reliably control all the DB client code, of course.
Disclaimer: I'm the founder.
I don't write code by hand any more, neither at work, nor for side projects.
I work mostly in Rust and TypeScript at a developer tools company.
It's amazing! Saves hours of work!
I create the basic helm configd settings etc and when there is a conflict or something not working I let an agent fix it!
Opus 4.6 is available on the $20 plan too
I use MiniMax daily, mostly for coding tasks, using pi-coding-agent mostly.
> The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable.
I don't care about token use, I pay per request in my cheap coding plan. I didn't notice slower outputs, it's even faster than Anthropic. Degradation is there for long sessions with long contexts, but that also happens with Anthropic models.
> Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.
Exactly. For my use case, I get 1500 API requests every 5 hours for 10€ monthly. I never hit the limit, even during the intensive coding sessions.
What I notice is, while Opus and Sonnet feel better for synthetic benchmarks, it doesn't matter in the real world. I never put so much effort into coming up with a perfect problem spec like the ones in benchmarks. I don't craft my prompts for hours expecting the LLM to one-shot a working program for me. And that's exactly what all those benchmarks are doing. And that's where Anthropic tools shine in comparison to cheaper Chinese models.
When it comes to the real world, where I put my half-baked thoughts in broken English in a prompt and execute 20 prompts in half an hour, the difference between Opus, Sonnet, and MiniMax is minimal, if at all. There, I don't want to think about costs and token savings and switching between different Anthropic models. I just use MiniMax, and that's it.
Yes, MiniMax sometimes gets stuck. Then I switch to Opus to unblock it. But the same happens if I use Opus the whole session. It gets stuck eventually, and model switch is sometimes required to get a fresh perspective on the problem.
The only difference is, using Opus or Sonnet quickly eats up my budget, while with MiniMax I have basically unlimited usage (for my coding use case) for 10€ per month.
If I was starting new projects I'd pay for a better model, but honestly I don't really know any different.
I've not ever used Claude and people seem to rave about it. Maybe its good, but I doubt its $200/month good.
When I hit issues with these lower models I think hard about creating the right tooling - agnostic to the harness and I feel like maybe its more work but I can carry those tools to any setup going forward. That's how it was in the early Linux days so why change what clearly works?
You can "feel" the llm being limited with Gemini, less so with Claude. Hopefully even less so with chatgpt
Not as good as Opus, but substantially cheaper!
It's also worth noting that the approach given in the link also benefits Sonnet and Opus. Not just as much - they are more forgiving - but put it in a harness that allows for various verification and repair and they too end up producing much better results than the "raw" model. And it's not clear that a harness around MiniMax, Kimi, or Qwen can measure up then.
I use those models a lot, and hope to use them more as my harnesses get better at discriminating which tasks they are cost effective for, but it's not straightforward to cost optimize this.
If I cared about running everything locally, then sure, it's amazing you can get to those kinds of results at all.
It does have some kind of horrible context consistency problem though, if you ask it to rewrite something verbatim it'll inject tiny random changes everywhere and potentially break it. That's something that other SOTA models haven't done for at least two years now and is a real problem. I can't trust it to do a full rewrite, just diffs.
I doubt Kimi would do well with most harnesses, its outputs are pretty chaotic in terms of formatting but the inteligence is definitely there.
https://aibenchy.com/compare/minimax-minimax-m2-7-medium/moo...
Needless to say, benchmarks are limited and impressions vary widely by problem domain, harness, written language, and personal preference (simplicity vs detail, tone, etc.). If personal experience is the only true measure, as with wine, solving this discovery gap is an interesting challenge (LLM sommelier!), even if model evolution eventually makes the choice trivial. (I prefer Gemini 3 for its wide knowledge, Sonnet 4.6 for balance, and GLM-5 for simplicity.)
I'm not saying it's bad, but it's definitely different than the others.
I badly want to shift more of my work to them, and I'm finding ways of shifting more lower-level loads to them regularly, but they're really not there yet for anything complex.
I think GLM got a bit in front, because on some tests that both got wrong, GLM did sometimes (inconsistently) respond with the correct answer.
That being said, yes, in this case probably with more and more tests added, gpt-5.4 would edge in front, especially if a coding would be added (there are no coding tests yet).
The problem with doing this is cost. Constsntly testing a lot of models on a large dataset can get really costly.
They are equivalent of frontier models 8+ months ago.
They're all slop when the complexity is higher than a mid-tech intermediate engineer though.
This right here. Value prop quickly goes out the window when you're building anything novel or hard. I feel that I'm still spending the same amount of time working on stuff, except that now I'm also spending money on models.
We've gone from doing the first 90% and then the second 90% to the first 90% and the second 990%, its exausting.
> DeepSeek V3.2 Reasoning 86.2% ~$0.002 API, single-shot
> ATLAS V3 (pass@1-v(k=3)) 74.6% ~$0.004 Local electricity only, best-of-3 + repair pipeline
1) That is relatively very slow.
2) Can also be done, simpler even, with SoTA models over API.
Being reliant on a service means you have to share whatever you're working on with the service, and the service provider decides what you can do, and make changes to their terms of service on a whim.
If locally running models can get to the point where they can be used as a daily driver, that solves the problem.
(And ideally you'd probably test first, or at least try to feed compiler errors back etc?)
Overall, I mostly agree.
Can you explain what that means?
Local model enthusiasts often assume that running locally is more energy efficient than running in a data center, but fail to take the economies of scale into account.
It is a well known 101 truism in /r/Localllama that local is rarely cheaper, unless run batched - then it is massively, 10x cheaper indeed.
> I think they mean that the DeepSeek API charges are less than it would cost for the electricity to run a local model.
Because it is hosted in China, where energy is cheap. In ex-USSR where I live it is inexpensive too, and keeping in mind that whole winter I had to use small space heater, due to inadequacy of my central heating, using local came out as 100% free.
Note that while a local chatbot user will mostly be using batch-size = 1, it's not going to be true if they are running an agentic framework, so the gap is going to narrow or even reverse.
Cool work though, really excited for the potential of slimming down models.
But this can be pretty slow since you have to run the code in an isolated environment, check the outputs, wait for it to finish. Doing that for every candidate quickly adds up. So ATLAS has another shortcut for avoiding unnecessary testing. Instead of simply generating solutions and testing all of them, it tries to predict which one is most likely correct before running any tests.
ATLAS also asks the model for an embedding of what it just wrote which acts as a fingerprint. Two similar pieces of code will produce similar fingerprints. A well-written, confident solution will produce a different fingerprint than a confused, buggy one.
These fingerprints get fed into a separate, much smaller neural network called the Cost Field. This little network was trained ahead of time on examples where they already knew which solutions were correct and which were wrong. It learned to assign a score to each fingerprint. Correct solutions get a low score and incorrect ones get a high one.
So the process is to generate multiple solutions, get their fingerprints, score each one, and pick the lowest. Only that one gets tested. The Cost Field picks correctly about 88% of the time according to the repo.
If the author actually wanted to explain his project he should have started with something along the lines of "Inference-time learning is the act of updating model parameters while you are generating tokens. Inference time learning is cost prohibitive for LLMs due to the need to update billions of parameters. However, what if updating billions of parameters wasn't necessary to begin with? What if you could instead have a much smaller model that merely scores a bunch of candidate output tokens? That model could be small enough for inference time learning to become viable and that's exactly what ATLAS does to achieve a 74.6% pass rate in LiveCodeBench and thereby outperforms Claude Sonnet with a small 14B open weight model that can be run locally on your $500 GPU."
This would have primed the reader to know what to look for. Instead you got this insurmountable wall of distractions.
Example: "combining constraint-driven generation, energy-based verification, self-verified iterative refinement, and adaptive routing"
That's a very long sequence of unexplained buzzwords that could mean absolutely anything.
Another interesting approach could be to use this set up with a language like Clojure or Common Lisp which facilitates interactive development. If you could hook up the agent directly to a REPL in a running program, then it could run tests with a lot less overhead.
Depends how you define efficiency. The power use of this rig is a lot less than the large data centers that serve trillion parameter models. The page suggests that the final dollar cost per request is an order of magnitude lower than the frontier models charge.
Perhaps these things aren't well represented in the training data for these open models? Every local model I've tried (minimax2.5, GLM-4.7, Quen3, 3.5 and -coder variants) spend so much time trying to get something syntactically sensible and accepted by the compiler that when they've finished they barely seem to have any "momentum" left to actually solve the problems, as pretty much anything but the most trivial change ends up in another loop of actually trying to get it working again, often losing the intent of that change in the process.
My fear is that the solution here, having multiple instances all making the same changes for later comparison, would spend a huge amount of time beating it's head against compiler errors, types, memory allocation (NO DON'T JUST SPRINKLE IN A FEW MORE RAW "new" KEYWORDS DAMMIT) before it even gets to the "logic".
Having plenty of local GPU power I'd love to be able to actually use that, and I'm already wary about some of the training data use and it's interactions with the license of the code I'm "sending" to the cloud models...
I know from first-hand experience that at least a couple of the SOTA providers use third-party providers for supervised finetuning with instructions that are heavily geared towards a specific set of languages as well. But of course the base dataset from the major providers is likely to be sufficiently better that it matters less, and the big models are good enough at carrying over training that it at least seems like extra training on the core languages they care about at least somewhat carries over (you see this with natural language too - they do really well for many minor languages that make up a miniscule proportion of the training data).
(I won't say much more regarding the SFT/RLHF work due to NDAs - plural; I know who one of the providers is; I don't know who the one or more others are as the intermediary I did some work for obscured it well enough that I couldn't really violate the NDA even if I wanted to)
I think the notion of a one size fits all model that is a bit like a sports car in the sense that just get the biggest/fastest/best one is overkill; you use bigger models when needed. But they use a lot of resources and cost you a lot. A lot of AI work isn't solving important math or algorithm problems. Or leet coding exercises. Most AI work is mundane plumbing work, summarizing, a bit of light scripting/programming, tool calling, etc. With skills and guard rails, you actually want agents to follow those rather than get too creative. And you want them to work relatively quickly and not overthink things. Latency is important. You can actually use guard rails to decide when to escalate to bigger models and when not to.
I hope you are not going to say, "to avoid a global recession or depression caused by the popping of the AI bubble". That would be unnecessary and harmful (in its second-order effects), and governments do have advisors who are competent enough in economics to advise against such a move.
0: Because the only way to get cache locality out of a LLM is to batch invocations. A centralized system where the server handles thousands of invocations at the same time only needs a tiny fraction of the total memory throughput as having all of those invocations run locally on different machines would.
This will crush OpenAI.
Note: I am not talking about coding here - it will take a while longer but when it is optimized to the bone and llms output has stabilized, you will be running that too on local hardware. Cost will come down for Claude and friends too but why pay 5 when you can have it for free?
In this theory, can you explain why Apple has announced it’s paying Google for Gemini too?
Eventually, this may be true. This autumn? Highly unlikely.
I don't get the financial motive for someone to keep funding these open-weight model training programs other than just purposefully trying to kill the big AI providers.
I, too, was interested because I am always eager to use local models in my claw-like. It looks like this could be useful for an async portion of the harness but it wouldn’t work in interactive contexts.
Very cool ensemble of techniques, particularly because they’re so accessible. I think I will use this form for reusable portions of web browsing functionality in my personal agent.
There seems to be at least some detail on that point.
> V3 phases were designed and tuned for LiveCodeBench.
only compared on the above benchmark, while this has been identified and being improved for the next version.
curious to see how it compares across the board against the base model (Qwen3-14B-Q4_K_M)
Edit : The 8GB seems to hit this price but 16 not so much.
one expensive and hard lesson we will learn overtime is that you can't compress generality beyond a point.
> ATLAS scores are from 599 LCB tasks using the full V3 pipeline (best-of-3 + Lens selection + iterative repair) on a frozen 14B quantized model or "pass@k-v(k=3)". Competitor scores are single-shot pass@1 (zero-shot, temperature 0) from Artificial Analysis on 315 LCB problems -- not the same task set, so this is not a controlled head-to-head.
Instead of following the LiveCodeBench methodology, it's a harness that spins up a sandbox and spends a long time testing and refining the solution. If you did the same for Sonnet, GPT5.4, or other models they would also get significantly higher scores and they'd do it faster.
The AI-coded README is also full of signs of vibecoded slop like the discoveries that some of the complex structures implemented were not actually being used or contributing anything to the output.
It looks like your card has 16GB VRAM? Start with Qwen 3.5 9B Unsloth GGUFs (UD-Q6_K_XL) and branch out from there.
I can't imagine trying to using this model on either GPU for real work. I can use much bigger and faster models on the $3 Chutes subscription or $10 OpenCode Go subscription.
Even so, I am still excited. I don't feel like there was even a model worth using with a tool like OpenCode 6 to 9 months ago. I like the way things are heading, and I am looking forward to seeing how capable coding models of this size are in another 6 to 9 months!
That doesn’t mean the 9070XT can’t do AI stuff, quite the opposite. ROCm gets better all the time. There are many AI workloads you can do on AMD cards.
Is it a card I would choose if I was primarily working on AI? Absolutely not. But it is the card I own and it’s been a great value for gaming.
It’s absurd I have to use open source programs to get INT8 FSR4 support.
I have a 9070 XT, which has 16GB of VRAM. My understanding from reading around a bunch of forums is that the smallest quant you want to go with is Q4. Below that, the compression starts hurting the results quite a lot, especially for agentic coding. The model might eventually start missing brackets, quotes, etc.
I tried various AI + VRAM calculators but nothing was as on the point as Huggingface's built-in functionality. You simply sign up and configure in the settings [1] which GPU you have, so that when you visit a model page, you immediately see which of the quants fits in your card.
From the open source models out there, Qwen3.5 is the best right now. unsloth produces nice quants for it and even provides guidelines [2] on how to run them locally.
The 6-bit version of Qwen3.5 9B would fit nicely in your 6700 XT, but at 9B parameters, it probably isn't as smart as you would expect it to run.
Which model have you tried locally? Also, out of curiosity, what is your host configuration?
[1]: https://huggingface.co/settings/local-apps [2]: https://unsloth.ai/docs/models/qwen3.5
Either that or just load up Qwen3.5-35B-A3B-Q4_K_S I'm serving it at about 40-50t/s on a 4070RTX Super 12GB + 64GB of RAM. The weights are 20.7GB + KV Cache (which should be lowered soon with the upcoming addition of TurboQuant).
The reasoning is great in opus, unbeatable at the moment.
I understand what you mean, it becomes disappointing on more niche or specific work. It’s honestly a good thing to see these models are not really intelligent yet.
I use it for reviewing existing code, specifically for a components-based framework for Godot/GDScript at [0]. You can view the AGENTS.md and see that it's a relatively simple enough project: Just for 2D games and fairly modular so the AI can look at each file/class individually and have to cross-reference maybe 1-3 dependencies/dependents at most at any time during a single pass.
I've been using Codex, and it's helped me catch a lot of bugs that would have taken a long time on my own to even notice at all. Most of my productivity and the commits from the past couple months are thanks to that.
Claude on the other hand, oh man… It just wastes my time. It's had way more gaffes than Codex, on the exact same code and prompts.
I still get really mad at AI sometimes and I am not sure whether I could use AI for coding full time.
(Codex broke my git a few days ago.)
I use Codex regularly and Claude is shit in comparison, from its constant "Oops you're right!!" backtracking to its crap Electron app (if their AI is so good why can't they make a fucking native app for each OS?)
Hell right freakin now I asked it to implement something and got a weird "Something went wrong" API error
Maybe you're too easily frustrated. Or your existing code reads like your comments.