I mean from a financial and sustainability standpoint, assuming they’re equally powerful as their proprietary counterparts.
I guess I’m trying to understand the economics of it.
However, I would highly suggest more people experiment with these smaller models. They are incredibly capable in many ways that many people dont realize.
The perceived capabilities of the larger models are also much less the result of the model having more parameters/training cycles, but rather that they are being run through well-made harnesses, something which the open-source community is rapidly approaching with near-peer solutions of their own.
In short, much of the gap between between open-weight models and the larger proprietary models can be considered more of an issue of perception and not an issue of capability. There is a fundamental gap economically, but not so much in capability. The open source community is rapidly closing the gap on these larger labs, especially thanks to the amazing research being freely given openly by well funded chinese labs.
I wonder if open source / open weight models will reach the point where we can run them locally on our mobile devices (for free), even if they're slightly inferior to the proprietary pay-as-you-use online models.
I know very little about this stuff. My inner optimist kinda hopes that the tech will continue to advance and become increasingly commoditised, to the point that open source locally run models are as good as the advanced proprietary models of a year ago. So that even if the open source models lag the proprietary models, they're still pretty great. Perhaps we're already there but I wouldn't know.
Anyhow thank you for the insights :-)
You can drastically reduce the requirements by running models at a lower bitrate, which somewhat reduces accuracy but not that much - think of the difference between an MP3 vs uncompressed audio. With this and other tricks, you can get high end models down to a size where they can be run on a high spec desktop workstation affordable by an individual or small business.
Obviously I'm heavily oversimplifying here. I think a useful parallel is to consider situations from the past where you would once have required corporate budgets equivalent to the price of a house to run a large database, but over time it became accessible to anyone with the requisite expertise and relatively affordable hardware.
That's still a lot of money, but most people don't really need a trillion parameter model. If privacy is more valuable than the frontier capabilities then they could almost certainly get by with much less.
Assuming math works here although I think there's some caveats depending on the model architecture, 1T 4 bit is 465Gi just for the weights so you wouldn't be able to fit kv cache.
It's showing about 8-9 tk/sec which seems quite slow for something like a web search with result aggregate although maybe bareable for smaller context stuff
The thing I've been running into with z.ai hosted GLM-5.2 is the 2024 knowledge cutoff. Anything recent requires web augmentation which is more token intensive so low tk/sec hurts even more than a "smarter" model
It seems (somewhat unsurprisingly) open weight models have older knowledge cutoffs.
You can run fantastic local models if you have either:
- M-series Apple device with ideally >= 24GB of VRAM
- RTX [345]090 GPU
I'm fortunate enough to have both and use an M-series laptop as basically a persistent server (I don't use it much and when traveling typically just use my work laptop). My desktop doesn't act as a persitent server but I fire up llama.cpp on it all time for quick chat sessions.
If you have one of the above devices and can dedicate it as server there are additional layers of tooling you can use that dramatically improve the experience. In particular Open WebUI allows you to add tons of useful tools (image gen, web search, code eval, etc), and agent harnesses like Hermes can make the current gen small models very capable. I have an agent in chat on my phone that basically handles all the sys-admin for the server it runs on.
If you are experimenting it's worth mentioning that the harness/tooling is very important to getting a solid experience. Herme's agent is great for running helpful agents and OpenWeb UI can get really make the experience feel on par with paid chat interfaced.
A reasonable halfway step is to pay for an open model through the provider or open router. You'll get many of the benefits (especially around pricing) without needing to shell out on hardware before deciding if you like the way these models work.
Presently they trail SOTA by about 6-12 months, not on par (average across everything they do).
DeepSeek V4 Pro with Max reasoning is very affordable even if you pay per-token, this month I pushed about 486 million tokens through it (I will admit that >95% was cache hits, for agentic development pretty typical) and it cost me about 8 USD in total. Meanwhile with Opus or even Sonnet if I had to pay API prices, I would be a more sad camper. That model makes a lot of stupid things though, so not ideal.
Meanwhile GLM-5.2 that came out is also quote capable and is near Opus in many tasks, all while their coding plan is more cost effective than Anthropic's: https://z.ai/subscribe
I will still stick with Anthropic but consider downgrading from Max 5x to Pro which will change the monthly expenses from around 108 EUR down to <20 EUR (they have a discount too if you pay for a year up front), and probably get the yearly GLM Pro plan which should decrease my yearly expenses from around 1300 EUR total to about 750 total EUR while still giving me a fairly decent setup.
For the consumer, that is doable and practical.
For the people actually running these models, who knows - at least DeepSeek and others are trying to make the models more efficient so the numbers are more feasible.
Also have run Qwen3.6 35B A3B on prem and it kinda sucks. Way better than models that size a year ago, but still lags behind Sonnet and also DeepSeek V4 Flash due to the size limits. Plus to even run myself I'd need a pretty beefy setup, most likely a pair of Intel Arc Pro B70s with 32 GB of VRAM each that I could still run off of my PSU but the actual model output would be kinda bullshit and I'd have to spend an unpleasant amount of time fixing it.
They are not SOTA in various ways but they have better economics.
My teen isn't super interested in AI, but whenever they do feel curious they have their own account they can use on our home network. As far as chatting goes local models are more than capable for handling standard chat questions, doing research, helping troubleshoot problems etc. In fact it was an agent powered by the same model that setup the open webui server and took care of all the account management features through my phone (using Hermes agent).
If you're building AI powered features and using sophisticated agent setups for coding for work, then it make sense to use SoTA from these providers. But I've been using local models increasingly for personal use and am starting to find them preferable (I run an uncensored, ephemeral model for my own use and it's an entirely different experience than anything you can pay for).
Still haven't cancelled my personal Anthropic subscription, but considering it soon.
So from the perspective of your teen, they would benefit from using z.ai or ChatGPT or Claude, etc, rather than the local server where you can see all the conversations.
What uncensored model do you recommend using ?
>So from the perspective of your teen, they would benefit from using z.ai or ChatGPT or Claude, etc, rather than the local server where you can see all the conversations.
That is bonkers. If I were a parent, I would hope my child would trust me more than systems monitored by FBI/NSA/etc. Like, what sort of sick relationship do you have to have with your own family to trust them less than strangers who would sell you into prison slavery for a buck.
The state isn't going to ground them, shame them at dinner, out them, or pull them out of a relationship, punish them.
Parents reading your browsing history and private conversations when you are 14-18 years old (the age of teenagers) is very very creepy, unless there is a specific danger to avoid. It's like if you read their private journal.
Adolescents need a private inner world to form an identity, and heavy parental intrusion ("psychological control") is the real distrust. Trust them, they are people, not possessions.
You can guide them, but do not store their private messages locally under your control using the excuse of protecting them from NSA.
If they trust you, they will tend to tell you upfront the things they have questions about, there is really no need to spy on their thoughts.
Same with husband/wife btw.
I guess "starting to find them preferable" suggests to me you think they work better, but this is surprising to me so I think I may have misunderstood, so I ask!
Like you're saying they work better than the proprietary models (in what ways?), or you find them mostly good enough and prefer the privacy or cost, or what?
Having full control over how your data is retained, what the system prompt is, which version of the model you're running, etc leads to much a more consistent experience. For example, for chat sessions, I can't stand the new "let me push back" version of Claude. For my home models I never have to worry about that.
There's never a mystery as to whether the model secretly degraded performance, I always know exactly which model I'm using and how well it's utilizing resources etc. Open models also give you full visibility into the reasoning steps, so you never have to guess what the model is thinking.
Then when you start getting into things like uncensored/abliterated models we're talking about something you can't even pay for. In case you're unfamiliar, even open local models have guardrails built in. But people in the community have found ways to remove these. One of the things I've found most concerning about AI, which is under discussed, is the combination of people having personal chats with an agent that both monitors the conversation and refuses to discuss certain topics. This leads to a very deep level of self-censoring I find dystopian.
I also have multiple hermes agents setup, some with local backends other with open but non-local backends (e.g. Kimi through the API). For some tasks, I've just started to find the local agent tends to work better for the type of tasks I want (maybe it just over thinks less?). I don't use it for coding so much as research tasks and sysadmin stuff, but I've been really happy with the results.
Oh, and let's not forget, especially running on a Mac, these local models are basically free to run.
I love what I get from Opus or GPT, but mainly I use GLM and it's so starkly apparent how much better it is that it let's me work together with it, that I can nudge it as it works by correcting bad assumptions or clarifying for it, as it works. And... it just doesn't feel icky. It's not a quasi-mystical alien intelligence, which, honestly, gives me strong "this should be destroyed, is unsafe, and feels outright impermissible" vibes. As a coder, seeing thinking saves time and prevents errors. As a civilization, seeing thinking let's people understand what the AI is working with and grounds society in an appreciation for what is happening, keeps us a little moored. Personally, if I were a government, I would not allow it.
Recent submission on this, The text in Claude Code’s “Extended Thinking” output is not authentic. https://patrickmccanna.net/the-text-in-claude-codes-extended... https://news.ycombinator.com/item?id=48630535
RTX 4090: ~190 token/sec
I don't have the number around but there is a notable latency for pre-fill on the M3, but once it's running the delay is negligible.
The RTX, unsurprisingly, is all around superior performance wise, but: I use that computer for gaming and image gen work so I can't dedicate it as a server, and, especially when it's warmer, the heat generated under heavy loads is noticable.
Dont. Goon. To. LLMs
Capabilities can be gated behind certification programs, or by money, or any other numerous corrupt and non-corrupt means. Model capabilities can be segregated by pricing tiers, creating an economic underclass that cannot afford access to frontier intelligence.
For humanity to benefit, the tech needs to be open and equally available to all.
In a world where everyone is a Claude controller (something I honestly enjoy!), that goes away. I use hundreds of dollars of tokens a month. Suddenly, the kid in her basement with an unloved computer can't get in on the ground floor. You have to be rich to even get started. That worries me deeply. It's a big change for our field, and I don't think it's a good one.
One is the potential for skill rot where AI grows a heavy dependence in new employees and once the real price per token cost is settled on and discoverable (post massive IPOs and probably a while post - not immediately after) we, as a society, are left with a bunch of people dependent on a deeply inefficient technology to maintain software we now view as vital that might severely impede our ability to actually deal with climate change (press X to doubt Bezos).
The second is that the psychological damage of interacting with models in a social context during your formative years is deeply damaging and we've essentially destroyed the ability for a generation or two to actually interact as productive members of society.
Addressing the second issue doesn't necessarily exclude our ability to leverage models for business productivity but it seems unlikely to happen in the current climate without that also happening. I am hesitant to believe in a sudden outbreak of common sense at this point. The first point, could really be a systems collapse trigger - we can argue about the likelihood but denying it as a possibility is excessively naive.
If even one of these had pledge that all profit goes to end world hunger, cancer research, etc, I could possibly see it - but they haven't. They're all after finding a way to be the biggest, richest asshole possible with the ability to crush anyone in their way..
Why on earth would you want to siphon off the proceeds of AI development to (ok my bias is strong here- mostly corrupt) "ideals" like world hunger and cancer research (that probably get more dollars annually than the sum of actual profit any of these companies will ever get). That would just instantly kill the ability to improve AI at all, and the world could possibly be better for a few months?
And how do we prevent Chinese companies from training on our open AI models and offering their models for free?
https://www.ibm.com/policy/contributions-and-expenditures
Their biggest customer is the US federal government, taken in aggregate across agencies, IBM is one of the largest federal IT contractors, and deep public-sector and financial-services contracts in the US make it IBM's single largest national market. No individual commercial company comes close to the government's aggregate spend.
Now, equivalent product, another company, they want to sell to the government twice cheaper, can they ? nope, it will be IBM winning.
Furthermore, according to the lobbyists, China = evil but they forget that a lot of software contains Chinese code.
the potential of wealth creation with AI is so high, and also the fact that research, pre-training and inference is so expensive that, that any open-AI would eventually become OpenAI.
Losing privacy has ZERO downsides for ordinary people. Nobody cares about your data. Literally, put all your life on a YouTube channel and see how many views that Video will get. ZERO.
Irrational fears (especially if it's conspiratorial) => Sub-optimal decision.
Just like Buffett, Bezos, my strategy is simple -- go against firms that are making irrational decision. It's the same framework to adopt cloud, AI and many frontier technologies and disrupt
Given they have laughable uptime and I have yet to find a useful project mostly written by claude... I doubt it.
The SCOTUS has made it exceptionally clear mathematics and software are protected by the First Amendment. The Atomic Energy Act of 1954 tries to make a very narrow exception for nuclear weapons, but
1. The law has never been challenged in court for being unconstitutional, and
2. It doesn't apply to model weights
Any attempt by the government to suppress open models will meet legal challenges on the grounds of (1) or (2).
Congress could amend the act to include model weights, but that won't prevent legal challenges on the grounds of it being unconstitutional (which it is).