I haven't found anything that requires running all night. I could tell it to one-shot a big plan but given how often I realize I want an intermediary thing to be slightly different it seems like a waste of effort.
I'm guessing the next thing I should probably look into is some sort of machine vm I can tunnel my codex-gui requests to so I don't have to deal with the sandbox approvals (I don't want to give it "dangerous" access to my entire mac).
I don't understand what people are doing with their side projects that is leading them to churn through tokens so quickly, to the point of requiring two $200/month subscriptions and a bunch of token charges besides.
I've only found one single application where it makes even the slightest amount of sense to have an AI grind away for hours on end. I'm reverse engineering a widget which contains five separate firmware images. I've dumped the binary from the widget and I set the AI to decompile and reverse engineer these interrelated firmware projects. It's a compelx task, but very well bounded. It's not complicated work, but it's a lot of work, and the end result is a C-shaped pile of text that is only informative, it never would be compilable on its own even if I did it by hand. The quality of the output is tightly bounded by the input assembly and the overall output artifact is documentation in the shape of code.
I don't have any qualms about letting an AI go ham on it unattended because the stakes are zero. But if the AI can beat the assembly into a recognizable C project, it's much easier for me to read and reason about. Easy win, I think.
My personal OSS projects don't have the scale to necessarily make this worth it, but at work I run three pipelines using Barnum (https://barnum-circus.github.io/). First, one that ingests files, identifies refactors (from a pre-approved list), and places a precise description of the refactor to be done in a queue; second, one that reads from said queue, implements and creates PRs (there is a lot of "check that the PR is correct" here as well); and a third that babysits PRs until they land. I've landed hundreds of PRs in this way, with very little effort on my part.
I'm grappling with this at the moment, getting it to do design or reverse engineering work, during investigation it makes the wall of text bigger rather than consolidating. It can never pause and create abstractions properly. This is on Opus which starts getting wordy and performative on goals it can't easily verify.
Here’s an example https://m.youtube.com/watch?v=xc1296HY8Fw&ra=m
It’s completely different to a professional workflow (what you described). It’s a toy for consumers
As everyone trying to do real work is finding, that's the actual bottleneck. If the system is keeping up with your thinking, you're doing fine. You can't "level up" your thinking by paying for more tokens. The people doing more automatic stuff are probably outpacing their own thinking, and that will bite them eventually.
But like 99% of that task is just Codex waiting for the output. So it’ll run for 12 hours but mostly it’s just setting lots of sleeps. I haven’t gotten close to running out of tokens. The $100 a month codex I hit usage limitations almost immediately, about 3 days in of working like crazy with 10 agents going at once, mostly coding an asset pipeline, I ran into my weekly limit and upgraded. So with the $200 a month plan at 4x more credits I haven’t hit any walls at all and can absolutely cook.
Knowing LLMs and their output I would also bet that you're getting nonsense output that sucks.
I'm running Claude/Codex inside native macOS sandbox, configured with a simple script - https://github.com/sheremetyev/sandfence
always in "bypass permissions" mode - it works until task is solved, sometime 1 hour or more (which includes running tests etc)
Go out for a walk. Wherever you live, there will be a destination or an environment that will enrich your life just by visiting it. Go and take a look at it or experience it and then go back to worrying about tokens.
I see people just completely wasting tokens with ridiculous setups, 100% hitting cache misses as well as dumping huge files into context all the time.
Just learn how these things work, or pay the price I guess.
I am an engineer, and when I understand what’s going on, I never hit any limit.
For personal pet projects I can definitely see how you can blow through your token budget very quickly. If I just point my coding agent to iteratively come up with some heuristics for some NP-hard problem, it will read intermediary outputs and constantly make small changes "in the dark" until it either finds a small improvement or gives up. In a similar vein I found that you can burn many many tokens if you try to let the agent reverse engineer something where you don't have the source code. If you just give it a binary or some interface to work with and a vague task you can easily burn your entire budget with 1 prompt.
I wouldn't want anyone to use these fully vibe coded toy projects though; it is more of an exploratory curiosity for me where I learn more about some problems I'm interested in as well as gauge how good the agents are at tasks that I seem to have a much better intuition on how to approach.
This is what https://github.com/kstenerud/yoloai does.
Sandboxing using Docker, Podman, containerd (linux only), seatbelt (macos only), tart (macos only), apple container (macos 26+ only).
It takes a copy of your workdir, does its thing inside of the sandbox, and you pull the results back using git semantics:
$ yoloai new mybugfix . -a # launch default sandbox in . and also attach the terminal
# Work with the agent...
$ yoloai diff mybugfix # See what it did
$ yoloai apply mybugfix # Bring out commits and/or uncommitted changes.
$ yoloai destroy mybugfixDocker sbx is worth looking at here, possibly; essentially a canned VM with a file system mount and layers for installing various agentic coding environments that cannot work outside that mount.
Apple’s new container machine addition to the container CLI does some similar magic.
In my experiments I have been using opencode, running the web interface inside a multipass VM, with the LLM server on the host. I have been using the desktop app, which can now do remote connections so the GUI app on the Mac can connect to the opencode web instance inside the VM. But I might bite the bullet, install Tahoe and switch to the container machine approach.
Having said that, I think there is a question of how far we can push this and not collapse under the weight of tech debt created, e.g. https://openai.com/index/open-source-codex-orchestration-sym...
I think the dream is basically that you go and file a bunch of Linear tickets, and then you come back a day later to evidence of the tickets being resolved and the code merged. I don't think we're super there yet (See: Anthropic's regular bugs in everything), but this is the future that people are trying to get to and to some extent the question is: is there anywhere we can apply this to now sanely? How does this frontier evolve?
In other words, isn't there a way to orchestrate this NOT as a long running token maxxing setup given that triggers and CI runs can be run deterministically.
disclaimer: I haven't done this, just interested.
I feel like I’d need to not have a job or a life if I wanted to exhaust the OpenAI $100 plan using GPT 5.5 xhigh, and I’ve found it insanely capable.
That said, while I don’t read the code much (if at all), I do discuss each milestone up front to make a plan, and use/dogfood the results to direct any follow-ups and refinements, which puts a natural cap on the ratio of LLM contributions to my input for these side projects. I believe these human parts are still necessary not to eventually end up with a mess.
I'll probably throw it out on the internet someday if it gets far enough, but there are no attention/adoption related goals attached to the project.
It's currently useful for trivial scripting and iterating as and when I have time.
Surprisingly, I have had one much longer run refactoring our marketing website. We have a lot of blog posts that were written before we had more detailed style and tone guidelines. I wanted to make everything consistent but it took 15 or 20 minutes per post because it required a number of passes through each post to fully enforce the guidelines and an overnight run was required. That was quite a surprise since the posts aren't terribly long...
FWIW While I have had created and run this kind of build a few times... I did not like the results! In the end, I personally like to be in the loop to test and feel how stuff is turning out as it goes.
orchestrator -> parallel subagents with investigation, authoring, verification, benchmarking subagents and integration / final verification handled by parent has improved my productivity too.
I feel like from here its agent swarms against a whole spec but haven't got there yet.
Still getting plenty of bugs in the more complex scenarios, but mostly (in some projects) i never have to look at the code and treat it like a black box
Why do you need to "level up"? To have it shit out slop faster?
Just use it rationally for what you need to do.
Power is not free.
What I’ve found is that you’re basically paying a premium for privacy, and that’s worth it for me.
So for me, there is no additional hardware cost; it was acquired in replacement.
I run the AI models at home on this kit because I want to; I'll use openrouter if I need to.
I accept the economics of this article are right. But I feel so incredibly sad about this outcome that we're now just to be people caretaking machines that do the job we loved that actually I am not sure that exercising this nuance is going to matter in the long term.
It turns out it is a mistake I have made in my life — now really unfixable because I am a bit too old — to believe that I will always find enough fulfilment in my work to offset the absence of personal fulfilment elsewhere; I have always enjoyed being able to help people directly by doing a thing I love and I am good at, and that has kept away the sadness of finding it difficult to build a conventional family life to enjoy.
I assumed I would always find some new way to find that enjoyment, but even the slim enjoyment from being able to explore this stuff on my own kit in my own terms will not be enough if the pendulum does not swing back towards human effort.
It is a dismal world we have made for ourselves. Lately I have found myself dreading growing too much older in it.
> dreading
Even avoiding political headlines (OK, at least articles), plenty of cause for dread, so I keep re-focusing to avoid despair. Easier said than done innit!
Can't kill my hope for the future though. One day, all the good stuff shall prevail (morality, intelligence, love & kindness)... maybe not permanently, but a Star Trek future is there somewhere (& they had their troubles but it wouldn't be a dreadful situation overall). Sharing with you in case it's even slightly contagious!
So in my fight back I decided that I needed to re-centre myself; learn how these tools can help me personally return to productivity, try to get that deep self-teaching back, reanimate myself consistent with my principles, learn and make things. Take it head on without losing who I was.
I haven’t been a “big projects” developer since the dot com era (when I worked on some pretty cutting edge things). I have been a small projects developer: building things that matter for small businesses and schools, supporting designers, teaching people stuff along the way. I have been productive, I have very diverse skills and I have been valued.
What I have come back to is an industry that has abandoned craft principles or discussions about developer discipline, code quality, efficiency, robustness, resilience, etc., and fully organised itself into a headlong rush towards a kind of nihilistic Metropolis machine-cranking.
And because I am a freelancer (more of a contractor in practice), my competition is already the machine itself. I am one of those developers who is eliminated in the last sentence of the article. I am not needed on big projects and in many small jobs — the kind a burned out small business developer needs to get back to work — I will never be needed again.
It is very odd, trying to learn how to understand the tools that others are using to make you irrelevant.
And when all your friends are obsessed with AI, either clients desperate to use it or friends (in the creative culture I am surrounded by away from work) angry and resentful of it, I find I have just nobody to talk this through with.
In many ways I would rather not have returned to actively using HN (because articles and despair, and because being by oneself it’s possible to get drawn into online arguments) but in recent months I have noticed in the comments that perhaps this is the only place where these discussions among “craft” developers are happening at all.
I am over fifty and safe financially, and if my last day were for some horrible reason out of my control to be tomorrow, that’s OK; I have enjoyed my life and on good days I do still enjoy it. I have friends who I see when I can get myself out of the house, I have distractions I can enjoy, all that.
I am now much more troubled by what it is going to be like to continue to live it. I struggle every day to see where I have value, especially as burnout has left me with less energy to spend.
Like I say, I am safe and very aware I have been blessed; it’s not a cry for help. But I think a lot of us who found value in our work wonder what the fuck we can do to keep ourselves alive the way we were.
ETA: holy shit that was an essay.
Now every developer is getting promoted to management because they are expected to manage the AI-agents. But their status in the organization nor pay does not really increase does it when every coder is doing that.
> now really unfixable because I am a bit too old
How old is a bit too old? I know 50+ colleagues doing sports and traveling just fine.
People tend to assume the capex is thrown away but as we’ve seen with RAM, don’t be so sure you won’t be able flip it if you need to.
There's actually an interesting thought experiment here: if it takes you a full day to build something that AI would otherwise build in a day, do you end up using more power, or less? What is the break-even point, purely from a power consumption perspective?
Brains are thousands or maybe even millions of times more fuel-efficient than computers and you are alive for the whole day either way, right? You probably eat about the same even.
The reason executives think AI is more efficient is that it more space efficient than a human and doesn't demand to be paid or work only a set number of hours. Everything with computing is more efficient if you resent having to give money to other humans. If they could just not have you be alive when they don't need you, it'd possibly be different.
Even though I think at a typical British freelance rate and a truly unsubsidised token price, the AI is possibly more expensive than me. And as a freelancer, from their perspective I really am not alive until they need me. (This is what it often feels like)
The reality is the human and the AI aren't used to build the same things anyway so it's a comparison you can't really make.
Then, assume power costs 20 cents per kilowatt hour (US avwrage) To match the human 3 cents per hour, you need an average of 150 watts of power drawn per hour. That's in the range of a budget graphics card, but not much past there.
However, if you sleep instead of sitting around, you can probably make AI cost competitive. Sleeping drops your metabolic rate by more, and lying down in bed (as opposed to sitting) also reduces calorie burn. Combined, you can reduce your burn by like 30 calories an hour. At the new 9 cents per hour human cost, you can afford to run a higher end graphics card at ~450 watts per hour. That puts you in RTX 3090 range.
I've run the napkin math, and assuming LLMs make humans even 5% more efficient, the power and water savings over time are significant, largely because humans are so resource intensive: https://news.ycombinator.com/item?id=46984659
I would agree with you if you said it was vastly cheaper overall (with the initial equipment investment amortized over time) compared to The Power Company.
In many states, even if you are generating electricity and selling it back to the power company, they still gonna charge you normal rates of usage because greed.
If you go off grid, you have bigger things to worry about than how to power your AI cluster. It’s manageable enough if you have land but that’s in scarce supply.
I ran the numbers and outside of privacy it doesn't make sense. But I did it anyways. [0]
0 - https://www.williamangel.net/blog/2026/05/17/offline-llm-ene...
The reality is that they do not offer configurations that would allow a consumer to run that much VRAM on a single setup to protect datacenter margins. Apple used to, and they stopped, those devices are going for ~$20k+ each on ebay now.
You can get very, very capable models on a 3090/4090/5090/6000 series card. But if you want 'frontier level' you are investing ~22k at a bare minimum if you go new. Used you can probably build your own server for much cheaper up-front cost but it's likely going to be 4-6x+ electricity usage.
Sadly, no. The best comparable thing you can get is about Sonnet 3.7
But - good luck finding them. Apple discontinued the model a few months ago. And more recently, even 256G model was discontinued. Big AI really really does not want people to get off their needle.
The Qwen3.6-35B-A3B planner hums along at 50-55 tokens/s, and the Qwen3-Coder-30B-A3B-Instruct coder does 30-35. With both agents up and ready to work, RAM consumption sits at about 112 of 128GB.
It's pretty okay. I'm faffing around with having it disassemble old MS-DOS games from the 1980s, which is a task that lends itself well to the setup. It's not the fastest thing in the world, but with the planner's context window at 256k tokens and the coding agent at 128k, they chew through pretty long task lists handing things back and forth without complaint. The only real issue is that even with really tightly scoped prompts, the coding agent tends to hallucinate like it's on LSD. But the planning agent appears to be quite good at spotting the hallucinations and re-parceling work back to the coder.
It's neat. I'm going to be sad when I have to return the review unit in a couple of months.
edit - I also have been fiddling with Deepseek v4 Flash via Antirez's setup (https://github.com/antirez/ds4), and it's pretty fantastic (and fantastically easy to get running). It's pretty pokey on the Spark, though, at 14-ish tokens/sec. And unless you have a second Spark, it's going to be the only model you run at one time, as it eats alllll the rams.
Is this with a Ghidra MCP or some other technique? And why two models - did you try using Qwen3.6-35B-A3B for everything? (Or 27B or a bigger model since you have the RAM for it)
I have used a $60 per month Cursor plan on auto, and have never come close to using up my included usage, and I probably have it planning and coding and working for me all through the evenings 4 nights a week.
What on earth are people doing differently that it's costing them so much?
Maybe enabling on-demand usage or other paid models, or on higher modes? What are you doing that requires this? The output from Auto for me is crazy good for the tasks I'm working on, and have yet to run into an issue where it couldn't perform at a high enough level.
We have been interviewing people at work to join our team and they tell us they use $2K per month in tokens with their current employers.... I can't even fathom what's going on here where that would be happening.
I used to spend $200/mth on the Max plan at a small startup. Now spending single digit thousands on Claude enterprise with the same usage levels.
Anthropic is subsidizing consumer usage, and also charging a nice margin for enterprises for zero data retention (ZDR)
perhaps they are simply trying to impress you with their mad prompting skills and like, what self-respecting engineer would be caught dead using less then $2k/month?
giving the context of your interaction with those people, it probably is the simplest answer to your rather baffling question. for the life of me the idea of using $2k/month doesn't even seem possible unless your telling it to waste credits.
As an example I might have an agent with access to a browser, logs, metrics, GitHub& CI logs etc. and ask it to implement a new feature.
In Slack I have a few bug reports so I spin up a few more agents. A PM needs a UI tweak so I spin up an agent. You can imagine that a lot of work a dev does isn’t necessarily that complicated and I just need to be there to review the final PR and leave comments as if it were a colleagues (and then my agent goes back, fixes the comments, requests a new review…)
While that’s happening I might be using my actual attention for a meaty feature, design doc, data analysis, etc.
I spend $300/mo for personal use, and a couple thousand at work. Agents can be really transformative and well worth the cost.
Would my company rather pay a few thousand per month, or a several hundred thousand per year for an extra fully loaded engineer? At this point it is _at least_ a 2x multiplier for myself
When I do use AI, it's just the pure tool itself, and the context is the exact code I'm working with (because I'm trying to see if it can help me solve a specific problem), and I understand the rest of the codebase well enough to know if it's giving me good answers or bad ones
Besides that it’s simply reading or writing a lot of code. The only time I spend 1k/mo was when I built a green field 20k Loc service with 2 other people (while also doing antagonistic code reviews etc)(mostly using Opus). So it took $3-5k to build a 20k loc production grade service in a critical domain.
But most of the time I’m not doing that. As they say most engineering isn’t coding.
Plenty of low level things can trip agents up, too. I just had one inexplicably refuse to read an error about a function needing a bool return value - trying about 10 variations of the same thing before I interrupted it. Skills probably cause issues too, it loves to for example read the source code of libraries I'm using if I give it permission. That's a rabbit hole.
I did explore self-hosting models but hardware right now is just too expensive.
First month is $5, later $10. Cancel any time. You can keep getting the deal with a new email.
Still, that's interesting. What do you get for that price? Only coding, or also e.g. image generation?
I just use it for my side-project coding and brainstorming tasks. At work I use AWS's Kiro CLI + Opus 4.6. At home I use Opencode + V4 Flash for the majority of "general" usage. I swap to V4 Pro for complex tasks if I feel like V4 Flash is struggling.
One other thing I highly like about the platform.deepseek API usage is it's a metered setup - not subscription based. Which means you only pay for what you use (the money that you put in doesn't expire) and can't spend more than you've deposited. This works well for me for my non-work coding because it generally happens in bursts. I may not code for a whole month (and therefore if I had a subscription it would have been wasted) and then spend a whole weekend coding nonstop.
It's entirely possible that there are middle-man providers that give a discount on Deepseek's own pricing, but I'm quite happy with the amount I'm paying so I haven't really looked into it.
[1]: https://lushbinary.com/blog/deepseek-v4-vs-claude-opus-4-7-v...
https://openrouter.ai/deepseek/deepseek-v4-flash-20260423#pr...
After complaints the cached read is not listed anymore in that page, you have to click one by one. All providers for DeepSeek V4 Flash charge ~$0.02 while DeepSeek provider is $0.0028. For coding this is huge as caching often gets in the range of 90 to 99%. But OpenRouter messes your caching so don't use it. And it seems to be a VC-backed closed middle-man company, not open source or open anything.
I have hourly automations for root cause analysis on customer support issues, daily automations for eg log analysis, weekly & monthly automations for KPI tracking & actioning.
I will say, when I was building side projects that were 1) fairly well defined in scope and 2) without users/need for automations it was much easier to stay under $20/mo plan limits. Now I regularly hit weekly limits and need multiple Max plans
I suspect the people that burn through tokens have several subagents and 50 skills loaded and 40 MCP tools. All those load up the context on every single turn.
I suspect that most of these people who are burning through thousands of dollars worth of tokens at home are largely producing big ol' piles of slop.
The short answer is: they are doing slop. Most of the coding can be done quickly with a keyboard, intelisense and maybe some code generation templates.
But people became dependent on AI doing everything for them and tech bros now started to squeeze. Like a drug dealers.
I learned coding nearly 24 years ago and still learning new stuff all the time. At no point in time I had to rely on a subscription model to learn and do new stuff.
If LLM and agents are the default tools for coding and building software, at least for next few years, it seems like a no-brainer to invest $2000-3000 on hardware, like a Halo Strix PC.
I have a GTX1080ti which i think is circa 2018, it's unused, more than paid for itself over the years, owes me nothing at this point so the hardware is free.
It runs Gemma e4b multimodal, qwen 3.5 8b or the qwen 4b embeddings models well enough (40+ t/s for the LLMs).
The machine consumes 350 watts at the wall when under load (3 watts when sleeping, 80w at idle). Electricity costs me £0.035GBP/kwh which is cheap for the UK (load shifting via house battery).
144k output tokens for around 1pence (and takes an hour to do that in theory).
It's only JUST cheaper to use than the far more capable deepseek v4 flash model despite the free hardware and ~10x cheaper than normal electricity.
But that feels like measuring productivity in lines of code. For what I'm doing, I'm not seeing the benefit in any subscription.
Sure, I can't one-prompt a whole new boring CRUD app, but oh well.
edit: I am not dismissing local. I am one such user ( though I have subs too ), but one has to be clear eyed about the trade-offs.
I wonder if part of the solution is building/finding the right libraries, with the right documentation/language/API(one that plays well with LLM's) and maybe creating some synthetic data around them - to make it very easy for the llm.
And maybe there could be a business model around creating those libraries.
If you can ask the model for a specific function; with a spec design (typed languages help too) then the small models are great! I have had good progress with generating small python modules for example, but you need verification rounds to catch issues.
So test driven design + a good spec sheet + a very detailed todo.md (or even better if its todo.json because then the LLM does not need to manage it, you do from the harness) is your best bet for small models.
Like perhaps you could produce 5 versions of a piece of code, and then compare them to choose the best.
Also if the local LLMs can call tools, maybe you can use static analysis tools to catch errors and try again in a loop or process of some sort.
There also might be certain languages that work better because those languages have better static checks.
Which is to say, I might use AI to do an outline/organizational , but I'm prompting every chunk of code "one-by-one," (e.g. at about the "function" level) which still feels lightyears ahead of what I used to do.
I realize this text is just slop but it never stops being a "real bargain" at any point.
And it's more like $200/mo for $4000+/mo in tokens. You can also buy additional subscriptions.
There's no sense in running local models or doing anything else as long as VCs (and soon the public markets) are willing to pay your bill.
At the end of the day, AI models are relatively small files that we run little CUDA programs on.
Oh, so this is not a post about AI coding at home. It's about vibe coding at home.
There's a lot I disagree with in this post, but I'm posting this from a home computer with 64 GB of RAM and no GPU. I do lots of AI coding while spending very little money. I run Gemma 4 26b (mixture of experts) and Qwen 3 coder with Ollama. I use Github Copilot code completions. I use the Gemini and Mistral API free tiers. I have a Gemini paid API account. It's now prepaid, so you don't have to worry about an accidental $1000 bill. You can do a lot of things with Gemini Flash Lite 3.1.
None of this is burning through tokens to create an expensive blob of spaghetti code, but it does qualify as AI coding.
There's far less need for what the author refers to as frontier models as soon as you move away from vibe coding to filling in the gaps that you don't want to write yourself. The author doesn't even consider Gemini models to be frontier.
> models become smarter and less expensive
That's optimistic. They might become smarter but I don't see any market forces in the next few years that will make them cheaper.
You can't "slop cannon" vibe code with it, but this is personal code I want to not be spaghetti, so I'm not trying to vibe code. I just want to get instant retrieval of all stack overflow and reddit posts in a chat box, and for it to be able to spare me the physical pain of actually having to type out typescript code (I am a BE dev with negative patience for all frontend) and fuck around endlessly debugging obscure docker problems (I like docker, but, no patience for it having annoying problems and endless quirks). And this model does that really well.
Because (1) Huawei collab and (2) vLLM etc dont implement half of the inference optimisations deepseek proposed in their paper.
For me MiniMax 3 has really hit the sweet spot of being very cheap, though more than flash, but I’d also very capable.
If you still need more tokens, odds that you're vibecoding unmaintainable throwaway trash.
I don't think that's true at all. I'm doing 8-12 PRs a week at work, all primarily Claude Code, and the usage at API billing has never broken $500/mo.
I'm on the $100/m plan and used $300 at API billing yesterday (according to ccusage)
Seems like one session is >$100 and I can get 10 full sessions per week
The $200/m plan is supposed to be 4x that in usage, so with 2 of those you could use 4*2*100*10=$8000 in just a week
Using Simon's numbers here as a bare minimum https://simonwillison.net/2026/May/27/product-market-fit/#en... you'd get 1200*4*2=$9600 a month
About interruptions, one thing AI assisted coding really helps with is coding with constant interruption. I can leave CC for half an hour and return then tell it I had to step away, catch me up, and proceed. This works well for me.
What does this look like after 6-12 months? Like, how much code are you trying to write total?
Maybe it just doesn’t click in my mind, but sometimes I wonder about how much work people are trying to do and how they actually have enough to get done so quickly in such a short amount of time.
I've never worked on a complicated codebase that started out that way until the rest of the business concerns and office politics came into effect. People may not like it, but the bureaucracy is far and away more valuable than the core functionality.
Mature codebases are years of people thinking of all the possible gotchas while solving their acute pain points. This is not fluff, but the living and breathing part of it. Without that code, it's just a machine barely doing stuff in the most obtuse ways possible that nobody wants to pay for.
I would argue that they're putting LLMs to work on that finer detail stuff, but AI is still far too dumb. No, what they're doing is playing with their skinner box.
This is US centric but a $200 Claude code and $100 codex sub is a vast, vast amount of tokens. Enough to pay for itself many times over. It provides exposure to the very edge of harnesses and experience that is being hired for.
Isn’t there an argument this is possibly the best price to available performance for frontier models? Both due to subsidies and the distance between open and accessible alternatives?
From all the data, it looks like the 200usd we pay for monthly usage is subsidised… at break-even pricing … well, that 200 is starting to look like a few thousand.
No clue what y'all are doing, perhaps because I'm hobbying, and also I'm old and can perhaps do more of this by hand.
But I'm basically just doing what I did before, plus ollama self hosted and sometimes gemini and I feel like I'm going lightspeed beyond what I've ever done.
And I suppose this is still very fine-grained. I have it make a draft, then just have them fix/change it step by step?
I tried one of the bigger boys that can one-shot apps, which I guess is cool, but I'm finding it's just as hard to modify as if I just grabbed someone elses repo on github.
In fact all you've done is add a business cost.
-- Brain is expensive smart model from claude subscription, Fable 5 when it was available, Opus now.
-- Worker is a local model (qwen3.6:46B), deployed in 36GB GPU, Opencode + Ollama.
Brain is responsible for analysis/design and task creation. Task should be made simple and clear so the worker can handle it. Worker does the coding. Brain validates and create a fix task when required. Atm fix to task ration is ~ 1:20.
If no available GPU at home - qwen3.6 is quite cheap on clouds.
Its rather experimental setup, out of curiosity, but it works better than I would expect it to. This allows me running 3 coding agents non stop for the 4-th day atm. Here I explain how I got there: https://news.ycombinator.com/item?id=48520757
Just-In-Time or dynamic precompute of distilled models have already begun reducing the use of these frontier models for task inference.
Lately I've been able to cut down on token usage with context-mode and codebase-memory to wring more out of my subscription, as well as doing things like make sure all terminal operations run in quiet mode. I've found codebase-memory particularly effective: it creates an index of your codebase that the agent can query for code tracing without reading all of the associated files, and I've also found it more accurate at analysis
In the good ol' days, we bought machines not only to run stuff, but to experiment.
I understand today experiments are limited. Inference is reasonable, fine-tuning is either niche or a stretch, and base training is impossible.
*That is bound to change*, and when it does, there will be an avalanche of hobbysts and amateurs poking at base training. They'll find optimizations no one found before, synthetize data no one ever imagined to synthetize, and when that happens we'll start getting libre models.
So, yeah. Right now, buying the machine doesn't pay off that well, unless you want to pioneer this stuff in severe adverse conditions (hardware prices inflated, etc). Eventually, it will.
3090s and 7900s are going well so far.
Next year an Arc Pro B70 won't produce you less tokens than today.
They aren't fast but if you have flows where you can make money with them - they are a bargain in terms of price per Gb.
This will probably become the only option as the companies that publish open weights stop doing that. Very very few people have enough hardware to train/fine-tune at home.
Depending on what one builds, comprehensive documentation and applicable skills and memory tools often allow for a substantial reduction of tokens previously used by the agent to comprehend and remember what is being built
The opencode-go sub, at $10/mo, is amazing value. I’ve been using that and the assistant kagi offers for web-chat and research for months. For the smallish projects I work on at home those have been great.
Some times I also get OpenCode Go just to get access to chinese models as an extra for new projects I don't really care about, fun ones.
That's way cheaper than hiring a dev anyways.
I feel like I must be missing something.
Yeah, every now and then you blow out the window limits. So you take a break and think about something else or go out and do something else...
Y'all know that is enough to buy a real human, right? Well, good for you. I'm going back to figuring out which of my 2 streaming subs are getting cut. Maybe my free crap is enough to figure out that for me while I make my own art.
To be frank, the time you spend constructing prompts, tasks, and all else required to get your ai to do a thing was probably enough time to do the thing yourself. -Research included.
It's good to see the world throw the concepts of art, pride, and general accomplishment in the trash. Why have friends and partners in projects when you could give your savings to Anthropic, OpenAI, or any number of companies already obtaining ungodly ammounts of financing? A somewhate helpful bot at the cost of who you are and your bank account.
what a world we live in?
I don't think its feasible to have something comparable to these frontier models when they are increasing usage and lowering token costs
If you hunt in the settings you can restrict your account to only use EU servers for inference... Which means you can't use a lot of the US frontier models, but you can use all the Chinese ones, albeit within EU GDPR, etc.
This to me is a good compromise between privacy and cost.
As usual, an extraordinary claim without an extraordinary evidence: https://stephen.bochinski.dev/apps/
My baseline is sonnet 4.6. I think it's good enough for most tasks sincerly. So, from what I see, we are already at a point where we don't need frontier models for serious coding and debuging. Give it a couple of years and that level will fit 120B models.
At the same time, we saw the rise of direct acess memory systems like DGX or Stryx Halo that will allow to run models of this size for "cheap" in the medium term.
That's what I'm betting in. That in 2 years I can buy a system for about $2500 that will run a model that's similar to Sonnet 4.6 locally.
I might be spectacularly wrong though. But I'm willing to wait and use subscriptions/API calls for now.