also, someone rightly predicted this rugpull coming in when they announced 2x usage - https://x.com/Pranit/status/2033043924294439147
The same as charging a different toll price on the road depending on the time of day.
I was trying to get The alibaba plan but missed the mark. I'm curious to try out the Minimax coding plan (#10/mo) or Kimi ($20/mo) at some point to see how they stack up.
For Pricing: GLM was $180 for a year of their pro tier during a black friday sale and GHCP was $100/year but they don't have the annual plan any more so it is now $120. Alibaba's only coding plan today is $50/mo, too rich for me.
If you want stability, own the means of inference and buy a Mac Studio or Strix Halo computer.
Someone spread FUD on the internet, incorrectly, and now others are spreading it without verifying.
Yes, it was FUD, but ended up being correct. With the track record that Anthropic has (e.g. months long denial of dumbed down models last year, just to later confirm it as a "bug"), this just continues to erode trust, and such predictions are the result of that.
Don’t you guys have hard business problems where AI just cant solve it or just very slowly and it’s presenting you 17 ideas till it found the right one. I’m using the most expensive models.
I think the nature of AI might block that progress and I think some companies woke up and other will wake up later.
The mistake rate is just too high. And every system you implement to reduce that rate has a mistake rate as well and increases complexity and the necessary exploration time.
I think a big bulk of people is of where the early adaptors where in December. AI can implement functional functionality on a good maintained codebase.
But it can’t write maintable code itself. It actually makes you slower, compared to assisted-writing the code, because assisted you are way more on the loop and you can stop a lot of small issues right away. And you fast iterate everything•
I’ve not opened my idea for 1 months and it became hell at a point. I’ve now deleted 30k lines and the amount of issues I’m seeing has been an eye-opening experience.
Unscalable performance issues, verbosity, straight up bugs, escape hatches against my verification layers, quindrupled types.
Now I could monitor the ai output closer, but then again I’m faster writing it myself. Because it’s one task. Ai-assisted typing isn’t slower than my brain is.
Also thinking more about it FAANG pays 300$ per line in production, so what do we really trying to achieve here, speed was never the issue.A great coder writes 10 production lines per day.
Accuracy, architecture etc is the issue. You do that by building good solid fundamental blocks that make features additions easier over time and not slower
I think it goes down in two camps. AI is improving on these issues and people countering.
I don’t know for sure, but to me it seems the last 2 years weren’t necessarily 'intelligence' improvements but post-training improvement and tool connections, also reduced censorship.
I’m know using less AI than ever and I’ve been burning 1000USD/month before Claude Code. I have a couple of really fundamental functions built that help me to solve a big chunk of specific problems I can built a lot on that. Adding functionality became easier not more complicated.
I would think for these business problems that I’m facing AI is less than 30% of the time right. For example deciding on how to setup databases for max efficiency how to write efficient queries. Everything that in the end is really moat to you compared to your vibe coded competitors.
From my personal experience I’ve seen a lot of vibe-cded companies stuck and barely adding nec functionality or features and my guess is that they don’t trust changes anymore.
So even if AI would be as good as a really good coder one thing would still be missing a person that is knowing exactly what is happening.
And I mean okay it might be writing a form real quick. But a modern form needs to do a lot of things and if you have established patterns for all kind of inputs, the implementation is mundane.
It’s like when you learn coding, type it yourself to learn. So if you can’t scale the AI only codebase at one point you have to learn it, and I argue right now most efficient way is to write in it.
And I’m also arguing that it’s really tough to get a software so good that it’s actually an asset on the market vibe-coded only. It seems like its more of a drug for wannapreneurs than it is actually building an asset.
Like it builds you a Netflix clone, but what you see is barely the code you need to write a Netflix competitor.
- performance is continuing to increase incredibly quickly, even if you rightfully don’t trust a particular evaluation. Scaling laws like chinchilla and RL scaling laws (both training and test time)
- coding is a verifiable domain
The second one is most important. Agent quality is NOT limited by human code in the training set, this code is simply used for efficiency: it gets you to a good starting point for RL.
Claiming that things will not reach superhuman performance, INCLUDING all end to end tasks: understanding a vague business objective poorly articulated, architecting a system, building it out, testing it, maintaining it, fixing bugs, adding features, refactoring, etc. is what requires the burden of proof because we literally can predict performance (albeit it has a complicated relationship with benchmarks and real world performance).
Yes definitely, error rates are too high so far for this to be totally trusted end to end but the error rates are improving consistently, and this is what explains the METR time horizon benchmark.
Either way… we badly need more innovation in inference price per performance, on both the software and hardware side. It would be great if software innovation unlocked inference on commodity hardware. That’s unlikely to happen, but today’s bleeding edge hardware is tomorrow’s commodity hardware so maybe it will happen in some sense.
If Taalas can pull off burning models into hardware with a two month lead time, that will be huge progress, but still wasteful because then we’ve just shifted the problem to a hardware bottleneck. I expect we’ll see something akin to gameboy cartridges that are cheap to produce and can plug into base models to augment specialization.
But I also wonder if anyone is pursuing some more insanely radical ideas, like reverting back to analog computing and leveraging voltage differentials in clever ways. It’s too big brain for me, but intuitively it feels like wasting entropy to reduce a voltage spike to 0 or 1.
If this direction holds true, ROI cost is cheaper.
Instead of employing 4 people (Customer Support, PM, Eng, Marketing), you will have 3-5 agents and the whole ticket flow might cost you ~20$
But I hope we won't go this far, because when things fail every customer will be impacted, because there will be no one who understands the system to fix it
Sadly enough I have not seen this happening in a long time.
Stripe is apparently pushing gazzaliion prs now from slack but their feature velocity has not changed. so what gives?
how is that number of pr is now the primary metric of productivity and no one cares about what is being shipped or if we are shipping product faster. Its total madness right now. Everyone has lost their collective minds.
I'm not seeing the apps, SaaS, and other tools I use getting better, with either more features or fewer bugs.
Whatever is being shipped, as an end user, I'm just not seeing it.
Its baffling to see these comments on hacknernews though. I guess you have to prove that you are not a luddite by making "ai forward" predictions and show that you "get it"
All Chinese labs have to do to tank the US economy is to release open-weight models that can run on relatively cheap hardware before AI companies see returns.
Maybe that's why AI companies are looking to IPO so soon, gotta cash out and leave retail investors and retirement funds holding the bag.
Of course it's in the areas where it doesn't matter as much, like experiments, internal tooling, etc, but the CTOs will get greedy.
I still can't get a good mental model for when these things will work well and when they won't. Really does feel like gambling...
It's the "robots will just build/repair themselves" trope but the robots are agents
Oh wait. That's already here and is working fine.
I am already at the point where because it is just the two of us, the limiting factor is his own needs, not my ability to ship features.
So, we will give these 3 or 4 trusted users access to an on-site chat interface to request updates.
Next, a dev environment is spun up, agent makes the changes, creates PR and sends branch preview link back to user.
Sort of an agent driven CMS for non-technical stakeholders.
Let’s see if it works.
But I do think even now with certain types of crud apps, things can be largely automated. And that's a fairly large part of our profession.
A PR tells me what changed, but not how an AI coding session got there: which prompts changed direction, which files churned repeatedly, where context started bloating, what tools were used, and where the human intervened.
I ended up building a local replay/inspection tool for Claude Code / Cursor sessions mostly because I wanted something more reviewable than screenshots or raw logs.
We dont have product managers or technical ticket writers of any sort
But us devs are still choosing how to tackle the ticket, we def don't have to as I’m solving the tickets with AI. I could automate my job away if I wanted, but I wouldn't trust the result as I give a degree of input and steering, and there’s bigger picture considerations its not good at juggling, for now
So one user's experience is relevant to another, so they can learn from one another?
(That's basically what A/B testing is about.)
But the entire SWE apparatus can be handled.
Automated A/B testing of the feature. Progressive exposure deployment of changes, you name it.
At least in my company we are close to that flywheel.
Tickets may well not look like they do now, but some semblance of them will exist. I'm sure someone is building that right now.
No. It's not Jira.
There's a lots of experimentation right now, but one thing that's guaranteed is that the data gatekeepers will slam the door shut[1] - or install a toll-booth when there's less money sloshing about, and the winners and losers are clear. At some point in the future, Atlassian and Github may not grant Anthropic access to your tickets unless you're on the relevant tier with the appropriate "NIH AI" surcharge.
1. AI does not suspend or supplant good old capitalism and the cult of profit maximization.
You need to write a clearer prompt.
Your AI assistant orders an experimental jetpack from a random startup lab. Would you have honestly guessed that the prompt was "ambiguous" before you knew how the AI was going to act on it ?
Yes, you can give the Big Brain Thing a vague task and expect results, some times it'll do it right
But if you want repeatability, give it tools to determine what is a "disruption".
I have this exact system running on a WhateverClaw with a few simple tools: weather, train time tables (I commute via train) and a read-only version of my "will I be at the office or remote today" -calendar via a tool. Oh and a tool to notify me via Telegram.
It gets this information using the given tools and determines if it's worth notifying me.
TBH this doesn't need a "claw", I could just run the tools in cron, construct a prompt with that data and run any LLM on it.
You'll define exactly what good looks like.
Yesterday, I spent the entire day trying to set up "Claude on the web" for an Elixir project and eventually had to give up. Their network firewall kept killing Hex/rebar3 dependency resolution, even after I selected "full" network access.
The environment setup for "on the web" is just a bash script. And when something goes wrong, you only see the tail of the log. There is currently no way to view the full log for the setup script. It's really a pain to debug.
The Copilot equivalent to "Claude on the web" is "GitHub Copilot Coding Agents," which leverages GitHub Actions infrastructure and conventions (YAML files with defined steps). Despite some of the known flaws of GitHub Actions, it felt significantly more robust.
"Schedule task on the web" is based on the same infrastructure and conventions as "Claude on the web", so I'm afraid I'm gonna have the same troubles if I want to use this.
"Your plan gets 3 daily cloud scheduled sessions. Disable or delete an existing schedule to continue."
But otherwise, this looks really cool. I've tried using local scheduled tasks in both Claude Code Desktop and the Codex desktop app, and very quickly got annoyed with permissions prompts, so it'll be nice to be able to run scheduled tasks in the cloud sandbox.
Here are the three tasks I'll be trying:
Every Monday morning: Run `pnpm audit` and research any security issues to see if they might affect our project. Run `pnpm outdated` and research into any packages with minor or major upgrades available. Also research if packages have been abandoned or haven't been updated in a long time, and see if there are new alternatives that are recommended instead. Put together a brief report highlighting your findings and recommendations.
Every weekday morning: Take at Sentry errors, logs, and metrics for the past few days. See if there's any new issues that have popped up, and investigate them. Take a look at logs and metrics, and see if anything seems out of the ordinary, and investigate as appropriate. Put together a report summarizing any findings.
Every weekday morning: Please look at the commits on the `develop` branch from the previous day, look carefully at each commit, and see if there are any newly introduced bugs, sloppy code, missed functionality, poor security, missing documentation, etc. If a commit references GitHub issues, look up the issue, and review the issue to see if the commit correctly implements the ticket (fully or partially). Also do a sweep through the codebase, looking for low-hanging fruit that might be good tasks to recommend delegating to an AI agent: obvious bugs, poor or incorrect documentation, TODO comments, messy code, small improvements, etc.
I ran all of these as one-off tasks just now, and they put together useful reports; it'll be nice getting these on a daily/weekly basis. Claude Code has a Sentry connector that works in their cloud/web environment. That's cool; it accurately identified an issue I've been working on this week.
I might eventually try having these tasks open issues or even automatically address issues and open PRs, but we'll start with just reports for now.
Seems trivial.
But you can set up a claude -p call via a cronjob without too much hassle and that can use subscriptions.
And there is also the mindset to avoid boring loops, and prefer event driven solutions for optimal resource-usage. So people also have a kind of blind spot for this functionality.
- IFTTT was great when it started; at some point, it became... weird, in a "I don't even know what's going on on my screen, is this a poster or an app" kind of way.
- Zapier is an unpenetrable mess, evidently targets marketers and other business users; discovery is hard, and even though it seems like it has everything, it - like all tools in this space - is always missing the one feature you actually need.
- Yahoo Pipes, I heard they were great, but I only learned about them after they shut down.
- Apple Shortcuts - not sure what you can do with those, but over the years of reading about them in HN comments, I think they may be the exception here, in being both targeting regular users and actually useful.
- Samsung Modes and Routines - only recently becoming remotely useful, so that's nice, even if vendor-restricted.
- Tasker - an Android tool that actually manages to offer useful automation, despite the entire platform/OS and app ecosystem trying its best to prevent it. Which is great, if your main computer is a phone. It sucks in a world of cloud/SaaS, because it creates a silly situation where e.g. I could nicely automate some things involving e-mail and calendars from Tasker + FairEmail, but... well my mailboxes and calendars lives in the cloud so some of that would conflict with use of vendor (Fastmail) webapp or any other tool.
Or, in short: we need Tasker but for web (and without some of the legacy baggage around UI and variable handling).
The sorry state of automation is not entirely, or even mostly, the fault of the automation platforms. I may have issues with some UI and business choices some of these platforms made, but really, the main issue is that integrations are business deals and the integrated sides quickly learned to provide only a limited set of features - never enough to allow users to actually automate use of some product. There's always some features missing. You can read data but not write it. You can read files and create new files but not edit or delete them. You can add new tasks but can't get a list of existing ones. Etc.
It's another reason LLMs are such a great thing to happen - they make it easy (for now) to force interoperability between parties that desperately want to prevent it. After all, worst case, I can have the LLM operate the vendor site through a browser, pretending to be a human. Not very reliable, but much better than nothing at all.
Expectations - the functionality of "do X on a timer" needs to be offered to users as a proper end-user feature[0], not treated as a sysadmin feature (Windows, Linux) or not provided at all (Android). People start seeing it on their own devices, they'll start using it, then expecting it, and the web will adjust too[1].
UI - somehow this escapes every existing solution, from `cron` through Windows timers to any web "on timer" event trigger in any platform ever. There already exists a very powerful UI paradigm for managing recurring tasks, that most normies know how to use, because they're already using it daily at work and privately: a calendar. Yes, that thing where we can set and manage recurring events, and see them at a glance, in context of everything else that's going on in our lives.
--
<rant>
I know those are hard problems, but are hard mostly because everybody wants to be the fucking one platform owning users and the universe. This self-inflicted sickness in computing is precisely why people will jump at AI solutions for this. Why I too will jump on this: because it's easier than dealing with all the systems and platforms that don't want to cooperate.
After all, at this point, the easiest solution to the problems I listed above, and several others in this space, would be to get an AI agent that I can:
1) Run on a cron every 30 minutes or so (events are too complicated);
2) Give it read (at minimum) access to my calendar and todo lists (the ones I use, but I'm willing to compromise here);
3) Give it access to other useful tools
Which I guess brings us to the actual root problem here. "Run tasks on a cron" and "run tasks on trigger" are basically just another way of saying unattended/non-interactive usage. That is what is constantly being denied end users.
This is also the key to enabling most value of AI tools, too, and people understand it very well (see the popularity of that Open Claw thing as the most recent example), but the industry also lives in denial, believing that "lethal trifecta" is a thing that can be solved.
</rant>
--
[0] - This extends to event triggers ("if X happens, then") automation, and end-user automation in all of every-day life. I mean, it's beyond ridiculous that the only things normal people are allowed to run automatically are dishwasher, and a laundry machine (and in the previous era, VCRs).
[1] - As a side effect, it would quickly debullshitify "smart home" / "internet of things" spaces a lot. The whole consumer side of the market revolves around selling people basic automation capabilities - except vendor-locked, and without the most useful parts.
Same. Sometimes it is just people overeager to play with new toys, but in our case there is a push from the top & outside too: we are in the process of being subsumed into a larger company (completion due on April the 1st, unless the whole thing is an elaborate joke!) and there is apparently a push from the investors there to use "AI" more in order to not "get left behind the competition".
This company already does some pretty cool stuff with statistics for forecasting but now they are pivoting their roadmap to bake in GenAI into their offering over some other features that would be more valuable to their clients.
I wrote this to help people (not just Devs) reason about agent skills
https://alexhans.github.io/posts/series/evals/building-agent...
And this one to address the drift of non determism (but depending on the audience it might not resonate as much)
https://alexhans.github.io/posts/series/evals/error-compound...
Grok has had this feature for some time now. I was wondering why others haven't done it yet.
This feature increases user stickiness. They give 10 concurrent tasks free.
I have had to extract specific news first thing in the morning across multiple sources.
It doesnt allow egress curl, apart from few hardcoded domains.
I have created Cronbox in the cloud which has a better utility than above. Did a "Show HN: Cronbox – Schedule AI Agents" a few days back.
and a pelican riding a bicycle job -
https://cronbox.sh/jobs/pelican-rides-a-bicycle?variant=term...
Is this assuming you give it git commit permission and it just does that? Or it acts through MCP tools you enable?
Anthropic wants a world where they own your agent where it can't exist outside of the Claude desktop app or Claude Code.
There could exist a world where your agent isn't confined by the whims of a corporation.
You misspelt ">95% discount relative to API pricing" ;)
I run conferences and I like to have photos of delegates on the page so you can see who else is attending.
I wanted to automate this by having Claude go to the person’s LinkedIn profile and save the image to the website.
But it seems it won’t do that because it’s been instructed not to.
That's not unique to LinkedIn but what is somewhat unique is the strong linkage to real world identities, which raises the cost of Sybil attacks on personal networks with high trust.
It's a game changer.
Edit: my mistake. It's inferior to a Cron job. If my repos happen to be self hosted with Forgejo or codeberg, then it won't even work. If I concede to use GitHub though I don't have to set up any env variables. Schedules lock-in, all over the web.
I feel this is rooted in problems that extend beyond computing. Regular people are not allowed to automate things in their life. Consider that for most people, the only devices designed to allow unattended execution off a timer are a washing machine, some ovens and dishwashers, and an alarm clock (also VCRs in the previous era). Anything else requires manual actuation and staying in a synchronous loop.
Of course a provider can offer convenient shortcuts, but at the cost of getting tied into their ecosystem.
Anthropic is clearly battling an existential threat: what happens when our paying users figure out they can get a better and cheaper model elsewhere.
They solved that with subscriptions. For end-users (and developers using AI for coding), it makes no sense to go for pay-as-you-go API use, as anything interesting will burn more than the monthly subscription worth of $$$ in API costs in few hours to days.
Such a service will always be destroyed by the bell-ends who want to run spam or worse activities.
(And on Android, AFAIK there's exactly nothing at all. There's not even common support for any kind of basic automation; only recent exception is Samsung. From third-party apps, there's always been Tasker - very powerful, but the UX almost makes you want to learn to write Android apps instead.)
I think the core problem is not so much that it is not "allowed", but that even the most basic types of automation involves programming. I mean "programming" here in the abstract sense of "methodically breaking up a problem into smaller steps and control flows". Many people are not interested in learning to automate things, or are only interested until they learn that it will involve having to learn new things.
There is no secret conspiracy stopping people from learning to automate things, rather I think it's quite the opposite: many forces in society are trying to push people to automate more and more, but most are simply not interested in learning to do so. See for example the bazillion different "learn to code" programs.
Computing isn't, and has never been, demand-driven. It's all supply-driven. People choose from what's made available by vendors, and nobody bothers listening to user feedback.
https://imgur.com/a/apero-TWHSKmJ
Cron triggers (or specific triggers per connector like new email in Gmail, new linear issue, etc for built in connectors).
Then you can just ask in natural language when (whatever trigger+condition) happens do x,y and z with any configuration of connectors.
It creates an agentic chain to handle the events. Parent orchestrator with limited tools invoking workers who had access to only their specific MCP servers.
Official connectors are just custom MCP servers and you could add your own MCP servers.
I definitely had the most advanced MCP client on the planet at that point, supporting every single feature of the protocol.
I think that's why I wasn't blown away by OpenClaw, I had been doing my own form of it for a while.
I need to release more stuff for people to play around with.
My friends had use cases like "I get too many emails from my kids school I can't stay on top of everything".
So the automation was just asking "when I get an email from my kids school, let me know if there's anything actionable for me in it"
Push events into a running session with channels: https://news.ycombinator.com/item?id=47448524
I use it to:
- perform review of latest changes of code to update my documentation (security policies, user documentation etc.)
- perform review to latest changes of code, triage them, deduplicate and improve code - I review them, close them with comments for over-engoneering / add review for auto-fix
- perform review of open GitHub issue with label, select the one with highest impact, comment with rationale, implement it and make pull request - I wake up and I have a few pull request to fix issues that I can approve /finish in existing Claude Code thread
I want also use it to: - review recent Sentry issues, make GitHub issues for the one with highest priority, make pull request with proposed fix - I can just wake up and see that some crash is ready to be resolved
Limit of 3 scheduled jobs is pretty impactful, but playing with it give me a nice idea on how I can reduce my manual work.