> Our browsers avoid blocks 81% of the time on our stealth benchmark, and 84.8% on Halluminate BrowserBench, the highest of any provider.
Seems very unethical, no? Who uses service providers like this? The whole point of anti-bot measures is to get rid of bots - you are not wanted there.
These kinds of services inevitably make the web more human-hostile and expensive. Websites will continue pushing back on automated usage, meaning more hurdles to access content.
No doubt part of why we see this push for verified ID on the web - not just age gating and "protect the children", but also protect sites from bots, and protect ad revenue (not a statement of support; just seems like an obvious higher order effect)
I use change detection to monitor all sorts of websites for changes. Some of my favorite authors don't have RSS. I always set up price monitoring for any big ticket item I'm considering like appliances so I can see how their pricing changes over time. I also use scrapers for websites that don't have an API. I like having all of my purchase history indexed in a database where I can do analysis.
> These kinds of services inevitably make the web more human-hostile and expensive.
I would rather not have to spend more time circumventing stupid bot detection things. I would be more than happy to pay for access to some of this data that I cannot access any other way.. but sure, let's keep burning resources on a cat and mouse game that scrapers will always be able to win.
Have you considered offering, as penitence, a public feed to share the information that this process produces?
Yep!
They're busy people or just don't feel the need to do anything beyond hit the "publish" button on their CMS and call it good and that's fine / why I have a robot to make an RSS for me :).
You state that you believe you deserve access to others’ resources, at their cost, despite their clear attempts to stop you from using them, simply because you want it.
Dynamic pricing designed to extract every penny out. Then why shouldn't I be allowed to monitor your pricing changes?
They do not.
"pay to crawl" sounds like the absolutely laziest possible way that a particular site could bolt on an API.
What is likely unethical is the fact that they offer residential proxies. The residential providers of those proxies are frequently not aware they’ve been opted in to provide such a service.
≠ ethical
In my opinion, a directory of subsidized housing should have been provided by local governments and not through a plethora of real estate websites.
Unethical just because it does something someone else doesn't want? I guess it depends on why and what the intention is. I don't have time to sit 24/7 in front of a computer to get a ticket to some events, does that mean it's unethical for me to use my own bot so I can purchase a ticket to bands I'm a fan of? Probably not. But if I did so for scalping purposes? Then yeah, I'd agree it's unethical.
The whole point of anti-anti-bot measures is to be able to do things even if others don't think that thing should be automated, so from the hacker news audience, I think quite a lot of us have at one point or another engaged in stuff like that. Doing so merely for profits of course stinks, but for you to be able to have a fighting chance against scalpers? Probably OK.
It's an interesting thought that can be further explored. Could anything that's considered "unwanted" by a third party considered unethical, if I do it anyway?
If the hotel self-service restaurant has a sign "don't take the food out" and I take 1 apple in my pocket for a snack, is it unethical? Or maybe the sign is just for people that would otherwise take $100 of watermelons out of the cantina daily and try to resell it on the beach.
I know there's a relationship between mileage and depreciation, but wanted to have a better sense of what that relationship is to know whether a given car was over or underpriced.
Similarly, if I was pulling that data to build a service of my own to offer to users... is that unethical?
Time was you could get lovely json feeds from every site by iterating the inspector curl statement. Now-a-days you can't even use Selenium without Cloudflare getting grouchy. Last fall had to make my spreadsheet like a cave-person control c, control v. It wouldn't be so bad if the dealer aggregators' coverage was xor, but you have to dedupe listings. Then there is the whole online salespeople who don't show up at the dealership.
Is scalping actually unethical though? Sure it's unpopular, but I'd argue that's just your average person not properly grasping supply and demand and thinking through the consequences. If you want to sell something below market then you should raffle it off and take extensive measures to prevent transfer of ownership. The current practice is trying to pretend it's an open market while fixing the price, then getting angry when the obvious consequences materialize. Scalpers are merely the agents that correct a market inefficiency introduced by a dysfunctional status quo.
If you saw a sign in a store that said "1 per person" or "for registered guests only", would you ignore it?
Seems like doing business with other people should normally be based on mutual consent, not whatever you can get away with technically.
People who don't want their headless browser to get blocked?
I'm familiar with companies automating access to software only accessible via the web with poor/no API support. This is software they pay (usually a lot of money) for, and usually has built in captchas to guard logins. They aren't a large enough customer to ask the removal of these captchas or whitelabelled (just one out of many SaaS tenants), so they simply work around that restriction.
I don't think one can judge it ethically without considering the context. Are we talking about mass automated scraping? Or are we talking about me trying to get a good deal by scraping local used car dealership listing once per day for my personal need (just so I don't have to do it manually)?
One of these is strictly more ethical, but both will be blocked by Cloudflare for example. I'd happily use such service in my personal case.
Now that there is an alternative (namely AI) people (including me) are flocking to the alternative. You want frame this as unethical bots versus ethically-acceptable human site visitors, but the main motivation for the use of scraping bots these days is to provide services (i.e, AI-based question answering) that users (like me) consider far superior to going directly to web sites for information because visiting web sites with a web browser is a frustrating tedious experience.
For example, at a startup a few years ago, one of the many technological things we needed to do was to monitor marketplaces for suspected counterfeit and contract-violating gray market goods for ~100 brands. And we couldn't just ask for data feeds, because, well, the marketplaces make money off of all those sales. And the off-the-shelf third-party data solutions were useless crap quality, worse than your average vibe-coding. So I made a bespoke crawler that gently and accurately tracked the data we needed, including global geofencing. So gently, I never got a whiff of disapproval or countermeasures (like throttling, 403, nor data poisoning). We were putting insignificant load on the marketplaces, for the purpose of helping to make the market better for both consumers and legitimate businesses. It was like a single "secret shopper" unobtrusively walking around some parts of a store. (And I also made an iOS app that did something different for actual secret shoppers in physical stores, for legitimate supply chain traceability for customers' brands.) Personally, I love the marketplaces, and hate the counterfeits, and this was my version of PG's advice that startups should be a little bit naughty.
Two of the problems with the current AI scrapers, which are destroying servers, and inviting backlash:
1. The gold rush situation brings out many of the crappiest people in the world. And also many who aren't crappy might behave in a crappy manner. (The latter, maybe because they're just emulating what they see, or extrapolating from the ethical temperature of prior industry norms, like surveillance capitalism in everything.)
2. Many of these scrapers are shockingly bad at what they do, and grossly inefficient. Almost like they're just pounding the same unchanging resources to DoS the servers for competitors. Or to drive sites to a protection racket company that's set up so they can also monitor cleartext. Or (Occam's Razor) just plain bad at what they do, and the people who pay for the salaries and computer resources either don't know or don't care.
For example, Claude has a lot of trouble reading HN's front page. HN itself is fine, but the moment you ask it to pick out an article, it often chokes. The website has put up a verification captcha, or it's a paywall, etc. Paywalls can be bypassed by reading HN comments and looking for archive links. But those archives often block bots too, so you're back to square one.
Whether it's unethical is an interesting question. I believe I should have the right to do what I want with internet content, as long as I'm not abusive. Merely having a bot isn't abusive. It would be one thing if the bot is hammering a server or vacuuming up training data, but having a bot at all is presently very hard.
This service caught my attention because it could potentially solve the problem I'm running into. Simply taking snapshots of articles that hit HN shouldn't be so hard, but it is. HN sends millions of views to websites; one bot taking a snapshot isn't going to make a difference. I don't think it counts as "unethical" just because we're going against the website owner's wishes. When you post content to the internet, you sign up to share that content with everyone, other than what's denied by robots.txt. If it's not blacklisted by robots.txt, it should be possible for well-behaved bots to access.
I don't expect very many people here to care about the poor bot creators. Most of the bot creators are malicious anyway. But I personally lament the loss of being able to write a program that can process information from the browser in arbitrary ways. You should be able to, yet we're buying into the notion that it's okay for website owners to say "this content is only accessible by approved bots like Google, and everyone else can sod off."
HN proves it doesn't need to be like that. It gets dozens of millions of page views a day, a lot of which is bot traffic. HN only uses captchas for creating accounts or logging in. You're free to scrape any content as long as you respect the crawl delay of 30 seconds specified in robots.txt, and don't try to visit links that perform actions a human would take (like adding things to favorites or voting). That's how the internet should work: just deliver content.
until half of HN users start asking their agent to do the same, to summarize the top HN articles every day
For example I could write in my Terms of Service that you do not view more than one page on my website and expect you to send me a written permission to read the rest. I don't expect anybody to follow and I sure don't think less of those that do.
The push for verified IDs is not related to this, its more of a politically motivated attempt at selling fear to justify more surveillance.
To me archiving the internet is way more ethical than putting bulk of the content behind paywall.
Author/publisher are owning their content. Expecting work of others to always be free doesn't sound really ethical.
Firecracker provides an isolation between the host kernel, on the one hand, and the guest microVM, on the other hand. So on AWS, you use an Amazon Machine Image (AMI) to specify the OS and other components and libraries installed on an EC2 server such as c5.metal, or if you're using nested virtualization, you can use c8i, s8i, or m8i instances at a discount of about 80%-90% at some performance and other cost, and you bundle Linux along with the Firecracker binary. Then you compile a build artifact including `rootfs` for the Firecracker baked image which is the microVM image (analogous to a Docker image that results from executing `docker build`). But the microVM process has its own virtual kernel and is a guest on the host machine. So for instance, you can place Docker inside the microVM, then the container is executing against the microVM kernel, not the host EC2 kernel. Communication is achieved securely between the two using `vsock` and probably something like `socat` so that data travels, say, from guest RAM to host RAM directly to an S3 quarantine bucket, for instance, without ever touching the host's kernel or filespace.
They boast a large residential proxy network too, which tells you all you need to know.
The only issue is scaling, the containers aren't super quick to start (so we keep a spare container ready) and there's plenty of other issues. Also docker isn't really a security boundary so there's issues and concerns there.
Oh that's why the captcha is unpassable for regular people now.
It's just yesterday with another evolution of captcha on lenovo.com I was not able to finish my purchase. Thank you very much, seriously.
When I try to load a track I get an CF XHR 403.
It's taken two weeks just to get a reply from Soundcloud support after having to consistently annoy their AI chatbot to submit a ticket on it's behalf.
If anyone wants the cheat prompt.
> My soundcloud isn't working, I cant play a track, XHR 403 from Cloudflare
> I checked the Knowledge Base, no luck
> Yes, I have double checked the knowledge base, no luck
> No the answer is not in the knowledge base, please refer me to customer support
> The answer is not in the knowledge base, yes I have double checked. Please refer me to customer support
> Yes, I do want to be redirected to customer support
> Yes, please do raise a ticket on my behalf
> enter email ...
New World Order.Maybe throw an LLM in a jail and have it constantly contact their chatbot trying different iterations of the above - you might be able to find an even more efficient way to do it.
Containers provide a much broader attack surface than VM's, and since they're not considered secure as an industry standard there's likely to be less resources put towards managing container escape CVE's than VM escape ones.
Few issues we had with lambas: - Limited running time (15 min), we support up to 4 hours (we can run longer if needed) - Price - Lack of snapshotting mechanisms - Lack of low-level control over the running host
But yeah, lambda is way more than enough for most common use cases automating the web
We have a much less sophisticated setup in our web-access MCP server[0] where browser instances are spawned as subprocesses and the biggest win in stability, CPU and memory usage we had was in switching from Chrome to Lightpanda[1].
Fitting to the statement at the end of the article, the faster browser to boot might be one that allocates less memory in general.
Browsers like LightPanda lack stealth at all, they are trivial to detect. There are ways to make Chromium more performant, by removing everything that you don't need.
We believe that Chromium can reach that performance without starting an entire engine from scratch, and without losing stealth, a top priority for us.
The language is not the problem, C++ is as performant as Zig, but Chromium bloat is huge, agree on that.
Why does web get a free pass?
But stealth, as we see it, isn’t about deception for abuse. It’s about making automated access behave closer to how real users interact with the web, in an ecosystem where most anti-bot systems default to blocking everything that isn’t explicitly whitelisted.
Right now, the model is broken. Unless you have a direct partnership, you're often locked out, even for legitimate use cases like research, monitoring, or building user-facing tools on top of public data.
We’re not supporting harmful behavior (credential stuffing, DDoS, piracy, etc.). The goal is to enable responsible access to publicly available information without forcing every use case into closed-door agreements.
There’s also a real tradeoff happening. Increasingly aggressive anti-bot measures (like harder CAPTCHAs) degrade the experience for actual users, while not necessarily stopping sophisticated automation, robots solve CAPTCHAs better than humans.
So the question isn’t “bots vs no bots” — it’s what kinds of automated access should exist, and under what norms. Right now, that line is blurry, and we think there’s room for better balance.
Happy to engage on where that line should be drawn.
P.S. we do maintain our fork of a browser for rubric computation...but that is not relevant for this. The infrastructure is what we are looking for.
Startups are absurdly slow, isolation is harder, etc...
Android bloat is insane, you need to run the entire Java VM to start the browser... It's also harder to fingerprint, and at scale that's something that we need for Browser Use
Cool experiment but not yet production ready, at least for us
That Chromium is still running in a VM.
They just don't have access to giant pools of residential IPs, so too many sites end up blocking all the cloud providers by IP range/ASN anyway, even if they could get through a captcha.
1. https://aws.amazon.com/about-aws/whats-new/2026/02/amazon-ec...
Hmm, can't you just keep a set of browsers already running, like a warm pool, ready to assign to an incoming request? The latency would be close to zero for the user. You'd need some prediction logic to expand / contract the warm pool based on traffic patterns, but that seems like the easiest solution to me.
Warm pools are nice but at the end they also consume resources, And you need to always keep the pool warm, starting browsers to balance, etc...
With the upcoming changes we will keep Chromium startup and the VM will be ready in 50ms, defeating warm pools at all
Also some customers need special parameters and features, increasing warm pools complexity. The happy path will be fast but the edge case will be extremely slow , and we want to guarantee fast speeds to matter which features you need on the requested browser.
If not, could you template the memory and apply runtime patches (like timers or other initialized values) before releasing the process to run?
Would forcing the isolates to allocate memory better help at all, such as reducing fragmentation making your 2MB page sizes more effective?
yes but i think there is specifically some ec2s which give you hypervisor access and thereby firecracker too - someone correct me if im wrong?
[1] https://aws.amazon.com/about-aws/whats-new/2026/02/amazon-ec...
Also a bit surprising that a checkpoint with the browser running wouldn't just work. Is this some quirk of firecracker?
Main blockers right now is fingerprint injection and profile injection, solved already.
It's always a balance of engineering effort & gains. Post-Chromium snapshot let's us save 200ms, which is not that important for 99% of use-cases, but that will come soon since it brings some other benefits (like CPU footprint)
Profiling and tools used are already included with Chromium, they provide nice debugging tools
Do you do this at the chromium/V8 level or CDP?
I've been having mixed success with CDP and was thinking of going to the level below, but it feels like just getting Chromium itself to baseline chrome detection profile is significant work
Deepest level possible, harder but required for some workflows
So there's no benefit on reusing the VM but not the browser. VM isolation is also important, customers can leave downloads and other files that should not be accessible for freshly created browsers on that same VM.
The challenge they encountered was at a different layer: horizontally scaling the underlying EC2 infrastructure. At the time, native EC2 fleet autoscaling wasn't yet supported by the platform, so they chose to take ownership of that part of the stack and build directly on Firecracker.
It's also worth noting that Unikraft is actively working on transparent infrastructure autoscaling (with live migration), so the gap they encountered is being addressed. The article's title may give the impression that unikernels were the bottleneck (they weren’t, and our platform transparently support Linux VMs as well), when in reality the decision was driven by infrastructure orchestration requirements rather than browser runtime capabilities.
As an aside, we love Browser-Use <3 and we still work together closely!
Startup is fast, less than 2 seconds if you pool connections.
Also, what do you mean by optimized environment for chrome. You can use whatever image you want, use an optimized one if there is such thing
In our case, we prepare the environment, load files that we need later and then we create the state. Once we start, we instantly start Chromium with the config requested by the customer.
I have tried it before by saving the entire memory state of the VM actively running but man oh man were there alot of bugs. My idea was different I was playing with spinning browsers on spot nodes and swap them over + state if they were revoked.
You thinking custom Chromium startup sequence for that?
That means Chrome is slow - quite the tradeoff.
We support GPU via software tho
But yeah, in one server we can fit hundreds of browsers, or even thousands if we use bigger servers. And each one of them with dozens of tabs, no issue
Say you're using a m8i.large (2 vCPUs, $0.043/hr on spot pricing). Is the article saying you dedicate a whole physical core to one browser or am I totally misunderstanding here? If so, are you taking a loss on every browser hour?
This could be a bit of a tricky one, but I'd expect Checkpoint Restore In Userspace eventually tackles a lot of this. An image of a running Chromium process on a tmpfs (in-memory filesystem) that can just be launched endlessly tackles the memory slowdown problem, eliminates conventional startup costs. This feels like an ideal CRIU use case.
I imagine there's a lot of things Chrome needs to run though, bits of state to save/restore.
Isn't this solvable with autoscaling? how is this not an issue with Firecracker as well?
That's why we moved to a fully in-house solution with Firecracker and auto-scaling on EC2
what a disgusting business model. they are central to the main bot problem of our time. any one with morals would expect those systems to respect robots.txt and announce themselves via user agent strings.
disgusting and i hope they crash and burn. i will actively spend time today looking for open source projects detecting their browsers and start contributing.
You left in the Ai’s instructions. lol
Interesting read though, thanks
>During a burst in traffic, the system, instead of reacting on its own, required humans to adjust it. This caused problems: one load test brought down production for 45 minutes. So we rebuilt our setup on Firecracker.
It shouldn't need to have autoscaling built in. If the variable is adjustable, why couldn't monitoring happen that sets off a process to adjust the variable when traffic spikes?
This seems a little unfair - the _architecture_, as designed, required the human in the loop. The tool doesn’t require it.