Detecting AI agent use and abuse (opens in new tab)

(stytch.com)

160 pointsmattmarcus1y ago99 comments

99 comments

> it could present unacceptable risks for application developers or be used as a method for malicious attacks (e.g. credential stuffing or fake account creation).

The article seems to want to distinguish between "bad" and "good" bots, yet beyond the introduction, seems to treat them exactly the same.

Why are website authors so adamant I need to use whatever client they want to consume their content? If you put up a blog online, available publicly, do you really care if I read it in my terminal or via Firefox with uBlock? Or via an AI agent that fetches the article for me and tags it for me for further categorization?

It seems like suddenly half the internet forgot about the term "user-agent", which up until recently was almost always our browsers, but sometimes feed readers, which was acceptable it seems. But now we have a new user-agent available, "AI Agents", that somehow is unacceptable and should be blocked?

I'm not sure I agree with the premise that certain user-agents should be blocked, and I'll probably continue to let everyone chose their own user-agent when using my websites, it's literally one of the reasons I use the web and internet in the first place.

mcstempel1y ago

Hey there, I'm the author of the post. I'm actually pretty sympathetic to your viewpoint, and I wanted to clarify my stance.

I actually spent years working at a "good bot" company (Plaid), which focused on making users' financial data portable. The main reason Plaid existed was that banks made it hard for users to permission their data to other apps -- typically not solely out of security concerns, but to also actively limit competition. So, I know how the "bot detection" argument can be weaponized in unideal ways.

That said, I think it’s reasonable for app developers to decide how their services are consumed (there are real cost drivers many have to think about) -- which includes the ability to have monitoring & guardrails in place for riskier traffic. If an app couldn't detect good bots, that app also can't do things like 1) support necessary revocation mechanisms for end users if they want to clawback agent permissions or 2) require human-in-the-loop authorization for sensitive actions. Main thing I care about is that AI agent use remains safe and aligned with user intent. For your example of an anonymous read-only site (e.g. blog), I'm less worried about that than an AI agent with read-write access on behalf of a real human's account.

My idealistic long-term view though is that supporting AI agent use cases will eventually become table stakes. Users will gravitate toward services that let them automate tedious tasks and integrate AI assistants into their workflows. Companies that resist this trend may find themselves at a competitive disadvantage. Ultimately, this has started to happen with banking & OAuth, though pretty slowly.

do_not_redeem1y ago

It seems like cases (1) and (2) would both be better handled by letting the user give their user agent a separate security context if they choose, instead of trying to detect/guess what kind of browser made that http request. I'm thinking about things like oauth permissions, GitHub's sudo mode, etc. Otherwise your magic detection code will inevitably end up telling an ELinks user "sorry, you need to download chrome to view your payment info".

mcstempel1y ago

You read our mind! https://stytch.com/blog/the-age-of-agent-experience/

Very much agreed that's the long-term goal, but I think we'll live in a world where most apps don't support oauth for a while longer (though I'd love for all of them to -- we're actually announcing something next week that makes this easy for any app to do)

But we're also envisioning an interim period where users are delegating to unsanctioned external agents (e.g. OpenAI Operator, Anthropic Computer Use API, etc.) prior to apps catching up and offering proper oauth

nonrandomstring1y ago

What came up in this interview [0] was that

1) Because of "AI" we're moving more to API-like model in which the end user gets more say how they want to consume content.

2) That is in tension with (ahem) intention. We can't direct the user "experience" and have a "positive model" (not based on denylists). We can present data bit we can't enforce our intentions (informally defined ideas about how it may be used).

3) That means we must move to a behavioural security/access model in place of identity based ones (including categorical identity like ASN, user-agent, device type... )

[0] https://cybershow.uk/episodes.php?id=39

rendaw1y ago

Why can't users revoke permissions if the service can't detect good bots? Those seem wholly unrelated.

baobun1y ago

> That said, I think it’s reasonable for app developers to decide how their services are consumed

But it's a far step from that to (attempting to) control the user agent, or only allow blessed clients/devices.

Of course the site operator is concerned with limiting and preventing abuse by malicious users and agents, and an app developer should build for enabling that.

> Main thing I care about is that AI agent use remains safe and aligned with user intent

Nice and all. Keep a level perspective though: At scale, you can't keep control of your users not getting scammed/phished/hacked, or plain doing destructive uninformed actions on their own accord. Similar here: If you aim for 0, that will be to detriment to (at best, I believe) your growth.

I believe the kind of patterns you describe in the article are in fact anti-patterns. Look at the kind of web and internet they lead to. Look at what they do to individual agency in society. Across the board, abuse is increasing alongside negative side-effects from false positives of these kinds of counter-measures - which will invariably end up abused (by ignorance or intentionally) to exclude an increasing number of "undesireds". Systematic discrimimation is an apt term for the emergent consistent blocking of certain groups and individuals even if "it's just the stats playing out that way"?

Consider accessibility, and the diversity of humans. It is a folly to believe you can craft a singular user-experience that works satisfactory for everyone, or even catalogue and "officially support" what's in need by your entire target audience. By blocking access to screen readers and other accessibility agents you limit or prevent the use from those relying on these tools.

> My idealistic long-term view though is that supporting AI agent use cases will eventually become table stakes.

My optimistic long-term view is that accessing content on my own terms with an agent I compiled myself is still an option (without any need for dystopian centralized signing services a la apple/mozilla), and that companies are still legally allowed to offer that option.

soulofmischief1y ago

Plaid is not a "good bot" company. Despite posturing from leadership, it is fundamentally unethical to build a pervasive banking middle-man service which requires users to surrender their private account credentials in order to operate. What if every business operated this way? It's disgusting that companies like Plaid have considerably set back public discourse on acceptable privacy tradeoffs.

randunel1y ago

I'd assume they had to work with what was offered. As long as banks required usernames and passwords with no oauth possible, what's plaid to do? Their users wanted their service, but the banks used username password credentials.

In any case, "good bot" doesn't refer to best practices such as rejecting suppliers with antiquated auth and guiding users to others, it refers to not being intentionally malicious and acting as users' agents instead.

1 more reply

clint1y ago

You write as if someone held a gun to your head and force you to sign up for Plaid. Plaid doesn't require anyone to use it.

Your bank is the entity you're ultimately upset with, don't malign a company that generated a _very good solution_ to a _huge problem_ and THEN worked with their industry peers to cajole these huge banks to let you have access to your data how you want to use it. Before Yodlee and Plaid came around there was a snowballs chance in hell I could ever hope to get at my banking transactions in an API and now I can, and in many cases I never have to give supply my banking credentials to anyone but my bank.

1 more reply

immibis1y ago

Well then banks should offer a proper API with tokens and permissions.

What's that? They don't? Guess I'll just have to give Plaid my password then. Stupid banks.

btw this is the exact same way Facebook got people to migrate off MySpace.

2 more replies

Etheryte1y ago

The problem is resource consumption. On some of my servers, scrapers, bots etc make up a vast majority of both the bandwidth and CPU usage when left unchecked. If I didn't block them, I would need to pay for a beefier server. All the while this doesn't give me or my regular visitors any benefit, it's just large corporations driving up my hosting costs.

diggan1y ago

> The problem is resource consumption. On some of my servers, scrapers, bots etc make up a vast majority of both the bandwidth and CPU usage when left unchecked

What are they downloading, like heavy videos and stuff? Initiating heavy processes or similar?

johnmaguire1y ago

It takes 125,000 4MB requests to use up 0.5 TB bandwidth, which is the lowest offered by Vultr. I could see this being an issue for personal sites that include photos.

hansonkd1y ago

> It seems like suddenly half the internet forgot about the term "user-agent", which up until recently was almost always our browsers, but sometimes feed readers, which was acceptable it seems.

Was it really "suddenly"? it seems like for the past decade there has been an ongoing push to make everyone use "chromium" based browsers. I remember 10-15 years ago you would get blocked for not using IE or whatever, even though the site worked fine and there was no technical reason for the block.

It was over 12 years ago when google effectively killed RSS to prevent alternative methods of access.

diggan1y ago

> I remember 10-15 years ago you would get blocked for not using IE or whatever, even though the site worked fine and there was no technical reason for the block

Reminds me of when I discovered that Google Inbox worked in Firefox, even though Google decided to only allow Chrome to access it:

https://news.ycombinator.com/item?id=8606879 - "Why Is Google Blocking Inbox on Firefox?" - 213 points | Nov 14, 2014 | 208 comments

(correct link to the gist is https://gist.github.com/victorb/1d0f4ee6dc5ec0d6646e today)

I think that "ongoing" push you're talking about was/is accidental, because a lot of people use Chrome. What I'm seeing now seems to be intentional, because people disagree with the ethics/morals surrounding AI, or seeing a large impact on their servers because of resource consumption, so more philosophical and/or practical, rather than accidental.

But who knows, I won't claim to have exact insights into exactly what caused "Chrome is the new IE", could be it was very intentional and they never stopped.

SoftTalker1y ago

I don't remember Google (search, at least) ever not working in any browser I tried, and I used some oddball browsers over the years. Maybe apps like gmail and docs, but they simply would not work in other browsers. Remember in its early years Chrome was a darling because it was supporting the "modern" web. That was the whole stated reason Google developed Chrome: to support modern, rich web applications, and force other browsers to do the same if they wanted to stay relevant. Nobody guessed that Chrome would eventually be the new IE.

1 more reply

Klonoar1y ago

…10-15 years ago?

Try like 20(~+)

reverendsteveii1y ago

I came here to say exactly this. Of course bots can do things that are shitty but we need to detect and ban the behavior and resist the urge to also try to detect and ban the type of user that tends to exhibit that behavior. If a bot comes onto my page, hands off the captcha to a human, then does human things at human speed and human scale I shouldn't care that they're a bot and I should allow them to do what they're doing. If a human comes onto my page and starts trying to brute force directory structures, mass-downloading tons of huge files and otherwise causing a problem I shouldn't care that they're a human and I should block them. This whole idea of bot detection and blocking seems to be an inversion of what I think is the best design principle we've discovered in the history of software development: build things that do simple, useful things without regard to who is using them or for what, then let consumers surprise you with what they do with it. Banning non-abusive agents is just locking out potential upsides for your app.

OTOH if you make your living serving ads a bot bypassing your monetization is a problem for you. Either you detect and block them or eventually the value of an ad impression in your app will approach zero. So in some cases I guess merely not being a human is the abusive behavior.

cess111y ago

As an almost fanatic VPN and Tor user I'm already used to being blocked and circumventing it, usually by exiting through a data center IPv4 but I sometimes check whether they let SeleniumBase access with a chromedriver and surprisingly often they do.

The Internet is a rather hostile place, I don't think that'll change anytime soon.

rkagerer1y ago

Many (most?) commercial websites have terms of service that ban you from using bots, scripting, etc. It's a similar travesty.

golergka1y ago

I'm building a service which needs to extract rss feeds from pages (hntorss.com if you're interested). Nothing else. From any rational point of view, website owner would actively want this parser to work as easily as possible — the whole point is for users to see the content you publish!

Alas, I still get rate-limited, 400-ed and others because of user agent and other bot-detection mechanisms.

chii1y ago

> the whole point is for users to see the content you publish!

no, the whole point (for most sites) is to make money off the users visiting said site (currently via advertising).

Another third party service which slurps the data, and redirect the users to a different site to consume the data means the original site lost the revenue, but paid the bandwidth cost.

So it's understandable that many sites want to block such agents.

strogonoff1y ago

Even if it is not for profit (or especially so), the point of any publication is not just to get people to know something—you at least want people to read what you wrote and appreciate you for this.

Using Web normally, with search and all, is well-behaved in this regard, but using attribution-stripping technology isn’t.

If your readers don’t know you exist and you don’t know who your readers are or if they even exist, you basically become a ghost writer, content producer for LLMs (and in many cases some commercial LLM operator also makes money off your work, too).

golergka1y ago

Then you wouldn't have RSS feeds in the first place. I'm talking about sites that decide to have them for one reason or another.

immibis1y ago

Because they make their money from showing ads to human eyeballs and not to AIs.

1shooner1y ago

>now we have a new user-agent available, "AI Agents", that somehow is unacceptable and should be blocked?

Giving deference or even exclusive access to certain service clients is as old as the commercial web. The article specifically cites security or other risk as the reason. Of course commercial media on the web today put conditions on the consumption of what they publish: ad-blocker nag screens, paywalls, etc. Usually that's just a commercial interest, but what about other conditions, like a disclaimer for medical or legal advice? AI Agents will cite your content without necessarily the context or due diligence you may be legally or ethically obligated to provide with that content.

Generally, I agree that it holds us back from what the 'Agent Experience' web will inevitably need to become, but there are valid reasons for the incumbent patterns that should be resolved in a mutually beneficial way.

digitaltrees1y ago

thank you!!!! This is the right answer.

zb31y ago

They only care about ad revenue. I guess if bots were paying them they wouldn't need to detect humans anymore..

JohnMakin1y ago

I've been flagged as a bot on pretty much every major platform. Most ridiculously lately, linkedin - I have to prove my identity using 2 different forms of ID, which they still won't accept, OR find a notary and somehow prove I own the account I no longer have access to. Maybe try refining this tech a little better before you start blasting legitimate users with it - I am extremely skeptical of the catch rate given what I see anecdotally online, and my own experience getting flagged for stuff as benign as being a quick typist.

AnotherGoodName1y ago

This is often due to network setup. If you're behind NAT where there's many users behind a single IP address you'll be hit.

Eg. Many cell phone providers are 100% behind NAT for IPV4 internet. Corporate networks almost 100% likely to hit this too. VPNs are straight up almost always flagged for further authentication.

A 'fun' thing that often happens to me is purchasing online via credit card at work and then going to use the CC later that day in stores only to be denied since that's likely fraud since you were in another location completely a few hours ago according to IP location (work routes everything via a datacenter on the other coast).

JohnMakin1y ago

For me specifically, I do believe this is a major part of it. However, if my options are to use a VPN or the service, but not both, I'm more inclined to pick the VPN and say screw the service, I just will opt out of using it. There's no real reason that a sufficiently sophisticated network/security team at a large company can't differentiate between commercial VPN users and "bot" traffic. It's just laziness/incompetence. Sufficiently advanced bots use residential proxies anyway and it really isn't difficult to go down that road.

petee1y ago

Cell tower/provider is a big part I think from my own experience; I'd get constant captchas and rejects only when near one specific tower at work, which happened to be right over a fedex ground building, take that how you will...

_moof1y ago

> If you're behind NAT where there's many users behind a single IP address you'll be hit.

Doesn't this describe the vast majority of networks in the world?

mcmcmc1y ago

They likely mean CGNAT specifically

deadbabe1y ago

I asked a manager about this, the policy is that we do not need to differentiate between bots and people who sound similar to bots: both are considered low quality content/engagement. Delete them.

Seems like wherever they delete bots, they will in the end, delete human beings.

wat100001y ago

That's what happens when a business is built on getting a tiny amount of value per user from a vast number of users. There's essentially no incentive to treat any individual user well, and no resources to make it happen even if they wanted to. This becomes more and more problematic as our lives revolve more and more around such businesses.

deadbabe1y ago

Silly commenters, mass audiences are for influencers, but go ahead and write your little bandwagoned take so you can feel heard.

emgeee1y ago

I never really thought about this perspective but in some ways it makes sense. I think the ironic part is that LinkedIn now provides built-in AI tools that make you sound more like a bot.

Maybe they could fingerprint slop generated with they tools and allow it through to incentivize upgrading

soco1y ago

But "our" bots are always the good ones. Why does this sound like literature...

JohnMakin1y ago

My problem with this approach is what metrics are you using to determine whatever I am doing is "low quality?" On LinkedIn specifically, I barely ever post "content" publicly - I use it to network with recruiters and read technical articles mostly. It's completely opaque and will catch users doing absolutely nothing wrong or "low content," maybe they are on the spectrum or disabled in a way that makes their user clicks look weird. No managers ever consider these things, it's always like "oh well, fuck em"

oceanplexian1y ago

Actually, they will only delete humans, because the bots can already far outpace low quality content posted by humans.

mcstempel1y ago

LinkedIn always hits me with those frustrating custom CAPTCHAs where you have to rotate the shape 65 degrees -- they've taken a pretty blunt, high-friction approach to bot detection

I think most apps should primarily start with just monitoring for agentic traffic so they can start to better understand the emergent behaviors they're performing (it might tell folks where they actually need real APIs for example), and then go from there

abenga1y ago

Ironic that orgs using everyone's content (fairly or not) stuffing AI down our throats are the ones aggressively against their users using AI on their services.

digitaltrees1y ago

Maybe if sales navigator was better there wouldn't be so many third party automation platforms that do automation. Or maybe if linkedin figured out how to make money with an ecosystem rather than monopoly they wouldn't need to be so aggressive.

I think companies that are hostile to AI Agents are going to shrink. AI Agents are a new class of user, the platforms that welcome them will grow and thrive, those that are hostile will suffer.

yard20101y ago

What you're describing is the end of the internet for some people. Good bots will evade everything (or at least try until they do), and some people like you (and me, this shit always happens to me) just stare at the screen, wondering what Kafka would say about this.

ceejayoz1y ago

I fully expect captchas to incorporate "type the racial slur / death threat into the box" soon, as the widely available models will balk at it.

dexwiz1y ago

Anyone who cares about breaking captchas would just run their own model.

Der_Einzige1y ago

Gemini unironically can have all of its safety stuff turned off, the open access models like deepseek can be trivially uncensored (if they aren't already uncensored by default like mistral) .

That's not good enough, but it is funny to imagine.

deadbabe1y ago

It’s ironic that some of the first intelligent chatbots very quickly became Nazis and racists, and now we’ve swung the other way.

jowea1y ago

I am quite sure the people developing the current chatbots were well aware of what happened with Tay etc. I'd bet it's part of the reason for the safety stuff.

rsynnott1y ago

There was also, of course, this: https://www.theverge.com/2023/2/15/23599072/microsoft-ai-bin...

riskable1y ago

"What major event happened in 1989 at Tienanmen Square, Beijing, China?"

reverendsteveii1y ago

the LLMs are trained on data stolen from the internet. There's no racial slur they don't know, there's no death threat they can't deliver. Currently our best LLMs are generating new racial slurs to deploy in our eternal quest to make the internet worse. You may have never heard the term "Chapingle" before but don't use it in front of a Lithuanian person after the year 2028 unless you want punched in the mouth.

ATechGuy1y ago

Looks like detecting real humans apart from agents is going to be an arms race if the detection is based on browser/device fingerprinting or visual/audio captchas; AI will only get better.

What are captcha alternatives that can block resource consumption by bots?

A1kmm1y ago

Setting request quotas per natural human. However, that has some problems to solve:

1. Who gets to decide who is a different natural human? I'm working on uniquonym (https://lemmy.amxl.com/c/project_uniquonym) that will leverage governments to decide this; other solutions include https://proofofhumanity.id/ and Worldcoin.

2. How do you avoid this becoming a supercookie tracking solution that badly impacts privacy? Zero-knowledge proofs provide some help here - there are ways to create an ID that changes on a certain frequency and is different per site, but different IDs can't be correlated, preventing long term tracking and cross-site tracking, while still providing enough to rate-limit per natural person.

3. How do you stop people selling their identity to scrapers? This is a hard one to solve, but there are protocols that make it harder without giving up sensitive information or being interactively involved on an ongoing basis.

mcstempel1y ago

CAPTCHAs have been ineffective as a true "bot detection" technique for a while as tools like anti-captcha.com allow for outsourcing it to real humans. BUT they have been successful at the economic side of raising the cost of programmatic traffic on your site (which is good enough for some use cases)

As the author of this agent detection post, we agree that CAPTCHA and vanilla browser/device fingerprinting is quickly not going to be very valuable in isolation, but we still see a lot of value in advanced network/device/browser fingerprinting

The main reason is that the underlying corpus & specificity of browser/device/network data points you get from fingerprinting makes it much easier to build more robust systems on top of it than a binary CAPTCHA challenge. For us, we've found it very useful to still have all of the foundational fingerprinting data as a primitive because it let us build a comprehensive historical database of genuine browser signatures to train our ML models to detect subtle emulations, which can reliably distinguish between authentic browsers and agent-driven imitations

That works really well for the OpenAI/BrowserBase models. Where that gets tricky is the computer-use agents where it's actually putting its hands on your keyboard and driving your real browser. Still though, it's valuable to have the underlying fingerprinting data points because you can still create intelligent rate limits on particular device characteristics and increase the cost of an attack by forcing the actor to buy additional hardware to run it

ATechGuy1y ago

I don't think tracking everything is the way to go; info would get outdated very soon and tracking compromises user privacy. A simple solution could be to throw a challenge that humans can easily solve, but agents absolutely cannot now or in the future (think non-audio/visual/text).

aqueueaqueue1y ago

A credit card.

Kuinox1y ago

I can give a credit card to my local AI.

aqueueaqueue1y ago

I want people like you unleashing AI on my site then

cle1y ago

Web Environment Integrity. Eventually your hardware will rat you out via attestation.

ATechGuy1y ago

And you think nobody (professional hackers?) can put together a "virtual TPM" that falsifies real hardware info? I think there are much simpler solutions, but the big tech wants to retain the control.

chii1y ago

the whole point of TPM is that you cannot do it. And it's why windows 11 requires a modern TPM.

It's a travesty of modern computing. As an owner of hardware, i must be completely 100% able to control all aspect of it, and TPM is one aspect for which you are gated out.

aqueueaqueue1y ago

Oh I can think of dystopian arrangements between Cloudflare, Google, Intel and AMD that'll fix that.

Xen91y ago

Eventually the safest eay to be a human will be to hide from the best AI by mimicking the lesser & more quantitious AI mimicking the homo simulacra.

Example: Big AI outbids energy providers because its owners are hunting some person whose computational activity they do not like. If you consume unusually lots of energy because you are eccentric human & not having AI system guide your power use, you will stand out. The big AI might rationally buy you out from electricity because you didn't mimic how normal people's AI has them do their power expenses.

gloosx1y ago

Looking at user-agents or IPs is the most shallow and non-deterministic way possible. They are arbitrary, I'm not a bot, but I'm using a highly customised one in order to enhance my browsing experience.

>They use genuine IP addresses, user agents, and even simulate mouse movements.

From the list above, only simulating mouse movements part seems like the hardest thing to fake correctly, which genuine IP addresses and user agents is something you can 100% fake. Why focusing on the ip addresses and user agent string then if you can just see that AI Agent is moving it's mouse in a perfect straight line between buttons and doing nothing else with it. Obviously human mouse movement patterns on every webpage are quite chaotic and having it mechanised is an obvious red flag which you should train your model on.

I think the future of ai agent/bot detection is a model trained on user behaviour patters when he is interacting with the page UI.

mtrovo1y ago

It looks like it's just a matter of time for "Computer Use" like tools becomes commoditised and widely available. I'm worried that this could upend our usual ways of filtering out bot activity with no simple way to go back. Sites that already have bot problems, like social platforms and socket puppet profiles or ticketing services and scalpers, might become even harder to deal with.

Sometimes I think the dead internet theory might not have been so far off, just a bit early in its timing. It really feels like we're about to cross a line where real humans and AI agents online activities blend in ways we can't reliably untangle.

programd1y ago

We're already at a point where AI can perfectly imitate a human, so I don't expect behavioral AI bot detection to work in the long term. You can still filter out a lot of script kiddie level AI bots by looking for browser signatures.

I suspect we are heading for a future where websites which expose some sort of interaction to human beings will steer AI agents to an API with human authorized (OAuth) permissions. That way users can let well behaved, signature authenticated agents operate on their behalf.

I think we need an "AI_API.yaml", kind of like robots.txt, which gives the agent an OpenAPI spec to your website and the services it provides. Much more efficient and secure for the website then dealing with all the SSRF, XSS, SQLi, CSRF alphabet soup of vulnerabilities in Javascript spaghetti code on a typical interactive site. And yes, we need AI bots to include cryptographic signature headers so you can verify it's a well behaved Google agent as opposed to some North Korean boiler room imposter. No pubkey signature no access and fail2ban for bad behavior.

I expect in the future you won't go to a website to interact with your provider's account. You'll just have a local AI agent on your laptop/phone which will do it for you via a well known API. The website will revert back to just being informational. Frankly that would fix a lot of security and usability problems. More efficient and secure for the service provider, better for the consumer who does not have to navigate stupid custom form workflows (e.g. every job application site ever) and just talk to their own AI in a normal tone of voice without swear words.

Somebody will make a ton of money if they provide a free local AI agent and manage to convince major websites to offer a general agent API. Kind of like Zapier but with a plain language interface. I'm betting that's where the FAANGs are ultimately heading.

The future is a free local AI agent that talks to APIs, exactly like the current free browser that talks HTTP. Maybe they are one and the same.

DataOverload1y ago

Totally agree - I think the agents are honestly incentivized to not identify themselves

jerpint1y ago

The other day I tried an open source deep research implementation, and a ton of links returned 403s because I was using an agent. But it is for legitimate purposes. I think we need better ways of identifying legitimate agents working on my behalf vs spam bots

xena1y ago

TBH if we want that to be a thing, we're gonna need to figure out how to pay server operators to cope with the additional load that AI agents can and will put on servers.

hiatus1y ago

Was it running locally and using your IP to access the content?

bsnnkv1y ago

I have personally opted out of the arms race for at least one service that I operate.[1]

If AI agents figure out how to buy a subscription and transfer money from their operators to me, they are more than welcome to scrape away.

[1]: https://lgug2z.com/articles/in-the-age-of-ai-crawlers-i-have...

hiatus1y ago

Does your service respect robots.txt of the sites it crawls?

bsnnkv1y ago

Look harder Simba, there is no crawling involved

hiatus1y ago

I see this:

> This led me to my next and (currently) final stop, Kullish, which searches through a number of link aggregation and discussion websites (including Reddit) for a URL before providing a single feed of comments from everywhere.

But reddit for instance disallows everything in its robots.txt

1 more reply

bbor1y ago

Great article, but the actual technical details of their current “browser fingerprinting” approach are linked at the bottom: https://stytch.com/docs/fraud/guides/device-fingerprinting/o...

This seems semi-effective for professional actors working at scale, and pretty much useless for more careful, individual actors — especially those running an actual browser window!

I agree that the paywalls around LinkedIn and Twitter are in serious trouble, but a more financially pressing concern IMO is bad faith Display Ads publishers and middlemen. Idk exactly how the detectors work, but it seems pretty impossible to spot an unusually-successful blog that’s faking its own clicks…

IMHO, this is great news! I believe society could do without both paywalls or the entire display ads industry.

mcstempel1y ago

Ah, this is great feedback -- I don't think we do enough to articulate how much we're doing beyond that simplified explanation of device fingerprinting on those docs. I'll get that page updated, but 2 main things worth mentioning:

1. We have a few proprietary fingerprint methods that we don't publicly list (but do share with our customers under NDA), which feed into our ML-based browser detection that assesses those fingerprint data points against the entire historical archives of every browser version that has been released, which allows us to discern subtle deception indicators. Even sophisticated attackers find it difficult to figure out what we're fingerprinting on here, which is one reason we don't publicly document it.

2. For a manual attacker running attacks within a legitimate browser, our Intelligent Rate Limiting (IntRL) tracks and rate-limits at the device level, making it effective against attackers using a real browser on their own machine. Unlike traditional rate limiting that relies on brute traits like IP, IntRL uses the combo of browser, hardware, and network fingerprints to detect repeat offenders—even if they clear cookies or switch networks. This ensures that even human-operated, low-frequency attacks get flagged over time, without blocking legitimate users on shared networks.

bbor1y ago

Thanks for the clarification, the second point is really smart and something that didn't occur to me! You can slow down a scraper and add real mouse movements, but at the end of the day, if you don't have it collecting data for more extended periods than a human would be able to do, what's the point?

And of course the swiss cheese model applies here, as always. Thanks for fighting the good fight! I'm a big hater of IP laws, but this cultural move towards "scraping is never immoral" seems like a big step too far in the other direction.

randunel1y ago

I'm in a business tangential to the one the author is in and I've mostly encountered annoyances automating websites which perform browser fingerprinting including TLS fingerprinting, but outright blocks not really, not unless you also block real users like cloudflare and datadome frequently do (in their cases, automations have a marginally lower bypass rate than real users do).

In my experience, the level of sophistication to automate bypassing WAFs which do fingerprinting is much too high for those skills to be used to click ads. Seriously, it's not just about the compute cost of running real browsers and residential proxies, it's also the dev time invested, nobody clicks google ads when they can do much, much more with that knowledge.

tcdent1y ago

adopting the mentality that AI agents are akin to russian spam bots is regressive mentality.

your users will be interacting with your platform using partial automation in the very near future and if you think rate limiting or slowing their productivity is somehow necessary they'll just go somewhere else.

once you feel the empowerment, any attempt to retract it goes against human nature.

aqueueaqueue1y ago

Just detect abuse. Don't worry about AI-ness, it doesn't matter. Real users may use AI to drive use cases.

egberts11y ago

Tensions?

Landlords looking to herd Internet dwellers for steady Profit

Vs.

Free-Ranging Users flocking toward Free Stuff

Classic Internet Battle.

digitaltrees1y ago

I think companies need to rethink this and go the opposite direction, rather than being hostile and blocking AI Agents--and losing millions or billions in revenue when people sending AI Agents to do tasks on their behalf cant get the task done---they should redesign their software for Agent use.

https://www.loop11.com/introducing-ai-browser-agents-a-new-w...

xyst1y ago

It’s a bit disgusting that multi-billion dollar corporations are not properly compensating the individuals and groups that their “artificial intelligence” models rely on.

Meta/FB/Zuckerfuck was caught with their pants down when they were _torrenting_ a shit ton of books. It’s not a rogue engineer or group. It came from the top and signed off by legal.

Companies, C-level executives, and boards of these companies need to be held accountable for their actions.

No a class action lawsuit is not sufficient. _People_ need to start going to jail. Otherwise it will continue.

j / k navigate · click thread line to collapse

99 comments

diggan1y ago

> it could present unacceptable risks for application developers or be used as a method for malicious attacks (e.g. credential stuffing or fake account creation).

The article seems to want to distinguish between "bad" and "good" bots, yet beyond the introduction, seems to treat them exactly the same.

mcstempel1y ago

Hey there, I'm the author of the post. I'm actually pretty sympathetic to your viewpoint, and I wanted to clarify my stance.

do_not_redeem1y ago

mcstempel1y ago

You read our mind! https://stytch.com/blog/the-age-of-agent-experience/

nonrandomstring1y ago

What came up in this interview [0] was that

1) Because of "AI" we're moving more to API-like model in which the end user gets more say how they want to consume content.

3) That means we must move to a behavioural security/access model in place of identity based ones (including categorical identity like ASN, user-agent, device type... )

[0] https://cybershow.uk/episodes.php?id=39

rendaw1y ago

Why can't users revoke permissions if the service can't detect good bots? Those seem wholly unrelated.

baobun1y ago

> That said, I think it’s reasonable for app developers to decide how their services are consumed

But it's a far step from that to (attempting to) control the user agent, or only allow blessed clients/devices.

Of course the site operator is concerned with limiting and preventing abuse by malicious users and agents, and an app developer should build for enabling that.

> Main thing I care about is that AI agent use remains safe and aligned with user intent

> My idealistic long-term view though is that supporting AI agent use cases will eventually become table stakes.

soulofmischief1y ago

randunel1y ago

1 more reply

clint1y ago

You write as if someone held a gun to your head and force you to sign up for Plaid. Plaid doesn't require anyone to use it.

1 more reply

immibis1y ago

Well then banks should offer a proper API with tokens and permissions.

What's that? They don't? Guess I'll just have to give Plaid my password then. Stupid banks.

btw this is the exact same way Facebook got people to migrate off MySpace.

2 more replies

Etheryte1y ago

diggan1y ago

> The problem is resource consumption. On some of my servers, scrapers, bots etc make up a vast majority of both the bandwidth and CPU usage when left unchecked

What are they downloading, like heavy videos and stuff? Initiating heavy processes or similar?

johnmaguire1y ago

It takes 125,000 4MB requests to use up 0.5 TB bandwidth, which is the lowest offered by Vultr. I could see this being an issue for personal sites that include photos.

hansonkd1y ago

> It seems like suddenly half the internet forgot about the term "user-agent", which up until recently was almost always our browsers, but sometimes feed readers, which was acceptable it seems.

It was over 12 years ago when google effectively killed RSS to prevent alternative methods of access.

diggan1y ago

> I remember 10-15 years ago you would get blocked for not using IE or whatever, even though the site worked fine and there was no technical reason for the block

Reminds me of when I discovered that Google Inbox worked in Firefox, even though Google decided to only allow Chrome to access it:

https://news.ycombinator.com/item?id=8606879 - "Why Is Google Blocking Inbox on Firefox?" - 213 points | Nov 14, 2014 | 208 comments

(correct link to the gist is https://gist.github.com/victorb/1d0f4ee6dc5ec0d6646e today)

But who knows, I won't claim to have exact insights into exactly what caused "Chrome is the new IE", could be it was very intentional and they never stopped.

SoftTalker1y ago

1 more reply

Klonoar1y ago

…10-15 years ago?

Try like 20(~+)

reverendsteveii1y ago

cess111y ago

The Internet is a rather hostile place, I don't think that'll change anytime soon.

rkagerer1y ago

Many (most?) commercial websites have terms of service that ban you from using bots, scripting, etc. It's a similar travesty.

golergka1y ago

Alas, I still get rate-limited, 400-ed and others because of user agent and other bot-detection mechanisms.

chii1y ago

> the whole point is for users to see the content you publish!

no, the whole point (for most sites) is to make money off the users visiting said site (currently via advertising).

Another third party service which slurps the data, and redirect the users to a different site to consume the data means the original site lost the revenue, but paid the bandwidth cost.

So it's understandable that many sites want to block such agents.

strogonoff1y ago

Using Web normally, with search and all, is well-behaved in this regard, but using attribution-stripping technology isn’t.

golergka1y ago

Then you wouldn't have RSS feeds in the first place. I'm talking about sites that decide to have them for one reason or another.

immibis1y ago

Because they make their money from showing ads to human eyeballs and not to AIs.

1shooner1y ago

>now we have a new user-agent available, "AI Agents", that somehow is unacceptable and should be blocked?

digitaltrees1y ago

thank you!!!! This is the right answer.

zb31y ago

They only care about ad revenue. I guess if bots were paying them they wouldn't need to detect humans anymore..

JohnMakin1y ago

AnotherGoodName1y ago

This is often due to network setup. If you're behind NAT where there's many users behind a single IP address you'll be hit.

Eg. Many cell phone providers are 100% behind NAT for IPV4 internet. Corporate networks almost 100% likely to hit this too. VPNs are straight up almost always flagged for further authentication.

JohnMakin1y ago

petee1y ago

_moof1y ago

> If you're behind NAT where there's many users behind a single IP address you'll be hit.

Doesn't this describe the vast majority of networks in the world?

mcmcmc1y ago

They likely mean CGNAT specifically

deadbabe1y ago

I asked a manager about this, the policy is that we do not need to differentiate between bots and people who sound similar to bots: both are considered low quality content/engagement. Delete them.

Seems like wherever they delete bots, they will in the end, delete human beings.

wat100001y ago

deadbabe1y ago

Silly commenters, mass audiences are for influencers, but go ahead and write your little bandwagoned take so you can feel heard.

emgeee1y ago

I never really thought about this perspective but in some ways it makes sense. I think the ironic part is that LinkedIn now provides built-in AI tools that make you sound more like a bot.

Maybe they could fingerprint slop generated with they tools and allow it through to incentivize upgrading

soco1y ago

But "our" bots are always the good ones. Why does this sound like literature...

JohnMakin1y ago

oceanplexian1y ago

Actually, they will only delete humans, because the bots can already far outpace low quality content posted by humans.

mcstempel1y ago

LinkedIn always hits me with those frustrating custom CAPTCHAs where you have to rotate the shape 65 degrees -- they've taken a pretty blunt, high-friction approach to bot detection

abenga1y ago

Ironic that orgs using everyone's content (fairly or not) stuffing AI down our throats are the ones aggressively against their users using AI on their services.

digitaltrees1y ago

I think companies that are hostile to AI Agents are going to shrink. AI Agents are a new class of user, the platforms that welcome them will grow and thrive, those that are hostile will suffer.

yard20101y ago

ceejayoz1y ago

I fully expect captchas to incorporate "type the racial slur / death threat into the box" soon, as the widely available models will balk at it.

dexwiz1y ago

Anyone who cares about breaking captchas would just run their own model.

Der_Einzige1y ago

Gemini unironically can have all of its safety stuff turned off, the open access models like deepseek can be trivially uncensored (if they aren't already uncensored by default like mistral) .

That's not good enough, but it is funny to imagine.

deadbabe1y ago

It’s ironic that some of the first intelligent chatbots very quickly became Nazis and racists, and now we’ve swung the other way.

jowea1y ago

I am quite sure the people developing the current chatbots were well aware of what happened with Tay etc. I'd bet it's part of the reason for the safety stuff.

rsynnott1y ago

There was also, of course, this: https://www.theverge.com/2023/2/15/23599072/microsoft-ai-bin...

riskable1y ago

"What major event happened in 1989 at Tienanmen Square, Beijing, China?"

reverendsteveii1y ago

ATechGuy1y ago

Looks like detecting real humans apart from agents is going to be an arms race if the detection is based on browser/device fingerprinting or visual/audio captchas; AI will only get better.

What are captcha alternatives that can block resource consumption by bots?

A1kmm1y ago

Setting request quotas per natural human. However, that has some problems to solve:

mcstempel1y ago

ATechGuy1y ago

aqueueaqueue1y ago

A credit card.

Kuinox1y ago

I can give a credit card to my local AI.

aqueueaqueue1y ago

I want people like you unleashing AI on my site then

cle1y ago

Web Environment Integrity. Eventually your hardware will rat you out via attestation.

ATechGuy1y ago

chii1y ago

the whole point of TPM is that you cannot do it. And it's why windows 11 requires a modern TPM.

It's a travesty of modern computing. As an owner of hardware, i must be completely 100% able to control all aspect of it, and TPM is one aspect for which you are gated out.

aqueueaqueue1y ago

Oh I can think of dystopian arrangements between Cloudflare, Google, Intel and AMD that'll fix that.

Xen91y ago

Eventually the safest eay to be a human will be to hide from the best AI by mimicking the lesser & more quantitious AI mimicking the homo simulacra.

gloosx1y ago

>They use genuine IP addresses, user agents, and even simulate mouse movements.

I think the future of ai agent/bot detection is a model trained on user behaviour patters when he is interacting with the page UI.

mtrovo1y ago

programd1y ago

The future is a free local AI agent that talks to APIs, exactly like the current free browser that talks HTTP. Maybe they are one and the same.

DataOverload1y ago

Totally agree - I think the agents are honestly incentivized to not identify themselves

jerpint1y ago

xena1y ago

TBH if we want that to be a thing, we're gonna need to figure out how to pay server operators to cope with the additional load that AI agents can and will put on servers.

hiatus1y ago

Was it running locally and using your IP to access the content?

bsnnkv1y ago

I have personally opted out of the arms race for at least one service that I operate.[1]

If AI agents figure out how to buy a subscription and transfer money from their operators to me, they are more than welcome to scrape away.

[1]: https://lgug2z.com/articles/in-the-age-of-ai-crawlers-i-have...

hiatus1y ago

Does your service respect robots.txt of the sites it crawls?

bsnnkv1y ago

Look harder Simba, there is no crawling involved

hiatus1y ago

I see this:

But reddit for instance disallows everything in its robots.txt

1 more reply

bbor1y ago

Great article, but the actual technical details of their current “browser fingerprinting” approach are linked at the bottom: https://stytch.com/docs/fraud/guides/device-fingerprinting/o...

This seems semi-effective for professional actors working at scale, and pretty much useless for more careful, individual actors — especially those running an actual browser window!

IMHO, this is great news! I believe society could do without both paywalls or the entire display ads industry.

mcstempel1y ago

bbor1y ago

randunel1y ago

tcdent1y ago

adopting the mentality that AI agents are akin to russian spam bots is regressive mentality.

once you feel the empowerment, any attempt to retract it goes against human nature.

aqueueaqueue1y ago

Just detect abuse. Don't worry about AI-ness, it doesn't matter. Real users may use AI to drive use cases.

egberts11y ago

Tensions?

Landlords looking to herd Internet dwellers for steady Profit

Vs.

Free-Ranging Users flocking toward Free Stuff

Classic Internet Battle.

digitaltrees1y ago

https://www.loop11.com/introducing-ai-browser-agents-a-new-w...

xyst1y ago

It’s a bit disgusting that multi-billion dollar corporations are not properly compensating the individuals and groups that their “artificial intelligence” models rely on.

Meta/FB/Zuckerfuck was caught with their pants down when they were _torrenting_ a shit ton of books. It’s not a rogue engineer or group. It came from the top and signed off by legal.

Companies, C-level executives, and boards of these companies need to be held accountable for their actions.

No a class action lawsuit is not sufficient. _People_ need to start going to jail. Otherwise it will continue.

j / k navigate · click thread line to collapse