Skip to content

Top Best Ask Show New Jobs

Ask HN: Why is ChatGPT allowed to scrape other sites via prompts?

28 pointshnroo992y ago45 comments

The fact that I can give ChatGPT any URL and extract html content from it feels like a big TOS breach for most sites. Am I misunderstanding something about the legality of scraping? Aren't developers discouraged from scraping like this in the first place for for-profit projects?

45 comments

29 comments · 11 top-level

Nextgrid2y ago· 6 in thread

If you can paste the URL in a browser and copy paste the next, why is it bad that a third-party agent can do the same? It's no different than a remotely-hosted browser you control via natural language, or asking a human assistant to do it and email you the result.

tangentstar2y ago

The first distinction I can think of is, “Who has agreed to the terms of service of the site being visited?”

If I visit a website and there is a tiny link to the terms of service on the bottom of it, there is no reasonable interpretation that I have ever agreed to them.

Because the stupid terms of service link being on a site doesn't mean anyone agreed to them, they dont actually hold up in a court of law. Now if you signup for an account and agree, then MAYBE then they can be enforced.

So no if its on the internet and its publicly viewable, i don't see why a bot like chatgpt should somehow be blind to a site that a human can see lol, hell microsofts made their new AI system see your screen, do you also want the AI's to somehow black out the screen area that has the website open and ... know theirs a TOS somewhere on the page

The CEO of the scraping company. Are we good?

Tepix2y ago

These terms are legally void

hnroo99OP2y ago

That's fair. Given that you can't do this programatically with their api (it disallows scraping prompts), it feels less prone to abuse. And even if a bad actor tried to leverage their web api instead of their official api to get around prompt limitations they could easily just ban you.

brianjking2y ago· 5 in thread

You can opt out.

https://platform.openai.com/docs/gptbot

Hey, that's great!

But wait, we already had a working mechanism to signal exactly this type of opting out[1] so let me rephrase the OP question: why does OpenAI get to be exempt from existing opt-out mechanisms and implement their own?

It certainly does seem as if they're trying to position themselves as a new standard against which content owners have to actively opt-out, and thus disregarding the already existing active opt-out signals. But that would mean that they don't actually care about privacy, and their opt-out signal is disengenuous! That can't be right, can it?? Surely everything they do is in good faith, just like every other corporation ever!

Anyway, the fact that they disregard existing privacy standards and rolled out their own privacy standard definitely gives me a lot of confidence that they will forever follow the privacy standards they themselves created!

Now excuse me, but I have to go get treatment for terminally metastasized sarcasm.

[1] https://en.m.wikipedia.org/wiki/Robots.txt

squigz2y ago

I'm... pretty sure OpenAI respects robots.txt, as explained in the link GP shared?

*> why does OpenAI get to be exempt from existing opt-out mechanisms and implement their own?

1) because they are not in law

2) because you too can ignore robots.txt

NemoNobody2y ago

Am I missing something?

I thought is was obvious that Microsoft is clearly about to establish the next "standard" with near windows level of ubiquity, it will end up our primary starting point to use Microsoft stuff - we won't open apps, Copilot will.

Actions speak louder than words tho - look at how obvious they are being

Copilot is included with windows, they added a button for it to all keyboards made here on out, built it into Edge, Office, is a standalone app, their search engine and now their Xbox games NPCs will be AI powered, prolly open to all their game pass studios.

If it goes the way I expect Microsoft will be essentially done positioning themselves for the world we talk to and expect to listen to us - and organize, track and recall anything I talk to about it. Perfect for the smart glasses we all about to buy

Tbh, I think this will be the end of computing as we conceive it now - just not for the reason I expected originally.

Folders for example - I think Copilot will end folders and all the file organization stuff for normal users. I shouldn't need to ever kno where that stuff is on my PC after a future date, or manage it in any way.

Instead we'll have "real-time" folders, created from our own saved content, assembled to our inquiry and according to our preferences all named, topic labeled, and dated - but not by us.

Stored and retrieved by AI - lots like human memory actually.

Bc we'd then NEED Copilot just to access our stuff - I think that is most definitely coming sooner than later

Because robots.txt is a standard people can choose to follow, its not a law

bicx2y ago· 4 in thread

Google scrapes like a maniac. And for profit. Many others do the same.

A website can put up a TOS prohibiting such use, but my understanding is that is essentially unenforceable if the site is publicly accessible.

The recent Meta v Bright Data case highlights how extreme it can get without being technically illegal. https://techcrunch.com/2024/02/26/meta-drops-lawsuit-against...

If you’re trying to prevent scraping of your data, your best option is to not make it public.

herbst2y ago

I have a website that has around 5k Bing visits every single day. Basically my most expensive user. Compared to Google with about 70 visits daily.

I randomly block their IPs, tried some stuff with robots.txt and even completely banned it in the past as I thought this must be something else. It would just show up with new IPs and proceed.

The few times I checked it looked like official IPs. If I knew how I would sue Microsoft. They have no business in scraping my website 3-5 times a day when they send me basically no traffic

Edit:// it's also not my only website where bing goes crazy. And it's not new, this is going on for several years now (so no AI scraping I guess)

> The few times I checked it looked like official IPs

Considering Microsoft now runs a cloud service, it may very well be their cloud provider users and not official Bing scrapers.

They absolutely should be paying you if they're single-handedly abusing your resources

wildrhythms2y ago

Usually it's robots.txt that 'prohibits' such use, but you're right it's not enforceable.

tripplyons2y ago· 3 in thread

Scraping and violating TOS are not illegal to do, but they can get you blocked.

HeatrayEnjoyer2y ago

"Not illegal" requires a jurisdiction reference.

FrenchDevRemote2y ago

Nope, using a web browser is not illegal. If you don't want your website to be accessed don't put it on the internet.

speedylight2y ago

Not illegal in the US!

I've encountered a couple of robots.txt that specifically block popular llms for certain areas. Example:

https://www.sigmaaldrich.com/robots.txt

My understanding is scraping public sites is legal. It's no different from a search engine crawling your site.

I believe this is current precedent around scraping:

https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

Terms of service enforcement is a matter of civil law.

Your legal wherewithal relative to those who abuse them is what gives your terms of service teeth. Or leaves you toothless.

mensetmanusman2y ago

Preventing scraping also entrenches google for eternity.

rl32y ago

The web agent's system prompt is simply informed that Scarlett Johansson's voice is at the URL it's about to visit.

8note2y ago

Why? It's another user agent. Curl does the same thing, as does chrome and firefox

j / k navigate · click thread line to collapse