undefined | Better HN

0 pointsdjhn1mo ago0 comments

This is the paper: https://arxiv.org/abs/2601.02671

Grok and Deepmind IIRC didn’t require tricks.

0 comments

This really makes me want to try something similar with content from my own website.

I shut it down a while ago because the number of bots overtake traffic. The site had quite a bit of human traffic (enough to bring in a few hundred bucks a month in ad revenue, and a few hundred more in subscription revenue), however, the AI scrapers really started ramping up and the only way I could realistically continue would be to pay a lot more for hosting/infrastructure.

I had put a ton of time into building out content...thousands of hours, only to have scrapers ignore robots, bypass cloudflare (they didn't have any AI products at the time), and overwhelm my measly infrastructure.

Even now, with the domain pointed at NOTHING, it gets almost 100,000 hits a month. There is NO SERVER on the other end. It is a dead link. The stats come from Cloudflare, where the domain name is hosted.

I'm curious if there are any lawyers who'd be willing to take someone like me on contingency for a large copyright lawsuit.

apsurd1mo ago

Can we help get your infra cost down to negligible? I'm thinking things like pre-generated static pages and CDNs. I won't assume you hadn't thought of this before, but I'd like to understand more where your non-trivial infra cost come from?

djhnOP1mo ago

I would be tempted to try and optimise this as well. 100000 hits on an empty domain and ~200 dollars worth of bot traffic sounds wild. Are they using JS-enabled browsers or sim farms that download and re-download images and videos as well?

londons_explore1mo ago

> only to have scrapers ignore robots, bypass cloudflare

Set the server to require cloudflares SSL client cert, so nobody can connect to it directly.

Then make sure every page is cacheable and your costs will drop to near zero instantly.

It's like 20 mins to set these things up.

raphman1mo ago

a) As an outside observer, I would find such a lawsuit very interesting/valuable. But I guess the financial risk of taking on OpenAI or Anthropic is quite high.

b) If you don't want bots scraping your content and DDOSing you, there are self-hosted alternatives to Cloudflare. The simplest one that I found is https://github.com/splitbrain/botcheck - visitors just need to press a button and get a cookie that lets them through to the website. No proof-of-work or smart heuristics.

camdenreslink1mo ago

The new cloudflare products for blocking bots and AI scrapers might be worth a shot if you put so much work into the content.

prawn1mo ago

Further, some low effort bots can be quickly handled with CF by blocking specific countries (e.g., Brazil and Russia, for one of my sites).

1 more reply

WarmWash1mo ago

What's not clear from the study (at least skimming it) is if they always started the ball rolling with ground truth passages or if they chained outputs from the model until they got to the end of the book. I strongly suspect the latter would hopelessly corrupt relatively quickly.

It seems like this technique only works if you have a copy of the material to work off of, i.e. enter a ground truth passage, tell the model to continue it as long as it can, and then enter the next ground truth passage to continue in the next session.

djhnOP1mo ago

Oh! That’s a huge caveat if that’s indeed the case.

j / k navigate · click thread line to collapse

0 comments

eek21211mo ago

This really makes me want to try something similar with content from my own website.

I'm curious if there are any lawyers who'd be willing to take someone like me on contingency for a large copyright lawsuit.

apsurd1mo ago

djhnOP1mo ago

londons_explore1mo ago

> only to have scrapers ignore robots, bypass cloudflare

Set the server to require cloudflares SSL client cert, so nobody can connect to it directly.

Then make sure every page is cacheable and your costs will drop to near zero instantly.

It's like 20 mins to set these things up.

raphman1mo ago

a) As an outside observer, I would find such a lawsuit very interesting/valuable. But I guess the financial risk of taking on OpenAI or Anthropic is quite high.

camdenreslink1mo ago

The new cloudflare products for blocking bots and AI scrapers might be worth a shot if you put so much work into the content.

prawn1mo ago

Further, some low effort bots can be quickly handled with CF by blocking specific countries (e.g., Brazil and Russia, for one of my sites).

1 more reply

WarmWash1mo ago

djhnOP1mo ago

Oh! That’s a huge caveat if that’s indeed the case.

j / k navigate · click thread line to collapse