How we blocked TikTok's Bytespider bot and cut our bandwidth by 80% (opens in new tab)

(nerdcrawler.com)

34 pointschptung1y ago20 comments

20 comments

I don't get what Bytedance is doing here. Clearly they are not actively trying to evade blocks, as they are idenifying their bot with a user agent sites can block.

However, surely they have enough smart engineers there to realize that running a bot at full speed (and, based on other reports, completely ignoring robots.txt) will get them blocked by a lot of sites.

If they just had a well behaved spider, almost no one would mind. Getting crawled is a fact of life on the internet, and most website owners recognize it as an essential cost of doing busses. Once you get a reputation as a bad spider, though, that is very hard to shake.

jd201y ago

I didn't see it mentioned, but why not just use robots.txt? Does Bytespider ignore it?

hajimuz1y ago

Yes. It ignores robots.txt.

chptungOP1y ago

Not super sure so this felt like a faster way to plug it ASAP.

chasd001y ago

Is returning a 403 based on the user agent worth a blog post? Also, can't Bytespider just change their user agent to Byte-Spider? Or, just make their user agent a random string? It will be a forever arms race and require constant code updates to keep chasing that bot by user agent. You're probably better off whitelisting the known user agents and blocking everything else.

Also, does it really require a specific "gem"? This is HTTP request filtering, the router (as in the real router, like the metal box with network cables) can probably do it by itself these days.

phartenfeller1y ago

For me the interesting part is that the crawler is going bezerk. Never ever should a single crawler be the cause of 80% of traffic.

Also why should they not respect the 403? Crawlers just go to anything they can find. It is not a targeted attack.

chptungOP1y ago

It might not be, but I couldn't find much about the topic so I figured I'd write it up and share. And you're right that this may be a bit of whack-a-mole, but for now I've cut my bandwidth down which means I may be able to downgrade my cloudinary plan to a lower tier, which is a big win for me since it accounts for like 20-30% of my total operating cost

braden_e1y ago

This is the worst behaved bot I have ever seen, I suspect it is AI related. I recently decided to block all the AI crawlers - unlike search engines I get nothing from them.

chptungOP1y ago

Same! Any chance you can share a list of the bots you're blocking so I can add them too?

braden_e1y ago

ClaudeBot (second worst crawler) and GPTBot are the only ones that identify themselves in an obvious way. The rest I am blocking by network. I assume AI when the crawler is very aggressively downloading images - it must be costing them an absolute fortune!

1 more reply

mmaunder1y ago

Is it just me or is that site a bit broken? Weirdly dark.

Edit: Nice try on the vote brigade guys. lol

chptungOP1y ago

Yeah...I suck at optimizing for dark mode and I think I'm about to get too much traffic from this post so I can't fix it right now. Probably a tomorrow task haha

nickthegreek1y ago

quickest fix is to change the text to a light gray.

chptungOP1y ago

Just pushed a fix out; it's not perfect but i hope it works for now. Let me know if it's better for you!

chptungOP1y ago

Thank you! Will do!

catoc1y ago

Can large companies not be faulted for ignoring robots.txt? Seems like something GDPR could enforce for personal(ly owned) sites?

jsnell1y ago

Why do you think TikTok is ignoring robots.txt? This site is disallowing a lot of crawlers via User-Agent matching, but not ByteSpider.

https://www.nerdcrawler.com/robots.txt

The domain serving the images is allowing everything:

https://res.cloudinary.com/robots.txt

chptungOP1y ago

So I decided to google around and I think TikTok does ignore robots.txt. A few posts from around:

- https://www.reddit.com/r/flask/comments/161fqml/what_to_do_a...

- https://www.reddit.com/r/sysadmin/comments/1ahhzdg/has_anyon...

catoc1y ago

I just assumed that would have been the first thing they tried…

j / k navigate · click thread line to collapse

20 comments

gizmo6861y ago

I don't get what Bytedance is doing here. Clearly they are not actively trying to evade blocks, as they are idenifying their bot with a user agent sites can block.

jd201y ago

I didn't see it mentioned, but why not just use robots.txt? Does Bytespider ignore it?

hajimuz1y ago

Yes. It ignores robots.txt.

chptungOP1y ago

Not super sure so this felt like a faster way to plug it ASAP.

chasd001y ago

Also, does it really require a specific "gem"? This is HTTP request filtering, the router (as in the real router, like the metal box with network cables) can probably do it by itself these days.

phartenfeller1y ago

For me the interesting part is that the crawler is going bezerk. Never ever should a single crawler be the cause of 80% of traffic.

Also why should they not respect the 403? Crawlers just go to anything they can find. It is not a targeted attack.

chptungOP1y ago

braden_e1y ago

This is the worst behaved bot I have ever seen, I suspect it is AI related. I recently decided to block all the AI crawlers - unlike search engines I get nothing from them.

chptungOP1y ago

Same! Any chance you can share a list of the bots you're blocking so I can add them too?

braden_e1y ago

1 more reply

mmaunder1y ago

Is it just me or is that site a bit broken? Weirdly dark.

Edit: Nice try on the vote brigade guys. lol

chptungOP1y ago

Yeah...I suck at optimizing for dark mode and I think I'm about to get too much traffic from this post so I can't fix it right now. Probably a tomorrow task haha

nickthegreek1y ago

quickest fix is to change the text to a light gray.

chptungOP1y ago

Just pushed a fix out; it's not perfect but i hope it works for now. Let me know if it's better for you!

chptungOP1y ago

Thank you! Will do!

catoc1y ago

Can large companies not be faulted for ignoring robots.txt? Seems like something GDPR could enforce for personal(ly owned) sites?

jsnell1y ago

Why do you think TikTok is ignoring robots.txt? This site is disallowing a lot of crawlers via User-Agent matching, but not ByteSpider.

https://www.nerdcrawler.com/robots.txt

The domain serving the images is allowing everything:

https://res.cloudinary.com/robots.txt

chptungOP1y ago

So I decided to google around and I think TikTok does ignore robots.txt. A few posts from around:

- https://www.reddit.com/r/flask/comments/161fqml/what_to_do_a...

- https://www.reddit.com/r/sysadmin/comments/1ahhzdg/has_anyon...

catoc1y ago

I just assumed that would have been the first thing they tried…

j / k navigate · click thread line to collapse