Facebook's robots.txt (opens in new tab)

I could be wrong but I believe that the the default is that spiders are blocked and only the "User-Agents" listed are allowed to scrape (but not the disallow pages).

elbear12y ago

You are correct.

decasteve12y ago· 2 in thread

Even Facebook's robots.txt has a hatred for my pseudo-anonymous browser settings. Facebook gives me this (for any page): "Sorry, something went wrong. We're working on getting this fixed as soon as we can."

startling12y ago

robots.txt isn't enforced.

easy_rider12y ago

Maybe they should be. Gentleman's agreements do not apply to robots.

1 more reply

kr1m12y ago· 1 in thread

You don't scrape Facebook, Facebook scrapes you!

jgalt21212y ago

In the US, you catch a cold. In Soviet Russia, cold catches you!

http://en.wikipedia.org/wiki/Russian_reversal

pdfcollect12y ago· 1 in thread

Is there a way to replace this robots.txt with a null robots.txt? :)

toomuchtodo12y ago

You just ignore the robots.txt file, crawl slowly, and from distributed virtual machines.

Not that you should do that. Robots.txt is a nicety though, the client doesn't have to respect it, and the server doesn't have to allow your HTTP requests.

bibstha12y ago· 1 in thread

What is a User Agent: Yeti?

unfunco12y ago

It's the crawler for Naver, a south Korean search engine.

j / k navigate · click thread line to collapse

22 comments

20 comments · 7 top-level

perryh212y ago· 3 in thread

http://disqus.com/humans.txt

glomph12y ago

http://www.last.fm/robots.txt

tux12y ago

Also has a funny error 404 when you remove "s"

Uh oh... Something didn't work. > http://disqus.com/human.txt

usaphp12y ago

What's "bmw" doing on the top of his head ?

viana00712y ago· 3 in thread

http://www.google.com/robots.txt

joshguthrie12y ago

Chrome user here. When I open it, the tab is automatically closed.

Tried to curl it, exact content, no 302 towards a "<script>window.close</script>",... Got anything?

easy_rider12y ago

/* would have sufficed

darkmighty12y ago

https://www.google.ca/search?q=google+search+domain&oq=googl...

yalogin12y ago· 2 in thread

So what does it mean by facebook whitelisting a scraping service? Do they actively block scrapers?

dblacc12y ago

I could be wrong but I believe that the the default is that spiders are blocked and only the "User-Agents" listed are allowed to scrape (but not the disallow pages).

elbear12y ago

You are correct.

decasteve12y ago· 2 in thread

startling12y ago

robots.txt isn't enforced.

easy_rider12y ago

Maybe they should be. Gentleman's agreements do not apply to robots.

1 more reply

kr1m12y ago· 1 in thread

You don't scrape Facebook, Facebook scrapes you!

jgalt21212y ago

In the US, you catch a cold. In Soviet Russia, cold catches you!

http://en.wikipedia.org/wiki/Russian_reversal

pdfcollect12y ago· 1 in thread

Is there a way to replace this robots.txt with a null robots.txt? :)

toomuchtodo12y ago

You just ignore the robots.txt file, crawl slowly, and from distributed virtual machines.

Not that you should do that. Robots.txt is a nicety though, the client doesn't have to respect it, and the server doesn't have to allow your HTTP requests.

bibstha12y ago· 1 in thread

What is a User Agent: Yeti?

unfunco12y ago

It's the crawler for Naver, a south Korean search engine.

j / k navigate · click thread line to collapse