Robots.txt for the NYT has a specific exclusion for an 1996 news article (opens in new tab)

(twitter.com)

270 pointsmikeortman5y ago100 comments

100 comments

44 comments · 17 top-level

gostsamo5y ago· 8 in thread

I've added a special exclusion to a robot.txt file for a specific article. It was some years ago while in college. The article in question was about the presentation of an assisting professor who had some kind of misunderstanding with the campus newspaper and therefore the article wasn't especially positive in tone. Couple of years later I was the sys admin of the newspaper website and a letter arrived in my university email. The professor had found that I'm responsible for the website and had sent me a tearful story about how this article is ruining her life, because it is the top Google result for her name, and how she had spent thousands of dollars on scammers who had promised to change that, and she was asking me to remove the piece. Long story short, I forwarded the case to the newspaper editor at the time and she agreed to let me add a line to the robot.txt.

Edit: newspaper -> newspaper editor

gostsamo5y ago

Out of curiosity, I checked the student's newspaper website. It turns out that they've made redesign after my time and they've removed the robots.txt file. However, googling the name of the professor in question returns much more recent results and the article is hard to find. It turns out that Google's algorithm buries some stories over time. One more decade and noone will be able to find this part of her history unless they know what they are looking for.

Scoundreller5y ago

I find that Google discriminates against old content just because.

Old, to the point, websites hand-written in notepad.exe are rarely in the top 5-10, even when they have precisely the answer you’re looking for.

3 more replies

notacoward5y ago

Similar thing here. I wrote a blog post about an online "contest" that was actually a bit of a scam. For years afterward I'd periodically get email from the person asking to take the blog post down because its prominence in search results was souring job prospects. I eventually relented, not because I cared about him but because he mentioned that he had started a family and it was affecting them as well. I didn't want to hurt innocent people. I didn't want to censor my blog either, but I did add a robots.txt so it wouldn't show up in search results.

Since then it seems he really has gotten it together, and even had a project show up on the first page here. So now I suppose the robots.txt entry doesn't matter much, but it's still there anyway.

lrem5y ago

This is now institutionalised as The Right To be Forgotten. Every search engine doing business in the EU has implemented it. Most probably only offer it to Europeans.

gostsamo5y ago

The professor is not an EU citizen and the right to be forgotten does not cover unflattering articles in student newspapers.

2 more replies

taway2323455y ago

That is one lucky professor -- how did she get granted an exception? There are tons of unlucky people who suffer as a result of bad/shoddy/vague reporting, how do they also get a carve-out?

gostsamo5y ago

In my case, nobody was actually malicious. It was a stupid situation caused by the stupid actions of both sides from what I've been able to discover. Nobody meant to cause harm but to tell their own story. I tried to find a compromise that would work for everyone and it worked this time.

MattGaiser5y ago

Like many people who get exceptions. They asked.

zaroth5y ago· 6 in thread

So much for the right to be forgotten. And this is a dismissed case.

williamscales5y ago

There is no right to be forgotten in the US.

adventured5y ago

Fortunately. I'm glad to be able to read the NYT story courtesy of the Way Back Machine and attempt to learn more about the context.

1 more reply

nabla95y ago

https://righttobeforgotten.org/

Most Americans support right to have some personal info removed from online searches https://www.pewresearch.org/fact-tank/2020/01/27/most-americ...

arbitrage5y ago

it's more nuanced than that.

they prefer to be able to remove information about themselves.

they strongly prefer you don't have the ability to remove information about someone else.

curt155y ago

I thought the "right to be forgotten" is simply meant to prevent certain websites from appearing in search engine results. Isn't that precisely what robots.txt accomplishes? Does the right to be forgotten also compel website owners to censor their own content?

OJFord5y ago

Accomplishes the same, but it's a different mechanism.

RTBF does not use robots.txt (and then require that it be respected) on the site hosting the thing to be forgotten, it's an exclusion on the search engine side.

aaron6955y ago· 4 in thread

So does Wired and other news sites.

https://www.wired.com/robots.txt

It's concerning someone working for the news doesn't understand why.

This is their power of ruining lives forever.

This is a new thing, it should be taught, it's not hard to understand.

I don't know why journalist think they deserve respect when these things are not fundamentally in their ethos.

There should be a better process than robot.txt and some news sites are doing better. Europe has brought in laws. But if journalist want to be thought of as more than writing blog spam, they need a better answer to this.

koheripbal5y ago

At some point they'll probably figure out they can charge money to hide articles.

curt155y ago

Should Wired simply be required by law to delete the articles?

michaelmrose5y ago

If he is innocent it should be sufficient to anmend the article so it shows the truth.

DangerousPie5y ago

If you Google a job applicant's name and find several articles reporting that they have been accused of rape (with just a little note at the end that the charges have been dismissed), are you really going to give them a fair chance? I don't think anyone would look at them in an unbiased way, even if they tried to.

1 more reply

williamscales5y ago· 2 in thread

Here is the article: http://web.archive.org/web/20091128124216/https://www.nytime...

nabla95y ago

... has a certificate of disposition by the Criminal Court of the City of New York, saying that the case against him was dismissed and sealed on June 5, 1997.

solosoyokaze5y ago

Why was the case sealed?

2 more replies

nieve5y ago· 2 in thread

Diplomatic immunity extending to the point of sealing the case and suppressing reporting?

gostsamo5y ago

Not necessarily. The crime scene was a diplomats residence, but the ddg results for the name are for some kind of New York businessman who might have been a guest. The dismissal of the case might have been an out-of-court agreement with lots of money involved.

bellyfullofbac5y ago

Well, he's about to wonder why everyone's suddenly visiting his LinkedIn profile.

Also his name plus his alleged crime gets an AP News article as the top hit.

rurban5y ago· 2 in thread

This was one of the very rare cases where even wikileaks took down one article in the global intelligence files, after the state dpmt complained. A very high profile case. In that specific country everybody knew, what the diplomate of the other very specific country did.

bigbillheck5y ago

Is it related to this particular incident? I'm not finding anything immediate.

rurban5y ago

You wont find anything, as it was taken down a few days after publication. The NY Times article was taken down later, but had not much juicy info.

cglong5y ago· 2 in thread

(2019)

remux5y ago

This was changed in 2019: https://www.robots-viewer.com/robots/checksum/64dc55e4da9cc8...

robots.txt version from 2017: https://www.robots-viewer.com/robots/checksum/8029662cfb040c...

makomk5y ago

Probably because that's around the time he seemingly really pissed off one or more former employees who went digging for more information on him: https://www.glassdoor.com/Reviews/Employee-Review-Romio-RVW2...

snowwrestler5y ago· 1 in thread

Just for future reference: adding a URL to robots.txt will not necessarily exclude that web page from Google, especially if it has already been indexed.

To reliably exclude a URL from indexing, you have to serve a “no index” instruction with that URL, either in a meta tag or an HTTP header. And for this instruction to be read, the robot has to visit that page! So disallowing the URL in robots.txt can actually be counterproductive to de-indexing it.

Google also offers a tool specifically for removing URLs from their index in Search Console.

rafaelm5y ago

That's right, but the url exclusion tool is only temporary, so the only way to do it correctly is with the noindex tag.

michaelcampbell5y ago

I have a specific exclusion in my robots.txt file, and also a cron-scheduled grep of my logs to see if anything actually hits it. I don't care if they do or they don't, but it's a way for me to exclude specific bots that don't honor my robots.txt file.

pseudalopex5y ago

It was about a rape case. The Internet Archive shows the article was updated in 2008 to say the case was dismissed and sealed in 1997.

throwbigdata5y ago

https://en.m.wikipedia.org/wiki/Apophasis

brailsafe5y ago

Scathing Glassdoor review for his current company entitled "Just Another One of Tarik's Victims" among a sea of obviously fake ones (yes I know GD is bs)

DyslexicAtheist5y ago

the US Department of State (DoS) during 2012/2013 in its robots.txt[1] excluded around 9577 documents which leaked into archive.org (already pre Snowden). The robots.txt file now is OK but not sure if content is still on archive.

[1] https://pastebin.com/raw/RE2tpyR3

  8<-----------8<-----------8<-----------8<-----------8<-----------

  #!/bin/bash
  snapshots="20120713050942 20121013154343 20121010165822 20120921054221 20130413152313 20130113162428"
  # orig source http://state.gov/robots.txt but also on pastebin in case they delete it:
  wget --output-document=robots.txt http://pastebin.com/raw.php?i=RE2tpyR3
  for x in `echo $snapshots`
  do
    for i in `cat ./robots.txt|cut -d ' ' -f2 | tr -d '\15\32'`
    do
      if [ -e `basename $i` ]; then
        echo "$i already fetched"
      else
        wget https://web.archive.org/web/$x/http://www.state.gov/documents/$i;
      fi
    done
  done

jchook5y ago

Reminds me of companies posting copyrighted material in comments only to file a DCMA takedown.

If you can’t get NYT to remove it, maybe you “know someone” who can.

alerighi5y ago

Using robots.txt to avoid search engines indexing the page is the most stupid thing you can do. Not only it's not mandated by law that search engines have to follow the rules in the file, but also you are giving to the public a known file where you put all the URL that you don't want to be public. And everyone that wants to get some information on a site the first thing that goes to see is the robots file.

The correct thing would be to serve pages with the appropriate HTTP header to disable indexing. Of course search engines are still not obliged to follow the header, just as they are not obliged to follow the robots.txt file, but you are not leaking more information that you need.

Really, robots.txt file is only useful to reduce the load on the server by crawlers, it shouldn't be used as a protective measure!

brianpan5y ago

Here's a relevant Radiolab episode about the right to be forgotten.

https://www.wnycstudios.org/podcasts/radiolab/articles/radio...

doe885y ago

A follow-up on one of their article. More of it. Please.

j / k navigate · click thread line to collapse

100 comments

44 comments · 17 top-level

gostsamo5y ago· 8 in thread

Edit: newspaper -> newspaper editor

gostsamo5y ago

Scoundreller5y ago

I find that Google discriminates against old content just because.

Old, to the point, websites hand-written in notepad.exe are rarely in the top 5-10, even when they have precisely the answer you’re looking for.

3 more replies

notacoward5y ago

Since then it seems he really has gotten it together, and even had a project show up on the first page here. So now I suppose the robots.txt entry doesn't matter much, but it's still there anyway.

lrem5y ago

This is now institutionalised as The Right To be Forgotten. Every search engine doing business in the EU has implemented it. Most probably only offer it to Europeans.

gostsamo5y ago

The professor is not an EU citizen and the right to be forgotten does not cover unflattering articles in student newspapers.

2 more replies

taway2323455y ago

That is one lucky professor -- how did she get granted an exception? There are tons of unlucky people who suffer as a result of bad/shoddy/vague reporting, how do they also get a carve-out?

gostsamo5y ago

MattGaiser5y ago

Like many people who get exceptions. They asked.

zaroth5y ago· 6 in thread

So much for the right to be forgotten. And this is a dismissed case.

williamscales5y ago

There is no right to be forgotten in the US.

adventured5y ago

Fortunately. I'm glad to be able to read the NYT story courtesy of the Way Back Machine and attempt to learn more about the context.

1 more reply

nabla95y ago

https://righttobeforgotten.org/

Most Americans support right to have some personal info removed from online searches https://www.pewresearch.org/fact-tank/2020/01/27/most-americ...

arbitrage5y ago

it's more nuanced than that.

they prefer to be able to remove information about themselves.

they strongly prefer you don't have the ability to remove information about someone else.

curt155y ago

OJFord5y ago

Accomplishes the same, but it's a different mechanism.

RTBF does not use robots.txt (and then require that it be respected) on the site hosting the thing to be forgotten, it's an exclusion on the search engine side.

aaron6955y ago· 4 in thread

So does Wired and other news sites.

https://www.wired.com/robots.txt

It's concerning someone working for the news doesn't understand why.

This is their power of ruining lives forever.

This is a new thing, it should be taught, it's not hard to understand.

I don't know why journalist think they deserve respect when these things are not fundamentally in their ethos.

koheripbal5y ago

At some point they'll probably figure out they can charge money to hide articles.

curt155y ago

Should Wired simply be required by law to delete the articles?

michaelmrose5y ago

If he is innocent it should be sufficient to anmend the article so it shows the truth.

DangerousPie5y ago

1 more reply

williamscales5y ago· 2 in thread

Here is the article: http://web.archive.org/web/20091128124216/https://www.nytime...

nabla95y ago

... has a certificate of disposition by the Criminal Court of the City of New York, saying that the case against him was dismissed and sealed on June 5, 1997.

solosoyokaze5y ago

Why was the case sealed?

2 more replies

nieve5y ago· 2 in thread

Diplomatic immunity extending to the point of sealing the case and suppressing reporting?

gostsamo5y ago

bellyfullofbac5y ago

Well, he's about to wonder why everyone's suddenly visiting his LinkedIn profile.

Also his name plus his alleged crime gets an AP News article as the top hit.

rurban5y ago· 2 in thread

bigbillheck5y ago

Is it related to this particular incident? I'm not finding anything immediate.

rurban5y ago

You wont find anything, as it was taken down a few days after publication. The NY Times article was taken down later, but had not much juicy info.

cglong5y ago· 2 in thread

(2019)

remux5y ago

This was changed in 2019: https://www.robots-viewer.com/robots/checksum/64dc55e4da9cc8...

robots.txt version from 2017: https://www.robots-viewer.com/robots/checksum/8029662cfb040c...

makomk5y ago

snowwrestler5y ago· 1 in thread

Just for future reference: adding a URL to robots.txt will not necessarily exclude that web page from Google, especially if it has already been indexed.

Google also offers a tool specifically for removing URLs from their index in Search Console.

rafaelm5y ago

That's right, but the url exclusion tool is only temporary, so the only way to do it correctly is with the noindex tag.

michaelcampbell5y ago

pseudalopex5y ago

It was about a rape case. The Internet Archive shows the article was updated in 2008 to say the case was dismissed and sealed in 1997.

throwbigdata5y ago

https://en.m.wikipedia.org/wiki/Apophasis

brailsafe5y ago

Scathing Glassdoor review for his current company entitled "Just Another One of Tarik's Victims" among a sea of obviously fake ones (yes I know GD is bs)

DyslexicAtheist5y ago

[1] https://pastebin.com/raw/RE2tpyR3

  8<-----------8<-----------8<-----------8<-----------8<-----------

  #!/bin/bash
  snapshots="20120713050942 20121013154343 20121010165822 20120921054221 20130413152313 20130113162428"
  # orig source http://state.gov/robots.txt but also on pastebin in case they delete it:
  wget --output-document=robots.txt http://pastebin.com/raw.php?i=RE2tpyR3
  for x in `echo $snapshots`
  do
    for i in `cat ./robots.txt|cut -d ' ' -f2 | tr -d '\15\32'`
    do
      if [ -e `basename $i` ]; then
        echo "$i already fetched"
      else
        wget https://web.archive.org/web/$x/http://www.state.gov/documents/$i;
      fi
    done
  done

jchook5y ago

Reminds me of companies posting copyrighted material in comments only to file a DCMA takedown.

If you can’t get NYT to remove it, maybe you “know someone” who can.

alerighi5y ago

Really, robots.txt file is only useful to reduce the load on the server by crawlers, it shouldn't be used as a protective measure!

brianpan5y ago

Here's a relevant Radiolab episode about the right to be forgotten.

https://www.wnycstudios.org/podcasts/radiolab/articles/radio...

doe885y ago

A follow-up on one of their article. More of it. Please.

j / k navigate · click thread line to collapse