Edit: newspaper -> newspaper editor
Old, to the point, websites hand-written in notepad.exe are rarely in the top 5-10, even when they have precisely the answer you’re looking for.
Since then it seems he really has gotten it together, and even had a project show up on the first page here. So now I suppose the robots.txt entry doesn't matter much, but it's still there anyway.
Most Americans support right to have some personal info removed from online searches https://www.pewresearch.org/fact-tank/2020/01/27/most-americ...
they prefer to be able to remove information about themselves.
they strongly prefer you don't have the ability to remove information about someone else.
RTBF does not use robots.txt (and then require that it be respected) on the site hosting the thing to be forgotten, it's an exclusion on the search engine side.
https://www.wired.com/robots.txt
It's concerning someone working for the news doesn't understand why.
This is their power of ruining lives forever.
This is a new thing, it should be taught, it's not hard to understand.
I don't know why journalist think they deserve respect when these things are not fundamentally in their ethos.
There should be a better process than robot.txt and some news sites are doing better. Europe has brought in laws. But if journalist want to be thought of as more than writing blog spam, they need a better answer to this.
Also his name plus his alleged crime gets an AP News article as the top hit.
robots.txt version from 2017: https://www.robots-viewer.com/robots/checksum/8029662cfb040c...
To reliably exclude a URL from indexing, you have to serve a “no index” instruction with that URL, either in a meta tag or an HTTP header. And for this instruction to be read, the robot has to visit that page! So disallowing the URL in robots.txt can actually be counterproductive to de-indexing it.
Google also offers a tool specifically for removing URLs from their index in Search Console.
[1] https://pastebin.com/raw/RE2tpyR3
8<-----------8<-----------8<-----------8<-----------8<-----------
#!/bin/bash
snapshots="20120713050942 20121013154343 20121010165822 20120921054221 20130413152313 20130113162428"
# orig source http://state.gov/robots.txt but also on pastebin in case they delete it:
wget --output-document=robots.txt http://pastebin.com/raw.php?i=RE2tpyR3
for x in `echo $snapshots`
do
for i in `cat ./robots.txt|cut -d ' ' -f2 | tr -d '\15\32'`
do
if [ -e `basename $i` ]; then
echo "$i already fetched"
else
wget https://web.archive.org/web/$x/http://www.state.gov/documents/$i;
fi
done
doneIf you can’t get NYT to remove it, maybe you “know someone” who can.
The correct thing would be to serve pages with the appropriate HTTP header to disable indexing. Of course search engines are still not obliged to follow the header, just as they are not obliged to follow the robots.txt file, but you are not leaking more information that you need.
Really, robots.txt file is only useful to reduce the load on the server by crawlers, it shouldn't be used as a protective measure!
https://www.wnycstudios.org/podcasts/radiolab/articles/radio...