Videolan.org robots.txt (opens in new tab)

(videolan.org)

199 pointspingiun4y ago69 comments

69 comments

46 comments · 14 top-level

frereubu4y ago· 11 in thread

I get some of the feeling behind this. But in terms of Turnitin my wife taught in art college and a close friend taught a Masters in Economics, and the amount of plagiarism was ridiculous. Sure, in theory there should be smaller class sizes, teachers should have more time per student, etc. etc., but Turnitin was an extremely helpful tool that meant they could offload the cognitive effort of detecting mechanical reproduction and get into reading the work. Unless there's something about Turnitin that I'm not aware of which tarnishes what they do? (Beyond making money out of already cash-strapped universities, I suppose...)

NobodyNada4y ago

I'm a student at a university that uses Turnitin. I understand the need for the tools and I absolutely don't have a problem with them using automated tools to check my work for plagiarism.

What I do have a problem with is the fact that, after uploading an assignment, I am required to click a checkbox that says "I agree to Turnitin's end-user license agreement." I should not have to agree to a license for a piece of software that I'm not even using; it's my professor who's using Turnitin's services. And if it's 11:55 PM and I'm trying to submit my assignment, it feels really scummy to suddenly force me to sign a legal contract that I don't even have time to read.

TheNewsIsHere4y ago

My university used to use Blackboard, which had SafeAssign (if I recall correctly). It was proprietary to Blackboard and the only option (at that time) was for students to choose if papers were included in the permanent reference database. Because I was given the choice, I selected that because I thought it was both considerate toward me and valuable to my scholarly work.

Then my university moved to Canvas and TurnItIn. At first there was no license agreement check box, and all the courses were force-enabled to allow TurnItIn to store student submissions forever.

I raised a lot of bell over that and the next term there was that same checkbox that I assume you also see.

It always felt very coercive. I hated checking that box. I fought tooth and nail. I had conference calls with the Academic Technologies leadership. They absolutely didn’t understand the objection. They compared it to Office 365 and didn’t understand the point that neither the university nor Microsoft was requiring that I give them a perpetual, virtually limitless license to my content in order to use the service.

I pointed to the university policies which explicitly and very clearly categorized non-compensated student output as the property of the student, who was to regain all rights. I pointed out the conflict of interest that iParadigms brings to the table.

All I ever got in response was the talking points I found on the TurnItIn marketing material. I’d have been OK if they disagreed after an actual discussion, but they weren’t interested.

woofcat4y ago

Why does Turnitin get to keep a copy of all of my work for free? Do I not own the copyright of my papers?

frereubu4y ago

OK, this objection sort of makes sense to me. Do they have something in their Ts and Cs which says "by submitting your work you consent to us storing your work..."? Presumably people who submit their work to them also benefit to some extent though, because then plagiarisers of your work will be caught?

2 more replies

red_trumpet4y ago

Well, it might be in your interest to know if someone plagiarizes you. I don't know anything about Turnitin, though I doubt they notify the original copyright holder?

If you provide copies of your work for free on the internet, this is why they get to keep one, just as everyone else? They are probably not allowed to distribute it, though?

1 more reply

ineptech4y ago

Presumably because someone with the rights to distribute it, either you or your school, uploaded it to them.

A more interesting question is, if these companies do well and stay in business a long time, won't it become increasingly difficult to write an original paper that isn't flagged for plagiarism? There's only so many ways to describe the effects of the Lend-Lease Act on postwar Europe.

hackmiester4y ago

This was my issue and was why I refused to use it.

google2341234y ago

They don’t claim to own your paper though.

3 more replies

bombcar4y ago

Sounds like there's a business opportunity here for a turnitinturnitin bot that runs your plagiarism through turnitin until it passes ...

KennyBlanken4y ago

The objection likely is due to crawling VLC's website costing the project money, and crawling the site being completely useless for Turnitin...but them not caring and having the resources of a for-profit company vs an open source software NPO.

frereubu4y ago

Could be, but the comment in the robots.txt file in that case is... enigmatic.

evv4y ago· 6 in thread

These are some cute "fuck off"s but its unlikely that these sites actually respect the robots.txt, right?

Correct me if I'm wrong: After the recent web scraping ruling[1] it seems that it's perfectly legal to ignore the robots.txt.

[1] https://news.ycombinator.com/item?id=31075396

dave51044y ago

Depends on the bot owner on whether they want to be respectful. Following the link to the TurnItIn bot...

https://www.turnitin.com/robot/crawlerinfo.html

> Q: How can I completely exclude TurnitinBot from my site?

> To exclude TurnitinBot from all or portions of your site all you have to to do is create a file called robots.txt and put it in the top most directory of your web site.

jrochkind14y ago

So... not necessarily.

1. So that case was about the CFAA (Computer Fraud and Abuse Act). So at most it would say that ignoring the robots.txt does not violate the CFAA -- a law that makes some things felonies as "hacking", basically. I agree that ignoring the robots.txt (say if you are Archive Team? [1]) should not be considered a criminal "hacking" felony.

But there can still be other reasons ignoring the robots.txt is against a law -- or cause for a civil tort action. (Most copyright violation is a civil tort action for instance, the CFAA is, again, a law that establishes some felonies with many years of jail time, intended to punish "hackers"). The decision in that case said nothing about anything except the CFAA.

For instance, taking copyrighted content from the public web and re-selling is probably still going to put you in various kinds of legal trouble -- just not a CFAA violation. It's possible ignoring a robots.txt could put you in other kinds of criminal or civil trouble, depending on the particular circumstances -- just not a CFAA violation. It would be interesting to research what other possible liability there might be. If for instance you caused harm to the site by ignoring the robots.txt (say, an accidental or intentional DOS), I bet there'd at least be cause for civil tort.

2. Even so, even under that case, if that specific case didn't involve a robots.txt (did it?), it's always possible the presence of a robots.txt would result in a differnet outcome. My sense is probably not though, that Supreme Court decision referenced by the ninth circuit on remand -- probably does mean ignoring a robots.txt is not a violation of the CFAA. (And again, I say, PHEW, that would have been terrible if it were -- if say someone trying to archive MySpace before it went away could be put in prison for a couple decades for disrespecting the robots.txt).

[1] https://wiki.archiveteam.org/

henryfjordan4y ago

That case isn't even decided yet, the court only ruled on a preliminary injunction so there's still quite a bit of case left to go before any final decisions are made. For now it's only "likely" that HiQ will prevail (though that means it's pretty likely).

In this case Linkedin sent HiQ a cease and desist letter before they sued and claimed that letter revoked access for the purpose of the CFAA, so not quite the same as a robots.txt but legally it's probably close enough. If anything it's stronger because HiQ can't claim they didn't see it.

1 more reply

bartread4y ago

Well, it's possible to also return a 403 (forbidden) to any request based off the user agent. Of course, this can be relatively easily circumvented, but then it's also possible to block IP ranges and suchlike. You can return a 403 off of any detectable aspect of the client that you don't like if you so wish.

I don't know how well this would work with a CDN, but presumably if you pay for the right tier of Cloudflare (or whatever) you can perform similar operations to prevent content being hoovered from their by clients you'd prefer not to serve.

superkuh4y ago

Yep. I 403 turnitin and similar companies via nginx configuration, if ($http_referer ~* (TurnitinBot|PaperLiBot|idmarch|FairShare|Lightspeedsystems|ZmEu|BPImageWalker|semrushBot|ias_crawler|360spider|copyrightinfringementportal|PetalBot|Adsbot|SlySearch|NPBot)) { return 403; }

But my favorite robots.txt is,

    User-agent: Zombies
    Disallow: /brains

2 more replies

erk__4y ago

Would it not have to be tried in a French court since that is where VideoLan is located?

spamtarget4y ago· 5 in thread

Yeah, NPBot and SlySearch can just fuck the fuck off, but what is wrong with fighting plagiarism? (honest question)

bastawhiz4y ago

The bot is clearly crawling enough to be noticed, and consider the site: how much are you plagiarizing from videolan.org? If they're wasting even a small amount of resources, they're worth blocking.

spamtarget4y ago

reasonable

Kaze4044y ago

There are arguments on whether plagiarism is a bad thing in an academic context. I'm not nearly qualified enough to make them, but they exist if you want to go looking.

Vladimof4y ago

My homeworks aren't written to become public (they probably could because of their business)... I guess the schools are probably more to blame though.

spamtarget4y ago

it's not your homework that is public, but you may sourced text from the public

1 more reply

ozfive4y ago· 3 in thread

I don't get it...

woliveirajr4y ago

Probably because of the

># --> fuck off.

comment that is added after 3 specific robots.

wormer4y ago

fuck off.

ozfive4y ago

How do you report people like this on HN?

1 more reply

kmeisthax4y ago· 2 in thread

So, I can understand the hate towards copyright enforcement bots, but... did TurnItIn hammer the shit out of VLC's website? Or do VLC's developers just hate the idea of automated enforcement in general?

(I doubt they're pro-plagiarism - not even copyright abolitionists go that far.)

KennyBlanken4y ago

I imagine the objection is that it is consuming bandwidth, electricity, and computing capacity of an open source project as part of a profit-making service, with an extra fuck-you of a)crawling the site making no sense whatsoever for the service and b)the service being of no possible use to VLC or its users

jrochkind14y ago

> consuming bandwidth, electricity, and computing capacity of an open source project as part of a profit-making service,

And yet they don't disallow Googlebot! For obvious reasons.

rickstanley4y ago· 2 in thread

What is that "# $Id$" at the top? Just a comment or serves a purpose?

tedunangst4y ago

If the file lived in cvs, it would be replaced with the revision.

JNRowe4y ago

If you want a little background on Ted's answer, the keywords and their use are described in the RCS docs¹. TIL, RCS still gets releases² ;)

¹ https://www.gnu.org/software/rcs/manual/html_node/Concepts.h...

² https://lists.gnu.org/archive/html/info-gnu/2022-02/msg00001...

kome4y ago· 2 in thread

i'm going to copy them. brilliant.

bombcar4y ago

# Plagiarism?

# --> fuck off.

User-Agent: kome

Disallow: /

kome4y ago

ahaha

paxys4y ago· 1 in thread

What content is videolan.org hosting that would be relevant for these bots?

ahmetkun4y ago

You don't need to be relevant, just being accessible is enough reason for all sorts of bots, legit or sketchy, to shove hundreds of thousands of requests down your throat.

nixcraft4y ago

I used the ultimate Nginx bad bot blocker on a couple of my side projects, and it is a pretty good project https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blo... . Apart from the Cloudflare offers UA blocking and AI driven bot management too. Most of these bots are for content scrapping and then creating search spam results. I am a one-person show, and it hurts both financially and resources wise on my tiny severs. So I block them.

1vuio0pswjnm74y ago

Why crawl when a sitemap is provided. Honest question.

IME, using a sitemap is much more efficient. For example, HTTP/1.1 pipelining can be used to reduce the number of TCP connections needed.

Is resource exhaustion what draws a public website^1 operator's attention to "bots". If it is not resource exhaustion then what is it.

1. For this question, assume "public website" means a website serving public information where there are no legitimate intellectual property rights in the information that can be asserted by the site operator.

jamespwilliams4y ago

> iThenticateÂ®

Unexpected place to see latin1 -> utf8 mojibake

jokoon4y ago

I once started logging user agents and ips for my small old php website, which is a bit hard to find.

I was quite surprised to see all the weird bots that were crawling it.

a3w4y ago

What would a webcrawler be called which reads only the disallowed robots.txt routes? Still just an unfriendly webscraper? Shodan? Shodan on steroids?

notorandit4y ago

Fuck off: I won't crawl it through!

j / k navigate · click thread line to collapse

69 comments

46 comments · 14 top-level

frereubu4y ago· 11 in thread

NobodyNada4y ago

I'm a student at a university that uses Turnitin. I understand the need for the tools and I absolutely don't have a problem with them using automated tools to check my work for plagiarism.

TheNewsIsHere4y ago

Then my university moved to Canvas and TurnItIn. At first there was no license agreement check box, and all the courses were force-enabled to allow TurnItIn to store student submissions forever.

I raised a lot of bell over that and the next term there was that same checkbox that I assume you also see.

All I ever got in response was the talking points I found on the TurnItIn marketing material. I’d have been OK if they disagreed after an actual discussion, but they weren’t interested.

woofcat4y ago

Why does Turnitin get to keep a copy of all of my work for free? Do I not own the copyright of my papers?

frereubu4y ago

2 more replies

red_trumpet4y ago

Well, it might be in your interest to know if someone plagiarizes you. I don't know anything about Turnitin, though I doubt they notify the original copyright holder?

If you provide copies of your work for free on the internet, this is why they get to keep one, just as everyone else? They are probably not allowed to distribute it, though?

1 more reply

ineptech4y ago

Presumably because someone with the rights to distribute it, either you or your school, uploaded it to them.

hackmiester4y ago

This was my issue and was why I refused to use it.

google2341234y ago

They don’t claim to own your paper though.

3 more replies

bombcar4y ago

Sounds like there's a business opportunity here for a turnitinturnitin bot that runs your plagiarism through turnitin until it passes ...

KennyBlanken4y ago

frereubu4y ago

Could be, but the comment in the robots.txt file in that case is... enigmatic.

evv4y ago· 6 in thread

These are some cute "fuck off"s but its unlikely that these sites actually respect the robots.txt, right?

Correct me if I'm wrong: After the recent web scraping ruling[1] it seems that it's perfectly legal to ignore the robots.txt.

[1] https://news.ycombinator.com/item?id=31075396

dave51044y ago

Depends on the bot owner on whether they want to be respectful. Following the link to the TurnItIn bot...

https://www.turnitin.com/robot/crawlerinfo.html

> Q: How can I completely exclude TurnitinBot from my site?

> To exclude TurnitinBot from all or portions of your site all you have to to do is create a file called robots.txt and put it in the top most directory of your web site.

jrochkind14y ago

So... not necessarily.

[1] https://wiki.archiveteam.org/

henryfjordan4y ago

1 more reply

bartread4y ago

superkuh4y ago

But my favorite robots.txt is,

    User-agent: Zombies
    Disallow: /brains

2 more replies

erk__4y ago

Would it not have to be tried in a French court since that is where VideoLan is located?

spamtarget4y ago· 5 in thread

Yeah, NPBot and SlySearch can just fuck the fuck off, but what is wrong with fighting plagiarism? (honest question)

bastawhiz4y ago

The bot is clearly crawling enough to be noticed, and consider the site: how much are you plagiarizing from videolan.org? If they're wasting even a small amount of resources, they're worth blocking.

spamtarget4y ago

reasonable

Kaze4044y ago

There are arguments on whether plagiarism is a bad thing in an academic context. I'm not nearly qualified enough to make them, but they exist if you want to go looking.

Vladimof4y ago

My homeworks aren't written to become public (they probably could because of their business)... I guess the schools are probably more to blame though.

spamtarget4y ago

it's not your homework that is public, but you may sourced text from the public

1 more reply

ozfive4y ago· 3 in thread

I don't get it...

woliveirajr4y ago

Probably because of the

># --> fuck off.

comment that is added after 3 specific robots.

wormer4y ago

fuck off.

ozfive4y ago

How do you report people like this on HN?

1 more reply

kmeisthax4y ago· 2 in thread

(I doubt they're pro-plagiarism - not even copyright abolitionists go that far.)

KennyBlanken4y ago

jrochkind14y ago

> consuming bandwidth, electricity, and computing capacity of an open source project as part of a profit-making service,

And yet they don't disallow Googlebot! For obvious reasons.

rickstanley4y ago· 2 in thread

What is that "# $Id$" at the top? Just a comment or serves a purpose?

tedunangst4y ago

If the file lived in cvs, it would be replaced with the revision.

JNRowe4y ago

If you want a little background on Ted's answer, the keywords and their use are described in the RCS docs¹. TIL, RCS still gets releases² ;)

¹ https://www.gnu.org/software/rcs/manual/html_node/Concepts.h...

² https://lists.gnu.org/archive/html/info-gnu/2022-02/msg00001...

kome4y ago· 2 in thread

i'm going to copy them. brilliant.

bombcar4y ago

# Plagiarism?

# --> fuck off.

User-Agent: kome

Disallow: /

kome4y ago

ahaha

paxys4y ago· 1 in thread

What content is videolan.org hosting that would be relevant for these bots?

ahmetkun4y ago

You don't need to be relevant, just being accessible is enough reason for all sorts of bots, legit or sketchy, to shove hundreds of thousands of requests down your throat.

nixcraft4y ago

1vuio0pswjnm74y ago

Why crawl when a sitemap is provided. Honest question.

IME, using a sitemap is much more efficient. For example, HTTP/1.1 pipelining can be used to reduce the number of TCP connections needed.

Is resource exhaustion what draws a public website^1 operator's attention to "bots". If it is not resource exhaustion then what is it.

jamespwilliams4y ago

> iThenticateÂ®

Unexpected place to see latin1 -> utf8 mojibake

jokoon4y ago

I once started logging user agents and ips for my small old php website, which is a bit hard to find.

I was quite surprised to see all the weird bots that were crawling it.

a3w4y ago

What would a webcrawler be called which reads only the disallowed robots.txt routes? Still just an unfriendly webscraper? Shodan? Shodan on steroids?

notorandit4y ago

Fuck off: I won't crawl it through!

j / k navigate · click thread line to collapse