What I do have a problem with is the fact that, after uploading an assignment, I am required to click a checkbox that says "I agree to Turnitin's end-user license agreement." I should not have to agree to a license for a piece of software that I'm not even using; it's my professor who's using Turnitin's services. And if it's 11:55 PM and I'm trying to submit my assignment, it feels really scummy to suddenly force me to sign a legal contract that I don't even have time to read.
Then my university moved to Canvas and TurnItIn. At first there was no license agreement check box, and all the courses were force-enabled to allow TurnItIn to store student submissions forever.
I raised a lot of bell over that and the next term there was that same checkbox that I assume you also see.
It always felt very coercive. I hated checking that box. I fought tooth and nail. I had conference calls with the Academic Technologies leadership. They absolutely didn’t understand the objection. They compared it to Office 365 and didn’t understand the point that neither the university nor Microsoft was requiring that I give them a perpetual, virtually limitless license to my content in order to use the service.
I pointed to the university policies which explicitly and very clearly categorized non-compensated student output as the property of the student, who was to regain all rights. I pointed out the conflict of interest that iParadigms brings to the table.
All I ever got in response was the talking points I found on the TurnItIn marketing material. I’d have been OK if they disagreed after an actual discussion, but they weren’t interested.
If you provide copies of your work for free on the internet, this is why they get to keep one, just as everyone else? They are probably not allowed to distribute it, though?
A more interesting question is, if these companies do well and stay in business a long time, won't it become increasingly difficult to write an original paper that isn't flagged for plagiarism? There's only so many ways to describe the effects of the Lend-Lease Act on postwar Europe.
Correct me if I'm wrong: After the recent web scraping ruling[1] it seems that it's perfectly legal to ignore the robots.txt.
https://www.turnitin.com/robot/crawlerinfo.html
> Q: How can I completely exclude TurnitinBot from my site?
> To exclude TurnitinBot from all or portions of your site all you have to to do is create a file called robots.txt and put it in the top most directory of your web site.
1. So that case was about the CFAA (Computer Fraud and Abuse Act). So at most it would say that ignoring the robots.txt does not violate the CFAA -- a law that makes some things felonies as "hacking", basically. I agree that ignoring the robots.txt (say if you are Archive Team? [1]) should not be considered a criminal "hacking" felony.
But there can still be other reasons ignoring the robots.txt is against a law -- or cause for a civil tort action. (Most copyright violation is a civil tort action for instance, the CFAA is, again, a law that establishes some felonies with many years of jail time, intended to punish "hackers"). The decision in that case said nothing about anything except the CFAA.
For instance, taking copyrighted content from the public web and re-selling is probably still going to put you in various kinds of legal trouble -- just not a CFAA violation. It's possible ignoring a robots.txt could put you in other kinds of criminal or civil trouble, depending on the particular circumstances -- just not a CFAA violation. It would be interesting to research what other possible liability there might be. If for instance you caused harm to the site by ignoring the robots.txt (say, an accidental or intentional DOS), I bet there'd at least be cause for civil tort.
2. Even so, even under that case, if that specific case didn't involve a robots.txt (did it?), it's always possible the presence of a robots.txt would result in a differnet outcome. My sense is probably not though, that Supreme Court decision referenced by the ninth circuit on remand -- probably does mean ignoring a robots.txt is not a violation of the CFAA. (And again, I say, PHEW, that would have been terrible if it were -- if say someone trying to archive MySpace before it went away could be put in prison for a couple decades for disrespecting the robots.txt).
In this case Linkedin sent HiQ a cease and desist letter before they sued and claimed that letter revoked access for the purpose of the CFAA, so not quite the same as a robots.txt but legally it's probably close enough. If anything it's stronger because HiQ can't claim they didn't see it.
I don't know how well this would work with a CDN, but presumably if you pay for the right tier of Cloudflare (or whatever) you can perform similar operations to prevent content being hoovered from their by clients you'd prefer not to serve.
But my favorite robots.txt is,
User-agent: Zombies
Disallow: /brains># --> fuck off.
comment that is added after 3 specific robots.
(I doubt they're pro-plagiarism - not even copyright abolitionists go that far.)
And yet they don't disallow Googlebot! For obvious reasons.
¹ https://www.gnu.org/software/rcs/manual/html_node/Concepts.h...
² https://lists.gnu.org/archive/html/info-gnu/2022-02/msg00001...
IME, using a sitemap is much more efficient. For example, HTTP/1.1 pipelining can be used to reduce the number of TCP connections needed.
Is resource exhaustion what draws a public website^1 operator's attention to "bots". If it is not resource exhaustion then what is it.
1. For this question, assume "public website" means a website serving public information where there are no legitimate intellectual property rights in the information that can be asserted by the site operator.
Unexpected place to see latin1 -> utf8 mojibake
I was quite surprised to see all the weird bots that were crawling it.