> And for us in the AI community, it was something like the hand of God reaching down to bless us with a guardian angel.
I’m disagreeing as a member of the community. I think back to the original books corpus. It wasn’t Stephen King, but it was books from authors who gave away their content more permissively than King. It sucked how unreproducible it was, but that lack of reproducibility came from compliance with the authors wishes. If you compare it with the recent books3, the real difference is that the creator just ignored the authors.
> Ultimately, their actions are legal.
I worked on a project recently that dealt with the question of whether willingness to do a takedown was sufficient, and our legal team warned us that it isn’t. We threw away lots of work, which was a hard choice I’m proud of. The tl;dr was analogous to this: if your site just hosts dvd rips of all of the big studios’ movies, then Disney comes along and sends you a DMCA request which you comply with, you aren’t protected to hold everybody else’s movies just because they haven’t sent a DMCA request yet. That only protects you when it’s mostly stuff that uploader actually has the authority to upload in the first place. (It was much more legalese, of course, and I don’t understand it perfectly, but the gist is that DMCA compliance is necessary but not sufficient.) i.e. the Pirate Bay or old
school Napster can’t just start honoring DMCA requests and be in the clear.
Putting the onus on creators to request a takedown is telling them to play whack-a-mole. As someone with a pipe dream of leaving ML and becoming a novelist, our community’s popular approach to IP embarrasses me. “Reproducibility is hard because the owners of the data don’t want us to pass it around” doesn’t mean to find a way to ignore the owners. Disregarding data owners’ wishes in return for better models is how the world has gotten into this privacy hell.
So many of us act as if, because copying something is essentially free, the thing should be. It’s embarrassing because our own output is in that form, so we of all people should recognize that the cost of copying it is the least important factor. Or we just call out fair use, as if that applies to everything under the sun. The eye does more, but for me their approach still makes me feel embarrassed that my own community applauds them.