IIUC Sci-hub has scooped up science docs through a good enough UX that it was able to leverage the goodwill of science folks to upload docs (plus whatever other methods it has used to scoop up docs), and it uses a public blitzkrieg-style distribution mechanism. I.e., I guess if one had a big enough harddrive and a fast enough internet connection, one could start downloading the lib right now and see if they win the race against the copyright holders.
On the other hand, the What.cd/Redacted approach seems to use Bittorrent ratios to create a private-tracker economy. New users get a few gigs free download on joining. But apparently because a) there's a 1:1 upload/download ratio, and b) a few first-mover fat cats are sitting on enormous ratios, this means there is a scramble by everyone else to upload new FLACs to build up their ratio so they can continue to be able to download FLACs. It seems that would mean the library-in-its-entirety cannot be easily replicated at will. Yet the tracker was apparently already nuked off the internet as What.cd and reappeared later as Redacted. Was any data lost between the two services?
Oh yeah, there's also apparently another approach in rutracker, which seems to be blitzkrieg to add content and publish, at the (apparent?) cost of quality of content.
It's really a shame that the nerdy, completist domain of digital archiving through torrents isn't covered by fair use. Perhaps we could exclude the most recent 10 years of music so that the hopeful young musician streamers can get paid a few hundred dollars for millions of streams and then receive the silver lining of fair use protection against a label refusing to release one of their albums.
Sadly, Sci-Hub took down this “magic proxy” to try and win a court case in India, which its creator thinks might legitimize it elsewhere. It’s a huge shame, because it means that many obscure articles are now inaccessible via Sci-Hub.
Ironically, a pure proxy-based Sci-Hub that didn’t host any articles on its own might actually be legal in certain countries, since it’s not actually hosting any copyrighted content itself. It definitely would be a lot easier to host and a lot harder to shut down; indeed, it could be completely decentralized.
Months later, I'm still waiting on sci-hub or anyone to get access to the studies.
The real WTF is science publishing. never-mind reproducibility of the studies, just getting the study in the first place is a predicament. at least genesis still works for 85% of requests i get.
Everyone else is left with scraps, trickling data out to whoever comes later. Or maybe you get lucky and you're the lone seeder, you get a 1:1 copy. You just better hope that was on a 24-Bit FLAC release, given how big they can get.
There's many many threads on https://old.reddit.com/r/trackers about REDs economy problems. OPS has a bonus point system and suffers less, but has worse content than RED. My opinion only though
rutracker is all about putting content out there. A lot of stuff is mislabelled or lower quality, less retention, identified wrong, seeded slowly. But you can sure find a lot of oddities there.
> was any data lost
yes, absolutely. I've hundreds of albums not on RED, and I can't bother to reupload them. it's a total waste of time when you know they're unseeded in a week if not for me hanging on.
time better spent finding new music and sharing everything on soulseek instead.
That said, things are probably different nowadays. In Oink's day everyone was on ADSL at home. Was paying €20 a month for a 100/100 OVH box; today I have 1000/100 at home and can instead spend that €20 on Spotify and Bandcamp.
It all serves the purpose of getting users to be good citizens participating in the community, and not just snatch and run.
You imply there is some continuity in the operation between these two trackers, but I don’t believe that’s the case. What.cd shut down. Subsequently, redacted (passtheheadphones) and apollo started, appealing to the same userbase. Neither of those trackers were privy to what.cd’s databases.
I like music a lot. I have a lot of vinyl and weird CDs, too. However, i can't be assed to rip to whatever draconian style-guide some of these private trackers want. So it's a matter of being a member of several servers and finding something that either isn't listed or seeded and "filling" or creating a torrent with some other tracker's set of files.
this is rewarding the wrong behavior. If i remember some song i heard in 1996, i should just be able to get it. It would be nice if all of the people who were involved in the creation and publication of the song got rewarded, somehow, but that's just not how art works in capitalism. I say this as someone who has personally released 11 CDs and a further 6 CDs in collaboration, of music. I haven't been paid a penny or more for anything i've ever produced in "art". I don't consider this a downside. People who know me and know i write music appreciate my music. People who don't know me will miss out. That's all there is to it.
Yes, absolutely, a lot.
Even though some people have automated their setup very well and have been downloading (and uploading on the newer trackers) a lot, it's just a giant amount of content, it's unlikely for any one person to have it all, and coordination was very limited back then.
The birthday release of the torrent db of what.cd includes 2.6m torrents in 1.2m groups (aka individual releases), in total weighing in at 588TB (or 421TB is you discard mp3, but there was content that hadn't been available in FLAC). That's doable on an SWE salary today if you're dedicated, but what.cd was shut down in 2016, and you'd still need to deal with the ratio system during collection.
the combined amount of content on those servers was probably around 100TB but most likely more
Undoubtedly.
Isn't that "bittorrent ratio" easy to cheat? I remember a good old times where I had to download some popular files just for ratio, then (beginning of 10's everybody starts to cheat and some of the biggest trackers turned to forever free leach.
Yes[0] but you have to be careful else you might get caught
not with most RED bounties you can fill. if you only sort by biggest bounties, you get albums that realistically can't be filled, or that would require serious money and effort to find.. rare asia specific releeases and stuff.
you can do specific requests if you have accounts for streaming platforms but nobody makes bounty requests for those
Basically her servers were set up to emit detailed error messages from PHP, including full path of faulting source file, which was under directory /home/ringo-ring, which could be traced to a username she had online on an unrelated site, attached to her real name. Before this revelation, she was anonymous.
Ha, my home dir is always called "me" or "and". Try google that.
my last name, without any other information, has one tenth of the bits of information that "genewitch" does.
So, use random usernames on the computers you use for this stuff, in case you misconfigure something.
...or a username that is so common as to be meaningless, like "user", "Administrator", or even "root".
I miss a p2p application with torrent packaging and Kademlia like per-file advertising and discovery, where I could point it to my hefty NAS directory of random things and they could be wired to released torrents. This way we could make torrents live much longer, even partially complete. In super extra option the app could even notify me to load DVD because somebody asked for a file which I indexed and advertised previously.
For years my program preferences changed, file locations changed - I have moved files around, made them offline, burned on DVDs, deleted some parts of torrent just to keep interesting stuff. Now these torrents are lost, at least my seeding contribution is gone. But I almost never change the content of these files, their checksum stays the same forever, so they could be still discoverable.
The digital preservation needs better distribution system.
All you'd have to do is make a torrent of your whole hard drive and then seed that. You don't need to publish the torrent anywhere.
If anyone else in the world is downloading any other torrent that happens to contain a file you have, they will end up connecting to your machine to download it.
But there's no way for other clients to know that there exists another torrent containing the file they are interested in if they only have the metadata for torrent A. In other words, there's no lookup mechanism for a "pieces root" to know that torrent B exists and contains file C.
If you were to make a v2 torrent of your entire drive, other clients won't know to download from your torrent. They'd need to have the contents of the metadata to know it would contain a file they are interested in, and have no way of knowing which metadata contains the desired "pieces root" entries without downloading all of them.
I'm very interested in this problem space, if you are aware of clients/mechanisms that allow for this I would love to hear them.
Someone could use this to
A) Remotely check the presence of any specific file on your machine.
B) Exfiltrate the contents of any file they know the hash of (or possibly more specifically, the hash of each piece? I don't know the protocol details).
Fine if you have a dedicated "I expect the contents to be public" drive or directory, but not something I'd want to do on my OS drive.
I built it with libtorrent and after loading in all of the torrents (multiple TBs of data), it would promptly and routinely crashed. I couldn't find the cause of the error, it doesn't seem it was designed to run with thousands of torrents.
One problem that I've yet to build a solution for is finding the metadata to use for the lookup phase. I haven't been able to find a publicly available database of torrent metadata. If you have an info hash then itorrents.org will give you the metadata, if it exists. I started scraping metadata via DHT announcements, but it's not exactly fast, and each client would have to do this unless they can share the database of metadata between them (I have an idea on how to accomplish this via BEP 46).
I've dreamed of federated meta-client which mixes all available p2p networks in a wild and can download missing torrent part from eMule or Soulseek.
That was the problem with p2p no one bothered to actually package their releases instead just slinging it onto the internet no care given as to if it was complete or even correctly labeled.
Sites like pirate bay had everything, torrents are a super fast and error resistant way to distribute content its the perfect system. The problem was that the content is illegal so it was nuked by companies with more money and power than any one can fight against.
The problem has never been technical or distribution the issue is the law.
Feels like the anonymous torrent seeder who keeps seeding a file for years just for the sake of keeping it alive. It's not easy, but some people seem to be able to derive full pleasure from accomplishing the task itself, whether recognition happens or not.
One group I remember in particular are mathematicians working for the NSA, etc., who are not permitted to publish their research, then they watch as other mathematicians rediscover their work and get the credit.
* Fellow NSA employees, your coworkers and boss
* Cash money, my personal favourite form of recognition
* Your close friends and trusted loved ones will know the broad outlines of what you're doing.
It doesn't make you world-famous but it wouldn't be as lonely as a job that needed total secrecy.
That some specific instance of a discovery or whatever becomes the mainstream version is, well, it's irrelevant. Who discovered calculus? It doesn't actually matter because calculus works without some belief system and worship. Traffic routing algorithms? yes, if the person is alive and kicking, being able to lay claim to some algorithm or novel solution is a CV bullet point, but, and i say this with the utmost respect: most people are one hit wonders. If they can ride that "fame" to higher pay or respect, cool. But in the grand scheme, it's irrelevant. Ideas should be spread far and wide, so that people who have a greater understanding can explain the ideas to those without an understanding.
Capitalism is the problem.
I hand-ripped and released the netflix wii disc once upon a time, and my seed ratio on that was astronomically high.
It was wildly successful. At it's zenith EmuParadise was ranked 700 or so as per Alexa on the entire internet. We're talking millions of visitors per day and thousands of active users every single second. I ran it all by myself with an entire team of moderators, contributors, etc.
It did have ads. Heck, our server bills were in the range of tens of thousands of dollars a month. How could I pay for that without having ads on the site? Then we're in commercial copyright infringement territory. Basically if you get sued, you can go to prison, and you will be bankrupted for sure. At the time there were no torrents, no IPFS, no distributed hosting solutions in any case.
As time went by the stress became enormous. Of course threatening letters and DMCA takedown notices were the norm. And the fact that the site was hugely popular and government agencies such as the FBI could get involved at the behest of Nintendo et al just made it worse. But also keeping it online, through various CDNs, trying to keep it anonymously run at all times (my OpSec was terrible starting out, it started in the year 2000), keeping servers online and uptime to almost 100% and bandwidth flowing and hard drives spinning and RAID arrays working. It was a whole lot of everything all at once and I was just one guy doing it all.
After another website Loveroms got sued by Nintendo in 2018 (for $12MM) I decided I had had enough. Reading stories like the kickasstorrents guy getting arrested while on holiday with his wife and kid, loveroms getting sued, I decided that this was the end of the road for me. I pulled all the games from the site. Eighteen years of work down the drain.
My mental health had suffered tremendously, I was depressed and anxious almost all the time. The sight of a police officer on the street would set me panicking. The cost was too high.
Was it a blast? Oh yes it was. I used to receive thousands of emails from grateful people. Cancer patients who reminisced in their last days playing video games from their childhood, soldiers at war whose only escape was a few rounds of Bomberman (the irony is not lost on me), and so many more beautiful stories of nostalgia and connection.
But current copyright law is going to destroy all this art and culture. There is no real legal way to preserve it. And people like me may do it for a long while, but at what cost to ourselves? I firmly believe that a 7-10 year copyright (extendible even somehow? debatable) would be fair and would let authors get what they need out of their creations. It would help us preserve all this beautiful art and culture that we have enjoyed and share it with future generations.
I would love for a human kid living on a distant exoplanet in the far future be able to play Chrono Trigger and wonder about the history of the earth and our stories.
Emuparadise is also the site that introduced me to BitTorrent, and my very first torrent was downloaded from there. That would get me interested in file sharing. In some way, it's partly responsible for why I'm interested in archiving and links like the OP. I'm sure I'm not the only one.
So, thank you for creating such a wonderful library and community back then! It was a great part of my childhood and adolescence, and it showed me how important preservation and sharing can be.
Our bittorrent tracker was very short lived. It did well and had some pretty good sets on it! But in 2010 when bittorrent was getting a really bad name right after The Pirate Bay case it was easy to get torrent trackers shut down.
One day I got a downtime alert, I think it was mid-2010. I checked the site, gone. The server, unresponsive. I got in touch with the host. He said he'll check with the data center. After a while he got back to me and said: "German police came in and seized your servers." There had been no notice, no warning, no nothing. Just boom, and gone. I asked him: "How can they do this? What do we do?". He said: "Nothing, they just come and take whatever they want every now and then."
I hired a lawyer in Frankfurt to go and check on the case. He said that they had closed the case with no further process because the person in question was unknown. And he ended that email with: "But Nintendo may try something else".
Until that moment, I had no idea that Nintendo was behind the server seizure. I was relieved that the case was closed. Anyway, I still went ahead and resurrected the site sans bittorrent tracker. YOLO and all that.
For the next 8 years, we never really had much trouble after that except the usual DMCA takedown notice here and there and a threatening legal letter sometimes. But pressure kept piling up. I did consider myself small and unimportant fish to fry (compared to say, The Pirate Bay or even current gen videogame piracy websites) but that didn't stop them from going after LoveROMs.
There was always the chance that one day they would just catch me at an airport or immigration (like the kickasstorrents dude) or something and that would be it. Or the police would just knock at my door. I mean, they would have to know who I was provably but I don't think it would be that hard for a government agency. It was just a matter of time that the powers that be would need to lobby the government to get at me.
I didn't want to live my life like that any more.
I'm glad you didn't suffer consequences from it. Thank you so much for your work!
I never realized that was my first true introduction to piracy. Really enjoyed the write up!
Then, section 5 reads like an advertisement for Scrapy since it is just stellar at following all pagination links and then either emitting the extracted payload as your own data structure and/or by telling Scrapy you want to download some media as-is. It will, by default, put the local content in a directory of your choice and hash the url to make the local filename. A separate json file serves as the "accounting" between the things it downloaded and their hashed on-disk filename
Scrapy is also able to glue 3 and 5 together because it has a pluggable (everything, heh) dupe detection hook and also HTTP cache support that can be backed by anything, including the aforementioned hsqldb operating in network mode. Scrapy is also very test friendly, since each method accepts a well known python object and emits either a follow-on request, zero or more extracted objects, or nothing if pagination has ended. Thus, one need not rerun the whole scraping process just to test if a bug has been fixed, or during development
I can appreciate there may be other scraping frameworks, but of the ones I've tried Scrapy makes everything that I've asked it to do simple and transparent
there's the web archival projects - that i cannot remember right now - that have some sort of proxy front end, but realistically, it should be possible to record the "content" portion of all web interactions, without relying on such dalliances as OCR and screengrabs or even OBS studio or a screen recorder.
Sometimes i go on a deep dive of some concept, and when i am done i feel i have a decent enough understanding to explain the concept to an adult, and sometimes i do a deep enough dive to explain to a 6 year old. I'd like to archive the entire "session" that got me there. Ideally as plaintext, but never have i wanted video documentation. I only ever use video to prove to someone that their service is acting up, since audio/visual desktop captures can do that, without cheating and provably.
PS. Maybe instead "pirates", we should call ourselves "keepers".
Being acknowledged for someone concerned about OpSec is minor, if not completely unimportant issue. Grind of maintaining OpSec for most is mind numbing in my experience, especially over an extended duration. One minor slip ends it all - and risk of slipping increases relative significance of the related operations, since more eyes increases odds someone will notice something, they’ll be forced into unfamiliar situations, etc.
Beyond that, research shows that odds of being discovered grow as more people know:
A very interesting thing nonetheless.
Anecdote: Remember when Microsoft Corp. declared what they love open source software and launched CodePlex platform, and then lost their business interest in it (when they bought GitHub) so they completely erased CodePlex archives? I was able to reach several long forgotten project I was interested in thanks to invaluable work of independent volunteer archivists. (It was quite tough manual job for me, I had to d/l database then locate desired archive segment and only then could transfer required files via bittorrent proto)
Finally an alternative to this Orwellian nightmare we call the Internet. Can't wait to have a copy at home, and there will probably be times where I'll be pulling the plug on the router with relief. And it's one more step towards reducing my Internet usage, thus keeping the government and corporations out of my life.