Academic Torrents (opens in new tab)

(academictorrents.com)

448 pointsyinghang12y ago63 comments

63 comments

57 comments · 25 top-level

cing12y ago· 8 in thread

The team should learn from the ghost-town that is BioTorrents[1] and offer more than just a tracker. [1] http://www.biotorrents.net/browse.php?incldead=1

_delirium12y ago

That's one reason I'd prefer that academics just put data into some kind of local university archive, where possible. Many universities provide resources to host scientific data (and have done so for decades, since the days of ftp.dept.university.edu servers), and putting it there makes it more likely that it'll still be there in 10 years. Torrents by comparison tend to be: 1) slow, as you rely on random seeders rather than a university that's peered onto Internet2 or the LambdaRail; and 2) unreliably seeded, as people drop off. Plus the workflow of "curl -O URL" is nicer than torrenting.

Universities typically have great bandwidth and good peering, and already host much larger data repositories than this seems to be targeting (e.g. here's a 30-terabyte repository, http://gis.iu.edu/), so they should be able to provide space for your local scientific data. Complain if not!

capnrefsmmat12y ago

Another alternative is something like the Dryad Digital Repository:

http://datadryad.org/

It's meant to include companion datasets for published papers, and gives out DOIs so datasets can be cited in other works. And it's mirrored at various universities to prevent loss.

runarberg12y ago

Perhaps if universities would robustly seed their staffs and students torrents?

1 more reply

nl12y ago

I agree about university data.

But there is a need for a way to distributed large datasets that come out of nonacademic projects.

For example, the DBPedia data dumps are very slow to download at the moment.

voltagex_12y ago

You can have both an use a web seed with most clients.

reitzensteinm12y ago

So, it's quite cheap to get a seeding box from LeaseWeb, in ascending levels of sophistication:

* 100mbps unmetered 2x2tb 39 eur/mo

* 1gbps unmetered 24x2tb 349 eur/mo

* 10gbps unmetered 24x2tb 1089 eur/mo

I'm tempted to grab the first, and open a GitTip account in case anyone wants to chip in towards the second (4tb isn't a lot of space as far as this stuff goes). The third is unlikely to be useful; this stuff is long tail by its nature, so storage is probably more important.

Though in a world containing Google Fiber, would it still be a valuable service?

There's a university box seeding the torrent I'm grabbing (2011 weather patterns), but it still seems to be going quite slowly.

blueblob12y ago

That does not sound cheap to me as a graduate student.

1 more reply

mathgladiator12y ago

like: http://docs.aws.amazon.com/AmazonS3/latest/dev/S3Torrent.htm...

sillysaurus212y ago· 7 in thread

One problem with offering a dataset as a torrent is that it's impossible to edit it after it's released. However, it seems like that doesn't matter at all in this case, because any scenario I can think of which could be solved by editing the dataset (like redacting private info that was accidentally included) wouldn't avoid the original problem: that they accidentally released private info in the first place. Perhaps it'd be useful to edit the original dataset in order to add to it / enhance it with more info, but in that case they could just release a second dataset as an addendum.

So the core idea seems solid. Thank you for this!

jzelinskie12y ago

There are attempts to feel out a process for "updating torrents". However, this is long from becoming a standard practice in the BitTorrent ecosystem. Check this[0] out for more info.

[0] http://www.bittorrent.org/beps/bep_0039.html

dnautics12y ago

One problem with offering a dataset as a torrent is that it's impossible to edit it after it's released.

for scientific endeavours, this should be considered a feature, not a bug.

dspillett12y ago

> but in that case they could just release a second dataset as an addendum

Or for some data it would make sense to partition the data into smaller chunks instead of one huge archive. That way adding a chunk (the new year's data for a multi-year dataset perhaps) just menas releasing a new torrent with the extra srchive in and a name meaningful enough to indicate the difference. Anyone with the last set could then just download the new partition (and any modified ones).

pixelcort12y ago

BitTorrent Sync might be useful for that:

http://www.bittorrent.com/sync

rodolphoarruda12y ago

I'd used BT Sync for a couple of weeks to sync data between my own machines. It works neatly. One question here. When you modify some part of a big file, does the program send out only the difference to the other authorized machines, or entire file? Let's say a researcher exports her data to a 1GB CSV file of my interest. I download it. In the following week the same researcher updates her CVS with more data, now it has 1.01GB in size. How big my next download will be?

1 more reply

whadar12y ago

Hopefully sharefest.me would be another alternative pretty soon

axman612y ago

If it's stored on ZFS, Copy on Write will let you edit a copy that only stores the changed files, and deduplication could give back even more space (if necessary and RAM permits).

TheBiv12y ago· 3 in thread

This is really cool.

I simply wished that the messaging was more clear and told a story that I could tell to my friends who ultimately are "too busy" to think about the value of this product.

Unfortunately "We've designed a distributed system for sharing enormous datasets - for researchers, by researchers. The result is a scalable, secure, and fault-tolerant repository for data, with blazing fast download speeds." Just isn't a story that I can tell to my buddies and get them excited.

henryzlo12y ago

Thanks for the comment. We've created a shorter "pitch" style presentation for the non-technical / too-busy, which summarizes the benefits etc. in a short several minute description.

https://docs.google.com/presentation/d/1JC2d1g9U6HaenGSn_Xvk...

Schwolop12y ago

Respectfully - that's an EVEN LONGER message. You need a sentence. Ten words, tops.

gog12y ago

What do you use for the tracker?

nvdk12y ago· 3 in thread

this seems to be very focused on US academics, at least that is what impression I'm given by labeling ".edu" addresses. It gives a feeling that these torrents/datasets are of better quality. I'm also missing a catalog on this tracker, some basic taxonomy would be most welcome...

jsumrall12y ago

I didn't get that impression. Are you referring to the ".edu" address of the creators of the site? Do you mean people with a ".edu" address, and therefore at an American institution, give you a sense of their work being higher quality?

mineo12y ago

I think he's referring to the "[edu]" label on the browsing pages (like [0]) which indicates that the uploader has a .edu email address. I'm not too sure about other countries, but at least in Germany, not many academical institutions actually have those, just normal .de ones.

[0] http://academictorrents.com/browse.php?cat=5

nvdk12y ago

to clarify: torrents are marked "edu" if the user has a .edu address, this makes those torrents stand out. The majority of non us universities do not offer *.edu addresses to their staff and students.

1 more reply

thedudemabry12y ago· 2 in thread

Wow! That's a snappy site. Major props to the frontend dev(s).

kirubakaran12y ago

Looks like just stock Bootstrap (not that there is anything wrong with that).

pointernil12y ago

True. I guess ppl of academia are used to "different" quality/snappiness levels ;)

macarthy1212y ago· 2 in thread

The problem is the word "torrent". Too many negative connotations for many in the traditional academic world.

teddyh12y ago

And you’re writing this on “Hacker News”…

ctrl12y ago

Maybe this can help in turning around those connotations.

linux_devil12y ago· 2 in thread

Great ! Looking forward to coursera, edx and ocw videos too

dombili12y ago

That'd be great actually. Especially for those who're not able to reach Coursera because of the stupid laws.

sitkack12y ago

If you can get a cheap VPS in the US, you can use coursera downloader to grab all the content and then rsync it to your home country.

1 more reply

guspe12y ago· 2 in thread

Aaron Swartz's dream come true?

skrebbel12y ago

maerF0x012y ago

what exactly was his dream?

jackmaney12y ago· 2 in thread

Excellent! It's far too early to tell, but I'd like to be hopeful that this distribution network could be another nail in the coffin of the old, expensive, dead-tree journals.

rfoeorfisdjus12y ago

I guess you mean papers. Go back to reddit, libtard.

jackmaney12y ago

Yes, I do mean papers. I'm not on reddit, you mouth-breathing neanderthal.

hardwaresofton12y ago· 1 in thread

Wow, this is pretty cool -- one of the most direct approaches to open-data that I've seen so far (and the research world is of course in dire need of this kind of open data/connect-the-dots enabling effort)!

I think it would be pretty cool to have trending datasets on the front page (I'm sure you could do a small cron that would find the most-downloaded per-week/per-day/etc)

Also, while not a dire necessity, I think a cooler name would help this project fly farther -- You should be able to make a play on "data torrents", maybe something like datastorm/samplerain/datawave/dataswell/Acadata?

Any way, trivial stuff aside, nice implementation -- bookmarked for when I get the urge to do a data-analysis project!

yinghangOP12y ago

Thanks! But this is not my project. It is something created by a grad student I met just a couple hours ago at a hack night discussion.

teddyh12y ago

So what do I do if I want to seed them all? Also, are all the data sets (and other things) freely licensed, i.e. no “non-commercial use only” clauses or things of that nature? Can I count on this going forward?

jakeogh12y ago

A few TB of FOIA information related to the September 11th attacks is available via BT.

Direct link: http://911datasets.org/images/911datasets.org_all_torrents_J...

dav-12y ago

Any reason passwords for user accounts are limited to 40 chars?

ses12y ago

Projects like this confirm my suspicion that traditional academic publishing is going to take a nosedive in the next few years. Working in this industry as I do, I don't see commercial publishers moving quickly enough to change. Really love the idea of this and can't help but support the general ethos of it, even if it / its descendants will put a lot of us out of a job.

kartikkumar12y ago

Brilliant idea if I understand it correctly. Just want to check that my use case would fit. I just submitted my first and main paper for my PhD to Icarus. I'm planning on soon uploading it to ArXiv as well. My paper is theoretical in nature and through a suite of Monte Carlo simulations I generated a few hundred MBs of data. Can I make use of this system as a way to deposit that data so that it's available to anyone that wants to verify the conclusions I reach in my paper and possibly extend the research?

csense12y ago

I'm surprised they don't have the Google Books n-gram dataset [1]. Then again, maybe they're more focused on data that doesn't have a good home already than on mirroring.

[1] http://storage.googleapis.com/books/ngrams/books/datasetsv2....

lancemjoseph12y ago

Many of the datasets that I've seen in academia are stored in static SQL databases that tend to be about 10-20 terabytes. Where does this leave individuals with limited resources who would like to query large databases without having to juggle the data management side of research? Are there softwares that make database querying P2P accessible?

shitlord12y ago

I have an idle server with 500 Mb/s upload. Now I can finally put it to good use! :)

incogmind12y ago

I remember the old days of DC++ whenever I hear blazing fast speeds.

alagappanr12y ago

We would need a significant number of seeders in order for this to become a successfully used product. Perhaps, universities can seed data?

mathattack12y ago

I am no expert on torrents, but I like this conceptually. Publicly funded academic research should be free.

talles12y ago

What a wonderful idea. This fits so well with the torrent protocol (maybe even philosophically speaking).

huevosabio12y ago

This awesome! Thanks for sharing!

erikb12y ago

awesome invention! Could this be connected with the google scholar to add keyword searching?

CompleteMoron12y ago

thanks for sharing! I shall store in the vault of Hard Drives I keep here in the desert

j / k navigate · click thread line to collapse

63 comments

57 comments · 25 top-level

cing12y ago· 8 in thread

The team should learn from the ghost-town that is BioTorrents[1] and offer more than just a tracker. [1] http://www.biotorrents.net/browse.php?incldead=1

_delirium12y ago

capnrefsmmat12y ago

Another alternative is something like the Dryad Digital Repository:

http://datadryad.org/

It's meant to include companion datasets for published papers, and gives out DOIs so datasets can be cited in other works. And it's mirrored at various universities to prevent loss.

runarberg12y ago

Perhaps if universities would robustly seed their staffs and students torrents?

1 more reply

nl12y ago

I agree about university data.

But there is a need for a way to distributed large datasets that come out of nonacademic projects.

For example, the DBPedia data dumps are very slow to download at the moment.

voltagex_12y ago

You can have both an use a web seed with most clients.

reitzensteinm12y ago

So, it's quite cheap to get a seeding box from LeaseWeb, in ascending levels of sophistication:

* 100mbps unmetered 2x2tb 39 eur/mo

* 1gbps unmetered 24x2tb 349 eur/mo

* 10gbps unmetered 24x2tb 1089 eur/mo

Though in a world containing Google Fiber, would it still be a valuable service?

There's a university box seeding the torrent I'm grabbing (2011 weather patterns), but it still seems to be going quite slowly.

blueblob12y ago

That does not sound cheap to me as a graduate student.

1 more reply

mathgladiator12y ago

like: http://docs.aws.amazon.com/AmazonS3/latest/dev/S3Torrent.htm...

sillysaurus212y ago· 7 in thread

So the core idea seems solid. Thank you for this!

jzelinskie12y ago

There are attempts to feel out a process for "updating torrents". However, this is long from becoming a standard practice in the BitTorrent ecosystem. Check this[0] out for more info.

[0] http://www.bittorrent.org/beps/bep_0039.html

dnautics12y ago

One problem with offering a dataset as a torrent is that it's impossible to edit it after it's released.

for scientific endeavours, this should be considered a feature, not a bug.

dspillett12y ago

> but in that case they could just release a second dataset as an addendum

pixelcort12y ago

BitTorrent Sync might be useful for that:

http://www.bittorrent.com/sync

rodolphoarruda12y ago

1 more reply

whadar12y ago

Hopefully sharefest.me would be another alternative pretty soon

axman612y ago

If it's stored on ZFS, Copy on Write will let you edit a copy that only stores the changed files, and deduplication could give back even more space (if necessary and RAM permits).

TheBiv12y ago· 3 in thread

This is really cool.

I simply wished that the messaging was more clear and told a story that I could tell to my friends who ultimately are "too busy" to think about the value of this product.

henryzlo12y ago

Thanks for the comment. We've created a shorter "pitch" style presentation for the non-technical / too-busy, which summarizes the benefits etc. in a short several minute description.

https://docs.google.com/presentation/d/1JC2d1g9U6HaenGSn_Xvk...

Schwolop12y ago

Respectfully - that's an EVEN LONGER message. You need a sentence. Ten words, tops.

gog12y ago

What do you use for the tracker?

nvdk12y ago· 3 in thread

jsumrall12y ago

mineo12y ago

[0] http://academictorrents.com/browse.php?cat=5

nvdk12y ago

1 more reply

thedudemabry12y ago· 2 in thread

Wow! That's a snappy site. Major props to the frontend dev(s).

kirubakaran12y ago

Looks like just stock Bootstrap (not that there is anything wrong with that).

pointernil12y ago

True. I guess ppl of academia are used to "different" quality/snappiness levels ;)

macarthy1212y ago· 2 in thread

The problem is the word "torrent". Too many negative connotations for many in the traditional academic world.

teddyh12y ago

And you’re writing this on “Hacker News”…

ctrl12y ago

Maybe this can help in turning around those connotations.

linux_devil12y ago· 2 in thread

Great ! Looking forward to coursera, edx and ocw videos too

dombili12y ago

That'd be great actually. Especially for those who're not able to reach Coursera because of the stupid laws.

sitkack12y ago

If you can get a cheap VPS in the US, you can use coursera downloader to grab all the content and then rsync it to your home country.

1 more reply

guspe12y ago· 2 in thread

Aaron Swartz's dream come true?

skrebbel12y ago

maerF0x012y ago

what exactly was his dream?

jackmaney12y ago· 2 in thread

Excellent! It's far too early to tell, but I'd like to be hopeful that this distribution network could be another nail in the coffin of the old, expensive, dead-tree journals.

rfoeorfisdjus12y ago

I guess you mean papers. Go back to reddit, libtard.

jackmaney12y ago

Yes, I do mean papers. I'm not on reddit, you mouth-breathing neanderthal.

hardwaresofton12y ago· 1 in thread

I think it would be pretty cool to have trending datasets on the front page (I'm sure you could do a small cron that would find the most-downloaded per-week/per-day/etc)

Any way, trivial stuff aside, nice implementation -- bookmarked for when I get the urge to do a data-analysis project!

yinghangOP12y ago

Thanks! But this is not my project. It is something created by a grad student I met just a couple hours ago at a hack night discussion.

teddyh12y ago

jakeogh12y ago

A few TB of FOIA information related to the September 11th attacks is available via BT.

Direct link: http://911datasets.org/images/911datasets.org_all_torrents_J...

dav-12y ago

Any reason passwords for user accounts are limited to 40 chars?

ses12y ago

kartikkumar12y ago

csense12y ago

I'm surprised they don't have the Google Books n-gram dataset [1]. Then again, maybe they're more focused on data that doesn't have a good home already than on mirroring.

[1] http://storage.googleapis.com/books/ngrams/books/datasetsv2....

lancemjoseph12y ago

shitlord12y ago

I have an idle server with 500 Mb/s upload. Now I can finally put it to good use! :)

incogmind12y ago

I remember the old days of DC++ whenever I hear blazing fast speeds.

alagappanr12y ago

We would need a significant number of seeders in order for this to become a successfully used product. Perhaps, universities can seed data?

mathattack12y ago

I am no expert on torrents, but I like this conceptually. Publicly funded academic research should be free.

talles12y ago

What a wonderful idea. This fits so well with the torrent protocol (maybe even philosophically speaking).

huevosabio12y ago

This awesome! Thanks for sharing!

erikb12y ago

awesome invention! Could this be connected with the google scholar to add keyword searching?

CompleteMoron12y ago

thanks for sharing! I shall store in the vault of Hard Drives I keep here in the desert

j / k navigate · click thread line to collapse