Academic Torrents (opens in new tab)

(academictorrents.com)

297 pointsjulianj6y ago32 comments

32 comments

22 comments · 7 top-level

glofish6y ago· 7 in thread

Cool idea, it is impressive that it is still around - alas it is flawed the same way all scientific data is flawed.

There is no metadata - all you have is an awkward imprecise textual search of the abstract that comes with the data. Good luck hosting the world's data that way.

derefr6y ago

One nice thing about digital data, as opposed to physical artefacts, is that you don’t need to keep digital data’s metadata attached to the data “at the hip.”

Through the magic of cryptographic hash algorithms, you can just keep your data sets floating around “raw” (like in these torrents), and then, elsewhere, ascribe metadata to the hash of the content it is meant to annotate.

Then, later, you can reassemble them in either order—either by first finding a data set, hashing it, and then looking up metadata in some metadata-hosting service; or by first browsing a catalogue of indexed metadata, finding out about a dataset that meets your needs, and then retrieving the data set by its hash.

Which is to say: with digital data, library science (creating metadata and chains-of-custody and indexing them for search) and archiving (ensuring access to pristine artifacts over time) don’t need to happen at the same time, in the same place. There can be separate “artifact hosting” and “metadata library” services. (Which is especially helpful in contexts where private IP is involved—you can still keep in your metadata library, the metadata for a data-set you don’t have the rights to; and those with the rights can go get the data-set themselves.)

ska6y ago

  > s that you don’t need to keep digital data’s metadata attached to the data “at the hip.”

You don't have to, but it's still mostly a good idea. But this stuff isn't either-or. We can have both.

This is especially true for research oriented files, where consumers are often unable or unwilling to maintain a functional metadata store, and do a lot of manual file handling. Saying "well, somebody could have set up a super-awesome metadata system that track this" doesn't magically make those resources exist.

metasj6y ago

This flexibility in time, specialization, and order of operation is surely one of the joys of modern digital collections.

Library scientists might say archiving and structuring and curation are all facets of that science. And you'll also want a hash search engine that finds related hashes, as there can be many revisions + versions, only some of which have some metadata.

bordercases6y ago

Aaaand someone has to do the work for computing the index and annotating the hashes.

1 more reply

ieee80236y ago

There is metadata. It is stored in bibtex along with every torrent. This format allows it to be a freeform database where the user can add fields as they want. We (Academic Torrents) can then build new ways to display this metadata. Also the "abstract" part of the metadata is rendered as markdown on the details page of a torrent. Here is a good example: https://academictorrents.com/details/d52ccc21455c7a82fd6e589...

glofish6y ago

Ok, I see that there is code provided there. Better than nothing but geez, it is not really what metadata should be like

  def get_labels(rightside):
    met = {}
    met['brain'] = (
        1. * (rightside != 0).sum() / (rightside == 0).sum())
    met['tumor'] = (
        1. * (rightside > 2).sum() / ((rightside != 0).sum() + 1e-10))
    met['has_enough_brain'] = met['brain'] > 0.30
    met['has_tumor'] = met['tumor'] > 0.01
    return met

I will say that it is very handy to know exactly how the labels were computed.

What I really meant is a way to search and select data based on metadata. For example has_tumor.

Also note how everything is still one single blob, to get one line of any of the files, one would need to download everything.

2 more replies

neffy6y ago

To err is human, to forgive divine, to fix immortal.

krick6y ago· 4 in thread

Yeah, it could really benefit from some organizational work, like on more mature music torrent trackers or such. Categories, mandatory tags, unified names, reviewed by community-chosen category-wise moderators. In it's current state in's basically a file dump, either you have the direct link, or you can only hope to find something interesting. Not that much better than sharing magnet links via public pastebin records...

colechristensen6y ago

One very interesting thing I wish would be studied in depth are the virtual economies of mature trackers. Limiting access to resources and granting increasing access for contributing and correcting quality has in places been extremely successful. It is interesting to see the varying quality and associated economic mechanics.

Some environments, based just on prestige, have big problems with toxicity (StackOverflow, Wikipedia) which I didn't see at all in some music trackers.

ryacko6y ago

Wikipedia does cover that issue. Competing views are difficult to reconcile.

https://en.wikipedia.org/w/index.php?title=Wikipedia:Systemi...

(using a version of the article from ten years ago because everything is unnecessarily verbose on wikipedia now)

1 more reply

krick6y ago

That definitely is an interesting issue that could be studied. From the practical perspective, speaking of this particular torrent tracker, I wouldn't speculate much and would just (more or less) copy the organizational structure of some tracker I know and see if it works (I assume some adjustments would need to be made, because people are different, content is different, whatever else I don't keep in mind will turn out to be different).

But if I were to speculate, I guess it always propagates from the top. The point is, that the visible community you can speak of is not entirely randomly chosen from the user base, and the user base are people who just want to use the product, not to play corporate mechanics. If in the end the goals of the general public are somewhat aligned with the internal community of ladder-climbers, it works out fine. Otherwise it doesn't.

(And, by the way, ladder-climbers in most of these communities tend not to be the nicest people by default... Let's just say, they are Dwight. So if you let them do stuff that is not desirable for the general community, they will.)

I think StackOverflow philosophy is flawed by design, the main point of user frustration always was the fact that questions that they very much need to get answered are closed as "too broad", "opinion-based" or something of the sorts. Dwights love to exercise their power by noticing that something can be close "as not good fit for this site", and users who want that stuff to be discussed obviously hate that. That is something that could be fixed from the top, but the top specifically wanted it this way.

Wikipedia is similar to that, but users and Dwights stand even further apart, since general user doesn't even make an account to make an edit, doesn't look who makes the edits and doesn't know the internal playground. The main point of frustration here is a user, who knows his stuff well and wants to share the knowledge, but is being shut down by a Dwight, because the subject is "of low importance" to him. This infuriates the user even more, considering that there are thousands of articles about some fucking Harry Potter-universe pokemon or whatever, which, naturally, doesn't raise an issue with Dwights, because they are Dwights and they love this stuff. This is also something to be solved organizationally from the very top.

Music trackers are way more meritocratic. People, who eventually get to be moderators can be formalistic or not — it varies — but they generally just want a lot of music on the tracker in a well-organised manner — and this is exactly what general public wants! It's another question how they get motivated by the platform to contribute so much — and involvement sometimes seems to be much more hard work than on Wikipedia — but the point is that they really do contribute useful stuff.

Also, music trackers tend to be way more liberal (in a sense to allow freedom, not to be left-wing politically, ironically, quite the opposite is true nowadays). Nobody cares is somebody is rude, racist or whatever, if off-topic flamewar goes over the top — the whole thread goes down. Otherwise, you can post whatever you want and nobody gives a shit and isn't pressured by the media to do something about it. After all, unlike twitter, reddit or stackoverflow, they aren't traded on the stock market.

ieee80236y ago

We have collections which I guess should be featured on the front page. https://academictorrents.com/collections.php

DuskStar6y ago· 2 in thread

I wish I could add Gwern's Danbooru dataset [0] here - 2.7TB of labeled anime images. But they only support torrent files up to 10MB, and that's over 20MB for the full dataset or 12MB for the SFW low-rez set...

Incidentally, when the torrent file for your anime image collection passes 20MB, something has obviously gone very w̵r̵o̵n̵g̵ right.

0: https://www.gwern.net/Danbooru2019

DuskStar6y ago

I should probably point out that this dataset has been used for some machine learning tech demos in the past, for example This Waifu Does Not Exist [0], a StyleGAN-based automatic anime portrait generation tool. So it's not completely outside of what the site already hosts...

0: https://www.thiswaifudoesnotexist.net/

gwern6y ago

More than demos, papers too: https://www.gwern.net/Danbooru2019#applications

yig6y ago· 1 in thread

2016 HN discussion: https://news.ycombinator.com/item?id=12381791

2014 HN discussion: https://news.ycombinator.com/item?id=7149006

dang6y ago

2018 too: https://news.ycombinator.com/item?id=17744150

robbya6y ago· 1 in thread

https://academictorrents.com/about.php#mirroring

Using RSS to allow mirrors to host different subjects is really clever, although some of the categories seem quite large (>5TB). It may be worth breaking up each category (sharding) to keep each to 100GB or less so a volunteer can pick a couple and not worry about running out of disk when a category grows.

Then it would be good to track how many seeds each category-shard has so volunteers can help where it's most needed.

DuskStar6y ago

Some individual items are multiple TB, which would make 100GB shards a little difficult.

husainalshehhi6y ago

Downloading some of this might be illegal. I see some entries that says "No license specified, the work may be protected by copyright."

aldoushuxley0016y ago

This is amazing, really a great source of data.

j / k navigate · click thread line to collapse

32 comments

22 comments · 7 top-level

glofish6y ago· 7 in thread

Cool idea, it is impressive that it is still around - alas it is flawed the same way all scientific data is flawed.

There is no metadata - all you have is an awkward imprecise textual search of the abstract that comes with the data. Good luck hosting the world's data that way.

derefr6y ago

One nice thing about digital data, as opposed to physical artefacts, is that you don’t need to keep digital data’s metadata attached to the data “at the hip.”

ska6y ago

  > s that you don’t need to keep digital data’s metadata attached to the data “at the hip.”

You don't have to, but it's still mostly a good idea. But this stuff isn't either-or. We can have both.

metasj6y ago

This flexibility in time, specialization, and order of operation is surely one of the joys of modern digital collections.

bordercases6y ago

Aaaand someone has to do the work for computing the index and annotating the hashes.

1 more reply

ieee80236y ago

glofish6y ago

Ok, I see that there is code provided there. Better than nothing but geez, it is not really what metadata should be like

  def get_labels(rightside):
    met = {}
    met['brain'] = (
        1. * (rightside != 0).sum() / (rightside == 0).sum())
    met['tumor'] = (
        1. * (rightside > 2).sum() / ((rightside != 0).sum() + 1e-10))
    met['has_enough_brain'] = met['brain'] > 0.30
    met['has_tumor'] = met['tumor'] > 0.01
    return met

I will say that it is very handy to know exactly how the labels were computed.

What I really meant is a way to search and select data based on metadata. For example has_tumor.

Also note how everything is still one single blob, to get one line of any of the files, one would need to download everything.

2 more replies

neffy6y ago

To err is human, to forgive divine, to fix immortal.

krick6y ago· 4 in thread

colechristensen6y ago

Some environments, based just on prestige, have big problems with toxicity (StackOverflow, Wikipedia) which I didn't see at all in some music trackers.

ryacko6y ago

Wikipedia does cover that issue. Competing views are difficult to reconcile.

https://en.wikipedia.org/w/index.php?title=Wikipedia:Systemi...

(using a version of the article from ten years ago because everything is unnecessarily verbose on wikipedia now)

1 more reply

krick6y ago

ieee80236y ago

We have collections which I guess should be featured on the front page. https://academictorrents.com/collections.php

DuskStar6y ago· 2 in thread

Incidentally, when the torrent file for your anime image collection passes 20MB, something has obviously gone very w̵r̵o̵n̵g̵ right.

0: https://www.gwern.net/Danbooru2019

DuskStar6y ago

0: https://www.thiswaifudoesnotexist.net/

gwern6y ago

More than demos, papers too: https://www.gwern.net/Danbooru2019#applications

yig6y ago· 1 in thread

2016 HN discussion: https://news.ycombinator.com/item?id=12381791

2014 HN discussion: https://news.ycombinator.com/item?id=7149006

dang6y ago

2018 too: https://news.ycombinator.com/item?id=17744150

robbya6y ago· 1 in thread

https://academictorrents.com/about.php#mirroring

Then it would be good to track how many seeds each category-shard has so volunteers can help where it's most needed.

DuskStar6y ago

Some individual items are multiple TB, which would make 100GB shards a little difficult.

husainalshehhi6y ago

Downloading some of this might be illegal. I see some entries that says "No license specified, the work may be protected by copyright."

aldoushuxley0016y ago

This is amazing, really a great source of data.

j / k navigate · click thread line to collapse