Some environments, based just on prestige, have big problems with toxicity (StackOverflow, Wikipedia) which I didn't see at all in some music trackers.
https://en.wikipedia.org/w/index.php?title=Wikipedia:Systemi...
(using a version of the article from ten years ago because everything is unnecessarily verbose on wikipedia now)
But if I were to speculate, I guess it always propagates from the top. The point is, that the visible community you can speak of is not entirely randomly chosen from the user base, and the user base are people who just want to use the product, not to play corporate mechanics. If in the end the goals of the general public are somewhat aligned with the internal community of ladder-climbers, it works out fine. Otherwise it doesn't.
(And, by the way, ladder-climbers in most of these communities tend not to be the nicest people by default... Let's just say, they are Dwight. So if you let them do stuff that is not desirable for the general community, they will.)
I think StackOverflow philosophy is flawed by design, the main point of user frustration always was the fact that questions that they very much need to get answered are closed as "too broad", "opinion-based" or something of the sorts. Dwights love to exercise their power by noticing that something can be close "as not good fit for this site", and users who want that stuff to be discussed obviously hate that. That is something that could be fixed from the top, but the top specifically wanted it this way.
Wikipedia is similar to that, but users and Dwights stand even further apart, since general user doesn't even make an account to make an edit, doesn't look who makes the edits and doesn't know the internal playground. The main point of frustration here is a user, who knows his stuff well and wants to share the knowledge, but is being shut down by a Dwight, because the subject is "of low importance" to him. This infuriates the user even more, considering that there are thousands of articles about some fucking Harry Potter-universe pokemon or whatever, which, naturally, doesn't raise an issue with Dwights, because they are Dwights and they love this stuff. This is also something to be solved organizationally from the very top.
Music trackers are way more meritocratic. People, who eventually get to be moderators can be formalistic or not — it varies — but they generally just want a lot of music on the tracker in a well-organised manner — and this is exactly what general public wants! It's another question how they get motivated by the platform to contribute so much — and involvement sometimes seems to be much more hard work than on Wikipedia — but the point is that they really do contribute useful stuff.
Also, music trackers tend to be way more liberal (in a sense to allow freedom, not to be left-wing politically, ironically, quite the opposite is true nowadays). Nobody cares is somebody is rude, racist or whatever, if off-topic flamewar goes over the top — the whole thread goes down. Otherwise, you can post whatever you want and nobody gives a shit and isn't pressured by the media to do something about it. After all, unlike twitter, reddit or stackoverflow, they aren't traded on the stock market.
2014 HN discussion: https://news.ycombinator.com/item?id=7149006
Using RSS to allow mirrors to host different subjects is really clever, although some of the categories seem quite large (>5TB). It may be worth breaking up each category (sharding) to keep each to 100GB or less so a volunteer can pick a couple and not worry about running out of disk when a category grows.
Then it would be good to track how many seeds each category-shard has so volunteers can help where it's most needed.
Incidentally, when the torrent file for your anime image collection passes 20MB, something has obviously gone very w̵r̵o̵n̵g̵ right.
There is no metadata - all you have is an awkward imprecise textual search of the abstract that comes with the data. Good luck hosting the world's data that way.
Through the magic of cryptographic hash algorithms, you can just keep your data sets floating around “raw” (like in these torrents), and then, elsewhere, ascribe metadata to the hash of the content it is meant to annotate.
Then, later, you can reassemble them in either order—either by first finding a data set, hashing it, and then looking up metadata in some metadata-hosting service; or by first browsing a catalogue of indexed metadata, finding out about a dataset that meets your needs, and then retrieving the data set by its hash.
Which is to say: with digital data, library science (creating metadata and chains-of-custody and indexing them for search) and archiving (ensuring access to pristine artifacts over time) don’t need to happen at the same time, in the same place. There can be separate “artifact hosting” and “metadata library” services. (Which is especially helpful in contexts where private IP is involved—you can still keep in your metadata library, the metadata for a data-set you don’t have the rights to; and those with the rights can go get the data-set themselves.)
> s that you don’t need to keep digital data’s metadata attached to the data “at the hip.”
You don't have to, but it's still mostly a good idea. But this stuff isn't either-or. We can have both.This is especially true for research oriented files, where consumers are often unable or unwilling to maintain a functional metadata store, and do a lot of manual file handling. Saying "well, somebody could have set up a super-awesome metadata system that track this" doesn't magically make those resources exist.
Library scientists might say archiving and structuring and curation are all facets of that science. And you'll also want a hash search engine that finds related hashes, as there can be many revisions + versions, only some of which have some metadata.
def get_labels(rightside):
met = {}
met['brain'] = (
1. * (rightside != 0).sum() / (rightside == 0).sum())
met['tumor'] = (
1. * (rightside > 2).sum() / ((rightside != 0).sum() + 1e-10))
met['has_enough_brain'] = met['brain'] > 0.30
met['has_tumor'] = met['tumor'] > 0.01
return met
I will say that it is very handy to know exactly how the labels were computed.What I really meant is a way to search and select data based on metadata. For example has_tumor.
Also note how everything is still one single blob, to get one line of any of the files, one would need to download everything.