I don't think I can agree with that. GitHub's success, IMO, seems to be based almost entirely on its openness. It has turned contributing to open source software into a drop-dead easy task, which would never be found nor contributed to if they weren't open. And they keep making it easier. I've fixed a number of things with machines which don't have Git installed, simply because they have their on-site editor.
Imagine if GitHub were behind a paywall. Do you think it would still be the success it is today? And, I may be weird, but I very rarely look at the names associated with commit histories. The code should speak for itself.
The rest of it sounds about right, scientific publishing as a whole is massively backwards compared to GitHub, if you're looking at it from an "Open" perspective. But I think that a lot of that is that the researchers tend to be insular compared to the implementers (businesses guarding their IP aside - they're not really GitHub's target audience anyway). GitHub isn't used exclusively for comp-sci researchers to post their findings with code, it's more for people doing things with ideas others have contributed to.
There are experiments on GitHub, absolutely. I have a few myself. But the main thing that GitHub has done is to make final products easy to find, modify, and contribute to. I have significant doubts that it would fit a research workflow smoothly, without becoming something else entirely.
Personally, a motivation (of a lot more motivations) is indeed prestige. Now I can show off my nice work.
Sure, openness started it all: Linus shared his VCS, which in turn sparked Github, which in turn initiated thousands of developers to share their code. But openness really isn't the sole driver for people to share their code any more; more incentives, of which prestige is an important one, drive the popularity of github.
But people do that with other open source sites as well. Does GitHub provide this feature better than SourceForge or others? You still need to go to the user's page, they don't advertise it anywhere else. Do people go to GitHub to see information about person X, or project Y? And for the posts on social sites, are they more often about the creator or what they created?
The result is similar to the GitHub situation in many ways. Because there are no barriers to publishing, everyone makes up their own mind about which papers are interesting. If your work is relevant, others will build on it and cite you. They will discuss it in their group meeting, and so on. A scientist's reputation is then directly related to the quality of their work, as judged by the community, with no artificial barriers. This means that a self-respecting scientist would not publish a sub-par paper even though it's technicality possible to do so, because that would hurt her reputation.
So it seems to me that the situation in high-energy physics is close to ideal, with respect to ease of publishing and the social aspect of reputation. Having said that, there are certainly aspects of GitHub that I would love to see adopted.
For instance, when several researchers are writing a paper, generally no version control system is employed. Instead, at any point in time the draft is "locked" by one of the collaborators, and only that person can change it. Beyond the obvious inefficiency of this method, note that it is also difficult to track what changes were made in each lock cycle. I use diff for this purpose, but in my experience many scientists in the field aren't aware of such tools. So something that could really help is a simple way to collaborate on papers, just a basic source control system. Also, it must be possible to work on the paper in private within the collaboration, and only publish the end result.
[1] The few barriers that exist are in place to keep out the crackpots, who reduce the signal-to-noise ratio and in that sense resemble spammers.
They are completely superfluous for disseminating knowledge.
So peer reviewed journals are only important for grants, but your community doesn't rely on peer reviewed journals. Do you not rely on grants? Who funds your work? Do you all work for free, in your spare time?
If researchers also need to publish their work in journals, write grant proposals etc., how is it relevant to the idea of applying the GitHub model to science? Of course raising money is part of the job for a professor, but thanks to the arXiv it's decoupled from the actual research work. It's at a point where I, as a Ph.D. student, have no reason to consider publishing in reviewed journals. This is in contrast to my friends in optics or condensed matter, for whom a publication in Nature or Science practically guarantees a good postdoc position.
This post strikes me as charmingly naive. You have to love this guy. And yet any essay that discusses the incentive structure of science but doesn't use the word "grant" until the last sentence is beating around the bush. Follow the money, my friends.
The publications are a side issue. To the extent that your count of top-tier publications matters when trying to get an academic job, it's because it's correlated with your ability to bring in money. (Money comes from peer review too, and what your peers want to read about is also what they want to fund.) What the hiring committees really want is grants. Grant money pays for labs and salaries. It pays for grad students and postdocs. And grant money literally buys prestige: Big projects come from big grants, and big grants require strong track records and a bunch of preliminary data, which in turn comes from smaller grants, or from the shared equipment that your neighbor bought with her grants.
The fact that there aren't that many top-tier peer-reviewed journals is a side effect of the limited number of top scientists, and the number of scientists is limited by available resources, not by lack of knowledge or connections or education. I could literally pick up the phone and reach a dozen Ivy-educated postdocs who would be full-time scientists if they could afford it.
Why can you find so much great software on Github? There are lots of reasons, but a fundamental one is: Moore's Law. Computer hardware has become so dirt cheap that you can be a programmer in your spare time. You can literally be a twelve-year-old kid with a $200 cast-off computer and yet do top-notch software work. If computers cost millions of dollars each, like they did in 1963, we wouldn't have Github. We'd have the drawer of a desk on the ninth floor of Tech Square. (After all, in the old days half the AI researchers in the world lived within a few miles of that drawer, and the others were just a phone call away.) That's how most advanced science works today: There's no need for more publishing infrastructure for scientific technique, because the available methods of getting the word out -- top journals, second-tier journals, email, the phone, bumping into people in the hallway at conferences -- scale well enough to meet the limited demand. Because just having the recipe for your very own scanning multiphoton microscope doesn't do you much good: You need a $150,000 laser, and a $200,000 microscope, and tens of thousands of dollars in lenses and filters and dyes, and a couple of trained optics experts to maintain the thing, and that's before you even have something to photograph.
I wish there were a magical way to turn everyone's suburban basement into a cancer research lab, the way Github has turned everyone's couch into a potential CS research lab, but there's no magic bullet. A few technologies, like DNA sequencing, are sufficiently generic, useful, and automatable to be amenable to Moore's-Law-based solutions, so we probably will soon be able to (e.g.) drop leaves into the hopper of a $1000 box and get a readout of the tree's genetics. But something like cancer research is never going to be cheap. To study cancer you must first have a creature that has cancer. Mice are as cheap as those get, and mice are not cheap, especially if you know what the word mycoplasma means.
After spending some time with the Open Wetware/DIYBio guys, I realized that no one (other than the DIYBio guys) has really spent time reducing the costs of the fundamental toolkit of the molecular biologist.
Now they are starting to do so:
http://openwetware.org/wiki/DIYbio:Notebook/Open_Gel_Box_2.0...
That gel box is going to be about $200-$300 all in. You might not get a multiphoton microscope right off the bat, anymore than you'd get access to all of Goog's servers when you were just starting out with your laptop...but reducing the price of entry for the hobbyist biologist/biochemist to say $5k or less is going to be really important.
That's because the role of the DIYBio community is set to wax in a big way. With the NIH budget cuts and the coming fiscal collapse of the US Government, the sun is about to set on a ~70 year period (1945-2015) of centralized US government research. The period before 1945 was more innovative by many measures (e.g. http://www.nytimes.com/2011/01/30/business/30view.html?_r=1), and potentially the period afterwards will be as well.
(As an aside, there are fascinating historical examples of people who did invent the web before inventing the personal computer, like this guy:
http://en.wikipedia.org/wiki/Paul_Otlet
Of course, he ended up as many of these slightly-too-early visionaries did: His work lost all its funding and he was kind of sad.)
Build the gel box and then worry about the social network. Or better yet, don't worry about the social network at all: These gel box users will network themselves, no problem. It would be a challenge to stop them from finding each other online.
A few example to illustrate my point:
1. Open farm tech - tractor and brick compressor and 1/3 to 1/10 of the commercial prices. they plan a whole set of manufacturing equipment at those reduced prices.
2. students took pictures of earth from space using £90 in equipment.
3. a diy electron microscope, a guy made himself [1], probably costs a lot less then commercial ones.
There many more extremely cheap stuff like this. Maybe the first place of open sourcing science is in support of open source tool development.
[1] http://blog.makezine.com/archive/2011/03/diy-scanning-electr...
Perhaps I should have talked about something boring, like absolutely clean, absolutely sterile containers. There is nothing sexy about containers and pipettes, and they are individually cheap. But they add up. The most reliable technique is to just buy quantities of disposable ones. You can try to save money by using recyclable containers instead, by washing dishes a lot, but washing dishes by hand is expensive even if you don't do it very patiently and carefully. And you need to be really careful, because if you move your cell culture from dish to dish to dish ten times, and even one of those dishes is contaminated, guess what? Your experiment might now be contaminated. And for every Alexander Fleming (a lucky guy whose contaminated experiment turned out to be a Nobel-Prize-winning medical breakthrough) there are a thousand experiments that have to be discarded because the data isn't reliable.
But even that argument is misleading, because the most expensive ingredient in science is not even a material good. It's time. Science is about patience and consistency. Doing an experiment once is not science. Doing it one thousand times and getting absolutely consistent results -- that is science. The work of being a scientist is about carefully building and debugging a reliable sequence of steps ("grow, filter, sort, lyse, plate, stain, image"): A sequence that can be repeated over and over to obtain thousands of data points that are extremely self-consistent.
The reason why professional scientists use such expensive equipment is that the equipment is actually cheap compared to the cost of spending two years taking data that turn out to be full of errors because your tools weren't reliable. Too much random error and you won't see your data amid the noise; too much systematic error and you might eventually have to throw out 100% of your work and start over. Trust me: If you want to experience soul-crushing misery [1], work sixty hours a week taking data for two years, then set the data on fire because it's unsalvageable. I have seen this happen many times. It has happened to me. It happens all the time in science, but you can't afford to have it happen too often.
So, yeah, you can take pictures of earth from space for $200, but can you take the same picture one thousand times, under consistent lighting and from consistent altitude and position? Yeah, you can build an electron microscope in your basement, but will it keep working every day for five years while you take your data? How much maintenance will it require? How much time will you waste waiting for it to pump down every time you change samples, or tweaking the knobs for hours every time to get a usable picture? Will it spread thin layers of carbon on your electronics, or go through a six-month phase where it can't focus well enough to see your samples?
---
[1] Fortunately, the misery is temporary. If it isn't, you're just not cut out to be an experimental scientist. Science requires reserves of forward-looking optimism! So what if yesterday was hell? It was a learning experience that will make tomorrow better! ;)
Sure, a "Github of science" wouldn't turn anyone's basement into a cancer research lab, but it would mean that a lot of researchers at less-affluent universities would finally have full access to the literature of science.
We've had the technology to publish science online for decades. We have tinkered with it dozens of times. The web was originally invented for exactly this purpose. Far older things, like TeX, were invented for this purpose. Nowadays we have everything from PLoS to arXiv to Google Scholar to custom in-house blogs to PDFs sent through email.
The continued existence of for-profit journals is an economic, political, and anthropological problem, not a technological one. PLoS and the like are slowly changing things, but I still suspect that the only way to free our journals within less than a generation or two is to lobby (e.g.) the NIH to require that their funded projects be published in free journals. When a grant agency talks, people listen. When postdocs talk, alas, it makes a very subtle sound. ;)
Can't speak for other fields, but in CS any outlet worth submitting to will let you post a preprint on your own website, and from there Google Scholar will pick it up and place the link to your free copy right next to the paywall link. It's not a perfect system---it's like an extreme form price discrimination, like when Microsoft turns a blind eye to piracy in emerging markets---but it does give access to those who can't pay, and its not as dystopian as you make it out to be.
I agree about funding being an ulterior motive for a hiring committee to look at publication history. In my defense, I brought it about 2/3 of the way through with:
"We need trusted ways to quantify just how useful that API and associated code are to the scientific community, which can be listed on a scientist's profile and utilized by committees making hiring and funding decisions."
Science is constrained by resources, like everything else, but that doesn't mean scientists shouldn't try to optimize what can be done with the resources available.
Also, you imply that good science requires your own experimental data - and bearing associated costs - but that's only true when experimentalists aren't incentivized to share their data.
And you're correct: The trick to encouraging open source science is not to focus on the social networking tech -- that will be ready when you need it -- but to first attack the problem of doing quality lab work on the cheap. That's where the bottleneck is.
The biggest problem with shared facilities is the tragedy of the commons. In engineering -- or machining or woodworking or cooking, for that matter -- you quickly learn the importance of having your own tools. It only takes seconds to ruin a good tool. It only takes seconds to contaminate your cell culture, or your neighbor's cell culture, or an entire room full of your department's laboratory mice.
And mailing your samples off to a distant "virtual" lab is fine if you're studying disposable samples, or inorganic samples, or samples that have been permanently fixed and preserved on a glass slide. But living cells ship poorly even when you're allowed to ship them at all, and animals ship even more poorly than that. So often you've got to live next door to the equipment you're trying to share, and that's still expensive.
Think how much more expensive coding would be if it cost $50/ per compile (or hell $10/per compile).
I realized after starting that scientific communication is more complex, or at least it tries to be for various reasons. I could use help learning what people want from such a system.
I am keen on feedback or insights to drive my development. Please, if you are interested, you can reach me at sunir at bibdex com.
Someday there might be 1,000,000 well-defined science/math questions, along with great answers.
If we want non-programmers to use git, we need GUI's to instantly visualize the state, commands and other possibilities. No non-programmer is going to learn git using a CLI.
Not that I'm disagreeing with you--But making git point-and-clickable doesn't strike me as being very simple.
I think that github is such a good tool to interact with git repos that if they made a version that can work locally (the main difference is explicitly dealing with the index and the current tree) I'd use it to manage git projects in a heartbeat.
The main unanswered questions for this idea are 1) Funding & 2) Maintenance. Knowble was a for-profit venture, but should have been a non-profit organization. So where can you/someone get the funding to build & maintain the site?
If you need a python hacker to help out - my email is emile.petrone (at) gmail.com
These comments are enough evidence of this. Some have already mentioned arXiv.org, and others Science.io which seems to be specifically targeted at CS. When you add medical research, the needs for these branches is vastly different.
The way this will happen is a grad student hacker who is avoiding working on his thesis will start coding it, and then create a kickstarter asking for support to spend the summer working on it. If she's a credible engineer, she'll get the support easily, and every subsequent kickstarter grant will also be fulfilled and it'll get built.
If you build it (right) they will come.
Science is different. The amateurs are called cranks, and a small community of professionals does the good stuff. (There are exceptions, but few.) The basic issue is who will pay their living expenses, and buy the million dollar machines that they work on.
These days, almost all research money is spent by governments. They spend most of it rewarding people for publishing in prestigious journals. Scientists will keep packaging their research that way until someone starts buying it in a different package.
I do not believe open source solves all problems, but you dismiss the incredible value and quality of so much that I find it difficult to take the rest of the comment seriously.
It sounds more like "People with credentials (whether scientists or professional programmers) are the only people who can produce quality work. I have credentials. I'm part of the elite who can do quality work."
BTW you're implying governments are funding over-priced journals people outside of prestigious institutions cannot afford. Even though that's true it's not the way it should be.
The service peer-review provides is the filtering of crap, so that not everyone has to do that by himself. This makes science possible, as not everyone can be a master of all trades.
Publication without review is called "journalism".
As a side note, I believe that Elsevier has acquired an extreme market dominance in the scientific publishing sector and is abusing it in alarming ways.
In addition to the open prestige inherent in GitHub, there is also the fact that one's work is vetted by a community. It becomes very difficult if not impossible to publish crap and claim that it is quality. In science this is not the case. The peer review system is supposed to protect us against that. However, my understanding is that a surprising percentage of research in top ten journals can't be reproduced either because key details about implementation are missing or because it is actually not reproducible.
A GitHub for science could also meaningfully move the ball forward in making science reproducible as it should be easy to wrap ones scripts in a specification of an "environment" that can be readily setup, deployed, and run. A lot of work would be required to develop corollaries for non-computational scientific domains, but it would be a hugely valuable effort as discussed in the general reproducible research community (http://reproducibleresearch.net/).
Funnily, academia harbours the most brilliant minds of CS, and barely produces usable software. Its people who identify problems, and provide software/ideas who actually get things moving. Github/Blogosphere etc allow such solutions to emerge more efficiently by allowing a lot of people to look at such solutions. In academia, a publication is taken as a end point for problem solving. There are no incentives to build real software or real systems.
If computer science wants to make a difference, it must move away from its publish or perish culture.