No, no it really hasn't. It has relied on the ability to make predictions based on pusblished theories, methods, laws etc. Even for hard-science experiments it's not even clear how you could record all the required knowledge to replicate an experiment. Every configuration, every machine, every particle in the air, every bit of software.
I really wish people engaged more with the actual history of science instead of what they believe it to be.
edit. To give a little more meat to my rant here's a good reading (https://plato.stanford.edu/entries/scientific-reproducibilit...):
"If replication played such an essential or distinguishing role in science, we might expect it to be a prominent theme in the history of science. Steinle (2016) (...) claims that the role and value of replication in experimental replication is 'much more complex than easy textbook accounts make us believe' (2016: 60), particularly since each scientific inquiry is always tied to a variety of contextual considerations that can affect the importance of replication. Such considerations include the relationship between experimental results and the background of accepted theory at the time, the practical and resource constraints on pursuing replication and the perceived credibility of the researchers. These contextual factors, he claims, mean that replication was a key or even overriding determinant of acceptance of research claims in some cases, but not in others."
The history of replications is extremely nuanced. Empirical results and by extension replications are one line of argument in scientific discourse, but by no means the only one.
I personally hold that valid predictions in the context of interesting problems are where it's really at. In the "Structure of Scientific Revolutions", Kuhn argues that at some point paradigms cannot make THESE kind of predictions anymore. Revolutions do not happen because of failed or missing replications.
Therefore, stating science "has relied on replication" is historically and epistemologically false. It's also misleading because the replication crisis happens due to a lack of theory and misguided incentives, not because some discipline has left the holy path of finding truth.
This might be a semantic argument, but what you describe is replication. Imagine every scientist would say "In my experiment, I perfectly predicted this and that, oh but no one else would ever be able to run that experiment again, so just trust me, ok?"
Replication/reproducibility isn't about logging every configuration, machine or particle. It's about being able to run the same method and get the same result. If that isn't the case, how do we know the predictions are correct?
I don't know much about this topic, but isn't this kind of what's happening with for instance the Large Hadron Collider? We seem to collectively trust the results from LHC experiments, even though no one can replicate them because there is only one LHC and AFAIK no one is building another particle accelerator for similar or greater energy levels. There's no guarantee that we'll see another particle accelerator of that scale in our lifetime.
It seems to me that your point implies that LHC experiments shouldn't be considered to be science, since they cannot practically be replicated any time soon (except at the LHC itself, which somewhat defeats the purpose of replication). But I (not a physicist) find myself quite trusting of their results tbh. I'm not sure I can fully articulate why, and I'm not sure I have a good reason for it.
>>No, no it really hasn't.
What do you mean no it hasn't? Reproducibility of scientific findings is certainly a cornerstone of science, at least according to all definitions of "scientific method" I have ever found.
>Crucially, experimental and theoretical results __must be reproduced__ by others within the scientific community.
https://en.wikipedia.org/wiki/Scientific_method
>Reproducibility, also known as replicability and repeatability, __is a major principle underpinning the scientific method.__
I see reproducibility as an aspirational goal, at least in the subset of "discovery" science where the competition is to be first to identify a new scientific principle. Reproducibility is more important in another area- people building tools for others to use for their own research.
While technical reproducibility isn't always required (in case of observing a rare event, and there is enough expertise to evaluate the fidelity of the experiment / observation), it's also a bit of a strawman to attack this point specifically, because in any case science advancement needs a body of evidence appropriate for the theory being tested, and replicability is crucial for a field like DL where it should have been relatively easy, and where the basic premise axiomatically requires reproducibility (that a reapplication of the same techniques should yield comparable results).
Frequently the PIs (bosses) will not even glance at the repositories written by junior members, probably can't read code anyway, and certainly won't allocate time for their maintenance. Even worse, most academics who do publish code have never been exposed to real world software engineers, their techniques, or tools.
Suppose I told you to develop good software that's novel enough to publish about, but only gave you enough budget to pay your SWEs a maximum of $30K/yr. That's one zero, for those reading quickly. Additional non-beneifts:
1. Unlike literally every other job in the country, you don't have budget to pay FICA taxes for your employees, and tax code allows this. This means your employees don't even have the USA's paltry social safety net to fall back on if they are hit by a bus or graduate into a massive recession, and their years working for you do not count toward social security or medicare retirement benefits.
2. Obviously, there is no budget for 401K retirement benefits
3. No CoL raises
4. Healthcare benefits will be paltry.
5. Your SWEs need to serve as a teaching assistant every once in a while. This likely means grading homework and a few late evenings of grading exams. No overtime for those late nights, obviously.
6. All travel, which is mandatory and often international, must be paid by the employee up front and reimbursement can take 1-3 months. We don't trust $30K/yr drones with corporate cards. Good luck making rent after a conference :)
Just to reiterate: You need to hire SWEs. You pay $30K/yr (less than some Amazon warehouses!), benefits package is literally worse than a part-time gig at a supermarket or fast food joint, and your employee is expected to give you $2K-$4K loans a few times a year while living paycheck to paycheck.
I just roll my eyes hard when I see complaints about garbage research code. Almost everyone in my PhD cohort had FAANG or finance offers; we were all taking 5x-10x paycuts to work on interesting problems and do science. If you want productizable research prototypes, hire PhDs to do science for you.
(And I say this, for the record, as a rare PhD who during their phd wrote code that is well-documented, well-maintained, and still used by dozens of companies for business-critical processes many years later.)
By this logic all companies maliciously sell broken software in order to charge for updates.
But obviously not all companies do that, those that do get called out for it, those that produce good products get a good reputation for it, etc. Similar things apply to academia.
"It ran once" papers run the risk of not getting cited as much compared to good papers with robust implementations, so the maligned incentive you describe isn't as clear cut, even in the corner case where the novelty is considered to be in the algorithm rather than the particular implementation. Worse, if the algorithm fails to reproduce, a researcher runs the risk of being retracted or shamed in subsequent publications when their work fails to reproduce. And reproducibility is a key aspect of journal publications in reputable journals, meaning less reproducible work will end up in lower quality outlets which often hurt one's career more than they help.
Your analogy here is not great since the parent's claim is that academics have no incentive to produce good, reproducible research, not that they are maliciously creating bad research.
A more apt analogy would be:
"By this logic all companies would be driven by quarterly metrics and rush out broken software and then charge for updates/support"
... which is pretty much the exact state of the industry right now.
Well crafted academic software is a rarity - the stuff that does exist tends to comes out of institutes where the software is necessary to their wider mission - like the Broad or Sanger Institutes.
Frankly, unlike the author, I think there’s too many people in the field. They produce a handful of papers worth reading every year along with thousands upon thousands of models that may or may not slightly improve performance on a specific task and then have no general value beyond that. And I don’t believe this will change much- ml is likely the most monetizable PhD path by a safe margin, so there is too much profit incentive to churn out crap at any cost.
I don't know why DL libraries are so afflicted by this, maybe things just move so fast. But it is such a pain in the ass.
Make a new venv for everything and don’t pollute the global environment and it should be fine.
This is definitely the case in DL (and I'm assuming elsewhere too but I wouldn't know).
I've lost count honestly, running 1-2 year old paper github repos with some detail missing (like the Python version!) that make it non-trivial to run as is. Libraries make undocumented breaking changes, wrong pickle format, authors used a nightly version which didn't make it to a tagged version, and so on.
This perhaps says also something about the CS (versus software eng) background that most people engaging in DL publishing have.
Are those things enjoyable? Or is hacking and playing with ideas enjoyable?
Huge portions of PhD students spent time as software engineers prior to starting their programs. It's not about know-how. It's about not being paid to engineer systems in addition to doing research.
Fewer than 1 in 100 labs have dedicated software engineers, and PhD students are paid $30K/yr. There's no way in hell most of them are going to spend their time doing dependency management or setting up CI/CD pipelines for that salary. If they wanted to spend their time doing software engineering, then can (and would) move to an industry SWE job at 10x the total comp.
1. similar open data exists, great, just publish a sample implementation
2. if not the first task is to generate such an open data set
Edit: formatting
This is a problem because cherry picking is essentially built into the frame work.
If I was building ranking algorithm and just kept picking a random seed to arbitrarily sort a list of numbers until it was correct, most people would consider that obviously cheating. However if I did the same thing but stuck 3 dense matrices between the seed and the list to be ranked it would considered AI.
With that being said, this is not an excuse for refusing to share paper code or making sure the experiments are reproducible.
In these situations, I have suggested releasing anonymous implementations after the paper is accepted just to get the code out there. I am not certain this is the right thing to do!
Give me a one liner huggingface or torchhub, or a working google colab. Or I'm probably nexting your work and trying the second and third best model instead.
Imagine isolated code dumps without the shared history leading to nightmare merge conflicts…
I think this makes its really hard for anyone not steeped in experience to parse through the outputs of the spraying firehouse, and organize their thinking rigorously — thereby fragilizing the field’s intellectual output in a vicious cycle.
I don't think this is what a rejection means. Papers are accepted and rejected from the challenge depending on whether they do a good and thorough job of attempting to replicate the original work, not depending on whether or not they succeed.
Even WITH code supplied by the authors, this was always a struggle. It'd usually take about a week or so just to get their github project out of dependency hell and actually run at all.
And if it needed to be reproduced in another framework, I'd really really want some kind of demo code just to clarify what exactly the authors were trying to describe. Especially if their descriptions had holes or discrepancies that only became clear when trying to fit the pieces together.
I remember trying to reproduce a couple of object tracking papers from the same authors, one with an overly complex and poorly defined feature set, the other with a glaring mistake/omission that forced my team to redesign the model because they described using a certain layer type in a way that made no sense.
There were a few good exceptions that provided nice code, but difficult reproduction seemed to be the norm.
I'm not writing this to defend deep learning. Reproducibility is an incentives problem across ALL of science. We value novelty and prestigious publications over everything. Nobody wants to fund "boring" research that reproduces existing results. To fix academic science, we need to reward reproducible research and fund groups such that they're capable of performing it.