Alphafold (opens in new tab)

(github.com)

550 pointsmatejmecka4y ago165 comments

165 comments

103 comments · 19 top-level

Cas94y ago· 16 in thread

Honest question: since AlphaFold doesn't really _solve_ the protein folding problem (it's NP-complete after all), but only _approximates_ solutions very well, what are the real impacts of this? Isn't a good approximation of a protein enough to cause unexpected problems? How do we know that an approximate structure will perform the same as the correct solution?

thxg4y ago

> (it's NP-complete after all)

Protein folding is a physical/biological phenomenon. AFAIK we don't currently have a proper exact mathematical formulation of the problem that would let one determine its complexity.

You may be referring to this paper [1]. It only claims that one particular optimization problem, believed to give a solution to protein folding problems, is NP-hard. So, even if a suitable exact formulation exists, it is not yet proven that protein folding is hard, although it for sure seems plausible.

By the way, it is perfectly possible today to solve some very large-scale NP-hard problems (think millions of variables and constraints) in reasonable amounts of time (think minutes or hours). Examples are knapsack problems, SAT problems [2], the Traveling Salesman Problem [3] or more generally Mixed Integer Programming [4].

[1] "Complexity of protein folding", 1993, by Aviezri S. Fraenkel

[2] http://www.satcompetition.org

[3] http://www.math.uwaterloo.ca/tsp/

[4] http://plato.asu.edu/bench.html

dekhn4y ago

The protein folding problem is not NP complete. The "formal" protein folding problem, as posed (find the set of dihedral angles whose resulting structure has the lowest energy) might be, but that bears only a distant resemblance to how people "solve" the problem today. At the very least, the statement is incorrect because many proteins don't actually fold to their energy minimum, they get stuck in kinetic traps, and the formal PF defintion never accomodated that idea.

ashtonbaker4y ago

Not really an answer to your question, but is the problem really NP-complete, or just combinatorially difficult? For example how is this condition of NP-completeness satisfied?

> it is a problem for which the correctness of each solution can be verified quickly [0]

[0] https://en.wikipedia.org/wiki/NP-completeness

Cas94y ago

According to this answer[0] it seems it's actually NP-Hard, my bad. Haven't seen the proof though, and I'm not an expert.

[0] https://cs.stackexchange.com/questions/128493/is-protein-fol...

bawolff4y ago

I dont know much about protein folding, but for most things in life,exact solutions to NPC problems usually aren't needed for non-contrived problems. In many cases, approximations are good enough.

Besides, this is real life - if predictions and real life match, that's great. If they don't, well you know you went wrong somewhere.

jerven4y ago

I would upvote this twice if I could. Life science quite often NP-hard still approximate results are extremely useful.

Joke, which I think is from Sean Eddy (hammer).

Bioinformatics approaches a Computer Scientist for help with a hard problem. CS agrees to help. Year later CS comes back very excitedly. "your problem is not hard it is NP-hard!". Bioinformatics nods, and says "I still got to solve it" and continues finding ever faster and better approximations ;)

Also problem space is both bounded (you don't have infinite length proteins) and f'd up in reality. e.g protein hijacking and re-conformation in the face of an infectious agent.

wpasc4y ago

A very-non-expert opinion, if an approach approximates it pretty well and can be improved upon, then it could end up being quite useful. Given that biology exists on a real, tangible scale then perfection in the fold prediction isn't necessary, instead just an approximation that is sufficiently good to be functionally useful.

^ That sounds like word-salad BS but I think there's some truth to it. I know protein folding has been postulated to be useful in terms of understanding basic biology, understanding disease pathology, and drug prediction. While a wide range of approximations are functionally useless, perhaps the Alphafold approach or some improved version of it surpasses the functionally useful threshold.

At least I hope so

radus4y ago

Yes, it is still useful. Even structures obtained through traditional means (eg. x-ray crystallography) are approximations to an extent since there are limits to the resolution that you can obtain and oftentimes regions of proteins are "disordered". Additionally, these structures are only snapshots of a protein in a particular state, which may not completely reflect the dynamics of the protein in its native environment.

whimsicalism4y ago

You want to find a protein that has X structure (since structure determines function to a degree).

If AlphaFold is substantially more accurate at solving proteins, it can mean that drug discovery is faster, assays are faster, etc. etc.

The "unexpected problems" would be caught in the assay stage.

radus4y ago

Kind of disagree with this.. solving protein structures is not the rate limiting step in drug discovery or in biochemical assays -- not by a long shot. See this excellent comment by @dekhn on a related submission: https://news.ycombinator.com/item?id=27849046

hobofan4y ago

I would expect that once AlphaFold has helped you identify a potential protein (e.g. as a drug) out of a bigger set of potential proteins, there will still be a manual step of traditional cryoEM, NMR, etc. to get an accurate high-resolution structure.

t_serpico4y ago

To me, the interesting thing is not the specific results but rather that you can accurately predict crystal structures from sequence alone. This begets the question: what other physical biological properties can we predict?

mrfusion4y ago

Is it really np complete? If so we could map other np complete problems onto it and let biology solve it for us.

nmca4y ago

NP completeness tells you about the hardest cases, not the most useful cases.

saithound4y ago

AlphaFold is not about solving any kind of NP-complete problem.

Proteins consist of chains of amino acids which spontaneously fold up to form a structure. Understanding how the amino acid chain determines the protein structure is highly challenging, and this is called the "protein folding problem".

People use mathematical models to predict how proteins fold in nature. Many such mathematical models are stated in terms such as "proteins fold into a configuration that minimizes a certain energy function". Even the simplest such models [1] give rise to NP-hard decision problems, which are also known (somewhat confusingly) as "protein folding problems". To make this a bit less confusing, I will call the mathematical decision problems PFPs.

Like all mathematical models, our protein folding models don't correspond exactly to reality. Even if you are somehow able to determine the exact mathematical solution to a mathematical PFP, that _still_ doesn't guarantee that the real protein that you were trying to model behaves like the mathematical solution would indicate. E.g. the protein may fold in such a way that it gets stuck in a local optimum of the energy function you were using.

How do we detect this? We make inferences about how the protein should behave, given the mathematical solution to the Protein Folding Problem, and then we perform experiments, and find out (empirically) that the protein behaves in a manner that is inconsistent with the inferences drawn from the mathematical model. Scientists _do_ do this. And they would have to do it even if they had a fast, exact way to solve NP-complete problems, because the NP-complete problems are still just part of a mathematical model, and need not correspond to reality in any way.

The success of AlphaFold is not measured by how well it solves (or approximates) mathematical PFPs. The success of AlphaFold is measured by making successful predictions about how certain proteins will fold. And this is exactly how it was tested [2]: they threw it at a bunch of problems for which scientists have empirically determined how certain amino acid chains fold, but didn't release the results. And then they compared the solutions predicted by AlphaFold, and found that most of the predictions were consistent with what they knew to be the case.*

[1] https://en.wikipedia.org/wiki/Lattice_protein

[2] https://predictioncenter.org/casp14/index.cgi

* That's an understatement. The solutions were really very good, much better than those produced by any other submission to CASP14.

Cas94y ago

Thanks a lot for the detailed explanation :-)

nextos4y ago· 14 in thread

Alphafold 2 is very very cool, but we need a little dose of reality. It's still a bit away from really solving protein folding as it was marketed.

For example, multi-complex proteins are not well predicted yet and these are really important in many biological processes and drug design:

https://occamstypewriter.org/scurry/2020/12/02/no-deepmind-h...

A disturbing thing is that the architecture is much less novel than I originally thought it would be, so this shows perhaps one of the major difficulties was having the resources to try different things on a massive set of multiple alignments. This is something an industrial lab like DeepMind excels at. Whereas universities tend to suck at anything that requires a directed effort of more than a handful of people.

sbierwagen4y ago

A similar concern has sparked some worries about "AI overhang" https://www.lesswrong.com/posts/75dnjiD8kv2khe9eQ/measuring-...

Most of the compute in ML research seems to be going into architecture search. Once the architecture is found, training and net finetuning/transfer learning is comparatively cheap, and then inference is cheaper still. This implies we could see 10-100x gains in AI algorithms using today's hardware, or sudden surprising appearance of AI dominance in an unexpected field. (Object grasping in unstructured environments? Art synthesis?) A task could go from totally impossible to trivial in a year. In retrospect, the EfficientNet scaling graph should have alarmed more people than it did: https://learnopencv.com/wp-content/uploads/2019/06/Efficient...

Waymo has been puttering along for years, not announcing much of interest. This may have caused some complacency about self-driving cars, which is a mistake. Algorithms only get better, while humans stay the same. Once Waymo can replace some human drivers some of the time, things will start changing very quickly.

jefftk4y ago

> Once Waymo can replace some human drivers some of the time, things will start changing very quickly.

But that happened 1y+ ago [1][2] without much changing since?

[1] https://www.theverge.com/2019/12/9/21000085/waymo-fully-driv...

[2] https://blog.waymo.com/2020/10/waymo-is-opening-its-fully-dr...

evouga4y ago

Self-driving is a cursed problem, but the thorniest obstacles relate to optics and politics and not technology. AI can already replace some human drivers some of the time; but it doesn’t matter because each news story about a Tesla killing a passenger or driving into a parked fire truck sets back public acceptance of self-driving cars by years.

1 more reply

nl4y ago

> Most of the compute in ML research seems to be going into architecture search.

No it's not. Only Google spends significant time with automatic architecture search, and many people think this is really to try to sell cloud capacity.

> Once the architecture is found, training and net finetuning/transfer learning is comparatively cheap

Training isn't cheap for significant problems.

Getting the data is very expensive, and compute is a significant expense for large datasets.

> This implies we could see 10-100x gains in AI algorithms using today's hardware

Actually, most of the time we see 10-100% (percent! not times) gains from architecture improvements, whether they be manual or automatic.

But that is very significant, because a 10% improvement can suddenly make something useful that wasn't before.

2 more replies

timr4y ago

> A disturbing thing is that the architecture is much less novel than I originally thought it would be, so this shows perhaps one of the major difficulties was having the resources to try different things on a massive set of multiple alignments. This is something an industrial lab like DeepMind excels at. Whereas universities tend to suck at anything that requires a directed effort of more than a handful of people.

Yeah, the HN commentary on Alphafold has a high heat-to-light ratio. I'm eager to read the paper because the previous description of the method sounded remarkably similar to methods that have been around for ages, plus a few twists.

The devil is going to be in the details on this one.

MrsPeaches4y ago

> high heat-to-light ratio

Sorry for the ignorance but what does this mean?

6 more replies

varelse4y ago

Transformers seem to be the successor to conv nets. But in my direct experience advocating for them, it's amazing how reluctant industry peeps were to trying them because they associated all the limitations of LSTM networks with them for a long time. YMMV, but that's how it went with me.

I even predicted DeepMind's CASP 14 network would be transformer-based back in 2018, but I couldn't have told you the details of that transformer, just that it was a no-brainer to move from fixed width convolutions to arbitrary width attention sums because sequence motifs and long-range interactions are of arbitrary width in the sequence.

All that seems to have changed with AlphaFold 2 because unlike GPT-XXX, this isn't a parlor trick with memorized text. This is actually useful and the FOSSing of the network will spawn all sorts of new applications of the approach.

So now I wonder what will replace Transformers because nothing lasts forever and there are a lot of smart people trying all sorts of new ideas.

dm3194y ago

The key difference seems to be using the multiple alignments and assumption about evolutionary conservation? Useful for genes conserved, but less useful for de-novo proteins (like COVID and cancer) I guess?

1 more reply

TaupeRanger4y ago

That's the case with basically everything DeepMind does. They have a very good PR department which hypes up everything they do while conveniently ignoring that basically nothing of any practical consequence has come of their endeavors. But I do think it's important that these companies exist now so we can see what not to try going forward.

3 more replies

lukeplato4y ago

I haven't read the full paper but there is certainly some new/exciting developments i'm seeing just from scanning. The “Invariant Point Attention" which is described as a novel, geometry-aware and equivariant attention operation is pretty huge.

Something along these lines was speculated to be to be used by Fabian Fuchs [0] soon after the original CASP competition. Basically, it's a huge win for the geometric deep learning people, and indicates an exciting direction for mainstream academia to move in.

https://fabianfuchsml.github.io/alphafold2/

zamalek4y ago

I'm genuinely curious: could the output of Alphafold be fed into a classical folding algorithm (as a starting point), or is the output of Alphafold too far down the wrong path, in these cases?

mike_hearn4y ago

Why is it disturbing? Isn't that just a values-neutral outcome, or, are you saying it's disturbing from the perspective of academia?

dekhn4y ago

many of these resources are available, it's mostly that academic scientists don't have the time, money, or expertise to manage large datasets. However, the community has maintained high quality MSA database for decades and that's exactly the work that DM drafted off.

gnufx4y ago

> academic scientists don't have the time, money, or expertise to manage large datasets

I may be cynical about general expertise, as a support person, but large datasets have long been stock in trade of areas I'm more or less familiar with, whether "large" is TBs or PBs like CERN experiments. (When I were a lad, it was what you could push past the tape interface in a few days -- data big in cubic feet...)

2 more replies

gonehome4y ago· 12 in thread

Does anyone on HN work in bio or drug discovery?

Could you give an overview of how people can leverage this (or how you might?).

From reading around about it, it sounds like there's often a need to find a certain type of molecule to activate/inhibit another based on shape and the ability to programmatically solve for this makes the searching way easier.

Is this too oversimplified/wrong? How will this be used in practice.

[Edit]: Thanks for the answers!

dekhn4y ago

I've worked in bio and drug discovery for some 25 years. That includes building classifiers using gradient descent in the 90s (when algorithms, computers and data were all much worse). I ported DOCK to Linux in ~96 or 97. Since then I built an academic and then industrial career with some emphasis on using computing to solve problems in drug discovery, but I don't play that role any more.

It doesn't look like the models produced by this would immediately turn the challenging problem of finding, approving, and marketing successful pharmaceuticals (IE, it doesn't eliminate any real bottleneck).

There was a long-term dream of structure-based drug discovery based on docking, but IMO, it has never really proved itself (most of the examples of success are cherry picked from a much larger pile of massive failures).

miltondts4y ago

> ... but I don't play that role any more.

I was thinking of going into that field. Can you expand a bit on why you left?

1 more reply

nick2384y ago

I haven't worked on the drug-side of things, but here my bio perspective: It's kind of out-of-vogue, but consider the "lock and key" model of proteins and small molecules (drugs). For drug design, what you want to do is get a key that fits just one lock (to pull whatever lever) and not others (to avoid side-effects). It's relatively easy to find a molecule that fits a protein, because that protein is what you might spend years researching and probing, but it's tricky to check if it does anything against ~100,000 others in humans. If you could do an in silico computational survey to be like, oh, maybe it'll target this accidentally, you could spot-check those in vitro, and/or stick on some other atoms to your small-molecule to make it not fit that off-target.

Holy grail, IMO, though is being able to design de novo protein sequences (to make "biologics", aka engineered protein drugs) that can a) target (bind/block/enhance) or do (chemical reactions) what you want and only that, b) are easily synthesizeable by bacteria/yeast (cheap to make), and c) are stable (easy to transport/store).

slownews454y ago

First seems reasonable. I've not heard of anything on the later coming even close credibly - though is an obvious holy grail.

1 more reply

timr4y ago

> Could you give an overview of how people can leverage this (or how you might?).

Short answer: nobody knows. Traditionally, protein folding is a solution in search of a problem, but that's largely because the predictions were...unusably bad. This was always more of a super-difficult validation problem for the force fields and simulation methods, which could then be used for other problems of greater value (such as rational protein design, or simulation of the motion of proteins with known structures).

These predictions are better, but still pretty far from the level of precision that you'd want for any kind of rational drug design, where the exact locations of protein side-chains (for example) matter a lot. You'll note that AlphaFold returns structures that are "relaxed" using one of the oldest simulation systems for proteins: AMBER. So it's not exactly a clean-room solution to the problem, and you can't assume that the details (which matter to drug design) are going to be any better than for the older methods.

But that said, if you have a method that can reliably give you a blurry view of the overall shape of a protein, even that could be useful for things like target discovery or inference of biological networks. But this is still a lot closer to pure research than "revolutionizing drug discovery", as is frequently batted around on reddit, HN and the press.

dekhn4y ago

Also I would say that really they just made improvements to protein structure prediction, not protein folding which is the dynamic process by which proteins reach their equilibrium fold.

1 more reply

l33tman4y ago

I do agree with this but would add that you wouldn't want to do static docking to a single protein and sidechain configuration anyway so you're bound to find a usable forcefield. The incoming structure (ideally!) only needs to fall into the right energy landscape valley. If your forcefield and MD simulation can't keep this stable within the natural protein's configuration space you are probably not going to make progress, and you need the simulations to evaluate the natural energy landscape at body temperature, not cryo temperature.

There are some examples of this issue in the AlphaFold blog, some protein loops that they thought were mispredicted but it turned out they were part of an energy degeneracy so the natural state fluctuated pretty wildly, so if you can't simulate this properly it matters less how accurate the incoming structure is (to a certain degree of course).

1 more reply

COGlory4y ago

You can't do intelligent drug design if you don't know what the target protein looks like. We've gotten great at solving protein structures with things like crystallography and cryo-EM microscopy. Unfortunately, many interesting drug targets reside in the membrane of a cell, which means you can't easily work with them in a lab because they aren't soluble in anything but a plasma membrane. For instance, this is an issue with the 5HT2A protein, a g coupled protein receptor that is implicated in many serotonin related pathways.

Being able to predict what it would look like would be a huge deal because then you can go about intelligently designing drugs for it.

ponsko4y ago

You should check out Salipro (https://www.salipro.com/) for membrane protein reconstitution.

1 more reply

zosima4y ago

It can be an aid in drug development, and can perhaps assist a bit in tuning small molecule drugs for more stable binding.

Though I think the major impacts will be two-fold:

(1) The field of structural biology is going to see a change, with much more data available. Some structures of difficult to crystallize proteins will be solved, which may lead to much greater biological understanding. We may enter a time, where once you have a primary sequence, you also have a likely 3d-structure, which will probably change the daily work of quite a few biologists a bit.

(2) Industrial protein design. A tool such as this can potentially have great utility in optimizing proteins as chemical catalysts for various processes in different industries. This includes expanding the conditions under which a protein is active and also making their conformation more stable and so the protein more long-lived in solution.

dekhn4y ago

For those that are unaware, industrial protein design is a multibillion dollar industry. For example, decades ago Genentech and Dow Corning formed a company that developed proteases (proteins that cut other proteins) that worked at much higher temperatures than the ones in nature. This was then sold to P&G and other major laundry companies (laundry detergent contains idle enzymes activated by the heat of the laundry water, and they go clean up. "Protein gets out protein" was the marketing jingle.

That was a few billion dollars right there and almost all the work was done by hand by lab scientists.

dumb12244y ago

I work in cancer research with a drug discovery focus in a lab with some structure biologists. My understanding is that if we identified proteins targets suitable for therapeutics then understand its structure to identify secondary binding sites could be crucial for drug discovery. Drugs can then be designed to modulate its biological functions.

COGlory4y ago· 7 in thread

I am a structural biologist. This is one of the handful of topics that overlaps with my field here. I'm very excited to play with this, although it might eventually put me out of a job.

AnimalMuppet4y ago

Here's where I think we need to be going: You go to a doctor's office, sick. 1) They take a blood sample. 2) They find the malignant bacteria and DNA sequence it. 3) If it's a known strain, they know what antibiotics to use on it. 4) If not, they solve protein folding on the genes. 5) From that, they see which existing antibiotics would kill it. 6) If none will, then given the proteins, they have to derive a new antibiotic.

1) is easy. 2) might not be - there can be a lot of things in a blood sample, and finding only the interesting (bad) things might not be simple. The sequencing part is pretty much solved. 3) would take a bit of work, but I think it's possible now. 4) we're getting there. 5) might have a fair amount in common with 3), but it probably takes some additional work. 6) is... probably non-trivial.

That's just one research agenda. There are others. You may have to move to related work, but I doubt you're going to be out of a job in this lifetime.

dzhiurgis4y ago

Re step 1 and 2 - here's interesting podcast on how they detect rare infections: https://www.youtube.com/watch?v=MzzD2F73iGU

Basically sequence everything what's in your blood and look for what doesn't match your genome === infection. The problem is this is orders of magnitude more compute intense than whole genome sequencing. Basically increased demand for sequencing far outmatches available compute!

unholiness4y ago

> The sequencing part is pretty much solved.

DNA sequencing is still slow and very expensive. On the scales you're talking about it's just not worth it.

I think I agree at a high-level that there is a huge reservoir of demand for this technology. But it's also possible that solving protein folding and similar research will simply cease to be a bottleneck for that demand, and people will be out of jobs.

rllearneratwork4y ago

why would it put you out of job? Wouldn't it just become one of the tools you use?

nikhilsimha4y ago

The implicit assumption you are making is that the demand increases in lock step with productivity gains. 100x faster drug discovery, 100x more drugs need to be discovered => same number of people employed.

These correlations do hold for technical fields, but logically there should be a point beyond which productivity gains outpace, demand growth / demand could even stop growing. One should either retool to solve a newer problem before this point is reached, or hope that the point is not reached in the span of their career.

Oil rig builders for example - manufacturing has been increasingly automated, but the demand for oil rig building has grown consistently. But they should probably look into solving other problems given that demand is shifting.

3 more replies

dekhn4y ago

It would both become a tool he used (to produce initial structures to fit in density maps) and a tool that used his or her output (because alphafold requires known protein structures that are homologous to the one you're predicting).

COGlory4y ago

Right now I make my living cloning, expressing, and purifying proteins, crystallizing them or freezing them onto EM grids, and solving their structure. From start to finish, it's months to years of work for each structure.

dekhn4y ago· 6 in thread

Fantastic, they released the dataset and code to train the model. Science will be able to proceed. edit: not the code to train the model, just the code to run inference.

The underlying sequence datasets include PDB strucrures and sequences, and how those map to large collections of sequences with no known structure (no surprise). Each of those datasets represents decades of thousands of scientists work, along with programmers and admins who kept the databases running for decades with very little grant money (funding long-term databases is something NIH hated to do until recently).

gopalv4y ago

> The total download size is around 428 GB and the total size when unzipped is 2.2 TB. Please make sure you have a large enough hard drive space, bandwidth and time to download.

> This was tested on Google Cloud with a machine using the nvidia-gpu-cloud-image with 12 vCPUs, 85 GB of RAM, a 100 GB boot disk, the databases on an additional 3 TB disk, and an A100 GPU.

This is amazingly detailed for a researcher who wants to follow in the track and also Apache licensed, which is one road-bump out of the way for a commercial enterprise, like an actual drug manufacturer who wants to burn some money trying this out.

edit: said the last part too fast, the code has a "the AlphaFold parameters are made available for non-commercial use only under the terms of the CC BY-NC 4.0 license"

dekhn4y ago

Yes, all science should be communicated in the form of an academic paper wiht a supporting git repo and quickly downloadable dataset and a fast path to reproducing the work. That would be a huge change from the establishment.

It's quite unclear what value this will have to pharma; personally I doubt this has any direct applications (and I'm one of the few people in the world that can say that with deep authority).

4 more replies

FredFS4564y ago

There's a preview paper as well: https://www.nature.com/articles/s41586-021-03819-2

dekhn4y ago

Yes, I skimmed the paper already and it wasn't too surprising. There are details that will take some time to parse out to understand how important they are.

Personally, I've found over decades that academic papers like that are far less useful to me than a github project and downloadable data that I can inspect, run and modify on my own. Other folks I know could read that paper and write the code in a day, I always wish I could do that.

cing4y ago

The process is described in Supplementary, but where do you see the code to train the model? The repository is the inference pipeline.

dekhn4y ago

I misread. The data dump is required for inference.

pjfin1234y ago· 6 in thread

I'm assuming you can't run this on any consumer computer?

pjfin1234y ago

Nevermind

> The simplest way to run AlphaFold is using the provided Docker script. This was tested on Google Cloud with a machine using the nvidia-gpu-cloud-image with 12 vCPUs, 85 GB of RAM, a 100 GB boot disk, the databases on an additional 3 TB disk, and an A100 GPU.

sambroner4y ago

That’s… way closer to consumer than I expected

2 more replies

erhk4y ago

2.2TB data

crazysim4y ago

Amazing. That's not a lot of libraries of congresses at all.

lasagnaphil4y ago

Nah, 4TB disk drives are not that expensive.

dekhn4y ago

which is basically nothing. They could put it in a cloud bucket and you could copy it to another bucket in minutes.

tdfirth4y ago· 5 in thread

This isn't a criticism - I'm just curious to hear people's thoughts on this. When I look at this code, one of my initial reactions is that it does not seem to be very thoroughly tested. Sure, certain modules have been tested (e.g. `model.quat_affine`) but it's not clear how completely. Meanwhile, other modules, for example `model.folding`, have not been tested at all, despite containing large amounts of complex logic. That kind of code that works with arrays is very easy to get wrong and bugs are difficult to spot.

My experience working with code written by researchers is that it frequently contains a large number of bugs, which brings the whole project into question. I've also found that encouraging them to write tests greatly improves the situation. Additionally, when they get the hang of testing they often come to enjoy it, because it gives them a way to work on the code without running the entire pipeline (which is a very slow feedback loop). It also gives them confidence that a change hasn't lead to a subtle bug somewhere.

Again, I'm not criticising. I am aware that there are many ways to produce high quality software and Google/DeepMind have a good reputation for their standards around code review, testing etc. I am, however, interested to understand how the team that wrote this think about and ensure accuracy.

In general, I hope that testing and code review become a central part of the peer review process for this kind of work. Without it, I don't think we can trust results. We wouldn't accept mathematical proofs that contained errors, so why would we accept programs that are full of bugs?

edit: grammar

benschulz4y ago

My understanding is that it has been manually tested. I.e. it has produced correct results to previously intractable problems. I'm not sure how much automated testing would add at that point.

dmos624y ago

Unit testing usually isn't easily replaced by manual testing. If you have, for example, 3 units that can be in 2 different modes each, that's 2^3 different combinations, but only 2*3 unit modes. Testing the end result is more work than testing the units.

1 more reply

allyourhorses4y ago

Prior to this model, protein folding hadn't seen significant advancements in a decade or more. Worrying about the lack of tests in a first of its kind model is very much akin to complaining about the choice of font in a user manual for the world's first warp drive. I understand you're attempting to frame the problem in terms in things you know, but trying to weigh down pioneering research with professional development ceremony is very much counterproductive. The 'missing' ceremony would not have contributed to the strength of AlphaFold's result, the model's only purpose was to compete within the context of an existing validation framework.

plutonorm4y ago

Because it passes the huge number of integration tests.

miltondts4y ago

Research code is highly volatile: the details and architecture changes a lot. It is much more important to invest the time into writing more experimental code and validate it with e2e functional tests that don't need to change, than to constantly having to rewrite the code and the tests.

devindotcom4y ago· 4 in thread

Also announced today was RoseTTAFold from UW's Baker Lab, which claims nearly the same accuracy at much higher efficiencies. There's a public server and paper in Science.

More info here and here:

https://www.bakerlab.org/index.php/2021/07/15/accurate-prote...

https://techcrunch.com/2021/07/15/researchers-match-deepmind...

ehsankia4y ago

Could it be that AlphaFold 2 was open sourced in response to this?

dekhn4y ago

it's very likely the baker submission to science forced DM's hand.

xor994y ago

> With RoseTTAFold, a protein structure can be computed in as little as ten minutes on a single gaming computer.

I guess its a little less accurate but the quick compute time makes as much difference too. E.g. research students can have multiple less costly mistakes before achieving what they want with the software.

xvilka4y ago

That is the way, unlike AlphaFold, to publish everything in open source. Kudos to the research team!

duckerude4y ago· 3 in thread

> The AlphaFold parameters are made available for non-commercial use only, under the terms of the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. You can find details at: https://creativecommons.org/licenses/by-nc/4.0/legalcode

Does CC BY-NC actually do this? As far as I can tell it only really talks about sharing/reproducing, not using.

Or is the only thing prohibiting other commercial use the words "available for non-commercial use only"?

sillysaurusx4y ago

Artbreeder has some interesting prior art here: nVidia forbid commercial use of StyleGAN, but artbreeder disregarded it and happily sold all the breeding you wanted. No one seemed to care.

I suspect that the clause is there to prevent a startup launching on the basis of “see this trained model? Yeah, that’s literally our business model” though, which is a mildly amusing thought, wot wot.

So basically, a few tens of thousands, sure. A few million, big G might have a problem.

Still, the smart move would be to launch the business anyway, and gamble that you can work out a licensing deal.

mikewarot4y ago

If you took their parameters, then trained it for while on a different set of data, it would vary from the original. I wonder how much compute would be required to make the offset far enough to hold up from scrutiny, and in court.

Alternatively, you could manually change the network model, add a few hidden layers, etc... modifying the parameters in step, and result in a new model and new parameters. Some training to vary the parameters, and it's now a new work.

l33tman4y ago

I would hazard a guess here that taking the parameters and continuing training constitutes "using" the parameters. Then when you get the subpoena you would have to explain the thousands of emails and slack messages discussing how you extend their parameters... :)

1 more reply

dekhn4y ago· 2 in thread

I missed an important detail: """an academic team has developed its own protein-prediction tool inspired by AlphaFold 2, which is already gaining popularity with scientists. That system, called RoseTTaFold, performs nearly as well as AlphaFold 2, and is described in a paper in Science paper also published on 15 July"""

One of the things I say about CASP has to be updated. It used to be "2 years after Baker wins CASP, the other advanced teams have duplicated his methods and accuracy, and 4 years after, everything Baker did is now open source and trivially reproducible"

now, it's baker catching up to DeepMind and it took about a year

https://doi.org/10.1126/science.abj8754

lukeplato4y ago

I see an interesting geometric similarity between the two models, namely the attention mechanism that learns relationships between structure in embedded reference frames (i.e. 1D,2D embeddings in RoseTTaFold or "local" frames in Alphafold) and the true structure of the intrinsic space reference frames (3D coordinates in RoseTTaFold and "global" frames in Alphafold)

radus4y ago

Very cool! Great to see this competition between academia and industry yielding improvements on all fronts.

stupidcar4y ago· 2 in thread

The model parameters are only available for non-commercial use. That's a shame, as I presume there might be a lot of medical startups that would benefit from having this kind protein-folding tech available.

mikewarot4y ago

Unless I'm mistaken, you could train the model yourself, starting with a random set of values. In time, your error rates would be low enough to have a new set of parameters which you could use however you like.

stupidcar4y ago

Yep, but there's a couple of problems. Firstly, AFAIK Deepmind haven't made all the code and settings they used to train the model available (although the paper does describe the architecture). Secondly, training a machine-learning model of this complexity is generally much more expensive, in terms of time and compute requirements, than using the resulting model.

If you're a a medical startup, having an off-the-shelf prediction model you can just start using for all your protein folding needs is a very different proposition from having to train one yourself from scratch.

That said, hopefully other researchers and institutions will take Google's research and produce an equivalently powerful model but with a more commercially-friendly open-source license. From some comments in this thread, it sounds like that's already happening, in fact.

jfengel4y ago· 2 in thread

So... is it possible to clone this and turn it into a Folding@Home client? How does it do?

kmckiern4y ago

Where there isn't an available crystal structure, Alphafold can be used to create initial structures for simulation via folding@home, replacing older homology modeling techniques.

Source: former folding@home researcher.

dekhn4y ago

no, it wouldn't make sense to do that. Folding@Home is for ab initio where you don't have any prior info for the structure, this is for homology modelling. F@H probes the dynamics of protein folding, this just makes a static prediction.

swalsh4y ago· 2 in thread

edit I was wrong. Please ignore.

ali_m4y ago

> This is a completely new model that was entered in CASP14 and published in Nature.

f38zf5vdt4y ago

From the repo:

> This package provides an implementation of the inference pipeline of AlphaFold v2.0

Cas94y ago· 2 in thread

Ultimatt4y ago

There is a lot of bias in the chat here from a more chemistry and pharma slant. If you ignore this AlphaFold solves in a very meaningful way the problem blocking a lot of science investigation.

For comparative and evolutionary analysis structure is far more conserved than sequence. Especially in things like viruses or anything with a high rate of reproduction like bacteria. Just knowing the general fold or overall structure is enough to do structural alignment and tell if two genes are related on that basis, even if their genomic sequence is completely dissimilar. Large groups of researchers rely on sequence homology built from sequences of known structure.

But AlphaFold works well in new sequence space to far more accuracy than is needed. If we had an AlphaFold prediction for every known sequence suddenly the evolutionary relationships between all genes and even all species would be far clearer. This on its own unlocks a new foundation to reason about function and molecular interaction with a wholistic systems view without gaps in what we can know with some reasonable assurance.

For an analogy think of the difference between having books in different languages describing objects. You know what some of the book in English might say but you dont even know if the book in Spanish is even talking about the same things. AlphaFold is like an AI that transforms all the books into picture books and now we can use image similarity or have one person look at all pictures.

haihaibye4y ago

> even if their genomic sequence is completely dissimilar

I think you mean amino acid homology? (due to synonymous mutations)

I looked it up and you're right, protein structure/motifs are much more highly conserved than amino acid sequence https://humgenomics.biomedcentral.com/articles/10.1186/1479-...

mensetmanusman4y ago· 1 in thread

Distribution of this 2 TB file seems like a good use of torrent…

jerven4y ago

Working with one of the team providing the uniref dataset used here. Running any kind of torrent stuff in an university network setting is a central policy fight one just does not want to get into at all.

On the other hand outbound network traffic from an university is "free". So the benefit is absolutely minimal from a hosting perspective.

It was tried (https://journals.plos.org/plosone/article?id=10.1371/journal...) but it is gone the way of the dodo for the above reasons.

thesausageking4y ago

The PDF is linked in the article:

https://www.nature.com/articles/s41586-021-03819-2_reference...

qeternity4y ago

Ok, so biochemists: which bit of the secret sauce are they leaving out?

culopatin4y ago

Does anyone know if this can be made to work with rna fold?

hermitsings4y ago

fodl

j / k navigate · click thread line to collapse

165 comments

103 comments · 19 top-level

Cas94y ago· 16 in thread

thxg4y ago

> (it's NP-complete after all)

Protein folding is a physical/biological phenomenon. AFAIK we don't currently have a proper exact mathematical formulation of the problem that would let one determine its complexity.

[1] "Complexity of protein folding", 1993, by Aviezri S. Fraenkel

[2] http://www.satcompetition.org

[3] http://www.math.uwaterloo.ca/tsp/

[4] http://plato.asu.edu/bench.html

dekhn4y ago

ashtonbaker4y ago

Not really an answer to your question, but is the problem really NP-complete, or just combinatorially difficult? For example how is this condition of NP-completeness satisfied?

> it is a problem for which the correctness of each solution can be verified quickly [0]

[0] https://en.wikipedia.org/wiki/NP-completeness

Cas94y ago

According to this answer[0] it seems it's actually NP-Hard, my bad. Haven't seen the proof though, and I'm not an expert.

[0] https://cs.stackexchange.com/questions/128493/is-protein-fol...

bawolff4y ago

I dont know much about protein folding, but for most things in life,exact solutions to NPC problems usually aren't needed for non-contrived problems. In many cases, approximations are good enough.

Besides, this is real life - if predictions and real life match, that's great. If they don't, well you know you went wrong somewhere.

jerven4y ago

I would upvote this twice if I could. Life science quite often NP-hard still approximate results are extremely useful.

Joke, which I think is from Sean Eddy (hammer).

Also problem space is both bounded (you don't have infinite length proteins) and f'd up in reality. e.g protein hijacking and re-conformation in the face of an infectious agent.

wpasc4y ago

At least I hope so

radus4y ago

whimsicalism4y ago

You want to find a protein that has X structure (since structure determines function to a degree).

If AlphaFold is substantially more accurate at solving proteins, it can mean that drug discovery is faster, assays are faster, etc. etc.

The "unexpected problems" would be caught in the assay stage.

radus4y ago

hobofan4y ago

t_serpico4y ago

mrfusion4y ago

Is it really np complete? If so we could map other np complete problems onto it and let biology solve it for us.

nmca4y ago

NP completeness tells you about the hardest cases, not the most useful cases.

saithound4y ago

AlphaFold is not about solving any kind of NP-complete problem.

[1] https://en.wikipedia.org/wiki/Lattice_protein

[2] https://predictioncenter.org/casp14/index.cgi

* That's an understatement. The solutions were really very good, much better than those produced by any other submission to CASP14.

Cas94y ago

Thanks a lot for the detailed explanation :-)

nextos4y ago· 14 in thread

Alphafold 2 is very very cool, but we need a little dose of reality. It's still a bit away from really solving protein folding as it was marketed.

For example, multi-complex proteins are not well predicted yet and these are really important in many biological processes and drug design:

https://occamstypewriter.org/scurry/2020/12/02/no-deepmind-h...

sbierwagen4y ago

A similar concern has sparked some worries about "AI overhang" https://www.lesswrong.com/posts/75dnjiD8kv2khe9eQ/measuring-...

jefftk4y ago

> Once Waymo can replace some human drivers some of the time, things will start changing very quickly.

But that happened 1y+ ago [1][2] without much changing since?

[1] https://www.theverge.com/2019/12/9/21000085/waymo-fully-driv...

[2] https://blog.waymo.com/2020/10/waymo-is-opening-its-fully-dr...

evouga4y ago

1 more reply

nl4y ago

> Most of the compute in ML research seems to be going into architecture search.

No it's not. Only Google spends significant time with automatic architecture search, and many people think this is really to try to sell cloud capacity.

> Once the architecture is found, training and net finetuning/transfer learning is comparatively cheap

Training isn't cheap for significant problems.

Getting the data is very expensive, and compute is a significant expense for large datasets.

> This implies we could see 10-100x gains in AI algorithms using today's hardware

Actually, most of the time we see 10-100% (percent! not times) gains from architecture improvements, whether they be manual or automatic.

But that is very significant, because a 10% improvement can suddenly make something useful that wasn't before.

2 more replies

timr4y ago

The devil is going to be in the details on this one.

MrsPeaches4y ago

> high heat-to-light ratio

Sorry for the ignorance but what does this mean?

6 more replies

varelse4y ago

So now I wonder what will replace Transformers because nothing lasts forever and there are a lot of smart people trying all sorts of new ideas.

dm3194y ago

1 more reply

TaupeRanger4y ago

3 more replies

lukeplato4y ago

https://fabianfuchsml.github.io/alphafold2/

zamalek4y ago

I'm genuinely curious: could the output of Alphafold be fed into a classical folding algorithm (as a starting point), or is the output of Alphafold too far down the wrong path, in these cases?

mike_hearn4y ago

Why is it disturbing? Isn't that just a values-neutral outcome, or, are you saying it's disturbing from the perspective of academia?

dekhn4y ago

gnufx4y ago

> academic scientists don't have the time, money, or expertise to manage large datasets

2 more replies

gonehome4y ago· 12 in thread

Does anyone on HN work in bio or drug discovery?

Could you give an overview of how people can leverage this (or how you might?).

Is this too oversimplified/wrong? How will this be used in practice.

[Edit]: Thanks for the answers!

dekhn4y ago

miltondts4y ago

> ... but I don't play that role any more.

I was thinking of going into that field. Can you expand a bit on why you left?

1 more reply

nick2384y ago

slownews454y ago

First seems reasonable. I've not heard of anything on the later coming even close credibly - though is an obvious holy grail.

1 more reply

timr4y ago

> Could you give an overview of how people can leverage this (or how you might?).

dekhn4y ago

Also I would say that really they just made improvements to protein structure prediction, not protein folding which is the dynamic process by which proteins reach their equilibrium fold.

1 more reply

l33tman4y ago

1 more reply

COGlory4y ago

Being able to predict what it would look like would be a huge deal because then you can go about intelligently designing drugs for it.

ponsko4y ago

You should check out Salipro (https://www.salipro.com/) for membrane protein reconstitution.

1 more reply

zosima4y ago

It can be an aid in drug development, and can perhaps assist a bit in tuning small molecule drugs for more stable binding.

Though I think the major impacts will be two-fold:

dekhn4y ago

That was a few billion dollars right there and almost all the work was done by hand by lab scientists.

dumb12244y ago

COGlory4y ago· 7 in thread

I am a structural biologist. This is one of the handful of topics that overlaps with my field here. I'm very excited to play with this, although it might eventually put me out of a job.

AnimalMuppet4y ago

That's just one research agenda. There are others. You may have to move to related work, but I doubt you're going to be out of a job in this lifetime.

dzhiurgis4y ago

Re step 1 and 2 - here's interesting podcast on how they detect rare infections: https://www.youtube.com/watch?v=MzzD2F73iGU

unholiness4y ago

> The sequencing part is pretty much solved.

DNA sequencing is still slow and very expensive. On the scales you're talking about it's just not worth it.

rllearneratwork4y ago

why would it put you out of job? Wouldn't it just become one of the tools you use?

nikhilsimha4y ago

3 more replies

dekhn4y ago

COGlory4y ago

dekhn4y ago· 6 in thread

Fantastic, they released the dataset and code to train the model. Science will be able to proceed. edit: not the code to train the model, just the code to run inference.

gopalv4y ago

> The total download size is around 428 GB and the total size when unzipped is 2.2 TB. Please make sure you have a large enough hard drive space, bandwidth and time to download.

> This was tested on Google Cloud with a machine using the nvidia-gpu-cloud-image with 12 vCPUs, 85 GB of RAM, a 100 GB boot disk, the databases on an additional 3 TB disk, and an A100 GPU.

edit: said the last part too fast, the code has a "the AlphaFold parameters are made available for non-commercial use only under the terms of the CC BY-NC 4.0 license"

dekhn4y ago

It's quite unclear what value this will have to pharma; personally I doubt this has any direct applications (and I'm one of the few people in the world that can say that with deep authority).

4 more replies

FredFS4564y ago

There's a preview paper as well: https://www.nature.com/articles/s41586-021-03819-2

dekhn4y ago

Yes, I skimmed the paper already and it wasn't too surprising. There are details that will take some time to parse out to understand how important they are.

cing4y ago

The process is described in Supplementary, but where do you see the code to train the model? The repository is the inference pipeline.

dekhn4y ago

I misread. The data dump is required for inference.

pjfin1234y ago· 6 in thread

I'm assuming you can't run this on any consumer computer?

pjfin1234y ago

Nevermind

sambroner4y ago

That’s… way closer to consumer than I expected

2 more replies

erhk4y ago

2.2TB data

crazysim4y ago

Amazing. That's not a lot of libraries of congresses at all.

lasagnaphil4y ago

Nah, 4TB disk drives are not that expensive.

dekhn4y ago

which is basically nothing. They could put it in a cloud bucket and you could copy it to another bucket in minutes.

tdfirth4y ago· 5 in thread

edit: grammar

benschulz4y ago

My understanding is that it has been manually tested. I.e. it has produced correct results to previously intractable problems. I'm not sure how much automated testing would add at that point.

dmos624y ago

1 more reply

allyourhorses4y ago

plutonorm4y ago

Because it passes the huge number of integration tests.

miltondts4y ago

devindotcom4y ago· 4 in thread

Also announced today was RoseTTAFold from UW's Baker Lab, which claims nearly the same accuracy at much higher efficiencies. There's a public server and paper in Science.

More info here and here:

https://www.bakerlab.org/index.php/2021/07/15/accurate-prote...

https://techcrunch.com/2021/07/15/researchers-match-deepmind...

ehsankia4y ago

Could it be that AlphaFold 2 was open sourced in response to this?

dekhn4y ago

it's very likely the baker submission to science forced DM's hand.

xor994y ago

> With RoseTTAFold, a protein structure can be computed in as little as ten minutes on a single gaming computer.

xvilka4y ago

That is the way, unlike AlphaFold, to publish everything in open source. Kudos to the research team!

duckerude4y ago· 3 in thread

Does CC BY-NC actually do this? As far as I can tell it only really talks about sharing/reproducing, not using.

Or is the only thing prohibiting other commercial use the words "available for non-commercial use only"?

sillysaurusx4y ago

Artbreeder has some interesting prior art here: nVidia forbid commercial use of StyleGAN, but artbreeder disregarded it and happily sold all the breeding you wanted. No one seemed to care.

So basically, a few tens of thousands, sure. A few million, big G might have a problem.

Still, the smart move would be to launch the business anyway, and gamble that you can work out a licensing deal.

mikewarot4y ago

l33tman4y ago

1 more reply

dekhn4y ago· 2 in thread

now, it's baker catching up to DeepMind and it took about a year

https://doi.org/10.1126/science.abj8754

lukeplato4y ago

radus4y ago

Very cool! Great to see this competition between academia and industry yielding improvements on all fronts.

stupidcar4y ago· 2 in thread

mikewarot4y ago

stupidcar4y ago

jfengel4y ago· 2 in thread

So... is it possible to clone this and turn it into a Folding@Home client? How does it do?

kmckiern4y ago

Where there isn't an available crystal structure, Alphafold can be used to create initial structures for simulation via folding@home, replacing older homology modeling techniques.

Source: former folding@home researcher.

dekhn4y ago

swalsh4y ago· 2 in thread

edit I was wrong. Please ignore.

ali_m4y ago

> This is a completely new model that was entered in CASP14 and published in Nature.

f38zf5vdt4y ago

From the repo:

> This package provides an implementation of the inference pipeline of AlphaFold v2.0

Cas94y ago· 2 in thread

Ultimatt4y ago

There is a lot of bias in the chat here from a more chemistry and pharma slant. If you ignore this AlphaFold solves in a very meaningful way the problem blocking a lot of science investigation.

haihaibye4y ago

> even if their genomic sequence is completely dissimilar

I think you mean amino acid homology? (due to synonymous mutations)

I looked it up and you're right, protein structure/motifs are much more highly conserved than amino acid sequence https://humgenomics.biomedcentral.com/articles/10.1186/1479-...

mensetmanusman4y ago· 1 in thread

Distribution of this 2 TB file seems like a good use of torrent…

jerven4y ago

On the other hand outbound network traffic from an university is "free". So the benefit is absolutely minimal from a hosting perspective.

It was tried (https://journals.plos.org/plosone/article?id=10.1371/journal...) but it is gone the way of the dodo for the above reasons.

thesausageking4y ago

The PDF is linked in the article:

https://www.nature.com/articles/s41586-021-03819-2_reference...

qeternity4y ago

Ok, so biochemists: which bit of the secret sauce are they leaving out?

culopatin4y ago

Does anyone know if this can be made to work with rna fold?

hermitsings4y ago

fodl

j / k navigate · click thread line to collapse