AlphaFold Protein Structure Database (opens in new tab)

(alphafold.ebi.ac.uk)

315 pointsmatejmecka4y ago58 comments

58 comments

44 comments · 15 top-level

nharada4y ago· 6 in thread

From the abstract[1]:

> After decades of effort, 17% of the total residues in human protein sequences are covered by an experimentally-determined structure. Here we dramatically expand structural coverage by applying the state-of-the-art machine learning method, AlphaFold2, at scale to almost the entire human proteome (98.5% of human proteins).

[1] https://www.nature.com/articles/s41586-021-03828-1

vmception4y ago

Basically they are saying that decades of distributed protein folding was useless and everyone would have had more utility mining cryptocurrency if it existed several years earlier

But at least it inspired someone to make and release this

dekhn4y ago

you're conflating two different disciplines: distributed protein folding studies the biophysical process of proteins folding over time, while protein structure prediction makes a static single predict of what is believed to be the final structure adopted by the protein in the folding process.

I think many people believe that given infinite computer time the protein folding simulations would produce the same output as the static prediction (modulo a number of complex details) but use far, far more computer time to get there.

The fundamental observation from the DM AF2 paper that I've been able to glean (which I kind of sort of already believed) is that careful multiple sequence alignments of 30-100 evolutionarily related proteins is enough to produce coarse distance constraints that can be used to guide a structure prediction to a good answer quickly. And that depended on new ML technology that didn't exist before.

1 more reply

cing4y ago

Just in case you're not joking, it's worth noting that the majority of distributed molecular simulation (past and present) is spent studying "folded proteins" to discover structures of proteins that are often hidden from methods like AlphaFold (currently). For example, https://www.nature.com/articles/s41557-021-00707-0

ramraj074y ago

I don't know if you know, but doctors spent 1,300 YEARS using the wrong anatomy book. A few years and compute time isnt the end of the world. I'm sure oracle's DB2 test suite has burned more carbon than protein folding labs have.

Jabbles4y ago

A third way in which you are wrong is that AlphaFold derives a lot of its power by referring to previously-solved protein structures, or parts of them. It doesn't fold the proteins from scratch in an "alpha-zero" way.

1 more reply

dmitryminkovsky4y ago

> experimentally-determined structure

refers to structures determined by means of physical examination, with like crystallography, not to attempts at predictive computational analysis prior to AlphaFold, which were not accurate compared to AlphaFold.

ramraj074y ago· 5 in thread

As an ex biomedical researcher I was trying to think what protein I should enter and see, and couldn't come up with a protein that I know of, that didn't have a structure already (at least a crude one). That is, we roughly know how most known important proteins look like. This is an amazing tool, and will he indispensable in labs (I'll expect any lab to use this site at least once a year?) But it's not as transformative as some might think.

amelius4y ago

https://www.embl.org/news/science/alphafold-potential-impact...

> A discussion of the applications that AlphaFold DB may enable and the possible impact of the resource on science and society

pelorat4y ago

Do we really know the structure of every protein that assembles into a human cell?

_RPL5_4y ago

From their abstract:

---

After decades of effort, 17% of the total residues in human protein sequences are covered by an experimentally-determined structure1. Here we dramatically expand structural coverage by applying the state-of-the-art machine learning method, AlphaFold2, at scale to almost the entire human proteome (98.5% of human proteins). The resulting dataset covers 58% of residues with a confident prediction, of which a subset (36% of all residues) have very high confidence.

https://www.nature.com/articles/s41586-021-03828-1

---

The metric they use (residues) is a bit unusual (I would have used number of proteins instead), but I assume they wanted to account for ambiguity (such as proteins with partial structures).

cing4y ago

One of the reasons we don't have them all is that individual genes can encode for multiple protein isoforms through alternative splicing. AlphaFold was only run on one. Otherwise, there's lots of important biochemical/biophysical processes that impact structure, as cells are only about 50% protein by weight.

seventytwo4y ago

Definitely not.

ricksunny4y ago· 5 in thread

I’m sorry but why don’t tbey just release the ability for a user to enter a known real-world sequence’s accession number from Genbank / GISAID, and generate the protein structure from that? Why do they have to abstract the user from the process by only exposing a completed database of the protein structures the Alphafold researchers decided would be worth producing?

tazjin4y ago

You can use the open-source code, and we also have a Colab notebook for that: https://bit.ly/alphafoldcolab

More info: https://deepmind.com/blog/article/putting-the-power-of-alpha...

ricksunny4y ago

Thanks for that - I can see why my comment was downvoted now, as the the posted article's FAQ lists these links for those who would like to study their favorite sequenced-but-unmodeled protein. I'm glad Alphafold is as open source as it is, and I recognize that it didn't have to be so.

I think I was primed for a knee-jerk reaction because when Alphafold's results were announced back in Dec. 2020, with expressions of what a boon it would be for researchers around the globe, I anticipated there would be a timeline announced for exposing a tool or for the open-sourcing. (The Github repo has only just been released about 6 days ago ...)

With all the work on SARS-CoV-2's 'interactome', as well as human proteins & enzymes involved in pharmacology of antiviral drugs under development / repurposing , it's easy to imagine that drug developers would have liked to exercise Alphafold as soon as it was announced. (I myself have wanted a structure for human enzyme OATP1A2 that wasn't available on the PDB for such a drug pharmacology study - quite glad it is available at hand now.. .:) ).

Anyway I'm sure good arguments will be made about the need to really 'get it right' before releasing, or internal deliberations on how much to open up vs charging for it.

But 7 months lead time during a pandemic is a long time...

In all cases thanks again for this innovation's availability now. :)

1 more reply

sveme4y ago

I'd guess the ad-hoc simulation of the structure is computationally quite expensive and takes a while, though that's just a guess and I haven't read the original paper yet.

ricksunny4y ago

In fact a cost of $1-$4 for the preferred implementation:

https://news.ycombinator.com/item?id=27894060

The colab provides a slightly-less-accurate version that operates in the cloud. For the real mccoy it seems one must set up one’s own environment and leverage the git repo.

sherjilozair4y ago

DeepMind has already released the open source code and model parameters. The database makes it easier to access the predictions.

stephanheijl4y ago· 4 in thread

I'm impressed and grateful that DeepMind released this resource, this will save a lot of compute from labs trying to replicate an entire exome for themselves. While some structures look great, there are still some misses here. Important structures like BRCA1 (a well-studied breast cancer associated protein) are just structures for the BRCT and RING domains surrounded by a low-confidence string of amino acids, likely shaped to be globular: https://alphafold.ebi.ac.uk/entry/P38398

Maybe I was wrong for expecting the impossible here, but I was excited to see this specific structure and it appears that there is still work to do. Nevertheless, kudos to Deepmind on their amazing achievement and contributions to the field!

cing4y ago

Everything between the BRCT and RING domains of BRCA1 is an intrinsically unstructured region which DeepMind correctly predicts, https://pubmed.ncbi.nlm.nih.gov/15571721/

Another famous one would be R-domain of CFTR, which was not resolved in experimental structure determination, and AlphaFold models correctly show disorder there. Nothing to be done in those cases except perform molecular simulation or other experiments to assess dynamic ensembles, https://alphafold.ebi.ac.uk/entry/P13569

maga4y ago

A curious non-biologist here: how valuable are these low confidence predictions for biologists? In other words, is it hard to predict but easy to check situation as with, say, prime numbers in mathematics?

toufka4y ago

The medium-confidence predictions are great for grounding or sourcing intuition. If you're trying to divide up a protein for an experiment and you have to choose where to divy it up - you'd like to use even a bad prediction to help weight an otherwise completely random approach. AND there are great methods to help with this, but they're often custom, time-consuming, and out-of-field for most. So being able to very quickly spot-check using a uniform state-of-the art, for any arbitrary protein, makes it actually pretty useful for certain kinds of pre-experimental guidance.

devindotcom4y ago

Some are valuable for the reasons the other person responding noted, but some of the low confidence predictions may also be high confidence predictions of a disordered class of protein that doesn't have a standard rest state. So it's useful work one way or the other.

Ovah4y ago· 4 in thread

Interesting that they're porting it to other organisms. Different organisms have variations in ribosomes, post translational modifications and even tRNA repertoire. So it's not a guarantee that two identical DNA sequences will give identical proteins in two different organisms.

pelorat4y ago

Shouldn't matter? Protein folding is based on the laws of physics after all. If DNA sequences folds differently in different organisms then an external factor is missing.

Ovah4y ago

While the laws of physics remain the same, the folding machinery between species varies to some degree. Protein folding is determined by the unique environment/machinery of a cell. A concrete example is disulphide bonds (S-S, ex cystein-cystein) that require a certain pH to form. The primary pathways of disulphide-bond formation are localized in the endoplasmic reticulum (ER) of eukaryotic cells and the periplasmic space of prokaryotic cells. So two complete different mechanisms to end up with the same bond (protein structure) depending on the organism.

1 more reply

ramraj074y ago

??? Unless you jump from eukaryotes to archea these are not real concerns. Most PTM markers are very conserved.

Ovah4y ago

I'd say the jump from eukaryotes to procaryotes is a realistic scenario in recombinant DNA technology.

I have some experience with recombinant yeast and PTMs. Degree of glycosylation actually vary a lot depending on strain used and has a huge effect of protein activity. And of course these PTMs affects the crystal structure.

pelorat4y ago· 2 in thread

There's a lot of news about AlphaFold lately but what about Rossettafold? Wasn't it more accurate and much faster?

_2d304y ago

I believe slightly less accurate but significantly faster is where it stands.

pelorat4y ago

Running a sequence against both seems like a good idea. If they agree the certainty will go way up.

moyix4y ago· 1 in thread

Anyone else getting a 403 Forbidden?

If so it might be better to link to the paper instead: https://www.nature.com/articles/s41586-021-03828-1

jkh14y ago

Works fine for me. Must have been a temporary glitch.

spacecity19714y ago· 1 in thread

Quick question, please excuse my ignorance, but is there a way to extrapolate sequence from structure? In other words, can we design proteins and calculate the sequence required to make it?

kmckiern4y ago

It's hard but people do it! This is the field of "protein engineering".

visarga4y ago· 1 in thread

Citation factory, that's what it is.

abcc84y ago

Resources as useful as this are bound to be. We do cite our sources after all.

jkh14y ago

Didn't see this post so posted it also. Also relevant: https://www.embl.org/news/science/alphafold-potential-impact...

sdbrown4y ago

This is a fabulous convenience! The reach of this ready-to-go data will be much larger (in some directions) than the model and CASP results themselves.

lumost4y ago

I used to do some RNA molecular dynamics simulations in college which were both computationally expensive and difficult to replicate. Having the ability to reasonably predict protein structure is an incredible scientific achievement - however I am curious if anyone here who is better informed has takes on the following.

1. How likely is it that alphafold learned to accurately predict protein structure in the narrow domain of proteins that have been experimentally synthesized and whose structure has been measured? in other words will AlphaFold's results generalize to proteins which cannot yet be synthesized in the laboratory.

2. If Alphafold's accuracy holds, what type of commercial applications does this open up?

_RPL5_4y ago

This is awesome! When they announced CASP results a few months ago, I was wondering if AlphaFold will be accessible as an API, where you can submit a protein id or a sequence and get back a 3D structure. This database is basically that, except it's free & open to the public. Major props!

culopatin4y ago

I happen to be working on a database for folds as well. But RNA folds not protein folds. I’m not a bio guy but my gf is and if I understand correctly this is not the same. I hope they are different because it would suck to be me lol.

This is my first big boy project and I’m driving solo so it takes me a while to make any progress. But at least now I have this db and genbank to model after

dnautics4y ago

yikes, this doesn't even do some basic stuff like trim off pre-protein segments for secreted proteins... Without this, you could get some very incorrect structures.

1 more reply

j / k navigate · click thread line to collapse

58 comments

44 comments · 15 top-level

nharada4y ago· 6 in thread

From the abstract[1]:

[1] https://www.nature.com/articles/s41586-021-03828-1

vmception4y ago

Basically they are saying that decades of distributed protein folding was useless and everyone would have had more utility mining cryptocurrency if it existed several years earlier

But at least it inspired someone to make and release this

dekhn4y ago

1 more reply

cing4y ago

ramraj074y ago

Jabbles4y ago

1 more reply

dmitryminkovsky4y ago

> experimentally-determined structure

ramraj074y ago· 5 in thread

amelius4y ago

https://www.embl.org/news/science/alphafold-potential-impact...

> A discussion of the applications that AlphaFold DB may enable and the possible impact of the resource on science and society

pelorat4y ago

Do we really know the structure of every protein that assembles into a human cell?

_RPL5_4y ago

From their abstract:

---

https://www.nature.com/articles/s41586-021-03828-1

---

The metric they use (residues) is a bit unusual (I would have used number of proteins instead), but I assume they wanted to account for ambiguity (such as proteins with partial structures).

cing4y ago

seventytwo4y ago

Definitely not.

ricksunny4y ago· 5 in thread

tazjin4y ago

You can use the open-source code, and we also have a Colab notebook for that: https://bit.ly/alphafoldcolab

More info: https://deepmind.com/blog/article/putting-the-power-of-alpha...

ricksunny4y ago

Anyway I'm sure good arguments will be made about the need to really 'get it right' before releasing, or internal deliberations on how much to open up vs charging for it.

But 7 months lead time during a pandemic is a long time...

In all cases thanks again for this innovation's availability now. :)

1 more reply

sveme4y ago

I'd guess the ad-hoc simulation of the structure is computationally quite expensive and takes a while, though that's just a guess and I haven't read the original paper yet.

ricksunny4y ago

In fact a cost of $1-$4 for the preferred implementation:

https://news.ycombinator.com/item?id=27894060

The colab provides a slightly-less-accurate version that operates in the cloud. For the real mccoy it seems one must set up one’s own environment and leverage the git repo.

sherjilozair4y ago

DeepMind has already released the open source code and model parameters. The database makes it easier to access the predictions.

stephanheijl4y ago· 4 in thread

cing4y ago

Everything between the BRCT and RING domains of BRCA1 is an intrinsically unstructured region which DeepMind correctly predicts, https://pubmed.ncbi.nlm.nih.gov/15571721/

maga4y ago

toufka4y ago

devindotcom4y ago

Ovah4y ago· 4 in thread

pelorat4y ago

Shouldn't matter? Protein folding is based on the laws of physics after all. If DNA sequences folds differently in different organisms then an external factor is missing.

Ovah4y ago

1 more reply

ramraj074y ago

??? Unless you jump from eukaryotes to archea these are not real concerns. Most PTM markers are very conserved.

Ovah4y ago

I'd say the jump from eukaryotes to procaryotes is a realistic scenario in recombinant DNA technology.

pelorat4y ago· 2 in thread

There's a lot of news about AlphaFold lately but what about Rossettafold? Wasn't it more accurate and much faster?

_2d304y ago

I believe slightly less accurate but significantly faster is where it stands.

pelorat4y ago

Running a sequence against both seems like a good idea. If they agree the certainty will go way up.

moyix4y ago· 1 in thread

Anyone else getting a 403 Forbidden?

If so it might be better to link to the paper instead: https://www.nature.com/articles/s41586-021-03828-1

jkh14y ago

Works fine for me. Must have been a temporary glitch.

spacecity19714y ago· 1 in thread

Quick question, please excuse my ignorance, but is there a way to extrapolate sequence from structure? In other words, can we design proteins and calculate the sequence required to make it?

kmckiern4y ago

It's hard but people do it! This is the field of "protein engineering".

visarga4y ago· 1 in thread

Citation factory, that's what it is.

abcc84y ago

Resources as useful as this are bound to be. We do cite our sources after all.

jkh14y ago

Didn't see this post so posted it also. Also relevant: https://www.embl.org/news/science/alphafold-potential-impact...

sdbrown4y ago

This is a fabulous convenience! The reach of this ready-to-go data will be much larger (in some directions) than the model and CASP results themselves.

lumost4y ago

2. If Alphafold's accuracy holds, what type of commercial applications does this open up?

_RPL5_4y ago

culopatin4y ago

This is my first big boy project and I’m driving solo so it takes me a while to make any progress. But at least now I have this db and genbank to model after

dnautics4y ago

yikes, this doesn't even do some basic stuff like trim off pre-protein segments for secreted proteins... Without this, you could get some very incorrect structures.

1 more reply

j / k navigate · click thread line to collapse