Some context: Been waiting for this to come out for a while! Main innovation is leveraging RosettaFold (protein folding neural net) to generate protein backbones via diffusing in 3D space! From backbones, we can generate sequences that would fold into said structures via sequence design algorithms (check out: proteinMPNN, Rosetta FastDesign).
In terms of applications: This is super relevant for our ability to create strongly binding protein binders (ex timely creation of proteins that bind to virus spike proteins), and designing enzyme from scratch!
Prior methods suffered from much lower success rates for generating “good” backbone structures. Extremely exciting!! If you want to learn more, check out the Baker group at UW!
Folding takes into account many variables, and a big chunk of current experimental structure determination is concerned with controlling/adjusting these variables.
So this dreaming up will provide a potential “quicker way” into what a folded protein might look like, but it will not guarantee you that humanity knows how to actually produce it in the real-world.
Disclaimer: someone correct me if I’m wrong. I might be rusty on the latest developments, as I’ve left the field after my PhD.
The very largest plain transformer models trained on protein sequences (analogous to plain text) are about 15B parameters (I am thinking of Meta AI’s ESM-2 [1]). These can do for protein sequences what LLMs do for text (that is, they can “fill in the blank” to design variations, generate new proteins that look like their training data), and tell you how likely it is that a given sequence exists.
Some cool variations of transformers have applications for protein design, like the now-famous SE(3) equivariant transformer used in the structure prediction module of AlphaFold [2], now appearing in the research paper [3] accompanying TFA, as well as variations on the transformer such as the message passing model ProteinMPNN [4], which builds on a neighbor graph-structured transformer [5]
1. https://github.com/facebookresearch/esm
2. https://github.com/deepmind/alphafold
3. https://www.biorxiv.org/content/10.1101/2022.12.09.519842v2
4. https://github.com/dauparas/ProteinMPNN
5. https://github.com/jingraham/neurips19-graph-protein-design
Before alphafold changed this field, creating your own protein design was considered an insane task (not impossible, bakers lab and others have done it a couple times). But these tools (now we have multiple) allow you to create new proteins From scratch that can do exactly what you want (caveats galore). New enzymes that can catalyze reactions never found in nature for example.
Before this all we could do was take proteins that already exist in nature and modify them. So you can imagine how new this world is.
This series of talks by Nazim Bouatta is exceptional, helped me appreciate and make sense of these models. Incredible how you can engineer neural nets to learn with way lesser data when you incorporate the right inductive biases: https://youtube.com/playlist?list=PL0NRmB0fnLJQPDZh-6utVnRpF...