Also interesting that AlphaMisense is omitted from Figure 2B; it substantially outperforms the ESM-based ESM1b in our hands. But I guess the idea is that this is a general-purpose DNA language model whereas AlphaMissense is domain-specific for variant effect prediction?
For design tasks like in this paper, I think computational models have a big hill to climb in order to compete with physical high-throughput screening. Most of the time the goal is to get a small number of hits (<10) out of a pool of millions of candidates. At those levels, you need to work in the >99.9% precision regime to have any hope of finding significant hits after multiple-hypothesis correction. I don't think they showed anything near that accurate in the paper.
Maybe we'll get there eventually, but the high-throughput techniques in molecular biology are also getting better at the same time.
Your model makes predictions. Prove they’re worth salt.
As you progress along the following chain: genomics-->proteomics->interactomics->metabolomics, our understanding becomes blurrier and challenges harder.
https://www.biorxiv.org/content/10.1101/2024.02.29.582810v1
Tl;dr: DNA is NOT all you need.
In such a system you would take an existing bacterial cell and replace its genome with the newly synthesized version. The proteins and other molecules from the existing cell would remain (before eventually being replaced) and serve to "boot" the new genome.
How about something more useful, lucrative, and easy to define success for like engineering a morphine synthesis pathway into E. coli or something.
Imo, if you are talking about synthetic biology, then their training data is insufficient. Synthetic bio explores a lot of design space that is far outside of anything you would see in nature. There the secret sauce would not be in the generative pretraining, but in the RL. Unfortunately bio experiments are noisy, slow, and expensive so good luck getting enough data before the heat death of the universe.
Within Prokaryotes, there is a lot of horizontal gene transfer. What if some of the synthetic sequences get into other organisms and spread out?
That's not to say you can't glean a ton from DNA, but there are some external inputs we may simply never know enough about to incorporate into the model. Ultimately DNA IS all you need...if you have perfect environmental information.
while potentially interesting work, very shortsighted and premature to say this is a "GPT" moment in biology. ML people in bio need to think hard not only about what they are doing, but why are they are doing it (other than this is cool and will lead to a nice Nature publication). Their basic premise (learning from DNA is the next grand challenge in biology) is shaky. Imo, the grand challenge in biology is determining what the grand challenge is, and that is a deep scientific/philosophical question.
Also CRE activity is highly cell type specific. This article is a pretty awesome demonstration of model guided design of cell type specific cis regulatory elements.
https://www.biorxiv.org/content/10.1101/2023.08.08.552077v1
An LLM would not be able to do this because DNA itself contains no contextual information about cell type - every cell has a copy of the full genome. Epigenetic tracks however contain a lot of information germane to the cellular context - ex which parts of the genome are being transcribed.