Learning From DNA: a grand challenge in biology (opens in new tab)

(hazyresearch.stanford.edu)

119 pointsninjha012y ago26 comments

26 comments

20 comments · 5 top-level

pfisherman2y ago· 9 in thread

Just gonna leave this here.

https://www.biorxiv.org/content/10.1101/2024.02.29.582810v1

Tl;dr: DNA is NOT all you need.

jhbadger2y ago

I think you are missing what the Evo project is trying to do -- create a new prokaryotic genome through a generative model. How this would work would be like the earlier hand-made synthetic genomes like Synthia (Gibson et al, 2010).

In such a system you would take an existing bacterial cell and replace its genome with the newly synthesized version. The proteins and other molecules from the existing cell would remain (before eventually being replaced) and serve to "boot" the new genome.

pfisherman2y ago

Sounds cool, but how you define success for something like that? I can copy a prokaryotic genome mutated based as some non-zero rate and it would probably be viable. Is that synthetic enough to count? Are they going for a minimal genome?

How about something more useful, lucrative, and easy to define success for like engineering a morphine synthesis pathway into E. coli or something.

Imo, if you are talking about synthetic biology, then their training data is insufficient. Synthetic bio explores a lot of design space that is far outside of anything you would see in nature. There the secret sauce would not be in the generative pretraining, but in the RL. Unfortunately bio experiments are noisy, slow, and expensive so good luck getting enough data before the heat death of the universe.

nextos2y ago

It's an interesting endeavor, but there are some obvious safety concerns.

Within Prokaryotes, there is a lot of horizontal gene transfer. What if some of the synthetic sequences get into other organisms and spread out?

1 more reply

samuell2y ago

I tend to agree (the cell being in control and all the 4D interactions and epigenetics mechanisms etc), but out of curiosity, what would you say we also need?

COGlory2y ago

For starters, chemical environment modeling. But also cells differentiate, so in any system you need to understand the differentiation, and how those differentiated cells will change the environment of other cells, based on the environment they encounter.

That's not to say you can't glean a ton from DNA, but there are some external inputs we may simply never know enough about to incorporate into the model. Ultimately DNA IS all you need...if you have perfect environmental information.

pfisherman2y ago

The article I posted shows what is working better - the Olga Troyanskaya / David Kelley style models. There was another one (Kundaje group?) recently that used Hi-C data.

t_serpico2y ago

https://onlinelibrary.wiley.com/doi/10.1002/bies.201300153 tl;dr: metabolism is all you need.

while potentially interesting work, very shortsighted and premature to say this is a "GPT" moment in biology. ML people in bio need to think hard not only about what they are doing, but why are they are doing it (other than this is cool and will lead to a nice Nature publication). Their basic premise (learning from DNA is the next grand challenge in biology) is shaky. Imo, the grand challenge in biology is determining what the grand challenge is, and that is a deep scientific/philosophical question.

dekhn2y ago

most of the examples in that paper (a single paper) show that DNA is nearly all you need, with the rest being RNA.

pfisherman2y ago

RNA is an obvious example. The examples and benchmarks they give in the paper are not the straw men the DNA LLMs are beating the stuffing out.

Also CRE activity is highly cell type specific. This article is a pretty awesome demonstration of model guided design of cell type specific cis regulatory elements.

https://www.biorxiv.org/content/10.1101/2023.08.08.552077v1

An LLM would not be able to do this because DNA itself contains no contextual information about cell type - every cell has a copy of the full genome. Epigenetic tracks however contain a lot of information germane to the cellular context - ex which parts of the genome are being transcribed.

1 more reply

jashephe2y ago· 4 in thread

I'm a little disappointed that their linked preprint doesn't appear to include any molecular biology; i.e. they don't actually try to synthesize any of their predicted sequences and test function. It wouldn't be an outrageous synthesis task to make some of the CRISPR-Cas sequences they generated.

Also interesting that AlphaMisense is omitted from Figure 2B; it substantially outperforms the ESM-based ESM1b in our hands. But I guess the idea is that this is a general-purpose DNA language model whereas AlphaMissense is domain-specific for variant effect prediction?

bnprks2y ago

Strong second for wishing they tried physically testing some model output. The importance of "model that makes outputs AlphaFold thinks look like Cas" is very different from "model that makes functional Cas variants".

For design tasks like in this paper, I think computational models have a big hill to climb in order to compete with physical high-throughput screening. Most of the time the goal is to get a small number of hits (<10) out of a pool of millions of candidates. At those levels, you need to work in the >99.9% precision regime to have any hope of finding significant hits after multiple-hypothesis correction. I don't think they showed anything near that accurate in the paper.

Maybe we'll get there eventually, but the high-throughput techniques in molecular biology are also getting better at the same time.

theGnuMe2y ago

You are correct that it is dangerous to rely on the results of a model being an oracle for another model, extremely good models (say F=ma) are used all the time.

ackbar032y ago

This should really be a requirement for bio type related generative methods rather than a nice-to-have. A very high percentage of compounds generated by genai type methods have been shown not to work as intended. Anything without wetlab validation should really be taken with a large grain of salt

rdmirza2y ago

My immediate thought. Big Claims without backing.

Your model makes predictions. Prove they’re worth salt.

ninjha01OP2y ago· 2 in thread

I built the wrapper/playground [0] linked in the article. Feel free to give feedback here or by the email in my bio

[0] https://evo.nitro.bio/

timy2shoes2y ago

Hi Nishant. Great work, as always.

ninjha01OP2y ago

Thanks for the kind words :)

d_silin2y ago

Would be interesting to see what comes of it.

As you progress along the following chain: genomics-->proteomics->interactomics->metabolomics, our understanding becomes blurrier and challenges harder.

visarga2y ago

DNA is all you need? In the future generative AI will generate You!

1 more reply

j / k navigate · click thread line to collapse

26 comments

20 comments · 5 top-level

pfisherman2y ago· 9 in thread

Just gonna leave this here.

https://www.biorxiv.org/content/10.1101/2024.02.29.582810v1

Tl;dr: DNA is NOT all you need.

jhbadger2y ago

pfisherman2y ago

How about something more useful, lucrative, and easy to define success for like engineering a morphine synthesis pathway into E. coli or something.

nextos2y ago

It's an interesting endeavor, but there are some obvious safety concerns.

Within Prokaryotes, there is a lot of horizontal gene transfer. What if some of the synthetic sequences get into other organisms and spread out?

1 more reply

samuell2y ago

I tend to agree (the cell being in control and all the 4D interactions and epigenetics mechanisms etc), but out of curiosity, what would you say we also need?

COGlory2y ago

pfisherman2y ago

The article I posted shows what is working better - the Olga Troyanskaya / David Kelley style models. There was another one (Kundaje group?) recently that used Hi-C data.

t_serpico2y ago

https://onlinelibrary.wiley.com/doi/10.1002/bies.201300153 tl;dr: metabolism is all you need.

dekhn2y ago

most of the examples in that paper (a single paper) show that DNA is nearly all you need, with the rest being RNA.

pfisherman2y ago

RNA is an obvious example. The examples and benchmarks they give in the paper are not the straw men the DNA LLMs are beating the stuffing out.

Also CRE activity is highly cell type specific. This article is a pretty awesome demonstration of model guided design of cell type specific cis regulatory elements.

https://www.biorxiv.org/content/10.1101/2023.08.08.552077v1

1 more reply

jashephe2y ago· 4 in thread

bnprks2y ago

Maybe we'll get there eventually, but the high-throughput techniques in molecular biology are also getting better at the same time.

theGnuMe2y ago

You are correct that it is dangerous to rely on the results of a model being an oracle for another model, extremely good models (say F=ma) are used all the time.

ackbar032y ago

rdmirza2y ago

My immediate thought. Big Claims without backing.

Your model makes predictions. Prove they’re worth salt.

ninjha01OP2y ago· 2 in thread

I built the wrapper/playground [0] linked in the article. Feel free to give feedback here or by the email in my bio

[0] https://evo.nitro.bio/

timy2shoes2y ago

Hi Nishant. Great work, as always.

ninjha01OP2y ago

Thanks for the kind words :)

d_silin2y ago

Would be interesting to see what comes of it.

As you progress along the following chain: genomics-->proteomics->interactomics->metabolomics, our understanding becomes blurrier and challenges harder.

visarga2y ago

DNA is all you need? In the future generative AI will generate You!

1 more reply

j / k navigate · click thread line to collapse