Biological Function Emerges from Unsupervised Learning on 250M Protein Sequences (opens in new tab)

(biorxiv.org)

251 pointssmhx7y ago49 comments

49 comments

27 comments · 10 top-level

gigantum7y ago· 4 in thread

Like some of the other ML/AI posts that made it to the top page today, this research too does not give any clear way to reproduce the results. I looked through the pre-print page as well as the full manuscript itself.

Without reproducibility and transparency in the code and data, the impact of this research is ultimately limited. No one else can recreate, iterate, and refine the results, nor can anyone rigorously evaluate the methodology used (besides giving a guess after reading a manuscript).

The year is 2019, many are finally realizing it's time to back up your results with code, data, and some kind of specification of the computing environment you're using. Science is about sharing your work for others in the research community to build upon. Leave the manuscript for the pretty formality.

Havoc7y ago

>any clear way to reproduce the results.

Given that it's evolved I'd imagine this is a given? Or more accurately you could probably duplicate some kind of emergent behaviour but it would be different given different randomized parameters

threwawasy12287y ago

More of what the point is I think is that they don't go into any meta-analysis of big changes that were seen in many of the trials. They don't try to isolate specific mechanisms that formed in a majority of trials that almost made it to this stage for example. They just don't really go into any analysis of the failure trees in trial dataset at all.

IMHO this is probably just a case of them trying to stretch this out across a bunch of different papers, and this is just the announce paper. Which is a shitty practice, but the current academic environment encourages taking good findings and puffing them up into multiple incomplete papers rather than one well-done paper.

lysium7y ago

Usually you use an RNG for which you can publish the seed. So, although it’s random, you can reproduce the results.

1 more reply

vanattab7y ago

Is it not possible to use the same seed and random number generator to reproduce the results accurately?

1 more reply

ArtWomb7y ago· 4 in thread

Fergus Lab at NYU. I believe he's across the hall from Yann LaCunn as well ;)

Still a long way from a Theory of Biogenesis. But a good next step is using a differentiable model to predict novel proteins which have no analogue in Nature. Much like Materials Genome researchers searching for stable phases of matter!

"Training ever bigger convnets and LSTMs on ever bigger datasets gets us closer to Strong AI -- in the same sense that building taller towers gets us closer to the moon." --François Chollet

visarga7y ago

> "Training ever bigger convnets and LSTMs on ever bigger datasets gets us closer to Strong AI -- in the same sense that building taller towers gets us closer to the moon." --François Chollet

The Transformer layer has radically leaped over LSTMs and CNNs. While LSTMs can model sequences and CNNs regular grids, they have no efficient long range interaction mechanism. Transformer does. It's a huge leap similar to the one in computer vision from a few years ago.

What is needed besides spatial translation invariance (CNN) and temporal invariance (LSTM) is permutation invariance. Whenever the problem can be described as a graph, then the ordering of the vertices and edges should not matter. You can't do that with CNNs and LSTMs, but you can do it with Graph neural nets and Transformers.

Apparently Transformers are the best for language modelling (GPT-2), playing games (Dota2 from OpenAI), composing music and possibly now in modelling proteins. I assume they will play a huge role in working with graph structured data, with multiple entities and relations.

nl7y ago

It's not really as clear cut as that.

Transformers work well in sequence tasks because both compare well in terms of accuracy but also scale better than a RNNs like a LSTM or a GRU. That means they can be trained on more data.

This isn't really the same as CNNs, where they model images by running at different scales. I'm not aware of any cases of Transformers being used particularly successfully on images.

They can be used on graphs of course, by translating the problem into a graph walk problem (ala DeepWalk).

All the examples you gave (language modelling, Dota2, music and protein modelling) are setup as sequence prediction problems, so are perfect for Transformers.

1 more reply

eganist7y ago

> I believe he's across the hall from Yann LaCunn as well ;)

I'm having a hard time processing what the wink might possibly mean in this context.

No sarcasm intended.

mkolodny7y ago

I'd guess that wink is hinting that Yann LeCun might've had something to do with this research. Whether that's true or not, I have no idea.

(Yann LeCun is a Turing award winner for his work in deep learning)

1 more reply

lucidrains7y ago· 3 in thread

Language, music, and now amino acid sequences. Attention is all you need.

mfatica7y ago

I would say you also need a fair bit of data too...

bearmcbearsly7y ago

Well, yes. But I think lucidrains was referring to:

https://arxiv.org/abs/1706.03762

return17y ago

and transformers

1 more reply

tepal7y ago· 2 in thread

This blog post seems to anticipate this happening: https://moalquraishi.wordpress.com/2019/04/01/the-future-of-...

dnautics7y ago

> It does a surprisingly good job of predicting protein function across a diverse set of tasks, including ones structural in nature, like the induction of a single neuron that is able, with some degree of accuracy (ρ = 0.33) to distinguish between α helices and β strands (I suspect the network as a whole is far more performant at this task than the single neuron we’ve identified, but we didn’t push this aspect of the analysis as the problem is well tackled using specialized approaches.)

I hate to be that guy, but distinguishing between alpha helices and beta strands is not really that hard.

It's a good start though. I would propose the following test: Let's see if we can use the activations from the neurons to predict the luminosity of a 'base' GFP molecule (under a fixed set of experimental conditions). Train the set on 10,000 mutations (this could maybe be done in very high throughput by tethering the XNA to a bead, synthesizing, and then measuring the beads one by one), and see if can extrapolate the effects of 10k more, or heck, just by doing it brute-forcedly, we've got high throughput robots, right?

jostmey7y ago

And predicting protein function is not that hard either. The ground truth labels are often determined by sequence alignment similarity, not by experiment. So the results are far from profound

1 more reply

shpongled7y ago· 2 in thread

This is cool, but would be significantly cooler if they did some kind of biological follow up. Perhaps getting their model to output an "ideal" sequence for a desired enzymatic function and then swapping that domain into an existing protein lacking the new function.

inciampati7y ago

Bingo. That would be really interesting. And useful.

There are probably already enzymes in this data set that have measurements of their behavior. Could this modelling approach be coaxed to find the one with the highest processivity? Or do we need more labeled data?

shpongled7y ago

I'm sure they have a bunch of enzymes in their dataset for which kinetic measurements have been published. Another interesting follow up study would attempting to improve kinetic behavior. They could, for instance, analyze some of the catalytically perfect enzymes out there (TIM, SOD, catalase, etc) and see if the model could project improvements onto existing orthogonal protein classes.

1 more reply

cellular7y ago· 2 in thread

I find these emergent behaviours fascinating: https://youtu.be/gaFKqOBTj9w

jakeogh7y ago

FPGA do interesting things when allowed to exploit sidechannels/analog effects: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.50....

cellular7y ago

It would be neat to run this with a few more rules, on a larger world, and for a longer time to see what emerges!

andbberger7y ago

I find this paper to be so steeped in hype and dogma so as to be nearly incomprehensible.

Which is a shame, because it's a reasonable approach. I just wish they just frickin described what they did instead of spending the whole paper monologuing and showcasing unconvincing experiments. No need to justify what you're doing, just do it.

obviuosly7y ago

> The resulting model maps raw sequences to representations of biological properties without labels or prior domain knowledge.

A couple of questions:

1. What are those representations?

2. Also what is "biological function"?

3. What kind of information does the learned representation extract that is not already in the "biological properties" it is trained to map to?

superfx7y ago

a_bonobo7y ago

Here's a very cool GitHub repository which uses unsupervised learning (ULMFiT) in the genomics space: https://github.com/kheyer/Genomic-ULMFiT

Very impressive accuracies on hard tasks, and it's open source!

j / k navigate · click thread line to collapse

49 comments

27 comments · 10 top-level

gigantum7y ago· 4 in thread

Havoc7y ago

>any clear way to reproduce the results.

Given that it's evolved I'd imagine this is a given? Or more accurately you could probably duplicate some kind of emergent behaviour but it would be different given different randomized parameters

threwawasy12287y ago

lysium7y ago

Usually you use an RNG for which you can publish the seed. So, although it’s random, you can reproduce the results.

1 more reply

vanattab7y ago

Is it not possible to use the same seed and random number generator to reproduce the results accurately?

1 more reply

ArtWomb7y ago· 4 in thread

Fergus Lab at NYU. I believe he's across the hall from Yann LaCunn as well ;)

"Training ever bigger convnets and LSTMs on ever bigger datasets gets us closer to Strong AI -- in the same sense that building taller towers gets us closer to the moon." --François Chollet

visarga7y ago

> "Training ever bigger convnets and LSTMs on ever bigger datasets gets us closer to Strong AI -- in the same sense that building taller towers gets us closer to the moon." --François Chollet

nl7y ago

It's not really as clear cut as that.

Transformers work well in sequence tasks because both compare well in terms of accuracy but also scale better than a RNNs like a LSTM or a GRU. That means they can be trained on more data.

This isn't really the same as CNNs, where they model images by running at different scales. I'm not aware of any cases of Transformers being used particularly successfully on images.

They can be used on graphs of course, by translating the problem into a graph walk problem (ala DeepWalk).

All the examples you gave (language modelling, Dota2, music and protein modelling) are setup as sequence prediction problems, so are perfect for Transformers.

1 more reply

eganist7y ago

> I believe he's across the hall from Yann LaCunn as well ;)

I'm having a hard time processing what the wink might possibly mean in this context.

No sarcasm intended.

mkolodny7y ago

I'd guess that wink is hinting that Yann LeCun might've had something to do with this research. Whether that's true or not, I have no idea.

(Yann LeCun is a Turing award winner for his work in deep learning)

1 more reply

lucidrains7y ago· 3 in thread

Language, music, and now amino acid sequences. Attention is all you need.

mfatica7y ago

I would say you also need a fair bit of data too...

bearmcbearsly7y ago

Well, yes. But I think lucidrains was referring to:

https://arxiv.org/abs/1706.03762

return17y ago

and transformers

1 more reply

tepal7y ago· 2 in thread

This blog post seems to anticipate this happening: https://moalquraishi.wordpress.com/2019/04/01/the-future-of-...

dnautics7y ago

I hate to be that guy, but distinguishing between alpha helices and beta strands is not really that hard.

jostmey7y ago

And predicting protein function is not that hard either. The ground truth labels are often determined by sequence alignment similarity, not by experiment. So the results are far from profound

1 more reply

shpongled7y ago· 2 in thread

inciampati7y ago

Bingo. That would be really interesting. And useful.

shpongled7y ago

1 more reply

cellular7y ago· 2 in thread

I find these emergent behaviours fascinating: https://youtu.be/gaFKqOBTj9w

jakeogh7y ago

FPGA do interesting things when allowed to exploit sidechannels/analog effects: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.50....

cellular7y ago

It would be neat to run this with a few more rules, on a larger world, and for a longer time to see what emerges!

andbberger7y ago

I find this paper to be so steeped in hype and dogma so as to be nearly incomprehensible.

obviuosly7y ago

> The resulting model maps raw sequences to representations of biological properties without labels or prior domain knowledge.

A couple of questions:

1. What are those representations?

2. Also what is "biological function"?

3. What kind of information does the learned representation extract that is not already in the "biological properties" it is trained to map to?

superfx7y ago

a_bonobo7y ago

Here's a very cool GitHub repository which uses unsupervised learning (ULMFiT) in the genomics space: https://github.com/kheyer/Genomic-ULMFiT

Very impressive accuracies on hard tasks, and it's open source!

j / k navigate · click thread line to collapse