DeepVariant: Highly Accurate Genomes with Deep Neural Networks (opens in new tab)

(research.googleblog.com)

69 pointstsaprailis8y ago13 comments

13 comments

11 comments · 3 top-level

dcdanko8y ago· 4 in thread

The figures in this paper use pretty deceptive scales. To be clear, DeepVariant is 0.5% better than a tool built in ~2010 (GATK), on DeepVariant's best test.

GATK is still the standard, not because better variant callers don't exist, but because it's more important that everyone uses the same tool for comparisons between studies.

alexlikeits19998y ago

That first paragraph is pretty deceptive. They are not comparing against the results from GATK 1.0.

jghn8y ago

I didn't see them mention which version they were using, presumably GATK3. I'm curious to see what it'd look like against GATK4 which is being released in a month.

Houshalter8y ago

But how much does a difference of 0.5% matter on this metric?

dcdanko8y ago

Probably not much at all. SNPs and small indels tend to be have many neighbors with which they're highly correlated. If a variant caller missed a single SNP it's likely that it still called a bunch of others that nearly always co-occur. In most cases downstream association studies would be unaffected.

It's actually possible that DeepVariant is implicitly learning some of these correlations (1). This would make it really really bad for picking out the rare persons that don't fit a trend (and tend to be very important for identifying disease loci). GATK definitely does not know about correlated SNPs.

(1) The paper implies this is not the case, saying that DeepVariant works for other genomes without retraining, but they don't show the relevant results.

j7ake8y ago· 4 in thread

very nice but do you think neural networks will also be able to interpret the function of these genotype ?

danblick8y ago

I'm skeptical. Machine learning algorithms are only as good as the training data you can provide them. There are lots of tools in genetic analysis you can use to try to understand function of some sequence (e.g. looking at conservation or homology). I don't see where deep learning would provide new value on top of hand-built statistical models that are already in use. (Geneticists have known about machine learning for a long time...)

Obligatory reference: https://xkcd.com/1831

nl8y ago

Yes. LIME works really well for this type of problem.

danblick8y ago

What is LIME?

nl8y ago

LIME: Local Interpretable Model-Agnostic Explanations; https://www.oreilly.com/learning/introduction-to-local-inter...

"“Why Should I Trust You?” Explaining the Predictions of Any Classifier": https://arxiv.org/pdf/1602.04938.pdf

https://homes.cs.washington.edu/~marcotcr/blog/lime/

https://github.com/marcotcr/lime

Anytime anyone makes snide HN comments like "oh you can't understand why neural networks make predictions" the correct response should always be "why doesn't LIME work in your specific case".

LIME is being used within the EU to explain credit decisions and fraud detection flagging on neural network based models, which is quite a high bar to regulatory oversight to pass.

1 more reply

inciampati8y ago

I implemented a similar model based around the amazing out of core linear learner Vowpal Wabbit. It did pretty well in the Precision FDA challanges despite being developed in two person months. I has the benefit of using fantastically less compute to train than something like deepvariant. (https://github.com/ekg/hhga)

The approach is the right one for small genetic variants. But it will be hard to handle more complex kinds of variation without adapting the alignments to training example synthesis.

I think the field should cool it on calling the results of something like deepvariant "genomes". These are genotypes, not fully sequenced and reconstructed genomes. The evaluations are typically on easy regions and we have no reason to believe that those are the only ones that are important. One important tool to dig into this is syndip, which is a simulated synthetic diploid where the full haplotypes are known. It is a mixture of two haploid human genomes that were de novo sequenced with pacbio technology. (https://www.biorxiv.org/content/early/2017/11/22/223297). For the curious these haploid human genomes only exist in molar pregnancies, so even this isn't ideal but it is maybe the best resource we have at present.

j / k navigate · click thread line to collapse

13 comments

11 comments · 3 top-level

dcdanko8y ago· 4 in thread

The figures in this paper use pretty deceptive scales. To be clear, DeepVariant is 0.5% better than a tool built in ~2010 (GATK), on DeepVariant's best test.

GATK is still the standard, not because better variant callers don't exist, but because it's more important that everyone uses the same tool for comparisons between studies.

alexlikeits19998y ago

That first paragraph is pretty deceptive. They are not comparing against the results from GATK 1.0.

jghn8y ago

I didn't see them mention which version they were using, presumably GATK3. I'm curious to see what it'd look like against GATK4 which is being released in a month.

Houshalter8y ago

But how much does a difference of 0.5% matter on this metric?

dcdanko8y ago

(1) The paper implies this is not the case, saying that DeepVariant works for other genomes without retraining, but they don't show the relevant results.

j7ake8y ago· 4 in thread

very nice but do you think neural networks will also be able to interpret the function of these genotype ?

danblick8y ago

Obligatory reference: https://xkcd.com/1831

nl8y ago

Yes. LIME works really well for this type of problem.

danblick8y ago

What is LIME?

nl8y ago

LIME: Local Interpretable Model-Agnostic Explanations; https://www.oreilly.com/learning/introduction-to-local-inter...

"“Why Should I Trust You?” Explaining the Predictions of Any Classifier": https://arxiv.org/pdf/1602.04938.pdf

https://homes.cs.washington.edu/~marcotcr/blog/lime/

https://github.com/marcotcr/lime

Anytime anyone makes snide HN comments like "oh you can't understand why neural networks make predictions" the correct response should always be "why doesn't LIME work in your specific case".

LIME is being used within the EU to explain credit decisions and fraud detection flagging on neural network based models, which is quite a high bar to regulatory oversight to pass.

1 more reply

inciampati8y ago

The approach is the right one for small genetic variants. But it will be hard to handle more complex kinds of variation without adapting the alignments to training example synthesis.

j / k navigate · click thread line to collapse