I expected to have to scroll through pages upon pages of indecipherable text. Instead it's no bigger than a large paragraph of text, and I can easily fit it on my screen.
The technically challenging parts are:
- delivery mechanism: you need to take a very unstable molecule, protect it from the environment - both external, and when inside the patient - and insert it into a human cell. (This is called the "platform", and is usually developed independently from the specific payload.)
- manufacturing: both producing the mRNA itself at a large scale, and inserting it into the delivery mechanism, at a large scale and in low-temperature conditions
- testing: the newly-developed payload and the existing platform were integrated at small scales within weeks, but testing the thing for safety and efficacy took months
EDIT: As schoen pointed out, this was not actually released by Moderna, but reverse engineered by third-party researchers. Original text was: "Hence they feel safe releasing this. Their moat is not the gene sequence, their moat is everything else."
https://www.modernatx.com/sites/default/files/US10702600.pdf
though they do present multiple sequences, so I guess you'd have to go to the FDA application to figure out exactly which one got used.
meh, I could do that over a weekend never sounded so scary, or impressive at the same time. That weekend just so happened to stand on the shoulders of prior decades of research though.
i guess this is big pharma's version of `apt-get install`
Of note, the immune system is pretty good at destroying foreign mRNA so you also need to evade it.
This article is pretty good: https://berthub.eu/articles/posts/reverse-engineering-source...
The most amazing thing is that now that the platform is proven secure in dozens of millions of people, it should be be very easy and fast to get approval for other payloads. Biontech for example wants to go after cancers - a platform that can deliver payloads targeted to an individual's cancer is nothing short of a game changer in cancer treatment because the current standard of blasting the patient's body with a lot of highly toxic chemicals is arcane compared to letting the body's immune system do the cleanup.
One or more of the vaccine developers may have released such details, but this particular file is a reverse engineering effort by unaffiliated scientists based on analyzing the dregs of used vaccine vials (!).
Edit: See https://news.ycombinator.com/item?id=26628594 for more substantive discussion about this.
What kind of tweaks were made from "the version they threw together in a weekend" to "the version that is in production now"? What's a typical "mRNA" feedback iteration loop like?
Sounds like a problem you solve once and for all, for any vaccine. And also that this problem was already solved since decades (e.g viral vectors)
- testing: the newly-developed payload and the existing platform were integrated at small scales within weeks, but testing the thing for safety and efficacy took months And so many people have been killed by this overly conservative testing, phase ~<2.5 was enough
https://blogs.sciencemag.org/pipeline/archives/2021/02/02/my...
Why manufacturing of these vaccines is a hard part.
Or the distribution method, or even really invent the thing, since you're mostly just copying someone else's work. Plus it doesn't have to even do anything. In fact, doing anything might be a problem, so best to just sit there and look menacing (and spikey).
Coincidentally, the mRNA sequences for both vaccines are about 4kb (kilobase) long.
Getting it designed and building it is more difficult.
At its core, it’s a piece of mRNA that creates a protein. That code gets transcribed into a protein (often those are relatively short). That protein then triggers your bodies immune response, which trains it to attack covid19.
Inject this mRNA into a cell and it’ll create the protein. Anything can be injected at this point once the mechanism for injection is developed
This should hopefully provide you with some useful perspective.
Biology is a funny old thing. You can look at that concise description - the orange and so on blocks of a few letters and a few short groupings.
Now ATCG are basic building blocks but they consist of quite a lot of stuff. I think it's a bit more complex than that because this is RNA not DNA so ATCG might not be quite right. Each of those bases are horrifically complicated depending on scale. Search "ATCG" - this is a good start: https://en.wikipedia.org/wiki/Nucleobase
Now dive into one of those bases and decompose it to its constituent atoms. Now look at the maths around this stuff. It gets quite complicated, quite quickly.
That said, the fact that a bloody complicated thingie can be described so concisely is absolutely amazing and as you say it looks so simple.
> So in the BioNTech/Pfizer vaccine, every U has been replaced by 1-methyl-3’-pseudouridylyl, denoted by Ψ. The really clever bit is that although this replacement Ψ placates (calms) our immune system, it is accepted as a normal U by relevant parts of the cell.
Neat.
"Since each of the 20 amino acids is chemically distinct and each can, in principle, occur at any position in a protein chain, there are 20 × 20 × 20 × 20 = 160,000 different possible polypeptide chains four amino acids long, or 20n different possible polypeptide chains n amino acids long. For a typical protein length of about 300 amino acids, more than 10^390 (20^300) different polypeptide chains could theoretically be made. This is such an enormous number that to produce just one molecule of each kind would require many more atoms than exist in the universe."
I mean, I can understand how an eye or a brain can evolve by natural selection, but I’m still stunned by abiogenesis. I guess we’ll never know for sure how it all started.
Honestly, na. It's pretty verbose. There's a lot of weird ass things in there like "Skip basepairs until you find the matching terminating sequence" (I think it's AG .* GA but its been a decade since my bioinformatics course), but you still have to include the non-AA-coding basepairs in the middle of that.
Compensating for that is the fact that there are like, multiple independent programs; if a ribosome is offset by a single base pair, the result is entirely different. If it runs the other strand, the result is different. And instead of crashing like any program would, biology just learns to use all of those possible encodings. In part, this works because there are 64 possible codons but only 20 amino acids, and the redundancy allows a substitution to affect only some of the offsets.
...with GATACCA right in the middle, but unfortunately with no GATTACA that I could find.
For comparison, the smallest chain that they technically call a protein is 100 amino acids that's an arbitrary limit to separate proteins from enzymes. So this thing isn't tiny tiny.
But Titin (also called connectin), a giant protein responsible for passive elasticity in mucles, is ~27,000-35,000 amino acids. So this thing isn't even close to the biggest proteins out there.
Do you mean “to separate polypeptides from proteins”? Enzymatic activity has nothing to do with size. For example, one of the smallest enzymes in humans has 62 amino acid residues. And, under certain conditions, even single amino acids can be catalytic.
But yeah, the polypeptide-protein threshold can get fuzzy, especially with the recent advances in miniprotein characterization.
When I saw it, I thought that it could almost fit in a tweet, so I just did it:
https://twitter.com/weinzierl/status/1376807707957719041?s=2...
The sequence takes 16 tweets, 15 if you don't split at line endings and remove spaces (4175 nucleobases / 280 nucleobases/tweet ~ 14.9 tweets).
Remind me the joke of the consultant engineer knows where to make X by the chalk. LOL
[1] https://berthub.eu/articles/posts/reverse-engineering-source...
I don't know how long it will be before we get a bit more serious with it, but geneticists have a big obstacle in their understanding, any change might needs a thousand strong lifelong population study to be understood. That's way crappier than dumping the assembly or only having the documentation in Chinese.
I will add that moreover the developers might have been even more conservative in their code because they knew it was going for large scale deployment, they probably avoided the cutting edge as much as they could.
Lots of these things aren’t complicated. It’s the careful systematic testing and public trust building that’s the hard part.
https://www.nytimes.com/interactive/2020/04/03/science/coron...
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5584442/
There’s also (IIRC, no citation right now) prior work suggesting that coronavirus vaccines against the spike are likely to be effective and that vaccines against the N protein might be counterproductive.
Make your own, open-source. Really cool.
A user on lesswrong made their own (with no prior experience): https://www.lesswrong.com/posts/niQ3heWwF6SydhS7R/making-vac...
Only two companies in the world succeeded, the French company Sanofi which also tried making a mRNA vaccine failed.
I presume a whole bunch goes into making vaccine and this is just the top of the iceberg.
https://en.wikipedia.org/wiki/Lipofectamine
https://www.thermofisher.com/us/en/home/brands/product-brand...
Short except for flavor, this is from near the beginning:
A[-G-]AGA{+A+}GAA{+ATATAAGAC+}CCCG{+GCGCCG+}CCACCATGTTCGTGTTCCTGGTGCTGCTGCC[-T-]{+C+}
BioNTech_Pfiz 1 -----------GAGAATAAACTAGTATTCTTCTGGTCCCCACAGACTCAG 39
|||||.|.|..|||| ||| ||
Moderna 1 GGGAAATAAGAGAGAAAAGAAGAGTA----------------AGA---AG 31
BioNTech_Pfiz 40 AGAGA----AC-------CCGCCACCATGTTCGTGTTCCTGGTGCTGCTG 78
|.|.| || ||||||||||||||||||||||||||||||||
Moderna 32 AAATATAAGACCCCGGCGCCGCCACCATGTTCGTGTTCCTGGTGCTGCTG 81
BioNTech_Pfiz 79 CCTCTGGTGTCCAGCCAGTGTGTGAACCTGACCACCAGAACACAGCTGCC 128
||.||||||..|||||||||.|||||||||||||||.|.||.||||||||
Moderna 82 CCCCTGGTGAGCAGCCAGTGCGTGAACCTGACCACCCGGACCCAGCTGCC 131Edit: I guess what I'm asking is: presumably these vaccines both target the spike protein. Do both of these sequences express the same protein? Or is there a "close enough!" thing in the immune system, where it can be a little different and still be targeted by the immune system?
Unfortunately, the core algorithm dates back to 1990, so it can be real slow in some cases. Biotech takes a while to improve :(
The only real annoyance I have with it is that the editor window is modal, like it blocks all the spreadsheets you have open on your machine, and it's primitive even compared to VBA, especially for debugging.
It's not just that it's given me the experience of "this is the way a spreadsheet or BI tool should work" but also "this is the way SQL should work". It's a little cumbersome to do the standard SQL-type operations, but the clean integration of functions means you can implement anything that's missing. Like say, Oracle has grouping sets - you can, and I did, just write a function to do that. I always felt that having a separate procedural language in your database was wrong, but I'd never seen the alternative until now. And I've been falling in love with higher order functions.
For those not in the know:
https://genomebiology.biomedcentral.com/articles/10.1186/s13...
[1] https://www.nature.com/articles/sdata201618
[2] https://github.com/NAalytics/Assemblies-of-putative-SARS-CoV...
Somewhere, Margaret O. Dayhoff is weeping.
Can somebody explain to me why?
https://bioinformatics.stackexchange.com/questions/11353/why...
[1] https://berthub.eu/articles/posts/reverse-engineering-source...
normally RNA in vivo is complexed with protiens that prevent RNA from folding, and annealing into structure that is not compatible with translation to protien. In the vaccine this isnt happening, this is why RNA is hard to work with and the vaccine must be kept so cold.
This is not to say that DNA is simple to work with, but it solves problems if you dont need direct access to RNA.
[1]https://www.wired.co.uk/article/mrna-coronavirus-vaccine-pfi...
"The ribosome is composed of one large and one small sub unit that assemble around the messenger RNA, which then passes through the ribosome like a computer tape. The amino acid building blocks, that's the small glowing red molecules, are carried into the ribosome attached to specific transfer RNAs; that's the larger green molecules also referred to as tRNA. The small sub unit of the ribosome positions the mRNA so that it can be read in groups of three letters known as a codon."
Very analogous indeed.
[0] https://xerocrypt.wordpress.com/2014/07/22/how-to-read-almos...
01J3 e. Coli has a DNA Polymerase that contains 3k’-5’ proofreading capability and 5’-3’ error correcting with a polymerisation rate of 50bps
I’ve made the above up because I have never been able to find a Wikipedia page winxe that as succinctly pointed out to me that biology was a machine and I was hooked
Why are we still doing genetics at the machine code level? Shouldn't we have some compilers, assemblers and linkers by now?
Take that piece of RNA. An intuitive mental model is that it's some form of "instruction" or a bunch of instruction, isn't it? It's also wrong, because it just encodes a protein that acts the way it does only because of its shape (that is, one of its potential energy local minimums) and the shape of other proteins around it. That shape is only weakly local, it can be affected by far-away sections of peptide sequence. So it's almost impossible to systematically break it down, you have to consider and model things as a whole , which is insanely complex both computationally and cognitively.
If you want a good mental model of how it works, imagine you assemble a thing from metal balls and springs. You take a few thousands balls and connect most of them with springs of different strengths. You then take this thing and throw it on the floor; it will assume a shape that is implicitly encoded in spring strengths, its environment, and the way you've assembled it. You can even make it change shape if you poke on it the right way. That's how biology works in a nutshell; it's a nightmare to design anything for systems like that. Again, you can't simplify and break down and encapsulate and abstract like you do in programming.
It’s also interesting the way it’s worded: that the sequence was “assembled from $vaccine”. Does that mean whoever published this has backed into these sequences rather than having gathered this information directly from the source(s)?
“Assembly” in this case means that they merged several short sequences they obtained, each representing a fragment of the whole mRNA sequence.
So reverse engineering basically.
I don't know much about DNA and co, but it sounds as microservice is not the right metapher. Rather just 30k sourcecode?
Because a microservice is something that is already compiled and running..
https://github.com/bionicles/coronavirus
to make the trie use the function here. the variable K is the length of the Kmers (runs of RNA). Larger values are gonna take a lot longer. ( warning: big job, uses multiprocessing...pypy recommended for speed ) https://github.com/bionicles/coronavirus/blob/b6f0db9dd8aaf7...
then you could use this recursive function to generate potential matches within some cutoff https://github.com/bionicles/coronavirus/blob/b6f0db9dd8aaf7...
the function right below it converts the generator to a list. then you could save that
enjoy
https://en.wikipedia.org/wiki/Three_prime_untranslated_regio....
The next thing is the poly-A tail:
https://en.wikipedia.org/wiki/Polyadenylation
blasting the 3' UTR, we see it ~50% of it was copied from the human mitochondria
tldr, extra regulatory signals (often not well understood)
What they actually do can vary, but essentially they can provide places for other things to bind and influence what happens with the mRNA. There are some fancier cases like riboswitches, but you don't see those in humans. The stuff at the start and end of the mRNA also determines stability of the mRNA.
The Moderna one has the 3'-UTR of the alpha subunit of human hemoglobin.
You want it as high as possible to make as much spike protein as possible.
It's proprietary information, mostly they try various possibilities until they find one with high expression.
https://berthub.eu/articles/posts/reverse-engineering-source...
It's a very interesting read and I hope the author makes another post explaining the differences of the two mrna vaccines.
> The injection contains volatile genetic material that describes the famous SARS-CoV-2 ‘Spike’ protein. Through clever chemical means, the vaccine manages to get this genetic material into some of our cells.
> These then dutifully start producing SARS-CoV-2 Spike proteins in large enough quantities that our immune system springs into action. Confronted with Spike proteins, and (importantly) tell-tale signs that cells have been taken over, our immune system develops a powerful response against multiple aspects of the Spike protein AND the production process.
What happens to the "volatile genetic material" at the end of this? Does it just linger in the body indefinitely? Or does it somehow get destroyed (and what does that mean)? From my reading of the above excerpt, it's the produced spike proteins that get destroyed but not the original genetic material that's injected. The reason I'm asking is to understand how the vaccine designers determine if there are any long-term effects of having this artificial material inside your body. They couldn't have tested it over a long time frame given how quickly all this moved.
> The very end of mRNA is polyadenylated. This is a fancy way of saying it ends on a lot of AAAAAAAAAAAAAAAAAAA. Even mRNA has had enough of 2020 it appears.
> mRNA can be reused many times, but as this happens, it also loses some of the A’s at the end. Once the A’s run out, the mRNA is no longer functional and gets discarded. In this way, the ‘poly-A’ tail is protection from degradation.
Also, your cells continuously make mRNAs, depending on what proteins they need to synthesize. And those (have to) get discarded too. And also this is what happens to the actual viral RNA when the virus attacks you for real.
The properties of mRNA are well known and have been for decades. Your cells are constantly producing more from the nucleus. It degrades, even more so when it gets transcribed. That's the beauty of this, it's self-limiting.
The only 'artificial' thing about it is the special base that's added to avoid detection by the immune system. Everything else is the exact same compounds present in your cells.
https://en.wikipedia.org/wiki/Ribosome
You can think of RNA as a copy of a section of DNA. They look very much like computer programs except rather than producing code, the Ribosome can read them and translate each codon for an amino acid into its corresponding actual amino acid that it then binds together into a protein. The execution engine is the environment of the cell. All highly probabilistic rather than deterministic. I can't imagine any programmer not finding them completely fascinating.
Something that might fit the computation vision of your comment are the various Ontologies for bioinformatics. The Gene Ontology is probably the most complete, although it lags many years behind the literature.
"Clasp: Common Lisp using LLVM and C++ for Designing Molecules": https://www.youtube.com/watch?v=0rSMt1pAlbE
What I don't understand is:
a) how the m-RNA code relates to the produced protein (i.e I can read C-code and get an idea of what is does fairly quickly, but can the same be said of m-RNA and the resulting protein)?
b) how did they get their hands on that code in the first place? Do the coronaviruses use m-RNA as well? Was then a coronavirus somehow "dissected" to get at the spike protein "source code"?a) From the mRNA you can learn the amino acid sequence of the protein very quickly. You absolutely cannot (yet) learn the function of the protein from that sequence - normally, people just do comparisons with proteins whose functions ARE known. Oftentimes in enzymes there are "domains" or little functional regions that stay consistent over long periods of time, so that's a good way to assign function (given knowledge of other proteins in the same family)
b) Yep. Every virus at some point in their lifecycle use mRNA. You can just sequence the virus and get all that data (I've done that on SARS-COV-2, it's honestly pretty easy). Then you just do homology alignment (as stated above) and you can figure out approximately what each gene does.
The problem of de-novo protein prediction is ONE OF THE HARDEST PROBLEMS IN BIOTECH, but just like getting amino acid sequence, doing homology searches, sequencing viruses, etc, is basic biotech and I'd expect an eager high schooler or undergrad to be able to do them.
b) Coronaviruses happen to be RNA viruses; that is, their genomes are RNA rather than DNA. DNA viruses also exist and are common. We got full genomes from sequencing early in the pandemic, and continue to use it to monitor the evolution of the virus (see e.g. [1], where the results are available for download). Sequencing is very cheap and easy these days - you take a sample from a patient, use chemicals to break down all the cell membranes and such, sequence all of the DNA and RNA in it, and look through the results for a virus genome (i.e. something that isn't a human chromosome and isn't a known virus or bacterial genome). "m"RNA is more a description of the function than the chemical - tRNA and rRNA are short snippets of RNA used for manufacturing purposes inside the cell, while mRNA is the long chunks that actually carry information from the DNA to the protein manufacturing sites. Virus RNA basically functions as imposter mRNA, getting those manufacturing systems to make more viruses.
[1] https://www.ncbi.nlm.nih.gov/datasets/coronavirus/genomes/ - SARS-CoV-2 is the COVID-19 virus. As of my fetch, there are 71,509 full sequences of the virus, reflecting slight mutations over time and space.
b) Coronaviruses have a RNA genome. Researchers extracted it from wild-type viruses and then sequenced it.
[1]: mRNAs can undergo several maturation steps, such as splicing, which removes regions that won’t be translated into protein.
They have. https://en.wikipedia.org/wiki/Viral_envelope
https://en.wikipedia.org/wiki/Retrotransposon
The injection is important, however, as it gets the genetic material past a whole lot of nucleases that cover your epithelia.
But for now we can inject code to trigger protein configuration via the immune system
Except it is unfortunately not that simple, because it assumes that distinct components such as CPU, co-processors and even logic gates exist in that context, as is totally reasonable to assume on devices created by humans. Abstracting complex machines into distinct components is a proven strategy to engineer a system, but it's not a necessity for functioning systems to exist.
In the case of natural organism, they "just" need to work. They don't have a blueprint, and they don't need to be organized in a way that allows for easy understanding by looking at individual parts in separation.
Consider also the difference between machine learning through neural networks ("we stuff a lot of training data in there and get what we want eventually, we hardly understand what the model does or why it fails"), and a QR code reader ("we carefully designed the format from the top down, including e.g. framing, error correction, and several invariants like rotation; if a QR code does not get recognized, we can usually tell exactly where and why it failed").
https://twitter.com/PowerDNS_Bert/status/1375091898797453326
https://gizmodo.com/stanford-scientists-post-entire-mrna-seq...
> Fire and Shoura told Motherboard that they had received permission from the FDA to collect scraps of vaccines that wouldn’t have otherwise been used from empty vials and that they’d notified Moderna in advance of their plans to publish the sequence without receiving any objection in turn.
Also:
> The research team told Motherboard that they didn’t “reverse engineer” the vaccine, they simply “posted the putative sequence of two synthetic RNA molecules that have become sufficiently prevalent in the general environment of medicine and human biology in 2021.”
I'm not familiar enough with how these sequences to work to understand what's being discussed. Is it simply that they took a sample of the vaccine and studied its composition using some standard machine/process?
http://www.josiahzayner.com/2020/12/i-made-covid-19-vaccine-...
other than the obvious advantage of being shorter, it would also be easier to read: the boundaries would be unambiguous and each char would correspond directly to and amino acid (if applicable/coding)
I do think back to the early days of Covid when there were all these predictions around when a vaccine would show up. It seemed like there was knowledge that the mRNA platform would be the likely solution and probably by April we knew a vaccine would be possible - it just took 6+ months to test.
Thinking about that timeline amazes me.
"I could have done this in a weekend"
Cargo-cult much?
if covid?(dna)
block_virus(dna)
endso, the virus is sort of like a ball with these spikes on top (that’s where the corona name comes from) and the vaccine helps your body develop antibodies against the spikes. so when the virus gets in your body, it actually receives a “haircut” which leads to it no longer being able to enter the cells and hijack their internals to produce more viruses.
it’s extremely clever, but it also means that your code is wrong ;))