Rosalind: Learn bioinformatics by programming it (opens in new tab)

(rosalind.info)

167 pointsmnemonicsloth6y ago40 comments

40 comments

19 comments · 8 top-level

eesmith6y ago· 9 in thread

So, I picked one at semi-random - http://rosalind.info/problems/prtm/ and found a usability problem (a popup that doesn't work; in FF or Safari) and a wrong example answer. Here's the description.

> Given: A protein string P of length at most 1000 aa.

> Return: The total weight of P. Consult the monoisotopic mass table.

The "monoisotopic mass table" appears to be a link. I get a pop-up, but nothing appears in it, other than a spinner. I had to do a web search to find http://rosalind.info/glossary/monoisotopic-mass-table/ .

The page continues:

> Sample Dataset - SKADYEK

> Sample Output - 821.392

Using the monoisotopic mass table I computed:

    >>> d = {'A': 71.03711, 'C': 103.00919, 'D': 115.02694,
    'E': 129.04259, 'F': 147.06841, 'G': 57.02146,
    'H': 137.05891, 'I': 113.08406, 'K': 128.09496,
    'L': 113.08406, 'M': 131.04049, 'N': 114.04293,
    'P': 97.05276, 'Q': 128.05858, 'R': 156.10111,
    'S': 87.03203, 'T': 101.04768, 'V': 99.06841,
    'W': 186.07931, 'Y': 163.06333}
    >>> sum(d[c] for c in "SKADYEK")
    821.3919199999999

This matches the example. BUT!!!!

This is NOT the correct answer because as the expanded text says, "the mass of a protein is the sum of masses of all its residues plus the mass of a single water molecule."

The table says "the monoisotopic mass of water is considered to be 18.01056" so

    >>> 821.3919199999999 + 18.01056
    839.40248

This latter number matches the value given by https://web.expasy.org/cgi-bin/compute_pi/pi_tool .

Which means the example answer ... is wrong. Yes?

How (in)correct are the other answers? I-am-not-a-bioinformatics-programmer.

ihunter28396y ago

This site was developed by Pavel Pevzner, who teaches bioinformatics at UCSD. We used this site as the main curriculum in one of our final bioinformatics class, and after solving ~ 10 - 15 problems a week for 10 weeks, I don't recall a single time where the error was in the problem set solutions.

Re: the problem - not a hundred percent on this, but I think the issue is that they are vague on the fact that this is a theoretical question, not a practical one. The key is that the question itself does not mention the addition of the water molecule, just that you have a sequence P with a dictionary of weights.

Edit 1: If memory serves me correct, after the initial ionization phase of mass spectroscopy, the additional water molecule is discarded, making it insignificant in the analysis of your peptide sequences.

Edit 2: If anyone is interested in following through this site, I would highly recommended using the existing problem tracks http://rosalind.info/problems/list-view/?location=bioinforma... These will help lay out the problems in a logical order an ensure you have the skills you need to progress. Alignment problems are a great way to learn dynamic programming and will allow you to move onto some of these other problems (like mass spec and HMMs) more reasonably (at least, in my experience!) Good luck!

eesmith6y ago

I was thinking about this some more. You wrote "after the initial ionization phase of mass spectroscopy".

In high school I tried to build a mass spectrometer. It didn't work - I couldn't get a high enough vacuum, and a few years later as a physics undergrad did I find that that was only one of several problems I had. It was fun to try though.

But I do know that the ionized particle has a charge, and that electron affects the overall mass, by about 1/1836 Dalton . That's 0.00054 Dalton, while the table lists masses down to even higher accuracy, like 71.03711 .

The example output gives a value down to 3 decimal digits, so at that precision there's a 50% chance that the electron mass will affect the result.

Isn't this problem therefore implicitly teaching an excessive trust in significant digits?

Now, I suspect that the mass spectrometers they use aren't that accurate. But it's bugging me now.

As mbreese wrote elsewhere here, I'm (clearly) reading too much into the problem. I don't think bioinformatics is the right field for me.

1 more reply

eesmith6y ago

A closer reading shows that I got tripped up by what "residue" means. But I'm not sure the author of the question got it right either? At the very least, I'm confused by it.

The first paragraph of the expanded question text has: "every pair of adjacent amino acids has lost one molecule of water, meaning that a polypeptide containing n amino acids has had n−1 water molecules removed"

The second paragraph has: "Thus, the mass of a protein is the sum of masses of all its residues plus the mass of a single water molecule."

The fifth paragraph has: "The mass of a protein is the sum of the monoisotopic masses of its amino acid residues plus the mass of a single water molecule"

And the monoisotopic mass table says "Note: the monoisotopic mass of water is considered to be 18.01056 Da."

So I thought that the water molecule was important in the calculation.

However, the last paragraph (which I only now closely read) says it isn't important, with "In the following several problems on applications of mass spectrometry, we avoid the complication of having to distinguish between residues and non-residues by only considering peptides excised from the middle of the protein. This is a relatively safe assumption because in practice, peptide analysis is often performed in tandem mass spectrometry."

Since it didn't mention "water", and instead used the specialist term "residue", I missed the connection earlier.

That said, the text seems to use "residue" inconsistently. There's the definition "a residue is a molecule from which a water molecule has been removed; every amino acid in a protein are residues except the leftmost and the rightmost ones."

but there's also the usage: "the mass of a protein is the sum of masses of all its residues plus the mass of a single water molecule"

Surely that should be "the mass of a protein is the sum of masses of all its residues plus the mass of its leftmost and rightmost amino acids minus the mass of a single water molecule", yes?

So I looked up the definition of "amino acid residue". It appears to be https://goldbook.iupac.org/terms/view/A00279 "α-Amino-acid residues are therefore structures that lack a hydrogen atom of the amino group (–NH–CHR–COOH), or the hydroxyl moiety of the carboxyl group (NH2–CHR–CO–), or both (–NH–CHR–COO–); all units of a peptide chain are therefore amino-acid residues".

https://en.wikipedia.org/wiki/Protein_sequencing#Whole-mass_... also agrees that "residue" includes the two amino acids at the ends, saying "The protein’s whole mass is the sum of the masses of its amino-acid residues plus the mass of a water molecule and adjusted for any post-translational modifications"

Which means ... I don't think the author uses the term "residue" correctly?

Or, more likely, I'm confused by the specialist terminology. Can someone clear up my confusion?

3 more replies

mnemonicslothOP6y ago

You're right. Sometimes the answer key is wrong. I have to explain this to my professors from time to time, and it's always annoying. And in those cases I have paid money to be graded incorrectly.

I would be happy if I were you though. The point of this exercise is to learn, and I'll bet you'll remember that water molecule for a long time :-)

DrScientist6y ago

As an aside - monoisotopic mass is a strange one to use.

In the real world you are a mixture of isotopes, so it's better to use the average mass ( average of the different isotope masses, corrected for abundance ) if you want to compare to experimentally determined masses - say from mass spec.

It's not as if average mass is more complex - for the sake of these calculations it's still just a number looked up from a table...

ie why oh why use the wrong value when it's just as easy to use the right one ()?

(

) true it's biology so there isn't a right one in all circumstances - lots of interesting effects eg enzymes having slightly different rates of incorporation for different isotopes - however it's closer to the truth than mono-isotopic.

eesmith6y ago

https://en.wikipedia.org/wiki/Monoisotopic_mass#Monoisotopic... points out:

> The monoisotopic mass is not used frequently in fields outside of mass spectrometry because other fields cannot distinguish molecules of different isotopic composition. For this reason, mostly the average molecular mass or even more commonly the molar mass is used. For most purposes such as weighing out bulk chemicals only the molar mass is relevant since what one is weighing is a statistical distribution of varying isotopic compositions.

> This concept is most helpful in mass spectrometry because individual molecules (or atoms, as in ICP-MS) are measured, and not their statistical average as a whole. Since mass spectrometry is often used for quantifying trace-level compounds, maximizing the sensitivity of the analysis is usually desired. By choosing to look for the most abundant isotopic version of a molecule, the analysis is likely to be most sensitive, which enables even smaller amounts of the target compounds to be quantified. Therefore, the concept is very useful to analysts looking for trace-level residues of organic molecules, such as pesticide residue in foods and agricultural products.

DrScientist6y ago

However for proteins - which, even if broken down to small peptides in the mass spec, have large numbers of C, N, O, H atoms then monoisotopic makes no sense.

1 more reply

fao_6y ago

I was going to suggest that the result was bad because of floating point error, but then I reread the value and, it doesn't seem like that amount of variance could be produced by errors introduced in the floating point calculations?

patrec6y ago

You sum 7 numbers all around 100. Assuming I didn't mess up, the maximal floating point error for that is ~100·7·2^-53 which is < 10^-13.

xvilka6y ago· 1 in thread

Would have been nice to have a Julia version too. Some time ago I suggested [1] to create a Julia flavor of Biostar Handbook [2]. And now there is an initiative[3] to create similar, but open source book instead. So anyone can contribute already.

[1] https://discourse.julialang.org/t/biostar-handbook-computati...

[2] https://www.biostarhandbook.com/

[3] https://github.com/BioJulia/biojulia_handbook/issues/1

computerfriend6y ago

Rosalind is language-agnostic, except for the mini tutorial at the start.

rjkennedy986y ago· 1 in thread

Surprised to see this here as this has been around for quite some time. I used to do these problems on the weekends in 2013-14.

killjoywashere6y ago

I think there will be some sites like books, they are timeless. And Rosalind is one of them. I'd add Philip Greenspun's /books (http://philip.greenspun.com/books/)

divbzero6y ago

If by chance anyone is not aware, the namesake is Rosalind Franklin [1] who made seminal contributions in the fields of X-ray crystallography and electron microscopy.

[1]: https://en.wikipedia.org/wiki/Rosalind_Franklin

It was her X-ray image that led to the discovery of the molecular structure of DNA.

dang6y ago

A thread from 2012: https://news.ycombinator.com/item?id=4761831

(Reposts are fine after a year: https://news.ycombinator.com/newsfaq.html)

fao_6y ago

First off, the login page doesn't redirect to the HTTPS version of the page, so it's sending my password over plaintext. What makes this worse is that when I manually go to the TLS page, it gives me a PF_END_OF_FILE_ERROR (I'm running firefox 72.0.2, on Alpine Linux).

The second thing is picking the first example (the character counting problem). Clicking on the thing, it told me that the important words are highlighted, and that the words 'figure N' refer to the figures on the right -- which felt unnecessary, because it's something that anyone visiting wikipedia, or browsing a book, would know.

danielecook6y ago

It’s a great site and greatly accelerated my learning of programming.

The form of learning which I call “problem based” learning is a great format for me. You learn from reading up on a topic. You learn from trying different solutions. Finally, you learn from seeing other people’s answers once you’ve solved it.

Also check out:

Hackerrank.com - all around focus Project Euler- math focus Leetcode - more oriented towards interview training but still useful and fun.

acomjean6y ago

We used a version of this site for a bio informatics algorithm class a couple years ago (we used the site for part of the homework assignments, I guess the auto grading of code saves the instructors time...)

The problems are interesting and fun to solve, they didn’t have a lot of context, though They seemed to have added some at the start of each problem.

j / k navigate · click thread line to collapse

40 comments

19 comments · 8 top-level

eesmith6y ago· 9 in thread

So, I picked one at semi-random - http://rosalind.info/problems/prtm/ and found a usability problem (a popup that doesn't work; in FF or Safari) and a wrong example answer. Here's the description.

> Given: A protein string P of length at most 1000 aa.

> Return: The total weight of P. Consult the monoisotopic mass table.

The page continues:

> Sample Dataset - SKADYEK

> Sample Output - 821.392

Using the monoisotopic mass table I computed:

    >>> d = {'A': 71.03711, 'C': 103.00919, 'D': 115.02694,
    'E': 129.04259, 'F': 147.06841, 'G': 57.02146,
    'H': 137.05891, 'I': 113.08406, 'K': 128.09496,
    'L': 113.08406, 'M': 131.04049, 'N': 114.04293,
    'P': 97.05276, 'Q': 128.05858, 'R': 156.10111,
    'S': 87.03203, 'T': 101.04768, 'V': 99.06841,
    'W': 186.07931, 'Y': 163.06333}
    >>> sum(d[c] for c in "SKADYEK")
    821.3919199999999

This matches the example. BUT!!!!

This is NOT the correct answer because as the expanded text says, "the mass of a protein is the sum of masses of all its residues plus the mass of a single water molecule."

The table says "the monoisotopic mass of water is considered to be 18.01056" so

    >>> 821.3919199999999 + 18.01056
    839.40248