Researchers generate complete human X chromosome sequence (opens in new tab)

(genome.gov)

123 pointsmglauco5y ago48 comments

48 comments

Dumb question: is there a x_chromosome.txt with the sequence in order? Why do geneticists not talk about it this way?

There is! You can find the current "agreed upon" human genome reference segmented by chromosome here: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/. (It's not the assembly that's described in the article here.)

People do talk about the genome and its elements using the location by chromosome number and range like you'd describe an index in a string. There has even been special notation developed to do so [1]. However, it depends on _how_ you're looking at biology.

I think an analogy would be: you can describe all code as machine code, but when there are higher level abstractions you wouldn't choose to do so.

[1]: https://en.wikipedia.org/wiki/Locus_(genetics)

carbocation5y ago

It’s a good question. The answer is no, before this study we didn’t have a gapless “x_chromosome.txt”. We did have 97% of it, but there were parts that were missing here and there. In fact, because the answer is no - which admittedly probably seems wild - this work is very important.

Now, there are much more sophisticated answers, and downstream points to be made about graph genomes instead of a reference, etc (which would also get to your point about why geneticists don’t talk about it this way). But, that’s a broader scope.

michaelhoffman5y ago

At a certain level of abstraction, we can treat it that way and it is good enough for many use cases. In biological and physical reality, no.

Each human started with between 1 and 5 copies of the X chromosome. Those copies are different in various ways. Many of the differences are single nucleotide variation, identical in a region but with a single letter changed. There are also tandem repeats where there might be a CAG sequence that occurs one or dozens of times. (Counting the number of repeats like this is often used for DNA fingerprinting.) There is also ample larger-scale structural variation, which includes whole regions of the genome present present or absent in one copy or another, or maybe copied multiple times in a row, or moved in from another chromosome, or reversed.

Complicated enough? On top of that you have to add the fact that there are trillions of cells in each human and in those trillions of cells you will have many slightly different copies of the original 1 to 5 X chromosomes from when that human was a single-cell organism. You will definitely have changes at the ends of the chromosomes, the telomeres, as they are made up of variable tandem repeats. You'll also have single nucleotide mutations, and if you're unlucky, bigger changes. On some chromosomes (not chromosome X), there's also V(D)J recombination, where our immune "memory" is actually encoded in changes to genome sequence in particular cells. Cancer or a pre-cancerous syndrome will increase the frequency and severity of these changes.

If you want to sequence a whole chromosome you have to contend with the fact that the most accurate methods for sequencing generally give you reads of 1000 nucleotides or less each and you have to assemble them together. People liken the problem to putting together a jigsaw puzzle, but it's not like assembling a jigsaw puzzle from a single box. It's more like taking hundreds of boxes of supposedly the same jigsaw puzzle (but in reality some small changes that make things fit together not quite right), dumping them all in a pile, randomly removing a bunch of them, and then trying to figure out how everything fits together. Also there are many parts of this puzzle with identical artwork and that fit together identically! Good luck!

Scientists have been applying a lot of ingenuity to this puzzle for decades and getting a whole chromosome assembly like this is a big milestone.

skunkworker5y ago

The article doesn’t mention but now I’m curious what kind of read length and error rate they are achieving. This could have huge impacts across all sequencing.

rcthompson5y ago

It looks like they're using Oxford nanopore and PacBio sequencing technologies for the long reads. These are two up-and-coming sequencing technologies focused on extremely long reads. My understanding of both is that their error rates on individual base pairs are too high to reliably determine the actual sequence on their own (something like 15% error rates). Typically the long reads from these technologies are used as a "scaffold" to resolve the large-scale structure of a DNA sequence, while another sequencing technology, usually Illumina, is used to resolve the actual sequence. (Illumina produces short reads, but it produces a lot of them, and the error rate is much lower, about 1%-5%.) In addition, since PacBio and Oxford Nanopore are very different technologies, I'm guessing that they probably have different "error profiles", so they probably partially cover for each others' deficiencies when you use both of them at the same time.

Note: Don't take any of the specific numbers above as gospel. These technologies develop extremely quickly, so it's quite likely that my knowledge of typical error rates is out of date.

In any case, here's the relevant quote from the original link (to phys.org), before it was changed to the less technical press release, which doesn't mention any specific technologies used:

"The new project built on that effort, combining nanopore sequencing with other sequencing technologies from PacBio and Illumina, and optical maps from BioNano Genomics. Using these technologies, the team produced a whole-genome assembly that exceeds all prior human genome assemblies in terms of continuity, completeness, and accuracy, even surpassing the current human reference genome by some metrics."

iskander5y ago

Illumina error rates are <<1% (~0.1%), whereas Nanopore with newer basecalling software is 5-10%. With UMIs you can get a consensus error that's also <<1%. The error profiles are indeed different: Illumina generally creates substitution errors, whereas Nanopore has trouble with "homopolymers" -- counting how many of the same letter occur in a row.

hcknwscommenter5y ago

Oxford error rates are up to 15%, they have optimized published runs that show 5% or even better, but in the real world the error rates are much closer to 15%. However, Oxford read lengths can be absolutely massive compared to even PacBio. PacBio's sequencing is actually much more accurate than Oxford, but read lengths top out at about 15,000 bases I think. Illumina read lengths are a bit less than 100 bases but the systems are massively parallel as compared to both PacBio and Oxford.

vikramkr5y ago

I dont think you can call pacbio up and coming at this point, but nanopore certainly.

And those error rate examples are way way too high - illumina is closer to Q30, which is a 1/1000 error rate[0]. 15% would result in an unusable sequence.

https://emea.illumina.com/science/technology/next-generation...

1 more reply

alecmg5y ago

15% is quite outdated. There have been major updates to nanopores and software. Typical single read error rate is less than 5% these days.

Single read accuracy is not as important for such projects. As coverage gets to 50-60X, expected assembly accuracy is Q30 on human.

bmsran5y ago

The "ultra long" nanopore reads used in this study are often greater than 100kbp in length and occasionally up to 1Mbp

teekert5y ago

In the movie he mentions Oxford Nanopore tech and using reads of 100.000 to 1.000.000 base pairs

andreygrehov5y ago

In a simple English, could someone explain why is this good and what does it all mean?

mr_overalls5y ago

From the article: "Repetitive DNA sequences are common throughout the genome and have always posed a challenge for sequencing because most technologies produce relatively short "reads" of the sequence, which then have to be pieced together like a jigsaw puzzle to assemble the genome. Repetitive sequences yield lots of short reads that look almost identical, like a large expanse of blue sky in a puzzle, with no clues to how the pieces fit together or how many repeats there are. . . . Filling in the remaining gaps in the human genome sequence opens up new regions of the genome where researchers can search for associations between sequence variations and disease and for other clues to important questions about human biology and evolution."

Or, to make this more simple: Finding the complete DNA sequence of chromosomes is difficult. That's because some parts of the sequence are highly repetitive. Using a new type of lab machine, the scientists were able sequence the repetitive parts of the X chromosome. This gives a more complete picture of the X chromosome. And that can help scientists fight diseases and understand human biology better.

cco5y ago

So 20 years ago when we "sequenced the human genome", we actually didn't? If you'd asked me whether or not we had a complete sequence of an X chromosome before I saw this I would have said, "Of course we have one, for over 20 years".

dekhn5y ago

So much about the original announcements was overhyped PR. The original assembly was super-crappy and super-gappy. The folks running the two projects were exhausted and declared victory, then moved on.

Of all the fields that I've worked in, genomics has been one of the most overhyped (virtual drug discovery is the other) and it takes a ton of training just to understand how messed up the field is.

1 more reply

iskander5y ago

That's right -- sequenced genomes are typically assemblies of short fragments. The assembly algorithms fail in areas of low complexity or when large sequences are repeated an unknown number of times.

1 more reply

derefr5y ago

> Repetitive sequences yield lots of short reads that look almost identical, like a large expanse of blue sky in a puzzle, with no clues to how the pieces fit together or how many repeats there are

Ah, https://en.wikipedia.org/wiki/Clock_recovery !

Too bad DNA isn't a run-length limited code. (Wouldn't that be something.)

haihaibye5y ago

DNA is a code, but it's not just a code. It is also a really long molecule that has to bend and fold up in certain ways.

There are error detecting codes, in a way. Protein is encoded by 3 base codes, and if you insert or delete bases not in a multiple of 3 it will be misaligned, then eventually likely encode a stop code and cause the bad protein to be truncated and likely removed via nonsense mediated decay.

hoseja5y ago

It's crazy to me that we still haven't hacked RNA polymerase or some other such "obvious" method to just linearly read all of the strand. The machinery is all there, by definition!

1 more reply

rcthompson5y ago

Imagine you have a bunch of aerial photographs that you're trying to assemble into one big mosaic of the entire region by matching up the overlaps on the edges. The problem is, this is a desert, and significant portions of the region are just flat expanses of empty sand, so all the aerial photographs from those regions look pretty much identical. Even worse, there's several of these flat sandy regions in the area, so you don't even know which region those photos came from.

So despite having taken multiple photos of every square inch of land in your target area, there's no way you can assemble them into one big image just by matching up the overlaps. Without a source of larger-scale information about the region, like a satellite photograph or GPS coordinates for the photos, you have no way of knowing how wide that desert is. All you know is that it's wider than one or two photographs.

This is essentially same problem that current genome assemblies have: there are regions of repetitive sequence in the genome, so all the sequencing reads from those regions look identical to each other, just like the photographs of flat sandy desert, and there's no way to tell how they're supposed to overlap to form the full sequence. The only way to resolve these regions is with a technology that can read all the way through from one end to the other without stopping, producing a single contiguous sequence.

The link here describes the fruits of an effort using exactly those sorts of long-read technologies to fill in all the gaps in the X chromosome sequence, thus generating a single contiguous sequence from end to end, something that hasn't previously been possible for DNA sequences of this size.

As to why this is important, these repetitive sequences, despite being apparently featureless, still sometimes have important effects (not unlike the apparently dead and featureless desert in the analogy). In addition, sometimes there are "oases" of functionally important non-repetitive DNA sequence within the "desert" of repetition, and previous genome assembly methods would not be able to tell where these oases belonged. All of this is important because many functional DNA elements are cis-acting. That is, they exert effects on genes that are nearby on the genome. So if you don't know where they belong, then you don't know what they're doing.

If you can assemble one big chromosome sequence from end to end, all of the above problems go away, and you can finally get on with the analysis you wanted to do anyway and stop worrying about not being able to calculate meaningful distances between DNA elements.

avancemos5y ago

Having a complete and accurate map is important for researchers who study the X-chromosome.

Long sequences reads allowed them to map the highly repetitive chromosome. Most sequencing is done by high-throughput short reads.

The technology can theoretically be used to map other regions of the genome which are highly repetitive.

staycoolboy5y ago

I thought the human genome was mapped in 2003, when the Human Genome Project wrapped up: https://en.wikipedia.org/wiki/Human_Genome_Project

What's different about this?

arm855y ago

https://news.ycombinator.com/item?id=23852177 explains it, with a quote from the article. They weren't able to map the whole thing, because of repeating patterns. That's now starting to change.

JadeNB5y ago

There is a space missing in the title ("X chromosome").

notRobot5y ago

OP probably ran into the HN character limit.

Edit: Upon testing, that appears to not be the case. Probably a typo.

mrfusion5y ago

So how do you actually isolate one chromosome to sequence it?

maxerickson5y ago

Their github has lots of information about what they do:

https://github.com/nanopore-wgs-consortium/chm13

murphyslab5y ago

Step 1: the researchers use a diploid cell line where the entire diploid X chromosome is homozygous: both copies of the X chromosome in this cell line are identical. This is part of why the researchers chose to look at the X chromosome.

> To circumvent the complexity of assembling both haplotypes of a diploid genome, we selected the effectively haploid CHM13hTERT cell line for sequencing (abbr. CHM13)

Incidentally, they do capture the other chromosomes in this process:

> Several chromosomes were captured in two contigs, broken only at the centromere (Fig 1a).

> https://www.biorxiv.org/content/10.1101/735928v3.full.pdf

Step 2: Follow a procedure for DNA prep that results in long stretches of DNA (though not an entire chromosome-length) and amplify (make multiple copies of) the mixture, per this reference:

> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5889714/

Step 3: Run the mixture through a nanopore sequencer (essentially a hole a few nanometres across), reading the change in current in response to the different bases, including methylated bases.

Step 4: Repeat this many many times to get multiple reads of each region of the data:

> In total, we sequenced 98 MinION flow cells for a total of 155 Gb (50× coverage, 1.6 Gb/flow cell, SNote 2). Half of all sequenced bases were contained in reads of 70 kb or longer (78 Gb, 25× genome coverage) and the longest validated read was 1.04 Mb.

Step 5: Overlay the data from the long measurements

> Once we had collected sufficient sequencing coverage for de novo assembly, we combined 39× of the ultra-long reads with 70× coverage of previously generated PacBio data 18 and assembled the CHM13 genome using Canu 19. This initial assembly totaled 2.90 Gbp with half of the genome contained in contiguous sequences (contigs) of length 75 Mbp or greater (NG50), which exceeds the continuity of the reference genome GRCh38 (75 vs. 56 Mbp NG50).

> The read was placed in the location of the assembly having the most unique markers in common with the read. Alignments were further filtered to exclude short and low identity alignments. This process was repeated after each polishing round, with new unique markers and alignments recomputed after each round.

Step 6: Check up the data against the reference genome:

> The corrected contigs were then ordered and oriented relative to one another using the optical map and assigned to chromosomes using the human reference genome.

> The final assembly consists of 2.94 Gbp in 590 contigs with a contig NG50 of 72 Mbp. We estimate the median consensus accuracy of this assembly to be >99.99%.

Essentially, this work closes up difficult-to-read gaps in the reference genome ( https://en.wikipedia.org/wiki/Reference_genome#Human_referen... )

mrfusion5y ago

Great write up! Thanks!

Regarding step 1 how can any human have an entire homozygous X chromosome?

Also/rather why not just use a male with one X chromosome?

bmsran5y ago

there are long regions on the Y chromosome that are very similar to the X chromosome, which would make the analysis difficult:

https://en.wikipedia.org/wiki/Pseudoautosomal_region

elmolino895y ago

In the old good (?) days some chromosomal libraries were constructed by using flow sorting. Not sure how often this is being used nowadays for genome/chromosome sequencing projects.

bmsran5y ago

One minor correction - in step 2 the DNA is not amplified as this would reduce the fragment length and also lose the methylation information

murphyslab5y ago

Good point. Here they're starting from a cell line, so presumably just starting with as much DNA as they can get from the cells. Amplification is usually needed in other scenarios where the sample is more finite, though from what I've read, nanopore sequencing tech doesn't need much DNA.

avancemos5y ago

I don’t believe you do.

pulse75y ago

When I read the title my first thought was "Scientists found complete assembly language of human Xchromosome".

pyedpiper5y ago

go banana slugs!

maxerickson5y ago

The press release is better than the link:

https://www.genome.gov/news/news-release/NHGRI-researchers-g...

dang5y ago

Ok, changed from https://phys.org/news/2020-07-scientists-human-chromosome.ht.... Thanks!

j / k navigate · click thread line to collapse

48 comments

ReaLNero5y ago

Dumb question: is there a x_chromosome.txt with the sequence in order? Why do geneticists not talk about it this way?

kevinwuhoo5y ago

I think an analogy would be: you can describe all code as machine code, but when there are higher level abstractions you wouldn't choose to do so.

[1]: https://en.wikipedia.org/wiki/Locus_(genetics)

carbocation5y ago

michaelhoffman5y ago

At a certain level of abstraction, we can treat it that way and it is good enough for many use cases. In biological and physical reality, no.

Scientists have been applying a lot of ingenuity to this puzzle for decades and getting a whole chromosome assembly like this is a big milestone.

skunkworker5y ago

The article doesn’t mention but now I’m curious what kind of read length and error rate they are achieving. This could have huge impacts across all sequencing.

rcthompson5y ago

Note: Don't take any of the specific numbers above as gospel. These technologies develop extremely quickly, so it's quite likely that my knowledge of typical error rates is out of date.

In any case, here's the relevant quote from the original link (to phys.org), before it was changed to the less technical press release, which doesn't mention any specific technologies used:

iskander5y ago

hcknwscommenter5y ago

vikramkr5y ago

I dont think you can call pacbio up and coming at this point, but nanopore certainly.

And those error rate examples are way way too high - illumina is closer to Q30, which is a 1/1000 error rate[0]. 15% would result in an unusable sequence.

https://emea.illumina.com/science/technology/next-generation...

1 more reply

alecmg5y ago

15% is quite outdated. There have been major updates to nanopores and software. Typical single read error rate is less than 5% these days.

Single read accuracy is not as important for such projects. As coverage gets to 50-60X, expected assembly accuracy is Q30 on human.

bmsran5y ago

The "ultra long" nanopore reads used in this study are often greater than 100kbp in length and occasionally up to 1Mbp

teekert5y ago

In the movie he mentions Oxford Nanopore tech and using reads of 100.000 to 1.000.000 base pairs

andreygrehov5y ago

In a simple English, could someone explain why is this good and what does it all mean?

mr_overalls5y ago

cco5y ago

dekhn5y ago

Of all the fields that I've worked in, genomics has been one of the most overhyped (virtual drug discovery is the other) and it takes a ton of training just to understand how messed up the field is.

1 more reply

iskander5y ago

That's right -- sequenced genomes are typically assemblies of short fragments. The assembly algorithms fail in areas of low complexity or when large sequences are repeated an unknown number of times.

1 more reply

derefr5y ago

> Repetitive sequences yield lots of short reads that look almost identical, like a large expanse of blue sky in a puzzle, with no clues to how the pieces fit together or how many repeats there are

Ah, https://en.wikipedia.org/wiki/Clock_recovery !

Too bad DNA isn't a run-length limited code. (Wouldn't that be something.)

haihaibye5y ago

DNA is a code, but it's not just a code. It is also a really long molecule that has to bend and fold up in certain ways.

hoseja5y ago

It's crazy to me that we still haven't hacked RNA polymerase or some other such "obvious" method to just linearly read all of the strand. The machinery is all there, by definition!

1 more reply

rcthompson5y ago

avancemos5y ago

Having a complete and accurate map is important for researchers who study the X-chromosome.

Long sequences reads allowed them to map the highly repetitive chromosome. Most sequencing is done by high-throughput short reads.

The technology can theoretically be used to map other regions of the genome which are highly repetitive.

staycoolboy5y ago

I thought the human genome was mapped in 2003, when the Human Genome Project wrapped up: https://en.wikipedia.org/wiki/Human_Genome_Project

What's different about this?

arm855y ago

https://news.ycombinator.com/item?id=23852177 explains it, with a quote from the article. They weren't able to map the whole thing, because of repeating patterns. That's now starting to change.

JadeNB5y ago

There is a space missing in the title ("X chromosome").

notRobot5y ago

OP probably ran into the HN character limit.

Edit: Upon testing, that appears to not be the case. Probably a typo.

mrfusion5y ago

So how do you actually isolate one chromosome to sequence it?

maxerickson5y ago

Their github has lots of information about what they do:

https://github.com/nanopore-wgs-consortium/chm13

murphyslab5y ago

> To circumvent the complexity of assembling both haplotypes of a diploid genome, we selected the effectively haploid CHM13hTERT cell line for sequencing (abbr. CHM13)

Incidentally, they do capture the other chromosomes in this process:

> Several chromosomes were captured in two contigs, broken only at the centromere (Fig 1a).

> https://www.biorxiv.org/content/10.1101/735928v3.full.pdf

Step 2: Follow a procedure for DNA prep that results in long stretches of DNA (though not an entire chromosome-length) and amplify (make multiple copies of) the mixture, per this reference:

> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5889714/

Step 3: Run the mixture through a nanopore sequencer (essentially a hole a few nanometres across), reading the change in current in response to the different bases, including methylated bases.

Step 4: Repeat this many many times to get multiple reads of each region of the data:

Step 5: Overlay the data from the long measurements

Step 6: Check up the data against the reference genome:

> The corrected contigs were then ordered and oriented relative to one another using the optical map and assigned to chromosomes using the human reference genome.

> The final assembly consists of 2.94 Gbp in 590 contigs with a contig NG50 of 72 Mbp. We estimate the median consensus accuracy of this assembly to be >99.99%.

Essentially, this work closes up difficult-to-read gaps in the reference genome ( https://en.wikipedia.org/wiki/Reference_genome#Human_referen... )

mrfusion5y ago

Great write up! Thanks!

Regarding step 1 how can any human have an entire homozygous X chromosome?

Also/rather why not just use a male with one X chromosome?

bmsran5y ago

there are long regions on the Y chromosome that are very similar to the X chromosome, which would make the analysis difficult:

https://en.wikipedia.org/wiki/Pseudoautosomal_region

elmolino895y ago

In the old good (?) days some chromosomal libraries were constructed by using flow sorting. Not sure how often this is being used nowadays for genome/chromosome sequencing projects.

bmsran5y ago

One minor correction - in step 2 the DNA is not amplified as this would reduce the fragment length and also lose the methylation information

murphyslab5y ago

avancemos5y ago

I don’t believe you do.

pulse75y ago

When I read the title my first thought was "Scientists found complete assembly language of human Xchromosome".

pyedpiper5y ago

go banana slugs!

maxerickson5y ago

The press release is better than the link:

https://www.genome.gov/news/news-release/NHGRI-researchers-g...

dang5y ago

Ok, changed from https://phys.org/news/2020-07-scientists-human-chromosome.ht.... Thanks!

j / k navigate · click thread line to collapse