People do talk about the genome and its elements using the location by chromosome number and range like you'd describe an index in a string. There has even been special notation developed to do so [1]. However, it depends on _how_ you're looking at biology.
I think an analogy would be: you can describe all code as machine code, but when there are higher level abstractions you wouldn't choose to do so.
Now, there are much more sophisticated answers, and downstream points to be made about graph genomes instead of a reference, etc (which would also get to your point about why geneticists don’t talk about it this way). But, that’s a broader scope.
Each human started with between 1 and 5 copies of the X chromosome. Those copies are different in various ways. Many of the differences are single nucleotide variation, identical in a region but with a single letter changed. There are also tandem repeats where there might be a CAG sequence that occurs one or dozens of times. (Counting the number of repeats like this is often used for DNA fingerprinting.) There is also ample larger-scale structural variation, which includes whole regions of the genome present present or absent in one copy or another, or maybe copied multiple times in a row, or moved in from another chromosome, or reversed.
Complicated enough? On top of that you have to add the fact that there are trillions of cells in each human and in those trillions of cells you will have many slightly different copies of the original 1 to 5 X chromosomes from when that human was a single-cell organism. You will definitely have changes at the ends of the chromosomes, the telomeres, as they are made up of variable tandem repeats. You'll also have single nucleotide mutations, and if you're unlucky, bigger changes. On some chromosomes (not chromosome X), there's also V(D)J recombination, where our immune "memory" is actually encoded in changes to genome sequence in particular cells. Cancer or a pre-cancerous syndrome will increase the frequency and severity of these changes.
If you want to sequence a whole chromosome you have to contend with the fact that the most accurate methods for sequencing generally give you reads of 1000 nucleotides or less each and you have to assemble them together. People liken the problem to putting together a jigsaw puzzle, but it's not like assembling a jigsaw puzzle from a single box. It's more like taking hundreds of boxes of supposedly the same jigsaw puzzle (but in reality some small changes that make things fit together not quite right), dumping them all in a pile, randomly removing a bunch of them, and then trying to figure out how everything fits together. Also there are many parts of this puzzle with identical artwork and that fit together identically! Good luck!
Scientists have been applying a lot of ingenuity to this puzzle for decades and getting a whole chromosome assembly like this is a big milestone.
Note: Don't take any of the specific numbers above as gospel. These technologies develop extremely quickly, so it's quite likely that my knowledge of typical error rates is out of date.
In any case, here's the relevant quote from the original link (to phys.org), before it was changed to the less technical press release, which doesn't mention any specific technologies used:
"The new project built on that effort, combining nanopore sequencing with other sequencing technologies from PacBio and Illumina, and optical maps from BioNano Genomics. Using these technologies, the team produced a whole-genome assembly that exceeds all prior human genome assemblies in terms of continuity, completeness, and accuracy, even surpassing the current human reference genome by some metrics."
And those error rate examples are way way too high - illumina is closer to Q30, which is a 1/1000 error rate[0]. 15% would result in an unusable sequence.
https://emea.illumina.com/science/technology/next-generation...
Single read accuracy is not as important for such projects. As coverage gets to 50-60X, expected assembly accuracy is Q30 on human.
Or, to make this more simple: Finding the complete DNA sequence of chromosomes is difficult. That's because some parts of the sequence are highly repetitive. Using a new type of lab machine, the scientists were able sequence the repetitive parts of the X chromosome. This gives a more complete picture of the X chromosome. And that can help scientists fight diseases and understand human biology better.
Ah, https://en.wikipedia.org/wiki/Clock_recovery !
Too bad DNA isn't a run-length limited code. (Wouldn't that be something.)
So despite having taken multiple photos of every square inch of land in your target area, there's no way you can assemble them into one big image just by matching up the overlaps. Without a source of larger-scale information about the region, like a satellite photograph or GPS coordinates for the photos, you have no way of knowing how wide that desert is. All you know is that it's wider than one or two photographs.
This is essentially same problem that current genome assemblies have: there are regions of repetitive sequence in the genome, so all the sequencing reads from those regions look identical to each other, just like the photographs of flat sandy desert, and there's no way to tell how they're supposed to overlap to form the full sequence. The only way to resolve these regions is with a technology that can read all the way through from one end to the other without stopping, producing a single contiguous sequence.
The link here describes the fruits of an effort using exactly those sorts of long-read technologies to fill in all the gaps in the X chromosome sequence, thus generating a single contiguous sequence from end to end, something that hasn't previously been possible for DNA sequences of this size.
As to why this is important, these repetitive sequences, despite being apparently featureless, still sometimes have important effects (not unlike the apparently dead and featureless desert in the analogy). In addition, sometimes there are "oases" of functionally important non-repetitive DNA sequence within the "desert" of repetition, and previous genome assembly methods would not be able to tell where these oases belonged. All of this is important because many functional DNA elements are cis-acting. That is, they exert effects on genes that are nearby on the genome. So if you don't know where they belong, then you don't know what they're doing.
If you can assemble one big chromosome sequence from end to end, all of the above problems go away, and you can finally get on with the analysis you wanted to do anyway and stop worrying about not being able to calculate meaningful distances between DNA elements.
Long sequences reads allowed them to map the highly repetitive chromosome. Most sequencing is done by high-throughput short reads.
The technology can theoretically be used to map other regions of the genome which are highly repetitive.
What's different about this?
Edit: Upon testing, that appears to not be the case. Probably a typo.
> To circumvent the complexity of assembling both haplotypes of a diploid genome, we selected the effectively haploid CHM13hTERT cell line for sequencing (abbr. CHM13)
Incidentally, they do capture the other chromosomes in this process:
> Several chromosomes were captured in two contigs, broken only at the centromere (Fig 1a).
> https://www.biorxiv.org/content/10.1101/735928v3.full.pdf
Step 2: Follow a procedure for DNA prep that results in long stretches of DNA (though not an entire chromosome-length) and amplify (make multiple copies of) the mixture, per this reference:
> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5889714/
Step 3: Run the mixture through a nanopore sequencer (essentially a hole a few nanometres across), reading the change in current in response to the different bases, including methylated bases.
Step 4: Repeat this many many times to get multiple reads of each region of the data:
> In total, we sequenced 98 MinION flow cells for a total of 155 Gb (50× coverage, 1.6 Gb/flow cell, SNote 2). Half of all sequenced bases were contained in reads of 70 kb or longer (78 Gb, 25× genome coverage) and the longest validated read was 1.04 Mb.
Step 5: Overlay the data from the long measurements
> Once we had collected sufficient sequencing coverage for de novo assembly, we combined 39× of the ultra-long reads with 70× coverage of previously generated PacBio data 18 and assembled the CHM13 genome using Canu 19. This initial assembly totaled 2.90 Gbp with half of the genome contained in contiguous sequences (contigs) of length 75 Mbp or greater (NG50), which exceeds the continuity of the reference genome GRCh38 (75 vs. 56 Mbp NG50).
> The read was placed in the location of the assembly having the most unique markers in common with the read. Alignments were further filtered to exclude short and low identity alignments. This process was repeated after each polishing round, with new unique markers and alignments recomputed after each round.
Step 6: Check up the data against the reference genome:
> The corrected contigs were then ordered and oriented relative to one another using the optical map and assigned to chromosomes using the human reference genome.
> The final assembly consists of 2.94 Gbp in 590 contigs with a contig NG50 of 72 Mbp. We estimate the median consensus accuracy of this assembly to be >99.99%.
Essentially, this work closes up difficult-to-read gaps in the reference genome ( https://en.wikipedia.org/wiki/Reference_genome#Human_referen... )
Regarding step 1 how can any human have an entire homozygous X chromosome?
Also/rather why not just use a male with one X chromosome?
https://www.genome.gov/news/news-release/NHGRI-researchers-g...