> title
bases with optional newlines
> title
bases with optional newlines
...
The author is talking about removing the non-semantic optional newlines (hard wrapping), not all the newlines in the file.It makes a lot of sense that this would work: bacteria have many subsequences in common, but if you insert non-semantic newlines at effectively random offsets then compression tools will not be able to use the repetition effectively.
;LCBO - Prolactin precursor - Bovine
MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS
EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL
VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED
ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*
where "SS...EM", HL..VT", or "ED..AR" may be common subsequences, but the plaintext file arbitrarily wraps at column 65 so it renders on a DEC VT100 terminal from the 70s nicely.Or, for an even simpler example:
; plaintext
GATTAC
AGATTA
CAGATT
ACCAGA
TTACAG
ATTACA
becomes, on disk, something like ; plaintext\r\nGATTAC\r\nAGATTA\r\nCAGATT\r\nACCAGA\r\nTTACAG\r\nATTACA\r\n
which is hard to compress, while ; plaintext\r\nGATTACAGATTACAGATTACCAGATTACAGATTACA
is just "; plaintext\r\n" + "GATTACA" * 7
and then, if you want, you can reflow the text when it's time to render to the screen.Guessing which PNG filters to use can make a huge difference to compression with only a tiny change to write speed. Or (like Adobe 20+ years ago) you can screw it up and get worse compression and slower speeds. These days brutal "try everything" modes exist which can squeeze out those last few bytes by trying even the unlikeliest combinations.
I can imagine a filter layer which says this textual data comes in 78 character blocks punctuated with \n so we're going to strip those out, then compress and in the opposite direction we decompress then put back the newlines.
For FASTA we can just unconditionally choose to remove the extra newlines but that may not be true for most inputs, so the filters would help there.
Surprisingly, it gave an answer along the lines of the parent comment.
However, it seems it didn't figure this out by itself but it says:
> It’s widely known in bioinformatics that “one-line Fasta” files compress much better with LZ-based algorithms, and this is discussed in forums, papers, and practical guides on storing genomic data.
Ultimately, Zstd is a byte-oriented compressor that doesn't understand the semantics of the data it compresses. Improvements are certainly possible if you can recognize and separate that framing to recover a contiguous view of the underlying data.
[0] https://github.com/facebook/zstd/blob/v1.5.7/lib/compress/zs...
(I am one of the maintainers of Zstd.)
I absolutely adore ZSTD, it has worked so well for me compressing json metadata for a knowledge engine.
The first stage of Zstd does LZ77 matching, which transforms the input into "sequences", a series of instructions each of which describes some literals and one match. The literals component of the instruction says "the next L bytes of the message are these L bytes". The match component says "the next M bytes of the input are the M bytes N bytes ago".
If you want to construct a match between two strings that differ by one character, rather than saying "the next N bytes are the N bytes M bytes ago except for this one byte here which is X instead", Zstd just breaks it up into two sequences, the first part of the match, and then a single literal byte describing the changed byte, and then the rest of the match, which is described as being at offset 0. The encoding rules for Zstd define offset 0 to mean "the previously used match offset". This isn't as powerful as a Levenshtein edit, but it's a reasonable approximation.
The big advantage of this approach is that it doesn't require much additional machinery on the encoder or decoder, and thus remains very fast. Whereas implementing a whole edit description state machine would (I think) slow down decompression and especially compression enormously.
[0] https://datatracker.ietf.org/doc/html/rfc8878#name-repeat-of...
Using larger-than-default window sizes has the drawback of requiring that the same --long=xx argument be passed during decompression reducing compatibility somewhat.
Interesting. Any idea why this can't be stored in the metadata of the compressed file?[1] https://datatracker.ietf.org/doc/html/rfc8878#name-window-de...
You always need resource limits when dealing with untrusted data. RAM is one of the obvious ones. They could introduce a memory limit parameter; require passing --long with a value equal to or greater than what the stream requires to successfully decompress; require seeking support for the input stream so they can look back that way (TMTO); fall back to using temp files; or interactively prompt the user if there's a terminal attached. Lots of options, each with pros and cons of course, that would all allow a scenario where the required information for the decoder is stored in the compressed data file
Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)
I’ve worked with large genomic datasets on my own dime, and the default formats show their limits quickly. With FASTA, the first step for me is usually conversion: unzip headers from sequences, store them in Arrow-like tapes for CPU/GPU processing, and persist as Parquet when needed. It’s straightforward, but surprisingly underused in bioinformatics — most pipelines stick to plain text even when modern data tooling would make things much easier :(
> Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)
I even confused myself about this while writing :-)
For me it was an increasing number (think of unix timestamps in a data logger that stores one entry per second, so you are just counting up until there's a gap in your data), in the article it's a fixed value every 60 bytes
Of course, our brains are exceedingly good at finding patterns (to the point where we often find phantom ones). I was just expecting some basic checks like "does it make sense to store the difference instead of the absolute value for some of these bytes here". Seeing as the difference is 0 between every 60th byte in the submitted article, that should fix both our issues
Bzip2 performed much better for me but it's also incredibly slow. If it were only the compressor, that might be fine for many applications, but also decompressing is an exercise in patience so I've moved to Zstandard at the standard thing to use
But Bzip2 is also a pretty bad BWT-based compressor. Not only does it use block sizes from a time when 8mb memory was a lot, it does silly things which doesn't help compression at all.
A binary format with a tool that renders it to text works the same as a text format; if the rendering is lossless, you could even consume the text format rather than the binary.
A "text" format is built to be understandable, but that's not a requirement; you could write a text format that isn't descriptive, and you'd have just as much trouble understanding what 'A' means as you would understanding what 'C0' means for a binary format.
Undocumented formats are a pain, whether they're in text or binary.
> "intentionally or not, bioinformatics found a way to survive: obfuscation. By making the tools unusable, by inventing file format after file format, by seeking out the most brittle techniques"
1. https://madhadron.com/science/farewell_to_bioinformatics.htm...
When FASTA was invented, Sanger sequencing reads would be around a thousand bases in length. Even back then, disk space wasn't so precious that you couldn't spend several kilobytes on the results of your experiment. Plus, being able to view your results with `more` is a useful feature when you're working with data of that size.
And, despite its simplicity, it has worked for forty years.
The simplicity of FASTA seems like a dream compared to the GenBank flat file format used before then. And around the year 2000, less computationally-inclined scientists were storing sequence in Microsoft Word binary .doc files.
A lot of file formats (including bioinformatics formats!) have come and gone in that time period. I don't think many would design it this way today, but it has a lot of nice features like the ones you point out that led to its longevity.
This is not really a fair statement. Literally all of software bears the weight of some early poor choice that then keeps moving forward via weight of momentum. FASTA and FASTQ formats are exceptionally dumb though.
A parser to stream FASTA can be written in like 30 lines [0], much easier than say CSV where the edge cases can get hairy.
If you need something like fast random reads, use the FAIDX format [1], or even better just store it in an LMDB or SQLite embedded db.
People forget FASTA was from 1985, and it sticks around because (1) it's easy to parse and write (2) we have mountains of sequences in that format going back 4 decades.
[O] https://gist.github.com/jszym/9860a2671dabb45424f2673a49e4b5...
[1] https://seqan.readthedocs.io/en/main/Tutorial/InputOutput/In...
On those grounds, the lack of pre-tokenization in html/css/js ranks at this point as a planet killing level of poor choices.
Have any of you tested it?
When I was shown BioPerl I was tempted to write a better, C++ version, but was overwhelmed by other university stuff and let it go.
If the linefeeds were treated as semantic characters and not allowed to break the hash size, you would get similar results without pre-filtering and post-filtering. It occurs to me that this strategy is so obvious that there must be some reason it won't work.
What lessons can we take from this?
Endogenous retroviruses [1] are interesting bits of genetics that helps link together related species. A virus will inject a bit of it's genetics into the host which can effectively permanently scar the host's DNA and all their offspring's DNA.
I thought I'd raise yesterday's HN discussion on 'The unreasonable effectiveness of modern sort algorithms' https://news.ycombinator.com/item?id=45208828
That blog post isn't about DNA per se, but it is about sorting data when you know there are only 4 numbers. I guess DNA has 5 - A,T,G,C,N the unknown base - but there's a huge space of DNA-specific compression research that outperforms ZSTD.
(I currently store a lot of data as FASTQ, and smaller file sizes could save us a bunch of money. But FASTQ + zstd is very good.)
CRAM compresses unmapped fastq pretty well, and can do even better with reference-based compression. If your institution is okay with it, you can see additional savings by quantizing quality scores (modern Illumina sequencers already do this for you). If you're aligning your data anyways, probably retaining just the compressed CRAM file with unmapped reads included is your best bet.
There are also other fasta/fastq specific tools like fqzcomp or MZPAQ. Last I checked, both of these could about halve the size of our fastq.gz files.
Depending on what you need to represent, you can get a 4x reduction in data size without compression at all, by just representing a GATC with 2 bits, rather than 8.
Compression on top of that "should" result in the same compressed size as the original text (after all, the "information" being compressed is the same), except that compression isn't perfect.
Newlines are an example of something that's "information" in the text format that isn't relevant, yet the compression scheme didn't know that.