In fact, now that I think about it, parquet supports compression. Shouldn't this be just an option when saving to parquet format?
However, this is only the case because on several Intel x86_64 benchmarks they report memcpy performance between 5-10 GB/s, while even a basic DDR3 dual channel arch has 20 GB/s memory bandwidth, while a modern quad channel DDR4 can have 76.8 GB/s bandwidth, and of course there is no reason for memcpy to be substantially slower than memory bandwidth assuming it's properly implemented (AVX can separately read two and write one 256-bit per cycle = 128 GB/s memcpy at 4GHz).
Am I missing something or is this another case of "implausible claims = they screwed the benchmark = they are incompetent/malicious"?
As long as they are using the same memcpy routine in both the decompression case and the 'only memcpy' case, that seems reasonable. Obviously, the quicker memcpy becomes, the faster the decompression has to become to maintain the same performance ratios, but things like faster clock speeds or multi-threading can make that issue moot.
It shines when values are of fixed size with lots of similar bits, e.g. positive integers of the same magnitude. It's not so good for doubles, where bits change a lot. Also, if stroring diffs it helps to take a diff from initial value in a chunk, not previous value, so that deltas change sign less often (and most bits flipped).
From own usage case, for the same data, C# decimal (16 bytes struct) is compressed much better than doubles (final absolute blob size), while decimal is taking 2x more memory uncompressed.
If data items have little similar bits/bytes then it's underlying compressor that matters.
It really shines first and foremost as a meta compressor, giving the developer a clean block based API. Once integrated (which really is quite easy) you can experiment easy with different compressors and preconditioners to see what works best with your dataset. These things can be changed at runtime and give you great flexibility.
Francesc has been advancing blosc consistently with a steady vision for years and years. It is one of the most underrated tools around IMO.
This is a really interesting library.