Like the fpng implementation, it's SSE (128-bit registers), but the inner loop eats 32 bytes at a time, not 16.
"Wuffs’ Adler-32 implementation is around 6.4x faster (11.3GB/s vs 1.76GB/s) than the one from zlib-the-library", which IIUC is roughly comparable to the article's defer32. https://nigeltao.github.io/blog/2021/fastest-safest-png-deco...
[0] https://github.com/libjxl/libjxl/tree/main/experimental/fast...
[1] https://github.com/veluca93/fpnge
[2] https://twitter.com/richgel999/status/1485976101692358656
fpnge (which I wasn’t aware of until now) appears to already be using a very similar (identical?) algorithm, so I suspect the relative performance of fpng and fpnge would not be significantly impacted by this change.
The frequency domain is a really powerful place to operate in when you are dealing with this amount of data.
- Does this change noticeably affect the total runtime? The checksum seems simple enough that the slight difference here wouldn't show up in PNG benchmarks.
- The proposed solution uses AVX2, which is not currently used in the original codebase. Would any other part of the processing benefit from using newer instructions?
If you put corrupted data into a PNG decoder, I don't think it's awfully important to most users whether they get a decode error or a garbled image out.
That sounds kinda slow; Is there only 1 DIMM in the slots? I remember benchmarking 40GiB/s read speed on an older system that had 2 dual-rank DIMMs (4 ranks in total).
I'd expect 3200mbit/s*(64 data lines)*(2 memory channels) = ~48 GiB/s on a typical DDR4 desktop and a lot more with overclocked ram.
Great writeup either way.
edit: For dual channel RAM, I would suspect the throughput depends on how the kernel decides to map physical memory to virtual addresses.
By default, whenever the memory modules used in all different channels have the same size, e.g. two 8 GB modules, the firmware maps the modules with interleaved addresses, to ensure a double throughput for 2 channels, or triple/quadruple/etc. for workstation/server motherboards with more memory channels.
Something that’s unfair about the world is that work like this could reach billions of people and save a million dollars worth of time and electricity annually but is being done gratis.
It would be amazing if there were charities that rewarded high-impact open source contributions like this proportionally to the benefits to humanity…
I doubt my current garage band could afford the OP just this moment, but I sure wish we could!
Well, I intend to finish high school at a minimum before pursuing employment :)
Might be interesting to benchmark their implementation too to see how it compares.