story

Computing Adler32 Checksums at 41 GB/s (opens in new tab)

wooo.sh

98 pointswooosh3y ago27 comments

27 comments

Nice! (I've been meaning to write up this Apple M1 ~60GB/s version, which I think is similar: https://gist.github.com/dougallj/66151f1c509484a42fe0abd0d84... )

nigeltao3y ago

Here's another SIMD implementation, with commentary: https://github.com/google/wuffs/blob/main/std/adler32/common...

Like the fpng implementation, it's SSE (128-bit registers), but the inner loop eats 32 bytes at a time, not 16.

"Wuffs’ Adler-32 implementation is around 6.4x faster (11.3GB/s vs 1.76GB/s) than the one from zlib-the-library", which IIUC is roughly comparable to the article's defer32. https://nigeltao.github.io/blog/2021/fastest-safest-png-deco...

pizza3y ago

Ooh now that is very interesting. I would really love to see how this speeds up the run-time of fpng as a whole, if you have any numbers. It looks like fjxl [0] and fpnge [1] (which also uses AVX2) are at the Pareto front for lossless image compression right now [2], but if this speeds things significantly then it's possible there'll be a huge shakeup!

[0] https://github.com/libjxl/libjxl/tree/main/experimental/fast...

[1] https://github.com/veluca93/fpnge

[2] https://twitter.com/richgel999/status/1485976101692358656

woooshOP3y ago

Unfortunately I haven’t had the time to do a proper benchmark, and the fpng test executable only decodes/encodes a single image which produces very noisy/inconclusive results. However, I’m under the impression that it doesn’t make a large difference in terms of overall time.

fpnge (which I wasn’t aware of until now) appears to already be using a very similar (identical?) algorithm, so I suspect the relative performance of fpng and fpnge would not be significantly impacted by this change.

Nyan3y ago

As someone who has been recently optimising fpnge, Adler32 computation is pretty much negligible regarding overall runtime. The Huffman coding and filter search take up most of the time. (IIRC fpng doesn't do any filter search, but Huffman encoding isn't vectorized, so I'd expect that to dominate fpng's runtime)

bob10293y ago

If image encode/decode speed is the only concern, libjpegturbo is going to be orders of magnitude faster than any of these lossless schemes. With jpeg, you could encode 1080p bitmaps in <10 milliseconds (per thread) on any consumer PC made in the last decade.

The frequency domain is a really powerful place to operate in when you are dealing with this amount of data.

pizza3y ago

That's not true. libjpeg-turbo is ~50 MB/s last I tried - plus it's not lossless. fjxl and fpnge are basically an order of magnitude faster than that. libjpeg-turbo isn't even the fastest jpeg codec - you should check out the (relatively obscure) libmango - roughly 1 gbps decode on a 2020 macbook pro - or nvJPEG for GPU-based JPEG decoding. Supposedly there's even faster GPU-based decoders than nvJPEG, too.

1 more reply

ebiggers3y ago

Note that libdeflate has used essentially the same method since 2016 (https://github.com/ebiggers/libdeflate/blob/v0.4/lib/adler32...), though I recently switched it to use a slightly different method (https://github.com/ebiggers/libdeflate/blob/v1.12/lib/x86/ad...) that performs more consistently across different families of x86 CPUs.

josephg3y ago

Does anyone have any recommendations for checksumming algorithms in greenfield systems? It seems like there’s lots of innovation in crypto secure hashing functions. But I have a greenfield project where I need checksums but don’t care about crypto properties. Is CRC32c still a good choice or has the industry moved on?

diroussel3y ago

What are your requirements? Tamper resistance? Error detection? Error correction? Speed vs time vs space trade off?

josephg3y ago

Just error detection for corrupt or partial disk writes.

TAForObvReasons3y ago

While micro-optimizations are interesting, there are two questions left unanswered:

- Does this change noticeably affect the total runtime? The checksum seems simple enough that the slight difference here wouldn't show up in PNG benchmarks.

- The proposed solution uses AVX2, which is not currently used in the original codebase. Would any other part of the processing benefit from using newer instructions?

londons_explore3y ago

If checksum calculation was any substantial portion of image decoding, I think that would be a strong case for simply not checking the checksum.

If you put corrupted data into a PNG decoder, I don't think it's awfully important to most users whether they get a decode error or a garbled image out.

woooshOP3y ago

This was actually considered, and other libraries do ignore checksums, or at least have options to:

https://github.com/richgel999/fpng/issues/9

NavinF3y ago

>diminishing returns especially due to it working faster than the speed of my RAM (2667MT/s * 8 = ~21 GB/s).

That sounds kinda slow; Is there only 1 DIMM in the slots? I remember benchmarking 40GiB/s read speed on an older system that had 2 dual-rank DIMMs (4 ranks in total).

I'd expect 3200mbit/s*(64 data lines)*(2 memory channels) = ~48 GiB/s on a typical DDR4 desktop and a lot more with overclocked ram.

Great writeup either way.

woooshOP3y ago

Yes, this is on a single 8GB 2667MHz DIMM in a laptop.

edit: For dual channel RAM, I would suspect the throughput depends on how the kernel decides to map physical memory to virtual addresses.

adrian_b3y ago

The memory is already mapped by the BIOS/EFI firmware, before the kernel takes control.

By default, whenever the memory modules used in all different channels have the same size, e.g. two 8 GB modules, the firmware maps the modules with interleaved addresses, to ensure a double throughput for 2 channels, or triple/quadruple/etc. for workstation/server motherboards with more memory channels.

jiggawatts3y ago

I hope this brilliant work has been merged into the relevant open source libraries.

Something that’s unfair about the world is that work like this could reach billions of people and save a million dollars worth of time and electricity annually but is being done gratis.

It would be amazing if there were charities that rewarded high-impact open source contributions like this proportionally to the benefits to humanity…

benreesman3y ago

Effective hiring managers are always paying attention in the hopes of noticing the people doing this excellent stuff and asking whether “now” or “soon” is the right time to offer such people high paying jobs.

I doubt my current garage band could afford the OP just this moment, but I sure wish we could!

woooshOP3y ago

> I doubt my current garage band could afford the OP just this moment, but I sure wish we could!

Well, I intend to finish high school at a minimum before pursuing employment :)

1 more reply

james-redwood3y ago

It works well in science (think the Nobel Prizes), and there's certainly more than enough money floating around in the tech community for it to work.

daniel-cussen3y ago

I love this kind of writeup. This is my idea of fun: speedups.

profquail3y ago

zlib-ng also has adler32 implementations optimized for various architectures: https://github.com/zlib-ng/zlib-ng

Might be interesting to benchmark their implementation too to see how it compares.

j / k navigate · click thread line to collapse

27 comments

dougall3y ago

Nice! (I've been meaning to write up this Apple M1 ~60GB/s version, which I think is similar: https://gist.github.com/dougallj/66151f1c509484a42fe0abd0d84... )

nigeltao3y ago

Here's another SIMD implementation, with commentary: https://github.com/google/wuffs/blob/main/std/adler32/common...

Like the fpng implementation, it's SSE (128-bit registers), but the inner loop eats 32 bytes at a time, not 16.

pizza3y ago

[0] https://github.com/libjxl/libjxl/tree/main/experimental/fast...

[1] https://github.com/veluca93/fpnge

[2] https://twitter.com/richgel999/status/1485976101692358656

woooshOP3y ago

Nyan3y ago

bob10293y ago

The frequency domain is a really powerful place to operate in when you are dealing with this amount of data.

pizza3y ago

1 more reply

ebiggers3y ago

josephg3y ago

diroussel3y ago

What are your requirements? Tamper resistance? Error detection? Error correction? Speed vs time vs space trade off?

josephg3y ago

Just error detection for corrupt or partial disk writes.

TAForObvReasons3y ago

While micro-optimizations are interesting, there are two questions left unanswered:

- Does this change noticeably affect the total runtime? The checksum seems simple enough that the slight difference here wouldn't show up in PNG benchmarks.

- The proposed solution uses AVX2, which is not currently used in the original codebase. Would any other part of the processing benefit from using newer instructions?

londons_explore3y ago

If checksum calculation was any substantial portion of image decoding, I think that would be a strong case for simply not checking the checksum.

If you put corrupted data into a PNG decoder, I don't think it's awfully important to most users whether they get a decode error or a garbled image out.

woooshOP3y ago

This was actually considered, and other libraries do ignore checksums, or at least have options to:

https://github.com/richgel999/fpng/issues/9

NavinF3y ago

>diminishing returns especially due to it working faster than the speed of my RAM (2667MT/s * 8 = ~21 GB/s).

That sounds kinda slow; Is there only 1 DIMM in the slots? I remember benchmarking 40GiB/s read speed on an older system that had 2 dual-rank DIMMs (4 ranks in total).

I'd expect 3200mbit/s*(64 data lines)*(2 memory channels) = ~48 GiB/s on a typical DDR4 desktop and a lot more with overclocked ram.

Great writeup either way.

woooshOP3y ago

Yes, this is on a single 8GB 2667MHz DIMM in a laptop.

edit: For dual channel RAM, I would suspect the throughput depends on how the kernel decides to map physical memory to virtual addresses.

adrian_b3y ago

The memory is already mapped by the BIOS/EFI firmware, before the kernel takes control.

jiggawatts3y ago

I hope this brilliant work has been merged into the relevant open source libraries.

Something that’s unfair about the world is that work like this could reach billions of people and save a million dollars worth of time and electricity annually but is being done gratis.

It would be amazing if there were charities that rewarded high-impact open source contributions like this proportionally to the benefits to humanity…

benreesman3y ago

I doubt my current garage band could afford the OP just this moment, but I sure wish we could!

woooshOP3y ago

> I doubt my current garage band could afford the OP just this moment, but I sure wish we could!

Well, I intend to finish high school at a minimum before pursuing employment :)

1 more reply

james-redwood3y ago

It works well in science (think the Nobel Prizes), and there's certainly more than enough money floating around in the tech community for it to work.

daniel-cussen3y ago

I love this kind of writeup. This is my idea of fun: speedups.

profquail3y ago

zlib-ng also has adler32 implementations optimized for various architectures: https://github.com/zlib-ng/zlib-ng

Might be interesting to benchmark their implementation too to see how it compares.

j / k navigate · click thread line to collapse