Skip to content

Top Best Ask Show New Jobs

Summing ASCII encoded integers on Haswell at almost the speed of memcpy (opens in new tab)

(blog.mattstuchlik.com)

132 pointsiliekcomputers1y ago36 comments

36 comments

19 comments · 4 top-level

raldi1y ago· 7 in thread

Is there an explanation of why it sometimes gives the wrong answer?

sYnfo1y ago

1) if you set BATCH_SIZE > 14 sums_acc may overflow

2) chunks with too many small numbers cannot be processed with just 2 shuffle-adds

3) (not mentioned in the post) HighLoad limits the size of the source code you can submit, so you can't put all possible values in the look-up table

camel-cdr1y ago

Couldn't you organize the accumulators in 8 byte chunks, and leave the upper byte unused. Then you map consecutive digits to those chunks and use 64 bit addition for the accumulation. Then overflow between the bytes would keep the correct result if you do the shuffles correctly, and you have a full byte of overflow buffer.

rurban1y ago

So SIMD would need to set the overflow flag also to catch em.

Which would be much faster than the checked add (adc). Does any hardware support such checked SIMD arithmetic already?

Or can you still assume that most arithmetic is still broken in most languages/libraries.

kardos1y ago

For 1, can you raise that to 28 with unsigned accumulators?

genter1y ago

> will only produce correct results with probability < 1, though very close to 1

That's terrifying

madars1y ago

It's worse: Pr[correct output | hard input] = 0, even though they estimate that Pr[correct output | random input] ~ 1. This means that you can't, for example, amplify your success probability by repeating the algorithm a bunch of times and taking the majority vote.

Why? So long as you know the probabilities and they are tolerable, why?

dist1ll1y ago· 4 in thread

First time I hear about HighLoad. Seems really interesting to me on the first glance. I personally find SIMD and ISA/μarch-specific optimizations more rewarding than pure algorithmic challenges (codeforces and such).

Though Haswell seems like a pretty obsolete platform to optimize for at this point. Even Skylake will be a decade old next year.

sgerenser1y ago

Realistically beyond Haswell there hasn’t been a ton of advancement in SIMD. Hawell introduced AVX2, which is what this blog post uses. AVX512 is certainly more powerful, but that’s not even available in the majority of Intel CPUs, even brand new ones.

jandrewrogers1y ago

AVX-512 has been ubiquitous on Intel server CPUs for a long time. Most people don't run high-performance throughput-oriented codes on consumer-grade CPUs with no ECC, which is the primary application for AVX-512. AVX-512 is a markedly better ISA than AVX2, aside from being wider.

As part of that Haswell brought FMA support which was a boon to those of us doing a lot of multiplication and addition (made those workloads twice as fast).

xipix1y ago

The correct solution to this optimization problem is to write the integers raw, not as ASCII.

ashleyn1y ago· 2 in thread

Knew it'd be SIMD. Such an underrated feature of modern CPUs. Hopefully with cross-platform SIMD in Rust and Golang, it'll be more commonly used.

Thinking parallel gets you enormous speed benefits for any number of arbitrary algorithms: https://mcyoung.xyz/2023/11/27/simd-base64/

neonsunset1y ago

Here's the tracking issue for Go if you're interested: https://github.com/golang/go/issues/67520

I wouldn't be holding my breath though - proper support of high-level portable SIMD abstraction requires quite a lot of compiler complexity due to how wide (heh) the API surface of SIMD extensions is in most ISAs, and because of details necessary to get right to keep data in appropriate (vector and/or mask) registers. This, naturally, goes in the complete opposite direction to the design philosophy of Go's compiler. Instead, you are supposed to write a custom Go ASM syntax, with byte literals used to encode opcodes if they are not natively supported (which is common).

If you're interested in what high-effort SIMD implementation in this kind of language looks like, take a look at C#'s cross-platform Vector API: https://github.com/dotnet/runtime/blob/main/docs/coding-guid...

https://lemire.me/blog/2024/07/05/scan-html-faster-with-simd... (uses platform intrisics, but showcases that you can go one abstraction level lower, retaining the same Vector128<T> type if you need to specialize a particular part of your algorithm for a platform, without having to maintain separate copy for each one)

Here's high-effort vectorized CRC64 implementation that uses these: https://github.com/dotnet/runtime/blob/283de5b5adf08c42d4945... (performs as fast as C++-based mnemonic variant)

Mojo should get a mention since we are in topic of SIMD.

wolf550e1y ago· 2 in thread

I think the trick with dereferencing unmapped memory is cool, but I only really care about techniques that work reliably and I can use in production.

sYnfo1y ago

To be clear, it’s not dereferencing unmapped memory, I just haven’t shown how it’s being mapped, because it’s a little complex. As I note in the post, you can imagine as if I mmap all the necessary addresses at the start of the program.

camel-cdr1y ago

Given that the input is "integers uniformly sampled from [0, 2³¹−1]" couldn't you use a LUT for the 99.99% case of just 10/9/8 digit numbers instead and have a cold branch the handle the very rare smaller numbers.

j / k navigate · click thread line to collapse