undefined | Better HN

0 pointsmaxwell864y ago0 comments

You can just verify using any language with powerful optimizations and whose generic system supports both boxed and unboxed generics, e.g., Rust.

The difference between passing an unboxed generic and triggering monomorphization vs passing a boxed type-erased generic is night and day in terms of performance.\

Boxed generics generate one version of the code that need to work for all types, independently of layout, and dispatch via the generic ABI (aka `interface {}` in Go).

Unboxed generics generate one version of the code per type, removing memory allocations, and allowing the compiler to optimize the function for each type and each layout independently.

This increases code size and compilation times, but in many cases allows dozens of compiler optimizations that aren't possible for the boxed generic case. The code produced by these optimizations allow others to run, etc. And a 100x improvement in perf isn't uncommon (i've seen order of magnitude larger improvements and regressions in perf than that from switching back and forth between boxed and unboxed generics in Rust).

0 comments

9 comments · 2 top-level

foldr4y ago· 6 in thread

We can already see the comparison in the benchmark in the original blog post.

I edited my previous post to point out that I’m familiar with Rust and other trendy languages. Please don’t assume that I’m not just because I don’t also hate Go.

The speed up (or even slow down!) from boxing is highly dependent on the nature of the code, the size of the relevant type, and the usage patterns for the data structure.

In the context of a generic data stucture, unboxing is not likely to unlock a huge number of optimizations. This is in contrast to e.g. generic code for performing matrix operations, where unboxing could make a huge performance difference.

My original post said “A generic data structure is exactly the kind of code that you wouldn’t expect to perform better with generics.” You then responded by taking about performance improvements deriving from autovectorization, which are relevant to generics in general but not to generic data strucures like a deque.

vgatherps4y ago

> In the context of a generic data stucture, unboxing is not likely to unlock a huge number of optimizations.

Aside from the optimisation of "Not having to allocate and deallocate memory", you can unlock the optimisations of "removing one step of pointer chasing" and "better memory locality" - arguably the top two things one should be thinking about when writing fast code.

If your datastructure has to access the memory, boxed values hurt you in two ways. One is that on each access you have to chase a pointer to who-knows-where. There's also no reason to believe that this value is going to be stored next to other values in the datastructure.

This hurts you in two ways - the obvious one is that extra pointer hop is another chance to miss cache. The less obvious way is that loading one element doesn't do anything to bring nearby elements in from the cache. If your datastructure tends to access nearby items together (say a btree), this is extremely valuable for performance. An even better feature is that if you start loading some sort of metadata nearby elements (again, like a btree might have), the prefetcher might already start loading your data.

You also get inlining benefits in this case - a virtual call out to some comparison function might compile down to a single instruction.

You bring up a good point, that there are cases where boxing matters like having extremely large elements (you still get inlining with generics though). It also matters less for datastructures that don't actually inspect their elements in any way. But in my experience, if you look at something like a map type datastructure, you don't see keys being 1KB objects - they tend towards things like strings, integer IDs, or fairly small structs consisting of the above.

Benchmarking these effects can be a little tricky - small microbenchmarks can reside entirely in L1 or L2 and get extremely unrealistic allocation patterns (only thing allocating == nice, contiguous, allocations), so a realistic benchmark has to apply some degree of cache/allocator pressure that's relevant to your workload.

maxwell86OP4y ago

> n the context of a generic data stucture, unboxing is not likely to unlock a huge number of optimizations.

This is not true, the difference between a Vec<Box<T>> and a Vec<T> is huge. You can't easily vectorize a Vec<Box<T>> cause the elements can be in different memory locations, and this is important because... people do loop over all elements in data-structures super often.

And not only vectorization, but removing the memory indirection would make much better use of caches, depending on the size of T, multiply your memory BW by a big factor, etc.

foldr4y ago

Hopefully you’re not looping over all the elements in your deques super often, or there wouldn’t be much point in using a deque!

We’re talking past each other here. Yes, unboxing will sometimes give a performance improvement. But I’m skeptical that any realistic code using the deque structure in the blog post would see much of a gain. To repeat, even in the micro benchmark in the blog post, the performance gains are only around 3x. In most realistic application-level code, the code paths that lead to insertion of a new element will also do some allocation, so that the cost of boxing is lost in the noise.

Vectorization strikes me as a fairly niche case. It’s revealing that you use Vec as an example. In any case where autovectorization is going to give a useful performance improvement, you probably can just use a Vec. And of course, the Go equivalent of an unboxed Vec in Rust doesn’t require generics anyway. So I doubt that there is a large body of existing Go code that will suddenly become ripe for autovectorization once data structure libraries switch over to generics.

maxwell86OP4y ago

> Hopefully you’re not looping over all the elements in your deques super often, or there wouldn’t be much point in using a deque!

I do this all the time: one thread fills the deque, while another thread consumes as much of it as passible.

A modern deque is just a contiguous (mirrored) array in virtual memory. From the point of view of the algorithms, it looks just like a Vec.

2 more replies

losvedir4y ago

> In the context of a generic data structure, unboxing is not likely to unlock a huge number of optimizations.

Reading between the lines here from your other comments, I think you might be excluding simple arrays here, because in Go you could already have monomorphized functions over unboxed arrays of elements without the new Generics feature? (I say "Generics feature" because that's what is new; I'd say Go already had support for "generics" special cases for arrays.)

Because generics over simple arrays is somewhere that unquestionably, vastly improves performance, in most normal situations. That's the impetus for "data oriented design", and the reason for, e.g. pandas using Arrow as a backing store, or why Clickhouse absolutely flies, or the basis for q/kdb: data locality to take advantage of CPU cache lines.

I think that in many situations, iterating over an array of contiguous data is faster than a more "optimal" algorithm (from a Big O perspective) juggling pointers. I've seen a couple talks on this, I think one by Godbolt and one by Strausberg, and they were kind of counterintuitive to me (what do you mean iterating over an array is faster than having a map?), but very thought provoking.

foldr4y ago

Yes, exactly.

I can see why people say that Go already had a sort of built-in generics feature. However, I think the same could be said about any language that allows you to define arrays with different element types (e.g. C). In common parlance, a language has generics iff it permits you to explicitly abstract over types.

I wish it had occurred to me earlier that C vs. C++ is a good test case here. If generics really led to huge performance improvements in typical application-level code, we'd expect C++ programs to run much faster than C programs. Typically, they don't. Of course there are exceptions, such as code that benefits from libraries like uBLAS.

(I'm aware that you can fake generics in C using macros, but there are also comparable code generation tools for Go.)

ogogmad4y ago· 1 in thread

Let me wade in: I think what you're saying is plausible, because the CPU cache should be an important variable here. Boxing should cause cache misses.

maxwell86OP4y ago

Its not only CPU cache, its the whole CPU.

Having one function per type results in more instructions in total, cause you end up with more copies of the function, but less instructions when dealing with one type, like in a collection of values of one type only.

Less instructions is linear speed up. Single copy of the code enables inlining, constant propagation for type sizes, which enables vectorization of loops. Vectorization alone can buy you up to 32x on modern hardware if the code is flops limited.

Avoiding pointer indirections also trash the cache less, and the BW of caches is orders of magnitude higher than the bandwidth of ram (multiple TB/s vs lower 100 GB/s).

And caches have lower latencies so CPU threads wait less on memory.

So your speed up ends up: less instructions * more FLOPs * more BW, and that can easily be 10x * 32x * 20 ~= 6000x in the worst pathological case.

A 1000x speed up / slowdown per element due to switching between boxed and unboxed generics is less common than a 100x speed up in Rust, but it does happen.

j / k navigate · click thread line to collapse

0 comments

9 comments · 2 top-level

foldr4y ago· 6 in thread

We can already see the comparison in the benchmark in the original blog post.

I edited my previous post to point out that I’m familiar with Rust and other trendy languages. Please don’t assume that I’m not just because I don’t also hate Go.

The speed up (or even slow down!) from boxing is highly dependent on the nature of the code, the size of the relevant type, and the usage patterns for the data structure.

vgatherps4y ago

> In the context of a generic data stucture, unboxing is not likely to unlock a huge number of optimizations.

You also get inlining benefits in this case - a virtual call out to some comparison function might compile down to a single instruction.

maxwell86OP4y ago

> n the context of a generic data stucture, unboxing is not likely to unlock a huge number of optimizations.

And not only vectorization, but removing the memory indirection would make much better use of caches, depending on the size of T, multiply your memory BW by a big factor, etc.

foldr4y ago

Hopefully you’re not looping over all the elements in your deques super often, or there wouldn’t be much point in using a deque!

maxwell86OP4y ago

> Hopefully you’re not looping over all the elements in your deques super often, or there wouldn’t be much point in using a deque!

I do this all the time: one thread fills the deque, while another thread consumes as much of it as passible.

A modern deque is just a contiguous (mirrored) array in virtual memory. From the point of view of the algorithms, it looks just like a Vec.

2 more replies

losvedir4y ago

> In the context of a generic data structure, unboxing is not likely to unlock a huge number of optimizations.

foldr4y ago

Yes, exactly.

(I'm aware that you can fake generics in C using macros, but there are also comparable code generation tools for Go.)

ogogmad4y ago· 1 in thread

Let me wade in: I think what you're saying is plausible, because the CPU cache should be an important variable here. Boxing should cause cache misses.

maxwell86OP4y ago

Its not only CPU cache, its the whole CPU.

Avoiding pointer indirections also trash the cache less, and the BW of caches is orders of magnitude higher than the bandwidth of ram (multiple TB/s vs lower 100 GB/s).

And caches have lower latencies so CPU threads wait less on memory.

So your speed up ends up: less instructions * more FLOPs * more BW, and that can easily be 10x * 32x * 20 ~= 6000x in the worst pathological case.

A 1000x speed up / slowdown per element due to switching between boxed and unboxed generics is less common than a 100x speed up in Rust, but it does happen.

j / k navigate · click thread line to collapse