Dense bitpacking (opens in new tab)

(writing.londonstartuptech.com)

34 pointshywel10y ago36 comments

36 comments

22 comments · 10 top-level

eloff10y ago· 5 in thread

The article author talks of the number of CPU ops, as if they were all the same. Since he's using compile time constants, he probably never noticed that division (20-100 cycles) is so horribly slow. The compiler uses a trick like in Hacker's Delight to convert division into multiplication (4 cycles). But shifts and masks are cheap (1 cycle each). This kind of trickery is rarely worth the small space savings (to be fair, the author seems aware of that, by calling it immature optimization.)

hywelOP10y ago

Yeah, definitely aware that it's slower - I was severely bounded my memory when I came up with this, as I was trying to get my training data and all the variables in my model into 8GB of RAM for a machine learning project. Without the trick, swap was killing the speed; with the trick everything sailed along :)

eloff10y ago

I'm surprised saving 4 bits made such a difference, it's about 6% smaller. But I guess that depends on how much of your memory these things are taking up. 6% of 8GB is 500 MB, which could easily be the difference between swapping and not.

1 more reply

asQuirreL10y ago

Also worth noting that storing them in a long probably means 2 memory reads per access on most modern architectures, whereas storing them in bitfields will allow the compiler to select the containing CPU word, which will be 1 read. I would be interested to see actual performance measurements (maybe a performance vs memory usage chart).

eloff10y ago

By most modern architectures you mean 32bit, since you seem to think accessing a 64bit "long" (it's 32bits on windows btw) takes two memory accesses. 32bit architectures are not "modern"... Maybe you meant unaligned 64bit accesses, which will be 2 reads, but those are pretty cheap since Sandy Bridge or earlier.

TheLoneWolfling10y ago

I thought that modern architectures tend to read in an entire cache line at once?

In which case bit packing can be problematic as it can stretch across a cache boundary?

4 more replies

TheLoneWolfling10y ago· 3 in thread

Are there any languages that suggest optimizations? Where you can explicitly enable them?

Something like this seems like something a compiler could relatively easily do. Have a bunch of different ways of storing structs (word-aligned, size-aligned, bit-aligned, modulo/remainder, modulo-next-prime, modulo-next-easy-prime, etc {what else am I missing here?}), and be able to specify to the compiler via an annotation or similar which you want, or what your typical sizes are and let the compiler choose from there.

hywelOP10y ago

That'd be hard for a compiler to suggest for this case - it requires knowing the range of each value that you're storing in the struct (and not just the number of bits).

But if you're happy to do that, there's no reason it couldn't be offered by a compiler. However it's only useful for an unusual use case, when you have data that could just fit in memory, but doesn't fit in memory without the bit-packing.

TheLoneWolfling10y ago

There are many cases where this sort of thing would be useful, not just this edge case. For instance, automatically rewriting between a recursive function and a function with an explicit external stack. Or swapping between an eagerly evaluated function and a lazily evaluated one. Or swapping between different types of allocations (stack, heap, region + bump pointer, etc). Or swapping between different ways of parallelizing a function or loop. Or swapping between row or column based evaluation. Or specifying two equivalent functions and have the compiler check as much as possible that they are equivilant, calling both and asserting on debug builds and calling the faster one on release builds (or potentially even calling both in separate threads and waiting until either one returns). Etc.

All of these are things that you can do manually, and that you can leave to the compiler and hope that it'll pick the correct one, but currently good luck telling the compiler that no, you really actually want to do <x>. Or to try <x> and <y> and see which is better. And the amount of time required to do so manually adds up, even though they are all things that could be done by a compiler.

I personally wish that you could specify / annotate the range that you are actually intending to store in an integer (and have it optionally bound-checked) anyway, but that is another matter.

(Actually, I wish that you could specify types with an arbitrary (probably-pure) boolean function to indicate if something is or isn't a valid value in the type. But that is another matter indeed.)

1 more reply

Someone10y ago

"it requires knowing the range of each value that you're storing in the struct (and not just the number of bits)."

Many high-level languages require programmers to specify what you want to store ('an integer between -6 and 43, inclusive'), not how ('6 bits'). That typically makes it possible for the compiler to do just that. For example, Pascal has the 'packed' keyword (http://www.gnu-pascal.de/gpc/packed.html)

periodontal10y ago· 2 in thread

The fastest way to encode these dense bitpackings is almost certainly with the Chinese remainder theorem.

hywelOP10y ago

Kind of. The Chinese Remainder Theorem tells that you that there exists a number that satisfies, and that the number is unique, but it doesn't tell you how to calculate it.

The standard way is to do it recursively - you can see my implementation here: https://github.com/hcarver/Netflix/blob/2136aa5d28a209f902d4...

periodontal10y ago

CRT is often given with a closed form for the satisfying solution, although requiring computing multiplicative inverses mod each of the n_i (your primes). Since all of these values are known at design time, they can be precomputed and the run time for encoding becomes a constant M multiplications, M-1 additions and one mod at the end where M is the number of items you are encoding.

https://en.wikipedia.org/wiki/Chinese_remainder_theorem#Gene...

For example, the 3, 5, 7 packing yields a closed form of 70a_1+21a_2+15a_3 (mod 105). Plugging in 2, 4, 3 like your example yields 59 directly.

eska10y ago· 1 in thread

In the context of the overall algorithm, wouldn't you cluster the movie ratings first anyway (O(n)), so most of the algorithm would do computations on those clusters which would have less data than single movies to begin with? I'd worry about minimizing the size of the clusters instead. You'll also probably want some kind of hierarchical data structure to use the cache efficiently. This doesn't help with that. If the movie ratings are sorted, a lot of the data becomes redundant and can be left out entirely. Those values should be all 0-based too, since it saves you some bits (3 bits for 1-5 vs 2 bits for 0-4). With that "optimization" alone you save more bits than with the early bit packing attempt. The solution with primes also doesn't scale well. The bigger the primes get, the more empty space you create.

hywelOP10y ago

The actual rating is a tiny part of the data per movie, so there's not much saving there. And clustering would have to be done instead of indexing by movie / user, so it would probably make performance worse overall.

Indexing by movie / user is done exactly for the reason of using the cache efficiently. Unfortunately, you have to iterate through both movies and users, so you either store the sparse matrix twice (once movie-indexed, once user-indexed) OR you deal with lots of cache misses half the time.

And, yes, all the values are stored 0-based for exactly that reason :) It's an even bigger saving for storing timestamps.

Not sure what you mean about the prime solution not scaling well - 3 primes of ~2^20 can be stored in ~2^60 (i.e. within 8 bytes) as opposed to within 3 4-byte integers.

When it really sucks is when you're storing lots of small integers, e.g. 20 things in [0,1,2,3] - that gets very inefficient fast, and it'd be much more efficient to use normal bitfields.

hywelOP10y ago· 1 in thread

It doesn't actually use bitfields, it describes an alternative to them that was a better choice in this specific instance.

Huffman would have been much slower to access, which would have been unacceptable for the use case (iterating 100 million data points multiple times a second).

uxcn10y ago

I'm not familiar with the dataset you were working with, but I would be willing to bet certain values occur a lot more frequently than others. In which case, you could work with the compressed values, and only uncompress them for the results.

uxcn10y ago

This is kind of an abuse of c/c++ bitfields. Arguably you don't gain much by using them as opposed to a memory chunk and simple accessors. Bitfields are a way to get an explicit memory layout, but using them subverts word alignments, word sizes, etc...

Is there any reason you didn't just pre-process the values and use a huffman coding on them?

stuaxo10y ago

Awesome, I wondered whether something like this might be possible when I first found out about bitfields, but maths isn't my strong point (and certainly wasn't 20 years ago), so I just forgot about it.

hywelOP10y ago

Although obviously there's still a preprocessing step to make that possible.

nickpsecurity10y ago

Clever and interesting scheme for encoding the data.

mc_hammer10y ago

good read cheers.

j / k navigate · click thread line to collapse

36 comments

22 comments · 10 top-level

eloff10y ago· 5 in thread

hywelOP10y ago

eloff10y ago

1 more reply

asQuirreL10y ago

eloff10y ago

TheLoneWolfling10y ago

I thought that modern architectures tend to read in an entire cache line at once?

In which case bit packing can be problematic as it can stretch across a cache boundary?

4 more replies

TheLoneWolfling10y ago· 3 in thread

Are there any languages that suggest optimizations? Where you can explicitly enable them?

hywelOP10y ago

That'd be hard for a compiler to suggest for this case - it requires knowing the range of each value that you're storing in the struct (and not just the number of bits).

TheLoneWolfling10y ago

I personally wish that you could specify / annotate the range that you are actually intending to store in an integer (and have it optionally bound-checked) anyway, but that is another matter.

(Actually, I wish that you could specify types with an arbitrary (probably-pure) boolean function to indicate if something is or isn't a valid value in the type. But that is another matter indeed.)

1 more reply

Someone10y ago

"it requires knowing the range of each value that you're storing in the struct (and not just the number of bits)."

periodontal10y ago· 2 in thread

The fastest way to encode these dense bitpackings is almost certainly with the Chinese remainder theorem.

hywelOP10y ago

Kind of. The Chinese Remainder Theorem tells that you that there exists a number that satisfies, and that the number is unique, but it doesn't tell you how to calculate it.

The standard way is to do it recursively - you can see my implementation here: https://github.com/hcarver/Netflix/blob/2136aa5d28a209f902d4...

periodontal10y ago

https://en.wikipedia.org/wiki/Chinese_remainder_theorem#Gene...

For example, the 3, 5, 7 packing yields a closed form of 70a_1+21a_2+15a_3 (mod 105). Plugging in 2, 4, 3 like your example yields 59 directly.

eska10y ago· 1 in thread

hywelOP10y ago

And, yes, all the values are stored 0-based for exactly that reason :) It's an even bigger saving for storing timestamps.

Not sure what you mean about the prime solution not scaling well - 3 primes of ~2^20 can be stored in ~2^60 (i.e. within 8 bytes) as opposed to within 3 4-byte integers.

When it really sucks is when you're storing lots of small integers, e.g. 20 things in [0,1,2,3] - that gets very inefficient fast, and it'd be much more efficient to use normal bitfields.

hywelOP10y ago· 1 in thread

It doesn't actually use bitfields, it describes an alternative to them that was a better choice in this specific instance.

Huffman would have been much slower to access, which would have been unacceptable for the use case (iterating 100 million data points multiple times a second).

uxcn10y ago

Is there any reason you didn't just pre-process the values and use a huffman coding on them?

stuaxo10y ago

hywelOP10y ago

Although obviously there's still a preprocessing step to make that possible.

nickpsecurity10y ago

Clever and interesting scheme for encoding the data.

mc_hammer10y ago

good read cheers.

j / k navigate · click thread line to collapse