In which case bit packing can be problematic as it can stretch across a cache boundary?
Something like this seems like something a compiler could relatively easily do. Have a bunch of different ways of storing structs (word-aligned, size-aligned, bit-aligned, modulo/remainder, modulo-next-prime, modulo-next-easy-prime, etc {what else am I missing here?}), and be able to specify to the compiler via an annotation or similar which you want, or what your typical sizes are and let the compiler choose from there.
But if you're happy to do that, there's no reason it couldn't be offered by a compiler. However it's only useful for an unusual use case, when you have data that could just fit in memory, but doesn't fit in memory without the bit-packing.
All of these are things that you can do manually, and that you can leave to the compiler and hope that it'll pick the correct one, but currently good luck telling the compiler that no, you really actually want to do <x>. Or to try <x> and <y> and see which is better. And the amount of time required to do so manually adds up, even though they are all things that could be done by a compiler.
I personally wish that you could specify / annotate the range that you are actually intending to store in an integer (and have it optionally bound-checked) anyway, but that is another matter.
(Actually, I wish that you could specify types with an arbitrary (probably-pure) boolean function to indicate if something is or isn't a valid value in the type. But that is another matter indeed.)
Many high-level languages require programmers to specify what you want to store ('an integer between -6 and 43, inclusive'), not how ('6 bits'). That typically makes it possible for the compiler to do just that. For example, Pascal has the 'packed' keyword (http://www.gnu-pascal.de/gpc/packed.html)
The standard way is to do it recursively - you can see my implementation here: https://github.com/hcarver/Netflix/blob/2136aa5d28a209f902d4...
https://en.wikipedia.org/wiki/Chinese_remainder_theorem#Gene...
For example, the 3, 5, 7 packing yields a closed form of 70a_1+21a_2+15a_3 (mod 105). Plugging in 2, 4, 3 like your example yields 59 directly.
Indexing by movie / user is done exactly for the reason of using the cache efficiently. Unfortunately, you have to iterate through both movies and users, so you either store the sparse matrix twice (once movie-indexed, once user-indexed) OR you deal with lots of cache misses half the time.
And, yes, all the values are stored 0-based for exactly that reason :) It's an even bigger saving for storing timestamps.
Not sure what you mean about the prime solution not scaling well - 3 primes of ~2^20 can be stored in ~2^60 (i.e. within 8 bytes) as opposed to within 3 4-byte integers.
When it really sucks is when you're storing lots of small integers, e.g. 20 things in [0,1,2,3] - that gets very inefficient fast, and it'd be much more efficient to use normal bitfields.
Huffman would have been much slower to access, which would have been unacceptable for the use case (iterating 100 million data points multiple times a second).
Is there any reason you didn't just pre-process the values and use a huffman coding on them?