One of the best written expositions of databases properly expressed as operations over hyper-dimensional spaces was produced by Rudolf Bayer, the guy that invented the B-Tree. In the late 1990s he invented a multidimensional indexing structure based on space-filling curves called a UB-Tree and they wrote at length about how various database operations are implemented using that representation. It is generally informative if you are unfamiliar with this aspect of database theory, not just in the context of UB-Trees about which it was written.
If it is such a great idea then why does no major database implement things this way? Despite several attempts by companies like Oracle, IBM, and Microsoft, no one has described a generalized data structure for databases with hyper-rectangle operands as your primitives. There are dozens of narrow algorithm solutions known, both published and unpublished, but none that you could legitimately use in a commercial database system because they all have limitations that will adversely affect real-world applications.
Hyperdex is a conventional algorithm from the standpoint of indexing hyper-dimensional spaces. They are not doing anything new there that has not been done before. However, the update value chaining element of it is actually pretty neat.
There is a big difference between space-filling curves and hyperspace hashing. Space-filling curves map a multidimensional space to a single path through that space that is then mapped to nodes. In the process, they do not retain locality. To our knowledge, hyperspace hashing is a direct intellectual descendant of consistent hashing and has not been done before. If you have pointers to work where data is mapped to nodes in a cluster using a multidimensional hash, please send them to us!
And one major reason why multidimensional databases failed to take off is a problem known as "the curse of dimensionality." If you implement a multi-dimensional representation naively, highly-dimensional data (say, an object with 10-20 attributes) will require a large number of nodes to be efficient. HyperDex solves this through something called space partitioning (I think the paper calls it "data partitioning," but we've changed the name to be a bit more descriptive). They're kind of analogous to materialized views, very loosely speaking.
Agreed completely that hyperspace hashing comes to its own when coupled with value-dependent chaining!
Having said that, I had trouble stopping and/or restarting the cluster in a clean way. To make me even consider using it on production, it should also offer me ways to backup data and as well as convince me that future upgrades will be somewhat smooth.
Very cool.
Agreed with you fully that value-dependent chains are neat. They allow the system to replicate and relocate data, without any need for background processes. VDCs are the key to HyperDex's strong consistency guarantees.
http://news.ycombinator.com/item?id=3622059
https://groups.google.com/forum/?fromgroups#!topic/redis-db/...
I think it's weird when people believe there's a tool that will do their job for them, like a hammer that builds a roof by itself.
I'm sure HyperDex is totally useful for some cases, but it has clear disadvantages when you try to use it for what it wasn't intended (like global HP databases). All of a sudden you find yourself building glue to make it fit with your hybrid architecture. Instead you could take something simple and customize it, and build a huge successful business off of it, like the biggest sites in the world do currently with various tools that weren't engineered to solve simple problems like the number of round trips to look up an object.