We managed to build our solution mostly in Python(!) using Numba for JIT and a number of compression tricks. More about it here:
How does your technology differ from bitmap indexes? Have you solved the performance problem of updating random rows, for example?
Compressed bitmap indexes are awesome. Like most indexes, the updating random rows problem is best addressed using a log structured merge tree and amortizing your index updates. Just have an in-memory buffer of recently updated rows.
If you are doing mostly sums & counts type work and can deal with some level of inaccuracy, you can consider HyperLogLog...
We use bitmap indexes in a number of places internally.
You mentioned that there are some regularities in different data sets that can be used to increase the efficiency of their encoding. Does this mean that you need to write a different foreign data wrapper for each data set?
Did you do anything about join pushdown (which isn't supported in core PostgreSQL yet)? Apologies if this is in your talk - I looked at the slides and couldn't see anything.
Our data model makes sure that we don't have to do huge joins on the fly, which would be a bad idea anyways. We have a workaround to distribute medium-scale joins that occur frequently. Small joins are handled fine by Postgres as usual.
I would be curious to learn more about Vitesse. I couldn't find many technical details besides the mailing list post.
But a keeping-the-sauce-sectret aproach won't win over techies! We want to know how it works, why it works, what it can do and what it can't do!
Looking forward to the white papers and talks :)
PS name is very close the Google's Vitess?
1) They have optimized the (existing?) CSV import code to use SSE instructions, for faster CSV import
2) For a sufficiently complicated query, they will compile it to native code using LLVM. They presumably precompile most of Postgres (or at least the execution parts) to LLVM IR, and then convert enough of the execution plan into code so that the LLVM optimizer can optimize it (inlining, branch prediction, dead code elimination etc). If they are able to persuade LLVM to optimize the per-row decoding, I think that could be a huge win.
It's a great accomplishment; I do wish they had contributed it to Postgres, but I can't blame them for not doing so (they need to eat!).
Stop this. The two are not mutually exclusive. Postgres developers are not starving.
For OLTP queries, particularly with working-set > RAM or lots of writes, your disk I/O is probably the bottleneck (probably your IOPS, actually, which is why SSD can be so valuable). Pretty sure they're not targeting that use-case though!
your point that great strides have been made in storage such that we could be back at cpu as a bottleneck is well taken though.
e.g. http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf
EDIT: It also has NUMA-aware multi-threaded query execution.
[2] http://www.vldb.org/pvldb/vol4/p539-neumann.pdf
[3] http://www.vldb.org/pvldb/vol6/p1702-muehlbauer.pdf
[4] http://databasearchitects.blogspot.de/2014/05/trying-out-hyp...
"The approach of compiling SQL statements into machine code was one of the most successful parts of the System R project. We were able to generate a machine-language subroutine to execute any SQL statement of arbitrary complexity..."
See: http://www.cs.berkeley.edu/~brewer/cs262/SystemR-comments.pd...
I think this is a fascinating approach to improving query speed. It certainly won't be applicable to every use-case, but it seems like there's a lot of value in it.
https://wiki.postgresql.org/wiki/PGStrom
Someone got similar performance just using multithreaded native code on the CPU: