Vitesse Data: Postgres + LLVM (opens in new tab)

(vitessedata.com)

257 pointsyangyang11y ago37 comments

37 comments

31 comments · 12 top-level

vtuulos11y ago· 7 in thread

We have been running a similar setup (Postgres -> Foreign Data Wrappers -> LLVM) at AdRoll for over a year. We keep 100TBs+ of raw data in memory, compressed.

We managed to build our solution mostly in Python(!) using Numba for JIT and a number of compression tricks. More about it here:

http://tuulos.github.io/pydata-2014/#/

https://www.youtube.com/watch?v=xnfnv6WT1Ng

BenoitP11y ago

I saw an instance of your talk at another venue. At the time I was learning about bitmap indexes (because we needed OLAP capabilities on a another database engine that is not postgres). Your talk reminded me furiously of them, but in the end I was not able to further investigate the difference.

How does your technology differ from bitmap indexes? Have you solved the performance problem of updating random rows, for example?

cbsmith11y ago

Combining OLAP without OLTP (large aggregate queries + lots of real time updates) is the holy grail that Cassandra, for example, has addressed rather nicely.

Compressed bitmap indexes are awesome. Like most indexes, the updating random rows problem is best addressed using a log structured merge tree and amortizing your index updates. Just have an in-memory buffer of recently updated rows.

If you are doing mostly sums & counts type work and can deal with some level of inaccuracy, you can consider HyperLogLog...

vtuulos11y ago

Our DeliRoll is used for analytics. It doesn't support random updates, it is append-only.

We use bitmap indexes in a number of places internally.

bglazer11y ago

Very interesting slides and talk.

You mentioned that there are some regularities in different data sets that can be used to increase the efficiency of their encoding. Does this mean that you need to write a different foreign data wrapper for each data set?

vtuulos11y ago

Thanks! Encoding takes care of regularities - it is all transparent to the user and DBA.

yangyangOP11y ago

This is really interesting.

Did you do anything about join pushdown (which isn't supported in core PostgreSQL yet)? Apologies if this is in your talk - I looked at the slides and couldn't see anything.

vtuulos11y ago

Yeah, lack of aggregation function and join pushdown is annoying.

Our data model makes sure that we don't have to do huge joins on the fly, which would be a bad idea anyways. We have a workaround to distribute medium-scale joins that occur frequently. Small joins are handled fine by Postgres as usual.

I would be curious to learn more about Vitesse. I couldn't find many technical details besides the mailing list post.

2 more replies

willvarfar11y ago· 4 in thread

Interesting and upvoted even if just a sales pitch.

But a keeping-the-sauce-sectret aproach won't win over techies! We want to know how it works, why it works, what it can do and what it can't do!

Looking forward to the white papers and talks :)

PS name is very close the Google's Vitess?

justinsb11y ago

To be fair, they've made it very clear how it works:

1) They have optimized the (existing?) CSV import code to use SSE instructions, for faster CSV import

2) For a sufficiently complicated query, they will compile it to native code using LLVM. They presumably precompile most of Postgres (or at least the execution parts) to LLVM IR, and then convert enough of the execution plan into code so that the LLVM optimizer can optimize it (inlining, branch prediction, dead code elimination etc). If they are able to persuade LLVM to optimize the per-row decoding, I think that could be a huge win.

It's a great accomplishment; I do wish they had contributed it to Postgres, but I can't blame them for not doing so (they need to eat!).

ris11y ago

"I do wish they had contributed it to Postgres, but I can't blame them for not doing so (they need to eat!)."

Stop this. The two are not mutually exclusive. Postgres developers are not starving.

Keats11y ago

Vitesse means speed in french, so it's not a made up word

Zikes11y ago

I also thought this was somehow related to Google's Vitess. I was rather excited at the prospect of Postgres compatibility, which I know they've been working on, and which we use in our business.

Thaxll11y ago· 4 in thread

From my experience I/O is more important than CPU for DB, I've never seen the 24 cores of our servers having problems, however the raid 10...

justinsb11y ago

It really depends on your workload. If your working-set fits into memory and is read-mostly, your I/O is basically irrelevant. Even if you're not in-memory, if you are doing mostly sequential reads (like the OLAP queries they are benchmarking), each drive can do ~200MB/second, so it is not too difficult to become CPU bound if you are doing any significant processing on each row (particularly if your row size is small!)

For OLTP queries, particularly with working-set > RAM or lots of writes, your disk I/O is probably the bottleneck (probably your IOPS, actually, which is why SSD can be so valuable). Pretty sure they're not targeting that use-case though!

swasheck11y ago

in my experience (which isn't the sum-total of all human experience), our OLAP queries have a bottleneck at the fiber interconnect between the storage controller and the OS. cpu still isn't the problem for our OLAP systems.

your point that great strides have been made in storage such that we could be back at cpu as a bottleneck is well taken though.

jcampbell111y ago

I assume you are using cheap spinning disks, and not something like a fusion-io drive. How big is your dataset?

swasheck11y ago

that's a narrow assumption. why not get more information from the person instead of simply attacking a straw man.

1 more reply

yawniek11y ago· 2 in thread

isn't usually the transfer of data between memory and cpu the bottle neck and not the cpu itself?

e.g. http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf

spacemanmatt11y ago

The most common bottleneck in my work experience has been disk I/O when the data set would not fit into memory. And every professional data set I've worked with exceeds 1TB. Perhaps there are other bottlenecks, but disk seeks and (for one place) their iSCSI over gbit ethernet nonsense ruled the performance challenges.

analogAndroid11y ago

Usually yes, but that is why you take advantage of specialized CPU instructions for bulk loading and operating on data. From what it says that is part of the optimization that these folks are taking advantage of (see comment mentioning SSE instructions).

florianfunke11y ago· 1 in thread

If you're interested in query compilation, check out the HyPer DBMS prototype at [1]. It efficiently compiles SQL (SQL92++) to super-fast LLVM code (faster than Vitesse). It's not open source, but you can download a binary [4] or try it out online using the "WebInterface" link and read about the query compiler here [2]. It also has lots of other goodies, like SIMD-powered CSV parsing (about 4x faster than Postgres) [3]. Rumor has it that the prototype will be commercialized soon ;)

EDIT: It also has NUMA-aware multi-threaded query execution.

[1] http://www.hyper-db.com

[2] http://www.vldb.org/pvldb/vol4/p539-neumann.pdf

[3] http://www.vldb.org/pvldb/vol6/p1702-muehlbauer.pdf

[4] http://databasearchitects.blogspot.de/2014/05/trying-out-hyp...

sbrakerm11y ago

Looks like the Vitesse product was inspired by this research. Even though the HyPer looks a lot more mature (though it's still a research project). Looking forward to see the product.

peatmoss11y ago· 1 in thread

I wonder how well this approach would extend to Postgres add ons like PostGIS and PGRouting?

cordite11y ago

As someone that has used PostGIS on every city in the US to map to every county in the US over the last two centuries for genealogical mapping purposes, it would probably make it nicer. (Some of the results are already in the library of congress)

hendzen11y ago

And what is old, is new again. For reference, the System R database used this technique of compiling SQL queries to machine language in the 70's:

"The approach of compiling SQL statements into machine code was one of the most successful parts of the System R project. We were able to generate a machine-language subroutine to execute any SQL statement of arbitrary complexity..."

See: http://www.cs.berkeley.edu/~brewer/cs262/SystemR-comments.pd...

mfenniak11y ago

There are a few posts on the pgsql-hackers mailing list that have some more detailed information, for anyone interested in reading more about the approach taken: http://www.postgresql.org/message-id/5CFE0CA1-E5CC-4CD1-9D0B...

I think this is a fascinating approach to improving query speed. It certainly won't be applicable to every use-case, but it seems like there's a lot of value in it.

twic11y ago

There was a vaguely similar thing a while ago called PGStrom, which compiled qualifiers into CUDA code which ran on the GPU:

https://wiki.postgresql.org/wiki/PGStrom

Someone got similar performance just using multithreaded native code on the CPU:

http://blog.notapaper.de/article5.html

uberneo11y ago

Shard-Query does the same work of utilising all the available CPUs but its for MySQL - https://github.com/greenlion/swanhart-tools/tree/master/shar...

bfrog11y ago

I would love to see how its done one day, though I get the need to make back the investment in time!

jeffdavis11y ago

Menu button doesn't seem to work in Firefox on android.

j / k navigate · click thread line to collapse

37 comments

31 comments · 12 top-level

vtuulos11y ago· 7 in thread

We have been running a similar setup (Postgres -> Foreign Data Wrappers -> LLVM) at AdRoll for over a year. We keep 100TBs+ of raw data in memory, compressed.

We managed to build our solution mostly in Python(!) using Numba for JIT and a number of compression tricks. More about it here:

http://tuulos.github.io/pydata-2014/#/

https://www.youtube.com/watch?v=xnfnv6WT1Ng

BenoitP11y ago

How does your technology differ from bitmap indexes? Have you solved the performance problem of updating random rows, for example?

cbsmith11y ago

Combining OLAP without OLTP (large aggregate queries + lots of real time updates) is the holy grail that Cassandra, for example, has addressed rather nicely.

If you are doing mostly sums & counts type work and can deal with some level of inaccuracy, you can consider HyperLogLog...

vtuulos11y ago

Our DeliRoll is used for analytics. It doesn't support random updates, it is append-only.

We use bitmap indexes in a number of places internally.

bglazer11y ago

Very interesting slides and talk.

vtuulos11y ago

Thanks! Encoding takes care of regularities - it is all transparent to the user and DBA.

yangyangOP11y ago

This is really interesting.

Did you do anything about join pushdown (which isn't supported in core PostgreSQL yet)? Apologies if this is in your talk - I looked at the slides and couldn't see anything.

vtuulos11y ago

Yeah, lack of aggregation function and join pushdown is annoying.

I would be curious to learn more about Vitesse. I couldn't find many technical details besides the mailing list post.

2 more replies

willvarfar11y ago· 4 in thread

Interesting and upvoted even if just a sales pitch.

But a keeping-the-sauce-sectret aproach won't win over techies! We want to know how it works, why it works, what it can do and what it can't do!

Looking forward to the white papers and talks :)

PS name is very close the Google's Vitess?

justinsb11y ago

To be fair, they've made it very clear how it works:

1) They have optimized the (existing?) CSV import code to use SSE instructions, for faster CSV import

It's a great accomplishment; I do wish they had contributed it to Postgres, but I can't blame them for not doing so (they need to eat!).

ris11y ago

"I do wish they had contributed it to Postgres, but I can't blame them for not doing so (they need to eat!)."

Stop this. The two are not mutually exclusive. Postgres developers are not starving.

Keats11y ago

Vitesse means speed in french, so it's not a made up word

Zikes11y ago

I also thought this was somehow related to Google's Vitess. I was rather excited at the prospect of Postgres compatibility, which I know they've been working on, and which we use in our business.

Thaxll11y ago· 4 in thread

From my experience I/O is more important than CPU for DB, I've never seen the 24 cores of our servers having problems, however the raid 10...

justinsb11y ago

swasheck11y ago

your point that great strides have been made in storage such that we could be back at cpu as a bottleneck is well taken though.

jcampbell111y ago

I assume you are using cheap spinning disks, and not something like a fusion-io drive. How big is your dataset?

swasheck11y ago

that's a narrow assumption. why not get more information from the person instead of simply attacking a straw man.

1 more reply

yawniek11y ago· 2 in thread

isn't usually the transfer of data between memory and cpu the bottle neck and not the cpu itself?

e.g. http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf

spacemanmatt11y ago

analogAndroid11y ago

florianfunke11y ago· 1 in thread

EDIT: It also has NUMA-aware multi-threaded query execution.

[1] http://www.hyper-db.com

[2] http://www.vldb.org/pvldb/vol4/p539-neumann.pdf

[3] http://www.vldb.org/pvldb/vol6/p1702-muehlbauer.pdf

[4] http://databasearchitects.blogspot.de/2014/05/trying-out-hyp...

sbrakerm11y ago

Looks like the Vitesse product was inspired by this research. Even though the HyPer looks a lot more mature (though it's still a research project). Looking forward to see the product.

peatmoss11y ago· 1 in thread

I wonder how well this approach would extend to Postgres add ons like PostGIS and PGRouting?

cordite11y ago

hendzen11y ago

And what is old, is new again. For reference, the System R database used this technique of compiling SQL queries to machine language in the 70's:

See: http://www.cs.berkeley.edu/~brewer/cs262/SystemR-comments.pd...

mfenniak11y ago

I think this is a fascinating approach to improving query speed. It certainly won't be applicable to every use-case, but it seems like there's a lot of value in it.

twic11y ago

There was a vaguely similar thing a while ago called PGStrom, which compiled qualifiers into CUDA code which ran on the GPU:

https://wiki.postgresql.org/wiki/PGStrom

Someone got similar performance just using multithreaded native code on the CPU:

http://blog.notapaper.de/article5.html

uberneo11y ago

Shard-Query does the same work of utilising all the available CPUs but its for MySQL - https://github.com/greenlion/swanhart-tools/tree/master/shar...

bfrog11y ago

I would love to see how its done one day, though I get the need to make back the investment in time!

jeffdavis11y ago

Menu button doesn't seem to work in Firefox on android.

j / k navigate · click thread line to collapse