Vectorization in OLAP Databases (opens in new tab)

(aneesh.mataroa.blog)

48 pointsalterneesh4y ago18 comments

18 comments

13 comments · 5 top-level

lidavidm4y ago· 3 in thread

The paper "Everything you always wanted to know about compiled and vectorized queries but were afraid to ask" [1] goes into this in much more detail and also compares vectorization with an alternative approach.

[1]: https://dl.acm.org/doi/abs/10.14778/3275366.3284966

gavinray4y ago

Came here to post this, fantastic paper and probably the most comprehensive thing on the internet available about this

Something else I've taken away from research about columnar and vectorized databases: there doesn't seem a good reason why they aren't fit for OLTP workloads

Analytics from SaaS/line-of-business apps shows about a 90/10 read/write ratio for CRUD apps.

Pavlo et al. have a solid paper on an HTAP database that compares OLTP, OLAP, and a novel HTAP storage on varying read/write workloads:

  > "Arulraj, J., Pavlo, A., & Menon, P. (2016). Bridging the Archipelago between Row-Stores and Column-Stores for Hybrid Workloads. Proceedings of the 2016 International Conference on Management of Data - SIGMOD  ’16. doi:10.1145/2882903.2915231 "

See "Figure 17: Concurrent Hybrid Workloads – The impact of the storage layout on the query processing time under different concurrent hybrid workloads."

The chart contains 4 workload types, one of which is "• Read-Heavy: 90% scans, 10% inserts"

You can see that the columnar storage model outperforms OLTP N-ary storage here

So my question is -- why don't people use columnar databases for CRUD apps?

AdamProut4y ago

For scan heavy apps columnstores will be much much faster. For simple CRUD apps that just want to read/write a few rows at a time (with high concurrency) they have a number of inefficiencies vs a rowstore.

  - Columnstore often don't support indexing at all (or have 
  weak support for it).  This means a scan is needed to find 
  any row (with min/max or segment elimination to avoid 
  opening up files with no matching rows at all).  Even if 
  this scan is very fast, its still going to use up more CPU 
  then an index seek.

  - Most columnstores don't use compression schemes that are 
  incremental.  To grab a single row the columnstore likely 
  decompresses many adjacent rows (could be millions of rows 
  - depends on the particular columnstore).

  - Most columnstores don't support fine grained locking 
  (row level locking).  They often lock the entire table or 
  an entire segment (millions?) of rows whenever a single 
  row is written to.  This damages concurrency.

  - The row reconstruction costs are high (gluing the 
  columns back together)

We have a SingleStoreDB paper in this years SIGMOD that goes into these implementation details (not publicly published just yet). The columnstore in SinglestoreDB is reasonably good at OLTP without sacrificing its OLAP performance (see some benchmarking results here: https://www.singlestore.com/blog/tpc-benchmarking-results/).

2 more replies

mamcx4y ago

Let's assume we want: Where are that databases?

Making one like that (with a hybrid model like PAX) with other things I consider important to improve the situation (like algebraic type support, full relational, etc) are desirable, but FUNDING that is other thing. Make a database is major effort and is not easy to get the proper support for it.

But I killed to have the chance, and I think the market is ready for go beyond the current crop of NoSql/NewSql...

snidane4y ago· 2 in thread

Isn't the typical big data sql task IO bound?

Vectorization only works when you have a table stored in an optimized columnar format and compute an run a function over a column or to combine multiple columns.

The moment you throw in group bys or windows the data turns into rows that you read from a hash table or after a sort - at which point you lose all opportunities of vectorization.

Since group bys break vectorization, the other use case is for map or reduce (sums, counts) operations over the entire table. In absence of filters you can precompute these for each column.

Plain map or sum like operations in presence of a filter is the only real use case for vectorization in OLAP, if I'm not missing anything.

In that case you need to implement the vectorized operation to work across together with a mask, so that you don't include the filtered out values, and over compressed data, otherwise you're wasting time on bringing the data from disk closer to cpu.

Most general big data sql tasks will not gain significant improvement using vectorization, unless they specialize on map after filter, no group bys, operations, such as perhaps log processing.

Vectorization and other kinds of hardware acceleration is highly useful for small array data that fits into memory such as geo data, APL, numpy, tensors on TPU processing and similar stuff.

jandrewrogers4y ago

In a well-designed system, you will typically be limited by effective bandwidth, often memory bandwidth or efficient use thereof which is an area where vectorization can help. Modern servers have tremendous storage bandwidth if you have an I/O scheduler capable of using it. Some newer database engines explicitly reject the assumption that storage throughput is precious as a design constraint, since it has become much less true over time due to advances in hardware.

Use of page layouts highly-optimized for vectorized evaluation is common now even if the implementation isn't vectorized. You lose nothing on modern hardware (they are good layouts regardless) and it allows you to easily do vector optimizations later. As a semantic distinction, columnar and vector layouts are organized differently and optimize for somewhat different things even though they have superficially similar appearance. Classic DSM-style columnar is largely obsolete.

Vectorization, first and foremost, is about optimizing selection operations in a database, but it can provide assists in other areas like joins, sorts, and aggregates. Most queries are a composed from these primitives, so many parts of the query plan may benefit. As a heuristic, operations that GPU databases excel at are the same kinds of operations that benefit from vectorization.

Obviously you can't just throw vectorization at an arbitrary database and expect major benefits, they need to be intentionally designed for it.

alterneeshOP4y ago

I can't seem to understand why vectorization wouldn't help, say if you read after a sort. Irrespective of whether it fits in memory, or you perform some sort of an external sort, any operation that you want to perform on top of that sorted vector, be it an aggregation to reduce it, or an arithmetic operation with another column, you could still leverage vectorization and would end up using fewer CPU cycles, no?

brutuscat4y ago· 2 in thread

isn't it what Apache Arrow [1], Apache CarbonData [2] and others are for?

[1] https://arrow.apache.org/docs/format/Columnar.html

[2] https://carbondata.apache.org/introduction.html

gavinray4y ago

The bit in Arrow closest to this would probably be Gandiva I think:

  > Gandiva - Vectorized processing for Apache Arrow

https://arrow.apache.org/blog/2018/12/05/gandiva-donation/

It's wicked neat tech.

profquail4y ago

I think it’s more like Gandiva or DataFusion (both from the Apache Arrow project).

einpoklum4y ago· 1 in thread

> A recent trend

It's not at all a recent trend, it goes back at least as far as the early 2000s. Also, the obvious next step from thinking about processing a few tuples together, is transposing your view of DB tables all the way, to consider _columns_ rather than _tuples_. This makes a lot of sense when you're processing analytic queries rather than transactions, which typically modify individual tuples. And _that_ view goes back at least another decade earlier if not more.

See my own brief historical survey on the matter in the intro my monograph regarding a computational model for column-oriented analytic engines:

https://arxiv.org/abs/1904.12217

(this is a bit of a shameless self-plug because other, much more distinguished people than myself have written about this subject.)

alterneeshOP4y ago

Fair point about the "recentness". I guess it seemed recent to me since I learned about it only a few months back :)

Also, yes, 100%. Columnar storage fits well with vectorized execution! Thanks for sharing

pstoll4y ago

And also some good articles by the CockroachDB folks on vectorization.

https://www.cockroachlabs.com/blog/vectorized-hash-joiner/

https://www.cockroachlabs.com/blog/vectorizing-the-merge-joi...

j / k navigate · click thread line to collapse

18 comments

13 comments · 5 top-level

lidavidm4y ago· 3 in thread

[1]: https://dl.acm.org/doi/abs/10.14778/3275366.3284966

gavinray4y ago

Came here to post this, fantastic paper and probably the most comprehensive thing on the internet available about this

Something else I've taken away from research about columnar and vectorized databases: there doesn't seem a good reason why they aren't fit for OLTP workloads

Analytics from SaaS/line-of-business apps shows about a 90/10 read/write ratio for CRUD apps.

Pavlo et al. have a solid paper on an HTAP database that compares OLTP, OLAP, and a novel HTAP storage on varying read/write workloads:

  > "Arulraj, J., Pavlo, A., & Menon, P. (2016). Bridging the Archipelago between Row-Stores and Column-Stores for Hybrid Workloads. Proceedings of the 2016 International Conference on Management of Data - SIGMOD  ’16. doi:10.1145/2882903.2915231 "

See "Figure 17: Concurrent Hybrid Workloads – The impact of the storage layout on the query processing time under different concurrent hybrid workloads."

The chart contains 4 workload types, one of which is "• Read-Heavy: 90% scans, 10% inserts"

You can see that the columnar storage model outperforms OLTP N-ary storage here

So my question is -- why don't people use columnar databases for CRUD apps?

AdamProut4y ago

  - Columnstore often don't support indexing at all (or have 
  weak support for it).  This means a scan is needed to find 
  any row (with min/max or segment elimination to avoid 
  opening up files with no matching rows at all).  Even if 
  this scan is very fast, its still going to use up more CPU 
  then an index seek.

  - Most columnstores don't use compression schemes that are 
  incremental.  To grab a single row the columnstore likely 
  decompresses many adjacent rows (could be millions of rows 
  - depends on the particular columnstore).

  - Most columnstores don't support fine grained locking 
  (row level locking).  They often lock the entire table or 
  an entire segment (millions?) of rows whenever a single 
  row is written to.  This damages concurrency.

  - The row reconstruction costs are high (gluing the 
  columns back together)

2 more replies

mamcx4y ago

Let's assume we want: Where are that databases?

But I killed to have the chance, and I think the market is ready for go beyond the current crop of NoSql/NewSql...

snidane4y ago· 2 in thread

Isn't the typical big data sql task IO bound?

Vectorization only works when you have a table stored in an optimized columnar format and compute an run a function over a column or to combine multiple columns.

The moment you throw in group bys or windows the data turns into rows that you read from a hash table or after a sort - at which point you lose all opportunities of vectorization.

Since group bys break vectorization, the other use case is for map or reduce (sums, counts) operations over the entire table. In absence of filters you can precompute these for each column.

Plain map or sum like operations in presence of a filter is the only real use case for vectorization in OLAP, if I'm not missing anything.

Most general big data sql tasks will not gain significant improvement using vectorization, unless they specialize on map after filter, no group bys, operations, such as perhaps log processing.

Vectorization and other kinds of hardware acceleration is highly useful for small array data that fits into memory such as geo data, APL, numpy, tensors on TPU processing and similar stuff.

jandrewrogers4y ago

Obviously you can't just throw vectorization at an arbitrary database and expect major benefits, they need to be intentionally designed for it.

alterneeshOP4y ago

brutuscat4y ago· 2 in thread

isn't it what Apache Arrow [1], Apache CarbonData [2] and others are for?

[1] https://arrow.apache.org/docs/format/Columnar.html

[2] https://carbondata.apache.org/introduction.html

gavinray4y ago

The bit in Arrow closest to this would probably be Gandiva I think:

  > Gandiva - Vectorized processing for Apache Arrow

https://arrow.apache.org/blog/2018/12/05/gandiva-donation/

It's wicked neat tech.

profquail4y ago

I think it’s more like Gandiva or DataFusion (both from the Apache Arrow project).

einpoklum4y ago· 1 in thread

> A recent trend

See my own brief historical survey on the matter in the intro my monograph regarding a computational model for column-oriented analytic engines:

https://arxiv.org/abs/1904.12217

(this is a bit of a shameless self-plug because other, much more distinguished people than myself have written about this subject.)

alterneeshOP4y ago

Fair point about the "recentness". I guess it seemed recent to me since I learned about it only a few months back :)

Also, yes, 100%. Columnar storage fits well with vectorized execution! Thanks for sharing

pstoll4y ago

And also some good articles by the CockroachDB folks on vectorization.

https://www.cockroachlabs.com/blog/vectorized-hash-joiner/

https://www.cockroachlabs.com/blog/vectorizing-the-merge-joi...

j / k navigate · click thread line to collapse