Something else I've taken away from research about columnar and vectorized databases: there doesn't seem a good reason why they aren't fit for OLTP workloads
Analytics from SaaS/line-of-business apps shows about a 90/10 read/write ratio for CRUD apps.
Pavlo et al. have a solid paper on an HTAP database that compares OLTP, OLAP, and a novel HTAP storage on varying read/write workloads:
> "Arulraj, J., Pavlo, A., & Menon, P. (2016). Bridging the Archipelago between Row-Stores and Column-Stores for Hybrid Workloads. Proceedings of the 2016 International Conference on Management of Data - SIGMOD ’16. doi:10.1145/2882903.2915231 "
See "Figure 17: Concurrent Hybrid Workloads – The impact of the storage layout on the query processing time under different concurrent hybrid workloads."The chart contains 4 workload types, one of which is "• Read-Heavy: 90% scans, 10% inserts"
You can see that the columnar storage model outperforms OLTP N-ary storage here
So my question is -- why don't people use columnar databases for CRUD apps?
- Columnstore often don't support indexing at all (or have
weak support for it). This means a scan is needed to find
any row (with min/max or segment elimination to avoid
opening up files with no matching rows at all). Even if
this scan is very fast, its still going to use up more CPU
then an index seek.
- Most columnstores don't use compression schemes that are
incremental. To grab a single row the columnstore likely
decompresses many adjacent rows (could be millions of rows
- depends on the particular columnstore).
- Most columnstores don't support fine grained locking
(row level locking). They often lock the entire table or
an entire segment (millions?) of rows whenever a single
row is written to. This damages concurrency.
- The row reconstruction costs are high (gluing the
columns back together)
We have a SingleStoreDB paper in this years SIGMOD that goes into these implementation details (not publicly published just yet). The columnstore in SinglestoreDB is reasonably good at OLTP without sacrificing its OLAP performance (see some benchmarking results here: https://www.singlestore.com/blog/tpc-benchmarking-results/).Making one like that (with a hybrid model like PAX) with other things I consider important to improve the situation (like algebraic type support, full relational, etc) are desirable, but FUNDING that is other thing. Make a database is major effort and is not easy to get the proper support for it.
But I killed to have the chance, and I think the market is ready for go beyond the current crop of NoSql/NewSql...
Vectorization only works when you have a table stored in an optimized columnar format and compute an run a function over a column or to combine multiple columns.
The moment you throw in group bys or windows the data turns into rows that you read from a hash table or after a sort - at which point you lose all opportunities of vectorization.
Since group bys break vectorization, the other use case is for map or reduce (sums, counts) operations over the entire table. In absence of filters you can precompute these for each column.
Plain map or sum like operations in presence of a filter is the only real use case for vectorization in OLAP, if I'm not missing anything.
In that case you need to implement the vectorized operation to work across together with a mask, so that you don't include the filtered out values, and over compressed data, otherwise you're wasting time on bringing the data from disk closer to cpu.
Most general big data sql tasks will not gain significant improvement using vectorization, unless they specialize on map after filter, no group bys, operations, such as perhaps log processing.
Vectorization and other kinds of hardware acceleration is highly useful for small array data that fits into memory such as geo data, APL, numpy, tensors on TPU processing and similar stuff.
Use of page layouts highly-optimized for vectorized evaluation is common now even if the implementation isn't vectorized. You lose nothing on modern hardware (they are good layouts regardless) and it allows you to easily do vector optimizations later. As a semantic distinction, columnar and vector layouts are organized differently and optimize for somewhat different things even though they have superficially similar appearance. Classic DSM-style columnar is largely obsolete.
Vectorization, first and foremost, is about optimizing selection operations in a database, but it can provide assists in other areas like joins, sorts, and aggregates. Most queries are a composed from these primitives, so many parts of the query plan may benefit. As a heuristic, operations that GPU databases excel at are the same kinds of operations that benefit from vectorization.
Obviously you can't just throw vectorization at an arbitrary database and expect major benefits, they need to be intentionally designed for it.
> Gandiva - Vectorized processing for Apache Arrow
https://arrow.apache.org/blog/2018/12/05/gandiva-donation/It's wicked neat tech.
It's not at all a recent trend, it goes back at least as far as the early 2000s. Also, the obvious next step from thinking about processing a few tuples together, is transposing your view of DB tables all the way, to consider _columns_ rather than _tuples_. This makes a lot of sense when you're processing analytic queries rather than transactions, which typically modify individual tuples. And _that_ view goes back at least another decade earlier if not more.
See my own brief historical survey on the matter in the intro my monograph regarding a computational model for column-oriented analytic engines:
https://arxiv.org/abs/1904.12217
(this is a bit of a shameless self-plug because other, much more distinguished people than myself have written about this subject.)
Also, yes, 100%. Columnar storage fits well with vectorized execution! Thanks for sharing
https://www.cockroachlabs.com/blog/vectorized-hash-joiner/
https://www.cockroachlabs.com/blog/vectorizing-the-merge-joi...