Faster hash joiner with vectorized execution (opens in new tab)

(cockroachlabs.com)

120 pointsjordanlewis7y ago19 comments

19 comments

19 comments · 9 top-level

angelachang277y ago· 5 in thread

Hey everyone, I'm the intern who did this work, happy to answer any questions if you have them!

fwip7y ago

Incredible work! 40x speedup is ludicrously good.

angelachang277y ago

Thank you!

DLA7y ago

Your work is awesome as is your excellent detailed writeup. Appreciate the links and refs to the background papers also. Thank you for helping make CRDB even better. You should be very proud of this work.

angelachang277y ago

curiousDog7y ago

Excellent work and write-up. Very happy to see the next generation of engineers are top notch like you!

nimish7y ago· 2 in thread

Awesome work. Really cool!

I have a hobby project to write an analytics DB that uses ISPC for vectorized execution. Currently not much (sums are real easy) but I really wonder if it could reduce the effort to vectorize these sorts of things.

gary__7y ago

Is it open source / online?

nimish7y ago

Nope, and by toy I really mean toy. It can handle a single type of query and that's it. Maybe one day I'll polish it up.

evrydayhustling7y ago· 1 in thread

Pretty impactful work for an intern! One thing I would have liked to see, both from a process and communication standpoint, is leading with some stats on how much faster the vectorized loops are in isolation from the surrounding engine.

It's always good practice to dig into a deep project like this with some napkin estimates of how much you stand to gain, and how much overhead you can afford to spend setting yourself up for the faster computation. (Not to mention how much of your own time is merited!)

asubiotto7y ago

Indeed! We spent some time running isolated benchmarks before deciding to dive into the project: https://github.com/jordanlewis/exectoy

blr2467y ago· 1 in thread

Great write-up. Is the long-term vision to go completely to the vectorised query execution model, or are there cases where a row-oriented plan might be better, such as cases when there are complex computations involving multiple columns of a single row?

jordanlewisOP7y ago

We don't have any plans to remove the row-by-row execution engine. Likely, we'll have some analysis during planning that can inform whether to use the row-oriented or column-oriented engine. I think the use cases for the row-oriented engine are exactly what you mention - things like single-row computations or more OLTP queries like point scans, inserts, update etc - where the overhead of setting up the data structures required to use the column-oriented engine would dominate.

sAbakumoff7y ago· 1 in thread

I looked at the title and thought that the article is about rolling joints with hash by using some advanced robotic technology and software. Damn!

ddorian437y ago

Pretty sure that they can use same tech as cigarettes.

obl7y ago

The "core" of the trick is nice : amortizing interpreter dispatch over many items. (ignoring the column layout/SIMD stuff which basically helps in any case)

Essentially it's turning :

  LOAD
    DISPATCH
    OP1
    DISPATCH
    OP2
    ... (once per operation in the expression)
  STORE
  ... (once per row)

into

  DISPATCH
    LOAD
    OP1
    STORE
    LOAD
    OP1
    STORE
    ... (once per row)
  DISPATCH
  ... (once per operation in the expression)

The nice trade-off here is that you don't require code generation to do that, but it's still not optimal.

If you can generate code it's even better to fuse the operations, to get something like :

  LOAD
    OP1
    OP2
    ...
  STORE
  LOAD
  ...

It helps because even though you can tune your batch size to get mostly cached loads and stores, it's still not free.

For example on Haswell you can only issue 1 store per cycle, so if OP is a single add you're leaving up to 3/4 of your theoretical ALU throughput on the table.

andonisus7y ago

It was a pleasant surprise to see the code base written in Go!

jnordwick7y ago

Column oriented dbs have been doing parallel (column per thread) joins for a while now, no? And I know they have been leaning heavily on vectorization for over a decade.

JanecekPetr7y ago

Genetic specialization (List<int>) is coming. See https://openjdk.java.net/projects/valhalla/. Not this year (value types will be first). But they're working in it really hard.

j / k navigate · click thread line to collapse

19 comments

19 comments · 9 top-level

angelachang277y ago· 5 in thread

Hey everyone, I'm the intern who did this work, happy to answer any questions if you have them!

fwip7y ago

Incredible work! 40x speedup is ludicrously good.

angelachang277y ago

Thank you!

DLA7y ago

angelachang277y ago

curiousDog7y ago

Excellent work and write-up. Very happy to see the next generation of engineers are top notch like you!

nimish7y ago· 2 in thread

Awesome work. Really cool!

gary__7y ago

Is it open source / online?

nimish7y ago

Nope, and by toy I really mean toy. It can handle a single type of query and that's it. Maybe one day I'll polish it up.

evrydayhustling7y ago· 1 in thread

asubiotto7y ago

Indeed! We spent some time running isolated benchmarks before deciding to dive into the project: https://github.com/jordanlewis/exectoy

blr2467y ago· 1 in thread

jordanlewisOP7y ago

sAbakumoff7y ago· 1 in thread

I looked at the title and thought that the article is about rolling joints with hash by using some advanced robotic technology and software. Damn!

ddorian437y ago

Pretty sure that they can use same tech as cigarettes.

obl7y ago

The "core" of the trick is nice : amortizing interpreter dispatch over many items. (ignoring the column layout/SIMD stuff which basically helps in any case)

Essentially it's turning :

  LOAD
    DISPATCH
    OP1
    DISPATCH
    OP2
    ... (once per operation in the expression)
  STORE
  ... (once per row)

into

  DISPATCH
    LOAD
    OP1
    STORE
    LOAD
    OP1
    STORE
    ... (once per row)
  DISPATCH
  ... (once per operation in the expression)

The nice trade-off here is that you don't require code generation to do that, but it's still not optimal.

If you can generate code it's even better to fuse the operations, to get something like :

  LOAD
    OP1
    OP2
    ...
  STORE
  LOAD
  ...

It helps because even though you can tune your batch size to get mostly cached loads and stores, it's still not free.

For example on Haswell you can only issue 1 store per cycle, so if OP is a single add you're leaving up to 3/4 of your theoretical ALU throughput on the table.

andonisus7y ago

It was a pleasant surprise to see the code base written in Go!

jnordwick7y ago

Column oriented dbs have been doing parallel (column per thread) joins for a while now, no? And I know they have been leaning heavily on vectorization for over a decade.

JanecekPetr7y ago

Genetic specialization (List<int>) is coming. See https://openjdk.java.net/projects/valhalla/. Not this year (value types will be first). But they're working in it really hard.

j / k navigate · click thread line to collapse