I have a hobby project to write an analytics DB that uses ISPC for vectorized execution. Currently not much (sums are real easy) but I really wonder if it could reduce the effort to vectorize these sorts of things.
It's always good practice to dig into a deep project like this with some napkin estimates of how much you stand to gain, and how much overhead you can afford to spend setting yourself up for the faster computation. (Not to mention how much of your own time is merited!)
Essentially it's turning :
LOAD
DISPATCH
OP1
DISPATCH
OP2
... (once per operation in the expression)
STORE
... (once per row)
into DISPATCH
LOAD
OP1
STORE
LOAD
OP1
STORE
... (once per row)
DISPATCH
... (once per operation in the expression)
The nice trade-off here is that you don't require code generation to do that, but it's still not optimal.If you can generate code it's even better to fuse the operations, to get something like :
LOAD
OP1
OP2
...
STORE
LOAD
...
It helps because even though you can tune your batch size to get mostly cached loads and stores, it's still not free.For example on Haswell you can only issue 1 store per cycle, so if OP is a single add you're leaving up to 3/4 of your theoretical ALU throughput on the table.