Ex: as long as i, i+1, i+2, i+3, ... i+7 are not dependent on each other, you can vectorize to SIMD-width 8.
Or in other words: i+7 can depend on i-1 no problems.
> Ex: as long as i, i+1, i+2, i+3, ... i+7 are not dependent on each other, you can vectorize to SIMD-width 8.
Do you mean like this? I get this to about as fast as the first "unoptimized" version in the SO post, but not faster.
void compute()
{
const double A = 1.1, B = 2.2, C = 3.3;
const double A128 = 128*A;
double Y[8], Z[8];
Y[0] = C;
Y[1] = A + B + C;
Y[2] = 4*A + 2*B + C;
Y[3] = 9*A + 3*B + C;
Y[4] = 16*A + 4*B + C;
Y[5] = 25*A + 5*B + C;
Y[6] = 36*A + 6*B + C;
Y[7] = 49*A + 7*B + C;
Z[0] = 64*A + 8*B;
Z[1] = 80*A + 8*B;
Z[2] = 96*A + 8*B;
Z[3] = 112*A + 8*B;
Z[4] = 128*A + 8*B;
Z[5] = 144*A + 8*B;
Z[6] = 160*A + 8*B;
Z[7] = 176*A + 8*B;
int i;
for(i=0; i<LEN; i+=8) {
data[i ] = Y[0];
data[i+1] = Y[1];
data[i+2] = Y[2];
data[i+3] = Y[3];
data[i+4] = Y[4];
data[i+5] = Y[5];
data[i+6] = Y[6];
data[i+7] = Y[7];
Y[0] += Z[0];
Y[1] += Z[1];
Y[2] += Z[2];
Y[3] += Z[3];
Y[4] += Z[4];
Y[5] += Z[5];
Y[6] += Z[6];
Y[7] += Z[7];
Z[0] += A128;
Z[1] += A128;
Z[2] += A128;
Z[3] += A128;
Z[4] += A128;
Z[5] += A128;
Z[6] += A128;
Z[7] += A128;
}
}Yeah, something like that. I haven't double-checked your math, but the idea is what I was going for.
I'm always "surprised" by the fact that CPUs care more about bandwidth rather than latency these days. A lot of CPUs (Intel, AMD, ARM, etc. etc.) support 1x or even 2x SIMD-multiplications per clock tick, even though they take 5 clock ticks to execute.
I guess the original "simple" code may have had a multiply in there, but that's not a big deal these days (throughput wise), even though its a big-deal latency wise.
So getting rid of those multiplies and cutting down the latency (ie: using only add statements) barely helps at all, maybe with no measurable difference.
One of these days, I'll actually remember that fact, lol.
4x 64-bit is 256-bit, which requires special compiler flags for 256-bit AVX2, but most x86 CPUs should support them these days.
2x64-bit is 128-bit, which fits in default SSE 128-bit SIMD with default GCC / Visual Studio compiler flags.
I think there's some gcc option that enables these "dangerous" optimizations. -ffast-math, or something like that?
I really don't see how that works in improving this.
You can only calculate i+8, for calculating i+9 you depend on 8. And you can't go in strides either since i+16 depends on i+15 which you've not calculated so far unless you want to intermix the stateful and non-stateful code. I'd rather not go there.