undefined | Better HN

0 pointsdragontamer3y ago0 comments

Unroll the dependency until you are longer than the SIMD width.

Ex: as long as i, i+1, i+2, i+3, ... i+7 are not dependent on each other, you can vectorize to SIMD-width 8.

Or in other words: i+7 can depend on i-1 no problems.

0 comments

sampo3y ago

> Unroll the dependency until you are longer than the SIMD width.

> Ex: as long as i, i+1, i+2, i+3, ... i+7 are not dependent on each other, you can vectorize to SIMD-width 8.

Do you mean like this? I get this to about as fast as the first "unoptimized" version in the SO post, but not faster.

    void compute()
    {
        const double A = 1.1, B = 2.2, C = 3.3;
        const double A128 = 128*A;
        double Y[8], Z[8];
    
        Y[0] =               C;
        Y[1] =     A +   B + C;
        Y[2] =   4*A + 2*B + C;
        Y[3] =   9*A + 3*B + C;
        Y[4] =  16*A + 4*B + C;
        Y[5] =  25*A + 5*B + C;
        Y[6] =  36*A + 6*B + C;
        Y[7] =  49*A + 7*B + C;
        Z[0] =  64*A + 8*B;
        Z[1] =  80*A + 8*B;
        Z[2] =  96*A + 8*B;
        Z[3] = 112*A + 8*B;
        Z[4] = 128*A + 8*B;
        Z[5] = 144*A + 8*B;
        Z[6] = 160*A + 8*B;
        Z[7] = 176*A + 8*B;
    
        int i;
        for(i=0; i<LEN; i+=8) {
            data[i  ] = Y[0];
            data[i+1] = Y[1];
            data[i+2] = Y[2];
            data[i+3] = Y[3];
            data[i+4] = Y[4];
            data[i+5] = Y[5];
            data[i+6] = Y[6];
            data[i+7] = Y[7];
            Y[0] += Z[0];
            Y[1] += Z[1];
            Y[2] += Z[2];
            Y[3] += Z[3];
            Y[4] += Z[4];
            Y[5] += Z[5];
            Y[6] += Z[6];
            Y[7] += Z[7];
            Z[0] += A128;
            Z[1] += A128;
            Z[2] += A128;
            Z[3] += A128;
            Z[4] += A128;
            Z[5] += A128;
            Z[6] += A128;
            Z[7] += A128;
        }
    }

dragontamerOP3y ago

> Do you mean like this? I get this to about as fast as the first "unoptimized" version in the SO post, but not faster.

Yeah, something like that. I haven't double-checked your math, but the idea is what I was going for.

I'm always "surprised" by the fact that CPUs care more about bandwidth rather than latency these days. A lot of CPUs (Intel, AMD, ARM, etc. etc.) support 1x or even 2x SIMD-multiplications per clock tick, even though they take 5 clock ticks to execute.

I guess the original "simple" code may have had a multiply in there, but that's not a big deal these days (throughput wise), even though its a big-deal latency wise.

So getting rid of those multiplies and cutting down the latency (ie: using only add statements) barely helps at all, maybe with no measurable difference.

One of these days, I'll actually remember that fact, lol.

btdmaster3y ago

On my machine, your code is faster for smaller LEN values. I'm not sure why this is though.

dragontamerOP3y ago

8x 64-bit is 512-bit, which is designed for AVX512. You'll probably need AVX512 to fully benefit from unrolling x8.

4x 64-bit is 256-bit, which requires special compiler flags for 256-bit AVX2, but most x86 CPUs should support them these days.

2x64-bit is 128-bit, which fits in default SSE 128-bit SIMD with default GCC / Visual Studio compiler flags.

yongjik3y ago

If they were integer variables, I guess the compiler would have done that, but you can't really do that with floats because i+A+A is not necessarily i+2*A. (Of course, in this particular example, the difference doesn't matter for the programmer, but the compiler doesn't know that!)

I think there's some gcc option that enables these "dangerous" optimizations. -ffast-math, or something like that?

ummonk3y ago

No the computer would have been unlikely to be able to figure out the math to coalesce 8 recursive additions into one operation.

zeusk3y ago

> as long as i, i+1, i+2, i+3, ... i+7 are not dependent on each other

I really don't see how that works in improving this.

You can only calculate i+8, for calculating i+9 you depend on 8. And you can't go in strides either since i+16 depends on i+15 which you've not calculated so far unless you want to intermix the stateful and non-stateful code. I'd rather not go there.

j / k navigate · click thread line to collapse

0 comments

sampo3y ago

> Unroll the dependency until you are longer than the SIMD width.

> Ex: as long as i, i+1, i+2, i+3, ... i+7 are not dependent on each other, you can vectorize to SIMD-width 8.

Do you mean like this? I get this to about as fast as the first "unoptimized" version in the SO post, but not faster.

    void compute()
    {
        const double A = 1.1, B = 2.2, C = 3.3;
        const double A128 = 128*A;
        double Y[8], Z[8];
    
        Y[0] =               C;
        Y[1] =     A +   B + C;
        Y[2] =   4*A + 2*B + C;
        Y[3] =   9*A + 3*B + C;
        Y[4] =  16*A + 4*B + C;
        Y[5] =  25*A + 5*B + C;
        Y[6] =  36*A + 6*B + C;
        Y[7] =  49*A + 7*B + C;
        Z[0] =  64*A + 8*B;
        Z[1] =  80*A + 8*B;
        Z[2] =  96*A + 8*B;
        Z[3] = 112*A + 8*B;
        Z[4] = 128*A + 8*B;
        Z[5] = 144*A + 8*B;
        Z[6] = 160*A + 8*B;
        Z[7] = 176*A + 8*B;
    
        int i;
        for(i=0; i<LEN; i+=8) {
            data[i  ] = Y[0];
            data[i+1] = Y[1];
            data[i+2] = Y[2];
            data[i+3] = Y[3];
            data[i+4] = Y[4];
            data[i+5] = Y[5];
            data[i+6] = Y[6];
            data[i+7] = Y[7];
            Y[0] += Z[0];
            Y[1] += Z[1];
            Y[2] += Z[2];
            Y[3] += Z[3];
            Y[4] += Z[4];
            Y[5] += Z[5];
            Y[6] += Z[6];
            Y[7] += Z[7];
            Z[0] += A128;
            Z[1] += A128;
            Z[2] += A128;
            Z[3] += A128;
            Z[4] += A128;
            Z[5] += A128;
            Z[6] += A128;
            Z[7] += A128;
        }
    }

dragontamerOP3y ago

> Do you mean like this? I get this to about as fast as the first "unoptimized" version in the SO post, but not faster.

Yeah, something like that. I haven't double-checked your math, but the idea is what I was going for.

I guess the original "simple" code may have had a multiply in there, but that's not a big deal these days (throughput wise), even though its a big-deal latency wise.

So getting rid of those multiplies and cutting down the latency (ie: using only add statements) barely helps at all, maybe with no measurable difference.

One of these days, I'll actually remember that fact, lol.

btdmaster3y ago

On my machine, your code is faster for smaller LEN values. I'm not sure why this is though.

dragontamerOP3y ago

8x 64-bit is 512-bit, which is designed for AVX512. You'll probably need AVX512 to fully benefit from unrolling x8.

4x 64-bit is 256-bit, which requires special compiler flags for 256-bit AVX2, but most x86 CPUs should support them these days.

2x64-bit is 128-bit, which fits in default SSE 128-bit SIMD with default GCC / Visual Studio compiler flags.

yongjik3y ago

I think there's some gcc option that enables these "dangerous" optimizations. -ffast-math, or something like that?

ummonk3y ago

No the computer would have been unlikely to be able to figure out the math to coalesce 8 recursive additions into one operation.

zeusk3y ago

> as long as i, i+1, i+2, i+3, ... i+7 are not dependent on each other

I really don't see how that works in improving this.

j / k navigate · click thread line to collapse