// Compiler doesn't make independent sum* accumulators, so unroll manually.
// We cannot use an array because V might be a sizeless type. For reasonable
// code, we unroll 4x, but 8x might help (2 FMA ports * 4 cycle latency).
That code needs 2 loads per FMA. So a CPU with 2 FMA ports would need at least 4 load ports to be able to feed the 2 FMA ports. Given that most CPUs with 2 FMA ports have just 2 load ports, unrolling by 4 should be more or less ideal.But, ideally, the compiler could make the decision based on the target architecture.
Without enabling associative math, it isn't legal to duplicate floating point accumulators and change the order of the accumulation. Perhaps compiling under `-funsafe-math` would help. If you're using GCC, you'll probably need `-fvariable-expansion-in-unroller`, too.
I think highway looks great. I'm sure I'll procrastinate on something important to play with it reasonably soon.