And if I recall correctly there was no allocation in the hot loop, with a single large array being initialized via numpy to store the values before hand. Certainly that's one of the first things I would think to fix.
I was strongly convinced at the time that there was no significant improvement left in python. With >99% of the time being spent in this one function, and no way to move the loop into native code given the primitives available from numpy. Admittedly I could have been wrong, and I'm not about to revisit the code now, since it has been years and it is no longer in use - so everything I'm saying is based off of years old memories.