> I use AVX-512, and it's not even 2x faster, though it is faster--it should be more than 2x faster because AVX-512 has better instructions to work with. But when I combine this with doing the calculation in threaded parallel chunks on the array, it goes far slower than it should.
You might be saturating your memory bandwidth to the point where it just can't go any faster. Since it seems your problem is easy to parallelize, you might want to experiment with the rust-gpu ecosystem.