I also expect Halide[1], Taichi[2], DaCe[3] and Tiramisu[4] to be able to more easily reach those levels of performance hence why I want to add a compiler to Laser (implemented at compile-time via Nim macros) but prerequisite is excellent and composable multithreading support which OpenMP didn't provide.
In statistics / data analysis, my PCA though (Principal Component Analysis) beats all implementations I've seen from Scipy, Facebook and R.
[2]: https://github.com/taichi-dev/taichi