No doubt fast SRAM helps, but from a computation pov imho its that they've statically planned computation and eliminated all locks.
Short explainer here: https://www.youtube.com/watch?v=H77tV1KcWIE (Based on their paper).