The reason for the lag is that Julia has been focusing on general composable compiler, codegen and metaprogramming infrastructure which isn't domain specific, whereas pytorch and friends has been putting lots of dev money into c++ ML focused optimizers.
Once the new compiler stuff is in place, it would be relatively trivial to write such optimizations, in user space, in pure Julia. Then exceeding that would be fairly simple also, plus things like static analysis of array shapes