If you wanted to use Julia at petascale or above (like the Celeste folks), then you'd probably want to be doing that in a case where you see fundamental algorithmic improvements you could readily make over the current SOTA with a higher level, dispatch-oriented language - or else a case where you really need, say, certain types of AD or the DiffEq + ML capabilities of the SciML ecosystem (the latter of which very much depends on the level of composability that follows from the dispatch-oriented programming paradigm).
In general, my two cents on what it takes to get "good enough" performance from Julia for my sort of HPC are that you (1) embrace the dispatch-oriented paradigm and take type-stability seriously, and (2) either disable the GC and manage memory manually or else (my usual approach) allocate everything you need heap-allocated up-front, and subsequently restrict yourself to in-place methods and the stack.
MPI.jl pretty much "just works" in my usage so far (just have to point it towards your cluster's OpenMPI/MPICH) and things like LoopVectorization.jl are great for properly filling up your vector registers with minimal programmer effort.