The comparison of efficiency between Apple and Nvidia here is a bit misleading because one compares Apples general-purpose ALUs to Nvidia’s specialized ALUs. For a more direct efficiency comparison, one would need to compare the Tensor Cores against the AMX or ANE coprocessors.
As to how Apple achieves such high efficiency, nobody knows. The fact that they are on 5nm node might help, but there must be something special about the ALU design as well. My speculation is that they are wider and much more simpler than in other GPUs, which directly translates to efficiency wins.