One thing I'm still not sure about is whether the kernel could theoretically do the same reordering at load time using relocatable symbols.
There may be a code size cost in some architectures - that since the call destination can be relocated far from the call site that the assembler will need to make sure it allocates enough space to reach the call target instead of a small PCREL relocation.