It seems some people here have issue acknowledging that.
But where you are wrong: for modern micro-archs, everything mostly happens at runtime. Specific micro-archs optimizations are not done anymore, the linux kernel do not bother anymore and is compiled for "generic" x86_64 for instance, it is not worth it (and may cause more harm in the end). Usually, you only just care of basic static optimizations, like cache line, fetch code window, alignment, which are more writing "correct" assembly code than anything else.
And even with that, in the worst case scenarios, one could write some specific micro-arch code paths, not an issue while thinking long term of many software components life cycle, which would be "installed"/"branched to" at runtime. At least that knowledge would not be hidden deep in the absurd complexity of an optimizing compiler...
This is not the reason why. Indeed x86_64 has a much broader instruction & register baseline than i386 did, so the impact of per-CPU tuning is less than it used to be. But even a generic compiled Linux selects CPU instruction sets at runtime for things like hardware-accelerated cryptography, because those instruction sets actually matter. If you have evidence that modern microcode magically recognizes hand-written AES and replaces it with the equivalent AES-NI instructions, please be sure to send Linus your patches with benchmarks.
While you can write those micro optimizations for each CPU by hand, they not worth the human cost except in very rare situations. In most cases of course you can't measure the difference, as only a couple CPU cycles are saved.