is there a tool that could profile/predict ahead of time, so that one does not attempt to hand write assembly before knowing for sure it will beat the compiled version?
I may be getting x86 CPU generations mixed up. But having wrestled with all this, I can certainly see the appeal of hand-optimising for older, simpler CPUs like the 6510 used in the C64, where things were a lot more deterministic.
A purely pen-and-paper tool was IACA; you simply inserted some macros around a bit of code in the binary and IACA simulated how these would/could be scheduled on a given core: https://stackoverflow.com/questions/26021337/what-is-iaca-an...
Most of the time, some good optimized library would do pretty good.
It looks like it only supports Linux and macOS - no Windows, but no other things too like mobile.
It seems it exists for ten years, I wonder what optimizations aren't still picked by the recent compilers.
One should be able to do a best-case calculation, mostly assuming caches hit and branch prediction gets the answer right. Register renaming manages to stay out of the way.
Getting more dubious, there is a statistical representation of program performance on unknown (or partially known) data. One might be able to estimate that usefully, though I haven't seen it done.