Here are a few ideas:
Use an extensible compiler and targeted optimizations. https://mitpress.mit.edu/books/automatic-algorithm-recogniti... is an excellent book on this topic.
Use a cluster to evolve the best settings for compile options, executable layout, instruction scheduling, etc. There is a paper from a Google author about doing this for prefetching.
Use an ILP solver for register allocation, instruction scheduling and other problems that are normally solved with heuristics. The size of the program may make this intractable. There was a startup that used this approach for a custom programming language targeted at Intel's network processors.