Memory is slow. Insanely slow (compared to the CPU). You can process stupid fast if your entire working set can fit in a 2KB L1 cache, but the second you touch memory you're hosed. You can't hide memory latency without out-of-order execution and/or SMT. You fundamentally need to be parallel to hide latency. CPUs do it with out-of-order and speculative execution. GPUs do it by being stupidly parallel and running something like 32-64 way SMT (huge simplification). Many high-performance CPUs do all of these things.
Instruction level parallelism is simply not optional with the DRAM latency we have.
But I was just really trying to point that in-order cpus are still around, they did not disappear with in-order atom.