It's is rather silly for Phi to be positioned as "it's just like x86, oh wait except for needing to use special SIMD instructions to get max performance". Kind of like Atom being x86 for ultra mobile platforms, just not being able to match the power/performance of ARM. Once you start sacrificing things to maintain x86 compatibility, you really loose its benefits.
You really do need these changes to get max parallelism though. Where it shines is situations where you'd otherwise be porting to a GPU. On the Phi its a recompile and adding a few intrinsics to your inner loops. This is much faster than getting reasonable performance on a heterogeneous architecture and you don't have to micro-manage the slow PCIe link between the CPU and the GPU.