IME simd very rarely gets used by the compiler or runtime unless you make some slight changes in your data structures or flow, that require specific knowledge of the simd hardware. Asking a compiler to target unknown GPU architecture seems more likely to slow execution than speed it up. Even when writing my own cuda kernels I sometimes realize that something I am doing won't work well for a particular card and it is actually making me slower than the cpu. I'm sure we'll get there, but cards will have to converge a bit.