"but compiling "high-level" (i.e. C-level or above) code into SIMD is an open research problem (AFAIK)"
In general, "taking a language never written for X and applying X on it" is an open problem. See also, for instance, trying to statically type a program in a dynamically-typed language. Of course it's hard to statically type a language that was designed to be dynamically typed. Of course it's hard to take C and make it do SIMD operations. You'd have to write a higher-level language designed to do SIMD from the beginning, if you really want it to be slick.
"C matches NUMA and caching just as well as assembly does."
I'm going to make a slightly different point, which is that C does not support caching as well as a language could. For one particular example, C still lays structs out as rows, and has no support for laying them out in columns. (It doesn't stop you, but you're going to be rewriting a lot of code if you change your mind later. Or doing some really funky things with macros, which is also rewriting lots of code, too.)
Of course C doesn't support this crazy optimization... when it was written the order-of-magnitude difference between CPUs and memory was much smaller, indeed outright nonexistent on some architectures of the time (though I can't promise C was on them). Computers changed.
(I'll give another idea that may or may not work: Creating a struct with built-in accessors designed to mediate between the in-RAM representation and the program interface, which the compiler will analyze and determine whether to inline into the memory struct or not. For instance, suppose you have two enumerations in your struct, one with 9 values and one with 28. There's 252 possible combinations, which could be packed into one byte, but naively, languages won't do that. This language could transparently decide whether to compute the values upon access, or unpack them into two bytes.)
Of course, virtually nothing does support this sort of thing, which is why I said C isn't really uniquely badly off... almost no current high-level languages are written for the machines we actually have today. (I'm hedging. I don't know of any that truly are, but maybe one could argue there's some scientific languages that are.) C is the water we all swim through, even if we hardly touch it directly, and ever since GPUs become standard-issue the mismatch between our programming languages and our hardware has become comically large.
Assembly supports everything, but by so doing so, supports nothing. It will of course permit you to lay your structs out in rows or columns or anything in between, but it doesn't particularly support any of them, either.
Incidentally, unlike some of my age, I don't actually complain about this; it is what it is, there are reasons for it, and what we have in hand is still pretty darned powerful. But, again, I'd suggest that if you are a language designer and you're looking to write the Next Big Thing, you could do worse than figure out how to start integrating this stuff into a programming language that can cache smarter and use the GPU in some sensible manner and all these other things. It's one way you might actually be able to create a language that will not merely tie C, but straight-up beat it.