CUDA yes. Cg yes. HLSL yes. Verilog yes. VHDL yes. C++ or Java with MapReduce yes. PHP and MySQL with memcached yes. Erlang yes — and it really is functional, inside each process, anyway — that is, the level where you aren't getting any parallelism. Octave or R, potentially, but not today, as far as I know. Mathematica yes, and it, too, is mostly functional.
In theory, side effects are what make parallelism hard, and so languages whose semantics are side-effect-free (unlike F# or Mathematica or Erlang) should make it easy. So we all thought in 1980. Since then we spent 20 years or so trying to make that happen, and it basically didn't work.
There are basically four kinds of parallelism within easy reach today. There's SIMD, like MMX, SSE, 3DNow, AltiVec, and the like; you'd think that data-parallel languages and libraries like Numpy and Octave would be all over this, but except for Mathematica, that doesn't seem to be happening. There's running imperative code on a bunch of tiny independent processors that share no data; AFAIK that's what the shader languages are doing. There's instruction-level parallelism on a superscalar processor, which largely benefits from things like contiguous arrays in memory, or maybe what Sun is doing with Niagara, where the processor pretends to be a bunch of tiny independent slow processors. And then there's splitting up your data across a shared-nothing cluster, which is how every high-traffic web site works, and that's what MapReduce makes simpler.
Uh, and then there's designing your own hardware or programming FPGAs, which is what Verilog and VHDL are for.
Languages like OCaml (I don't know anything about F# except that it's like OCaml, but for the CLR) have no special advantage for any of these scenarios. They don't even have the theoretical advantage that they have no side effects and therefore you can speculatively multithread them without breaking the language semantics. They do have the massive practical disadvantage, in most of the scenarios I described, of needing unpredictable amounts of memory, having massive libraries, and using pointers all over the place. Using pointers all over the place kills your locality of reference and your ILP. Having massive libraries and using unpredictable amounts of memory makes it impossible to run them inside your GPU and means they can't run on an FPGA (except by using external memory, like the awesome Reduceron). And nothing about the language semantics helps with SIMD either.
So, sheesh, go read Alan Bawden's dissertation or whatever, but don't go around claiming that ML (or even Haskell) is going to magically make your algorithms parallel. We tried that. It didn't work. We're trying something else now.
And please, PHP and MySQL? For computation? Are you serious?