I agree that they can be too heavyweight for short-lived work. e.g. a program needs to sort a small(ish) amount of data, it would be great if the language could make the sort utilise all the available cores on the machine, without the base program needing to do any kind of multi-process or multi-threading. Spinning up threads to do this kind of thing could be too slow to be of any benefit, but you could use thread pools to save on the startup costs. This is the kind of problem that OpenMP tries to solve.
But for long-lived workloads, or lots & lots of tiny requests, (the OP talks about 'web-scale', whatever that is), you would be creating the processes and threads once, at startup, and then they just all keep busy with little overhead.