Firstly, the models (particularly the language models) needed for state of the art performance are huge. It's not atypical for papers to discuss using a billion n-grams, for example ( https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/ListenTerm1201415/s... ). That's several gigabytes of memory and storage at the very least, and you'd need a copy of that for every spoken language you'd want to support. Plus you need to keep that up to date with new words and phrases; it's much easier to keep models fresh on a server than on everyone's computer.
Power and CPU time are also a concern. Big beefy server farms can have trouble keeping up with state of the art speech recognition algorithms; a laptop, tablet or phone is going to struggle, especially when running off a battery, is at a huge disadvantage.
But the biggest advantage to server-based speech recognition is indeed that more data is critical to improving accuracy and performance. There's no data like more data. And you don't just need more data, you need a lot more data. You can get big gains from just doing unsupervised training on 20 million utterance rather than 2 million: http://static.googleusercontent.com/media/research.google.co... There's simply no way you're going to get anything like 20 million utterances without getting data from millions of real world users.