The behavior I'm complaining about happens on both, though as best I can tell the voice typing decides whether to use on-device or cloud-based depending on the conditions when you use it. If you cut your data off you'll get word-by-word recognition, whereas most of the times you're connected the whole sentence will pop in at the same time indicating it used the cloud.