There is at least one startup working on a local solution (I forget their name), but my understanding is that it's still technically inferior - for example it can only support very small vocabularies.
The bottleneck is simply raw gpu compute power available.
I completely agree with you that the technical gap might be caused in part by incentives from cloud companies. Why invest in researching something that makes your core business less valuable?
Amazon and Google both most likely have more to gain from reducing their compute requirements for in house voice recognition than from keeping the cost high for others.