I disagree that the most natural of human interfaces is voice. It's like saying that listening to an audiobook will always be a better way to consume media than watching a video. Audio only is great for some use cases, but not most. Humans developed the ability to see, touch, hear, and speak for a reason. After all, radios predated television by a long shot. Then when televisions were released, they pretty much cannibalized the sale of the home radio, because it added the ability to see in real time what you were listening to.
Speech is most certainly an important part of communication for humans, but it's just one, and only best for certain things. Simple queries like "what's the weather today?" or "play some music" is better said than typed, I would agree. But for lots of other tasks its just not efficient. I'd rather pull out my phone and search for restaurant recommendations than try and fumble around communicating by voice. When I pull up the Yelp app, I can instantly view a list of many of restaurants and because I'm used to the interface and visual cues, like the number of stars, location, and reviews, I can discern what I think I'd like very quickly. Now imagine a human trying to describe what they saw on that app to me. It'd be impossible to do. It's just very difficult to convey subtleties with voice only.
As an aside, if Apple wanted to get into the IoT business and pose a threat to Amazon, I think if they released some sort of "Always On" listening mode and began giving developers the ability to build apps which responded to it, they'd already be caught up. If I could say "play some music" or "Face Time Dave" and it solved the simple query problem, I don't know that I'd ever use my Echo again (I maybe use it once or twice a week now).