We definitely agree that voice interfaces are very rudimentary today. I try to run lots of things through dictation first that normally I would type out with my thumbs on the smartphone or on a keyboard on my computer. Text messages, search terms, commit messages, Slack conversations. Still, it can't perform very basic tasks like changing or backspacing a word or phrase, either because it misheard it or because you want to change it. (And actually as I dictated this paragraph on my 2018 MacBook Pro, it typed out everything I said twice and still required typing interventions, and eventually I just fell back to typing everything.)
You've laid out some good criteria though. I wouldn't say voice interfaces have really "made it" until it gets to the point where you don't have to ask how to ask it to do something (discoverability). You just ask it to do something and it does it. Although that's just one of many criteria.
The food menu problem is interesting, but pretty much everything that prints out on a ticket in a kitchen is structured data–it should be able to be efficiently conversationalized (preference notwithstanding, of course). Certainly there are many ways you could talk to someone about a menu: what kinds of dishes are there? Appetizers, grilled entrees, pasta, salads, desserts. What kind of entrees? Vegetarian, pork, beef, seafood. OK, but what styles of cuisine? Jamaican, Italian, Szechuan. There's probably an analog to the 5 Why's for figuring out what someone wants to eat! Asking yesterday's weather, though, is a specific case that could probably be solved by an intern, provided that data is easy to find on the Internet (FWIW, I've searched for the very same thing many times and it's much harder to find vs forecasts).
I concede that there will always be a need for graphical interfaces. How do you "speak" a map, or a CAD model? I guess I was just thinking of things that can accomplished with a keyboard. You can speak anything you can type, even if it's as rudimentary as today, where you have to say "period newline newline" to end a sentence at the end of a paragraph while dictating.
I agree it might seem tough to multitask. But consider WiFi routers serving multiple computers, or hell, even CPUs serving different processes, "simultaneously." If voice recognition and NLP become sufficiently sophisticated I could foresee being able to isolate multiple overlapping voices in an audio sample. If not, consider that you could ask it to look something up, immediately followed by your wife dictating an email to send–or one of you could even interrupt the other–and it could be able to handle the context switching and queuing at speed.
And I understand there's a lot I don't know, and I do remain skeptical that this could ever be perfected. Would it really be able to dictate poetry? Would the forms I create or creatively destroy in free verse just totally confuse the voice interface? Would it be smart enough to side step the confusion via some pseudo-meta-cognitive process and ask me what the hell I'm doing?