First things that came to mind:
- facial hair
- getting people to learn to make bigger mouth movements and not mumble
- we're constantly self-correcting our speech as we hear our voice. This removes the feedback loop.
- non english languages (god forbid bilingualism)
- camera angles and head movement
And that thinking about it for 30s. I'm sure there are some really good use cases, but will any research group/company push through for years and years to make it really good even if the response is luck warm ?