But we know it does matter - i.e. there's research which shows a good sound quality on a voice call improves whether people believe what you say[1].
Now in any individual session, you probably can't make particularly big alterations, but imagine say, Google or Amazon shipping a modified voice assistant voice as "the default" with every new speaker box? Whether people ask for the default voice, or change it, would all become data which tells you what people are responding to. And so right there, your new "voice of Google" or "voice of Amazon" you use in other places now becomes informed by wide-scale testing of whether people listen to it.
And that's presuming no one simply runs studies where they stick people in fMRI machines and play them an AI voice recording which they module according to neural feedback till it's "optimal".
[1] https://today.usc.edu/why-we-believe-something-audio-sound-q...