FWIW, streaming
voice over the Internet isn't required for this attack - all the software needs is to send a few bytes long tag indicating the topic of an overheard conversation.
The processing power required for this isn't big either - remember that 12+ years ago Microsoft Windows shipped with a speech recognition system that was in many ways better than what the phones currently offer, and worked off-line and with almost unnoticeable performance penalty. And if you're interested in probabilistic reporting ("there's 86% I've heard a word matching this tag in the last hour..."), you can relax performance requirements even further.
So, out of the things you mention, the only somewhat convincing piece of evidence would be that the apps in question are not accessing microphone in the background.