I think it might be even less problematic with something like Whisper than with DALLE/SD? Merely consuming data to train a system or create an index is not usually contrary to the law (otherwise Google wouldn't exist) – it's the
publication of copyright content that's thorny (and is something you can begin to achieve with results from visual models that include Getty Photos logo, etc.)
I think it'd be a lot harder to make a case for an accurate audio to text transcription being seen to violate the copyright of any of the training material in the way a visual could.