A speech recognition model can give you a reading on how understandable the speech is and use that information to guide the channel volume in the mixing.
OTOH, a lot of the models end up trained on features that are very different from what humans hear.