Forbidden West has some of the best mocap with voice acting I've seen in a while since the God of War 2016 game. It's one of those things that's part art and part technology really, but making better tools will get this over the hurdle I think.
Do companies use machine learning for face animation at this point? E.g. capture a bunch of facial mocap data and phoneme streams and train a transformer model on how the face would move given a new stream? I guess it's probably easier just to translate phonemes to facial movements with much the same outcome?