Ever since the release of Whisper and others, text-to-speech and speech-to-text have been more or less solved, while image generation seems to still sometimes have trouble. Earlier this week was a thread about how no image model could draw a crocodile without a tail.
Meanwhile, the first photographs predate the first sound recordings. And moving images without sound, of course, predate moving images with sound.
The original poster was trying to sound profound as though there was some set sequence of things that always happens through human development. But the reality is a much more mundane "less complex things tend to be easier than more complex things".