More generally, I think the thing you are noticing is that visual and physical items offer random access.
Compare trying to find a specific piece of information in a book, vs in some training DVD.
If I'm just learning how to cook, watching a professional demonstrate the whole thing is going to be very helpful, but if I already know how to cook in general it's easier to flick to the right section of a book and scan the page for the bit of information I need.
Or compare the difference between listening to a phone system's 7 different options vs seeing all the options available on a single screen.
The other side of this is precision. Not only do input methods like a keyboard allow you to give extremely explicit, high information, instructions with no need for interpretation, they also have extremely fast feedback loops. Imagine trying to use your voice to click on a specific part of an image, or draw a circle around it. Far, far easier to move a pointer with your hand, watch where it goes, and then click when it's in the right position.
So visual comprehension probably is better than auditory, but I think the main things that are important are random access, specific and information dense input, and low latency feedback loops on input - all things that we are far better at achieving with physical/visual methods than auditory or speech based methods.