Second, you can verify fingerprints manually if you're in person. The biggest draw is there is no distinction between a user who you've simply exchanged keys with vs a user who's fingerprint you've verified. This is something the Threema messenger gets right. Trying to explain capabilities of the attacker and how you could be MITM'd to some random user is totally pointless and will just scare them, it's detrimental to adoption. It's far better to have a visual indication of how much relative 'trust' you have with your individual contacts, rather than write a novel implying there could be G-Men on the other side of the line.
Finally, for phone calling capability, the loop can be 'closed' through a second level of verification, because RedPhone/Signal give you a (matching) set of random words upon connection establishment, and each party says the secure words to each other before anything else. This was an idea taken from SilentCircle, I believe.
The idea here is that it is easy for a human to verify they're talking to someone they know simply if they hear their voice verify the words they're seeing, while it's probably going to be difficult for an attacker to imitate arbitrary voices on demand in real time to 'spoof' a human.