https://en.wikipedia.org/wiki/Binary-to-text_encodingIf the two people can agree on the pronunciation of various symbols, then you can choose a more "dense" encoding like Base85.
If the key is designed to be spoken and heard, then you're better off avoiding case-sensitive and sound-alike symbol set altogether. Instead, generate a dictionary of words that is designed to reduce the possibility of mistaking one word for another (something like the phonetic alphabet: https://en.wikipedia.org/wiki/NATO_phonetic_alphabet), and then assign each symbol to a word.
The larger your dictionary, the shorter each key exchange will be but also the chance of including sound-alike words increases as well.
A dictionary of 256 words means you would need 68 per key but if you double the dictionary size, you half the number of words per key.
And of course this falls apart if two people exchanging keys don't speak the language of the dictionary. Although this is (usually) the case with just numbers and letters as well.