(Put another way, English text is a lossy representation of English speech.)
Perhaps if you were to feed the IPA representation of each word in alongside the text, the RNN would do a bit better, though admittedly I'm not sure how you would do so.
If this is the case, I'd imagine training it against Lojban text would see similar results.