Ensembles are well-known to be more accurate. But this is not an advantage exclusive to humans: NNs ensembled will do better than any of the individual NNs.
There's no reason one couldn't train 5 or 10 RNNs for transcription and ensemble them. (Indeed, one cute trick this ICLR was how to get an ensemble of NNs for free so you don't have to spend 5 or 10x time training: simply lower the learning rate during training until it stops improving, save the model, then jack the learning rate way up for a while and start lowering it until it stops improving, save that model, and when finished, now you have _n_ models you can ensemble.) And computing hardware is cheaper than humans, so it will be cheaper to have 5 or 10 RNNs process an audio file than it would be to have 2 or 3 humans independently check, so the ensembling advantage is actually bigger for the NNs in this scenario.
Humans still have the advantage of more semantic understanding, but RNNs can be trained on much larger corpuses and read all related transcripts, so even there the human advantage is not guaranteed.