Do you have any intuition for whether the echhoprint-codegen algorithm would be suitable for saying whether two voice recordings match? One would be a little lossy, the other pretty much perfect.
Echonest can work with voice but is optimized for music so you might encounter a lot of false positive with it.
Check out the echonest board on google. It's a recurring topic.