I think the trouble you're running into is that a spectrogram discards phase information so it's not informationally complete, and impossible to perfectly invert. Basically, a Fourier Transform represents a sound as a series of many sound waves at different frequencies added together. In order to make a pretty picture, the phase is thrown away, and only the magnitude of each wave is shown. The trouble is, to go back to a pleasant/accurate sound, we need that phase information that is missing.
I was thinking this is the case, until I stumbled upon a stackoverflow question that explains how to recover the phase data from overlapping FFT frames. The key word here is "overlapping".