Here's an experiment which makes it very obvious that we hear phase.
Take some audio signal X and apply the FFT with a 50% overlap between frames. Now randomize (or zero out) the phase, invert back to time-domain, and then take the FFT again and invert again back to the time domain. The resulting signal has 'consistent' phase between frames due to the extra round trip, but will still sound terrible. (It's equivalent to a single round of Griffin-Lim phase reconstruction. You can run more iterations to get something that sounds better, but it's still not perfect even if you run an arbitrary number of iterations.)
This proves that the information in the magnitude spectrogram is a subset of what we pick up with our ears: There are signals with the same magnitude spectrogram which sound different.