Makes sense! My impression that phase matters from audio comes from when editing audio in a DAW or anything like that. We are very sensitive to sudden phase changes (which would be kind of like teleporting very fast from one point to another, from our heads point of view). Our ears kind of pick them up like sudden bursts of white noise (which also makes sense, given that they kind of look like an impulse when zoomed in a lot).
So when generating audio I think the next chunk needs to be continuous in phase to the last chunk, where in images a small discontinuity in phase would just result in a noisy patch in the image. That's why I think it should be somewhat like video models, where sudden, small phase changes from one frame to the next give that "AI graininess" that is so common in the current models