Yes, but not very well.
What I want is "AI" to do something impressive. Why are we trying to make the system generate the sounds itself? We don't make artists do that, we give them instruments. Give the models actual instruments, and then have it play them like a real artist. I will be much more impressed with an AI that understands composition and scoring, use of musical voices, key signatures. That would still be generative. I guess I just don't understand the point of the direction being taken. It's like a solution looking for a problem.
Similarly, we don't have recordings of the actions of painters; we have finished paintings -- but if you're not impressed with what AI can do in the visual sphere, your standards are, to put it mildly, high.
Yes, my standards are if it isn't at least as good as what's available now, what's the point.
In the same way, many successful musicians can't read sheet music or know music theory, they just know how to produce something that sounds good.
Right, because they can operate the instruments that make the sound with natural talent, but they don't have to draw the waveforms. Audio generation is much different than image generation. It's just very odd to me.