Thanks for the feedback, glad to hear I'm not completely crazy ;). I think I saw that paper cited in my reading but haven't read it in full, will take a look thanks!
Some of the results I've had have been from trying to apply it using 1D unets (also audio). I am getting slightly better results now using larger (and more standard) 2D unets but it's really taking a long time to train, especially given that I'm still experimenting with a subset of my data.
I'm beginning to suspect that because it's learning to predict very small signal residuals, improvement in output quality is very incremental in a way that is not directly correlated to the size or nature of the dataset. Like, even if I just train it on sinusoids it takes a really long time improve. (compared to a GAN approach). None of these conclusions are very formal mind you, would love to hear this confirmed. The training dynamics just seem very different from what I am used to with either MSE or discriminative loss.