I was gonna do something involving about 3 different neural nets:
a source separator: taking one audio stream as input and producing a set of audio streams as output.
a segmentation regression neural net: takes an audio stream as input and returns start and stop timestamps of individual samples as output, or alternativey, just trimmed copies of the audio stream
sample classifier: takes an audio stream and then returns “kick drum”, “snare drum”, “voice”, “guitar”, etc
then the pipeline would be like
source separator => segmenter => sample classifier
Hopefully with this I would be able to decompose music into constituent parts, useful for remixing and other kinds of musique concrete
I expect that the results with a deep pretrained generic image model + some tweaking with more niche training examples will be satisfactory, but if not it would be a good excuse to experiment with more traditionally seuqnece-oriented network architectures