Then treating these spectrograms as images, train a neural net to classify them using pre-labelled samples. Then take samples from the unknown songs, and let it classify them. I find it incredible that 2.5 seconds of sound represented as a tiny picture captures information enough for reliable classification, but apparently it does!
Although I wonder what that would do to the metal scene if their main topic of discussion and contention got completely solved.
There's more on that (and some other pitfalls) in a paper linked elsewhere in the comments here: https://news.ycombinator.com/item?id=13085651
But please correct me if I'm wrong.
(Nice to see you show up for the discussion. I was worried that you'd given up hope before your article hit the front page.)
[1] https://sourceforge.net/p/sox/feature-requests/176/
Converting to pictures is unnecessary. It makes the processing harder. The pooling should just happen on segments of the wave form instead of the fourier transform (frequency-domain) picture spectrograms.
I agree it seems a little jank, but the features are pretty good - and a lot of network architectures / training techniques are most practiced in an image processing context.
2. The size of frequency analysis blocks seems arbitrary. I wonder if there is a "natural" block size based on a song's tempo, say 1 bar. This would of course require a priori tempo knowledge or a run-time estimate.
[1]: https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/...
The creator wrote about it here:
http://blog.echonest.com/post/52385283599/how-we-understand-...
and writes a lot about it on their blog:
http://www.furia.com/page.cgi?terms=noise&type=search
Of course those are going in the other direction, not generating the classification from the data, but it's probably one of the best data sets as far as classifying existing music.
Demo available here: http://demo.niland.io/
For example, it can output Drum Machine: 87%, House: 88%, Female Voice: 55%, Groovy: 93%