Finally, if you pre-process the audio using an FFT try different FFT sizes.
The trade off for window size is frequency resolution and time resolution. A bigger window gives you narrower bands, so more frequency resolution while giving you less temporal resolution where an onset of transient is significant in the analysis. Similarly, hop size will determine how 'leaky' the process is and how fine grained the windows will be. This can effect detecting quick peaks or changes while possibly smearing them across a few windows.
I believe I based my code of this matlab code: https://github.com/ebrevdo/synchrosqueezing/tree/master/sync...
The above matlab code is ridiculously slow, I rewrote it using sse intrinsics, and got it several orders of magnitude faster.
I hope this helps out someone. I never really produced anything with it, but I still feel it is promising.