This is much easier than source separation. It would be different if I were asking to isolate a violin from a viola or another violin, you’d have to get much more specific about the timbre of each instrument and potentially understand what each instruments part was.
But a vibration made from a string makes a very unique wave that is easy to pick out in a file.
Humans are amazing at it. You can discern the different instruments way better than any stem separating AI.
What did you compare it to? Ableton recently launched a audio separation feature too, and probably the highest ROI on simple/useful/accurate so far I've tried, other solutions been lacking in one of the points before.
https://www.reddit.com/r/LocalLLaMA/comments/1pp9w31/ama_wit...
How does that work? Correlating sound with movement?
If I had to do it synthetically, take single subjects with a single sound and combine them together. Then train a model to separate them again.
Could you watch a music video and say "that's the snare drum, that's the lead singer, keyboard, bass, that's the truck that's making the engine noise, that's the crowd that's cheering, oh and that's a jackhammer in the background"? So can AI.
Could you point out who is lead guitar and who is rhythm guitar? So can AI.
That doesn't seem any better than typing "rhythm guitar". In fact, it seems worse and with extra steps. Sometimes the thing making the sound is not pictured. This thing is going to make me scrub through the video until the bass player is in frame instead of just typing "bass guitar". Then it will burn some power inferring that the thing I clicked on was a bass.
Also https://zanshin.sh, if you'd like speaker diarization when watching YouTube videos
At the time the first SAM was created, Meta was already spending over 2B/year on human labelers. Surely that number is higher now and research like this can dramatically increase data labeling volume
How is creating 3D objects and characters (and something resembling bones/armature but isn't) supposed to help with data labeling? As synthetic data for training other models, maybe, but seems like this new release is aimed at improving their own tooling for content creators, hard to deny this considering their demos.
For the original SAM releases, I agree, that was probably the purpose. But these new ones that generate stuff and do effects and what not, clearly go beyond that initial scope.
But also agreed (with you, yes), for the vast majority of moments, ignore and don't add more noise. But sometimes... human after all.
Github: https://github.com/facebookresearch/sam-audio
I quite like adding effects such as making the isolated speech studio-quality or broadcast-ready.
* remove background noise of tech products, but keep the nature
* isolate the voice of a single person and feed into STT model to improve accuracy
* isolating sound of events in games and many more
I got much better results, though still not perfect, with the voice isolator in ElevenLabs.