Our new SAM audio model transforms audio editing (opens in new tab)

(about.fb.com)

168 pointsushakov6mo ago66 comments

66 comments

47 comments · 16 top-level

yunwal6mo ago· 8 in thread

This is hilariously bad with music. Like I can type in the most basic thing like "string instruments" which should theoretically be super easy to isolate. You can generally one-shot this using spectral analysis libraries. And it just totally fails.

photon_garden6mo ago

I had the same experience. It did okay at isolating vocals but everything else it failed or half-succeeded at.

embedding-shape6mo ago

Like most models released for publicity rather than usefulness, they'll do great at benchmarks and single specific use cases, but no one seem to be able to release actually generalized models today.

hamza_q_5mo ago

Use Demucs bruh https://github.com/adefossez/demucs

yunwal5mo ago

Hilarious that this is maintained by facebook and yet SAM fails so badly

duped6mo ago

what in theory makes those "super easy" to isolate? Humans are terrible at this to begin with, it takes years to train one of them to do it mildly well. Computers are even worse - blind source separation and the cocktail party problem have been the white whale of audio DSP for decades (and only very recently did tools become passable).

yunwal6mo ago

The fact that you can do it with spectral analysis libraries, no LLM required.

This is much easier than source separation. It would be different if I were asking to isolate a violin from a viola or another violin, you’d have to get much more specific about the timbre of each instrument and potentially understand what each instruments part was.

But a vibration made from a string makes a very unique wave that is easy to pick out in a file.

1 more reply

coldtea5mo ago

>what in theory makes those "super easy" to isolate? Humans are terrible at this to begin with,

Humans are amazing at it. You can discern the different instruments way better than any stem separating AI.

lomase5mo ago

Like everything AI you just have to lie a little and people whith 0 clue abot SOTA in audio will think this is amazing.

ks20486mo ago· 6 in thread

I recently discovered Audacity includes plug-ins for audio separation that work great (e.g. split into vocals track and instruments track). The model it uses also originated at Facebook (demucs).

embedding-shape6mo ago

> for audio separation that work great

What did you compare it to? Ableton recently launched a audio separation feature too, and probably the highest ROI on simple/useful/accurate so far I've tried, other solutions been lacking in one of the points before.

tantalor6mo ago

Is "demucs" a pun on demux (demultiplexer)?

ipsum26mo ago

Yes.

TylerE6mo ago

Audacity is very very very far from state of the art in that respect.

vhcr6mo ago

This new SAM model actually competes against SOTA models.

https://www.reddit.com/r/LocalLLaMA/comments/1pp9w31/ama_wit...

1 more reply

wellthisisgreat6mo ago

What’s a good alternative ?

3 more replies

yjftsjthsd-h6mo ago· 4 in thread

> Visual prompting: Click on the person or object in the video that’s making a sound to isolate their audio.

How does that work? Correlating sound with movement?

janalsncm6mo ago

If it’s anything like the original SAM, thousands of hours of annotator time.

If I had to do it synthetically, take single subjects with a single sound and combine them together. Then train a model to separate them again.

yodon6mo ago

Think about it conceptually:

Could you watch a music video and say "that's the snare drum, that's the lead singer, keyboard, bass, that's the truck that's making the engine noise, that's the crowd that's cheering, oh and that's a jackhammer in the background"? So can AI.

Could you point out who is lead guitar and who is rhythm guitar? So can AI.

recursive5mo ago

I thought about it. Still seems kind of pointless.

That doesn't seem any better than typing "rhythm guitar". In fact, it seems worse and with extra steps. Sometimes the thing making the sound is not pictured. This thing is going to make me scrub through the video until the bass player is in frame instead of just typing "bass guitar". Then it will burn some power inferring that the thing I clicked on was a bass.

1 more reply

scarecrowbob6mo ago

I mean, sometimes I -mixing- a show and I couldn't tell you where a specific sound is coming from....

1 more reply

m3kw96mo ago· 4 in thread

Can I create a continuous “who farted” detector? Would be great at parties

rmnclmnt6mo ago

Bighead is back! « Fart Alert »!

IncreasePosts6mo ago

Each person's unique fartprint is yet another way big tech will be tracking us

samat6mo ago

And ads based on a fart! I guess you could throw in some spectrography for content aware ads too!! ‘Hmm, I sense you like onions, you would love French soup in the restaurant downstairs today!’

BoorishBears6mo ago

They're already analyzing poop, what's a mic to go with your toilet camera?

https://www.kohlerhealth.com/dekoda/

websiteapi5mo ago· 3 in thread

I wonder if it works for speaker diarization out of the box. I've found that open source speaker diarization that doesn't require a lot of tweaking is basically non-existent.

hamza_q_5mo ago

Yeah I was frustrated by slow and hard to use OSS diarization too; recently released a library to address that, check it out: https://github.com/narcotic-sh/senko

Also https://zanshin.sh, if you'd like speaker diarization when watching YouTube videos

noman-land5mo ago

Hey, thanks for this. Been trying it out and it's very fast but seems to hear more speakers than are in the audio. I didn't see a way to tweak speaker similarity settings or merge speakers in some way. Any advice?

1 more reply

websiteapi5mo ago

looks interesting. will check it out.

ajcp6mo ago· 2 in thread

Given TikToks insane creator adoption rate is Meta developing these models to build out a content creation platform to compete?

mgraczyk6mo ago

I doubt it, although it's possible these models will be used for creator tools, I believe the main idea is to use them for data labeling.

At the time the first SAM was created, Meta was already spending over 2B/year on human labelers. Surely that number is higher now and research like this can dramatically increase data labeling volume

embedding-shape6mo ago

> I doubt it, although it's possible these models will be used for creator tools, I believe the main idea is to use them for data labeling.

How is creating 3D objects and characters (and something resembling bones/armature but isn't) supposed to help with data labeling? As synthetic data for training other models, maybe, but seems like this new release is aimed at improving their own tooling for content creators, hard to deny this considering their demos.

For the original SAM releases, I agree, that was probably the purpose. But these new ones that generate stuff and do effects and what not, clearly go beyond that initial scope.

77341286mo ago· 2 in thread

Finally a way to perhaps remove laugh tracks in the near future.

sefrost6mo ago

There are examples on YouTube of laughter tracks being removed and there are lots of awkward pauses, so I think you'd need to edit the video to cut the pauses out entirely.

- https://www.youtube.com/watch?v=23M3eKn1FN0

- https://www.youtube.com/watch?v=DgKgXehYnnw

embedding-shape6mo ago

Cutting the pauses will change the beats and rhythm of the scene, so you probably need to edit some of the voice lines and actual scenes too then. In the end, if you're not interested in the original performance and work, you might as well read the script instead and imagine it however you want, read it at the pace you want and so on.

1 more reply

IndySun5mo ago· 2 in thread

A lot of comments here exhibit the Gell-Mann amnesia effect writ large.

AlexeyBelov5mo ago

Your comment is just a meta-comment and that's just as bad. I suggest gently correcting people instead of just pointing out very non-specifically that someone is wrong.

IndySun5mo ago

I have. I did. I do. But like so many cocktail sticks launched towards the mammoth, eventually one lobs a final ineffectual remark.

But also agreed (with you, yes), for the vast majority of moments, ignore and don't add more noise. But sometimes... human after all.

keepamovin6mo ago

FB has been a pioneer in voice and audio, somehow. A couple of years ago FB-Research had a little repo on GitHub that was the best noise-removal / voice-isolation out there. I wanted to use it in Wisprnote and politely emailed the authors. Never heard back (that's okay), but I was so impressed with the perceptual quality and "wind removal" (so hard).

Oras6mo ago

To try: https://aidemos.meta.com/segment-anything/editor/segment-aud...

Github: https://github.com/facebookresearch/sam-audio

I quite like adding effects such as making the isolated speech studio-quality or broadcast-ready.

throwaw126mo ago

This is super cool. Of course, it is possible to separate instrument sounds using specialized tools, but can't wait to see how people use this model for bunch of other use cases, where its not trivial to use those specialized tools:

* remove background noise of tech products, but keep the nature

* isolate the voice of a single person and feed into STT model to improve accuracy

* isolating sound of events in games and many more

teeray6mo ago

I wonder if this would be nice for hearing aid users for reducing the background restaurant babble that overwhelms the people you want to hear.

AkshatJ276mo ago

You can try it out in the playground: https://aidemos.meta.com/segment-anything/gallery/ There seem to be many more fun little demos by meta here like automatic video masking, making 3d models from 2d images, etc.

samuell6mo ago

I tried this to try to extract some speech from an audio track with heavy noise from wind (filmed out on a windy sea shore without mic windscreen), and the result unfortunately was less intelligible than the original.

I got much better results, though still not perfect, with the voice isolator in ElevenLabs.

ac2u6mo ago

I wonder if the segmentation would work with a video of a ventriloquist and a dummy?

theflyestpilot6mo ago

sample anything model?

j / k navigate · click thread line to collapse

66 comments

47 comments · 16 top-level

yunwal6mo ago· 8 in thread

photon_garden6mo ago

I had the same experience. It did okay at isolating vocals but everything else it failed or half-succeeded at.

embedding-shape6mo ago

Like most models released for publicity rather than usefulness, they'll do great at benchmarks and single specific use cases, but no one seem to be able to release actually generalized models today.

hamza_q_5mo ago

Use Demucs bruh https://github.com/adefossez/demucs

yunwal5mo ago

Hilarious that this is maintained by facebook and yet SAM fails so badly

duped6mo ago

yunwal6mo ago

The fact that you can do it with spectral analysis libraries, no LLM required.

But a vibration made from a string makes a very unique wave that is easy to pick out in a file.

1 more reply

coldtea5mo ago

>what in theory makes those "super easy" to isolate? Humans are terrible at this to begin with,

Humans are amazing at it. You can discern the different instruments way better than any stem separating AI.

lomase5mo ago

Like everything AI you just have to lie a little and people whith 0 clue abot SOTA in audio will think this is amazing.

ks20486mo ago· 6 in thread

I recently discovered Audacity includes plug-ins for audio separation that work great (e.g. split into vocals track and instruments track). The model it uses also originated at Facebook (demucs).

embedding-shape6mo ago

> for audio separation that work great

tantalor6mo ago

Is "demucs" a pun on demux (demultiplexer)?

ipsum26mo ago

Yes.

TylerE6mo ago

Audacity is very very very far from state of the art in that respect.

vhcr6mo ago

This new SAM model actually competes against SOTA models.

https://www.reddit.com/r/LocalLLaMA/comments/1pp9w31/ama_wit...

1 more reply

wellthisisgreat6mo ago

What’s a good alternative ?

3 more replies

yjftsjthsd-h6mo ago· 4 in thread

> Visual prompting: Click on the person or object in the video that’s making a sound to isolate their audio.

How does that work? Correlating sound with movement?

janalsncm6mo ago

If it’s anything like the original SAM, thousands of hours of annotator time.

If I had to do it synthetically, take single subjects with a single sound and combine them together. Then train a model to separate them again.

yodon6mo ago

Think about it conceptually:

Could you point out who is lead guitar and who is rhythm guitar? So can AI.

recursive5mo ago

I thought about it. Still seems kind of pointless.

1 more reply

scarecrowbob6mo ago

I mean, sometimes I -mixing- a show and I couldn't tell you where a specific sound is coming from....

1 more reply

m3kw96mo ago· 4 in thread

Can I create a continuous “who farted” detector? Would be great at parties

rmnclmnt6mo ago

Bighead is back! « Fart Alert »!

IncreasePosts6mo ago

Each person's unique fartprint is yet another way big tech will be tracking us

samat6mo ago

And ads based on a fart! I guess you could throw in some spectrography for content aware ads too!! ‘Hmm, I sense you like onions, you would love French soup in the restaurant downstairs today!’

BoorishBears6mo ago

They're already analyzing poop, what's a mic to go with your toilet camera?

https://www.kohlerhealth.com/dekoda/

websiteapi5mo ago· 3 in thread

I wonder if it works for speaker diarization out of the box. I've found that open source speaker diarization that doesn't require a lot of tweaking is basically non-existent.

hamza_q_5mo ago

Yeah I was frustrated by slow and hard to use OSS diarization too; recently released a library to address that, check it out: https://github.com/narcotic-sh/senko

Also https://zanshin.sh, if you'd like speaker diarization when watching YouTube videos

noman-land5mo ago

1 more reply

websiteapi5mo ago

looks interesting. will check it out.

ajcp6mo ago· 2 in thread

Given TikToks insane creator adoption rate is Meta developing these models to build out a content creation platform to compete?

mgraczyk6mo ago

I doubt it, although it's possible these models will be used for creator tools, I believe the main idea is to use them for data labeling.

At the time the first SAM was created, Meta was already spending over 2B/year on human labelers. Surely that number is higher now and research like this can dramatically increase data labeling volume

embedding-shape6mo ago

> I doubt it, although it's possible these models will be used for creator tools, I believe the main idea is to use them for data labeling.

For the original SAM releases, I agree, that was probably the purpose. But these new ones that generate stuff and do effects and what not, clearly go beyond that initial scope.

77341286mo ago· 2 in thread

Finally a way to perhaps remove laugh tracks in the near future.

sefrost6mo ago

There are examples on YouTube of laughter tracks being removed and there are lots of awkward pauses, so I think you'd need to edit the video to cut the pauses out entirely.

- https://www.youtube.com/watch?v=23M3eKn1FN0

- https://www.youtube.com/watch?v=DgKgXehYnnw

embedding-shape6mo ago

1 more reply

IndySun5mo ago· 2 in thread

A lot of comments here exhibit the Gell-Mann amnesia effect writ large.

AlexeyBelov5mo ago

Your comment is just a meta-comment and that's just as bad. I suggest gently correcting people instead of just pointing out very non-specifically that someone is wrong.

IndySun5mo ago

I have. I did. I do. But like so many cocktail sticks launched towards the mammoth, eventually one lobs a final ineffectual remark.

But also agreed (with you, yes), for the vast majority of moments, ignore and don't add more noise. But sometimes... human after all.

keepamovin6mo ago

Oras6mo ago

To try: https://aidemos.meta.com/segment-anything/editor/segment-aud...

Github: https://github.com/facebookresearch/sam-audio

I quite like adding effects such as making the isolated speech studio-quality or broadcast-ready.

throwaw126mo ago

* remove background noise of tech products, but keep the nature

* isolate the voice of a single person and feed into STT model to improve accuracy

* isolating sound of events in games and many more

teeray6mo ago

I wonder if this would be nice for hearing aid users for reducing the background restaurant babble that overwhelms the people you want to hear.

AkshatJ276mo ago

samuell6mo ago

I got much better results, though still not perfect, with the voice isolator in ElevenLabs.

ac2u6mo ago

I wonder if the segmentation would work with a video of a ventriloquist and a dummy?

theflyestpilot6mo ago

sample anything model?

j / k navigate · click thread line to collapse