SAM 2: Segment Anything in Images and Videos (opens in new tab)

(github.com)

824 pointsxenova1y ago147 comments

147 comments

Hi from the Segment Anything team! Today we’re releasing Segment Anything Model 2! It's the first unified model for real-time promptable object segmentation in images and videos! We're releasing the code, models, dataset, research paper and a demo! We're excited to see what everyone builds! https://ai.meta.com/blog/segment-anything-2/

robbomacrae1y ago

Code, model, data and under Apache 2.0. Impressive.

Curious how this was allowed to be more open source compared to Llama's interesting new take on "open source". Are other projects restricted in some form due to technical/legal issues and the desire is to be more like this project? Or was there an initiative to break the mold this time round?

Nesco1y ago

LLMs are trained on the entire internet so loads of copyrighted data, which Meta can’t distribute, and is afraid to even reference

Zuiii1y ago

This argument doesn't make sense to me unless you're talking about the training material. If that is not the case, then how does this argument relate to the license Meta attempts to force on downloaders of LLaMa weights?

1 more reply

swyx1y ago

data is creative commons

8organicbits1y ago

Yeah, but there's a CLA for some reason. I'm wary they will switch to a new license down the road.

phkahler1y ago

So get it today. You can't retroactively change a license on someone.

1 more reply

ed1y ago

Grounded SAM has become an essential tool in my toolbox (for others: it lets you mask any image using a text prompt, only). HUGE thank you to the team at Meta, I can't wait to try SAM2!

1 more reply

benreesman1y ago

Huge fan of the SAM work, one of the most underrated models.

My favorite use case is that it slays for memes. Try getting a good alpha mask of Fassbender Turtleneck any other way.

Keep doing stuff like this. <3

sea-shunned1y ago

I've been supporting non-computational (i.e. scientists) to use and finetune SAM for biological applications, so excited to see how SAM2 performs and how the video aspects work for large image stacks of 3D objects.

Considering the instant flood of noisy issues/PRs on the repo and the limited fix/update support on SAM, are there plans/buy-in for support of SAM2 on the medium-term beyond quick fixes? Either way, thank you to the team for your work on this and the continued public releases!

vivzkestrel1y ago

stupid question from a noob: what exactly is object segmentation? what does your library actually do? Does it cut clips?

j7ake1y ago

Given an image, it will outline where objects are in the image.

bryanrasmussen1y ago

and extract segments of images where the object are in the image as I understand it?

A segment then is a collection of images that follow each other in time?

So if you have a video comprised of img1, img2, img3, img4 and object shows in img1 and img2 and img4

Can you catch that as a sequence img1, img2, img3, img4 and can you also catch just the object img1, img2, img4 but get some sort of information that there is a break between img2 and img4 - number of images break etc.?

On edit: Or am I totally off about the segment possibilities and what it means?

Or can you only catch img1 and img2 as a sequence?

1 more reply

stabbles1y ago

Classification per pixel

acacac1y ago

will the model ever be extended to being able to segment audio (eg. different people talking, different instruments in a soundtrack?)

sagz1y ago

Check out Facebook DeMucs, and more newer: Ultimate Vocal Remover project on GitHub

mrdjtek1y ago

There are a ton of models that do Stemming like this. We use them all the time. Lookup MvSep on Replicate.com

TheHumanist1y ago

That would be really cool to try out. I hope someone is doing that.

cheema331y ago

I wonder if it can be used with security cameras somehow. My cameras currently alert me when they detect motion. It would be neat if this would help cameras become a little smarter. They should alert me only if someone other than a family member is detected.

The recognition logic doesn't have to always be reviewing the video, but only when motion is detected.

I think some cameras already try to do this, however, they are really bad at it.

sorenjan1y ago

Frigate use both motion detection and object detection. Object detection is usually done with one of the Yolo models.

nyxtom1y ago

Is there a reason Texans can't use the demo?

mike_hearn1y ago

Texas and Illinois. Both issued massive fines against Facebook for facial recognition, over a decade after FB first launched the feature. Segmentation is I guess usable to identify faces, so may seem too close to facial recognition to launch.

Basically the same issue the EU has with demos not launching there. You fine tech firms under vague laws often enough, and they stop doing business there.

ulrikhansen541y ago

Awesome model - thank you! Are you guys planning to provide any guidance on fine-tuning?

Yoric1y ago

Oh, nice!

The first one was excellent. Now part of my Gimp toolbox. Thanks for your work!

jacooper1y ago

How did you add it to gimp?

homarp1y ago

https://github.com/Shriinivas/gimpsegany

https://github.com/crb02005/gimp-segment-anything

madduci1y ago

Thank you for sharing it! Is there any plans to move the codebase to a more performant programming language?

Legend24401y ago

Everything in machine learning uses Python.

It doesn't matter much because all the real computation happens on the GPU. But you could take their neural network and do inference using any language you want.

cinntaile1y ago

It's all C, C++ and Fortran(?) under the hood so moving languages probably won't matter as much as you expect.

swyx1y ago

i covered SAM 1 a year ago (https://news.ycombinator.com/item?id=35558522). notes from quick read of the SAM 2 paper https://ai.meta.com/research/publications/sam-2-segment-anyt...

1. SAM 2 was trained on 256 A100 GPUs for 108 hours (SAM1 was 68 hrs on same cluster). Taking the upper end $2 A100 cost off gpulist means SAM2 cost ~$50k to train - surprisingly cheap for adding video understanding?

2. new dataset: the new SA-V dataset is "only" 50k videos, with careful attention given to scene/object/geographical diversity incl that of annotators. I wonder if LAION or Datacomp (AFAICT the only other real players in the open image data space) can reach this standard..

3. bootstrapped annotation: similar to SAM1, a 3 phase approach where 16k initial annotations across 1.4k videos was then expanded to 63k+197k more with SAM 1+2 assistance, with annotation time accelerating dramatically (89% faster than SAM1 only) by the end

4. memory attention: SAM2 is a transformer with memory across frames! special "object pointer" tokens stored in a "memory bank" FIFO queue of recent and prompted frames. Has this been explored in language models? whoa?

(written up in https://x.com/swyx/status/1818074658299855262)

ulrikhansen541y ago

A colleague of mine has written up a quick explainer on the key features (https://encord.com/blog/segment-anything-model-2-sam-2/). The memory attention module for keeping track of objects throughout a video is very clever - one of the trickiest problems to solve, alongside occlusion. We've spent so much time trying to fix these issues in our CV projects, now it looks like Meta has done the work for us :-)

Szpadel1y ago

> 4. memory attention: SAM2 is a transformer with memory across frames! special "object pointer" tokens stored in a "memory bank" FIFO queue of recent and prompted frames. Has this been explored in language models? whoa?

interesting, how do you think this could be introduced to llm? I imagine in video some special tokens are preserved in input to next frame, so kind of like llms see previous messages in chat history, but it's filters out to only some category of tokens to limit size of context.

I believe this is trick already borrowed from llm into video space.

(I didn't read the paper, so that's speculation on my side)

alsodumb1y ago

I might be minority, but I am not that surprised by the results or the not so significant GPU hours. I've been video segment tracking for a while now using SAM for mask generation and some of the robust academic video-object segmentation models (see CUTIE: https://hkchengrex.com/Cutie/ presented at CVPR this year.)for tracking the mask.

I need to read SAM2 paper, but 4. seems a lot like what Rex has in CUTIE. CUTIE can consistently track segments across video frames even if they get occluded/ go out of frame for a while.

dingaling1y ago

Seems like there's functional overlap between segmentation models and the autofocus algorithms developed by Canon and Sony for their high-end cameras.

The Canon R1 for example will not only continually track a particular object even if partially occluded but will also pre-focus on where it predicts the object will be when it emerged from being totally hidden. It can also be programmed by the user to focus on a particular face to the exclusion of all else.

michaelt1y ago

Of course Facebook has had a video tracking ML model for a year or so - Co-tracker [1] - just tracking pixels rather than segments.

[1] https://co-tracker.github.io/

minimaxir1y ago

The web demo is actually pretty neat: https://sam2.metademolab.com/demo

I selected each shoe as individual objects and the model was able to segment them even as they overlapped.

simonw1y ago

It's super fun! I used it on a video of my new cactus tweezers: https://simonwillison.net/2024/Jul/29/sam-2/

rkagerer1y ago

I guess the demo simply doesn't work unless you accept cookies?

sashank_15091y ago

Are there people who don’t accept cookies?

Don’t most websites require you to accept cookies?

bazzargh1y ago

In many jurisdictions requiring blanket acceptance of cookies to access the whole site is illegal, eg https://ico.org.uk/for-organisations/direct-marketing-and-pr... . Sites have to offer informed consent for nonessential cookies - but equally don't have to ask if the only cookies used are essential. So a popup saying 'Accept cookies?' with no other information doesn't cut it.

1 more reply

wongarsu1y ago

You don't need consent for functional cookies that are necessary for the website to work. Anything you are accepting or declining in a cookie popup shouldn't affect the user experience in any major way.

I know a lot of people who reflexively reject all cookies, and the internet indeed does keep working for them.

1 more reply

SanderNL1y ago

Always refuse them, close to zero problems.

I can’t think of a technical reason a website without auth needs cookies to function.

rkagerer1y ago

I don't. I see a few sibling comments who don't accept them either. And now I'm curious to know if there's a behavioral age gap - i.e. have the younger crowd been defacto-trained to always accept them?

shreddit1y ago

If someone gives me the choice i don’t.

brk1y ago

I reject cookies on the regular. Generally do not see any downsides for the things I browse.

dyauspitr1y ago

I never accept any they don’t force me to accept.

arendtio1y ago

I think under the GDPR this is even illegal.

ks20481y ago

It is giving me "Access Denied".

rawrawrawrr1y ago

Might have issues if you're from Texas or Illinois due to their local laws.

swamp401y ago

What is the Illinois law?

Edit: Found lower in thread: biometric privacy laws

rvnx1y ago

I tried on the default video (white soccer ball), and it seems to really struggle with the trees in the background, maybe you could benefit of more of such examples.

vitorgrs1y ago

"The Firefox browser doesn’t support the video features we’ll need to run this demo. Please try again using Chrome or Safari."

barnabask1y ago

Same :(

Just a guess, maybe it's the VideoFrame API? It was the only video-related feature I could find that Chrome and Safari have and FF doesn't.

https://caniuse.com/mdn-api_videoframe

Lucasoato1y ago

> This research demo is not open to residents of, or those accessing the demo from, the States of Illinois or Texas.

Are there laws stricter than in California or EU in those places?

dhon_1y ago

Try tracking the table tennis bat

phillypham1y ago

Really cool. Doesn't really work for juggling unfortunately, https://sam2.metademolab.com/shared/fa993f12-b9ce-4f19-bb75-...

kajecounterhack1y ago

It looks like it’s working to me. Segmentation isn’t supposed to be used for tracking alone. If you add tracking on top, the uncertainty in the estimated mask for the white ball (which is sometimes getting confused with the wall) would be accounted for and you’d be able to track it well.

phillypham1y ago

The blog post (https://ai.meta.com/blog/segment-anything-2/) mentions tracking as a use case. Similar objects is known to be challenging and they mention it in the Limitations section. In that video, I only used one frame, but in some other tests even when I prompted in several frames as recommended, it didn't really work, still.

kajecounterhack1y ago

Yeah, it's a reasonable expectation since the blog highlights it. Just figure it's worth calling out that SOTA trackers are able to deal with object disappearance well enough that when used with this it would handle things. I'd venture to say that most people doing any kind of tracking aren't relying on their segmentation process.

1 more reply

mattigames1y ago

I bet it would do a lot better if it had a more frames per second (or slow-mo)

Imnimo1y ago

I think the first SAM is the open source model I've gotten the most mileage out of. Very excited to play around with SAM2!

ignoramous1y ago

> ...the first SAM is the open source model I've gotten the most mileage out of

How's OpenMMLab's MMSegmentation, if you've tried it? https://github.com/open-mmlab/mmsegmentation

It seems like Amazon is putting its weight behind it (from the papers they've published): https://github.com/amazon-science/bigdetection

djsavvy1y ago

What have you found it useful for?

snovv_crash1y ago

Annotating datasets so I can train a smaller more specialized production model.

Tostino1y ago

I wish there was a similar model like this, but for (long context) text.

Would be extremely useful to be able to semantically "chunk" text for RAG applications compared to the generally naive strategies employed today.

If I somehow overlooked it, would be very interested in hearing about what you've seen.

HanClinto1y ago

Semantic chunking. This is an intriguing idea.

I feel like one could do this with a chain of LLM prompts -- extract the primary subjects or topics from this long document, then prompt again (1 at a time?) to pull out everything related to each topic from the document and collate it into one semantic chunk.

At the very least, a dataset / benchmark centered around this task feels like it would be really useful.

Tostino1y ago

Yeah, I do think that's possible with LLM, just too slow and expensive to be usable in most settings.

nullandvoid1y ago

Anyone have any home project ideas (or past work) to apply this to / inspire others?

I was initially thinking the obvious case would be some sort of system for monitoring your plant health. It could check for shrinkage / growth, colour change etc and build some sort of monitoring tool / automated watering system off that.

jonnyscholes1y ago

I used the original SAM (alongside Grounding DINO) to create an ever growing database of all the individual objects I see as I go about my daily life. It automatically parses all the photos I take on my Meta Raybans and my phone along with all my laptop screenshots. I made it for an artwork that's exhibiting in Australia, and it will likely form the basis of many artworks to come.

I haven't put it up on my website yet (and proper documentation is still coming) so unfortunately the best I can do is show you an Instagram link:

https://www.instagram.com/p/C98t1hlzDLx/?igsh=MWxuOHlsY2lvdT...

Not exactly functional, but fun . Artwork aside it's quite interesting to see your life broken into all its little bits. Provides a new perspective (apparently, there are a lot more teacups in my life than I notice).

sintezcs1y ago

Wow, that’s really cool!

rikroots1y ago

After playing with the SAM2 demo for far too long, my immediate thought was: this would be brilliant for things like (accessible, responsive) interactive videos. I've coded up such a thing before[1] but that uses hardcoded data to track the position of the geese, and a filter to identify the swans. When I loaded that raw video into the SAM2 demo it had very little problem tracking the various birds - which would make building the interactivity on top of it very easy, I think.

Sadly my knowledge of how to make use of these models is limited to what I learned playing with some (very ancient) MediaPipe and Tensorflow models. Those models provided some WASM code to run the model in the browser and I was able to find the data from that to pipe it though to my canvas effects[2]. I'd love to get something similar working with SAM2!

[1] - https://scrawl-v8.rikweb.org.uk/demo/canvas-027.html

[2] - https://scrawl-v8.rikweb.org.uk/demo/mediapipe-003.html

daemonologist1y ago

Nice! Of particular interest to me is the slightly improved mIoU and 6x speedup on images [1] (though they say the speedup is mainly from the more efficient encoder, so multiple segmentations of the same image presumably would see less benefit?). It would also be nice to get a comparison to original SAM with bounding box inputs - I didn't see that in the paper though I may have missed it.

[1] - page 11 of https://ai.meta.com/research/publications/sam-2-segment-anyt...

albert_e1y ago

How do these techniques handle transparent, translucent, mesh/gauge/hair like objects that interact with background.

Splashing water or Orange juice, spraying snow from skis, rain and snowfall, foliage, fences and meshes, veils etc.

andy_ppp1y ago

State of the art still looks pretty bad at this IMO.

gpjanik1y ago

Hi from Germany. In case you were wondering, we regulated ourselves to the point where I can't even see the demo of SAM2 until some other service than Meta deploys it.

Does anyone know if this already happened?

pavlov1y ago

It’s more like “Meta is restricting European access to models even though they don’t have to, because they believe it’s an effective lobbying technique as they try to get EU regulations written to their preference.”

The same thing happened with the Threads app which was withheld from European users last year for no actual technical reason. Now it’s been released and nothing changed in between.

These free models and apps are bargaining chips for Meta against the EU. Once the regulatory situation settles, they’ll do what they always do and adapt to reach the largest possible global audience.

michaelt1y ago

> Meta is restricting European access to models even though they don’t have to

This video segmentation model could be used by self-driving cars to detect pedestrians, or in road traffic management systems to detect vehicles, either of which would make it a Chapter III High-Risk AI System.

And if we instead say it's not specific to those high-risk applications, it is instead a general purpose model - wouldn't that make it a Chapter V General Purpose AI Model?

Obviously you and I know the "general purpose AI models" chapter was drafted with LLMs (and their successors) in mind, rather than image segmentation models - but it's the letter of the law, not the intent, that counts.

phyrex1y ago

> The same thing happened with the Threads app which was withheld from European users last year for no actual technical reason. Now it’s been released and nothing changed in between.

No technical reason, but legal reasons. IIRC it was about cross-account data sharing from Instagram to Threads, which is a lot more dicey legally in the EU than in NA.

pavlov1y ago

It’s not like Meta doesn’t know how it works. They ship many apps that share accounts like FB + Messenger most prominently.

They’ve also had separate apps in the past that shared an Instagram account, like IGTV (2018 - 2022).

The Threads delay was primarily a lobbying ploy.

1 more reply

bakje1y ago

Not saying you're wrong, but in this instance it might be a regulation specific to Germany since the site works just fine from the Netherlands.

maeil1y ago

Sounds like big tech's strategy to make you protest against regulating them is working brilliantly.

gpjanik1y ago

Regulation in this space works exclusively in favor of big tech, not against them. Almost all of that regulation was literally written for the benefit and with aid of the big tech.

Leszek1y ago

Hi also from Germany - works fine here

analyzethis1y ago

Looking at it right now from Denmark. You must have some other problem.

consumer4511y ago

Which German regulation prevents this? Is it biometric related?

It seems that https://mullvad.net is a necessary part of my Internet toolkit these days, for many reasons.

gpm1y ago

> This research demo is not open to residents of, or those accessing the demo from, the States of Illinois or Texas.

Alright, I'll bite, why not?

daemonologist1y ago

I know Illinois and Texas have biometric privacy laws; I would guess it's related to that. (I am in Illinois and cannot access the demo, so I don't know what if anything it's doing which would be in violation.)

ipsum21y ago

It's because their biometric privacy laws are written in such a general way that detecting the presence of a face is considered illegal.

boppo11y ago

I'm kinda on board with this.

NorwegianDude1y ago

So there will be a lot of blurry portraits coming from Illinois and Texas as autofocus can't find faces? /s

glandium1y ago

> We extend SAM to video by considering images as a video with a single frame.

I can't make sense of this sentence. Is there some mistake?

RobinL1y ago

Everything is a video. An image is the special case of length 1 frame

glandium1y ago

Here's a sentence I would understand: > We extend SAM to video and retrofit support for images by considering images as a video with a single frame.

As it is written, I don't see the link between "We extend SAM to video" and "by considering images as a video with a single frame".

ZephyrBlu1y ago

I read it like this:

- "We extend SAM to video", because is was previously only for images and it's capabilities are being extended to videos

- "by considering images as a video with a single frame", explaining how they support and build upon the previous image functionality

The main assumptions here are that images -> videos is a level up as opposed to being a different thing entirely, and the previous level is always supported.

"retrofit" implies that the ability to handle images was bolted on afterwards. "extend to video" implies this is a natural continuation of the image functionality, so the next part of the sentence is explaining why there is a natural continuation.

ks20481y ago

I would like to train a model to classify frames in a video (and identify "best" frame for something I want to locate, according to my training data).

Is SAM-2 useful to use as a base model to finetune a classifier layer on? Or are there better options today?

simonw1y ago

Has anyone built anything cool with the original SAM? What did you build?

rocauc1y ago

One thing its enabled is automated annotations for segmentation, even on out-of-distribution examples. e.g. in the first 7 months of SAM, users on Roboflow used SAM-powered labeling to label over 13 million images, saving over ~21 years[0] of labeling time. That doesn't include labeling from self hosting autodistill[1] for automated annotation either.

[0] based on comparing avg labeling session time on individual polygon creation vs SAM-powered polygon examples [1] https://github.com/autodistill/autodistill

benreesman1y ago

As mentioned in another comment I use it all the time for zero-shot segmentation to do quick image collage type work (former FB-folks take their memes very seriously). It’s crazy good at doing plausible separations on parts of an image with no difference at the pixel level.

Someone who knows Creative Suite can comment on what Photoshop can do on this these days, one imagines it’s something, but the SAM stuff is so fast it can run in low-spec settings.

ed1y ago

Grounded SAM[1] is extremely useful for segmenting novel classes. The model is larger and not as accurate as specialized models (e.g. any YOLO segmenter), but it's extremely useful for prototyping ideas in ComfyUI. Very excited to try SAM2.

[1] - https://github.com/IDEA-Research/Grounded-Segment-Anything

abrichr1y ago

We use SAM to segment GUI elements in https://github.com/OpenAdaptAI/OpenAdapt

daemonologist1y ago

I used it for segmentation for this home climbing/spray wall project: https://freeclimbs.org/wall/demo/edit-set

It does detection on the backend and then feeds those bounding boxes into SAM running in the browser. This is a little slow on the first pass but allows the user the adjust the bboxes and get new segmentations in nearly real time, without putting a ton of load on the server. Saved me having to label a bunch of holds with precise masks/polygons (I labeled 10k for the detection model and that was quite enough). I might try using SAM's output to train a smaller model in the future, haven't gotten around to it.

(Site is early in development and not ready for actual users, but feel free to mess around.)

totalview1y ago

We are using it to segment different pieces of an industrial facility (pipes valves, etc.) before classification

sobellian1y ago

Are you working with image data or do you have laser scans? If laser scans, how are you extending SAM to work with that format?

pgt1y ago

Wonder if I can use this to count my winter wood stock. Before resuscitating my mutilated Python environment, could someone please run this on a photo of stacked uneven bluegum logs to see if it can segment the pieces? OpenCV edge detection does not cut it:

https://share.icloud.com/photos/090J8n36FAd0_lz4tz-TJfOhw

Havoc1y ago

Heads up that link reveals real name. Maybe edit it out if you care

pgt1y ago

thx for the heads up :) full name is in my HN profile. Good to know iCloud reveals that.

zengineer1y ago

Would love to use it for my startup, but I believe it is to self-host on a server with GPU? Or is there an easy to use API?

leodriesch1y ago

I ran it with 3040x3040px images on my MacBook M1 Pro in about 9 seconds + 200ms or so for the masking.

pzo1y ago

Previous SAM v1 you can use e.g. in here:

https://fal.ai/models

https://replicate.com/

You just have to wait probably few weeks for the SAM v2 to be available. Hugging Face might also have some offering

Gisbitus1y ago

It's OSS, so there isn't an "official" hosted version, but someone probably is gonna offer it soon.

Mxbonn1y ago

What happened to text prompts that were shown as early results in SAM1? I assume they never really got them working well?

doubleorseven1y ago

Thank you for this amazing work you are sharing.

I do have a 2 questions: 1. isn't addressing the video frame by frame expensive? 2. In the web demo when the leg moves fast it loses it's track from the shoe. Does the memory part not throwing some uristics to over come this edge case?

pzo1y ago

Impressive, wondering if this is now out of the box fast enough to run on iphone. Previous SAM had some community projects such as FastSAM, MobileSAM, EfficientSAM that tried to speed up. Wish when Readme reporting FPS, provided on what hardware it was tested

leodriesch1y ago

I’d guess testing hardware is same as training hardware, so A100. If it was on a mobile device they would have definitely said that.

shaunregenbaum1y ago

Very excited to give it a try, SAM has had great performance in Biology applications.

kamil20001y ago

Already live on Encord - https://encord.com/blog/sam2-now-in-encord/

ei8htyfi5e1y ago

Will it handle tracking out of frame?

i.e. if I stand in the center of my room and take a video of the room spinning around slowly over 5 seconds. Then reverse spin around for 5 seconds.

Will it see the same couch? Or will it see two couches?

snovv_crash1y ago

I think it depends how long it is out of frame for, there is a cache that you might be able to tweak the size of.

gpm1y ago

Interesting how you can bully the model into accepting multiple people as one object, but it keeps trying to down-select to just one person (which you can then fix by adding another annotated frame in).

j0e11y ago

This is great! Can someone point me to examples how to bundle something like to run offline on a browser, if possible at all?

naitgacem1y ago

Anyone managed to get this to work on Google collab? I am having trouble with the imports and not sure what is going on.

_giorgio_1y ago

Does it segment and describe or recognize objects? What "pipeline" would be needed to achieve that? Thanks.

renewiltord1y ago

This is a super-useful model. Thanks, guys.

blackeyeblitzar1y ago

Somewhat related: is there much research into how these models can be tricked or possible security implications?

vanjajaja11y ago

Cool! Seems this is cuda only?

rawrawrawrr1y ago

Can run on CPU (slower) or AMD GPUs.

mnk471y ago

What about Mac/Metal?

vanjajaja11y ago

this is what I was getting at, i tried on my mbp and no luck. might be just an installer issue but I wanted confirmation from someone with more know-how before diving in

1 more reply

carbocation1y ago

Huge fan of the SAM loss function. Thanks for making this.

unnouinceput1y ago

Trying to run https://sam2.metademolab.com/demo and...

Quote: "Sorry Firefox users! The Firefox browser doesn’t support the video features we’ll need to run this demo. Please try again using Chrome or Safari."

Wtf is this shit? Seriously!

sails1y ago

Any use of this category of tools in OCR?

ximilian1y ago

Roughly how many fps could you get running this on a raspberry pi?

vicentwu1y ago

It's amazing!

Imagesegmetanto1y ago

Awesome! Loved SAM already made our Segmentation problem so so so much better.

I was wondering why the original one got deprecated.

Is there now also a good way for finetuning from the officaial / your side?

Any benchmarks against SAM1?

maxdo1y ago

How many days it will take to see this in military use killing people …

j / k navigate · click thread line to collapse

147 comments

nravi201y ago

robbomacrae1y ago

Code, model, data and under Apache 2.0. Impressive.

Nesco1y ago

LLMs are trained on the entire internet so loads of copyrighted data, which Meta can’t distribute, and is afraid to even reference

Zuiii1y ago

1 more reply

swyx1y ago

data is creative commons

8organicbits1y ago

Yeah, but there's a CLA for some reason. I'm wary they will switch to a new license down the road.

phkahler1y ago

So get it today. You can't retroactively change a license on someone.

1 more reply

ed1y ago

Grounded SAM has become an essential tool in my toolbox (for others: it lets you mask any image using a text prompt, only). HUGE thank you to the team at Meta, I can't wait to try SAM2!

1 more reply

benreesman1y ago

Huge fan of the SAM work, one of the most underrated models.

My favorite use case is that it slays for memes. Try getting a good alpha mask of Fassbender Turtleneck any other way.

Keep doing stuff like this. <3

sea-shunned1y ago

vivzkestrel1y ago

stupid question from a noob: what exactly is object segmentation? what does your library actually do? Does it cut clips?

j7ake1y ago

Given an image, it will outline where objects are in the image.

bryanrasmussen1y ago

and extract segments of images where the object are in the image as I understand it?

A segment then is a collection of images that follow each other in time?

So if you have a video comprised of img1, img2, img3, img4 and object shows in img1 and img2 and img4

On edit: Or am I totally off about the segment possibilities and what it means?

Or can you only catch img1 and img2 as a sequence?

1 more reply

stabbles1y ago

Classification per pixel

acacac1y ago

will the model ever be extended to being able to segment audio (eg. different people talking, different instruments in a soundtrack?)

sagz1y ago

Check out Facebook DeMucs, and more newer: Ultimate Vocal Remover project on GitHub

mrdjtek1y ago

There are a ton of models that do Stemming like this. We use them all the time. Lookup MvSep on Replicate.com

TheHumanist1y ago

That would be really cool to try out. I hope someone is doing that.

cheema331y ago

The recognition logic doesn't have to always be reviewing the video, but only when motion is detected.

I think some cameras already try to do this, however, they are really bad at it.

sorenjan1y ago

Frigate use both motion detection and object detection. Object detection is usually done with one of the Yolo models.

nyxtom1y ago

Is there a reason Texans can't use the demo?

mike_hearn1y ago

Basically the same issue the EU has with demos not launching there. You fine tech firms under vague laws often enough, and they stop doing business there.

ulrikhansen541y ago

Awesome model - thank you! Are you guys planning to provide any guidance on fine-tuning?

Yoric1y ago

Oh, nice!

The first one was excellent. Now part of my Gimp toolbox. Thanks for your work!

jacooper1y ago

How did you add it to gimp?

homarp1y ago

https://github.com/Shriinivas/gimpsegany

https://github.com/crb02005/gimp-segment-anything

madduci1y ago

Thank you for sharing it! Is there any plans to move the codebase to a more performant programming language?

Legend24401y ago

Everything in machine learning uses Python.

It doesn't matter much because all the real computation happens on the GPU. But you could take their neural network and do inference using any language you want.

cinntaile1y ago

It's all C, C++ and Fortran(?) under the hood so moving languages probably won't matter as much as you expect.

swyx1y ago

i covered SAM 1 a year ago (https://news.ycombinator.com/item?id=35558522). notes from quick read of the SAM 2 paper https://ai.meta.com/research/publications/sam-2-segment-anyt...

(written up in https://x.com/swyx/status/1818074658299855262)

ulrikhansen541y ago

Szpadel1y ago

I believe this is trick already borrowed from llm into video space.

(I didn't read the paper, so that's speculation on my side)

alsodumb1y ago

I need to read SAM2 paper, but 4. seems a lot like what Rex has in CUTIE. CUTIE can consistently track segments across video frames even if they get occluded/ go out of frame for a while.

dingaling1y ago

Seems like there's functional overlap between segmentation models and the autofocus algorithms developed by Canon and Sony for their high-end cameras.

michaelt1y ago

Of course Facebook has had a video tracking ML model for a year or so - Co-tracker [1] - just tracking pixels rather than segments.

[1] https://co-tracker.github.io/

minimaxir1y ago

The web demo is actually pretty neat: https://sam2.metademolab.com/demo

I selected each shoe as individual objects and the model was able to segment them even as they overlapped.

simonw1y ago

It's super fun! I used it on a video of my new cactus tweezers: https://simonwillison.net/2024/Jul/29/sam-2/

rkagerer1y ago

I guess the demo simply doesn't work unless you accept cookies?

sashank_15091y ago

Are there people who don’t accept cookies?

Don’t most websites require you to accept cookies?

bazzargh1y ago

1 more reply

wongarsu1y ago

I know a lot of people who reflexively reject all cookies, and the internet indeed does keep working for them.

1 more reply

SanderNL1y ago

Always refuse them, close to zero problems.

I can’t think of a technical reason a website without auth needs cookies to function.

rkagerer1y ago

shreddit1y ago

If someone gives me the choice i don’t.

brk1y ago

I reject cookies on the regular. Generally do not see any downsides for the things I browse.

dyauspitr1y ago

I never accept any they don’t force me to accept.

arendtio1y ago

I think under the GDPR this is even illegal.

ks20481y ago

It is giving me "Access Denied".

rawrawrawrr1y ago

Might have issues if you're from Texas or Illinois due to their local laws.

swamp401y ago

What is the Illinois law?

Edit: Found lower in thread: biometric privacy laws

rvnx1y ago

I tried on the default video (white soccer ball), and it seems to really struggle with the trees in the background, maybe you could benefit of more of such examples.

vitorgrs1y ago

"The Firefox browser doesn’t support the video features we’ll need to run this demo. Please try again using Chrome or Safari."

barnabask1y ago

Same :(

Just a guess, maybe it's the VideoFrame API? It was the only video-related feature I could find that Chrome and Safari have and FF doesn't.

https://caniuse.com/mdn-api_videoframe

Lucasoato1y ago

> This research demo is not open to residents of, or those accessing the demo from, the States of Illinois or Texas.

Are there laws stricter than in California or EU in those places?

dhon_1y ago

Try tracking the table tennis bat

phillypham1y ago

Really cool. Doesn't really work for juggling unfortunately, https://sam2.metademolab.com/shared/fa993f12-b9ce-4f19-bb75-...

kajecounterhack1y ago

phillypham1y ago

kajecounterhack1y ago

1 more reply

mattigames1y ago

I bet it would do a lot better if it had a more frames per second (or slow-mo)

Imnimo1y ago

I think the first SAM is the open source model I've gotten the most mileage out of. Very excited to play around with SAM2!

ignoramous1y ago

> ...the first SAM is the open source model I've gotten the most mileage out of

How's OpenMMLab's MMSegmentation, if you've tried it? https://github.com/open-mmlab/mmsegmentation

It seems like Amazon is putting its weight behind it (from the papers they've published): https://github.com/amazon-science/bigdetection

djsavvy1y ago

What have you found it useful for?

snovv_crash1y ago

Annotating datasets so I can train a smaller more specialized production model.

Tostino1y ago

I wish there was a similar model like this, but for (long context) text.

Would be extremely useful to be able to semantically "chunk" text for RAG applications compared to the generally naive strategies employed today.

If I somehow overlooked it, would be very interested in hearing about what you've seen.

HanClinto1y ago

Semantic chunking. This is an intriguing idea.

At the very least, a dataset / benchmark centered around this task feels like it would be really useful.

Tostino1y ago

Yeah, I do think that's possible with LLM, just too slow and expensive to be usable in most settings.

nullandvoid1y ago

Anyone have any home project ideas (or past work) to apply this to / inspire others?

jonnyscholes1y ago

I haven't put it up on my website yet (and proper documentation is still coming) so unfortunately the best I can do is show you an Instagram link:

https://www.instagram.com/p/C98t1hlzDLx/?igsh=MWxuOHlsY2lvdT...

sintezcs1y ago

Wow, that’s really cool!

rikroots1y ago

[1] - https://scrawl-v8.rikweb.org.uk/demo/canvas-027.html

[2] - https://scrawl-v8.rikweb.org.uk/demo/mediapipe-003.html

daemonologist1y ago

[1] - page 11 of https://ai.meta.com/research/publications/sam-2-segment-anyt...

albert_e1y ago

How do these techniques handle transparent, translucent, mesh/gauge/hair like objects that interact with background.

Splashing water or Orange juice, spraying snow from skis, rain and snowfall, foliage, fences and meshes, veils etc.

andy_ppp1y ago

State of the art still looks pretty bad at this IMO.

gpjanik1y ago

Hi from Germany. In case you were wondering, we regulated ourselves to the point where I can't even see the demo of SAM2 until some other service than Meta deploys it.

Does anyone know if this already happened?

pavlov1y ago

The same thing happened with the Threads app which was withheld from European users last year for no actual technical reason. Now it’s been released and nothing changed in between.

michaelt1y ago

> Meta is restricting European access to models even though they don’t have to

And if we instead say it's not specific to those high-risk applications, it is instead a general purpose model - wouldn't that make it a Chapter V General Purpose AI Model?

phyrex1y ago

> The same thing happened with the Threads app which was withheld from European users last year for no actual technical reason. Now it’s been released and nothing changed in between.

No technical reason, but legal reasons. IIRC it was about cross-account data sharing from Instagram to Threads, which is a lot more dicey legally in the EU than in NA.

pavlov1y ago

It’s not like Meta doesn’t know how it works. They ship many apps that share accounts like FB + Messenger most prominently.

They’ve also had separate apps in the past that shared an Instagram account, like IGTV (2018 - 2022).

The Threads delay was primarily a lobbying ploy.

1 more reply

bakje1y ago

Not saying you're wrong, but in this instance it might be a regulation specific to Germany since the site works just fine from the Netherlands.

maeil1y ago

Sounds like big tech's strategy to make you protest against regulating them is working brilliantly.

gpjanik1y ago

Regulation in this space works exclusively in favor of big tech, not against them. Almost all of that regulation was literally written for the benefit and with aid of the big tech.

Leszek1y ago

Hi also from Germany - works fine here

analyzethis1y ago

Looking at it right now from Denmark. You must have some other problem.

consumer4511y ago

Which German regulation prevents this? Is it biometric related?

It seems that https://mullvad.net is a necessary part of my Internet toolkit these days, for many reasons.

gpm1y ago

> This research demo is not open to residents of, or those accessing the demo from, the States of Illinois or Texas.

Alright, I'll bite, why not?

daemonologist1y ago

ipsum21y ago

It's because their biometric privacy laws are written in such a general way that detecting the presence of a face is considered illegal.

boppo11y ago

I'm kinda on board with this.

NorwegianDude1y ago

So there will be a lot of blurry portraits coming from Illinois and Texas as autofocus can't find faces? /s

glandium1y ago

> We extend SAM to video by considering images as a video with a single frame.

I can't make sense of this sentence. Is there some mistake?

RobinL1y ago

Everything is a video. An image is the special case of length 1 frame

glandium1y ago

Here's a sentence I would understand: > We extend SAM to video and retrofit support for images by considering images as a video with a single frame.

As it is written, I don't see the link between "We extend SAM to video" and "by considering images as a video with a single frame".

ZephyrBlu1y ago

I read it like this:

- "We extend SAM to video", because is was previously only for images and it's capabilities are being extended to videos

- "by considering images as a video with a single frame", explaining how they support and build upon the previous image functionality

The main assumptions here are that images -> videos is a level up as opposed to being a different thing entirely, and the previous level is always supported.

ks20481y ago

I would like to train a model to classify frames in a video (and identify "best" frame for something I want to locate, according to my training data).

Is SAM-2 useful to use as a base model to finetune a classifier layer on? Or are there better options today?

simonw1y ago

Has anyone built anything cool with the original SAM? What did you build?

rocauc1y ago

[0] based on comparing avg labeling session time on individual polygon creation vs SAM-powered polygon examples [1] https://github.com/autodistill/autodistill

benreesman1y ago

Someone who knows Creative Suite can comment on what Photoshop can do on this these days, one imagines it’s something, but the SAM stuff is so fast it can run in low-spec settings.

ed1y ago

[1] - https://github.com/IDEA-Research/Grounded-Segment-Anything

abrichr1y ago

We use SAM to segment GUI elements in https://github.com/OpenAdaptAI/OpenAdapt

daemonologist1y ago

I used it for segmentation for this home climbing/spray wall project: https://freeclimbs.org/wall/demo/edit-set

(Site is early in development and not ready for actual users, but feel free to mess around.)

totalview1y ago

We are using it to segment different pieces of an industrial facility (pipes valves, etc.) before classification

sobellian1y ago

Are you working with image data or do you have laser scans? If laser scans, how are you extending SAM to work with that format?

pgt1y ago

https://share.icloud.com/photos/090J8n36FAd0_lz4tz-TJfOhw

Havoc1y ago

Heads up that link reveals real name. Maybe edit it out if you care

pgt1y ago

thx for the heads up :) full name is in my HN profile. Good to know iCloud reveals that.

zengineer1y ago

Would love to use it for my startup, but I believe it is to self-host on a server with GPU? Or is there an easy to use API?

leodriesch1y ago

I ran it with 3040x3040px images on my MacBook M1 Pro in about 9 seconds + 200ms or so for the masking.

pzo1y ago

Previous SAM v1 you can use e.g. in here:

https://fal.ai/models

https://replicate.com/

You just have to wait probably few weeks for the SAM v2 to be available. Hugging Face might also have some offering

Gisbitus1y ago

It's OSS, so there isn't an "official" hosted version, but someone probably is gonna offer it soon.

Mxbonn1y ago

What happened to text prompts that were shown as early results in SAM1? I assume they never really got them working well?

doubleorseven1y ago

Thank you for this amazing work you are sharing.

pzo1y ago

leodriesch1y ago

I’d guess testing hardware is same as training hardware, so A100. If it was on a mobile device they would have definitely said that.

shaunregenbaum1y ago

Very excited to give it a try, SAM has had great performance in Biology applications.

kamil20001y ago

Already live on Encord - https://encord.com/blog/sam2-now-in-encord/

ei8htyfi5e1y ago

Will it handle tracking out of frame?

i.e. if I stand in the center of my room and take a video of the room spinning around slowly over 5 seconds. Then reverse spin around for 5 seconds.

Will it see the same couch? Or will it see two couches?

snovv_crash1y ago

I think it depends how long it is out of frame for, there is a cache that you might be able to tweak the size of.

gpm1y ago

j0e11y ago

This is great! Can someone point me to examples how to bundle something like to run offline on a browser, if possible at all?

naitgacem1y ago

Anyone managed to get this to work on Google collab? I am having trouble with the imports and not sure what is going on.

_giorgio_1y ago

Does it segment and describe or recognize objects? What "pipeline" would be needed to achieve that? Thanks.

renewiltord1y ago

This is a super-useful model. Thanks, guys.

blackeyeblitzar1y ago

Somewhat related: is there much research into how these models can be tricked or possible security implications?

vanjajaja11y ago

Cool! Seems this is cuda only?

rawrawrawrr1y ago

Can run on CPU (slower) or AMD GPUs.

mnk471y ago

What about Mac/Metal?

vanjajaja11y ago

this is what I was getting at, i tried on my mbp and no luck. might be just an installer issue but I wanted confirmation from someone with more know-how before diving in

1 more reply

carbocation1y ago

Huge fan of the SAM loss function. Thanks for making this.

unnouinceput1y ago

Trying to run https://sam2.metademolab.com/demo and...

Quote: "Sorry Firefox users! The Firefox browser doesn’t support the video features we’ll need to run this demo. Please try again using Chrome or Safari."

Wtf is this shit? Seriously!

sails1y ago

Any use of this category of tools in OCR?

ximilian1y ago

Roughly how many fps could you get running this on a raspberry pi?

vicentwu1y ago

It's amazing!

Imagesegmetanto1y ago

Awesome! Loved SAM already made our Segmentation problem so so so much better.

I was wondering why the original one got deprecated.

Is there now also a good way for finetuning from the officaial / your side?

Any benchmarks against SAM1?

maxdo1y ago

How many days it will take to see this in military use killing people …

j / k navigate · click thread line to collapse