Meta Segment Anything Model 3 (opens in new tab)

(ai.meta.com)

692 pointslukeinator427mo ago134 comments

134 comments

118 comments · 35 top-level

cebert7mo ago· 17 in thread

I’m thankful that Meta still contributes to open source and shares models like this. I know there’s several reasons to not like the company, but actions like this are much appreciated and benefit everyone.

cheschire7mo ago

Does everyone forget 2023 when someone leaked the llama weights to 4chan?? Then meta started issuing takedowns on the leaks trying to stop it.

Meta took the open path because their initial foray into AI was compromised so they have been doing their best to kneecap everyone else since then.

I like the result but let’s not pretend it’s for gracious intent.

prodigycorp7mo ago

Wait a minute. I'm no Meta fan, but that leak wasn't internal. llama released their weights to researchers first. The leak was from the initial batch of users, not from inside of Meta. iirc, the model was never meant to be closed weight.

poutrathor7mo ago

I agree How can the previous comment be on hacker news ? Every one here has followed the llama release saga. The famous cheeky PR on their GitHub with the torrent link was genius comedy.

zamadatix7mo ago

This might make sense for explaining n=1 releases of Llama being open weight. Even OpenAI started with open weight models and moved to closed weight though, so why would this have forever locked Meta into releasing all models as open weight and across so many model families if they weren't really interested in that path as a strategy in its own right?

deadbabe7mo ago

There is so much malice in the world, let’s just pretend for once it is gracious intent. Feels better.

alex11387mo ago

Sure, but people have the right to ask questions, as for example Zuck's pledge to give away 99% which people pointed out might be a tax avoidance scheme

The retort was essentially "Can't you just be nice?" but people have the right to ask questions; sometimes the questions reveal much corruption that actually does go on

1 more reply

visioninmyblood7mo ago

Not of fan of the company for the social media but have to appreciate all the open sourcing. none of the other top labs release thier models like meta.

magicalist7mo ago

> none of the other top labs release thier models like meta

Don't basically all the "top labs" except Anthropic now have open weight models? And Zuckerberg said they were now going to be "careful about what we choose to open source" in the future, which is a shift from their previous rhetoric about "Open Source AI is the Path Forward".

patrickk7mo ago

They're not doing it out of the goodness of their heart, they're deploying a classic strategy known as "Commoditize Your Complement"[1], to ward off threats from OpenAI and Anthropic. It's only a happy accident that the little guy benefits in this instance.

Facebook is a deeply scummy company[2] and their stranglehold on online advertising spend (along with Google) allows them to pour enormous funds into side bets like this.

[1] https://gwern.net/complement

[2] https://en.wikipedia.org/wiki/Careless_People

unsungNovelty7mo ago

Not even closely OK with facebook. But none of the other companies do this. And Mark has been open about it. I remember him saying in an interview the same very openly. Something oddly respectable about NOT sugar coating with good PR and marketing. Unlike OpenAI.

arcanemachiner7mo ago

Well, when your incentives happen to align with those of a faceless mega-corporstion, you gotta take what you can get.

1 more reply

jayd167mo ago

We can still like it. We're not nominating Nobel Prizes or something.

_giorgio_7mo ago

Among the top 10 tech companies and beyond, they have the most successful open source program.

These projects come to my mind:

SAM segment anything.

PyTorch

LLama

...

Open source datacenters and server blueprints.

the following instead comes from grok.com

Meta’s open-source hall of fame (Nov 2025)

---------------------

Llama family (2 → 3.3) – 2023-2025 >500k total stars · powers ~80% of models on Hugging Face Single-handedly killed the closed frontier model monopoly

---------------------

PyTorch – 2017 85k+ stars · the #1 ML framework in research TensorFlow is basically dead in academia now

---------------------

React + React Native – 2013/2015 230k + 120k stars Still the de-facto UI standard for web & mobile

---------------------

FAISS – 2017 32k stars · used literally everywhere (even inside OpenAI) The vector similarity search library

---------------------

Segment Anything (SAM 1 & 2) – 2023-2024 55k stars Revolutionized image segmentation overnight

---------------------

Open Compute Project – 2011 Entire open-source datacenter designs (servers, racks, networking, power) Google, Microsoft, Apple, and basically the whole hyperscaler industry build on OCP blueprints

---------------------

Zstandard (zstd) – 2016 Faster than gzip · now in Linux kernel, NVIDIA drivers, Cloudflare, etc. The new compression king

---------------------

Buck2 – 2023 Rust build system, 3-5× faster than Buck1 Handles Meta’s insane monorepo without dying

---------------------

Prophet – 2017 · 20k stars Go-to time-series forecasting library for business

---------------------

Hydra – 2020 · 9k stars Config management that saved the sanity of ML researchers

---------------------

Docusaurus – 2017 · 55k stars Powers docs for React, Jest, Babel, etc.

---------------------

Velox – 2022 C++ query engine · backbone of next-gen Presto/Trino

---------------------

Sapling – 2023 Git replacement that actually works at 10M+ file scale

---------------------

Meta’s GitHub org is now >3 million stars total — more than Google + Microsoft + Amazon combined.

---------------------

Bottom line: if you’re using modern AI in 2025, there’s a ~90% chance you’re running on something Meta open-sourced for free.

1 more reply

unsungNovelty7mo ago

I dont think it's open source. It says SAM license. Most likely source available.

uudecoded7mo ago

Agreed. The community orientation is great now. I had mixed feelings about them after finding and reporting a live vuln (medium-severity) back in 2005 or so.[1] I'm not really into social media but it does seem like they've changed their culture for the better.

[1] I didn't take them up on the offer to interview in the wake of that and so it will be forever known as "I've made a huge mistake."

siva77mo ago

If they really deliver a model that can track and describe existing images / videos well that would be a huge breakthrough. There are many extremely useful cases in med, law, surveillance, software and so on. Their competition sucks at this.

throwuxiytayq7mo ago

Disappointingly, every time Zuck hands out some free shit people instantly forget that he and his companies are a cancer upon humanity. Come on dude, "several reasons to not like the company" doesn't fucking cut it.

yeldarb7mo ago· 8 in thread

We (Roboflow) have had early access to this model for the past few weeks. It's really, really good. This feels like a seminal moment for computer vision. I think there's a real possibility this launch goes down in history as "the GPT Moment" for vision. The two areas I think this model is going to be transformative in the immediate term are for rapid prototyping and distillation.

Two years ago we released autodistill[1], an open source framework that uses large foundation models to create training data for training small realtime models. I'm convinced the idea was right, but too early; there wasn't a big model good enough to be worth distilling from back then. SAM3 is finally that model (and will be available in Autodistill today).

We are also taking a big bet on SAM3 and have built it into Roboflow as an integral part of the entire build and deploy pipeline[2], including a brand new product called Rapid[3], which reimagines the computer vision pipeline in a SAM3 world. It feels really magical to go from an unlabeled video to a fine-tuned realtime segmentation model with minimal human intervention in just a few minutes (and we rushed the release of our new SOTA realtime segmentation model[4] last week because it's the perfect lightweight complement to the large & powerful SAM3).

We also have a playground[5] up where you can play with the model and compare it to other VLMs.

[1] https://github.com/autodistill/autodistill

[2] https://blog.roboflow.com/sam3/

[3] https://rapid.roboflow.com

[4] https://github.com/roboflow/rf-detr

[5] https://playground.roboflow.com

sorenjan7mo ago

SAM3 is probably a great model to distill from when training smaller segmentation models, but isn't their DINOv2 a better example of a large foundation model to distill from for various computer vision tasks? I've seen it used for as starting point for models doing segmentation and depth estimation. Maybe there's a v3 coming soon?

https://dinov2.metademolab.com/

nsingh27mo ago

DINOv3 was released earlier this year: https://ai.meta.com/dinov3/

I'm not sure if the work they did with DINOv3 went into SAM3. I don't see any mention of it in the paper, though I just skimmed it.

yeldarb7mo ago

We used DINOv2 as the backbone of our RF-DETR model, which is SOTA on realtime object detection and segmentation: https://github.com/roboflow/rf-detr

It makes a great target to distill SAM3 to.

sorenjan7mo ago

> It makes a great target to distill SAM3 to.

Could you expand on that? Do you mean you're starting with the pretrained DINO model and then using SAM3 to generate training data to make DINO into a segmentation model? Do you freeze the DINO weights and add a small adapter at the end to turn its output into segmentations?

dangoodmanUT7mo ago

I was trying to figure out from their examples, but how are you breaking up the different "things" that you can detect in the image? Are you just running it with each prompt individually?

rocauc7mo ago

The model supports batch inference, so all prompts are sent to the model, and we parse the results.

mchusma7mo ago

Thanks for the linkes! Can we run rf-detr in the browser for background removal? This wasn't clear to me from the docs

yeldarb7mo ago

We have a JS SDK that supports RF-DETR: https://docs.roboflow.com/deploy/sdks/web-browser

gs177mo ago· 6 in thread

The 3D mesh generator is really cool too: https://ai.meta.com/sam3d/ It's not perfect, but it seems to handle occlusion very well (e.g. a person in a chair can be separated into a person mesh and a chair mesh) and it's very fast.

Animats7mo ago

It's very impressive. Do they let you export a 3D mesh, though? I was only able to export a video. Do you have to buy tokens or something to export?

TheAtomic7mo ago

I couldn't download it. Model appears to be comparable to Sparc3D, Huyunan, etc but w/o download, who can say? It is much faster though.

visioninmyblood7mo ago

you can download it at https://github.com/facebookresearch/sam3. for 3d https://github.com/facebookresearch/sam-3d-objects

I actually found the easiest way was to run it for free to see if it works for my use case of person deidentification https://chat.vlm.run/chat/63953adb-a89a-4c85-ae8f-2d501d30a4...

WhiteNoiz37mo ago

The models it creates are gaussian splats, so if you are looking for traditional meshes you'd need a tool that can create meshes from splats.

bahmboo7mo ago

Are you sure about that? They say "full 3D shape geometry, texture, and layout" which doesn't preclude it being a splat but maybe they just use splats for visualization?

1 more reply

modeless7mo ago

The model is open weights, so you can run it yourself.

bahmboo7mo ago· 5 in thread

Like the models before it it struggles with my use case of tracing circuit board features. It's great with a pony on the beach but really isn't made for more rote industrial type applications. With proper fine-tuning it would probably work much better but I haven't tried that yet. There are good examples on line though.

maurits7mo ago

I would try to take DINO v3 [1] for a spin, for that specific use-case. Or, don't laugh, the Nano Banana [2]

[1]: https://github.com/facebookresearch/dinov3 [2]: https://imgeditor.co/

squigz7mo ago

Wow that sounds like a really interesting use-case for this. Can you link to some of those examples?

bahmboo7mo ago

I don't have anything specific to link to but you could try it yourself with line art. Try something like a mandala or a coloring book type image. The model is trying to capture something that encompasses an entity. It isn't interested in the subfeatures of the thing. Like with a mandala it wants to segment the symbol in its entirety. It will segment some subfeatures like a leaf shaped piece but it doesn't want to segment just the lines such that it is a stencil.

I hope this makes sense and I'm using terms loosely. It is an amazing model but it doesn't work for my use case, that's all!

visioninmyblood7mo ago

Actually a combination of LLM and VLMs could work in such cases. I just tested on some circuit boards. https://chat.vlm.run/c/f0418b26-af20-4b3d-a873-ff954f5117af

2 more replies

sneilan17mo ago

Have you found any models that work better for your use case?

1 more reply

clueless7mo ago· 5 in thread

With a avg latency of 4 seconds, this still couldn't be used in real-time video, correct?

[Update: should have mentioned I got the 4 second from the roboflow.com links in this thread]

Etheryte7mo ago

Didn't see where you got those numbers, but surely that's just a problem of throwing more compute at it? From the blog post:

> This excellent performance comes with fast inference — SAM 3 runs in 30 milliseconds for a single image with more than 100 detected objects on an H200 GPU.

v9v7mo ago

For the first SAM model, you needed to encode the input image which took about 2 seconds (on a consumer GPU), but then any detection you did on the image was on the order of milliseconds. The blog post doesn't seem too clear on this, but I'm assuming the 30ms is for the encoder+100 runs of the detector.

vlovich1237mo ago

Even if it was 4s, you can always parallelize the frames to do it “realtime”, just the latency for the output will be 4s (provided you can get a cluster with 120 or 240 GPUs to do 4s of frames going in parallel (if it’s 30ms per image then you only need 2 GPUs to do 60fps on a video stream).

aDyslecticCrow7mo ago

The model is massive and heavy. I have a hard time seeing this used in real-time. But it's so flexible and accurate it's an amazing teacher for lean CNNs; that's where the real value lies.

I don't even care about the numbers; a visual transformer encoder with output that is too heavy for many edge compute CNNs to use as input isn't gonna cut it.

hansent7mo ago

p50 latency on roboflow serverless api is 300~400ms roundtrip for sam3 image with text prompt.

You can get an easy to use api endpoint by creating a workflow in roboflow with just the sam3 block in it (and hook up an input parameter to forward prompt to the model), which is then available as an HTTP endpoint. You can use the sam3 template and remove the visualization block if you need just json response for a bit faster latency and smaller payload.

Internally we are getting to run approx ~200ms http roundtrip, but our user facing API currently has some additional latency because we have to proxy a bit to hit a different cluster where we have more GPU capacity for this model allocated than we can currently get on GCP.

torginus7mo ago· 5 in thread

These models have been super cool and it'd be nice if they made it into some editing program. Is there anything consumer focused that has this tech?

Redster7mo ago

https://news.ycombinator.com/item?id=44736202

"Krita plugin Smart Segments lets you easily select objects using Meta’s Segment Anything Model (SAM v2). Just run the tool, and it automatically finds everything on the current layer. You can click or shift-click to choose one or more segments, and it converts them into a selection."

torginus7mo ago

This is a good start, however this looks more like a hobbyist experiment by some guy instead of a polished way of integrating these new techniques into the software.

Also LOL @ the pictures in the readmee on Github

nuclearsugar7mo ago

Here are two plugins for After Effects - https://aescripts.com/mask-prompter/ https://aescripts.com/depth-scanner-lite/

embedding-shape7mo ago

I think DaVinci Resolve probably have the best, professional-grade usage of ML models today, but they're not "AI Features Galore" about it when it's there. They might mention it as "Paint Out Unwanted Objects" or similar. From the latest release (https://www.blackmagicdesign.com/products/davinciresolve/wha...), I think I could spot 3-4 features at least that are using ML underneath, but aren't highlighted as "AI" at all. Still very useful stuff.

1277mo ago

ComfyUI addon for Krita is pretty close I think.

sciencesama7mo ago· 5 in thread

Does the license allow for commercial purposes?

rocauc7mo ago

Yes. It's a custom license with an Acceptable Use Policy preventing military use and export restrictions. The custom license permits commercial use.

nebula88047mo ago

If this is whats in the consumer space I'd imagine the government has something much more advanced. Its probably a foregone conclusion that they are recording the entire country (maybe the world) and storing everyone's movements or are getting close to it.

visioninmyblood7mo ago

I just check and it seems to commercial permissiable.Companies like vlm.run and roboflow are using for commercial use as show by thier comments below. So i guess it can be used for commercial purposes.

rocauc7mo ago

Yes. But also note that redistribution of SAM 3 requires using the same SAM 3 license downstream. So libraries that attempt to, e.g., relicense the model as AGPL are non-compliant.

colesantiago7mo ago

Yes, the license allows you to grift for your “AI startup”

raindear7mo ago· 4 in thread

There has been a slow progress in computer vision in the last ~5 years. We are still not close to human performance. This is in contrast to language understanding which has been solved - LLMs understand text on a human level (even if they have other limitations). But vision isn't solved. Foundation models struggle to segment some objects, they don't generalize to domains such as scientific images, etc. I wonder what's missing with models. We have enough data in videos. Is it compute? Is the task not informative enough? Do we need agency in 3D?

tarsinge7mo ago

I’m not an expert in the field but intuitively from my own experience I’d say what’s missing is a world model. By trying to be more conscious about my own vision I’ve started to notice how common it is that I fail to recognize a shape and then use additional knowledge, context and extrapolations to deduce what it can be.

A few examples I encountered recently: If I take a picture of my living room many random object would be impossible to identify by a stranger but easy by the household members. Or when driving, say at night I see a big dark shape coming from the side of the road? If I’m a local I’ll know there are horses in that field and it is fenced, or I might have read a warning sign before that’ll make me able to deduce what I’m seeing a few minutes later.

People are usually not conscious about this but you can try to block the additional informations to only see and process only what’s really coming from your eyes, and realize how soon it gets insufficient.

patates7mo ago

> If I take a picture of my living room many random object would be impossible to identify by a stranger but easy by the household members.

Uneducated question so may sound silly: A sufficiently complex vision model must have seen a million living rooms and random objects there to make some good guesses, no?

parineum7mo ago

> LLMs understand text on a human level (even if they have other limitations).

Limitations like understanding...

visioninmyblood7mo ago

The problem is the data. LLM data is self supervised. Vision data is very sparsly annotated in the real world. Going a step further robotics data is is much sparser. So getting these models to improve on this long tail distribution will take time.

Benjamin_Dobell7mo ago· 3 in thread

For background removal (at least my niche use case of background removal of kids drawings — https://breaka.club/blog/why-were-building-clubs-for-kids) I think birefnet v2 is still working slightly better.

SAM3 seems to less precisely trace the images — it'll discard kids drawing out the lines a bit, which is okay, but then it also seems to struggle around sharp corners and includes a bit of the white page that I'd like cut out.

Of course, SAM3 is significantly more powerful in that it does much more than simply cut out images. It seems to be able to identify what these kids' drawings represent. That's very impressive, AI models are typically trained on photos and adult illustrations — they struggle with children's drawings. So I could perhaps still use this for identifying content, giving kids more freedom to draw what they like, but then unprompted attach appropriate behavior to their drawings in-game.

warangal7mo ago

I know it may be not what you are looking for, but most of such models generate multiple-scale image features through an image encoder, and those can be very easily fine-tuned for a particular task, like some polygon prediction for your use case. I understand the main benefit of such promptable models to reduce/remove this kind of work in the first place, but could be worth and much more accurate if you have a specific high-load task !

florians7mo ago

Curious about background removal with BiRefNet. Would you consider it the best model currently available? What other options exist that are popular but not as good?

Benjamin_Dobell7mo ago

I'm far from an expert in this area. I've also tried Bria RMBG 1.4, Bria RMBG 2.0, older BiRefNet versions, and I think another I forgot the name of. The fact I'm removing backgrounds that are predominantly white (a sheet of paper) in first place probably changes things significantly. So it's hard to extrapolate my results to general background removal.

BiRefNet 2 seems to do a much better job of correctly removing backgrounds in between the contents outline. So like hands on hips, that region that's fully enclosed but you want removed. It's not just that though, some other models will remove this, but they'll be overly aggressive and remove white areas where kids haven't coloured in perfectly — or like the intentionally left blank whites of eyes for example.

I'm putting these images in a game world once they're cut out, so if things are too transparent, they look very odd.

fzysingularity7mo ago· 3 in thread

SAM3 is cool - you can already do this more interactively on chat.vlm.run [1], and do much more. It's built on our new Orion [2] model; we've been able to integrate with SAM and several other computer-vision models in a truly composable manner. Video segmentation and tracking is also coming soon!

[1] https://chat.vlm.run

[2] https://vlm.run/orion

visioninmyblood7mo ago

Wow this is actually pretty cool, I was able to segment out the people and dog in the same chat. https://chat.vlm.run/chat/cba92d77-36cf-4f7e-b5ea-b703e612ea...

luckyLooking7mo ago

Even works with long range shots. https://chat.vlm.run/chat/e8bd5a29-a789-40aa-ae31-a510dc6478...

fzysingularity7mo ago

Nice, that's pretty neat.

SubiculumCode7mo ago· 3 in thread

For my use case, segmentation is all about 3D segmentation of volumes in medical imaging. SAM 2 was tried, mostly using a 2D slice approach, but I don't think it was competitive with the current gold standard nn-unet[1] [1. https://github.com/MIC-DKFZ/nnUNet]

aDyslecticCrow7mo ago

U-net is a brilliant architecture, and it still seems to beat this model in scaling up the segmentation mask from 256x256 back to the real image. I also don't think unet really benefits from the massive internal feature size given by a the visual transformer used for image encoding.

But I'm impressed by the ability of this model to create a image encoding that is independent of the prompt. I feel like there may be lessons in training approach that can be carried over to unet for a more valuable encoding.

visioninmyblood7mo ago

Agreed that Unet has been the most used model for medical imaging for the last 10 years since the initial Unet paper. I think a combination of Llm+VLMs could be a way forward for medical imaging. I tried it out here and it works great. https://chat.vlm.run/c/e062aa6d-41bb-4fc2-b3e4-7e70b45562cf

davycro7mo ago

Same. My use case is ultrasound segmentation. These models struggle, understandably so, with medical imaging.

iandanforth7mo ago· 3 in thread

I wonder if we'll get an updated DeepSeek-OCR that incorporates this. Would be very cool!

aDyslecticCrow7mo ago

I don't quite see how this would help OCR at all? or am I misunderstanding what kind of OCR you're thinking of?

iandanforth7mo ago

Deepseek-OCR uses SAM V1 as a component in its pipeline already. It also does layout detection.

1 more reply

netdur7mo ago

for document layout! did you have success understanding document layout using SAM

daemonologist7mo ago· 2 in thread

First impressions are that this model is extremely good - the "zero-shot" text prompted detection is a huge step ahead of what we've seen before (both compared to older zero-shot detection models and to recent general purpose VLMs like Gemini and Qwen). With human supervision I think it's even at the point of being a useful teacher model.

I put together a YOLO tune for climbing hold detection a while back (trained on 10k labels) and this is 90% as good out of the box - just misses some foot chips and low contrast wood holds, and can't handle as many instances. It would've saved me a huge amount of manual annotation though.

rocauc7mo ago

As someone that works on a platform users have used for labeling 1B images, I'm bullish SAM 3 can automate at least 90% of the work. Data prep is flipped to models being human-assisted instead of humans being model-assisted (see "autolabel" https://blog.roboflow.com/sam3/). I'm optimistic majority of users can now start deploying a model to then curate data instead of the inverse.

1 more reply

pierrec7mo ago

I'm guessing you worked on the Stokt app or something similar! It's certainly become one of the best established apps in climbing.

____tom____7mo ago· 2 in thread

Ok, I tried convert body to 3d, which is seems to do well, but it just gives me the image, I see no way to export or use this image. I can rotate it, but that's it.

Is there some functionality I'm missing? I've tried Safari and Firefox.

FeiyouG7mo ago

If you open inspect element you can download the blob there. It is a .ply file and you can view it in any splat viewer.

nmfisher7mo ago

I didn't look too close but it wouldn't surprise me if this was intentional. Many of these Meta/Facebook projects don't have open licenses so they never graduate from web demos. Their voice cloning model was the same.

featureofone7mo ago· 2 in thread

The SAM models are great. I used the latest version when building VideoVanish ( https://github.com/calledit/VideoVanish ) a video-editing GUI for removing or making objects vanish from videos.

That used SAM 2, and in my experience SAM 2 was more or less perfect—I didn’t really see the need for a SAM 3. Maybe it could have been better at segmenting without input.

But the new text prompt input seams nice; much easier to automate stuff using text input.

jdprgm7mo ago

Promising looking tool. It would be useful to add a performance section to the readme for some ballpark of what to expect even if it is just a reference point of one gpu.

I've been considering building something similar but focused on static stuff like watermarks so just single masks. From that diffueraser page it seems performance is brutally slow with less than 1 fps on 720p.

For watermarks you can use ffmpeg blur which will of course be super fast and looks good on certain kinds of content that are mostly uniform like a sky but terrible and very obvious for most backgrounds. I've gotten really good results with videos shot with static cameras generating a single inpainted frame and then just using that as the "cover" cropped and blurred over the watermark or any object really. Even better results with completely stabilizing the video and balancing the color if it is changing slightly over time. This of course only works if nothing moving intersects with the removed target or if the camera is moving then you need every frame inpainted.

Thus far all full video inpainting like this has been so slow as to not be practically useful for example to casually remove watermarks on videos measured in tens of minutes instead of seconds where i would really want processing to be close to realtime. I've wondered what knobs can be turned if any to sacrifice quality in order to boost performance. My main ideas are to try to automate detecting and applying that single frame technique to as much of the video as possible and then separately process all the other chunks with diffusion scaling to some really small size like 240p and then use ai based upscaling on those chunks which seems to be fairly fast these days compared to diffusion.

featureofone7mo ago

Good point — I’ll add that to the README.

Masking is fast — more or less real-time, maybe even a bit faster.

However, infill is not real-time. It runs at about 0.8 FPS on a 3090 GTX at 860p (which is the default resolution of the underlying networks).

There are much faster models out there, but none that match the visual quality and can run on a consumer GPU as of now. The use case for VideoVanish is more geared towards professional or hobby video editing — e.g., you filmed a scene for a video or movie and don’t want to spend two days doing manual in painting.

VideoVanish does have an option to run the infill at a lower resolution. Where it fills only the infilled areas using the low-resolution output — that way you can trade visual fidelity for speed. Depending on what’s behind the patches, this can be a very viable approach.

xfeeefeee7mo ago· 2 in thread

I can't wait until it is easy to rotoscope / greenscreen / mask this stuff out accessibly for videos. I had tried Runway ML but it was... lacking, and the webui for fixing parts of it had similar issues.

I'm curious how this works for hair and transparent/translucent things. Probably not the best, but does not seem to be mentioned anywhere? Presumably it's just a straight line or vector rather than alpha etc?

rocauc7mo ago

I tried it on transparent glass mugs, and it does pretty well. At least better than other available models: https://i.imgur.com/OBfx9JY.png

Curious if you find interesting results - https://playground.roboflow.com

nodja7mo ago

I'm pretty sure davinci resolve does this already, you can even track it, idk if it's available in the free version.

exe347mo ago· 2 in thread

can anyone confirm if this fits in a 3090? the files look about 3.5GB, but I can't work out what the memory needs will be overall.

yeldarb7mo ago

Yes, it should.

exe347mo ago

thanks!

foota7mo ago· 2 in thread

Obligatory xkcd: https://xkcd.com/1425/

hdjrudni7mo ago

That comic doesn't appear to be dated but I'm sure it's been at least 5 years, so that checks out.

esprehn7mo ago

It's from 2014, over a decade old.

Relevant to that comic specifically: https://www.reddit.com/r/xkcd/comments/mi725t/yeardate_a_com...

hodgehog117mo ago· 1 in thread

This is an incredible model. But once again, we find an announcement for a new AI model with highly misleading graphs. That SA-Co Gold graph is particularly bad. Looks like I have another bad graph example for my introductory stats course...

typpilol7mo ago

Check out the new grok 4.1 graphs. They're even worse

HowardStark7mo ago· 1 in thread

Curious if anyone has done anything meaningful with SAM2 and streaming. SAM3 has built-in streaming support which is very exciting.

I’ve seen versions where people use an in-memory FS to write frames of stream with SAM2. Maybe that is good enough?

tom-in-july7mo ago

The native support for streaming in SAM3 is awesome. Especially since it should also remove some of the memory accumulation for long sequences.

I used SAM2 for tracking tumors in real-time MRI images. With the default SAM2 and loading images from the da, we could only process videos with 10^2 - 10^3 frames before running out of memory.

By developing/adapting a custom version (1) based on a modified implementation with real (almost) stateless streaming (2) we were able to increase that to 10^5 frames. While this was enough for our purposes, I spend way too much time debugging/investigating tiny differences between SAM2 versions. So it’s great that the canonical version now supports streaming as well.

(Side note: I also know of people using SAM2 for real-time ultrasound imaging.)

1 https://github.com/LMUK-RADONC-PHYS-RES/mrgrt-target-localiz...

2 https://github.com/Gy920/segment-anything-2-real-time

bangaladore7mo ago· 1 in thread

Probably still can't get past a Google Captcha when on a VPN. Do I click the square with the shoe of the person who's riding the motorcycle?

conception7mo ago

There are services you can get that will bypass those with a browser extension for you.

retinaros7mo ago· 1 in thread

a quick question. is it possible in a single prompt to identify multiple type of objects or do you need to send multiple queries? like if i have a prompt "donkey, dogs" will sam3 return in one shot boxes with the class they belong to or do i need to send two queries?

aDyslecticCrow7mo ago

Seems like every mask is one pass. So even multiple penguins are multiple inferences. But if it continues from SAM 2, the heavy compute is the original image encoding which is reused and cashed for every inference.

No idea what they will do for their API, but from a compute perspective the prompt is free once the image is processed.

Ey7NFZ3P0nzAe7mo ago

> *Core contributor (Alphabetical, Equal Contribution), Intern, †Project leads, §Equal Contribution

I like seeing this

pacifi307mo ago

Grateful for Meta to release models and give the GPU access for free, it has been great for experimenting without the thinking overhead of paying too much for inference. Thank you Zuck.

visioninmyblood7mo ago

Claude, gemini and ChatGPT does image segmentation in surprising ways - we did a small evaluation [1] of different frontier models for image segementation and understanding, and Claude is by far the most surprising in results.

https://news.ycombinator.com/item?id=45996392

geooff_7mo ago

Seems like theres no API access. Has anyone got the weights? I'm not sure what to fill out for `affiliation`

8f2ab37a-ed6c7mo ago

Couple of questions for people in-the-know:

* Does Adobe have their version of this for use within Photoshop, with all of the new AI features they're releasing? Or are they using this behind the scenes? * If so, how does this compare? * What's the best-in-class segmentation model on the market?

tonyhart77mo ago

This would be good for video editor

mksystem7mo ago

Is it possible to prompt this model with two or more texts for each image and get masks for each? Something like this inputs = processor(images=images, text=["cat", "dog"], return_tensors="pt").to(device)?

dangoodmanUT7mo ago

This model is incredibly impressive. Text is definitely the right modality, and now the ability to intertwine it with an LLM creates insane unlocks - my mind is already storming with ideas of projects that are now not only possible, but trivial.

ge967mo ago

Dang that seems like it would work great for game asset generation regarding 3D

rocauc7mo ago

A brief history. SAM 1 - Visual prompt to create pixel-perfect masks in an image. No video. No class names. No open vocabulary. SAM 2 - Visual prompting for tracking on images and video. No open vocab. SAM 3 - Open vocab concept segmentation on images and video.

Roboflow has been long on zero / few shot concept segmentation. We've opened up a research preview exploring a SAM 3 native direction for creating your own model: https://rapid.roboflow.com/

nowittyusername7mo ago

This thing rocks. i can imagine soo many uses for it. I really like the 3d pose estimation especially

maelito7mo ago

Can it detect the speed of a vehicle on any video unsupervised ?

xnx7mo ago

Reminder that Nano Banana is also capable of image segmentation: https://x.com/phillip_lippe/status/1991555954908025123

j / k navigate · click thread line to collapse

134 comments

118 comments · 35 top-level

cebert7mo ago· 17 in thread

cheschire7mo ago

Does everyone forget 2023 when someone leaked the llama weights to 4chan?? Then meta started issuing takedowns on the leaks trying to stop it.

Meta took the open path because their initial foray into AI was compromised so they have been doing their best to kneecap everyone else since then.

I like the result but let’s not pretend it’s for gracious intent.

prodigycorp7mo ago

poutrathor7mo ago

I agree How can the previous comment be on hacker news ? Every one here has followed the llama release saga. The famous cheeky PR on their GitHub with the torrent link was genius comedy.

zamadatix7mo ago

deadbabe7mo ago

There is so much malice in the world, let’s just pretend for once it is gracious intent. Feels better.

alex11387mo ago

Sure, but people have the right to ask questions, as for example Zuck's pledge to give away 99% which people pointed out might be a tax avoidance scheme

The retort was essentially "Can't you just be nice?" but people have the right to ask questions; sometimes the questions reveal much corruption that actually does go on

1 more reply

visioninmyblood7mo ago

Not of fan of the company for the social media but have to appreciate all the open sourcing. none of the other top labs release thier models like meta.

magicalist7mo ago

> none of the other top labs release thier models like meta

patrickk7mo ago

Facebook is a deeply scummy company[2] and their stranglehold on online advertising spend (along with Google) allows them to pour enormous funds into side bets like this.

[1] https://gwern.net/complement

[2] https://en.wikipedia.org/wiki/Careless_People

unsungNovelty7mo ago

arcanemachiner7mo ago

Well, when your incentives happen to align with those of a faceless mega-corporstion, you gotta take what you can get.

1 more reply

jayd167mo ago

We can still like it. We're not nominating Nobel Prizes or something.

_giorgio_7mo ago

Among the top 10 tech companies and beyond, they have the most successful open source program.

These projects come to my mind:

SAM segment anything.

PyTorch

LLama

...

Open source datacenters and server blueprints.

the following instead comes from grok.com

Meta’s open-source hall of fame (Nov 2025)

---------------------

Llama family (2 → 3.3) – 2023-2025 >500k total stars · powers ~80% of models on Hugging Face Single-handedly killed the closed frontier model monopoly

---------------------

PyTorch – 2017 85k+ stars · the #1 ML framework in research TensorFlow is basically dead in academia now

---------------------

React + React Native – 2013/2015 230k + 120k stars Still the de-facto UI standard for web & mobile

---------------------

FAISS – 2017 32k stars · used literally everywhere (even inside OpenAI) The vector similarity search library

---------------------

Segment Anything (SAM 1 & 2) – 2023-2024 55k stars Revolutionized image segmentation overnight

---------------------

Open Compute Project – 2011 Entire open-source datacenter designs (servers, racks, networking, power) Google, Microsoft, Apple, and basically the whole hyperscaler industry build on OCP blueprints

---------------------

Zstandard (zstd) – 2016 Faster than gzip · now in Linux kernel, NVIDIA drivers, Cloudflare, etc. The new compression king

---------------------

Buck2 – 2023 Rust build system, 3-5× faster than Buck1 Handles Meta’s insane monorepo without dying

---------------------

Prophet – 2017 · 20k stars Go-to time-series forecasting library for business

---------------------

Hydra – 2020 · 9k stars Config management that saved the sanity of ML researchers

---------------------

Docusaurus – 2017 · 55k stars Powers docs for React, Jest, Babel, etc.

---------------------

Velox – 2022 C++ query engine · backbone of next-gen Presto/Trino

---------------------

Sapling – 2023 Git replacement that actually works at 10M+ file scale

---------------------

Meta’s GitHub org is now >3 million stars total — more than Google + Microsoft + Amazon combined.

---------------------

Bottom line: if you’re using modern AI in 2025, there’s a ~90% chance you’re running on something Meta open-sourced for free.

1 more reply

unsungNovelty7mo ago

I dont think it's open source. It says SAM license. Most likely source available.

uudecoded7mo ago

[1] I didn't take them up on the offer to interview in the wake of that and so it will be forever known as "I've made a huge mistake."

siva77mo ago

throwuxiytayq7mo ago

yeldarb7mo ago· 8 in thread

We also have a playground[5] up where you can play with the model and compare it to other VLMs.

[1] https://github.com/autodistill/autodistill

[2] https://blog.roboflow.com/sam3/

[3] https://rapid.roboflow.com

[4] https://github.com/roboflow/rf-detr

[5] https://playground.roboflow.com

sorenjan7mo ago

https://dinov2.metademolab.com/

nsingh27mo ago

DINOv3 was released earlier this year: https://ai.meta.com/dinov3/

I'm not sure if the work they did with DINOv3 went into SAM3. I don't see any mention of it in the paper, though I just skimmed it.

yeldarb7mo ago

We used DINOv2 as the backbone of our RF-DETR model, which is SOTA on realtime object detection and segmentation: https://github.com/roboflow/rf-detr

It makes a great target to distill SAM3 to.

sorenjan7mo ago

> It makes a great target to distill SAM3 to.

dangoodmanUT7mo ago

I was trying to figure out from their examples, but how are you breaking up the different "things" that you can detect in the image? Are you just running it with each prompt individually?

rocauc7mo ago

The model supports batch inference, so all prompts are sent to the model, and we parse the results.

mchusma7mo ago

Thanks for the linkes! Can we run rf-detr in the browser for background removal? This wasn't clear to me from the docs

yeldarb7mo ago

We have a JS SDK that supports RF-DETR: https://docs.roboflow.com/deploy/sdks/web-browser

gs177mo ago· 6 in thread

Animats7mo ago

It's very impressive. Do they let you export a 3D mesh, though? I was only able to export a video. Do you have to buy tokens or something to export?

TheAtomic7mo ago

I couldn't download it. Model appears to be comparable to Sparc3D, Huyunan, etc but w/o download, who can say? It is much faster though.

visioninmyblood7mo ago

you can download it at https://github.com/facebookresearch/sam3. for 3d https://github.com/facebookresearch/sam-3d-objects

I actually found the easiest way was to run it for free to see if it works for my use case of person deidentification https://chat.vlm.run/chat/63953adb-a89a-4c85-ae8f-2d501d30a4...

WhiteNoiz37mo ago

The models it creates are gaussian splats, so if you are looking for traditional meshes you'd need a tool that can create meshes from splats.

bahmboo7mo ago

Are you sure about that? They say "full 3D shape geometry, texture, and layout" which doesn't preclude it being a splat but maybe they just use splats for visualization?

1 more reply

modeless7mo ago

The model is open weights, so you can run it yourself.

bahmboo7mo ago· 5 in thread

maurits7mo ago

I would try to take DINO v3 [1] for a spin, for that specific use-case. Or, don't laugh, the Nano Banana [2]

[1]: https://github.com/facebookresearch/dinov3 [2]: https://imgeditor.co/

squigz7mo ago

Wow that sounds like a really interesting use-case for this. Can you link to some of those examples?

bahmboo7mo ago

I hope this makes sense and I'm using terms loosely. It is an amazing model but it doesn't work for my use case, that's all!

visioninmyblood7mo ago

Actually a combination of LLM and VLMs could work in such cases. I just tested on some circuit boards. https://chat.vlm.run/c/f0418b26-af20-4b3d-a873-ff954f5117af

2 more replies

sneilan17mo ago

Have you found any models that work better for your use case?

1 more reply

clueless7mo ago· 5 in thread

With a avg latency of 4 seconds, this still couldn't be used in real-time video, correct?

[Update: should have mentioned I got the 4 second from the roboflow.com links in this thread]

Etheryte7mo ago

Didn't see where you got those numbers, but surely that's just a problem of throwing more compute at it? From the blog post:

> This excellent performance comes with fast inference — SAM 3 runs in 30 milliseconds for a single image with more than 100 detected objects on an H200 GPU.

v9v7mo ago

vlovich1237mo ago

aDyslecticCrow7mo ago

The model is massive and heavy. I have a hard time seeing this used in real-time. But it's so flexible and accurate it's an amazing teacher for lean CNNs; that's where the real value lies.

I don't even care about the numbers; a visual transformer encoder with output that is too heavy for many edge compute CNNs to use as input isn't gonna cut it.

hansent7mo ago

p50 latency on roboflow serverless api is 300~400ms roundtrip for sam3 image with text prompt.

torginus7mo ago· 5 in thread

These models have been super cool and it'd be nice if they made it into some editing program. Is there anything consumer focused that has this tech?

Redster7mo ago

https://news.ycombinator.com/item?id=44736202

torginus7mo ago

This is a good start, however this looks more like a hobbyist experiment by some guy instead of a polished way of integrating these new techniques into the software.

Also LOL @ the pictures in the readmee on Github

nuclearsugar7mo ago

Here are two plugins for After Effects - https://aescripts.com/mask-prompter/ https://aescripts.com/depth-scanner-lite/

embedding-shape7mo ago

1277mo ago

ComfyUI addon for Krita is pretty close I think.

sciencesama7mo ago· 5 in thread

Does the license allow for commercial purposes?

rocauc7mo ago

Yes. It's a custom license with an Acceptable Use Policy preventing military use and export restrictions. The custom license permits commercial use.

nebula88047mo ago

visioninmyblood7mo ago

rocauc7mo ago

Yes. But also note that redistribution of SAM 3 requires using the same SAM 3 license downstream. So libraries that attempt to, e.g., relicense the model as AGPL are non-compliant.

colesantiago7mo ago

Yes, the license allows you to grift for your “AI startup”

raindear7mo ago· 4 in thread

tarsinge7mo ago

patates7mo ago

> If I take a picture of my living room many random object would be impossible to identify by a stranger but easy by the household members.

Uneducated question so may sound silly: A sufficiently complex vision model must have seen a million living rooms and random objects there to make some good guesses, no?

parineum7mo ago

> LLMs understand text on a human level (even if they have other limitations).

Limitations like understanding...

visioninmyblood7mo ago

Benjamin_Dobell7mo ago· 3 in thread

warangal7mo ago

florians7mo ago

Curious about background removal with BiRefNet. Would you consider it the best model currently available? What other options exist that are popular but not as good?

Benjamin_Dobell7mo ago

I'm putting these images in a game world once they're cut out, so if things are too transparent, they look very odd.

fzysingularity7mo ago· 3 in thread

[1] https://chat.vlm.run

[2] https://vlm.run/orion

visioninmyblood7mo ago

Wow this is actually pretty cool, I was able to segment out the people and dog in the same chat. https://chat.vlm.run/chat/cba92d77-36cf-4f7e-b5ea-b703e612ea...

luckyLooking7mo ago

Even works with long range shots. https://chat.vlm.run/chat/e8bd5a29-a789-40aa-ae31-a510dc6478...

fzysingularity7mo ago

Nice, that's pretty neat.

SubiculumCode7mo ago· 3 in thread

aDyslecticCrow7mo ago

visioninmyblood7mo ago

davycro7mo ago

Same. My use case is ultrasound segmentation. These models struggle, understandably so, with medical imaging.

iandanforth7mo ago· 3 in thread

I wonder if we'll get an updated DeepSeek-OCR that incorporates this. Would be very cool!

aDyslecticCrow7mo ago

I don't quite see how this would help OCR at all? or am I misunderstanding what kind of OCR you're thinking of?

iandanforth7mo ago

Deepseek-OCR uses SAM V1 as a component in its pipeline already. It also does layout detection.

1 more reply

netdur7mo ago

for document layout! did you have success understanding document layout using SAM

daemonologist7mo ago· 2 in thread

rocauc7mo ago

1 more reply

pierrec7mo ago

I'm guessing you worked on the Stokt app or something similar! It's certainly become one of the best established apps in climbing.

____tom____7mo ago· 2 in thread

Ok, I tried convert body to 3d, which is seems to do well, but it just gives me the image, I see no way to export or use this image. I can rotate it, but that's it.

Is there some functionality I'm missing? I've tried Safari and Firefox.

FeiyouG7mo ago

If you open inspect element you can download the blob there. It is a .ply file and you can view it in any splat viewer.

nmfisher7mo ago

featureofone7mo ago· 2 in thread

The SAM models are great. I used the latest version when building VideoVanish ( https://github.com/calledit/VideoVanish ) a video-editing GUI for removing or making objects vanish from videos.

That used SAM 2, and in my experience SAM 2 was more or less perfect—I didn’t really see the need for a SAM 3. Maybe it could have been better at segmenting without input.

But the new text prompt input seams nice; much easier to automate stuff using text input.

jdprgm7mo ago

Promising looking tool. It would be useful to add a performance section to the readme for some ballpark of what to expect even if it is just a reference point of one gpu.

featureofone7mo ago

Good point — I’ll add that to the README.

Masking is fast — more or less real-time, maybe even a bit faster.

However, infill is not real-time. It runs at about 0.8 FPS on a 3090 GTX at 860p (which is the default resolution of the underlying networks).

xfeeefeee7mo ago· 2 in thread

rocauc7mo ago

I tried it on transparent glass mugs, and it does pretty well. At least better than other available models: https://i.imgur.com/OBfx9JY.png

Curious if you find interesting results - https://playground.roboflow.com

nodja7mo ago

I'm pretty sure davinci resolve does this already, you can even track it, idk if it's available in the free version.

exe347mo ago· 2 in thread

can anyone confirm if this fits in a 3090? the files look about 3.5GB, but I can't work out what the memory needs will be overall.

yeldarb7mo ago

Yes, it should.

exe347mo ago

thanks!

foota7mo ago· 2 in thread

Obligatory xkcd: https://xkcd.com/1425/

hdjrudni7mo ago

That comic doesn't appear to be dated but I'm sure it's been at least 5 years, so that checks out.

esprehn7mo ago

It's from 2014, over a decade old.

Relevant to that comic specifically: https://www.reddit.com/r/xkcd/comments/mi725t/yeardate_a_com...

hodgehog117mo ago· 1 in thread

typpilol7mo ago

Check out the new grok 4.1 graphs. They're even worse

HowardStark7mo ago· 1 in thread

Curious if anyone has done anything meaningful with SAM2 and streaming. SAM3 has built-in streaming support which is very exciting.

I’ve seen versions where people use an in-memory FS to write frames of stream with SAM2. Maybe that is good enough?

tom-in-july7mo ago

The native support for streaming in SAM3 is awesome. Especially since it should also remove some of the memory accumulation for long sequences.

I used SAM2 for tracking tumors in real-time MRI images. With the default SAM2 and loading images from the da, we could only process videos with 10^2 - 10^3 frames before running out of memory.

(Side note: I also know of people using SAM2 for real-time ultrasound imaging.)

1 https://github.com/LMUK-RADONC-PHYS-RES/mrgrt-target-localiz...

2 https://github.com/Gy920/segment-anything-2-real-time

bangaladore7mo ago· 1 in thread

Probably still can't get past a Google Captcha when on a VPN. Do I click the square with the shoe of the person who's riding the motorcycle?

conception7mo ago

There are services you can get that will bypass those with a browser extension for you.

retinaros7mo ago· 1 in thread

aDyslecticCrow7mo ago

No idea what they will do for their API, but from a compute perspective the prompt is free once the image is processed.

Ey7NFZ3P0nzAe7mo ago

> *Core contributor (Alphabetical, Equal Contribution), Intern, †Project leads, §Equal Contribution

I like seeing this

pacifi307mo ago

Grateful for Meta to release models and give the GPU access for free, it has been great for experimenting without the thinking overhead of paying too much for inference. Thank you Zuck.

visioninmyblood7mo ago

https://news.ycombinator.com/item?id=45996392

geooff_7mo ago

Seems like theres no API access. Has anyone got the weights? I'm not sure what to fill out for `affiliation`

8f2ab37a-ed6c7mo ago

Couple of questions for people in-the-know:

tonyhart77mo ago

This would be good for video editor

mksystem7mo ago

dangoodmanUT7mo ago

ge967mo ago

Dang that seems like it would work great for game asset generation regarding 3D

rocauc7mo ago

Roboflow has been long on zero / few shot concept segmentation. We've opened up a research preview exploring a SAM 3 native direction for creating your own model: https://rapid.roboflow.com/

nowittyusername7mo ago

This thing rocks. i can imagine soo many uses for it. I really like the 3d pose estimation especially

maelito7mo ago

Can it detect the speed of a vehicle on any video unsupervised ?

xnx7mo ago

Reminder that Nano Banana is also capable of image segmentation: https://x.com/phillip_lippe/status/1991555954908025123

j / k navigate · click thread line to collapse