Meta took the open path because their initial foray into AI was compromised so they have been doing their best to kneecap everyone else since then.
I like the result but let’s not pretend it’s for gracious intent.
Don't basically all the "top labs" except Anthropic now have open weight models? And Zuckerberg said they were now going to be "careful about what we choose to open source" in the future, which is a shift from their previous rhetoric about "Open Source AI is the Path Forward".
Facebook is a deeply scummy company[2] and their stranglehold on online advertising spend (along with Google) allows them to pour enormous funds into side bets like this.
[1] I didn't take them up on the offer to interview in the wake of that and so it will be forever known as "I've made a huge mistake."
I put together a YOLO tune for climbing hold detection a while back (trained on 10k labels) and this is 90% as good out of the box - just misses some foot chips and low contrast wood holds, and can't handle as many instances. It would've saved me a huge amount of manual annotation though.
[1]: https://github.com/facebookresearch/dinov3 [2]: https://imgeditor.co/
I hope this makes sense and I'm using terms loosely. It is an amazing model but it doesn't work for my use case, that's all!
SAM3 seems to less precisely trace the images — it'll discard kids drawing out the lines a bit, which is okay, but then it also seems to struggle around sharp corners and includes a bit of the white page that I'd like cut out.
Of course, SAM3 is significantly more powerful in that it does much more than simply cut out images. It seems to be able to identify what these kids' drawings represent. That's very impressive, AI models are typically trained on photos and adult illustrations — they struggle with children's drawings. So I could perhaps still use this for identifying content, giving kids more freedom to draw what they like, but then unprompted attach appropriate behavior to their drawings in-game.
BiRefNet 2 seems to do a much better job of correctly removing backgrounds in between the contents outline. So like hands on hips, that region that's fully enclosed but you want removed. It's not just that though, some other models will remove this, but they'll be overly aggressive and remove white areas where kids haven't coloured in perfectly — or like the intentionally left blank whites of eyes for example.
I'm putting these images in a game world once they're cut out, so if things are too transparent, they look very odd.
[Update: should have mentioned I got the 4 second from the roboflow.com links in this thread]
> This excellent performance comes with fast inference — SAM 3 runs in 30 milliseconds for a single image with more than 100 detected objects on an H200 GPU.
I don't even care about the numbers; a visual transformer encoder with output that is too heavy for many edge compute CNNs to use as input isn't gonna cut it.
You can get an easy to use api endpoint by creating a workflow in roboflow with just the sam3 block in it (and hook up an input parameter to forward prompt to the model), which is then available as an HTTP endpoint. You can use the sam3 template and remove the visualization block if you need just json response for a bit faster latency and smaller payload.
Internally we are getting to run approx ~200ms http roundtrip, but our user facing API currently has some additional latency because we have to proxy a bit to hit a different cluster where we have more GPU capacity for this model allocated than we can currently get on GCP.
Two years ago we released autodistill[1], an open source framework that uses large foundation models to create training data for training small realtime models. I'm convinced the idea was right, but too early; there wasn't a big model good enough to be worth distilling from back then. SAM3 is finally that model (and will be available in Autodistill today).
We are also taking a big bet on SAM3 and have built it into Roboflow as an integral part of the entire build and deploy pipeline[2], including a brand new product called Rapid[3], which reimagines the computer vision pipeline in a SAM3 world. It feels really magical to go from an unlabeled video to a fine-tuned realtime segmentation model with minimal human intervention in just a few minutes (and we rushed the release of our new SOTA realtime segmentation model[4] last week because it's the perfect lightweight complement to the large & powerful SAM3).
We also have a playground[5] up where you can play with the model and compare it to other VLMs.
[1] https://github.com/autodistill/autodistill
[2] https://blog.roboflow.com/sam3/
[3] https://rapid.roboflow.com
I'm not sure if the work they did with DINOv3 went into SAM3. I don't see any mention of it in the paper, though I just skimmed it.
It makes a great target to distill SAM3 to.
But I'm impressed by the ability of this model to create a image encoding that is independent of the prompt. I feel like there may be lessons in training approach that can be carried over to unet for a more valuable encoding.
Is there some functionality I'm missing? I've tried Safari and Firefox.
That used SAM 2, and in my experience SAM 2 was more or less perfect—I didn’t really see the need for a SAM 3. Maybe it could have been better at segmenting without input.
But the new text prompt input seams nice; much easier to automate stuff using text input.
I've been considering building something similar but focused on static stuff like watermarks so just single masks. From that diffueraser page it seems performance is brutally slow with less than 1 fps on 720p.
For watermarks you can use ffmpeg blur which will of course be super fast and looks good on certain kinds of content that are mostly uniform like a sky but terrible and very obvious for most backgrounds. I've gotten really good results with videos shot with static cameras generating a single inpainted frame and then just using that as the "cover" cropped and blurred over the watermark or any object really. Even better results with completely stabilizing the video and balancing the color if it is changing slightly over time. This of course only works if nothing moving intersects with the removed target or if the camera is moving then you need every frame inpainted.
Thus far all full video inpainting like this has been so slow as to not be practically useful for example to casually remove watermarks on videos measured in tens of minutes instead of seconds where i would really want processing to be close to realtime. I've wondered what knobs can be turned if any to sacrifice quality in order to boost performance. My main ideas are to try to automate detecting and applying that single frame technique to as much of the video as possible and then separately process all the other chunks with diffusion scaling to some really small size like 240p and then use ai based upscaling on those chunks which seems to be fairly fast these days compared to diffusion.
Masking is fast — more or less real-time, maybe even a bit faster.
However, infill is not real-time. It runs at about 0.8 FPS on a 3090 GTX at 860p (which is the default resolution of the underlying networks).
There are much faster models out there, but none that match the visual quality and can run on a consumer GPU as of now. The use case for VideoVanish is more geared towards professional or hobby video editing — e.g., you filmed a scene for a video or movie and don’t want to spend two days doing manual in painting.
VideoVanish does have an option to run the infill at a lower resolution. Where it fills only the infilled areas using the low-resolution output — that way you can trade visual fidelity for speed. Depending on what’s behind the patches, this can be a very viable approach.
I'm curious how this works for hair and transparent/translucent things. Probably not the best, but does not seem to be mentioned anywhere? Presumably it's just a straight line or vector rather than alpha etc?
Curious if you find interesting results - https://playground.roboflow.com
I like seeing this
A few examples I encountered recently: If I take a picture of my living room many random object would be impossible to identify by a stranger but easy by the household members. Or when driving, say at night I see a big dark shape coming from the side of the road? If I’m a local I’ll know there are horses in that field and it is fenced, or I might have read a warning sign before that’ll make me able to deduce what I’m seeing a few minutes later.
People are usually not conscious about this but you can try to block the additional informations to only see and process only what’s really coming from your eyes, and realize how soon it gets insufficient.
Uneducated question so may sound silly: A sufficiently complex vision model must have seen a million living rooms and random objects there to make some good guesses, no?
Limitations like understanding...
"Krita plugin Smart Segments lets you easily select objects using Meta’s Segment Anything Model (SAM v2). Just run the tool, and it automatically finds everything on the current layer. You can click or shift-click to choose one or more segments, and it converts them into a selection."
Also LOL @ the pictures in the readmee on Github
* Does Adobe have their version of this for use within Photoshop, with all of the new AI features they're releasing? Or are they using this behind the scenes? * If so, how does this compare? * What's the best-in-class segmentation model on the market?
I’ve seen versions where people use an in-memory FS to write frames of stream with SAM2. Maybe that is good enough?
I used SAM2 for tracking tumors in real-time MRI images. With the default SAM2 and loading images from the da, we could only process videos with 10^2 - 10^3 frames before running out of memory.
By developing/adapting a custom version (1) based on a modified implementation with real (almost) stateless streaming (2) we were able to increase that to 10^5 frames. While this was enough for our purposes, I spend way too much time debugging/investigating tiny differences between SAM2 versions. So it’s great that the canonical version now supports streaming as well.
(Side note: I also know of people using SAM2 for real-time ultrasound imaging.)
1 https://github.com/LMUK-RADONC-PHYS-RES/mrgrt-target-localiz...
Roboflow has been long on zero / few shot concept segmentation. We've opened up a research preview exploring a SAM 3 native direction for creating your own model: https://rapid.roboflow.com/
No idea what they will do for their API, but from a compute perspective the prompt is free once the image is processed.
Relevant to that comic specifically: https://www.reddit.com/r/xkcd/comments/mi725t/yeardate_a_com...