Curious how this was allowed to be more open source compared to Llama's interesting new take on "open source". Are other projects restricted in some form due to technical/legal issues and the desire is to be more like this project? Or was there an initiative to break the mold this time round?
My favorite use case is that it slays for memes. Try getting a good alpha mask of Fassbender Turtleneck any other way.
Keep doing stuff like this. <3
Considering the instant flood of noisy issues/PRs on the repo and the limited fix/update support on SAM, are there plans/buy-in for support of SAM2 on the medium-term beyond quick fixes? Either way, thank you to the team for your work on this and the continued public releases!
A segment then is a collection of images that follow each other in time?
So if you have a video comprised of img1, img2, img3, img4 and object shows in img1 and img2 and img4
Can you catch that as a sequence img1, img2, img3, img4 and can you also catch just the object img1, img2, img4 but get some sort of information that there is a break between img2 and img4 - number of images break etc.?
On edit: Or am I totally off about the segment possibilities and what it means?
Or can you only catch img1 and img2 as a sequence?
The recognition logic doesn't have to always be reviewing the video, but only when motion is detected.
I think some cameras already try to do this, however, they are really bad at it.
Basically the same issue the EU has with demos not launching there. You fine tech firms under vague laws often enough, and they stop doing business there.
The first one was excellent. Now part of my Gimp toolbox. Thanks for your work!
It doesn't matter much because all the real computation happens on the GPU. But you could take their neural network and do inference using any language you want.
1. SAM 2 was trained on 256 A100 GPUs for 108 hours (SAM1 was 68 hrs on same cluster). Taking the upper end $2 A100 cost off gpulist means SAM2 cost ~$50k to train - surprisingly cheap for adding video understanding?
2. new dataset: the new SA-V dataset is "only" 50k videos, with careful attention given to scene/object/geographical diversity incl that of annotators. I wonder if LAION or Datacomp (AFAICT the only other real players in the open image data space) can reach this standard..
3. bootstrapped annotation: similar to SAM1, a 3 phase approach where 16k initial annotations across 1.4k videos was then expanded to 63k+197k more with SAM 1+2 assistance, with annotation time accelerating dramatically (89% faster than SAM1 only) by the end
4. memory attention: SAM2 is a transformer with memory across frames! special "object pointer" tokens stored in a "memory bank" FIFO queue of recent and prompted frames. Has this been explored in language models? whoa?
(written up in https://x.com/swyx/status/1818074658299855262)
interesting, how do you think this could be introduced to llm? I imagine in video some special tokens are preserved in input to next frame, so kind of like llms see previous messages in chat history, but it's filters out to only some category of tokens to limit size of context.
I believe this is trick already borrowed from llm into video space.
(I didn't read the paper, so that's speculation on my side)
I need to read SAM2 paper, but 4. seems a lot like what Rex has in CUTIE. CUTIE can consistently track segments across video frames even if they get occluded/ go out of frame for a while.
The Canon R1 for example will not only continually track a particular object even if partially occluded but will also pre-focus on where it predicts the object will be when it emerged from being totally hidden. It can also be programmed by the user to focus on a particular face to the exclusion of all else.
I selected each shoe as individual objects and the model was able to segment them even as they overlapped.
Don’t most websites require you to accept cookies?
I know a lot of people who reflexively reject all cookies, and the internet indeed does keep working for them.
I can’t think of a technical reason a website without auth needs cookies to function.
Edit: Found lower in thread: biometric privacy laws
Just a guess, maybe it's the VideoFrame API? It was the only video-related feature I could find that Chrome and Safari have and FF doesn't.
Are there laws stricter than in California or EU in those places?
How's OpenMMLab's MMSegmentation, if you've tried it? https://github.com/open-mmlab/mmsegmentation
It seems like Amazon is putting its weight behind it (from the papers they've published): https://github.com/amazon-science/bigdetection
Would be extremely useful to be able to semantically "chunk" text for RAG applications compared to the generally naive strategies employed today.
If I somehow overlooked it, would be very interested in hearing about what you've seen.
I feel like one could do this with a chain of LLM prompts -- extract the primary subjects or topics from this long document, then prompt again (1 at a time?) to pull out everything related to each topic from the document and collate it into one semantic chunk.
At the very least, a dataset / benchmark centered around this task feels like it would be really useful.
I was initially thinking the obvious case would be some sort of system for monitoring your plant health. It could check for shrinkage / growth, colour change etc and build some sort of monitoring tool / automated watering system off that.
I haven't put it up on my website yet (and proper documentation is still coming) so unfortunately the best I can do is show you an Instagram link:
https://www.instagram.com/p/C98t1hlzDLx/?igsh=MWxuOHlsY2lvdT...
Not exactly functional, but fun . Artwork aside it's quite interesting to see your life broken into all its little bits. Provides a new perspective (apparently, there are a lot more teacups in my life than I notice).
Sadly my knowledge of how to make use of these models is limited to what I learned playing with some (very ancient) MediaPipe and Tensorflow models. Those models provided some WASM code to run the model in the browser and I was able to find the data from that to pipe it though to my canvas effects[2]. I'd love to get something similar working with SAM2!
[1] - https://scrawl-v8.rikweb.org.uk/demo/canvas-027.html
[2] - https://scrawl-v8.rikweb.org.uk/demo/mediapipe-003.html
[1] - page 11 of https://ai.meta.com/research/publications/sam-2-segment-anyt...
Splashing water or Orange juice, spraying snow from skis, rain and snowfall, foliage, fences and meshes, veils etc.
Does anyone know if this already happened?
The same thing happened with the Threads app which was withheld from European users last year for no actual technical reason. Now it’s been released and nothing changed in between.
These free models and apps are bargaining chips for Meta against the EU. Once the regulatory situation settles, they’ll do what they always do and adapt to reach the largest possible global audience.
This video segmentation model could be used by self-driving cars to detect pedestrians, or in road traffic management systems to detect vehicles, either of which would make it a Chapter III High-Risk AI System.
And if we instead say it's not specific to those high-risk applications, it is instead a general purpose model - wouldn't that make it a Chapter V General Purpose AI Model?
Obviously you and I know the "general purpose AI models" chapter was drafted with LLMs (and their successors) in mind, rather than image segmentation models - but it's the letter of the law, not the intent, that counts.
No technical reason, but legal reasons. IIRC it was about cross-account data sharing from Instagram to Threads, which is a lot more dicey legally in the EU than in NA.
They’ve also had separate apps in the past that shared an Instagram account, like IGTV (2018 - 2022).
The Threads delay was primarily a lobbying ploy.
It seems that https://mullvad.net is a necessary part of my Internet toolkit these days, for many reasons.
Alright, I'll bite, why not?
I can't make sense of this sentence. Is there some mistake?
As it is written, I don't see the link between "We extend SAM to video" and "by considering images as a video with a single frame".
- "We extend SAM to video", because is was previously only for images and it's capabilities are being extended to videos
- "by considering images as a video with a single frame", explaining how they support and build upon the previous image functionality
The main assumptions here are that images -> videos is a level up as opposed to being a different thing entirely, and the previous level is always supported.
"retrofit" implies that the ability to handle images was bolted on afterwards. "extend to video" implies this is a natural continuation of the image functionality, so the next part of the sentence is explaining why there is a natural continuation.
Is SAM-2 useful to use as a base model to finetune a classifier layer on? Or are there better options today?
[0] based on comparing avg labeling session time on individual polygon creation vs SAM-powered polygon examples [1] https://github.com/autodistill/autodistill
Someone who knows Creative Suite can comment on what Photoshop can do on this these days, one imagines it’s something, but the SAM stuff is so fast it can run in low-spec settings.
[1] - https://github.com/IDEA-Research/Grounded-Segment-Anything
It does detection on the backend and then feeds those bounding boxes into SAM running in the browser. This is a little slow on the first pass but allows the user the adjust the bboxes and get new segmentations in nearly real time, without putting a ton of load on the server. Saved me having to label a bunch of holds with precise masks/polygons (I labeled 10k for the detection model and that was quite enough). I might try using SAM's output to train a smaller model in the future, haven't gotten around to it.
(Site is early in development and not ready for actual users, but feel free to mess around.)
You just have to wait probably few weeks for the SAM v2 to be available. Hugging Face might also have some offering
I do have a 2 questions: 1. isn't addressing the video frame by frame expensive? 2. In the web demo when the leg moves fast it loses it's track from the shoe. Does the memory part not throwing some uristics to over come this edge case?
i.e. if I stand in the center of my room and take a video of the room spinning around slowly over 5 seconds. Then reverse spin around for 5 seconds.
Will it see the same couch? Or will it see two couches?
Quote: "Sorry Firefox users! The Firefox browser doesn’t support the video features we’ll need to run this demo. Please try again using Chrome or Safari."
Wtf is this shit? Seriously!
I was wondering why the original one got deprecated.
Is there now also a good way for finetuning from the officaial / your side?
Any benchmarks against SAM1?