Depends on the machine, number of threads selected and the model checkpoint used (Vit-B or Vit-L or Vit-B). The video demo attached is running on Apple M2 Ultra and using the Vit-B model. The generation of the image embedding takes ~1.9s there and all the subsequent mask segmentations take ~45ms.
However, I am now focusing on improving the inference speed by making better use of ggml and trying out quantization. Once I make some progress in this direction I will compare to other SAM alternatives and benchmark more thoroughly.
We’ve been working on using them (often in conjunction with SAM) for auto-labeling datasets to train smaller faster models that can run in real-time at the edge: https://github.com/autodistill/autodistill
The Bark python model is very compute intensive and require a powerful GPU to get bearable inference speed. I really hope that bark.cpp with GPU/Metal support and quanticized model can bring useful inference speed on a laptop in the near future.
Next Step: Incorporate this library into image editors like Photopea (via WebAssembly) to boost the speed of common selection tasks. The magic wand is a tool of the past.
I'd pay for such a feature.
Another popular optimisation is to port models to WASM + GPU because it makes them easy to support a variety of platforms (desktop, mobile...) with a single API and it can still offer great performance (see Google's mediapipe as an exemple of that).
Python is really great for fast prototyping. It can be argued most AI products so far are result of fast prototyping. So not sure if there is anything wrong with that.
As practical models emerge, at that point it indeed makes sense to port them to C++. But I would not in my wildest dreams suggest prototyping a data model in C++ unless absolutely necessary.