Though, I wonder if the whole hassle of relying on RGB-D sensor of a phone, copying from your phone and using a yet another annotation tool, is worth it, when you can instead use some tracking bbox annotation tool, which interpolates many frames. With those, you can even annotate moving and distant objects, which I would argue is even better for generalization (since the background changes).
But I bet there are some use cases/users which can profit from it.