Is it possible to handle SfM out of band? For example, by precisely measuring the location and orientation of the camera?
The paper’s pipeline includes a stage that identifies the in-focus area of an image. Perhaps you could use that to partition the input images. Exclusively use the in-focus areas for SfM, perhaps supplemented by out of band POV information, then leverage the whole image for training the splat.
Overall this seems like a slow journey to building end-to-end model pipelines. We’ve seen that in a few other domains, such as translation. It’s interesting to see when specialized algorithms are appropriate and when a unified neural pipeline works better. I think the main determinant is how much benefit there is to sharing information between stages.