Lighting is the big issue, IMO. As soon as you want any kind of interactivity besides moving the camera you need dynamic lighting. The problem is you're going to have to mix the captured absolutely perfect real-world lighting with extremely approximate real-time computed lighting (which will be much worse than offline-rendered path tracing, which still wouldn't match real-world quality). It's going to look awful. At least, until someone figures out a revolutionary neural relighting system. We are pretty far from that today.
Scale is another issue. Two issues, really, rendering and storage. There's already a lot of research into scaling up rendering to large and detailed scenes, but I wouldn't say it's solved yet. And once you have rendering, storage will be the next issue. These scans will be massive and we'll need some very effective compression to be able to distribute large scenes to users.