As they say in ML, representation first -- and this is one of the most natural and elegant ways to represent 3D scenes and subjective viewpoints. Great that it works into a rendering environment such that it's E2E differentiable.
This is the first leap toward true high-quality real-time ML-based rendering. I'm blown away.
The paper says its 5MB, 12 hours to train the NN and then 30 seconds to render novel views of the scene on an nVidia V100.
Sadly not something you can use in real time but still very cool.
Edit:12 hours and 5MB NN not 5 Minutes
EDIT: I was referring to the last paragraph of section 5.3 (Implementation details), but maybe I’m misunderstanding how they use rays / sampled coordinates.
Very impressive visual quality. But it seems like they need a LOT of data and computation for each scene. So, its still plausible that intelligently done photogrammetry will beat this approach in efficiency, but a bunch of important details need to be figured out to make that happen.
>All compared single scene methods take at least 12 hours to train per scene
But it seems to only need sparse images.
>Here, we visualize the set of 100 input views of the synthetic Drums scene randomly captured on a surrounding hemisphere, and we show two novel views rendered from our optimized NeRF representation
Not sure what you mean by "views". The comparisons in the paper use at most 100 input images per scene.
And we are all supposed to become AI developers this decade?!
Come back Visual Basic all is forgiven :-)
I'd also like to see what it does when you give it multiple views of scenes in a video game. Some from the direct pictures and some from pictures of the monitor.
Better results than other methods so far.
Take it one more step further and make model B create photos from some text description similar to the one described in https://news.ycombinator.com/item?id=22640407 (although that one does 3D designs using voxels)
In this work they take pictures of a scene from different angles and are able to train a neural network to render the scene from new angles that aren't in any source pictures.
The neural network takes in a location (x,y,z), a viewing direction and spits out the RGB of the rendered image if you were to view the scene at that location and angle.
Using this network and traditional rendering techniques they are able to render the whole scene.
ie. Few source images vs. traditional photogrammetry.
...but basically yes, tldr; photogrammetry using neural networks; this one is better than other recent attempts at the same thing, but takes a really long time (2 days for this vs 10 minutes for a voxel based approach in one of their comparisons).
Why bother?
mmm... theres some kind speculation you might be able to represent a photorealistic scene/ 3d object as a neural model instead of voxels or meshes.
That might be useful for some things. eg. say, a voxel representation of semi transparent fog, or high detail objects like hair are impractically huge, and as a mesh its very difficult to represent.
Smart, high-dimensional interpolator.