NeRF: Representing scenes as neural radiance fields for view synthesis (opens in new tab)

(matthewtancik.com)

237 pointsdfield6y ago40 comments

40 comments

35 comments · 14 top-level

blackhaz6y ago· 8 in thread

Could someone ELI5, please?

If you give it a bunch of photos of a scene from different angles, this machine learning method lets you see angles that did not exist in the original set.

Better results than other methods so far.

notfed6y ago

Fist bump for actually answering as ELI5 (unlike the other responses).

airstrike6y ago

So can we take it to the next level and give it a bunch of ML-generated photos of a scene that doesn't exist (from model B) and let this model A create the 3D view?

Take it one more step further and make model B create photos from some text description similar to the one described in https://news.ycombinator.com/item?id=22640407 (although that one does 3D designs using voxels)

quadrature6y ago

its a very similar concept to photogrammetry which is recovering a 3d representation of an object given pictures taken from different angles.

In this work they take pictures of a scene from different angles and are able to train a neural network to render the scene from new angles that aren't in any source pictures.

The neural network takes in a location (x,y,z), a viewing direction and spits out the RGB of the rendered image if you were to view the scene at that location and angle.

Using this network and traditional rendering techniques they are able to render the whole scene.

wokwokwok6y ago

Significantly, the input is a sparse dataset.

ie. Few source images vs. traditional photogrammetry.

...but basically yes, tldr; photogrammetry using neural networks; this one is better than other recent attempts at the same thing, but takes a really long time (2 days for this vs 10 minutes for a voxel based approach in one of their comparisons).

Why bother?

mmm... theres some kind speculation you might be able to represent a photorealistic scene/ 3d object as a neural model instead of voxels or meshes.

That might be useful for some things. eg. say, a voxel representation of semi transparent fog, or high detail objects like hair are impractically huge, and as a mesh its very difficult to represent.

3 more replies

type_enthusiast6y ago

They're modeling a scene mathematically as a "radiance field" - a function that takes a view position and direction as inputs and returns the light color that hits that position from the direction it's facing. They use some input images to train a neural network, in order to find an optimal radiance field function which explains the input images. Once they have that function, they can construct images from new angles by evaluating the function over the (position, direction) inputs needed by the pixels in the new image.

ur-whale6y ago

>Could someone ELI5, please?

Smart, high-dimensional interpolator.

imposter6y ago

Wow great

raidicy6y ago· 5 in thread

This blows my mind. This is probably a naive thought; This technique looks like it could be combined with robotics to help it navigate through its environment.

I'd also like to see what it does when you give it multiple views of scenes in a video game. Some from the direct pictures and some from pictures of the monitor.

yarg6y ago

They've only showed it working with static content - they'll need to do it with video (multiple synchronised cameras) and in real time for ant robotics application.

blurbleblurble6y ago

It'd be interesting to see what happened if they encoded an additional time parameter on each 'view' (input image pixel). Surely someone is already trying to extend this technique that way.

iandanforth6y ago

Currently view coordinates relative to the volume are required so you first have to solve the SLAM problem before you can optimize a network representation of a given volume.

BubRoss6y ago

It takes 12 hours on a high end GPU to make one frame.

teraflop6y ago

No, as appendix A of the paper states, each frame takes about 30 seconds to render.

1 more reply

jayd166y ago· 3 in thread

Very cool. Reminds me of when I played with Google's Seurat.

The paper says its 5MB, 12 hours to train the NN and then 30 seconds to render novel views of the scene on an nVidia V100.

Sadly not something you can use in real time but still very cool.

Edit:12 hours and 5MB NN not 5 Minutes

ssivark6y ago

Huh, what? It needs almost a million views, and takes 1-2 days to train on a GPU. I’m not sure where the “5 minutes” number comes from.

EDIT: I was referring to the last paragraph of section 5.3 (Implementation details), but maybe I’m misunderstanding how they use rays / sampled coordinates.

Very impressive visual quality. But it seems like they need a LOT of data and computation for each scene. So, its still plausible that intelligently done photogrammetry will beat this approach in efficiency, but a bunch of important details need to be figured out to make that happen.

jayd166y ago

Excuse me I meant 5MB. It takes 12 hours to train.

>All compared single scene methods take at least 12 hours to train per scene

But it seems to only need sparse images.

>Here, we visualize the set of 100 input views of the synthetic Drums scene randomly captured on a surrounding hemisphere, and we show two novel views rendered from our optimized NeRF representation

scribu6y ago

> It needs almost a million views

Not sure what you mean by "views". The comparisons in the paper use at most 100 input images per scene.

1 more reply

ssivark6y ago· 2 in thread

Does anyone know how they do the “virtual object insertion” demonstrated in the paper summary video? Can that be somehow done on the network itself, or is that a diagnostic for scene accuracy by performing SFM on network output?

theresistor6y ago

I'm pretty sure they're rendering a depth channel and compositing it in.

teraflop6y ago

You could do that, but I think it's simpler to just introduce additional objects during the raytracing process that generates the images. That would produce accurate results even with semitransparent objects, unlike compositing with an depth buffer.

kuprel6y ago· 1 in thread

This would be great for instant replays

jayd166y ago

Intel already does this with their "True View" setup. They also had a tech demo CES where they synthesized camera positions for movie sets. https://www.youtube.com/watch?v=9qd276AJg-o

blurbleblurble6y ago· 1 in thread

The neural networks representing these scenes take up just 5 MB... Less than the input images used to train them. Wow. Mind blowing!

BubRoss6y ago

Keep in mind though, that the way the data is represented is a form of lossy compression and the size of the images may not be.

philip3683206y ago· 1 in thread

I would like to see “neural enhances” an already rendered 3D scene with the changes which would make it more realistic, given depth map and other information to the neural network

BubRoss6y ago

How would it be made more realistic?

uoaei6y ago

This is absolutely stunning.

As they say in ML, representation first -- and this is one of the most natural and elegant ways to represent 3D scenes and subjective viewpoints. Great that it works into a rendering environment such that it's E2E differentiable.

This is the first leap toward true high-quality real-time ML-based rendering. I'm blown away.

lifeisstillgood6y ago

Well that took some effort just to work out what they actually did. How they actually did it I have no idea. Impressive however - a sort of fill in the blanks for the bits that are missing. If our brains dont do this one would be surprised.

And we are all supposed to become AI developers this decade?!

Come back Visual Basic all is forgiven :-)

teknopurge6y ago

This is bad-ass, partly because it's so elegant.

byt1436y ago

If you're only looking for one novel view, can it use less views that are close to the novel one?

tanilama6y ago

This is REALLY cool, but kinda makes sense as well. Neural networks are very good at interpolation, given the right prior.

2OEH8eoCRo06y ago

This is the kind of shit I come here for. Awesome post! Thanks for sharing!

anthk6y ago

This is like the Blade Runner ingame tool.

j / k navigate · click thread line to collapse

40 comments

35 comments · 14 top-level

blackhaz6y ago· 8 in thread

Could someone ELI5, please?

mooneater6y ago

If you give it a bunch of photos of a scene from different angles, this machine learning method lets you see angles that did not exist in the original set.

Better results than other methods so far.

notfed6y ago

Fist bump for actually answering as ELI5 (unlike the other responses).

airstrike6y ago

So can we take it to the next level and give it a bunch of ML-generated photos of a scene that doesn't exist (from model B) and let this model A create the 3D view?

quadrature6y ago

its a very similar concept to photogrammetry which is recovering a 3d representation of an object given pictures taken from different angles.

In this work they take pictures of a scene from different angles and are able to train a neural network to render the scene from new angles that aren't in any source pictures.

The neural network takes in a location (x,y,z), a viewing direction and spits out the RGB of the rendered image if you were to view the scene at that location and angle.

Using this network and traditional rendering techniques they are able to render the whole scene.

wokwokwok6y ago

Significantly, the input is a sparse dataset.

ie. Few source images vs. traditional photogrammetry.

Why bother?

mmm... theres some kind speculation you might be able to represent a photorealistic scene/ 3d object as a neural model instead of voxels or meshes.

That might be useful for some things. eg. say, a voxel representation of semi transparent fog, or high detail objects like hair are impractically huge, and as a mesh its very difficult to represent.

3 more replies

type_enthusiast6y ago

ur-whale6y ago

>Could someone ELI5, please?

Smart, high-dimensional interpolator.

imposter6y ago

Wow great

raidicy6y ago· 5 in thread

This blows my mind. This is probably a naive thought; This technique looks like it could be combined with robotics to help it navigate through its environment.

I'd also like to see what it does when you give it multiple views of scenes in a video game. Some from the direct pictures and some from pictures of the monitor.

yarg6y ago

They've only showed it working with static content - they'll need to do it with video (multiple synchronised cameras) and in real time for ant robotics application.

blurbleblurble6y ago

It'd be interesting to see what happened if they encoded an additional time parameter on each 'view' (input image pixel). Surely someone is already trying to extend this technique that way.

iandanforth6y ago

Currently view coordinates relative to the volume are required so you first have to solve the SLAM problem before you can optimize a network representation of a given volume.

BubRoss6y ago

It takes 12 hours on a high end GPU to make one frame.

teraflop6y ago

No, as appendix A of the paper states, each frame takes about 30 seconds to render.

1 more reply

jayd166y ago· 3 in thread

Very cool. Reminds me of when I played with Google's Seurat.

The paper says its 5MB, 12 hours to train the NN and then 30 seconds to render novel views of the scene on an nVidia V100.

Sadly not something you can use in real time but still very cool.

Edit:12 hours and 5MB NN not 5 Minutes

ssivark6y ago

Huh, what? It needs almost a million views, and takes 1-2 days to train on a GPU. I’m not sure where the “5 minutes” number comes from.

EDIT: I was referring to the last paragraph of section 5.3 (Implementation details), but maybe I’m misunderstanding how they use rays / sampled coordinates.

jayd166y ago

Excuse me I meant 5MB. It takes 12 hours to train.

>All compared single scene methods take at least 12 hours to train per scene

But it seems to only need sparse images.

>Here, we visualize the set of 100 input views of the synthetic Drums scene randomly captured on a surrounding hemisphere, and we show two novel views rendered from our optimized NeRF representation

scribu6y ago

> It needs almost a million views

Not sure what you mean by "views". The comparisons in the paper use at most 100 input images per scene.

1 more reply

ssivark6y ago· 2 in thread

theresistor6y ago

I'm pretty sure they're rendering a depth channel and compositing it in.

teraflop6y ago

kuprel6y ago· 1 in thread

This would be great for instant replays

jayd166y ago

Intel already does this with their "True View" setup. They also had a tech demo CES where they synthesized camera positions for movie sets. https://www.youtube.com/watch?v=9qd276AJg-o

blurbleblurble6y ago· 1 in thread

The neural networks representing these scenes take up just 5 MB... Less than the input images used to train them. Wow. Mind blowing!

BubRoss6y ago

Keep in mind though, that the way the data is represented is a form of lossy compression and the size of the images may not be.

philip3683206y ago· 1 in thread

I would like to see “neural enhances” an already rendered 3D scene with the changes which would make it more realistic, given depth map and other information to the neural network

BubRoss6y ago

How would it be made more realistic?

uoaei6y ago

This is absolutely stunning.

This is the first leap toward true high-quality real-time ML-based rendering. I'm blown away.

lifeisstillgood6y ago

And we are all supposed to become AI developers this decade?!

Come back Visual Basic all is forgiven :-)

teknopurge6y ago

This is bad-ass, partly because it's so elegant.

byt1436y ago

If you're only looking for one novel view, can it use less views that are close to the novel one?