The core idea behind this type of approach ("parametric encoding") is that you learn a scene as some spatial data + a (small) neural network. For example, a 128^3 grid of data values and a 10k parameter model. In the forward pass you feed whatever data is at the voxel(s) in question to the network, and the backward pass updates both the network and the same voxel(s).
The innovation in this paper is in how the spatial data is represented. Prior work includes dense grids, multi-resolution grids and octrees to name some - but all of them are either GPU-unfriendly or waste parameters on empty space. They figured that they can just hash the coordinates and use them directly as an index into a data array (edit: A multi-resolution stack of data arrays - sorry for not getting this right initially), with hash collisions left to the network to figure out (it's gonna figure out whether there's a collision on fine layer through info from the coarser ones, I guess).
(Relatively) few parameters + GPU-friendly data structure = fast training. Tempted to try and implement this myself...
Your comment made me realize that I forgot to mention the multi-resolution aspect of their hash encoding (there are several data arrays corresponding to different resolutions - coarse ones are 1:1 indexed but finer ones have hash collisions for the network to deal with). It's in the title, but I should still include it.
Every new deep learning paper that comes out, I'm disappointed that it needs...
- A $500-$1,000 GPU
- A huge proprietary NVidia driver
- Some odd language or language extensions, usually CUDA
- Python
I find it remarkable that most recent deep learning papers release the source code needed to reproduce their result -- and even more remarkable that many papers, like this one, can be reproduced on hardware that a hobbyist can afford.
And if you'd like this to run on a CPU, you're welcome to port it. The code is open source after all.
CPUs tend to have very few FPUs per core, so you max out a modern systems CPUs idealised throughput at maybe 40-80 concurrent streams. On top of that the FPUs on a CPU are generally require to perform fully compliant ieee754 arithmetic at at least 32bit of precision.
Modern GPUs can have that number of FPUs per hardware thread and then have a few hundred of those hardware threads. Each of those GPU FPUs are also faster as they can both elide some elements of ieee754, and operate at lower precision (fp16) to get even more performance.
So you could read the paper, and implement it on a CPU and the very best that you, or anyone, could do would be literal orders of magnitude slower than the GPU implementation.
That’s why you don’t see them doing it on a CPU, let alone in Python.
[0] https://pypi.org/project/tensorflow-directml/
[1] https://docs.microsoft.com/en-us/windows/ai/directml/gpu-ten...
Why everything's written in Python I couldn't tell you.
The days where you could run anything significant on a single 1k graphics card are long gone.
This is, ironically, the first time that (I’m aware of) you could distill this Nerf stuff down into a size that runs on a single consumer GPU (RTX 2x or higher)
…so, some of your points are fair, but hey, at least these folk are trying to bring this down from “only usable by large corporations” to “runs on your desktop”.
I mean, it’s not perfect, but I think in this case you’re complaining about something abstract, when these folk are actually going in the right direction.
(I could not get the URL to load. Maybe HN hugged it)
Are there any commercial games currently doing this?
This is probably going to fight virtual geometry tech like Unreal's Nanite, which is still using triangles but using clever automated LoD and GPGPU rasterization so that rendering e.g. 20 million pixel-sized triangles is fast and looks just as good as rendering a trillion triangles. (normally very small or thin triangles are a pathological case for hardware rasterizers)