undefined | Better HN

0 pointsslashdave1y ago0 comments

It is just video. There are no external interactions.

Heck, it is far simpler than video, because the point of view and frame is fixed.

0 comments

I think you're mistaken. The abstract says it's interactive, "We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction"

Further - "a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions." specifically "and actions"

User input is being fed into this system and subsequent frames take that into account. The user is "actually" firing a gun.

slashdaveOP1y ago

No, I am not. The interaction is part of the training, and is used during inference, but it is not including during the process of generation.

SeanAnderson1y ago

Okay, I think you're right. My mistake. I read through the paper more closely and I found the abstract to be a bit misleading compared to the contents. Sorry.

slashdaveOP1y ago

Don't worry. The paper is not very well written.

1 more reply

smusamashah1y ago

It's interactive but can it go beyond what it learned from the videos. As in, can the camera break free and roam around the map from different angles? I don't think it will be able to do that at all. There are still a few hallucinations in this rendering, it doesn't look it understands 3d.

Sharlin1y ago

You might be surprised. Generating views from novel angles based on a single image is not novel, and if anything, this model has more than a single frame as input. I’d wager that it’s quite able to extrapolate DOOM-like corridors and rooms even if it hasn’t seen the exact place during training. And sure, it’s imperfect but on the other hand it works in real time on a single TPU.

hypertele-Xii1y ago

Then why do monsters become blurry smudgy messes when shot? That looks like a video compression artifact of a neural network attempting to replicate low-structure image (source material contains guts exploding, very un-structured visual).

Sharlin1y ago

Uh, maybe because monster death animations make up a small part of the training material (ie. gameplay) so the model has not learned to reproduce them very well?

There cannot be "video compression artifacts" because it hasn’t even seen any compressed video during training, as far as I can see.

Seriously, how is this even a discussion? The article is clear that the novel thing is that this is real-time frame generation conditioned on the previous frame(s) AND player actions. Just generating video would be nothing new.

psb2171y ago

In a sense, poorly reproducing rare content is a form of compression artifact. Ie, since this content occurs rarely in the training set, it will have less impact on the gradients and thus less impact on the final form of the model. Roughly speaking, the model is allocating fewer bits to this content, by storing less information about this content in its parameters, compared to content which it sees more often during training. I think this isn't too different from certain aspects of images, videos, music, etc., being distorted in different ways based on how a particular codec allocates its available bits.

hypertele-Xii1y ago

I simply cannot take seriously anyone who exclaims that monster death animations are a minor part of Doom. It's literally a game about slaying demons. Gameplay consists almost entirely of explosions and gore, killing monsters IS THE GAME, if you can't even get that correct then what nonsense are we even looking at.

nopakos1y ago

Maybe it's so advanced, it knows the players' next moves, so it is a video!

slashdaveOP1y ago

I guess you are being sarcastic, except this is precisely what it is doing. And it's not hard: player movement is low information and probably not the hardest part of the model.

raincole1y ago

I highly suggest you to read the paper briefly before commenting on the topic. The whole point is that it's not just generating a video.

slashdaveOP1y ago

I did. It is generating a video, using latent information on player actions during the process (which it also predicts). It is not interactive.

Sharlin1y ago

Uff, I guess you’re right. Mea culpa. I misread their diagram to represent inference when it was about training instead. The latter is conditioned on actions, but… how do they generate the actual output frames then? What’s the input? Is it just image-to-image based on the previous frame? The paper doesn’t seem to explain the inference part at all well :(

slashdaveOP1y ago

It should be possible to generate an initial image from Gaussian noise, including the latent information on player position

j / k navigate · click thread line to collapse

0 comments

SeanAnderson1y ago

I think you're mistaken. The abstract says it's interactive, "We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction"

Further - "a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions." specifically "and actions"

User input is being fed into this system and subsequent frames take that into account. The user is "actually" firing a gun.

slashdaveOP1y ago

No, I am not. The interaction is part of the training, and is used during inference, but it is not including during the process of generation.

SeanAnderson1y ago

Okay, I think you're right. My mistake. I read through the paper more closely and I found the abstract to be a bit misleading compared to the contents. Sorry.

slashdaveOP1y ago

Don't worry. The paper is not very well written.

1 more reply

smusamashah1y ago

Sharlin1y ago

hypertele-Xii1y ago

Sharlin1y ago

Uh, maybe because monster death animations make up a small part of the training material (ie. gameplay) so the model has not learned to reproduce them very well?

There cannot be "video compression artifacts" because it hasn’t even seen any compressed video during training, as far as I can see.

psb2171y ago

hypertele-Xii1y ago

nopakos1y ago

Maybe it's so advanced, it knows the players' next moves, so it is a video!

slashdaveOP1y ago

I guess you are being sarcastic, except this is precisely what it is doing. And it's not hard: player movement is low information and probably not the hardest part of the model.

raincole1y ago

I highly suggest you to read the paper briefly before commenting on the topic. The whole point is that it's not just generating a video.

slashdaveOP1y ago

I did. It is generating a video, using latent information on player actions during the process (which it also predicts). It is not interactive.

Sharlin1y ago

slashdaveOP1y ago

It should be possible to generate an initial image from Gaussian noise, including the latent information on player position

j / k navigate · click thread line to collapse