In order to approximately learn a "real" graphics engine with support for basic physics, just feed-forward computation might not be sufficient. A more natural way to learn graphics/physics might be to learn the temporal structure more explicitly. On the other hand, it might also be interesting to just add temporal convolution-deconvolution structure in the existing model. This is work in progress though.