undefined | Better HN

0 pointsmjburgess2y ago0 comments

Alas, no it doesn't. Language induces this sort of anthropomorphism in people, I guess, so consider images.

Suppose I take a billion images of all the coffee cups in the world, at a set of angles on the cup, and then build an associative (ie., frequency) statistical model of their pixels (ie., statistical AI). Consider generating one pixel at a time, in sequence, through the image. My associative model tells me P(col of next pixel | all previous).

Now, I can generate coffee cups images similar to any variation or combination of the images in the dataset. Now, you might say, "well you can only do that if you have a model of a coffee cup" (rather than of pixels) -- if so, just generate a coffee cup at one of the angles not in the dataset. This will not happen, because the model has not been provided with enough information to do so.

Namely, the model does not know the distance from the camera, the camera lens parameters, the angle to the coffee cup, etc. So there's literally a very very large inifinity of possible objects at unseen angles. Consider that underneath a coffee cup, the bottom might be missing entirely, etc.

Now it will appear to know all of these things, because its just generating images with these same parameters (camera, angle, distance, etc.). But as soon as you want "a coffee further away than has been seen before", or "a coffee using a macro lens", etc. the whole thing will fall over.

It is you, the view, who attributes 3D knowledge to the model because under ordinary circumstances the cause of a photo is features of a 3D environment.

0 comments

9 comments · 2 top-level

jameshart2y ago· 5 in thread

You’re saying this with confidence as if there isn’t a large body of working image and video generation algorithms out there that can produce physically plausible images of objects transposed into circumstances that don’t exist in their training set. A coffee using a macro lens for example.

Is it so hard to believe that such models have developed a sense for how light propagates through a scene, a sense for how physical objects change when viewed from different angles, a sense for how lens distortion interacts with light? For goodness’ sake, these same models have a sense of what Greg Rutkowski’s art style is - we are well beyond ‘they’re just remembering pixels from past coffeecups’

mjburgessOP2y ago

> it so hard to believe that such models have developed a sense for how light propagates

Well, its not a matter or belief or otherwise. I'm a trained practitioner in statistics, AI, physics, and other areas and you can show trivially that you cannot learn light physics from pixel distributions.

Pixel distributions aren't stationary, and are caused by a very very large number of factors; likewise the physics of light for any given situation is subject to a large number of causes, all of them entirely absent from from the pixel distributions. This is a pretty trivial thing to show.

> have a sense of what Greg Rutkowski’s art style is

Well what these models show is that when you have PBs of image data and TBs of associated text data, you can relate words and images together usefully. In particular, you can use patterns of text tokens to sample from image distributions, and combine and vary these samples to produce novel images.

The patterns in text and images are caused by people speaking, taking photos, etc. Those patterns necessarily obtain in any generated output. As in, if you train an LLM/etc. on how to speak, using vast amounts of conversational data, it cannot do anything other than appear to speak: that is the only thing the data distribution makes possible.

Likewise here, the image generator has a compressed representation of PBs of pixel data which can be sampled from using text. So when you say, "Greg Rutkowski" you select for a highly structured image space, whose structure the original artists placed there.

The generative model itself is not imparting structure to the data, it isnt aware of stlyle.. it's sampling from structure that we placed there. When we did so it was because we were, eg., in the room and taking a photo; or imagining what it would be like to apply preraphelite paintaing styles to 60s psychedelic colour pallets because we sensed that fashions of a century ago would now be regarded as cool.

TeMPOraL2y ago

The point of shoving so much data at those models is to help them pick up on the "very very large number of factors".

There was a story I saw on HN a few times in the past, but which I can't find anymore, of someone training a simple, dumb neural net to predict a product (or a sum?) of two numbers, and discovering to their surprise that, under optimization pressure, the network eventually picked up Fourier Transform.

It doesn't seem out of realm of possibility for a large model to pick up on light propagation physics and basic 3D structure of our reality just from watching enough images. After all, the information is implicitly encoded there, and you can handwave a Bayesian argument that it should be extractable.

infecto2y ago

Genuine question, what does it mean to be a trained practitioner in statistics, AI, physics and other areas?

1 more reply

gizmo2y ago

Humans have painted with wonky perspective and impossible shadows because they didn't know better for literally 50.000 years. And those humans were just as smart as we are. Just look at 13th century paintings. Does this prove that humans back then didn't understand what a coffee cup looks like when rotated? No. So what does this prove about midjourney? Nothing.

1 more reply

croon2y ago

> Is it so hard to believe that such models have developed a sense for how light propagates through a scene...

This specifically is the thing I usually notice in AI images (outside of the hand trope).

I'm not GP, and at best a layman in the field, but it's not hard to believe it's possible to generate believable lighting, given enough training data, but if I'm not mistaken it would be through sheer volume of properties like lighting/shadow here usually follows item here.

But it's extremely inefficient, and not like we reason. It's like learning the multiplication table without understanding math. Just pairing an infinite amount of properties with each other.

We on the other hand develop a grasp of where lighting exists (sun/lamp) and surmise where shadows fall and can muster any image in our mind using that model instead.

jddj2y ago· 2 in thread

Is that really true?

I can go to a huggingface space right now and type in koala wearing a suit serving coffee at a republican rally and there's a reasonable chance I get a result that's something along those lines. Is that meaningfully different to "coffee using a macro lens"?

mjburgessOP2y ago

Those models were not trained on the restricted dataset i'm talking about.

I'm saying you deliberately construct a dataset which, say, does not include cups at various distances, angles, etc. but has as many as you like at a fixed range of these parameters (lens, distance, lighting, angle...).

Now, you will get, from this model, just coffee cup images with these same parameters (eg., distance from the camera).

Real-world generative systems are deliberately not constrained this way, and require many many PBs of images under various conditions to overcome this problem.

Nevertheless you can actually still see this limitation: most generated photos etc. show subjects in "photographic distance/focus/etc. conditions", ie., its hard to get a photo of a person who isnt framed as if they were teh subject of a photo.

Whereas, if you were in a room with a friend, you can take a photo at any angle/distnace.. even, say, from the top of their ears down. You will not get this freedom with a statistical model of pixel patterns

jddj2y ago

I can't argue with that, so I think unfortunately I may have missed the original point.

The sun revolved around the earth for a long time until our own model was updated to include more data.

j / k navigate · click thread line to collapse

0 comments

9 comments · 2 top-level

jameshart2y ago· 5 in thread

mjburgessOP2y ago

> it so hard to believe that such models have developed a sense for how light propagates

> have a sense of what Greg Rutkowski’s art style is

TeMPOraL2y ago

The point of shoving so much data at those models is to help them pick up on the "very very large number of factors".

infecto2y ago

Genuine question, what does it mean to be a trained practitioner in statistics, AI, physics and other areas?

1 more reply

gizmo2y ago

1 more reply

croon2y ago

> Is it so hard to believe that such models have developed a sense for how light propagates through a scene...

This specifically is the thing I usually notice in AI images (outside of the hand trope).

But it's extremely inefficient, and not like we reason. It's like learning the multiplication table without understanding math. Just pairing an infinite amount of properties with each other.

We on the other hand develop a grasp of where lighting exists (sun/lamp) and surmise where shadows fall and can muster any image in our mind using that model instead.

jddj2y ago· 2 in thread

Is that really true?

mjburgessOP2y ago

Those models were not trained on the restricted dataset i'm talking about.

Now, you will get, from this model, just coffee cup images with these same parameters (eg., distance from the camera).

Real-world generative systems are deliberately not constrained this way, and require many many PBs of images under various conditions to overcome this problem.

jddj2y ago

I can't argue with that, so I think unfortunately I may have missed the original point.

The sun revolved around the earth for a long time until our own model was updated to include more data.

j / k navigate · click thread line to collapse