- looking at your hands
- looking at clocks
- trying to read
It’s funny that diffusion models often make those exact same mistakes. There’s clearly a similar failure mode where both are drawing from a distribution and losing fine details. Has this been studied?
I don't know that it's necessarily the case that there's a strong relationship between the way these models work and the way human brains, particularly dreaming, work.
The similarity of things observed in dreams to AI is then because both procedures involve constructing coherence out of noise. "Gradient descent" or something, I wouldn't really know about that. Pareidolia.
This seems more like a post-hoc rationalization than a theory. If you can't trust memories of dreams, how can you even know you dream at all? What do you even base your assumptions on?
I theorize that our experience of life is more garbled than we realize, and that memories of life are actually post-hoc reconstructions formed at the point of recall.
to be fair, I don't recall that from generative images either. it's either 11 or 12 type of situations
If you recognize you're actually dreaming, though, is this a lucid dream?
I've only had a couple lucid dreams in my life, and I thought they were some of the most awesome things I've ever experienced. Real life holodeck!
I don't know but my absolute favorite when I recognize it's a dream, and I've "trained" myself to do that is to say: "Nice, this is a dream, so I can fly!". It's an awesome, awesome, awesome feeling. Usually doesn't last long but it's a cool thing to do.
While most of the time i recognize that im dreaming very few times i can actually control the outcome and even fewer i have gain admin control to change anything i think those are what people would assume when talking about lucid dreams.
Also, if i start doing weird stuff or making obvious that I'm aware of my status, the NPCs on my dream become apathetic and would plainly ask for me to stop playing and just wake up.
Ok that sounds like nightmare material noe that i wrote it, but is not scary at all they sound more annoyed that anything else.
There’s an anecdote about blind men whose sight was restored. They were adult men, who had felt cubes and heard about cubes, and could describe a cube. After their sight was restored, they were shown a cube and a sphere and were asked to identify them by sight. They were unable to, having never seen these objects before.
Many people (including very smart people) make the mistake of equating all forms of intelligence. They assume that computer programs have an intelligence level, and should be able to handle all tasks below that intelligence level, but machine learning models break our intuition for this. A model which has been trained on stock market data and is extremely intelligent in this area may be able to predict the stock market tomorrow. But if it has not been trained on words than it is no more able to write a sentence than a newborn baby. ChatGPT can eloquently generate words but it is completely unable to generate or understand pictures. (Ask ChatGPT to generate some ASCII-art.) Eventually OpenAI will create a sophisticated multi-modal model capable of generating poems or reading words in an image or predicting the stock market, but this model will be completely unable to answer questions about the physical world, because it’s only been trained on words and images.
Ok. I did both things.
I took a photo of my feet up on a stool in my living room, and told CharGPT to describe it.
It was reasonably (and rather surprisingly) successful.
I also told it to generate an ASCII image of a car. It did that, too.
Feed original (not copy-pasted from the web) ASCII art of a foot into GPT-4 and I'd be very impressed if it can tell you it's a foot.
I'm actually mildly impressed it could generate ASCII art of a car, because that's a lot better than I've been able to get out of it (albeit on gpt3.5). Try anything more complex and I believe you'll see it's limitations.
Think about how we chunk words[1] and recognize them. We have whole word(shape recognition), morphme recognition, and spelling(letter-by-letter chunking). Text models receive tokens(akin to morpheme chunks) and don't have access to the underlying letters(spelling data) unless that was part of their training. For the most part, individual letters, something I think we can agree is necessary for rendering text, is not accessible.
An appropriate analogy is an illiterate artist. Someone who can hear chunks of words and recognizes them verbally I'd asked to do their best job at painting text. They can piece together letter clusters based on inference, but they cannot spell.
This was noted in the DALL-E 2 paper (https://arxiv.org/pdf/2204.06125.pdf#page=16&org=openai), and it can be experimentally established by swapping out even a very large LLM like PaLM for a humble, small, weak, but not badly-tokenized ByT5 and noting instant solution of the 'problem' (https://arxiv.org/abs/2212.10562#google). Skip to the appendix of the second paper if you have any doubts about the difference that switching to ByT5 makes in terms of spelling. The solution is just scaling up the LLM models (which is necessary to get better instruction-following and image quality in general, quite aside from spelling inside images) and eventually switching to character tokenization.* See, as always, https://gwern.net/gpt-3#bpes
(Hands and cats, however, are just genuinely difficult and require biting the bullet of scaling. And I wonder if it will take video supervision to truly solve them?)
* on a recent-news note, I suspect Claude-3 may have done something interesting with tokenization - possibly but not necessarily switching to character/byte encodings - and this is part of why it confabulates in ways unusual for ChatGPT but also is a lot more pleasant to use.
Or tattoos with Chinese/Japanese characters. More often than not they use the "wrong" character (even if it is technically correct) and the calligraphy is considered not artistic by native speakers.
The tattoo artist knows roughly what a lettershape is, but has no idea how to write it.
Leads to hilarious tattoos
Heh, this gives me the idea of training an image generator not on tokens but on rendered text of the prompt in some 8x8 font, that would be a fun experiment ;)
I'd also be curious how text rendering performance changes if the tokenizer could be made aware of quoted strings and instead tokenizes the contents as characters instead. Surely someone has tried this, (right?) but I haven't seen it in the literature.
text like hands belong to class of imagery satisfying two characteristics: 1) They are intricately structured, having many subcomponents which have precise spatial interrelationships over a range of scales; there are a lot of ways to make things that are like text/hands except wrong 2) The average person is intimately familiar with said structures, having spent thousands of hours engaging looking at them while performing complex tasks involving a visiospatial feedback loop.
image generation models tend to have trouble with (1), but people only tend to notice it when paired with (2).
(1) can be improved by scale and more balanced training data; consider that for a person, their own hands are very frequently in their own field of view, but the photos they take only rarely feature hands as the focus. this creates a differential bias.
as for (2), image models tend to generate all kinds of implausibilities that the average person doesn't notice. try generating a complex landscape and ask a geologist how it formed.
Shouldn't this apply even more strongly to faces versus hands? AI seems to have a significantly easier time with those.
And, actually, faces still, especially outside of closeups of just the face, can be a problem, too, which is why a separate face restoration with a GAN or inpainting pass for faces with the same or different diffusion model is common.
in other words.. what have you actually tried? be specific.
Yes, this is the 'miracle of spelling' (as https://arxiv.org/abs/2212.10562#google calls it): for many words, larger models can manage to deduce the spelling somehow despite the tokenization. It may even fool you into thinking it understands spelling in general. But if you ask DALL-E 3 to generate a random string of ASCII, you'll quickly discover the limits to the 'miracle'.
Generating a cat that looks LIKE a cat is fine because there are differences between cats.
The problem is that you can't make something that looks LIKE a letter K, it needs to satisfy the rules of K and can't just look LIKE a K and not some made up character.
They're LIKE generators and have trouble with the bits that need to be exact.
there are several fonts on the market that would profoundly disagree with this.
Newer models like cascade or SD 3 are using multimodal llms to caption images including text. Dall-E was at the forefront because they had access to gpt4-vision before everyone else. You will see that all new models will be able to spell. The problems we see are still mostly because of gigo.
As a test I just tried ChatGPT with the prompt :-
Hi ChatGPT can you give me a picture of a banner that says "Hacker news"
And the resultant image does indeed have that text on it. Where I've seen this approach fall down, is where the text is long and/or complex or the words are uncommon.
so while there's some way to go, things are definitely improving here.
This could be entirely wrong however.
It would be interesting what would happen on a dataset with nothing but text.
My suggestion is to use Image to Image, start with the text of your son's name, and give it some gaussian noise background, and then paint out the parts you want to keep.
https://medium.com/community-driven-ai/midjourney-can-spell-...
If an AI that could draw was also able to write, that would be artificial general intelligence. And pretty much everyone seems to agree we don't have that yet.