Gibberish has to map _somewhere_ in the models concept space.
Whether is maps onto anything we'd recognise as consistent doesn't mean that the AI wouldn't have some concept of where it relates, as other people have noted, the gibberish breaks down when you move it into another context, but who's to say that Dall-E 2 isn't remaining consistent to some concept it understands that isn't immediately recognisable to us.
The interesting part is if you can trick it to spit out gibberish in targeted areas of that concept space using crafted queries.
I agree that this shows a focus on the appearance of the words rather than their meaning.
https://astralcodexten.substack.com/p/a-guide-to-asking-robo...
A counterpoint I'd raise is I wonder how aggressive Dall-E 2 is in making assumptions about words it hasn't seen before.
Hard to do given that it's read essentially the entire internet, however someone could make up some latin-esque words that people would be able to guess the meaning of.
If the model is as good as people at assuming the meaning of such made up words, it could stand to reason that if it were aggressive enough in this it might be doing the same thing with gibberish and thus ending up with it's own interpretation of the word, which would land it back in a more targeted concept space.
I'd love to see someone craft some words that most people could guess the meaning of, and see how Dall-E 2 fairs.
That the model would have a consistent form of some kind of gibberish would be a given. Even humans have it: https://en.wikipedia.org/wiki/Bouba/kiki_effect And I'm sure if you asked native English speakers, "Hey, we know this isn't a word, but if it was a word, what would it be? 'Apoploe vesrreaitars'" you would get something very far from a uniformly random distribution of all nameable concepts.
In hindsight, sure. Given enough time someone might have predicted the phenomenon. But I don't think most of us did.
What's more fascinating to me is how often this has happened in this space in just the last few years.
1. Some phenomenon is discovered
2. I'm surprised
3. It makes sense in hindsight
Why? It could just go to noise images, or vaguely real-looking objects that don't look like anything in particular.
The machine is always trying to associate the words with other words semantically close together. E.g. when taken as input strong_man, or strng_man or srong_man these are all mean the same because that combination of letters are usually used with the word man, and there is no other competitor word to replace the srong except strong.
Now why that should be considered a secret language, it is beyond me. The input language for the machine is a natural human language, and that means it is very poor defined language for the machine to recognize. That is going always to produce a lot of gibberish.
Not really. It's a stochastic model, so after a bunch of random denoising steps, it could easily just be mapping every bit of gibberish to a random image, and it be vanishingly unlikely for any of them to be similar or the relationship to run in reverse.
No, it doesn't. The model in use maps all input to some output, but that isn't a necessary feature of the problem at all. It's actually a terrible idea.
https://twitter.com/Thomas_Woodside/status/15317102510150819...
Eg. "Apoploe vesrreaitais" Could refer to something along the lines of a "fan / wedge" or "wing-like"
If you look at the examples of cheese, when compared to the "birds and cheese" the cheese tends to be laid out in a fan like pattern and shaped in sharp angled wedges.
I'm unconvinced by the rebuttal as well, not to say I am convinced we have a fully formal language going on here, but there's definitely some shared concepts with the generated text.
I wonder what imagen would come up with or if it's 'language' is more correlated to real language.
"feathered" maybe?
A language should have syntax and meaning. We can see these phrases (tokens?) have meaning.
It is unclear what they syntax is. But DALL-E2's idea of what the syntax is for English isn't how most people understand it either (as can be seen by how many rephrasing attempts people make to get what they want).
It's entirely possible (probable?) there is syntax here but we don't know it yet.
How many French people speak Breton?
Found this answer:
https://twitter.com/BarneyFlames/status/1531736708903051265?...
Serious question: what else do you think language is? How else would your brain associate the word "bird" with the concept?
This is an example of an application where uncertainty modelling would help greatly. Any and every input will lead to an output. That doesn’t mean that all regions of latent/embedding space are equally valid.
I’m in the camp that large/modern ML models are nearing human intelligence, in some aspects. What’s currently missing is the universal ability to estimate uncertainty and identify inputs that are out of distribution. Many groups are working on this and perhaps we already have the solution but are not combining the right uncertainty estimation approach with the right foundational model.
https://giannisdaras.github.io/publications/Discovering_the_...
[1] https://twitter.com/barneyflames/status/1531736708903051265?...
And for the record, they use BPE dropout for DALLE-1, see https://arxiv.org/pdf/2102.12092.pdf
> While the idea of AI agents inventing their own language may sound alarming/unexpected to people outside the field, it is a well-established sub-field of AI, with publications dating back decades.
> Simply put, agents in environments attempting to solve a task will often find unintuitive ways to maximize reward.
But I think it is an interesting discovery because I don’t think anyone could have predicted this.
One of my favorite examples is the classification model that will identify an apple with a sticker on it that says “pear” as a pear—it makes sense, but is still surprising when you first see it.
That classification model (CLIP) is the first stage of this image generator (DALLE) - and actually this shows that it doesn't think they're exactly the same thing, or at least that's not the full story, because DALL-E doesn't confuse the two.
However, other CLIP guided image generation models do like to start writing the prompt as text into the image if you push them too hard.
It'd be cool if this was true, but it looks like it mostly isn't.
I thought DALL-E's language model was tokenized, so it doesn't understand that eg "car" is made up of the letters 'c', 'a' and 'r'.
So how could the generated pictures contain letters that form words that are tokenized into DALL-E's internal "language"? Shouldn't we expect that feeding those words to the model would give the same result as feeding it random invented words?
Actually, now that I think about it, how does DALL-E react when given words made of completely random letters?
It's one thing if DALL-E 2 was trying to map words in the prompt to their letter sequences and failing because of BPEs; that shows an impressive amount of compositionality but it's still image-model territory. It's another if DALL-E 2 was trying to map the prompt to semantically meaningful content and then failing to finish converting that content to language because it's too small and diffusion is a poor fit for language generation. That makes for worse images but it says terrifying things about how much DALL-E 2 has understood the semantic structure of dialog in images, and how this is likely to change with scale. Normally I'd expect the physical representation to precede semantic understanding, not follow it!
That said I reiterate that a degree of skepticism seems warranted at this point.
https://en.wikipedia.org/wiki/Simlish
https://web.archive.org/web/20040722043906/http://thesims.ea...
https://web.archive.org/web/20121102012431/http://bbs.thesim...
The reason "Apoploe vesrreaitais" is detected as Greek is because the first "word" is "phonetically" similar to the word απόπλους, which means sailing/shipping and it is rooted in ancient Greek. If we were to write Αποπλοuς using roman characters, we would write apoplous or apoloi (plural, in Greek is αποπλοΐ). So I think that the model understands that "oe" suffix is used to represent the Greek suffix "οι" that is used for plurals. The rest of the word is rather close phonetically, so there is some model that maps phonetic representations to the correct word.
The other phrase seems to be combined of words classified as Portuguese, Spanish, Lithuanian, and Luxembourgish.
My hypothesis here is more that these models are trained more on western languages than others and thus our latent representation of "language" is going to appear like Latin gibberish due to a combination of the evolution of these languages as well as human bias. ("It's all Greek to me")
"My first reaction to this was, "It probably has to do with tokenization. If there's a 'language' buried in here, its native alphabet is GPT-3 tokens, and the text we see is a concatenation of how it thinks those tokens map to Unicode text." Most randomly concatenated pairs of tokens simply do not occur in any training text, because their translation to Unicode doesn't correspond to any real word. There are also combinations that do correspond to real words ("pres" + "ident" + "ial") but still never occur in training because some other tokenization is preferred to represent the same string ("president" + "ial").
Maybe DALL-E 2 is assigning some sort of isolated (as in, no bound morphemes) meaning to tokens — e.g., combinations of letters that are statistically likely to mean "bird" in some language when more letters are revealed. When a group of such tokens are combined, you get a word that's more "birdlike" than the word "bird" could ever be, because it's composed exclusively of tokens that mean "bird": tokens that, unlike "bird" itself, never describe non-birds (e.g., a Pontiac Firebird). The exact tokens it uses to achieve this aren't directly accessible to us, because all we get is poorly rendered roman text."
I wonder if this is why the term for "bird" seemed to be in faux binomial nomenclature, the scientific names for animals. I assume that in the training set there were images of birds/insects with their scientific name. An image labeled with the scientific name would always be an image of an animal, unlike images with the word bird in them which could be of a birdhouse, Pontiac Firebird, or someone playing golf. That would mean that in the latent space when DALLE wants to represent a bird as accurately as possible, it will use the scientific name, or a gibberish/tokenized version of the scientific name-- like someone trying to make up a name that sounds regal might say "Sir Reginard Swellington III". Even though it's not a real name it encodes into the latent space of royal-sounding names.
I wonder if this could be extended to other things with very specific naming conventions. For example aircraft names: "Gruoeing B-26 Froovet" might encode into military aircraft latent space.
Seems like a useful enhancement would be to invert the text and image prior stages, so it'd be able to explain what it thinks your prompt meant along with making images of it.
[1] https://astralcodexten.substack.com/p/a-guide-to-asking-robo...
i.e. anything can be completely described in a more succinct manner than any current spoken language.
Or maybe some kind of universal language that naturally occurs and any semi-intelligence life can understand it.
Fun stuff!
However, optimality of encoding is entirely relative to the decoding scheme used and your purposes. Obviously a matrix of numbers representing a summary of a paragraph can be in some sense "more compressed" than the English equivalent, but it's useless if you don't speak matrices. Similarly, you could invent an encoding scheme with Latin characters that is more compressed than English, but it's again useless if you don't know it or want to take the time to learn it. If we wanted we could make English more regular and easier to learn/compress, but we don't, for a whole bunch of practical/real life reasons. There's no free lunch in information theory. You always have to keep the decoder/reader in mind.
Meaningful phrases or sentences can usually be expressed in Ithkuil with fewer linguistic units than natural languages.[2] For example, the two-word Ithkuil sentence "Tram-mļöi hhâsmařpţuktôx" can be translated into English as "On the contrary, I think it may turn out that this rugged mountain range trails off at some point."[2]
All human languages are about the same efficiency when spoken, but of course this mainly depends on having short enough words for the most common concepts in the specific thing you’re talking about.
https://www.science.org/content/article/human-speech-may-hav...
And there can’t be a universal language because the symbols (words) used are completely arbitrary even if the grammar has universal concepts.
I’ve been wondering if there is a way to do psychological experiments on these large language models that we couldn’t do with a person.
This one melts my brain a bit, I’m not going to lie. Whales talking about food, with subtitles. “Translate” the subtitles and you get food that whales would actually eat.
How does getting access work, do you need a referral?
This may act as a counter balance to the trends of the last few years of all major research becoming concentrated in a few tech companies.
Conclusion: the gibberish is the expression for birds eating things in DALL-E's secret language.
But, wait. Why is the same gibberish in the first image, that has the two men and the cabbages(?), but no birds?
Explanation: the two men are clearly talking about birds:
>> We then feed the words: "Apoploe vesrreaitars" and we get birds. It seems that the farmers are talking about birds, messing with their vegetables!
With apologies to my two compatriots, but that is circular thinking to make my head spin. I'm reminded of nothing else as much as the scene in the Knights of the Round Table where the wise Sir Bedivere explains why witches are made of wood:
All the cool images that DALL-E spits out are fun to look at, but this sort of thing is an even more interesting experiment in my book. I've been patiently sitting on the waitlist for access, but I can't wait to play around with it.
It will be fun to see people experimenting with extracting text prompts from generated images. I'd try something like "An open children book about animals" or "Random thought written on a paper". Maybe do a feedback loop of extracted prompts :)
I think, there will be multiple words for the same thing. Also, unlike 'bird' the word 'Apoploe vesrreaitais' might actually mean specific kind of bird in specific setting.
No, DALL-E doesn’t have a secret language - https://news.ycombinator.com/item?id=31587316 - June 2022 (7 comments)