Hi, one of the authors here.
The thing to bear in mind when reading the dialogue examples in figure 11 is the custom prompt shown in Appendix D:
```
This is a conversation between a human, User, and an intelligent visual
AI, Flamingo. User sends images, and Flamingo describes them.
User: <a cat image>
Flamingo: That is a cat. It’s a tiny kitten with really cute big ears.
User: <a dinner image>
Flamingo: This is a picture of a group of people having dinner. They
are having a great time!
User: Can you guess what are they celebrating?
Flamingo: They might be celebrating the end of a successful project or
maybe a birthday?
User: <a graph image>
Flamingo: This is a graph, it looks like a cumulative density function
graph.
```
My personal opinion would be, once you're doing next token prediction with this description of what Flamingo "is" in the history, then "I am not affected by this difference" is a pretty reasonable completion rather than a lucky guess. It definitely was exciting for the team that this whole example worked so nicely, but if you discard the visual side, this "illusion of an unbelievable capacity" has been seen in other works as well.