It's a super cool paper that invents "vokenization" to generate a large amount of visually-grounded language datasets and trains visually-grounded models on those.
Most language models are trained on pure text data. Although it achieves significant success in recent years, this is not how humans acquire a language. It raises an interesting question "Can language models achieve a high level of language understanding by reading the text input alone?" The answer is probably "no".
To push the boundary of language models, adding other learning signals in the learning process is the key to success. And the first thing that comes to my mind is vision (visual cue). However, the existing visually-grounded datasets are a level of magnitude smaller than pure text ones. This paper purposes "vokenization" method to overcome this problem, and uses the new data that generate to train visually-supervised language models.
More importantly, visually-grounded models show significant improvements over text-grounded only models.
Paper https://arxiv.org/abs/2010.06775
Code https://github.com/airsplay/vokenization