Is there a way to confirm that this extra processing relates to the language structure, and not the processing of concepts?
I wouldn’t be surprised if the lack of video and 3D understanding in the image dataset training fails to understand things like the fear of heights, and the concept of gravity ends up being learned in the text processing weights.