Or is the idea to keep the network the same size and trade off some of its nodes for image, video, etc. data?
If so has anyone shown that doing so results in better overall performance?
My lay-observation is that GPT-4 seems to be on the border of usability for most applications so if nothing is gained by simply changing the input data type as opposed to expanding the model then it feels like it won't be of much use yet.
Also apologies if I'm not making sense, I'm almost certainly not using to correct technical terms to articulate what I'm thinking.