I have been investigating issues of LLM training and inference for quite some time, and have developed a number of hypotheses about future SoTA models, which I believe very likely apply to GPT-4.
For example, I think Google's paper "Sparse is enough for scaling transformers" was very underrated, as it provided more than an order of magnitude improvement for inference economy, and it included one OpenAI researcher among authors.
"A Length-Extrapolatable Transformer"
https://arxiv.org/abs/2212.10554
"Language Is Not All You Need: Aligning Perception with Language Models"
https://arxiv.org/abs/2302.14045
Notably, this positional embedding has been implemented by lucidrains in his x-transformers package: https://github.com/lucidrains/x-transformers/blob/main/x_tra...
For me the slightly positive takeaway from OpenAI's paper is good uncertainty calibration of the base pretrained GPT-4. It could be interpreted as one example of "awareness" of its inner workings.
Of course it's hard to say much about the model we don't even know architecture of, not to mean such luxuries as access to weights... Meta's LLaMA release did more to democratize deep learning than OpenAI's GPT-4, that's for sure.