GPT-4 architecture: what we can deduce from research literature (opens in new tab)

(kir-gadjello.github.io)

10 pointskir-gadjello3y ago6 comments

6 comments

6 comments · 3 top-level

kir-gadjelloOP3y ago· 1 in thread

As the discussion of GPT-4 heats up, the absence of details on its technical implementation becomes only more glaring. As an engineer, I have not learned anything applicable I haven't known yesterday from the newest OpenAI publication!

I have been investigating issues of LLM training and inference for quite some time, and have developed a number of hypotheses about future SoTA models, which I believe very likely apply to GPT-4.

kir-gadjelloOP3y ago

If you have questions about my rationale for this or that technique included in the list, please, ask!

For example, I think Google's paper "Sparse is enough for scaling transformers" was very underrated, as it provided more than an order of magnitude improvement for inference economy, and it included one OpenAI researcher among authors.

amrb3y ago· 1 in thread

I'd like to know how it can support 32k when all the other models I've seen are 2-4k, does this mean it's got a bigger layer for attention or it's 4x billions of parameters Large?

kir-gadjelloOP3y ago

It could be done in a dozen ways. One beautiful method is just using the xPos positional embedding pioneered by Microsoft and scale the context window size at runtime (even better if your attention is subquadratic - again there is a dozen of varieties to pick from), see:

"A Length-Extrapolatable Transformer"

https://arxiv.org/abs/2212.10554

"Language Is Not All You Need: Aligning Perception with Language Models"

https://arxiv.org/abs/2302.14045

Notably, this positional embedding has been implemented by lucidrains in his x-transformers package: https://github.com/lucidrains/x-transformers/blob/main/x_tra...

seydor3y ago· 1 in thread

Well if the model is so smart, could it be that it is actually aware of its layers and parameters?

kir-gadjelloOP3y ago

It's no problem to put model's architecture and even some python code into the generous 32k context window, the real problem seems to be as you say "awareness" - at least the facet of it that'd allow to answer complex novel questions.

For me the slightly positive takeaway from OpenAI's paper is good uncertainty calibration of the base pretrained GPT-4. It could be interpreted as one example of "awareness" of its inner workings.

Of course it's hard to say much about the model we don't even know architecture of, not to mean such luxuries as access to weights... Meta's LLaMA release did more to democratize deep learning than OpenAI's GPT-4, that's for sure.

j / k navigate · click thread line to collapse

6 comments

6 comments · 3 top-level

kir-gadjelloOP3y ago· 1 in thread

I have been investigating issues of LLM training and inference for quite some time, and have developed a number of hypotheses about future SoTA models, which I believe very likely apply to GPT-4.

kir-gadjelloOP3y ago

If you have questions about my rationale for this or that technique included in the list, please, ask!

amrb3y ago· 1 in thread

I'd like to know how it can support 32k when all the other models I've seen are 2-4k, does this mean it's got a bigger layer for attention or it's 4x billions of parameters Large?

kir-gadjelloOP3y ago

"A Length-Extrapolatable Transformer"

https://arxiv.org/abs/2212.10554

"Language Is Not All You Need: Aligning Perception with Language Models"

https://arxiv.org/abs/2302.14045

Notably, this positional embedding has been implemented by lucidrains in his x-transformers package: https://github.com/lucidrains/x-transformers/blob/main/x_tra...

seydor3y ago· 1 in thread

Well if the model is so smart, could it be that it is actually aware of its layers and parameters?

kir-gadjelloOP3y ago

For me the slightly positive takeaway from OpenAI's paper is good uncertainty calibration of the base pretrained GPT-4. It could be interpreted as one example of "awareness" of its inner workings.

j / k navigate · click thread line to collapse