I can't wait to see ideas from the diffusion image generation world (like controlnet) work their way into language models.
there's nothing wrong with you, you just need the right background and you can go get that. see e.g. the fast.ai course
A single paper is part of a conversation, not something that stands alone. Trying to read one random paper is like finding a 1000 page thread on an obscure topic that has been running for 10+ years and reading only the last page. It won’t make any sense without reading back a ways.
I'd worry about learning the wrong things.