I would strongly caution anyone who thinks that they will be able to understand or explain LLM behavior better by studying the architecture closely. That is a trap. Big SotA models these days exhibit so much nontrivial emergent phenomena (in part due to the massive application of reinforcement learning techniques) that give them capabilities very few people expected to ever see when this architecture first arrived. Most of us confidently claimed even back in 2023 that, based on LLM architecture and training algorithms, LLMs would never be able to perform well on novel coding or mathematics tasks. We were wrong. That points towards some caution and humility about using network architecture alone to reason about how LLMs work and what they can do. You'd really need to be able to poke at the weights inside a big SotA model to even begin to answer those kinds of questions, but unfortunately that's only really possible if you're a "mechanistic interpretability" researcher at one of the major labs.
Regardless, this is a nice article, and this stuff is worth learning because it's interesting for its own sake! Right now I'm actually spending some vacation time implementing a transformer in PyTorch just to refresh my memory of it all. It's a lot of fun! If anyone else wants to get started with that I would highly recommend Sebastian Raschka's book and youtube videos as way into the subject: https://github.com/rasbt/LLMs-from-scratch .
Has anyone read TFA author Jay Alammar's book (published Oct 2024) and would they recommend it for a more up-to-date picture?
So sad that "reinforcement learning" is another term whose meaning has been completely destroyed by uneducated hype around LLMs (very similar to "agents"). 5 years ago nobody familiar with RL would consider what these companies are doing as "reinforcement learning".
RLHF and similar techniques are much, much closer to traditional fine-tuning than they are reinforcement learning. RL almost always, historically, assumes online training and interaction with an environment. RLHF is collecting data from user and using it to reach the LLM to be more engaging.
This fine-tuning also doesn't magically transform LLMs into something different, but it is largely responsible for their sycophantic behavior. RLHF makes LLMs more pleasing to humans (and of course can be exploited to help move the needle on benchmarks).
It's really unfortunate that people will throw away their knowledge of computing in order to maintain a belief that LLMs are something more than they are. LLMs are great, very useful, but they're not producing "nontrivial emergent phenomena". They're increasing trained a products to invoked increase engagement. I've found LLMs less useful in 2025 than in 2024. And the trend in people not opening them up under the hood and playing around with them to explore what they can do has basically made me leave the field (I used to work in AI related research).
> I've found LLMs less useful in 2025 than in 2024.
I really don't know how to reply to this part without sounding insulting, so I won't.
I’d give similar advice to any coding bootcamp grad: yes you can get far by just knowing python and React, but to reach the absolute peak of your potential and join the ranks of the very best in the world in your field, you’ll eventually want to dive deep into computer architecture and lower level languages. Knowing these deeply will help you apply your higher level code more effectively than your coding bootcamp classmates over the course of a career.
I mostly do it because it's interesting and I don't like mysteries, and that's why I'm relearning transformers, but I hope knowing LLM internals will be useful one day too.
A common sentiment on HN is that LLMs generate too many comments in code.
But comment spam is going to help code quality, due to the way causal transformers and positional encoding works. The model has learned to dump locally-specific reasoning tokens where they're needed, in a tightly scoped cluster that can be attended to easily, and forgetting about just as easily later on. It's like a disposable scratchpad to reduce the errors in the code it's about to write.
The solution to comment spam is textual/AST post-processing of generated code, rather than prompting the LLM to handicap itself by not generating as much comments.
Like I said, it's a trap to reason from architecture alone to behavior.
A common sentiment on HN is that LLMs generate too many comments in code.
For good reason -- comment sparsity improves code quality, due to the way causal transformers and positional encoding work. The model has learned that real, in-distribution code carries meaning in structure, naming, and control flow, not dense commentary. Fewer comments keep next-token prediction closer to the statistical shape of the code it was trained on.
Comments aren’t a free scratchpad. They inject natural-language tokens into the context window, compete for attention, and bias generation toward explanation rather than implementation, increasing drift over longer spans.
The solution to comment spam isn’t post-processing. It’s keeping generation in-distribution. Less commentary forces intent into the code itself, producing outputs that better match how code is written in the wild, and forcing the model into more realistic context avenues.
We are only just beginning to understand how these things work. I imagine it will end up being similar to Freud’s Oedipal complex: when we failed to have a fully physical understanding of cognition, we employed a schematic narrative. Something similar is already emerging.
I'm not clear at all we were wrong. A lot of the mathematics announcements have been rolled back and "novel coding" is exactly where the LLMs seem to fail on a daily basis - things that are genuinely not represented in the training set.
What happens to an LLM without reinforcement learning?
Unfortunately the really interesting details of this are mostly secret sauce stuff locked up inside the big AI labs. But there are still people who know far more than I do who do post about it, e.g. Andrej Karpathy discusses RL a bit in his 2025 LLMs Year in Review: https://karpathy.bearblog.dev/year-in-review-2025/
However, most modern LLMs, even base models, would be not just trained on raw internet text. Most of them were also fed a huge amount of synthetic data. You often can see the exact details in their model cards. As a result, if you sample from them, you will notice that they love to output text that looks like:
6. **You will win millions playing bingo.**
- **Sentiment Classification: Positive**
- **Reasoning:** This statement is positive as it suggests a highly favorable outcome for the person playing bingo.
This is not your typical internet page.The purpose of RL (applied to LLMs as a second "post-training" stage after pre-training) is to train the LLM to act as if it had planned ahead before "speaking", so that rather than just focusing on the next word it will instead try to choose a sequence of words that will steer the output towards a particular type of response that had been rewarded during RL training.
There are two types of RL generally applied to LLMs.
1) RLHF - RL from Human Feedback, where the goal is to generate responses that during A/B testing humans had indicated a preference for (for whatever reason).
2) RLVR - RL with Verifiable Rewards, used to promote the appearance of reasoning in domains like math and programming where the LLM's output can be verified in someway (e.g. math result or program output checked).
Without RLHF (as was the case pre-ChatGPT) the output of an LLM can be quite unhinged. Without RLVR, aka RL for reasoning, the abilty of the model to reason (or give the appearance of reasoning) is a function of pre-training, and won't have the focus (like putting blinkers on a horse) to narrow generative output to achieve the desired goal.
I feel like there are three groups of people:
1. Those who think that LLMs are stupid slop-generating machines which couldn't ever possibly be of any use to anybody, because there's some problem that is simple for humans but hard for LLMs, which makes them unintelligent by definition.
2. Those who think we have already achieved AGI and don't need human programmers any more.
3. Those who believe LLMs will destroy the world in the next 5 years.
I feel like the composition of these three groups is pretty much constant since the release of Chat GPT, and like with most political fights, evidence doesn't convince people either way.
But a lot of us have a more nuanced take! It's perfectly possible to believe simultaneously that 1) LLMs are more than stochastic parrots 2) LLMs are useful for software development 3) LLMs have all sorts of limitations and risks (you can produce unmaintainable slop with them, and many people will, there are massive security issues, I can go on and on...) 4) We're not getting AGI or world-destroying super-intelligence anytime soon, if ever 5) We're in a bubble and it's going to pop and cause a big mess 6) This tech is still going to be transformative long term, on a similar level to the web and smartphones.
Don't let the noise from the extreme people who formed their opinions back when ChatGPT came out drown out serious discussion! A lot of us try and walk a middle course with this and have been and still are open to changing our minds.
There's no rule that the internet is limited to a single explanation. Find the one that clicks for you, ignore the rest. Whenever I'm trying to learn about concepts in mathematics, computer science, physics, or electronics, I often find that the first or the "canonical" explanation is hard for me to parse. I'm thankful for having options 2 through 10.
If your mental model of an LLM is:
> a synthetic human performing reasoning
You are severely overestimating the capabilities of these models and not realizing potential areas of failure (even if your prompt works for now in the happy case). Understanding how transformers work absolutely can help debug problems (or avoid them in the first place). People without a deep understanding of LLMs also tend to get fooled by them more frequently. When you have internalized the fact that LLMs are literally optimistized to trick you, you tend to be much more skeptical of the initial results (which results in better eval suites etc).
Then there's people who actually do AI engineering. If you're working with local/open weights models or on the inference end of things you can't just play around with an API, you have a lot more control and observability into the model and should be making use of it.
I still hold that the best test of an AI Engineer, at any level of the "AI" stack, is how well they understand speculative decoding. It involves understanding quite a bit about how LLMs work and can still be implemented on a cheap laptop.
1) ‘human’ encompasses behaviours that include revenge cannibalism and recurrent sexual violence —- wish carefully.
2) not even a little bit, and if you want to pretend then pretend they’re a deranged delusional psych patient who will look you in the eye and say genuinely “oops, I guess I was lying, it won’t ever happen again” and then lie to you again, while making sure happens again.
3) don’t anthropomorphize LLMs, they don’t like it.
The future is now! (Not because of "a synthetic human" per se but because of people thinking of them as something unremarkable.)
Whereas a standard deep layer in a network is matrix * input, where each row of the matrix is the weights of the particular neuron in the next layer, a transformer is basically input* MatrixA, input*MatrixB, input*MatrixC (where vector*matrix is a matrix), then the output is C*MatrixA*MatrixB*MatrixC. Just simply more dimensions in a layer.
And consequently, you can represent the entire transformer architecture with a set of deep layers as you unroll the matricies, with a lot of zeros for the multiplication pieces that are not needed.
This is a fairly complex blog but it shows that its just all matrix multiplication all the way down. https://pytorch.org/blog/inside-the-matrix/.
There might be some unifying way to look at things though, maybe GNNs. I found this talk [1] and at 4:17 it shows how convolution and attention would be modeled in a GNN formalism
Once compute became more available, and you had more neural nets, and thus more dimensionality (in the sense of layer sizes), during training, you had more directions for gradient descent, so things started happening with ML.
And all the architectures that you see today are basically simplifications of the fully connected layers with max dimensionality. Any operation like attention, self attention, or convolution can be unrolled into matrix multiples.
I wouldn't be surprised if Google TPUs basically do this. It seems to reason that they are the most efficient because they don't move memory around, which means that the matrix multiply circuitry is hard wired, which means that the compiler basically has to lay out the locations of the data in the spaces that are meant to be matrix multiplied together, so the compiler probably does that unrolling under the hood.
It also means more jobs for the people who understand them at a deeper level to advance the SOTA of specific widely used technologies such as operating systems, compilers, neural network architectures and hardware such as GPUs or TPU chips.
Someone has to maintain and improve them.
If you also get into more robust and/or specialized tasks (e.g. rotation invariant computer vision models, graph neural networks, models working on point-cloud data, etc) then transformers are also not obviously the right choice at all (or even usable in the first place). So plenty of other useful architectures out there.
What about DINOv2 and DINOv3, 1B and 7B, vision transformer models? This paper [1] suggests significant improvements over traditional YOLO-based object detection.
...but, if you have favorite resources on understanding Q & K, please drop them in comments below...
(I've watched the Grant Sanderson/3blue1brown videos [including his excellent talk at TNG Big Tech Day '24], but Q & K still escape me).
Thank you in advance.
Once you recognize this it's a wonderful re-framing of what a transformer is doing under the hood: you're effectively learning a bunch of sophisticated kernels (though the FF part) and then applying kernel smoothing in different ways through the attention layers. It makes you realize that Transformers are philosophically much closer to things like Gaussian Processes (which are also just a bunch of kernel manipulation).
0. http://bactra.org/notebooks/nn-attention-and-transformers.ht...
Plus with different forms of attention, e.g. merged attention, and the research into why / how attention mechanisms might actually be working, the whole "they are motivated by key-value stores" thing starts to look really bogus. Really it is that the attention layer allows for modeling correlations and/or multiplicative interactions among a dimension-reduced representation.
Don't get me wrong, implementing attention is still great (and necessary), but even with something as simple as linear regression, implementing it doesn't really give you the entire conceptual model. I do think implementation helps to understand the engineering of these models, but it still requires reflection and study to start to understand conceptually why they are working and what they're really doing (I would, of course, argue I'm still learning about linear models in that regard!)
"he was red" - maybe color, maybe angry, the "red" token embedding carries both, but only one aspect is relevant for some particular prompt.
if I understand it all correctly.
implemented it in html a while ago and might do it in htmx sometime soon.
transformers are just slutty dictionaries that Papa Roach and kage bunshin no jutsu right away again and again, spawning clones and variations based on requirements, which is why they tend to repeat themselves rather quickly and often. it's got almost nothing to do with languages themselves and requirements and weights amount to playbooks and DEFCON levels