LLM Architecture Gallery (opens in new tab)

(sebastianraschka.com)

584 pointstzury10d ago42 comments

42 comments

This is great - always worth reading anything from Sebastian. I would also highly recommend his Build an LLM From Scratch book. I feel like I didn’t really understand the transformer mechanism until I worked through that book.

On the LLM Architecture Gallery, it’s interesting to see the variations between models, but I think the 30,000ft view of this is that in the last seven years since GPT-2 there have been a lot of improvements to LLM architecture but no fundamental innovations in that area. The best open weight models today still look a lot like GPT-2 if you zoom out: it’s a bunch of attention layers and feed forward layers stacked up.

Another way of putting this is that astonishing improvements in capabilities of LLMs that we’ve seen over the last 7 years have come mostly from scaling up and, critically, from new training methods like RLVR, which is responsible for coding agents going from barely working to amazing in the last year.

That’s not to say that architectures aren’t interesting or important or that the improvements aren’t useful, but it is a little bit of a surprise, even though it shouldn’t be at this point because it’s probably just a version of the Bitter Lesson.

imjonse10d ago

> On the LLM Architecture Gallery, it’s interesting to see the variations between models, but I think the 30,000ft view of this is that in the last seven years since GPT-2 there have been a lot of improvements to LLM architecture but no fundamental innovations in that area.

After years of showing up in papers and toy models, hybrid architectures like Qwen3.5 contain one such fundamental innovation - linear attention variants which replace the core of transformer, the self-attention mechanism. In Qwen3.5 in particular only one of every four layers is a self-attention layer.

MoEs are another fundamental innovation - also from a Google paper.

libraryofbabel10d ago

Thanks for the note about Qwen3.5. I should keep up with this more. If only it were more relevant to my day to day work with LLMs!

I did consider MoEs but decided (pretty arbitrarily) that I wasn’t going to count them as a truly fundamental change. But I agree, they’re pretty important. There’s also RoPE too, perhaps slightly less of a big deal but still a big difference from the earlier models. And of course lots of brilliant inference tricks like speculative decoding that have helped make big models more usable.

phanarch10d ago

I'd push back slightly on the "no fundamental innovations" read though — the innovations that stuck (MoE, GQA, RoPE) are almost entirely ones that improve GPU utilization: better KV-cache efficiency, more parallelism in attention, cheaper to serve per parameter. Mamba and SSM-based hybrids are interesting but kept running into hardwar friction.

iroddis10d ago

This is amazing, such a nice presentation. It reminds me of the Neural Network Zoo [1], which was also a nice visualization of different architectures.

[1] https://www.asimovinstitute.org/neural-network-zoo/

nxobject9d ago

Thank you for this! I help teach a "CS enrichment course", and I'm having students play with Keras (with my own written scaffolding of course.) I'm struggling to find a resource to help me plan beyond "this is a perceptron/FFNN", and with my lack of experience (I'm a statistician) this is going to be extremely helpful.

wood_spirit10d ago

Lovely!

Is there a sort order? Would be so nice to understand the threads of evolutions and revolution in the progression. A bit of a family tree and influence layout? It would also be nice to have a scaled view so you can sense the difference in sizes over time.

krackers10d ago

There is https://magazine.sebastianraschka.com/p/technical-deepseek which shows an evolution in deepseek family

andai10d ago

> The goal of the proof verifier (LLM 2) is to check the generated proofs (LLM 1), but who checks the proof verifier? To make the proof verifier more robust and prevent it from hallucinating issues, they developed a third LLM, a meta-verifier.

1 more reply

gasi10d ago

So cool — thanks for sharing! Here’s a zoomable version of the diagram: https://zoomhub.net/LKrpB

7777777phil9d ago

This is amazing, I just spent some time scrolling through these, most of the evolution is about inference cost not capability. GQA, MoE routing, sliding window attention, all trading theoretical capacity for practical efficiency.

Tbh might be the last generation of architectures designed entirely by humans. I dug into that (1) and might add another paragraph based on this if I find the time. The Big LLM Architecture Comparison (2) by Sebastian Raschka already inspired my ograph image for the blog -thanks again!

(1) https://philippdubach.com/posts/the-last-architecture-design...

(2) https://magazine.sebastianraschka.com/p/the-big-llm-architec...

charcircuit10d ago

I'm surprised at how similar all of them are with the main differences being the size of layers.

hrmtst9383710d ago

Most of the arch work is just scaling knobs.

If you swap in wierd layer types or move the objective much people run into ugly failure modes fast, so the field keeps circling the same Transformer blocks and then markets the change as novel when it's mostly a trianing and compute tradeoff.

bicepjai10d ago

Currently working on a similar project for myself. This looks like a great resource. Thanks for sharing. https://llm-lab.bicepjai.com/

Slugcat10d ago

What tool was used to draw the diagrams?

dr_kiszonka9d ago

https://github.com/rasbt/llm-architecture-gallery

khafra10d ago

Nice! Last time I had a custom temporary tattoo made, I had to copy and paste from Attention is All You Need; this provides a much cleaner and more varied source.

cagz10d ago

It is perhaps my eyes, but when I zoom in enough to make it readable, it gets blurry. A higher-res image would be much appreciated. Great idea otherwise.

travisgriggs10d ago

Darn. I clicked here hoping we were having LLMs design skyscrapers, dams, and bridges.

I even brought my popcorn :(

nxobject10d ago

Thank you so much! As a (bio)statistician, I've always wanted a "modular" way to go from "neural networks approximate functions" to a high-level understanding about how machine learning practitioners have engineered real-life models.

LuxBennu10d ago

Interesting collection. The architecture differences show up in surprising ways when you actually look at prompt patterns across models. Longer context windows don't just let you write more, they change what kind of input structure works best.

TedHerman9d ago

For comparison: https://cudl.lib.cam.ac.uk/collections/medievalmedicalrecipe...

neuroelectron10d ago

An older post from this blog, the linked article was updated recently: https://news.ycombinator.com/item?id=44622608

jasonjmcghee10d ago

What's the structurally simplest architecture that has worked to a reasonably competitive degree?

loveparade10d ago

Competitiveness doesn't really come from architecture, but from scale, data, and fine-tuning data. There has been little innovation in architecture over the last few years, and most innovations are for the purpose of making it more efficient to run training or inference (fit in more data), not "fundamentally smarter"

bigyabai10d ago

If your definition of "competitive" is loose enough, you can write your own Markov chain in an evening. Transformer models rely on a lot of prior art that has to be learned incrementally.

jasonjmcghee10d ago

Not that loose lol.

I’m thinking it’s still llama / dense decoder only transformer.

jrvarela5610d ago

Would be awesome to see something like this for agents/harnesses

vinhnx10d ago

I think Cognition DeepWiki's or Google CodeWiki's code map does generated a architecture map (Mermaid style). Eg: https://deepwiki.com/openai/codex#project-purpose-and-archit...

imfing10d ago

Thanks for putting all these model architectures together!

arikrahman10d ago

Thank you for the high quality diagrams!

mvrckhckr10d ago

What a great idea and nice execution.

jawarner10d ago

Looks like this may have received the HN Hug of Death. I'm getting "Too Many Requests" error trying to load the images.

brianjking10d ago

I'm getting that trying to load the content at all, text included.

celltalk10d ago

We’re literally seeing digital evolution in real-time. These are basically primitive life forms such as bacteria evolving just with tiniest differences.

Right now we’re engineering every bit of it to make it better but in the long run this is unsustainable. It’s going to be so complex that even these digital life forms won’t be able to understand their own digital DNAs, like us.

We know we have DNA, we can measure every letter but it doesn’t mean we understand what’s going on our 14 trillion cells and how each and every one of them is regulated.

I think this analogy not only explains us, or digital beings we see today. It explains everything, quite literally. Still it would be amazing to think about these systems from the perspective of biology, and try to understand the parts analogous to existing frame that we already have. Then we might figure out what to optimize better. For instance if we figure out a certain part of a layer corresponds to “genes” then we might find out there is alternative splicing within it. Wild but worth a shot.

j / k navigate · click thread line to collapse

42 comments

libraryofbabel10d ago

imjonse10d ago

MoEs are another fundamental innovation - also from a Google paper.

libraryofbabel10d ago

Thanks for the note about Qwen3.5. I should keep up with this more. If only it were more relevant to my day to day work with LLMs!

phanarch10d ago

iroddis10d ago

This is amazing, such a nice presentation. It reminds me of the Neural Network Zoo [1], which was also a nice visualization of different architectures.

[1] https://www.asimovinstitute.org/neural-network-zoo/

nxobject9d ago

wood_spirit10d ago

Lovely!

krackers10d ago

There is https://magazine.sebastianraschka.com/p/technical-deepseek which shows an evolution in deepseek family

andai10d ago

1 more reply

gasi10d ago

So cool — thanks for sharing! Here’s a zoomable version of the diagram: https://zoomhub.net/LKrpB

7777777phil9d ago

(1) https://philippdubach.com/posts/the-last-architecture-design...

(2) https://magazine.sebastianraschka.com/p/the-big-llm-architec...

charcircuit10d ago

I'm surprised at how similar all of them are with the main differences being the size of layers.

hrmtst9383710d ago

Most of the arch work is just scaling knobs.

bicepjai10d ago

Currently working on a similar project for myself. This looks like a great resource. Thanks for sharing. https://llm-lab.bicepjai.com/

Slugcat10d ago

What tool was used to draw the diagrams?

dr_kiszonka9d ago

https://github.com/rasbt/llm-architecture-gallery

khafra10d ago

Nice! Last time I had a custom temporary tattoo made, I had to copy and paste from Attention is All You Need; this provides a much cleaner and more varied source.

cagz10d ago

It is perhaps my eyes, but when I zoom in enough to make it readable, it gets blurry. A higher-res image would be much appreciated. Great idea otherwise.

travisgriggs10d ago

Darn. I clicked here hoping we were having LLMs design skyscrapers, dams, and bridges.

I even brought my popcorn :(

nxobject10d ago

LuxBennu10d ago

TedHerman9d ago

For comparison: https://cudl.lib.cam.ac.uk/collections/medievalmedicalrecipe...

neuroelectron10d ago

An older post from this blog, the linked article was updated recently: https://news.ycombinator.com/item?id=44622608

jasonjmcghee10d ago

What's the structurally simplest architecture that has worked to a reasonably competitive degree?

loveparade10d ago

bigyabai10d ago

If your definition of "competitive" is loose enough, you can write your own Markov chain in an evening. Transformer models rely on a lot of prior art that has to be learned incrementally.

jasonjmcghee10d ago

Not that loose lol.

I’m thinking it’s still llama / dense decoder only transformer.

jrvarela5610d ago

Would be awesome to see something like this for agents/harnesses

vinhnx10d ago

I think Cognition DeepWiki's or Google CodeWiki's code map does generated a architecture map (Mermaid style). Eg: https://deepwiki.com/openai/codex#project-purpose-and-archit...

imfing10d ago

Thanks for putting all these model architectures together!

arikrahman10d ago

Thank you for the high quality diagrams!

mvrckhckr10d ago

What a great idea and nice execution.

jawarner10d ago

Looks like this may have received the HN Hug of Death. I'm getting "Too Many Requests" error trying to load the images.

brianjking10d ago

I'm getting that trying to load the content at all, text included.

celltalk10d ago

We’re literally seeing digital evolution in real-time. These are basically primitive life forms such as bacteria evolving just with tiniest differences.

We know we have DNA, we can measure every letter but it doesn’t mean we understand what’s going on our 14 trillion cells and how each and every one of them is regulated.

j / k navigate · click thread line to collapse