The weird finding: single-layer duplication does nothing. Too few layers, nothing. Too many, it gets worse. Only circuit-sized blocks of ~7 layers work. This suggests pretraining carves out discrete functional circuits in the layer stack that only work when preserved whole.
The whole thing was developed on 2x RTX 4090s in my basement. I'm now running current models (GLM-4.7, Qwen3.5, MiniMax M2.5) on a dual GH200 rig (see my other post). Code and new models coming soon.
Happy to answer questions.
MoE notwithstanding, a model trained on the whole Internet and a few hundred thousands stolen books carries way more knowledge than is actually needed for any given workflow. It would be great if we could ship slimmed down models into which we'd plug the knowledge banks useful for today's work, and only those.
It would also mean that you could keep a model's knowledge fresh without retraining the whole of it.
plugs in knowledge bank LLM: ... I know kung fu.
This wasn't something I really dug into in great detail but I remember my surprise back then at how all those merged models and those "expanded" models like Goliath still generated coherent output. IMO those were more community models made by small creators for entertainment rather than work, and only really of interest to the local LLM groups on Reddit, 4chan, and Discord. People might briefly discuss it on the board and say "that's cool" but papers aren't being written and it's less likely for academics or corpo researchers to notice it.
That being said I wonder if it's possible to combine the layers of completely different models like say a Llama and a Qwen and still get it to work.
Even with math probes, I hit unexpected problems. LLMs fail arithmetic in weird ways. They don’t get the answer wrong so much as get it almost right but forget to write the last digit, as if it got bored mid-number. Or they transpose two digits in the middle. Or they output the correct number with a trailing character that breaks the parser.
Would using grammar parsing help here by forcing the LLM to only output the expected tokens (i.e. numbers)? Or maybe on the scoring side you could look at the actual probabilities per token to see how far the correct digit is.
Even between two models of identical architecture, they may have landed on quite different internal representations if the training data recipe was substantially different.
But it would be fun to experiment with.
Nobody here or on Reddit has mentioned this, maybe bc it’s too obvious, but it’s clear to me that the residual connections are an absolutely necessary component to making this merging possible — that’s the only reason dimension 1 of a later layer is encouraged to mean something similar to dimension 1 of an earlier layer.
The code in the blog helps derive useful metrics from partial answers.
Have you tried a simple inline loop over the duplicated layers? Would be interesting to see performance. Also, would be interesting to compare with a MOE model. See if these layers are acting like different agreeing "experts" or if there is reasoning happening in the latent space.
I think this hasn't been tried before because it's totally unintuitive that feeding the output from later layers into previous ones would actually do anything. And in fact, it usually is detrimental. I guess it takes really bored hobbyists with too much compute to check this stuff.
I have done some interesting work on applying multiple layer duplications in different regions of the model too, going so far as to train a meta-model (actually just XGBoost) to predict the merges. Seems to work, buts thats a whole other blog post.
This works with MoE, and yes, I would be interested in looking into this in more detail. But my wife might disagree with this time sink...
Normal:
L1 -> L2 -> L3 -> L4 -> out
Unrolled (current framing): L1 -> [L2->L3] -> [L2->L3] -> L4 -> out
Looped (proposed): --<--loop----
| |
L1 -> [L2->L3] x N --> L4 -> out
"reasoning loop"Note: ascii rendering HN is not trivial
I think that these models have to learn to efficiently use their parameters, and the best way to do that is 'evolve' (yes, a bad word for it), structures over pretraining time. Unfortunately, they don't have a way to access these structures 'from the inside'. I hope this new approach lets up boost performance in s more experimentally rigorous way
I have a couple questions:
1. I think this quote should be raising *many more* eyebrows.
> The astounding thing about Goliath wasn’t that is was a huge leap in performance, it was that the damn thing functioned at all. To this day, I still don’t understand why this didn’t raise more eyebrows.
You put a cat's brain into a dog's head and its still breathing! It didn't flatline immediately! Is yesterday's news? This seems like the biggest take away. Why isn't every <MODEL_PROVIDER> attempting LLM-surgery at this moment? Have you noticed any increasede discourse in this area?
2. You mentioned you spent the beginning of your career looking at brains in biotech. How did you end up in a basement of GPU's, working not in biotech, but still kind of looking at brains?
Again, great post!
On your questions: 1) A few other papers have been mentioned in the thread, like Solar10.7B. They duplicated the whole transformer stack, and it kinda helped. But as I found experimentally, that probably not a great idea. You are duplicating 'organs' (i.e. input processing stuff), that should only have one copy. Also, that paper didn't see immediate improvements; they had to do continued pre-training to see benefits. At that point, I'm guessing the big labs stopped bothering. Limited by hardware, I had to find unusual angles to approach this topic.
2) Nah, no more wetware for me. I did a half decade of research at a big neurobiology institute, and while it was very enjoyable, I can truly say that grant writing and paper review are 'not my thing'. This reason this info was delayed so long is that I wanted a paper in the AI field to go along with my papers in other fields. But as a Hobbyist with no official affiliation, and the attention span of a gnat, I gave up and started a blog instead. Maybe someone will cite it?
i think it isn't surprising giving how for example kernels in the first layers in visual CNNs converge to Gabors which are also the neuron transfer functions in the first layers of cat, human, etc. visual cortexes, and that there is math proving that such kernels are optimal (at some reasonable conditions).
And so i'd expect that the layers inside LLM reach or come close to some optimality which is universal across brains and LLMs (main reasons for such optimality is energy (various L2 like metrics), information compression and entropy)
There's a video on YouTube https://www.youtube.com/watch?v=pDsTcrRVNc0
about a looping layer models, after watching that I poured some thoughts off the top of my head into a comment which, of course, promptly sunk without a trace. I'll repost the gist of them here.
If you gain benefit from looping layers, at some level every layer of parameters is in front of and behind every other, the conclusion must be that the order of the layers does not need to be fixed at all.
If you cycle through the layers multiple times, are you doing so for the benefit of a particular layer on a particular problem. If so, can you skip the other layers that don't add on repetition. If you can skip (and you can know when to skip), and you can repeat (and know when to repeat)
What you would need is a mechanism which can decide which layer is needed next. Is that then not a looping single layer MOE model? Storing the layers as a wide set of selectable options rather than a deep set of unconditional layers. You would be picking what the next layer should be (or exit the loop) the threshold for exit drops each iteration so it always eventually exits. With a tunable 'how hard to think' knob to adjust the threshold.
But we could still try it out: randomize the order we call the transformer blocks, and see if it affects performance. If not, that’s extremely interesting.
There's probably a number of common sequences of layers that are inevitable when working on a problem though. I think of it like a expression calculator which could do various parts of an expression tree merging leaf nodes on each iteration. I wouldn't expect it to be quite so explicit with neural nets, but I feel like the underlying principle of do the sub parts then do the same thing on the result of the subparts must be beneficial to some degree.
I think there's probably quite a lot to be revealed from study of representations in those middle layers. If there's a 'how-much-have-we-solved-so-far' signal to be detected from the data between layers, there would be quite a lot of options I think.
Author is right about the base64 part. Does seem weird that it can decode and understand it at same time. And I guess what makes it weird that we just sorta accept that for say English and German this works ie normal use but when framed as base64 then it suddenly stops feeling intuitive
They almost certainly have never seen regular conversations in Base64 in their training set, so its weird that it 'just works'.
Does that make sense?
You could make the argument it's closer to the blocks of a CPU compared with a brain, and it's no different to copy-pasting some IP block for eg, HW JPEG decoding. But I feel like the difference here is we're 'discovering' these blocks / organs. They weren't designed, they were evolved.
You can enter the setting, and apply new re-layering architectures. Its very weird chatting with these brain-damaged models.
Altering these features isn’t messing with evolution anymore than tweaking a CAD file that used genetic algorithms: it’s all math, 1s and 0s.
Pretty cool though. LLM brain surgery.
I really think from the experiments that 'organs' (not sure what to term this), develop during massive pretraining. This also means maybe looping the entire models is actually not efficient. Maybe a better way is [linear input section -> loop 1 -> linear section -> loop 2 -> linear section -> ... -> loop n -> linear output]?
This would give 'organs' space to develop.
finding them on the other hand is not easy! as you've shown, i guess brute force is one way.. it would be nice to find a short cut but unfortunately as your diagrams show, the landscape isn't exactly smooth.
I would also hypothesize that different circuits likely exist for different "problems" and that these are messy and overlapping so the repeated layers that improve math for example may not line up with the repeated layers that improve poetry or whatever, meaning the basic layer repetition is too "simple" to be very general. that said you've obviously shown that there is some amount of generalizing at work, which is definitely interesting.
Do you think karpathy's autoresearch would be useful here?
This sounds similar to the Kimi's mixture of experts architecture if I understood it correctly(likely I have not), can you comment on this ?
MoE (mixture of experts), is an architecture that forces sparsity (not all 'neurons' are active during the forward pass.
This is pretty much orthogonal to that; it works with dense and MoE models, by repeating 'vertical' sections of the transformer stack.
That's branching and then coalescing, right? It selects a path that is weighted as being most beneficial to the input?
Given you pointed out how even the vertical part of the architecture allows for skipping layers anyway, isn't that essentially the same thing?
First pass runs your input through, second pass runs it's output as input?
Just, in double check it presumably runs the entire stack while you're trying to skip the translation steps and only double check the logic?
so thats about %15 more compute per forward pass with 0 extra memory which is just nuts, so for a streaming or disk-based setup its just free better answers. def wasnt gonna think of this myself.
config layers overall delta math reasoning word problems
baseline 80 0.5391 +0.0000 0.5850 0.6357 0.3500
rys 87 0.5452 +0.0061 0.6706 0.6000 0.2723
cartographer_repeat_x2 92 0.7741 +0.2350 0.8455 0.8214 0.6000
looks like the model gets a second/third go at figuring out how to approach the problem and it gets better answers.i tried a matrix of other configurations and stuff gets totally weird. like playing em through backwards in that block doesnt make much of a difference / order doesnt seem to matter (?!). doubling each layer got a benefit, but if i doubled the layers and doubled that block there was interference. doubling the block where the model is architecting/crystallizing its plans improves reasoning but at the cost of other stuff. other mixes of blocks showed some improvements for certain kinds of prompts but didnt stand out as much.
my guess is that large models trained on large corpuses there is just some ceiling of "reasoning you can do" given the internal geometry implied by the training data, cause text is lossy and low-bandwidth anyway, and theres only really so much of it. past some point you just have to have models learning from real-world interactions and my guess is we're already kind of there.
I will make another post if the topic is popular; its pretty geeky though, even more than my usual blog posts...
Still the result is really interesting being able to stack abstract reasoning and get better performance and the heat maps to show the prob results
The academic literature seems to be catching up:
- *[SOLAR / DUS (Kim et al., 2023)](https://arxiv.org/abs/2312.15166)* — duplicated transformer layers to build a 10.7B model that outperformed 30B parameter baselines.
- *[The Curse of Depth (2025)](https://arxiv.org/abs/2502.05795)* — explains why this works: Pre-LN causes deep transformer layers to converge toward identity functions, meaning middle layers are where real computation happens, and duplicating them concentrates that capacity.
- *[Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (Geiping et al., NeurIPS 2025)](https://arxiv.org/abs/2502.05171)* — takes the idea to its logical conclusion: a model trained with a single recurrent block repeated at inference time, scaling reasoning depth without adding parameters.
On the other papers, models like SOLAR or training a model that uses a single layers are probably going to hit a wall, based on the heatmaps I found. The transformer stack starts with randomised weights, (analogous to undifferentiated stem cells), and it seems they later form 'organs' during the trillions of pre-training tokens they undergo. My hypothesis is that you probably only want one copy of the 'token-to-thought', and 'thought-to-token' organs. It seems that you can make one layer do all three things (transforms in and out, and do the 'thinking'), but I think specialisation will always win.
It would go from a normal description of the item in the picture to suddenly seeing people clapping in the background that were not there, or making up some other stuff. I kinda stopped after a while, but I should pick that back up and do a more coherent experiment to see if I can find any correlation between vector dimensions and "meaning."
Great read, makes you wonder what else is encoded in these models that might be useful!
1. Should we be training models like this from the start? It seems that a model trained with layer loops would be able to take advantage of it better than rearranging the layers of a naive model.
2. Should we even be using a fixed number of layers? If models are this tolerant to their inner layers being meddled with, then it doesn't make sense to run all the layers on every single token.
Maybe we could make a model that changed the number of iterations through the compute layers based on how much computation it thought the problem needed. Send it through only once for easy problems (perhaps even zero times?) and two or more times for harder problems. This would allow easier prompts to complete faster, while allowing the model to potentially scale up to infinity hard problems.
If we are training or fine tuning the model, we can probably make the compute layers generate a confidence signals based that predicts how likely it is for an extra compute iteration to meaningfully change the result.
It less 'tool', than an assorted set of scripts, tailored to my unusual hardware setup. But it should be easy to extend; I would have released this earlier but I had the (stupid) idea to 'write a paper' on this. Aiming for that delayed this a year. Blogs are the way to go (for me).
But if there are sizes that are common, then that could also point to an architectural flaw, because whilst it could be universal constant-ness it could also be bounded by some inner working - and perhaps this is something that could be improved upon.
I do wish one of the big labs would sponsor with a rack of HGX Rubin NVL8's. I have lots of ideas to test, and I have probably hit the spending limit with the boss on hardware (she hasn't seen the new power bill yet...)
Interesting content still in the sea of useless AI slop, even if I couldn't understand anything after the first paragraph.
Extra thanks for making it written in a readable and approachable way! I don't have much of a background in this topic, but still managed to understand about 70-80% of it :) You're a good writer
Now it's making me wonder - instead of smashing things together more violently for MoE type stuff, perhaps it's more effective to create better toolsets to allow us to analyse smaller models.
Then small models can be trained (faster & cheaper) to be excellent at very specific tasks or domains, the toolset used to identify the organ and organ selection layers, a larger Frankenstein's monster model can be stitched together from these organs with perhaps a little extra training/fine-tuning to improve its organ selection abilities.
That makes me imagine some sort of future of layer standardisation, in which for a standard and optimal architecture sets of layers can be dynamically downloaded, added, swapped out etc to maintain fastest inference speed whilst allowing for flexible skills. Almost like the concept of subagents but within the architecture of the model itself. Hmmm.
I'm only versed in transformer architecture at a high level, does anybody know of any architectures where the layers branch & then coalesce like that? Or is it majority linear layer by layer?
I say this naively as I’m not that familiar with how transformers work under the hood, but I wonder if you could combine the two approaches in a coherent way. Frankenmerges were often down naively just smooshing things together, but knowing how the layers work under the hood I wonder if there’s a more intelligent way to combine merging and layer duplication to create even better performers.
The paper puts out an interesting hypothesis that these LLM-derived transformer layers have the ability to "refine" any set of learned tokens, even in different modalities. I wonder if what you're seeing here is related?
For example, we take for granted the context model of LLMs is necessary, that all you can do is append and anything that changes the beginning requires a recalculation of whatever comes after it. And that does match how training works.
But all sorts of things would become possible if it were possible to shift things in and out of context without recomputing it all; conservatively you could avoid compaction, optimistically it might be a way to get info to the model that's both more deeply integrated than search and more efficient than training larger and larger models.
I spend a lot of time wrestling with smaller LLMs for strict data extraction and JSON formatting. Have you noticed if duplicating these specific middle layers boosts a particular type of capability?
For example, does the model become more obedient to system prompts/strict formatting, or is the performance bump purely in general reasoning and knowledge retrieval?
Amazing work doing this on a basement 4090 rig!
"And now for the weirdness: There was never the case where any Transformer layer would have seen the output from a future layer!
Layer 10 is trained on layer 9’s output distribution. Layer 60 is trained on layer 59’s. If you rearrange them — feeding layer 60’s output into layer 10 — you’ve created a distribution the model literally never saw during training.
The astounding thing about Goliath wasn’t that is was a huge leap in performance, it was that the damn thing functioned at all. To this day, I still don’t understand why this didn’t raise more eyebrows.
Experimentally, this proved that layers were far more interchangeable than anyone had reason to expect. The internal representations were homogenous enough that the model could digest out-of-order hidden states without collapsing. The architecture was far more flexible than a rigid pipeline.
Between the Base64 observation and Goliath, I had a hypothesis: Transformers have a genuine functional anatomy. Early layers translate input into abstract representations. Late layers translate back out. And the middle layers, the reasoning cortex, operate in a universal internal language that’s robust to architectural rearrangement. The fact that the layer block size for Goliath 120B was 16-layer block made me suspect the input and output ‘processing units’ sized were smaller that 16 layers. I guessed that Alpindale had tried smaller overlaps, and they just didn’t work.
If that was true, maybe I didn’t need to teach a model new facts to make it smarter. I didn’t need fine-tuning. I didn’t need RLHF. I just needed to give it a more layers to think with."
If more than two repetitions of the “thinking organ” leads to worse results (I think that’s what you’ve said in other comments), would it be possible to get better results by slicing and dicing some of the early-layer “preparatory organs” between the thinking organs?
Maybe that would still require fine tuning to “evolve” an intermediary organ that would allow for multiple repetitions.
I have to say that intuitively I wasn't at all surprised that duplicating a single layer didn't do much good, but I had never expected that you can identify and so clearly visualize these relatively short circuit blocks (and of course it's around the magic number 7! /jk). Super cool research and really well explained!
Another very interesting thing would be modulating compute at the token level. Default is 0 loops, maybe 1 loop is better, and 10 loops is even better than that.
Hopefully the cost per GPU will kick-it soon and we'll see people properly play, but frankly the "middle section" layers 2(ish) to (n-1)(ish) of a model can be shuffled up/down and left/right and still perform well.
The fun one will be an LLM router for LLM layers to apply the best reasoning to the best input so far, but frankly that would need the years and years of training that the author hints at.
The one that's still out of grasps is still how to combine/manipulate per-layer k,v caches into a globally coherent state. i.e. if layers can be moved up/down why can't the cached k,v be swapped/combined with different projections? global k,v caches work, but they have to be _huge_ in order to prevent model collapse even on something as simple as owt.