When you train it to be an assistant model, it's better at compressing assistant transcripts than it is general text.
There is an eval which I have a lot of interested in and respect for https://huggingface.co/spaces/Jellyfish042/UncheatableEval called UncheatableEval, which tests how good of a language model an LLM is by applying it on a range of compression tasks.
This task is essentially impossible to 'cheat'. Compression is a benchmark you cannot game!
This is essentially just compression and decompression. It's just that with prior compression techniques, we never tried leveraging the inherent relationships encoded in a compressed data structure, because our compression schemes did not leverage semantic information in a generalized way and thus did not encode very meaningful relationships other than "this data uses the letter 'e' quite a lot".
A lot of that comes from the sheer amount of data we throw at these models, which provide enough substrate for semantic compression. Compare that to common compression schemes in the wild, where data is compressed in isolation without contributing its information to some model of the world. It turns out that because of this, we've been leaving a lot on the table with regards to compression. Another factor has been the speed/efficiency tradeoff. GPUs have allowed us to put a lot more into efficiency, and the expectations that many language models only need to produce text as fast as it can be read by a human means that we can even further optimize for efficiency over speed.
Also, shout out to Fabrice Bellard's ts_zip, which leverages LLMs to compress text files. https://bellard.org/ts_zip/
To me the amazing thing is that you can tell the model to do something, even follow simple instructions in plain English, like make a list or write some python code to do $x, that's the really amazing part.
Then ask for the same list sorted and get that nearly instantly,
These models have a short time context for now, but they already have a huge “working memory” relative to us.
It is very cool. And indicative that vastly smarter models are going to be achieved fairly easily, with new insight.
Our biology has had to ruthlessly work within our biological/ecosystem energy envelope, and with the limited value/effort returned by a pre-internet pre-vast economy.
So biology has never been able to scale. Just get marginally more efficient and effective within tight limits.
Suddenly, (in historical, biological terms), energy availability limits have been removed, and limits on the value of work have compounded and continue to do so. Unsurprising that those changes suddenly unlock easily achieved vast untapped room for cognitive upscaling.
So text wikipedia at 24G would easily hit 8G with many standard forms of compression, I'd think. If not better. And it would be 100% accurate, full text and data. Far more usable.
It's so easy for people to not realise how massive 8GB really is, in terms of text. Especially if you use ascii instead of UTF.
Lots of various sources that you can download locally to have available offline. They're even providing some pre-loaded devices in areas where there may not be reliable or any internet access.
> The English Wikipedia, as of June 26, 2025, contains over 7 million articles and 63 million pages. The text content alone is approximately 156 GB, according to Wikipedia's statistics page. When including all revisions, the total size of the database is roughly 26 terabytes (26,455 GB)
https://en.wikipedia.org/wiki/Wikipedia:Database_download#Wh...?
How close does it come?
It is 64,800,000,000 bits.
I can imagine 100 bits sure. And 1,000 bits why not. 10,000 you lose me. A million? That sounds like a lot. Now 64 million would be a number I can't well imagine. And this is a thousand times 64 million!
The more and faster a “mind” can infer, the less it needs to store.
Think how much fewer facts a symbolic system that can perform calculus needs to store, vs. an algebraic, or just arithmetic system, to cover the same numerical problem solving space. Many orders of magnitude less.
The same goes for higher orders of reasoning. General or specific subject related.
And higher order reasoning vastly increases capabilities extending into new novel problem spaces.
I think model sizes may temporarily drop significantly, after every major architecture or training advance.
In the long run, “A circa 2025 maxed M3 Ultra Mac Studio is all you need!” (/h? /s? Time will tell.)
the self-execution is the interactive chat interface.
wikipedia gets "trained" (compiled+compressed+lossy) into an executable you can chat with, you can pass this through another pretrained A.I. than can talk out the text or transcribe it.
I think writing compilers is now an officially a defunct skill of historical and conservation purposes more than anything else; but I don't like saying "conservation", it's a bad framing, I rather say "legacy connectivity" which is a form of continuity or backwards compatibility
One factor, is the huge redundancies pervasive in our communication.
(1) There are so many ways to say the same thing, that (2) we have to add even more words to be precise at all. Without a verbal indexing system we (3) spend many words just setting up context for what we really want to say. And finally, (4) we pervasively add a great deal of intentionally non-informational creative and novel variability, and mood inducing color, which all require even more redundancy to maintain reliable interpretation, in order to induce our minds to maintain attention.
Our minds are active resistors of plain information!
All four factors add so much redundancy, it’s probably fair to say most of our communication (by bits, characters, words, etc., may be 95%?, 98%? or more!) pure redundancy.
Another helpful compressor, is many facts are among a few “reasonably expected” alternative answers. So it takes just a little biasing information to encode the right option.
Finally, the way we reason seems to be highly common across everything that matters to us. Even though we have yet to identify and characterize this informal human logic. So once that is modeled, that itself must compress a lot of relations significantly.
Fuzzy Logic was a first approximation attempt at modeling human “logic”. But has not been very successful.
Models should eventually help us uncover that “human logic”, by analyzing how they model it. Doing so may let us create even more efficient architectures. Perhaps significantly more efficient, and even provide more direct non-gradient/data based “thinking” design.
Nevertheless, the level of compression is astounding!
We are far less complicated cognitive machines that we imagine! Scary, but inspiring too.
I personally believe that common PCs of today, maybe even high end smart phones circa 2025, will be large enough to run future super intelligence when we get it right, given internet access to look up information.
We have just begun to compress artificial minds.
"The English Wikipedia, when compressed, currently occupies approximately 24 GB of storage space without media files. This compressed size represents the current revisions of all articles, but excludes media files and previous revisions of pages, according to Wikipedia and Quora."
So 3x is correct but LLMs are lossy compression.
Although strictly speaking they have lots of information in a small package, they are F-tier compression algorithms because the loss is bad, unpredictable, and undetectable (i.e. a human has to check it). You would almost never use a transformer in place of any other compression algorithm for typical data compression uses.
In one view, you can view LLMs as SOTA lossless compression algorithms, where the number of weights don’t count towards the description length. Sounds crazy but it’s true.
All digitized books ever written/encoded compress to a few TB. The public web is ~50TB. I think a usable zip of all english electronic text publicly available would be on O(100TB). So we're at about 1% of that in model size, and we're in a diminishing-returns area of training -- ie., going to >1% has not yielded improvements (cf. gpt4.5 vs 4o).
This is why compute spend is moving to inference time with "reasoning" models. It's likely we're close to diminshing returns on inference-time compute now too, hence agents whereby (mostly,) deterministic tools are supplementing information /capability into the system.
I think to get any more value out of this model class, we'll be looking at domain-specific specialisation beyond instruction fine-tuning.
I'd guess targeting 1TB inference-time VRAM would be a reasonable medium-term target for high quality open source models -- that's within the reach of most SMEs today. That's about 250bn params.
After that, make the robots explore and interact with the world by themselves, to fetch even more data.
In all seriousness, adding image and interaction data will probably be enormously useful, even for generating text.
There are just a lot of avenues to try at this point.
Perhaps the 50TB estimate is unique information without any media or so, but OP can back up where they got that number from than I can do with guesswork
Where you getting these numbers from? Interested to see how that's calculated.
I read somewhere, but cannot find the source anymore, that all written text prior to this century was approx 50MB. (Might be misquoted as don't have source anymore).
50 MB feels too low, unless the quote meant text up until the 20th century, in which case it feels much more believable. In terms of text production and publishing, we're still riding an exponent, so a couple orders of magnitude increase between 1899 and 2025 is not surprising.
(Talking about S-curves is all the hotness these days, but I feel it's usually a way to avoid understanding what exponential growth means - if one assumes we're past the inflection point, one can wave their hands and pretend the change is linear, and continue to not understand it.)
Most people who blog could wrote 1k words a day. That's a million in 3 years. So not crazy numbers here.
That's 5Mb. Maybe you meant 50Gb. I'd hazard 50Tb.
Extract just the plain text from that (+social media, etc.), remove symbols outside of a 64 symbol alphabet (6 bits) and compress. "Feels" to me around a 100TB max for absolutely everything.
Either way, full-fat LLMs are operating at 1-10% of this scale, depending how you want to estimate it.
If you run a more aggressive filter on that 100TB, eg., for a more semantic dedup, there's a plausible argument for "information" in english texts available being ~10TB -- then we're running close to 20% of that in LLMs.
If we take LLMs to just be that "semantic compression algorithm", and supposing the maximum useful size of an LLM is 2TB, then you could run the argument that everything "salient" ever written is <10TB.
Taking LLMs to be running at close-to 50% "everything useful" rather than 1% would be a explanation of why training has capped out.
I think the issue is at least as much to do with what we're using LLMs for -- ie., instruction fine-tuning requires some more general (proxy/quasi-) semantic structures in LLMs and I think you only need O(1%) of "everything ever written" to capture these. So it wouldnt really matter how much more we added, instruction-following LLMs don't really need it.
There's no way the entire Web fits in 400$ worth of hard drives.
Maybe text only, though...
I tied to estimate how much data this actually is:
# annas archive stats
papers = 105714890
books = 52670695
# word count estimates
avrg_words_per_paper = 10000
avrg_words_per_book = 100000
words = (papers*avrg_words_per_paper + books*avrg_words_per_book )
# quick text of 27 million words from a few books
sample_words = 27809550
sample_bytes = 158824661
sample_bytes_comp = 28839837 # using zpaq -m5
bytes_per_word = sample_bytes/sample_words
byte_comp_ratio = sample_bytes_comp/sample_bytes
word_comp_ratio = bytes_per_word*byte_comp_ratio
print("total:", words*bytes_per_word*1e-12, "TB") # total: 30.10238345855199 TB
print("compressed:", words*word_comp_ratio*1e-12, "TB") # compressed: 5.466077036085319 TB
So uncompressed ~30 TB and compressed ~5.5 TB of data.That fits on three 2TB micro SD cards, which you could buy for a total of 750$ from SanDisk.
Did you mean to type EB?
FWIW there is a huge difference between 4.5 and 4o.
You're right the T5 stuff is very important historically but they're below 11B and I don't have much to say about them. Definitely a very interesting and important set of models though.
Eh?
* Gemma 1 (2024): 2B, 7B
* Gemma 2 (2024): 2B, 9B, 27B
* Gemma 3 (2025): 1B, 4B, 12B, 27B
This is the same range as some Llama models which you do mention.
> important historically
Aren't you trying to give a historical perspective? What's the point of this?
Turns out, size really did matter, at least at the base model level. Only with the release of truly massive dense (405B) or high-activation MoE models (DeepSeek V3, DBRX, etc) did we start seeing GPT-4-level reasoning emerge outside closed labs.
If you like darker color scheme, here it is:
https://app.charts.quesma.com/s/f07qji
And active vs total:
I think in these scenarios, articles should include the prompt and generating model.
Thank you for spotting the error.
There are some signs it's written by possibly a non-native speaker.
That said, there's an unstated assumption here that these truly large language models are the most interesting thing. The big players have been somewhat quiet but my impression from the outside is that OpenAI let a little bit leak with their behavior. They built an even larger model and it turned out to be disappointing so they quietly discontinued it. The most powerful frontier reasoning models may actually be smaller than the largest publicly available models.
Something like 1.61B just doesn't mean much to me since I don't know much about the guts of LLMs. But I'm curious about how that translates to computer hardware -- what specs would I need to run these? What could I run now, what would require spending some money, and what I might hope to be able to run in a decade?
In practice, models can be quantized to smaller weights for inference. Usually, the performance loss going from 16 bit weights to 8 bit weights is very minor, so a 1 billion parameter model can take 1 gigabyte. Thinking about these models in terms of 8-bit quantized weights has the added benefit of making the math really easy. A 20B model needs 20G of memory. Simple.
Of course, models can be quantized down even further, at greater cost of inference quality. Depending on what you're doing, 5-bit weights or even lower might be perfectly acceptable. There's some indication that models that have been trained on lower bit weights might perform better than larger models that have been quantized down. For example, a model that was trained using 4-bit weights might perform better than a model that was trained at 16 bits, then quantized down to 4 bits.
When running models, a lot of the performance bottleneck is memory bandwidth. This is why LLM enthusiasts are looking for GPUs with the most possible VRAM. You computer might have 128G of RAM, but your GPU's access to that memory is so constrained by bandwidth that you might as well run the model on your CPU. Running a model on the CPU can be done, it's just much slower because the computation is so parallel.
Today's higher end consumer grade GPUs have up to 24G of dedicated VRAM (an Nvidia RTX 5090 has 32G of VRAM and they're like $2k). The dedicated VRAM on a GPU has a memory bandwidth of about 1 Tb/s. Apple's M-series of ARM-based CPU's have 512 Gb/s of bandwidth, and they're one of the most popular ways of being able to run larger LLMs on consumer hardware. AMD's new "Strix Halo" CPU+GPU chips have up to 128G of unified memory, with a memory bandwidth of about 256 Gb/s.
Reddit's r/LocalLLaMA is a reasonable place to look to see what people are doing with consumer grade hardware. Of course, some of what they're doing is bonkers so don't take everything you see there as a guide.
And as far as a decade from now, who knows. Currently, the top silicon fabs of TSMC, Samsung, and Intel are all working flat-out to meet the GPU demand from hyperscalers rolling out capacity (Microsoft Azure, AWS, Google, etc). Silicon chip manufacturing has traditionally followed a boom/bust cycle. But with geopolitical tensions, global trade barriers, AI-driven advances, and whatever other black swan events, what the next few years will look like is anyone's guess.
https://gist.github.com/rain-1/cf0419958250d15893d8873682492...
2. "superintelligence"
https://en.m.wikipedia.org/wiki/Superintelligence
"Meta is uniquely positioned to deliver superintelligence to the world."
https://www.cnbc.com/2025/06/30/mark-zuckerberg-creating-met...
Is there any difference between 1 and 2
Yes. One is purely hypothetical
There is kind of a vague sense in which this metaphor holds, but there is a much more interesting and rigorous fact about LLMs which is that they are also _lossless_ compression algorithms.
There are at least two senses in which this is true:
1. You can use an LLM to losslessly compress any piece of text at a cost that approaches the log-likelihood of that text under the model, using arithmetic coding. A sender and receiver both need a copy of the LLM weights.
2. You can use an LLM plus SGD (I.e the training code) as an lossless compression algorithm, where the communication cost is area under the training curve (and the model weights don’t count towards description length!) see: Jack Rae “compression for AGI”
That parenthetical doesn't quite work for me.
If synthetic data always degraded performance, AI labs wouldn't use synthetic data. They use it because it helps them train better models.
There's a paper that shows that if you very deliberately train a model in its own output in a loop you can get worse performance. That's not what AI labs using synthetic data actually do.
That paper gets a lot of attention because the schadenfreude of models destroying themselves through eating their own tails is irresistible.
This is exactly what I did in a previous role, fine-tuning Llama and Mistral models on a mix of human and GPT-4 data for a domain-specific task. Adding (good) synthetic data definitely increased the output quality for our tasks.
For example, it somehow merged Llama 4 Maverick's custom Arena chatbot version with Behemoth, falsely claiming that the former is stopping the latter from being released. It also claims 40B of internet text data is 10B tokens, which seems a little odd. Llama 405B was also trained on more than 15 trillion tokens[1], but the post claims only 3.67 trillion for some reason. It also doesn't mention Mistral large for some reason, even though it's the first good European 100B+ dense model.
>The MoE arch. enabled larger models to be trained and used by more people - people without access to thousands of interconnected GPUs
You still need thousands of GPUs to train a MoE model of any actual use. This is true for inference in the sense that it's faster I guess, but even that has caveats because MoE models are less powerful than dense models of the same size, though the trade-off has apparently been worth it in many cases. You also didn't need thousands of GPUs to do inference before, even for the largest models.
The conclusion is all over the place, and has lots of just weird and incorrect implications. The title is about how big LLMs are, why is there such a focus on token training count? Also no mention of quantized size. This is a bad AI slop article (whoops, turns out the author accidentally said it was AI generated, so it's a bad human slop article).
> it somehow merged Llama 4 Maverick's custom Arena chatbot version with Behemoth
I can clarify this part. I wrote 'There was a scandal as facebook decided to mislead people by gaming the lmarena benchmark site - they served one version of llama-4 there and released a different model' which is true.
But it is inside the section about the llama 4 model behemoth. So I see how that could be confusing/misleading.
I could restructure that section a little to improve it.
> Llama 405B was also trained on more than 15 trillion tokens[1],
You're talking about Llama 405B instruct, I'm talking about Llama 405B base. Of course the instruct model has been traiend on more tokens.
> why is there such a focus on token training count?
I tried to include the rough training token count for each model I wrote about - plus additional details about training data mixture if available. Training data is an important part of an LLM.