How large are large language models? (opens in new tab)

(gist.github.com)

263 pointsrain111mo ago150 comments

150 comments

90 comments · 14 top-level

ljoshua11mo ago· 37 in thread

Less a technical comment and more just a mind-blown comment, but I still can’t get over just how much data is compressed into and available in these downloadable models. Yesterday I was on a plane with no WiFi, but had gemma3:12b downloaded through Ollama. Was playing around with it and showing my kids, and we fired history questions at it, questions about recent video games, and some animal fact questions. It wasn’t perfect, but holy cow the breadth of information that is embedded in an 8.1 GB file is incredible! Lossy, sure, but a pretty amazing way of compressing all of human knowledge into something incredibly contained.

rain1OP11mo ago

It's extremely interesting how powerful a language model is at compression.

When you train it to be an assistant model, it's better at compressing assistant transcripts than it is general text.

There is an eval which I have a lot of interested in and respect for https://huggingface.co/spaces/Jellyfish042/UncheatableEval called UncheatableEval, which tests how good of a language model an LLM is by applying it on a range of compression tasks.

This task is essentially impossible to 'cheat'. Compression is a benchmark you cannot game!

soulofmischief11mo ago

Knowledge is learning relationships by decontextualizing information into generalized components. Application of knowledge is recontextualizing these components based on the problem at hand.

This is essentially just compression and decompression. It's just that with prior compression techniques, we never tried leveraging the inherent relationships encoded in a compressed data structure, because our compression schemes did not leverage semantic information in a generalized way and thus did not encode very meaningful relationships other than "this data uses the letter 'e' quite a lot".

A lot of that comes from the sheer amount of data we throw at these models, which provide enough substrate for semantic compression. Compare that to common compression schemes in the wild, where data is compressed in isolation without contributing its information to some model of the world. It turns out that because of this, we've been leaving a lot on the table with regards to compression. Another factor has been the speed/efficiency tradeoff. GPUs have allowed us to put a lot more into efficiency, and the expectations that many language models only need to produce text as fast as it can be read by a human means that we can even further optimize for efficiency over speed.

Also, shout out to Fabrice Bellard's ts_zip, which leverages LLMs to compress text files. https://bellard.org/ts_zip/

1 more reply

MPSimmons11mo ago

Agreed. It's basically lossy compression for everything it's ever read. And the quantization impacts the lossiness, but since a lot of text is super fluffy, we tend not to notice as much as we would when we, say, listen to music that has been compressed in a lossy way.

2 more replies

exe3411mo ago

Wikipedia is about 24GB, so if you're allowed to drop 1/3 of the details and make up the missing parts by splicing in random text, 8GB doesn't sound too bad.

To me the amazing thing is that you can tell the model to do something, even follow simple instructions in plain English, like make a list or write some python code to do $x, that's the really amazing part.

Nevermark11mo ago

It blows my mind that I can ask for 50 synonyms, instantly get a great list with great meaning summaries.

Then ask for the same list sorted and get that nearly instantly,

These models have a short time context for now, but they already have a huge “working memory” relative to us.

It is very cool. And indicative that vastly smarter models are going to be achieved fairly easily, with new insight.

Our biology has had to ruthlessly work within our biological/ecosystem energy envelope, and with the limited value/effort returned by a pre-internet pre-vast economy.

So biology has never been able to scale. Just get marginally more efficient and effective within tight limits.

Suddenly, (in historical, biological terms), energy availability limits have been removed, and limits on the value of work have compounded and continue to do so. Unsurprising that those changes suddenly unlock easily achieved vast untapped room for cognitive upscaling.

1 more reply

b11211mo ago

Not to mention, Language Modeling is Compression https://arxiv.org/pdf/2309.10668

So text wikipedia at 24G would easily hit 8G with many standard forms of compression, I'd think. If not better. And it would be 100% accurate, full text and data. Far more usable.

It's so easy for people to not realise how massive 8GB really is, in terms of text. Especially if you use ascii instead of UTF.

1 more reply

thecosas11mo ago

A neat project you (and others) might want to check out: https://kiwix.org/

Lots of various sources that you can download locally to have available offline. They're even providing some pre-loaded devices in areas where there may not be reliable or any internet access.

nico11mo ago

For reference (according to Google):

> The English Wikipedia, as of June 26, 2025, contains over 7 million articles and 63 million pages. The text content alone is approximately 156 GB, according to Wikipedia's statistics page. When including all revisions, the total size of the database is roughly 26 terabytes (26,455 GB)

sharkjacobs11mo ago

better point of reference might be pages-articles-multistream.xml.bz2 (current pages without edit/revision history, no talk pages, no user pages) which is 20GB

https://en.wikipedia.org/wiki/Wikipedia:Database_download#Wh...?

1 more reply

pcrh11mo ago

Wikipedia itself describes its size as ~25GB without media [0]. And it's probably more accurate and with broader coverage in multiple languages compared to the LLM downloaded by the GP.

https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia

1 more reply

mapt11mo ago

What happens if you ask this 8gb model "Compose a realistic Wikipedia-style page on the Pokemon named Charizard"?

How close does it come?

tasuki11mo ago

8.1 GB is a lot!

It is 64,800,000,000 bits.

I can imagine 100 bits sure. And 1,000 bits why not. 10,000 you lose me. A million? That sounds like a lot. Now 64 million would be a number I can't well imagine. And this is a thousand times 64 million!

swyx11mo ago

the study of language models from an information theory/compression POV is a small field but increasingly impt for efficiency/scaling - we did a discussion about this today https://www.youtube.com/watch?v=SWIKyLSUBIc&t=2269s

divbzero11mo ago

The Encyclopædia Britannica has about 40,000,000 words [1] or about 0.25 GB if you assume 6 bytes per word. It’s impressive but not outlandish that an 8.1 GB file could encode a large swath of human information.

[1]: https://en.wikipedia.org/wiki/Encyclopædia_Britannica

1 more reply

agumonkey11mo ago

Intelligence is compression some say

Nevermark11mo ago

Very much so!

The more and faster a “mind” can infer, the less it needs to store.

Think how much fewer facts a symbolic system that can perform calculus needs to store, vs. an algebraic, or just arithmetic system, to cover the same numerical problem solving space. Many orders of magnitude less.

The same goes for higher orders of reasoning. General or specific subject related.

And higher order reasoning vastly increases capabilities extending into new novel problem spaces.

I think model sizes may temporarily drop significantly, after every major architecture or training advance.

In the long run, “A circa 2025 maxed M3 Ultra Mac Studio is all you need!” (/h? /s? Time will tell.)

1 more reply

tshaddox11mo ago

Some say that. But what I value even more than compression is the ability to create new ideas which do not in any way exist in the set of all previously-conceived ideas.

1 more reply

goatlover11mo ago

How well does that apply to robotics or animal intelligence? Manipulating the real world is more fundamental to human intelligence than compressing text.

1 more reply

hamilyon211mo ago

Crystallized intelligence is. I am not sure about fluid intelligence.

1 more reply

penguin_booze11mo ago

I don't know why, but I was reminded of Douglas Hofstadter's talk: Analogy is cognition: https://www.youtube.com/watch?v=n8m7lFQ3njk&t=964s.

dgrabla11mo ago

Back in the '90s, we joked about putting “the internet” on a floppy disk. It’s kind of possible now.

Lu202511mo ago

Yeah, those guys managed to steal the internet.

Wowfunhappy11mo ago

How does this compare to, say, the compression ratio of a lossless 8K video and a 240p Youtube stream of the same video?

mr_toad11mo ago

I will never tire of pointing out that machine learning models are compression algorithms, not compressed data.

inopinatus11mo ago

I kinda made an argument the other day that they are high-dimensional lossy decompression algorithms, which might be the same difference but looking the other way through the lens.

dcl11mo ago

ML algorithms are compression algorithms, the trained models are compressed data.

ysofunny11mo ago

they're an upgraded version of self-executable zip files that compresses knowledge like mp3 compresses music, without knowing exactly wtf are either music nor knowledge

the self-execution is the interactive chat interface.

wikipedia gets "trained" (compiled+compressed+lossy) into an executable you can chat with, you can pass this through another pretrained A.I. than can talk out the text or transcribe it.

I think writing compilers is now an officially a defunct skill of historical and conservation purposes more than anything else; but I don't like saying "conservation", it's a bad framing, I rather say "legacy connectivity" which is a form of continuity or backwards compatibility

Nevermark11mo ago

It is truly incredible.

One factor, is the huge redundancies pervasive in our communication.

(1) There are so many ways to say the same thing, that (2) we have to add even more words to be precise at all. Without a verbal indexing system we (3) spend many words just setting up context for what we really want to say. And finally, (4) we pervasively add a great deal of intentionally non-informational creative and novel variability, and mood inducing color, which all require even more redundancy to maintain reliable interpretation, in order to induce our minds to maintain attention.

Our minds are active resistors of plain information!

All four factors add so much redundancy, it’s probably fair to say most of our communication (by bits, characters, words, etc., may be 95%?, 98%? or more!) pure redundancy.

Another helpful compressor, is many facts are among a few “reasonably expected” alternative answers. So it takes just a little biasing information to encode the right option.

Finally, the way we reason seems to be highly common across everything that matters to us. Even though we have yet to identify and characterize this informal human logic. So once that is modeled, that itself must compress a lot of relations significantly.

Fuzzy Logic was a first approximation attempt at modeling human “logic”. But has not been very successful.

Models should eventually help us uncover that “human logic”, by analyzing how they model it. Doing so may let us create even more efficient architectures. Perhaps significantly more efficient, and even provide more direct non-gradient/data based “thinking” design.

Nevertheless, the level of compression is astounding!

We are far less complicated cognitive machines that we imagine! Scary, but inspiring too.

I personally believe that common PCs of today, maybe even high end smart phones circa 2025, will be large enough to run future super intelligence when we get it right, given internet access to look up information.

We have just begun to compress artificial minds.

holoduke11mo ago

Yea. Same for a 8gb stable diffusion image generator. Sure not the best quality. But there is so much information inside.

ljlolel11mo ago

How big is Wikipedia text? Within 3X that size with 100% accuracy

phkahler11mo ago

Google AI response says this for compressed size of wikipedia:

"The English Wikipedia, when compressed, currently occupies approximately 24 GB of storage space without media files. This compressed size represents the current revisions of all articles, but excludes media files and previous revisions of pages, according to Wikipedia and Quora."

So 3x is correct but LLMs are lossy compression.

stronglikedan11mo ago

I've been doing the AI course on Brilliant lately, and it's mindblowing the techniques that they come up with to compress the data.

tomkaos11mo ago

Same thing with image model. 4 Go stable diffusion model can draw and represent anything humanity know.

alternatex11mo ago

How about a full glass of wine? Filled to the brim.

Workaccount211mo ago

I don't like the term "compression" used with transformers because it gives the wrong idea about how they function. Like that they are a search tool glued onto a .zip file, your prompts are just fancy search queries, and hallucinations are just bugs in the recall algo.

Although strictly speaking they have lots of information in a small package, they are F-tier compression algorithms because the loss is bad, unpredictable, and undetectable (i.e. a human has to check it). You would almost never use a transformer in place of any other compression algorithm for typical data compression uses.

Wowfunhappy11mo ago

A .zip is lossless compression. But we also have plenty of lossy compression algorithms. We've just never been able to use lossy compression on text.

2 more replies

angusturner11mo ago

There is an excellent talk by Jack Rae called “compression for AGI”, where he shows (what I believe to be) a little known connection between transformers and compression;

In one view, you can view LLMs as SOTA lossless compression algorithms, where the number of weights don’t count towards the description length. Sounds crazy but it’s true.

2 more replies

mjburgess11mo ago· 20 in thread

Deepseek v1 is ~670Bn which is ~1.4TB physical.

All digitized books ever written/encoded compress to a few TB. The public web is ~50TB. I think a usable zip of all english electronic text publicly available would be on O(100TB). So we're at about 1% of that in model size, and we're in a diminishing-returns area of training -- ie., going to >1% has not yielded improvements (cf. gpt4.5 vs 4o).

This is why compute spend is moving to inference time with "reasoning" models. It's likely we're close to diminshing returns on inference-time compute now too, hence agents whereby (mostly,) deterministic tools are supplementing information /capability into the system.

I think to get any more value out of this model class, we'll be looking at domain-specific specialisation beyond instruction fine-tuning.

I'd guess targeting 1TB inference-time VRAM would be a reasonable medium-term target for high quality open source models -- that's within the reach of most SMEs today. That's about 250bn params.

smokel11mo ago

Simply add images and video, and these estimates start to sound like the "640 KB should be enough for everyone".

After that, make the robots explore and interact with the world by themselves, to fetch even more data.

In all seriousness, adding image and interaction data will probably be enormously useful, even for generating text.

netcan11mo ago

Like both will be done. Idk what the roi is on adding video data to the text models, but it's presumably lower than text.

There are just a lot of avenues to try at this point.

1 more reply

layer811mo ago

Just a nitpick, but please don’t misuse big O notation like that. Any fixed storage amount is O(100TB).

fouc11mo ago

Maybe you're thinking of Library of Congress when you say ~50TB? Internet is definitely larger..

Aachen11mo ago

Indeed, a quick lookup doesn't give many reliable-sounding sources but they're all on the order of zettabytes (tens to thousands of them), also for years before any LLM was halfway usable. One has to wonder how much of that is generated, thinking of point of my own websites where the pages are derived statistics from player highscores, or the websites that jokingly index all Bitcoin addresses and UUIDs

Perhaps the 50TB estimate is unique information without any media or so, but OP can back up where they got that number from than I can do with guesswork

account-511mo ago

> All digitized books ever written/encoded compress to a few TB. The public web is ~50TB. I think a usable zip of all english electronic text publicly available would be on O(100TB).

Where you getting these numbers from? Interested to see how that's calculated.

I read somewhere, but cannot find the source anymore, that all written text prior to this century was approx 50MB. (Might be misquoted as don't have source anymore).

TeMPOraL11mo ago

> I read somewhere, but cannot find the source anymore, that all written text prior to this century was approx 50MB. (Might be misquoted as don't have source anymore).

50 MB feels too low, unless the quote meant text up until the 20th century, in which case it feels much more believable. In terms of text production and publishing, we're still riding an exponent, so a couple orders of magnitude increase between 1899 and 2025 is not surprising.

(Talking about S-curves is all the hotness these days, but I feel it's usually a way to avoid understanding what exponential growth means - if one assumes we're past the inflection point, one can wave their hands and pretend the change is linear, and continue to not understand it.)

2 more replies

bravesoul211mo ago

I reckon a prolific writer could publish a million words in their career.

Most people who blog could wrote 1k words a day. That's a million in 3 years. So not crazy numbers here.

That's 5Mb. Maybe you meant 50Gb. I'd hazard 50Tb.

mjburgess11mo ago

Anna's Archive full torrent is O(1PB), project gutenberg is O(1TB), many AI training torrents are reported in the O(50TB) range.

Extract just the plain text from that (+social media, etc.), remove symbols outside of a 64 symbol alphabet (6 bits) and compress. "Feels" to me around a 100TB max for absolutely everything.

Either way, full-fat LLMs are operating at 1-10% of this scale, depending how you want to estimate it.

If you run a more aggressive filter on that 100TB, eg., for a more semantic dedup, there's a plausible argument for "information" in english texts available being ~10TB -- then we're running close to 20% of that in LLMs.

If we take LLMs to just be that "semantic compression algorithm", and supposing the maximum useful size of an LLM is 2TB, then you could run the argument that everything "salient" ever written is <10TB.

Taking LLMs to be running at close-to 50% "everything useful" rather than 1% would be a explanation of why training has capped out.

I think the issue is at least as much to do with what we're using LLMs for -- ie., instruction fine-tuning requires some more general (proxy/quasi-) semantic structures in LLMs and I think you only need O(1%) of "everything ever written" to capture these. So it wouldnt really matter how much more we added, instruction-following LLMs don't really need it.

zX41ZdbW11mo ago

I've recently made a presentation on this topic: https://www.youtube.com/watch?v=8yH3rY1fZEA

kmm11mo ago

Perhaps that's meant to be 50GB (and that still seems like a serious underestimation)? Just the Bible is already 5MB.

1 more reply

WesolyKubeczek11mo ago

Maybe prior to the prior century, and even then I smell a lot of bullshit. I mean, just look at the Project Gutenberg. Even plaintext only, even compressed.

1 more reply

andrepd11mo ago

> 50TB

There's no way the entire Web fits in 400$ worth of hard drives.

flir11mo ago

Nah, Common Crawl puts on 250TB a month.

Maybe text only, though...

AlienRobot11mo ago

Text is small.

camel-cdr11mo ago

> All digitized books ever written/encoded compress to a few TB.

I tied to estimate how much data this actually is:

    # annas archive stats
    papers = 105714890
    books = 52670695
    
    # word count estimates
    avrg_words_per_paper = 10000
    avrg_words_per_book = 100000
    
    words = (papers*avrg_words_per_paper + books*avrg_words_per_book )
    
    # quick text of 27 million words from a few books
    sample_words = 27809550
    sample_bytes = 158824661
    sample_bytes_comp = 28839837 # using zpaq -m5
    
    bytes_per_word = sample_bytes/sample_words
    byte_comp_ratio = sample_bytes_comp/sample_bytes
    word_comp_ratio = bytes_per_word*byte_comp_ratio
    
    print("total:", words*bytes_per_word*1e-12, "TB") # total: 30.10238345855199 TB
    print("compressed:", words*word_comp_ratio*1e-12, "TB") # compressed: 5.466077036085319 TB

So uncompressed ~30 TB and compressed ~5.5 TB of data.

That fits on three 2TB micro SD cards, which you could buy for a total of 750$ from SanDisk.

charcircuit11mo ago

>The public web is ~50TB

Did you mean to type EB?

gosub10011mo ago

Only if you included all images and video

rain1OP11mo ago

This is kind of related to the jack morris post https://blog.jxmo.io/p/there-are-no-new-ideas-in-ai-only he discusses how the big leaps in LLMs have mostly come - not so much from new training methods or arch. changes as such - but the ability of new archs. to ingest more data.

generalizations11mo ago

> has not yielded improvements (cf. gpt4.5 vs 4o).

FWIW there is a huge difference between 4.5 and 4o.

OtherShrezzing11mo ago· 5 in thread

>None of this document was not written by AI

I think in these scenarios, articles should include the prompt and generating model.

rain1OP11mo ago

I have corrected that. It was supposed to say "None of this document was written by AI."

Thank you for spotting the error.

OtherShrezzing11mo ago

Understood, thanks for updating it!

kylecazar11mo ago

I thought this was an accidental double negative by the author -- trying to declare they wrote it themselves.

There are some signs it's written by possibly a non-native speaker.

WesolyKubeczek11mo ago

I don’t think the author knows that double negatives in English in a sentence like this cancel, not reinforce, each other.

oc111mo ago

You are absolutely right! The AI slop is getting out of control.

dale_glass11mo ago· 4 in thread

How big are those in terms of size on disk and VRAM size?

Something like 1.61B just doesn't mean much to me since I don't know much about the guts of LLMs. But I'm curious about how that translates to computer hardware -- what specs would I need to run these? What could I run now, what would require spending some money, and what I might hope to be able to run in a decade?

mjburgess11mo ago

At 1byte/param that's 1.6GB (f8), at 2 bytes (f16) that's 2.3GB -- but there's other space costs beyond loading the parameters for the GPU. So a rule of thumb is ~4x parameter count. So round up, 2B -> 2*4 = 8GB VRAM

171862744011mo ago

That sounds about the size of a modern browser (aka. any Electron et al. application)

loudmax11mo ago

Most of these models have been trained using 16-bit weights. So a 1 billion parameter model takes up 2 gigabytes.

In practice, models can be quantized to smaller weights for inference. Usually, the performance loss going from 16 bit weights to 8 bit weights is very minor, so a 1 billion parameter model can take 1 gigabyte. Thinking about these models in terms of 8-bit quantized weights has the added benefit of making the math really easy. A 20B model needs 20G of memory. Simple.

Of course, models can be quantized down even further, at greater cost of inference quality. Depending on what you're doing, 5-bit weights or even lower might be perfectly acceptable. There's some indication that models that have been trained on lower bit weights might perform better than larger models that have been quantized down. For example, a model that was trained using 4-bit weights might perform better than a model that was trained at 16 bits, then quantized down to 4 bits.

When running models, a lot of the performance bottleneck is memory bandwidth. This is why LLM enthusiasts are looking for GPUs with the most possible VRAM. You computer might have 128G of RAM, but your GPU's access to that memory is so constrained by bandwidth that you might as well run the model on your CPU. Running a model on the CPU can be done, it's just much slower because the computation is so parallel.

Today's higher end consumer grade GPUs have up to 24G of dedicated VRAM (an Nvidia RTX 5090 has 32G of VRAM and they're like $2k). The dedicated VRAM on a GPU has a memory bandwidth of about 1 Tb/s. Apple's M-series of ARM-based CPU's have 512 Gb/s of bandwidth, and they're one of the most popular ways of being able to run larger LLMs on consumer hardware. AMD's new "Strix Halo" CPU+GPU chips have up to 128G of unified memory, with a memory bandwidth of about 256 Gb/s.

Reddit's r/LocalLLaMA is a reasonable place to look to see what people are doing with consumer grade hardware. Of course, some of what they're doing is bonkers so don't take everything you see there as a guide.

And as far as a decade from now, who knows. Currently, the top silicon fabs of TSMC, Samsung, and Intel are all working flat-out to meet the GPU demand from hyperscalers rolling out capacity (Microsoft Azure, AWS, Google, etc). Silicon chip manufacturing has traditionally followed a boom/bust cycle. But with geopolitical tensions, global trade barriers, AI-driven advances, and whatever other black swan events, what the next few years will look like is anyone's guess.

1 more reply

ethan_smith11mo ago

As a rule of thumb, each billion parameters requires about 4GB of VRAM in FP16 (2 bytes per parameter), so a 7B model needs ~28GB, 70B needs ~280GB, while the 405B models need ~1.6TB of VRAM - though quantization can reduce this by 2-4x (4-bit models use only ~0.5GB per billion parameters).

kamranjon11mo ago· 3 in thread

This is somehow missing the Gemma and Gemini series of models from Google. I also think that not mentioning the T5 series of models is strange from a historical perspective because they sort of pioneered many of the concepts in transfer learning and kinda kicked off quite a bit of interest in this space.

rain1OP11mo ago

The Gemma models are too small to be included in this list.

You're right the T5 stuff is very important historically but they're below 11B and I don't have much to say about them. Definitely a very interesting and important set of models though.

tantalor11mo ago

> too small

Eh?

* Gemma 1 (2024): 2B, 7B

* Gemma 2 (2024): 2B, 9B, 27B

* Gemma 3 (2025): 1B, 4B, 12B, 27B

This is the same range as some Llama models which you do mention.

> important historically

Aren't you trying to give a historical perspective? What's the point of this?

kamranjon11mo ago

Since you included GPT-2, everything from Google including T5 would qualify for the list I would think.

stared11mo ago· 3 in thread

If you want it visually, here's a chart of total parameters as a function of year: https://app.charts.quesma.com/s/rmyk38

rain1OP11mo ago

I think that one thing that this chart makes visually very clear is the point I about GPT-3 being such a huge leap, and there being a long gap before anybody was able to match it.

rain1OP11mo ago

This is really awesome. Thank you for creating that. I included a screenshot and link to the chart with credit to you in a comment to my post.

stared11mo ago

I am happy you like it!

If you like darker color scheme, here it is:

https://app.charts.quesma.com/s/f07qji

And active vs total:

https://app.charts.quesma.com/s/4bsqjs

simonw11mo ago· 2 in thread

> There were projects to try to match it, but generally they operated by fine tuning things like small (70B) llama models on a bunch of GPT-3 generated texts (synthetic data - which can result in degeneration when AI outputs are fed back into AI training inputs).

That parenthetical doesn't quite work for me.

If synthetic data always degraded performance, AI labs wouldn't use synthetic data. They use it because it helps them train better models.

There's a paper that shows that if you very deliberately train a model in its own output in a loop you can get worse performance. That's not what AI labs using synthetic data actually do.

That paper gets a lot of attention because the schadenfreude of models destroying themselves through eating their own tails is irresistible.

rybosome11mo ago

Agreed, especially when in this context of training a smaller model on a larger model’s outputs. Distillation is generally accepted as an effective technique.

This is exactly what I did in a previous role, fine-tuning Llama and Mistral models on a mix of human and GPT-4 data for a domain-specific task. Adding (good) synthetic data definitely increased the output quality for our tasks.

rain1OP11mo ago

Yes but just purely in terms of entropy, you can't make a model better than GPT-4 by training it on GPT-4 outputs. The limit you would converge towards is GPT-4.

2 more replies

angusturner11mo ago· 1 in thread

I wish people would stop parroting the view that LLMs are lossy compression.

There is kind of a vague sense in which this metaphor holds, but there is a much more interesting and rigorous fact about LLMs which is that they are also _lossless_ compression algorithms.

There are at least two senses in which this is true:

1. You can use an LLM to losslessly compress any piece of text at a cost that approaches the log-likelihood of that text under the model, using arithmetic coding. A sender and receiver both need a copy of the LLM weights.

2. You can use an LLM plus SGD (I.e the training code) as an lossless compression algorithm, where the communication cost is area under the training curve (and the model weights don’t count towards description length!) see: Jack Rae “compression for AGI”

actionfromafar11mo ago

Re 1 - classical compression is also extremely effective if both sender and receiver have access to the same huge dictionary.

christianqchung11mo ago· 1 in thread

This is a bad article. Some of the information is wrong, and it's missing lots of context.

For example, it somehow merged Llama 4 Maverick's custom Arena chatbot version with Behemoth, falsely claiming that the former is stopping the latter from being released. It also claims 40B of internet text data is 10B tokens, which seems a little odd. Llama 405B was also trained on more than 15 trillion tokens[1], but the post claims only 3.67 trillion for some reason. It also doesn't mention Mistral large for some reason, even though it's the first good European 100B+ dense model.

>The MoE arch. enabled larger models to be trained and used by more people - people without access to thousands of interconnected GPUs

You still need thousands of GPUs to train a MoE model of any actual use. This is true for inference in the sense that it's faster I guess, but even that has caveats because MoE models are less powerful than dense models of the same size, though the trade-off has apparently been worth it in many cases. You also didn't need thousands of GPUs to do inference before, even for the largest models.

The conclusion is all over the place, and has lots of just weird and incorrect implications. The title is about how big LLMs are, why is there such a focus on token training count? Also no mention of quantized size. This is a bad AI slop article (whoops, turns out the author accidentally said it was AI generated, so it's a bad human slop article).

[1] https://ai.meta.com/blog/meta-llama-3-1/

rain1OP11mo ago

I can correct mistakes.

> it somehow merged Llama 4 Maverick's custom Arena chatbot version with Behemoth

I can clarify this part. I wrote 'There was a scandal as facebook decided to mislead people by gaming the lmarena benchmark site - they served one version of llama-4 there and released a different model' which is true.

But it is inside the section about the llama 4 model behemoth. So I see how that could be confusing/misleading.

I could restructure that section a little to improve it.

> Llama 405B was also trained on more than 15 trillion tokens[1],

You're talking about Llama 405B instruct, I'm talking about Llama 405B base. Of course the instruct model has been traiend on more tokens.

> why is there such a focus on token training count?

I tried to include the rough training token count for each model I wrote about - plus additional details about training data mixture if available. Training data is an important part of an LLM.

fossa111mo ago

It’s ironic: for years the open-source community was trying to match GPT-3 (175B dense) with 30B–70B models + RLHF + synthetic data—and the performance gap persisted.

Turns out, size really did matter, at least at the base model level. Only with the release of truly massive dense (405B) or high-activation MoE models (DeepSeek V3, DBRX, etc) did we start seeing GPT-4-level reasoning emerge outside closed labs.

lukeschlather11mo ago

This is a really nice writeup.

That said, there's an unstated assumption here that these truly large language models are the most interesting thing. The big players have been somewhat quiet but my impression from the outside is that OpenAI let a little bit leak with their behavior. They built an even larger model and it turned out to be disappointing so they quietly discontinued it. The most powerful frontier reasoning models may actually be smaller than the largest publicly available models.

1vuio0pswjnm711mo ago

1. "raw text continuation engine"

https://gist.github.com/rain-1/cf0419958250d15893d8873682492...

2. "superintelligence"

https://en.m.wikipedia.org/wiki/Superintelligence

"Meta is uniquely positioned to deliver superintelligence to the world."

https://www.cnbc.com/2025/06/30/mark-zuckerberg-creating-met...

Is there any difference between 1 and 2

Yes. One is purely hypothetical

unwind11mo ago

Meta: The inclusion of the current year ("(2025)") in the title is strange, even though it's in the actual title of the linked-to post, repeating it here makes me look around for the time machine controls.

bobsmooth11mo ago

There's got to be tons of books that remain undigitized that can be mined for training data, hasn't there?

j / k navigate · click thread line to collapse

150 comments

90 comments · 14 top-level

ljoshua11mo ago· 37 in thread

rain1OP11mo ago

It's extremely interesting how powerful a language model is at compression.

When you train it to be an assistant model, it's better at compressing assistant transcripts than it is general text.

This task is essentially impossible to 'cheat'. Compression is a benchmark you cannot game!

soulofmischief11mo ago

Knowledge is learning relationships by decontextualizing information into generalized components. Application of knowledge is recontextualizing these components based on the problem at hand.

Also, shout out to Fabrice Bellard's ts_zip, which leverages LLMs to compress text files. https://bellard.org/ts_zip/

1 more reply

MPSimmons11mo ago

2 more replies

exe3411mo ago

Wikipedia is about 24GB, so if you're allowed to drop 1/3 of the details and make up the missing parts by splicing in random text, 8GB doesn't sound too bad.

Nevermark11mo ago

It blows my mind that I can ask for 50 synonyms, instantly get a great list with great meaning summaries.

Then ask for the same list sorted and get that nearly instantly,

These models have a short time context for now, but they already have a huge “working memory” relative to us.

It is very cool. And indicative that vastly smarter models are going to be achieved fairly easily, with new insight.

Our biology has had to ruthlessly work within our biological/ecosystem energy envelope, and with the limited value/effort returned by a pre-internet pre-vast economy.

So biology has never been able to scale. Just get marginally more efficient and effective within tight limits.

1 more reply

b11211mo ago

Not to mention, Language Modeling is Compression https://arxiv.org/pdf/2309.10668

So text wikipedia at 24G would easily hit 8G with many standard forms of compression, I'd think. If not better. And it would be 100% accurate, full text and data. Far more usable.

It's so easy for people to not realise how massive 8GB really is, in terms of text. Especially if you use ascii instead of UTF.

1 more reply

thecosas11mo ago

A neat project you (and others) might want to check out: https://kiwix.org/

Lots of various sources that you can download locally to have available offline. They're even providing some pre-loaded devices in areas where there may not be reliable or any internet access.

nico11mo ago

For reference (according to Google):

sharkjacobs11mo ago

better point of reference might be pages-articles-multistream.xml.bz2 (current pages without edit/revision history, no talk pages, no user pages) which is 20GB

https://en.wikipedia.org/wiki/Wikipedia:Database_download#Wh...?

1 more reply

pcrh11mo ago

Wikipedia itself describes its size as ~25GB without media [0]. And it's probably more accurate and with broader coverage in multiple languages compared to the LLM downloaded by the GP.

https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia

1 more reply

mapt11mo ago

What happens if you ask this 8gb model "Compose a realistic Wikipedia-style page on the Pokemon named Charizard"?

How close does it come?

tasuki11mo ago

8.1 GB is a lot!

It is 64,800,000,000 bits.

swyx11mo ago

divbzero11mo ago

[1]: https://en.wikipedia.org/wiki/Encyclopædia_Britannica

1 more reply

agumonkey11mo ago

Intelligence is compression some say

Nevermark11mo ago

Very much so!

The more and faster a “mind” can infer, the less it needs to store.

The same goes for higher orders of reasoning. General or specific subject related.

And higher order reasoning vastly increases capabilities extending into new novel problem spaces.

I think model sizes may temporarily drop significantly, after every major architecture or training advance.

In the long run, “A circa 2025 maxed M3 Ultra Mac Studio is all you need!” (/h? /s? Time will tell.)

1 more reply

tshaddox11mo ago

Some say that. But what I value even more than compression is the ability to create new ideas which do not in any way exist in the set of all previously-conceived ideas.

1 more reply

goatlover11mo ago

How well does that apply to robotics or animal intelligence? Manipulating the real world is more fundamental to human intelligence than compressing text.

1 more reply

hamilyon211mo ago

Crystallized intelligence is. I am not sure about fluid intelligence.

1 more reply

penguin_booze11mo ago

I don't know why, but I was reminded of Douglas Hofstadter's talk: Analogy is cognition: https://www.youtube.com/watch?v=n8m7lFQ3njk&t=964s.

dgrabla11mo ago

Back in the '90s, we joked about putting “the internet” on a floppy disk. It’s kind of possible now.

Lu202511mo ago

Yeah, those guys managed to steal the internet.

Wowfunhappy11mo ago

How does this compare to, say, the compression ratio of a lossless 8K video and a 240p Youtube stream of the same video?

mr_toad11mo ago

I will never tire of pointing out that machine learning models are compression algorithms, not compressed data.

inopinatus11mo ago

I kinda made an argument the other day that they are high-dimensional lossy decompression algorithms, which might be the same difference but looking the other way through the lens.

dcl11mo ago

ML algorithms are compression algorithms, the trained models are compressed data.

ysofunny11mo ago

they're an upgraded version of self-executable zip files that compresses knowledge like mp3 compresses music, without knowing exactly wtf are either music nor knowledge

the self-execution is the interactive chat interface.

wikipedia gets "trained" (compiled+compressed+lossy) into an executable you can chat with, you can pass this through another pretrained A.I. than can talk out the text or transcribe it.

Nevermark11mo ago

It is truly incredible.

One factor, is the huge redundancies pervasive in our communication.

Our minds are active resistors of plain information!

All four factors add so much redundancy, it’s probably fair to say most of our communication (by bits, characters, words, etc., may be 95%?, 98%? or more!) pure redundancy.

Another helpful compressor, is many facts are among a few “reasonably expected” alternative answers. So it takes just a little biasing information to encode the right option.

Fuzzy Logic was a first approximation attempt at modeling human “logic”. But has not been very successful.

Nevertheless, the level of compression is astounding!

We are far less complicated cognitive machines that we imagine! Scary, but inspiring too.

We have just begun to compress artificial minds.

holoduke11mo ago

Yea. Same for a 8gb stable diffusion image generator. Sure not the best quality. But there is so much information inside.

ljlolel11mo ago

How big is Wikipedia text? Within 3X that size with 100% accuracy

phkahler11mo ago

Google AI response says this for compressed size of wikipedia:

So 3x is correct but LLMs are lossy compression.

stronglikedan11mo ago

I've been doing the AI course on Brilliant lately, and it's mindblowing the techniques that they come up with to compress the data.

tomkaos11mo ago

Same thing with image model. 4 Go stable diffusion model can draw and represent anything humanity know.

alternatex11mo ago

How about a full glass of wine? Filled to the brim.

Workaccount211mo ago

Wowfunhappy11mo ago

A .zip is lossless compression. But we also have plenty of lossy compression algorithms. We've just never been able to use lossy compression on text.

2 more replies

angusturner11mo ago

There is an excellent talk by Jack Rae called “compression for AGI”, where he shows (what I believe to be) a little known connection between transformers and compression;

In one view, you can view LLMs as SOTA lossless compression algorithms, where the number of weights don’t count towards the description length. Sounds crazy but it’s true.

2 more replies

mjburgess11mo ago· 20 in thread

Deepseek v1 is ~670Bn which is ~1.4TB physical.

I think to get any more value out of this model class, we'll be looking at domain-specific specialisation beyond instruction fine-tuning.

I'd guess targeting 1TB inference-time VRAM would be a reasonable medium-term target for high quality open source models -- that's within the reach of most SMEs today. That's about 250bn params.

smokel11mo ago

Simply add images and video, and these estimates start to sound like the "640 KB should be enough for everyone".

After that, make the robots explore and interact with the world by themselves, to fetch even more data.

In all seriousness, adding image and interaction data will probably be enormously useful, even for generating text.

netcan11mo ago

Like both will be done. Idk what the roi is on adding video data to the text models, but it's presumably lower than text.

There are just a lot of avenues to try at this point.

1 more reply

layer811mo ago

Just a nitpick, but please don’t misuse big O notation like that. Any fixed storage amount is O(100TB).

fouc11mo ago

Maybe you're thinking of Library of Congress when you say ~50TB? Internet is definitely larger..

Aachen11mo ago

Perhaps the 50TB estimate is unique information without any media or so, but OP can back up where they got that number from than I can do with guesswork

account-511mo ago

> All digitized books ever written/encoded compress to a few TB. The public web is ~50TB. I think a usable zip of all english electronic text publicly available would be on O(100TB).

Where you getting these numbers from? Interested to see how that's calculated.

I read somewhere, but cannot find the source anymore, that all written text prior to this century was approx 50MB. (Might be misquoted as don't have source anymore).

TeMPOraL11mo ago

> I read somewhere, but cannot find the source anymore, that all written text prior to this century was approx 50MB. (Might be misquoted as don't have source anymore).

2 more replies

bravesoul211mo ago

I reckon a prolific writer could publish a million words in their career.

Most people who blog could wrote 1k words a day. That's a million in 3 years. So not crazy numbers here.

That's 5Mb. Maybe you meant 50Gb. I'd hazard 50Tb.

mjburgess11mo ago

Anna's Archive full torrent is O(1PB), project gutenberg is O(1TB), many AI training torrents are reported in the O(50TB) range.

Extract just the plain text from that (+social media, etc.), remove symbols outside of a 64 symbol alphabet (6 bits) and compress. "Feels" to me around a 100TB max for absolutely everything.

Either way, full-fat LLMs are operating at 1-10% of this scale, depending how you want to estimate it.

Taking LLMs to be running at close-to 50% "everything useful" rather than 1% would be a explanation of why training has capped out.

zX41ZdbW11mo ago

I've recently made a presentation on this topic: https://www.youtube.com/watch?v=8yH3rY1fZEA

kmm11mo ago

Perhaps that's meant to be 50GB (and that still seems like a serious underestimation)? Just the Bible is already 5MB.

1 more reply

WesolyKubeczek11mo ago

Maybe prior to the prior century, and even then I smell a lot of bullshit. I mean, just look at the Project Gutenberg. Even plaintext only, even compressed.

1 more reply

andrepd11mo ago

> 50TB

There's no way the entire Web fits in 400$ worth of hard drives.

flir11mo ago

Nah, Common Crawl puts on 250TB a month.

Maybe text only, though...

AlienRobot11mo ago

Text is small.

camel-cdr11mo ago

> All digitized books ever written/encoded compress to a few TB.

I tied to estimate how much data this actually is:

    # annas archive stats
    papers = 105714890
    books = 52670695
    
    # word count estimates
    avrg_words_per_paper = 10000
    avrg_words_per_book = 100000
    
    words = (papers*avrg_words_per_paper + books*avrg_words_per_book )
    
    # quick text of 27 million words from a few books
    sample_words = 27809550
    sample_bytes = 158824661
    sample_bytes_comp = 28839837 # using zpaq -m5
    
    bytes_per_word = sample_bytes/sample_words
    byte_comp_ratio = sample_bytes_comp/sample_bytes
    word_comp_ratio = bytes_per_word*byte_comp_ratio
    
    print("total:", words*bytes_per_word*1e-12, "TB") # total: 30.10238345855199 TB
    print("compressed:", words*word_comp_ratio*1e-12, "TB") # compressed: 5.466077036085319 TB

So uncompressed ~30 TB and compressed ~5.5 TB of data.

That fits on three 2TB micro SD cards, which you could buy for a total of 750$ from SanDisk.

charcircuit11mo ago

>The public web is ~50TB

Did you mean to type EB?

gosub10011mo ago

Only if you included all images and video

rain1OP11mo ago

generalizations11mo ago

> has not yielded improvements (cf. gpt4.5 vs 4o).

FWIW there is a huge difference between 4.5 and 4o.

OtherShrezzing11mo ago· 5 in thread

>None of this document was not written by AI

I think in these scenarios, articles should include the prompt and generating model.

rain1OP11mo ago

I have corrected that. It was supposed to say "None of this document was written by AI."

Thank you for spotting the error.

OtherShrezzing11mo ago

Understood, thanks for updating it!

kylecazar11mo ago

I thought this was an accidental double negative by the author -- trying to declare they wrote it themselves.

There are some signs it's written by possibly a non-native speaker.

WesolyKubeczek11mo ago

I don’t think the author knows that double negatives in English in a sentence like this cancel, not reinforce, each other.

oc111mo ago

You are absolutely right! The AI slop is getting out of control.

dale_glass11mo ago· 4 in thread

How big are those in terms of size on disk and VRAM size?

mjburgess11mo ago

171862744011mo ago

That sounds about the size of a modern browser (aka. any Electron et al. application)

loudmax11mo ago

Most of these models have been trained using 16-bit weights. So a 1 billion parameter model takes up 2 gigabytes.

1 more reply

ethan_smith11mo ago

kamranjon11mo ago· 3 in thread

rain1OP11mo ago

The Gemma models are too small to be included in this list.

You're right the T5 stuff is very important historically but they're below 11B and I don't have much to say about them. Definitely a very interesting and important set of models though.

tantalor11mo ago

> too small

Eh?

* Gemma 1 (2024): 2B, 7B

* Gemma 2 (2024): 2B, 9B, 27B

* Gemma 3 (2025): 1B, 4B, 12B, 27B

This is the same range as some Llama models which you do mention.

> important historically

Aren't you trying to give a historical perspective? What's the point of this?

kamranjon11mo ago

Since you included GPT-2, everything from Google including T5 would qualify for the list I would think.

stared11mo ago· 3 in thread

If you want it visually, here's a chart of total parameters as a function of year: https://app.charts.quesma.com/s/rmyk38

rain1OP11mo ago

I think that one thing that this chart makes visually very clear is the point I about GPT-3 being such a huge leap, and there being a long gap before anybody was able to match it.

rain1OP11mo ago

This is really awesome. Thank you for creating that. I included a screenshot and link to the chart with credit to you in a comment to my post.

stared11mo ago

I am happy you like it!

If you like darker color scheme, here it is:

https://app.charts.quesma.com/s/f07qji

And active vs total:

https://app.charts.quesma.com/s/4bsqjs

simonw11mo ago· 2 in thread

That parenthetical doesn't quite work for me.

If synthetic data always degraded performance, AI labs wouldn't use synthetic data. They use it because it helps them train better models.

There's a paper that shows that if you very deliberately train a model in its own output in a loop you can get worse performance. That's not what AI labs using synthetic data actually do.

That paper gets a lot of attention because the schadenfreude of models destroying themselves through eating their own tails is irresistible.

rybosome11mo ago

Agreed, especially when in this context of training a smaller model on a larger model’s outputs. Distillation is generally accepted as an effective technique.

rain1OP11mo ago

Yes but just purely in terms of entropy, you can't make a model better than GPT-4 by training it on GPT-4 outputs. The limit you would converge towards is GPT-4.

2 more replies

angusturner11mo ago· 1 in thread

I wish people would stop parroting the view that LLMs are lossy compression.

There is kind of a vague sense in which this metaphor holds, but there is a much more interesting and rigorous fact about LLMs which is that they are also _lossless_ compression algorithms.

There are at least two senses in which this is true:

actionfromafar11mo ago

Re 1 - classical compression is also extremely effective if both sender and receiver have access to the same huge dictionary.

christianqchung11mo ago· 1 in thread

This is a bad article. Some of the information is wrong, and it's missing lots of context.

>The MoE arch. enabled larger models to be trained and used by more people - people without access to thousands of interconnected GPUs

[1] https://ai.meta.com/blog/meta-llama-3-1/

rain1OP11mo ago

I can correct mistakes.

> it somehow merged Llama 4 Maverick's custom Arena chatbot version with Behemoth

But it is inside the section about the llama 4 model behemoth. So I see how that could be confusing/misleading.

I could restructure that section a little to improve it.

> Llama 405B was also trained on more than 15 trillion tokens[1],

You're talking about Llama 405B instruct, I'm talking about Llama 405B base. Of course the instruct model has been traiend on more tokens.

> why is there such a focus on token training count?

I tried to include the rough training token count for each model I wrote about - plus additional details about training data mixture if available. Training data is an important part of an LLM.

fossa111mo ago

It’s ironic: for years the open-source community was trying to match GPT-3 (175B dense) with 30B–70B models + RLHF + synthetic data—and the performance gap persisted.

lukeschlather11mo ago

This is a really nice writeup.

1vuio0pswjnm711mo ago

1. "raw text continuation engine"

https://gist.github.com/rain-1/cf0419958250d15893d8873682492...

2. "superintelligence"

https://en.m.wikipedia.org/wiki/Superintelligence

"Meta is uniquely positioned to deliver superintelligence to the world."

https://www.cnbc.com/2025/06/30/mark-zuckerberg-creating-met...

Is there any difference between 1 and 2

Yes. One is purely hypothetical

unwind11mo ago

bobsmooth11mo ago

There's got to be tons of books that remain undigitized that can be mined for training data, hasn't there?

j / k navigate · click thread line to collapse