What even is a small language model now? (opens in new tab)

(jigsawstack.com)

109 pointsyoeven1y ago75 comments

75 comments

53 comments · 17 top-level

antirez1y ago· 14 in thread

Very small: can run on the edge to allow something like a Raspberry Pi to make basic decisions for your appliance even if disconnected from the internet. Example: those are some time series parameters and instructions, decide if watering the plants or not; vision models that can watch a camera and transcribe what it is seeing in a basic way, ...

Small: runs in an average laptop not optimized for inference of LLMs, like Gemma 3 4B.

Medium: runs in a very high spec computer that people can buy for less than 5k. 30B, 70B dense models or larger MoEs.

Large: Models that big LLM providers sell as "mini", "flash", ...

Extra Large / SOTA: Gemini 2.5 PRO, Claude 4 Opus, ChatGPT O3, ...

mnahkies1y ago

I'm not sure if you're implying that very small language models would be run in your raspberry pi example, but for use cases like the time series one, wouldn't something like an LSTM or TiDE architecture make more sense than a language model?

These are typically small and performant both in compute and accuracy/utility from what I've seen.

I think with all the hype at the moment sometimes AI/ML has become too synonymous with LLM

antirez1y ago

Sure if you have a specific need you can specialize some NN with the right architecture, collecting the data, doing the training several times, testing the performances, ... Or: you can download an already built LLM and write a prompt.

2 more replies

greenavocado1y ago

He's talking about general purpose zero shot models.

mnky9800n1y ago

Why in the world do you need such sophistication to know whether to water the plants or not?

collingreen1y ago

When you have a golden hammer everything starts to look like a nail

1 more reply

kovezd1y ago

There are places where: a) weather predictions are unreliable, b) there is scarcity of water. Just making the right decision on at what hour to water is a huge monthly saving of water.

2 more replies

amelius1y ago

In this case, "sophistication" meaning throwing insane amounts of compute power and data at the problem? In older times we'd probably call that "brute forcing".

hugh-avherald1y ago

Today, I asked a colleague to pass me a pen. Was that a egregiously simple task for such a powerful intelligence?

layer81y ago

For “very small”, I would add “can be passively cooled” as a criterion.

SkiFire131y ago

> Example: those are some time series parameters and instructions, decide if watering the plants or not

How is that a "language model"?

tayo421y ago

Is language model used to mean neural net, with transformers, attention that takes in a series of tokens and out outs a prediction as a value?

Working with time series data would work in that case.

1 more reply

lloydatkinson1y ago

> Example: those are some time series parameters and instructions, decide if watering the plants or not; vision models that can watch a camera and transcribe what it is seeing in a basic way, ...

This is the problem I have with the general discourse of "AI" even on Hacker News, of all places. Everything you listed is not an example of a *language model*.

All of those can either be implemented as a simple "if", decision tree, decision table, and finally actual ML in the example of cameras and time series predication.

Using an LLM is not just ridiculous here but totally the wrong fit and a waste of resources.

bdzr1y ago

> Using an LLM is not just ridiculous here but totally the wrong fit and a waste of resources.

Time and labor are resources too. There's a whole host of problems where "good enough" is tremendously valuable.

oezi1y ago

How do we call the models beyond extra large which are so big they can't be served publicly because their inference cost is too high? Do such exist?

zellyn1y ago· 5 in thread

I think of “fits on the overpowered M1/2/3/4 64GB MacBook Pro my employer gave me” as the dividing line. We’re getting to within spitting distance of models that can code well at that size.

Maxious1y ago

https://mistral.ai/news/devstral and https://huggingface.co/nvidia/AceReason-Nemotron-14B were released in just the last couple of days and work in 24GB 4090 GPUs/32GB Macbook Pros just fine

mark_l_watson1y ago

+1 that is my experience. devstral 24B on my Mac does very well designing code. I am writing a book on AI first software development and I have been exploring using small models in a specific process of separate steps for analysis, design, implementation, etc.

api1y ago

I want my next laptop to be the 128gb M series monster. That will run not quite frontier models but ones that are close in performance, and run them fast.

danielbln1y ago

And, also quite important, leave your system enough RAM to do anything else.

1 more reply

adgjlsfhk11y ago

are you sure? lpddr5 is somewhere in the range of ~0.25 W/GB (for some reason this stat is hard to find good values for), so 128gb of RAM will mean your laptop idles at >25 watts.

1 more reply

nickpsecurity1y ago· 4 in thread

The term is too overloaded.

I'll add one more: a LLM small enough that it can be trained from scratch on one A100 in 24 hours. Is it really small if it takes $10,000 to train? Or leave that term for $200 models?

Back to your definitions, there are sub-1B models people are using. I think I saw one in the 400-600M range for audio. Another person posted here a 100M-200M model for extracting data from web pages. We told them to just use a rules-based approach where possible but they believed the SLM worked better.

Then, there's projects like BabyLM that can be useful at 10M:

https://babylm.github.io/

GardenLetter271y ago

But you only have to train the foundational model once - so with open weights it's not really a problem.

Maybe resources needed for fine-tuning would be nice to see.

nickpsecurity1y ago

Most have been trained on illegally-distributed, copyrighted works. They might output them, too. People might want untainted models. Additionally, some have weaknesses due to tokenizers, pre-training data, or moral alignment (political bias).

For those reasons, users might want to train a new model from scratch.

Researchers of training methods have a different problem. They need to see whether a new technique, like an optimization algorithm, gets better results. They try them more quickly with less money if they have small, training runs representative of what larger models do. If BabyLM-10M was representative, they could test each technique at the FLOPS/$ of a 10M model instead of a 1B model.

So, both researchers and users might want new models trained from scratch. The cheaper to train, the better.

monkeyisland1y ago

> Another person posted here a 100M-200M model for extracting data from web pages

Could you post a link to this comment or thread. I can't seem to find this model by searching but world love to try it out.

nickpsecurity1y ago

I think I found it. I could be getting the numbers mixed up with another SLM. That example's smaller model was 500M:

https://news.ycombinator.com/item?id=41515730

srikz1y ago· 4 in thread

I want to see more models that can be streamed to a browser and run locally via wasm. That would be my hope for small models. In the <100mb range.

firejake3081y ago

After experimenting with 1B models, I am starting to think that any model with 1B parameters or less will probably lack a lot of the general intelligence that we observe in the frontier models, because it seems physically impossible to encode that much information into so few parameters. I believe that in the range of very small models, the winner will be models that are fine tuned to a small range of tasks or domains, such as a model that can translate between English and any other language, or a legal summarization model, etc.

vindex101y ago

Have you heard of Transformers.js? They are running onnx inside browser:

https://huggingface.co/docs/transformers.js/en/index

relaxing1y ago

Why? Just so user data stays local?

dainiusse1y ago

Yes. And also, cost to run it.

1 more reply

croes1y ago· 3 in thread

How can a Large Language Model be a small language model?

kelseyfrog1y ago

Because words are arbitrary. See Saussure.

baq1y ago

Why wouldn’t there be any? Right now there are large large language models, medium large language models and small large language models. You can say there are also tiny large language models and extra large large language models. Nothing confusing about it.

tialaramex1y ago

See also the Little Giant Girl who is part of The Sultan's Elephant and several other Royal de Luxe performances. She's clearly a little girl, but, she's also clearly a giant.

stephantul1y ago· 3 in thread

This post is 100% rewritten or fully generated by gpt-4o. It has the gpt smell all over it.

gwern1y ago

> In a world chasing ever-bigger models, small ones are quietly doing more with less—and that's exactly what makes them powerful.

100%. It has enough technical details that maybe a human did something. But who knows.

maksimur1y ago

Is there a problem with that? If so, what is it? I don't mind as long as it's not the boilerplate AI spits out by default.

stephantul1y ago

Nah not really, the information content is what counts of course. It’s just a bit cringe to see it happen.

alexpham141y ago· 2 in thread

I appreciate how it redefines “small” not by parameter count but by practical impact and deployability.

lblume1y ago

I do not — parameter count is objective, practical impact depends on such a multitude of factors that any comparison becomes virtually meaningless.

kergonath1y ago

The standard for parameters count is rapidly evolving. Something large now will be small tomorrow, there is no point in using such a moving target as a criterion.

1 more reply

GolDDranks1y ago· 1 in thread

A traditional Markov model trained (rather, just "fitted") on tokens or words is a small language model.

GolDDranks1y ago

(To share a recent personal experience about Markov models: I bootstrapped recently a HMM with hand-assigned weights. It was around 15x15 class transitions, 225 weights. That's small. Or rather, microscopic. Then I ran it against real data, and picked up examples of wrong classifications, and made them auxillary training data. Of course, it was not a language model, language model is impossible to fit in such a small space. It was a model of transitions of chapter "types" in novels, where types are something like "Epilogue" , "Prologue", "Chapter 23", "Table of Contents", "Afterword" etc.)

armcat1y ago

There is a "small language model", and then there is a "small LARGE language model". In late 2018, BERT (110 million params) would've been considered a "large" language model. A "small" LM would be some markov chain or a topic model (e.g. latent dirichlet allocation) - technically they would be considered generative language models since they learn joint distributions of params and data (words), and can then sample from that distribution. But today, we usually map "small" LMs to "small" LLMs, so in that sense a small LLM would be anything from BERT to around 3-4B params.

breckinloggins1y ago

Maybe we should appropriate the old DOS/x86 memory model names and give them “class-relative” sizes.

“tiny” can run on a microcontroller, “compact” on a Rpi, “small” on a phone, “medium” on a single GPU machine, “large” on AI class workstation hardware, and “huge” on a data center cluster.

mcswell1y ago

> Small models used to mean tiny. Now they mean "runs without drama."

Does this mean without a dedicated electric power plant?

I wanted to say "Right, big-sized. Do you want fries with that?", but I couldn't figure out how to work that in, so I won't say it.

rickstanley1y ago

On this topic, I've been wondering if models are capable of recommending other models for a given machine spec, for example: which model, if any, would be recommended for a laptop with a Ryzen 9 6000S and RTX 3060m (random spec).

Dwedit1y ago

These terms are all relative, but there's also "BabyLlama", which measures its parameter count in millions rather than billions.

Havoc1y ago

It’s always been a little arbitrary. Can it fit on 3090 seems like a reasonable cutoff to me for now

KasianFranks1y ago

This is also where MoE shines with a mixture of small and large language models.

option1y ago

whatever fits into gaming GPU such as GeForce 3080

MiddleEndian1y ago

Just ask my ex-wife!

j / k navigate · click thread line to collapse

75 comments

53 comments · 17 top-level

antirez1y ago· 14 in thread

Small: runs in an average laptop not optimized for inference of LLMs, like Gemma 3 4B.

Medium: runs in a very high spec computer that people can buy for less than 5k. 30B, 70B dense models or larger MoEs.

Large: Models that big LLM providers sell as "mini", "flash", ...

Extra Large / SOTA: Gemini 2.5 PRO, Claude 4 Opus, ChatGPT O3, ...

mnahkies1y ago

These are typically small and performant both in compute and accuracy/utility from what I've seen.

I think with all the hype at the moment sometimes AI/ML has become too synonymous with LLM

antirez1y ago

2 more replies

greenavocado1y ago

He's talking about general purpose zero shot models.

mnky9800n1y ago

Why in the world do you need such sophistication to know whether to water the plants or not?

collingreen1y ago

When you have a golden hammer everything starts to look like a nail

1 more reply

kovezd1y ago

There are places where: a) weather predictions are unreliable, b) there is scarcity of water. Just making the right decision on at what hour to water is a huge monthly saving of water.

2 more replies

amelius1y ago

In this case, "sophistication" meaning throwing insane amounts of compute power and data at the problem? In older times we'd probably call that "brute forcing".

hugh-avherald1y ago

Today, I asked a colleague to pass me a pen. Was that a egregiously simple task for such a powerful intelligence?

layer81y ago

For “very small”, I would add “can be passively cooled” as a criterion.

SkiFire131y ago

> Example: those are some time series parameters and instructions, decide if watering the plants or not

How is that a "language model"?

tayo421y ago

Is language model used to mean neural net, with transformers, attention that takes in a series of tokens and out outs a prediction as a value?

Working with time series data would work in that case.

1 more reply

lloydatkinson1y ago

> Example: those are some time series parameters and instructions, decide if watering the plants or not; vision models that can watch a camera and transcribe what it is seeing in a basic way, ...

This is the problem I have with the general discourse of "AI" even on Hacker News, of all places. Everything you listed is not an example of a *language model*.

All of those can either be implemented as a simple "if", decision tree, decision table, and finally actual ML in the example of cameras and time series predication.

Using an LLM is not just ridiculous here but totally the wrong fit and a waste of resources.

bdzr1y ago

> Using an LLM is not just ridiculous here but totally the wrong fit and a waste of resources.

Time and labor are resources too. There's a whole host of problems where "good enough" is tremendously valuable.

oezi1y ago

How do we call the models beyond extra large which are so big they can't be served publicly because their inference cost is too high? Do such exist?

zellyn1y ago· 5 in thread

I think of “fits on the overpowered M1/2/3/4 64GB MacBook Pro my employer gave me” as the dividing line. We’re getting to within spitting distance of models that can code well at that size.

Maxious1y ago

https://mistral.ai/news/devstral and https://huggingface.co/nvidia/AceReason-Nemotron-14B were released in just the last couple of days and work in 24GB 4090 GPUs/32GB Macbook Pros just fine

mark_l_watson1y ago

api1y ago

I want my next laptop to be the 128gb M series monster. That will run not quite frontier models but ones that are close in performance, and run them fast.

danielbln1y ago

And, also quite important, leave your system enough RAM to do anything else.

1 more reply

adgjlsfhk11y ago

are you sure? lpddr5 is somewhere in the range of ~0.25 W/GB (for some reason this stat is hard to find good values for), so 128gb of RAM will mean your laptop idles at >25 watts.

1 more reply

nickpsecurity1y ago· 4 in thread

The term is too overloaded.

I'll add one more: a LLM small enough that it can be trained from scratch on one A100 in 24 hours. Is it really small if it takes $10,000 to train? Or leave that term for $200 models?

Then, there's projects like BabyLM that can be useful at 10M:

https://babylm.github.io/

GardenLetter271y ago

But you only have to train the foundational model once - so with open weights it's not really a problem.

Maybe resources needed for fine-tuning would be nice to see.

nickpsecurity1y ago

For those reasons, users might want to train a new model from scratch.

So, both researchers and users might want new models trained from scratch. The cheaper to train, the better.

monkeyisland1y ago

> Another person posted here a 100M-200M model for extracting data from web pages

Could you post a link to this comment or thread. I can't seem to find this model by searching but world love to try it out.

nickpsecurity1y ago

I think I found it. I could be getting the numbers mixed up with another SLM. That example's smaller model was 500M:

https://news.ycombinator.com/item?id=41515730

srikz1y ago· 4 in thread

I want to see more models that can be streamed to a browser and run locally via wasm. That would be my hope for small models. In the <100mb range.

firejake3081y ago

vindex101y ago

Have you heard of Transformers.js? They are running onnx inside browser:

https://huggingface.co/docs/transformers.js/en/index

relaxing1y ago

Why? Just so user data stays local?

dainiusse1y ago

Yes. And also, cost to run it.

1 more reply

croes1y ago· 3 in thread

How can a Large Language Model be a small language model?

kelseyfrog1y ago

Because words are arbitrary. See Saussure.

baq1y ago

tialaramex1y ago

See also the Little Giant Girl who is part of The Sultan's Elephant and several other Royal de Luxe performances. She's clearly a little girl, but, she's also clearly a giant.

stephantul1y ago· 3 in thread

This post is 100% rewritten or fully generated by gpt-4o. It has the gpt smell all over it.

gwern1y ago

> In a world chasing ever-bigger models, small ones are quietly doing more with less—and that's exactly what makes them powerful.

100%. It has enough technical details that maybe a human did something. But who knows.

maksimur1y ago

Is there a problem with that? If so, what is it? I don't mind as long as it's not the boilerplate AI spits out by default.

stephantul1y ago

Nah not really, the information content is what counts of course. It’s just a bit cringe to see it happen.

alexpham141y ago· 2 in thread

I appreciate how it redefines “small” not by parameter count but by practical impact and deployability.

lblume1y ago

I do not — parameter count is objective, practical impact depends on such a multitude of factors that any comparison becomes virtually meaningless.

kergonath1y ago

The standard for parameters count is rapidly evolving. Something large now will be small tomorrow, there is no point in using such a moving target as a criterion.

1 more reply

GolDDranks1y ago· 1 in thread

A traditional Markov model trained (rather, just "fitted") on tokens or words is a small language model.

GolDDranks1y ago

armcat1y ago

breckinloggins1y ago

Maybe we should appropriate the old DOS/x86 memory model names and give them “class-relative” sizes.

mcswell1y ago

> Small models used to mean tiny. Now they mean "runs without drama."

Does this mean without a dedicated electric power plant?

I wanted to say "Right, big-sized. Do you want fries with that?", but I couldn't figure out how to work that in, so I won't say it.

rickstanley1y ago

Dwedit1y ago

These terms are all relative, but there's also "BabyLlama", which measures its parameter count in millions rather than billions.

Havoc1y ago

It’s always been a little arbitrary. Can it fit on 3090 seems like a reasonable cutoff to me for now

KasianFranks1y ago

This is also where MoE shines with a mixture of small and large language models.

option1y ago

whatever fits into gaming GPU such as GeForce 3080

MiddleEndian1y ago

Just ask my ex-wife!

j / k navigate · click thread line to collapse