Good results fine tuning a local LLM like Qwen 3:0.6B to categorize questions (opens in new tab)

(teachmecoolstuff.com)

205 pointsdev-experiments2d ago49 comments

49 comments

25 comments · 17 top-level

nl2d ago· 8 in thread

If you are going to go to the bother of fine tuning for trivial problems like subject classification then I think you'll find Scikit Learn with a SGDClassifier on 2-grams will do probably just as well and be under 1MB for the trained classifier.

You can train it in under a minute, and it will work perfectly well on embedded devices.

Small LLMs are good choices for text classification in two cases:

- If you next to provide in-context examples and classifier based on them.

- Your classification goes beyond simple subject-type classifiers. For example, multiple choice question answering is classification where small LLM will work but traditional ML methods won't/

djsjajah2d ago

Not with 800 examples. If you are going to consider an ngram model, I think you are better off getting a frontier llm to write you an absurd regex.

nl1d ago

Hmm maybe. Turns out the author trained a logistic-regression classifier on the embeddings too, but didn't report the results:

https://github.com/thelgevold/fine-tuned-classifier/blob/mai...

dev-experimentsOP1d ago

Expanding on this experiment using logistic regression is an interesting continuation, detailed here: https://www.teachmecoolstuff.com/viewarticle/using-logistic-...

In summary: Using logistic regression actually improves accuracy, but also performance during both runtime and during training.

IanCal1d ago

I would also recommend the approach of using an llm to create the examples, and then train from there.

You can even get fancy and do things like active learning with the llm taking the role of the human annotator and sending in trial statements (and you can use a cheap one for larger gen and a more expensive one for the classification).

I’d be interested in seeing how well LLMs work with writing things like code for what snorkel AI used to have (there was open source code a while back that I assume is still around somewhere, you wrote code that was a low quality set of classifiers and it trained a model around those)

zubiaur1d ago

A small transformer like BERT or variants is a better fit. It only takes a few examples, which can be generated synthetically using an LLM.

Trains quickly and classifies speedily on modern hardware.

Had a lot of fun doing stuff like this years ago, before LLMs were a thing.

brokensegue2d ago

there are models between 2-grams and 600m param models that would be good options. i don't expect a 2-gram to do very well here. also i'm not sure why this model isn't a fine choice if it solves their problem

throwa3562621d ago

What would you suggest instead?

stephantul1d ago

A non-autoregressive transformer trained with a classification objective.

1 more reply

deepsquirrelnet2d ago

If you want to go deeper on language models, try these project ideas:

- Zero-shot encoders like tasksource or GliNER

- Natural language inference: https://huggingface.co/blog/dleemiller/nli-xenc-ways-to-use

- GRPO training

- GEPA prompt tuning Qwen 0.6B (or GEPA, then GRPO)

- Use an embedding model and train a classifier (MLP, logistic, svm)

- Use a larger LLM to generate a synthetic dataset (beware of lack of diversity, mine "seed text" from real sources first)

- Synthetically generate "hard examples" where more than one category may be valid and DPO tune your preferred responses

1 more reply

vb713222h ago

Isn’t a cosine similarity over the text embeddings a much more effective way to handle categorization?

The whole reason why embeddings work so well is because they encode the underlying meaning of the texts

mickael-kerjean2d ago

If you are interested in small language model to fine tune, gemma3:270m is quite interesting for its size

nextaccountic2d ago

> The model invents new categories (e.g. apartments) and doesn’t stick to the provided list of allowed categories

Can this specific failure mode be solved by providing a grammar that the output must adhere to? (Not sure if Qwen has this feature, it's used for eg. to ensure the output is parseable json)

3 more replies

doubtfuluser2d ago

But why using an encoder model instead of a BERT based model? For a pure classification that should be easier to train and work quite well

zwaps2d ago

Has anyone compared recently doing something like ModernBERT plus classifier vs. full or lora FT of a small LM like qwen?

1 more reply

pj_mukh2d ago

“As an example, the question “When did we replace our pool pump?” will be mapped to a category called “pool” before querying the Index database.”

Cool write up! Really appreciate it but incidentally how does this categorization help you get better retrieval results?

1 more reply

GardenLetter271d ago

If you're gonna fine-tune for a closed set classification problem like this, you could just fine-tune BERT and get a faster model with better performance.

electroglyph2d ago

existing embedding models like alibaba's modernbert tune or one of the jina v5s would probably map query to category automatically. (i.e. store embeddings of each category and calculate cosine sim for each incoming query vs. categories and pick the closest)

also, you could stick a classifier head on a BERT model as another option.

abhashanand15012d ago

Do small language models run on cpus or you still need a gpus to run them?

3 more replies

throwa3562621d ago

Are 0.6b models useful without fine tuning?

Half of the times I ask qwen 0.6b "what is 1 + 2?" it ends up in a thinking loop of "but wait, the user is asking me to ..."

2 more replies

jszymborski2d ago

I think the Qwen 0.6B is so cool. It is super fast and as illustrated here it has a clear niche, esp. when fine-tuned.

I'm also interested in it as a student for distillation.

armcat1d ago

I mean it's always nice to play around with sLLM finetuning, but for practical purposes I would always start with a lazy learner using embeddings (something like a small Stella model), pre-embed the topics/categories, embed the question, perform a kNN using cosine distance. You can use an LLM to "expand" the topics before embedding to make them more contextual. This is usually super fast and super simple and gives you a nice baseline. Then I would add a classification head after embedding layer (with maybe some dropout + 2-3 MLP layers) and train my own classifier, and compare that to lazy learner. Only after that would I start finetuning an LLM.

danielhanchen1d ago

Very cool write-up and GitHub repo!

crimsoneer1d ago

Tangentially related, but the UK Gov Incubator for AI has quite a nifty LLM driven classification pipeline for survey answers.

https://github.com/i-dot-ai/consult

737max1d ago

Is it just me or half these comments read like AI

j / k navigate · click thread line to collapse

49 comments

25 comments · 17 top-level

nl2d ago· 8 in thread

You can train it in under a minute, and it will work perfectly well on embedded devices.

Small LLMs are good choices for text classification in two cases:

- If you next to provide in-context examples and classifier based on them.

- Your classification goes beyond simple subject-type classifiers. For example, multiple choice question answering is classification where small LLM will work but traditional ML methods won't/

djsjajah2d ago

Not with 800 examples. If you are going to consider an ngram model, I think you are better off getting a frontier llm to write you an absurd regex.

nl1d ago

Hmm maybe. Turns out the author trained a logistic-regression classifier on the embeddings too, but didn't report the results:

https://github.com/thelgevold/fine-tuned-classifier/blob/mai...

dev-experimentsOP1d ago

Expanding on this experiment using logistic regression is an interesting continuation, detailed here: https://www.teachmecoolstuff.com/viewarticle/using-logistic-...

In summary: Using logistic regression actually improves accuracy, but also performance during both runtime and during training.

IanCal1d ago

I would also recommend the approach of using an llm to create the examples, and then train from there.

zubiaur1d ago

A small transformer like BERT or variants is a better fit. It only takes a few examples, which can be generated synthetically using an LLM.

Trains quickly and classifies speedily on modern hardware.

Had a lot of fun doing stuff like this years ago, before LLMs were a thing.

brokensegue2d ago

throwa3562621d ago

What would you suggest instead?

stephantul1d ago

A non-autoregressive transformer trained with a classification objective.

1 more reply

deepsquirrelnet2d ago

If you want to go deeper on language models, try these project ideas:

- Zero-shot encoders like tasksource or GliNER

- Natural language inference: https://huggingface.co/blog/dleemiller/nli-xenc-ways-to-use

- GRPO training

- GEPA prompt tuning Qwen 0.6B (or GEPA, then GRPO)

- Use an embedding model and train a classifier (MLP, logistic, svm)

- Use a larger LLM to generate a synthetic dataset (beware of lack of diversity, mine "seed text" from real sources first)

- Synthetically generate "hard examples" where more than one category may be valid and DPO tune your preferred responses

1 more reply

vb713222h ago

Isn’t a cosine similarity over the text embeddings a much more effective way to handle categorization?

The whole reason why embeddings work so well is because they encode the underlying meaning of the texts

mickael-kerjean2d ago

If you are interested in small language model to fine tune, gemma3:270m is quite interesting for its size

nextaccountic2d ago

> The model invents new categories (e.g. apartments) and doesn’t stick to the provided list of allowed categories

Can this specific failure mode be solved by providing a grammar that the output must adhere to? (Not sure if Qwen has this feature, it's used for eg. to ensure the output is parseable json)

3 more replies

doubtfuluser2d ago

But why using an encoder model instead of a BERT based model? For a pure classification that should be easier to train and work quite well

zwaps2d ago

Has anyone compared recently doing something like ModernBERT plus classifier vs. full or lora FT of a small LM like qwen?

1 more reply

pj_mukh2d ago

“As an example, the question “When did we replace our pool pump?” will be mapped to a category called “pool” before querying the Index database.”

Cool write up! Really appreciate it but incidentally how does this categorization help you get better retrieval results?

1 more reply

GardenLetter271d ago

If you're gonna fine-tune for a closed set classification problem like this, you could just fine-tune BERT and get a faster model with better performance.

electroglyph2d ago

also, you could stick a classifier head on a BERT model as another option.

abhashanand15012d ago

Do small language models run on cpus or you still need a gpus to run them?

3 more replies

throwa3562621d ago

Are 0.6b models useful without fine tuning?

Half of the times I ask qwen 0.6b "what is 1 + 2?" it ends up in a thinking loop of "but wait, the user is asking me to ..."

2 more replies

jszymborski2d ago

I think the Qwen 0.6B is so cool. It is super fast and as illustrated here it has a clear niche, esp. when fine-tuned.

I'm also interested in it as a student for distillation.