undefined | Better HN

0 pointshansvm1y ago0 comments

That's (obviously) a bit of an exaggeration. BERT is just another transformer architecture. Cut down from ~100 layers to 1, ~1k dimensions to ~10, and ~10k tokens to 100, and you're only 1e6 faster / more efficient, still a factor of 10k greater than your estimate and also too small to handle the detection you're describing with any reasonable degree of accuracy.

0 comments

3 comments · 1 top-level

griomnib1y ago· 2 in thread

I literally have DistilBERT models that can do this exact task in ~14ms on an NVIDIA A6000. I don’t know the precise performance per watt, but it’s really fucking low.

I use LLM to help with training data as they are great at zero shot, but after the training corpora is built a small, well trained, model will smoke an LLM in classification accuracy and are way faster - which means you can get scale and low carbon cost.

In my personal opinion there is a moral imperative to use the most efficient models possible at every step in a system design. LLM are one type of architecture and while they do a lot well, you can use a variety of energy efficient techniques to do discrete tasks much better.

hansvmOP1y ago

Thanks for providing a concrete model to work with. Compared to GPT3.5, the number you're looking for is ~0.04%. I pointed out the napkin math because 0.00000001% was so obviously wrong even at a glance that it was hurting your claim.

And, yes, purpose-built models definitely have their place even with the advent of LLMs. I'm happy to see more people working on that sort of thing.

griomnib1y ago

I applaud you doing the math! Proves you aren’t an LLM :-D

j / k navigate · click thread line to collapse