VibeThinker-3B achieves 80.2 on LCBv6 (opens in new tab)

(twitter.com)

8 pointsmoondistance10d ago3 comments

3 comments

3 comments · 3 top-level

So I downloaded the model and tried a few Math prompts. The simple addition was a little tedious because it checked multiple times that the calculation was right, I then gave it a quite long integral to solve but which is straightforward if you know the techniques, and it got it in 5 minutes on my Macbook Pro M4 Pro 24 GB, I just had to increase the context window. I finally tried giving it a full math exam but here it wouldn't score much points as it takes so many shortcuts it writes wrong steps in its answers. Still pretty good as it generally identifies what it should do, but I did not try anything in that weight class before so I can't really talk if that's impressive in the full picture.

SwellJoe10d ago

I added this to my benchmark of models looking for Mythos-reported security bugs. Unsurprisingly, it found 0. There is, after all, a lower bound on how small a model can be and still find security bugs. https://swelljoe.com/post/will-it-mythos/

It can seemingly reliably write working Python code though, which is impressive for such a little guy.

moondistanceOP10d ago

Paper: https://huggingface.co/papers/2606.16140

j / k navigate · click thread line to collapse