VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO (opens in new tab)

(arxiv.org)

387 pointstimhigins1d ago203 comments

203 comments

155 comments · 35 top-level

secretslol1d ago· 33 in thread

Am I right in thinking this is a tiny model which has been trained well to reason, and that's it? Makes me think of a smart person who doesn't know anything about a given topic, but with the right tools will go and research the heck out of it. I really like the sound of this... why have models train on learning anything when you can just train them how to learn and let them get on with it from something as small as a Pi Zero and an internet connection.

numlock861d ago

This has been my dream ever since. Instead of encoding "all the knowledge" into those parameters, how about just making a model that has the same size, but all (or rather most) it does is reasoning? Just give it the ability to browse the net (e.g. language specifications, documentation and best practices) and just have it do its thing. Why does my coding agent need to know the population of New York, know a cheese cake recipe or the general lifespan of an ostrich? Just give it the bare minimum knowledge to think and reason about, and let it figure out the rest.

Sadly that's not how LLMs work, since all they do is "token prediction". At least the models we have to today ...

dandaka1d ago

I think this is a well known concept, which we can't deliver yet. LLM/transformer give us reasoning engine as a byproduct of its design, but it is quite ineffective. If we can distill reasoning, if reasoning can be achieved without general knowledge, it will be a very effective machine.

Some amount of knowledge is required for reasoning. Maybe such model can dynamically knowledge domains to have taxonomy. For example, model can't effective reason about development task, if it has no knowledge about development best practices. But population of New York or recipies can definitely be loaded run time with tools.

sigmoid101d ago

>Some amount of knowledge is required for reasoning.

This is the root of problem. If you think about STEM universities, they don't really teach you things you need in the real world. They teach you what you need to know in order to go out there and accumulate the necessary information which can then be used to solve problems. Giving a person access to the internet or a super powerful calculator (like Mathematica) won't mean that they can do anything useful. They need tons of experience to use these tools in an effective way. That experience is basically all that implicit adjacent knowledge that we pick up along the way getting our degrees. And LLMs pick that up during pre-training. Drop this part and the outcome will be worthless.

1 more reply

XCSme1d ago

Yup, you still need knowledge. Even if you have access to all the data and tools, you still need to know what to search for, what tools to use and to understand what the user is asking.

Our computers can already do everything, have access to all the tools and information, yet they still need a human/intelligence to use it and apply to specific problems.

Even defining the problem requires knowledge.

As for the tools, if the model has access to 1000 tools, how would it know which one to use if it doesn't have any knowledge itself?

What if I ask for "table tennis spin" it had a "magnus effect calculator", how would it know to make the connection between the two?

1 more reply

athrowaway3z1d ago

This is me vibe-splaining something I don't know a lot about, but I doubt there is such a thing.

If "all the knowledge" is what our models now do, what exactly would be the most extreme "none of the knowledge +search" ?

> language specifications.

It would load in all the knowledge to figure it what "language" means, then it would continue trying to decode what "specifications" means.

That might sound absurd, but to figure out the population of New York It's either: Just going to google it, or derive from primary sources.

But how is it ever going to interpret the primary sources? It needs to understand the question, how complex a question is, and how complete an answer is and how things relate. Thats just _too_ much language.

There might be a way to compact this down into a LLM-native language such that the request of `the population of New York` or `use best practices` is encoded without our messy human language for a reasoning model to work with, but the encoding itself has to be done by the "all the knowledge" llm. Now it seems we just rebuild something related to MoE with extra step afaict.

tomaskafka1d ago

Education had this sad 15 year period where it thought “competences” are all you need.

Turns out that without the world knowledge to have a base of facts, it is not.

gmac1d ago

Basically: you can't teach people to think without giving them some facts and ideas to think with. It's like trying to teach woodworking without giving the students any wood.

1 more reply

g8oz23h ago

Competences were always supposed to be supported by demonstrable knowledge and skills and behavior.

So I don't think it's true that relevant knowledge was deprioritized. At least it wasn't supposed to be.

inopinatus1d ago

Any sufficiently general superintelligence will deduce the existence of rice pudding and income tax from Cartesian first principles.

3eb7988a16631d ago

It would also reduce training costs to nothing. Current methodology requires continual retraining to scoop up new facts. If you can do a one time "this is how to think" - that could conceptually work forever, just plug in a new database layer that can be queried as required.

fjsoxjdnwk1d ago

But isn’t that what “training” is anyway? They train LLM today like that and the database becomes the parameters. You can post train on smaller corpus for purpose-built stuff.

dminik1d ago

I mean, this really doesn't sound useful even if LLMs worked that way.

First, if you know nothing you don't even know what you're missing or what to search for.

Then, without unlimited context, you have to do research for every task all over again every time.

regularfry1d ago

> First, if you know nothing you don't even know what you're missing or what to search for.

RAG on the initial prompt would be the first thing to try.

> Then, without unlimited context, you have to do research for every task all over again every time.

Thing is, we're really really good at building very fast search engines. Doing research all over again every time shouldn't be a problem.

1 more reply

scotty791d ago

The model they built knows a fair bit apparently. You can't get 94.3 on AIME26 knowing nothing.

LoganDark1d ago

Reasoning alone can’t always predict all the bits of knowledge you’d need to sufficiently solve a problem, that you would research when planning.

hypendev23h ago

Because reasoning is an emergent byproduct of training it on all knowledge. It still doesn't "know" things in this form and just generates tokens, no matter how weird we spin it.

So if you don't train it on a large dataset of a lot of words with a lot of sensible connections, it won't be able to reason, as it won't be able to make proper connections between words and sentences.

You can try training a really small model and seeing the gibberish outputs when you train it on only a small dataset.

Minmaxing the dataset to extract maximum generation with minimal data does sound like fun, but if you want to build SoTA models as a company, the economic tradeoff of doing that vs slapping a few more GPU's together is terrible.

cogman1022h ago

I think small expert models could be pretty powerful from open weight providers.

Imagine, for example, a model that's primarily train on typescript and general programming. It would be faster to train and it could be a lot smaller than a generalist model. It might be the best model to pick when you are doing typescript programming. And if you could squeeze that into 3B parameters a lot of consumer hardware could run it locally.

You could even expand it to just "webdev tech" or the like.

Lerc1d ago

I think you could probably train a model to consider boolean logic, modal logic, and mathematics reasonably well, but there is still a pretty big leap between that and thinking about things.

Even the most basic questions such as put a ball in a cup and place it on a table upside down then pick up the cup and put it in a box.

Requires knowledge of things not mentioned in the question (notably gravity).

Strict definition of all terms quickly gets you into a quagmire of complexity. Some base level of knowledge about things is required for you to give it instructions. If it only knows how to reason, it lacks any idea of what to aim to achieve.

There is quite a pronounced disconnect between the vast stores of written data that models are trained on and robust consideration of a topic. I do wonder if the path can be directed by the order of training.

For example if you train a model to basic literacy using tinystories, then math and philosopy texts, then psychology, and sociology texts, and then finally the mass data of everything from conversations and rants, to code and fiction.

Does that end up with a significantly different model to one that is trained on books on acting, creative writing, and fantasy novels, before introducing the same final mass data set.

How much does it's current ability allow it to contextualise new training data?

placebo22h ago

>Even the most basic questions such as put a ball in a cup and place it on a table upside down then pick up the cup and put it in a box.

That reminds me - this used to be my go-to question for smaller models and on which they would always fail miserably on:

A small strawberry is placed in a large cup. The cup is placed upside down on the kitchen table. Someone then lifts the cup as-is and puts it in the microwave. Where is the strawberry when the cup is in the microwave?

Here's what the 1.9GB VibeThinker-3B-GGUF:Q4_K_M answered:

Answer: The strawberry is still on the kitchen table – it fell out when the cup was turned upside‑down, and the subsequent lift‑and‑microwave move doesn’t change that.

So it seems there is definite progress here. Both specialized and yet improved common sense on things outside its domain of specialization.

Lerc21h ago

Is that learned common sense or has it learned the structure of that particular problem?

What happens if you ask

A small strawberry is placed in a large cup. The cup is placed upside down on a saucer on the kitchen table. Someone then lifts the cup and saucer as-is and puts them in the microwave. Where is the strawberry when the cup is in the microwave?

dmichulke22h ago

The hard part was always the number of 'r's

mejutoco1d ago

> Even the most basic questions such as put a ball in a cup and place it on a table upside down then pick up the cup and put it in a box.

I do not think this is a great example. First, it is not a question. Second, it seems very related to robotics. A model itself cannot put a ball anywhere, it can just call tools and answer in text, image, etc.

An LLM seeing "put a x in a y and place it on a z upside down then pick up the y and put it in a z2." and then a question about what happens could check a rag for properties of those x,y,z,z2 and still answer. Alternatively, this could be useful for coding, for example. And that is a very extreme example. Some basic language plus tool use could go quite far. I think it is a very interesting direction vs here is a gpu the price of a car.

Lerc21h ago

I wasn't explicitly stating the question, It was paraphrasing a common test question for world knowledge.

That you don't need to have a ball, cup, table, or even the ability to perform physical actions in order to consider where the ball ends up is in-itself required knowledge.

grumbelbart223h ago

The thing is we tried that for decades, using more formal logic to build reasoning engines. And we never got it to be even a fraction as good and generic as learning-based LLMs are today.

1 more reply

kristjansson21h ago

other way around. it's trained to generate long CoT to reason through problems (and does it well!) but has ~no tool calling capability, and ~no ability to manage more than 1-2 messages.

see the warning at the top of https://huggingface.co/WeiboAI/VibeThinker-3B

giancarlostoro18h ago

I have been obsessed with the idea of this for a while, theres a Qwen with Opus reasoning distilled that works nicely as well. I think the next frontier is optimizing the models to be more capable on less hardware especially if it can learn on the fly.

kitd1d ago

"The right tools" in this case might presumably include, eg, a set of repos + docs and specs on the various technologies being used. Or a library of text/images and background docs on style and techniques use to create them.

That plus this model should give you a very powerful and focussed assistant.

seunosewa21h ago

Then smaller the models are, the longer they have to reason when dealing with complex problems. The trade-off is real.

soulofmischief1d ago

Choosing between a model that can only "reason" and a model that has extensive knowledge and "reasoning", the latter will be undeniably better. Advanced reasoning requires cross-domain knowledge, superb pattern recognition, which can only be gained through the same mechanisms which give you a knowledgeable model.

Except for the most basic of tasks, such as "turn on my lights" or "cross-reference these two lists", I wouldn't trust a small model to be as conscientious and reliable as one with deep knowledge.

dominotw20h ago

> Am I right in thinking this is a tiny model which has been trained well to reason, and that's it?

i remember karpathy mentioning in dwarkesh podcast. But is reasoning really possible without all the knowledge.

supern0va20h ago

Even Karpathy acknowledged that this would require some baseline of human knowledge. The idea wasn't pure logic/reasoning, but some subset to bootstrap from.

witnessme19h ago

Sure it is small, 3B. But on Pi Zero, I can tell you from my experience, you'll be disappointed.

altmanaltman1d ago

Yeah but don't you think like that's an oversimplication with the metaphor if we assume this model can do a smart human-level analysis and distillation of knowledge, no? I mean if that were true (i.e. its just like that) then yeah there is no need for massive models but I really would doubt that.

Even recent massive models do not work anything like a smart human does at the moment so why are we assuming this can?

deftio1d ago· 24 in thread

There is some base level of intelligence any model needs to be useful, even in narrow tasks.

Could you teach a 5 year old to drive a car? A 10 year old? A 12 year old? To drive a car requires being able to read, to have judgement about ice or rainy conditions, to anticipate a child running after a ball. By the time a human in in their mid teens they have acquired the base knowledge...

Small models need to have enough base knowledge to be able to be good enough -- even in a seemingly narrow regime. Where is that? Obviously they don't need all the obscure knowledge of a frontier model but there is some base level which is probably more than it would first seem.

swiftcoder1d ago

> To drive a car requires being able to read

Emphatically, it does not. Passing your drivers test may require being able to read, but plenty of illiterate people around the world drive just fine.

There is a reason we made all the common road signs recognisable purely by shape/colour, after all.

varispeed20h ago

I don't think many drivers pay too much attention to signs apart from traffic lights.

avereveard1d ago

Until they reverse on a highway and kill a family. Being able to drive isn't where parent poster put the bar

swiftcoder1d ago

I don't see what reading has to do with knowing not to reverse on a highway. It's not like they put up big glowing signs that say "wrong way" like in a video game.

3 more replies

coldtea1d ago

Not reversing on a highway doesn't require reading, just driving sense.

And whole lot of people have done stupid shit like that while perfectly able to read, many even with masters and PhDs.

delis-thumbs-7e23h ago

> To drive a car requires being able to read, to have judgement about ice or rainy conditions, to anticipate a child running after a ball. By the time a human in in their mid teens they have acquired the base knowledge...

It is really strange to see comments like this here, where people seem to reduce some basic human action into how it would work in a text-only computer game. Driving itself requires mainly muscular memory how to operate the car, which why people who drive a lot can just go on autopilot and think something completely different when driving long distances. That is of course a form of kno, but you only get it through repetition. Of course driving in traffic requires far more, basic understanding of traffic law etc, but most of driving is muscle memory, understanding the vehicle and anticipating future occurrences. Why we apes are so good at this is because we have some million years of evolution of just using our bodies and seeing what happens. And of course we all seen the gif of an orangutang driving a golf cart (how real it is I’m uncertain), so there’s that.

I think might help to think models not as some future replicants, but models with certain capabilities in certain domains. It probably doesn’t make much sense to ask Opus 4.8 to drive you around as it doesn’t make sense to except a small image model made for edge devices to be able to write a novel. Perhaps we should just think of them as tools with certain applications they are made for.

ygjb1d ago

> Could you teach a 5 year old to drive a car? A 10 year old? A 12 year old? To drive a car requires being able to read, to have judgement about ice or rainy conditions, to anticipate a child running after a ball. By the time a human in in their mid teens they have acquired the base knowledge...

I would be interested to see a formal study of this. I say this not out of anything other than a observation that I think the only real blockers are a) judgement, and b) physical reflexes/strength. As a kid I was certainly aware of ice,snow, and rain, because I road my bike year round and had low confidence in my own ability to control my bike on snowy or wet terrain, especially during season changes. That translated into learning to drive in northern Canada in the winter and applying those lessons to driving.

In an environment devoid of consequences, I have seen kids operate driving simulations (both real simulations, and video games) with a degree of precision that is shocking, including seeing several 9-11 year olds play the simulations and games with a much higher degree of confidence than adult drivers. Children have an awareness that the simulations are consequence free, unless given other motivation. Adults that are consistent drivers have muscle memory and preconceived expectations that govern the decisions they make when playing the game. I am curious about the level of training and exposure required for children to overcome their lack of awareness of the hard limits and consequences of driving and driver error, versus the amount of training and exposure required for expert drivers that are novice gamers to stop applying their learned experience to consequence free simulations.

attila-lendvai1d ago

if you don't sit in the car you lack a lot of information. driving without them is almost a different skill.

(i'm above average in both)

universa11d ago

A 10 year old definitely,and 5year old is close, but not unrealistic, To drive a car you don't need to be able to read... To drive a car on the road with other people is a whole other story :-)

3eb7988a16631d ago

I suspect plenty of five year olds can do a respectable job in Mario Kart, Gran Turismo, etc driving games. Gaming has too low of stakes to judge them on perfectly adhering to the rules of the road, but the ability is there.

smokel1d ago

Being able to drive a car properly also depends on having the right exploration-exploitation balance. A three-year-old is likely to explore too much in a situation where mistakes can be dangerous.

This requires not only knowledge, but also the control systems that develop with the prefrontal cortex. LLMs don't do much control yet.

threatripper1d ago

Ask people who grew up on a farm in a rural area. Sometimes you have to even if you can't and you do.

subscribed1d ago

I was driving a tractor since 12, including on the road with small farm equipment, and indeed, mostly out of the necessity, but I also received a lot of tuition (from licenced drivers) to know how to behave.

Different times though.

coldtea1d ago

Not that different. Still happens all the time all over the world (the west included).

madduci1d ago

True story, they can already drive a tractor at 10 and I know people who learned to drive a proper truck at 13 too

jmalicki1d ago

You can teach a dog to drive a car.

https://www.youtube.com/watch?v=BWAK0J8Uhzk

dkersten1d ago

And AI tried telling me that Uber for Dogs (dogs are the drivers) was a terrible idea…

satvikpendem1d ago

While I agree with your assessment, probably could've chosen a better example, as in many countries young kids even as young as 8 will learn how to drive.

embedding-shape1d ago

In some countries they even let kids as young as 16 drive, no wonder they have so many accidents.

swiftcoder1d ago

Several US states will give you a permit to drive a farm vehicle on public roads at 14. Illinois recently passed an amendment to allow farm kids to drive a semi-truck at 16. And there is absolutely no minimum age for driving so long as you are on private land - I have seen 8 year olds driving a pickup truck round a farm...

1 more reply

skeledrew1d ago

> To drive a car requires being able to read, to have judgement about ice or rainy conditions, to anticipate a child running after a ball.

Conflation. That's to drive a car safely. To just drive a car one only need know to press gas to move, press brake to stop, turn steering wheel to change direction and maybe use a gear stick to shift into drive/park (car can be modified to abstract that away). Not much more complex than riding a bicycle; maybe even less since no need to learn to balance.

wilg1d ago

This is more of a question of the definition of "drive a car" than any specific issue about intelligence. Drive a car without errors? Impossible, and now we're into a subjective discussion about what feels intelligent. Pass the DMV test? Probably. How complicated are the conditions? There are plenty of drivers with bad judgement. It's a quicksand sort of discussion.

YetAnotherNick22h ago

> A 10 year old

https://www.youtube.com/watch?v=sLIAoW4QxIs

coldtea1d ago

>To drive a car requires being able to read

Millions of people do drive who can't read. It's very common in parts of Asia, Africa, Latin America, etc, especially rural, but even in cities.

There are places where oral exams and audio-assisted testing is allowed. And there are places where people just drive (and drive fine) not bothering with a license.

rbbydotdev1d ago· 18 in thread

Looks like we are seeing small but mighty model breakthroughs, outpacing the pure capital firepower of SOTA providers. I love rooting for the little guy, but is it too soon to call it? To play devils advocate, could it just be the benchmarks are not efficient enough to capture success of real developer workflows?

Catloafdev21h ago

I think people are going to continue to be surprised by the capability of small models.

Now, if you ask this model to have a conversation with you, it's gonna fail and be incoherent. But boy, does it sure reason through math problems well.

bakies1d ago

I've just started using qwen3.6:35b a couple days ago running on my framework desktop and rather impressed. It runs really well and reminds me of probably the first Claude model I used. It's the first local model that's actually working for me in a coding agent I've tried. Very exciting!

smcleod1d ago

Try 27b, it's significantly smarter than 35b-a3b (although it is slower, it's not so bad with MTP).

stymaar17h ago

It is, but it's way too slow on a Strix Halo due to its limited bandwidth.

(I'm still sad that they didn't make a 122B-A10B version of it, as it's the kind of model that fits best on a Strix Halo, and for 3.5 it was comparable in performance to the dense 27B version).

2 more replies

ignoramous1d ago

At least according to gertlabs, Qwen3.6 27B outperforms every SoTA (closed) model at Kotlin: https://archive.vn/RYBCL / https://gertlabs.com/rankings?mode=agentic_coding&language=k...

2 more replies

bakies22h ago

Hmm, I just assumed bigger was better. How's it different?

2 more replies

diseasedyak23h ago

I'm running qwen36.:35b:iq4 IQ4_XS quant. Takes 18 GB of RAM with 131k context window. Seems to be really good. Have it running local stuff via Hermes, using a cloud model via Ollama (Deepseek V4-Pro) for heavy lifting.

tarruda22h ago

If your framework desktop is the 128G Strix Halo, I recommend giving Qwen 3.5 122B-A10B a shot.

This Q5_K_M quant should be near lossless and fit with full 256K context in about 100GB of RAM: https://huggingface.co/AesSedai/Qwen3.5-122B-A10B-GGUF

Catloafdev21h ago

3.6 scores better on coding across the board.

Edit: specifically Qwen 3.6 27B beats that on coding and agentic workflows.

1 more reply

bakies21h ago

I'll keep this in mind.

andy991d ago

Could you please share which coding agent you are using with it?

waezel23h ago

Crush: https://github.com/charmbracelet/crush/

The Q8_K_XL MTP model from Unsloth: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF

bakies22h ago

I settled on opencode after trying goose and aider as well. I'll probably try some more but opencode worked similar to Claude code which is my main agent.

I serve the model with ollama and am thinking about replacing ollama but haven't looked into it.

I have openwebui for chat if I want that too, but don't really use it.

oneshtein23h ago

npx @oh-my-pi/pi-coding-agent

npodbielski23h ago

I am using Mistral Vibe.

NamlchakKhandro23h ago

j451d ago

It feels sometimes like optimizations are only starting.

trollbridge23h ago

I’m beginning to suspect the closed SOTA labs were doing all these optimisations, keeping quiet about it, and just charging us out the yinyang for inference.

aero21461d ago· 13 in thread

I tried generating the classic pelican svg, but it failed horribly just showing me a rectangle and a black circle...

fwipsy1d ago

I think this is predicted? Part of the story is how they were able to preserve core reasoning ability while cutting knowledge like "pelicans have wings."

> these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios.

sheepscreek21h ago

So I think the takeaway here is, this is a super fast companion model to larger models, that reasons quickly. Perhaps this technique can be used to train a highly optimized reasoning "expert" in MoEs.

pylotlight1d ago

The only real essential item here is tool calling capability is it not? So I assume they tested a strong read/write/edit tool consistency?

nsingh21d ago

This model doesn't support tool calling, was not part of its training. It's focused on Python (and I think C++) competitive programming and mathematics tasks, i.e. tasks with verifiable rewards. So if you have a task that fits that description, the size-to-capability ratio is good.

These kinds of models might be more useful as tools to be used by larger orchestrator models, than being the orchestrators themselves.

btown1d ago

I'm not seeing any mention of tools in the paper, much less a bias towards "curiosity" to use those tools when it encounters gaps in its knowledge. So perhaps this is a good proof-of-concept that single-pass code generation is viable with this small a model - but we're still a long way from a viable solution.

kristopolous1d ago

try it again but give a careful explanation of what a bicycle and a pelican is and how the pelican would sit atop the bicycle. Then give it a reference to the SVG tags you want it to use with documentation.

Here's what I got

https://9ol.es/tmp/pelican.png

with https://9ol.es/tmp/prompt_pelican.txt

using prithivMLmods/VibeThinker-3B-GGUF:Q4_K_M

realitysballs1d ago

That’s all I needed to hear

pylotlight1d ago

As in, you learnt that a useless test that no one should be using was tested here, that's what you meant right?

fransje2623h ago

right?

physPop1d ago

Its for reasoning not generating art?

websap1d ago

Can you explain this a bit more

tyre1d ago

Imagine you want to make a smaller model that is really good at one thing, say, driving a car. You could remove the parameters that lead it to correctly answer, "What is the powerhouse of the cell?" or, "Who was the first president of the United States?"

It would look really dumb if someone asked it that, but that's fine. You're trying to make a model that is optimized for efficiency for a specific task. As much as possible, you should prune uncorrelated things.

pylotlight1d ago

SVG generation is a useless test, what's there more to know?

1 more reply

gslepak1d ago· 8 in thread

Note that these are Python-only results, the model will not do as well with other languages.

I'm glad to see more domain-focused SLMs, we need more of them! A programming focused MoE should work well across many languages.

rcarmo1d ago

If it writes functional Python instead of cosplaying as a Java programmer and cramming code with classes and accessors, it's already better than Opus...

nsingh21d ago

Lots of confusion about what this model is actually focused on.

It is a cheap specialist for closed-world, verifiable reasoning tasks like math, self-contained coding problems, and similar.

"Closed-world" means the needed information is already in the context. It is not a tool-using agent that can discover missing context. "Verifiable" means answers are hard to generate but easy to check.

So no open ended research, repo wide agent work, factual Q&A, or SVG generation. More of a compact reasoning module for bounded problems.

nsingh21d ago

To follow up on this, I had it solve a nasty ODE problem that I saw in the recent Mathematica 15 release post:

    Solve the following first-order ODE for f(x):

    ((-1 - 2*x)*f(x)*tan(1 + x - exp(-61 - 2*x)*f(x)/x)
    + exp(61 + 2*x)*x*(1 - x*tan(1 + x - exp(-61 - 2*x)*f(x)/x))
    + x*tan(1 + x - exp(-61 - 2*x)*f(x)/x)*f'(x)) = 0

    Find the general solution f(x).

And surprisingly it found a valid solution! Extra impressive because it runs 25 tok/s on my measly RTX 2070 super.

    f(x) = x*exp(61 + 2*x)*(1 + x - arccos(C/x))

    C is an arbitrary constant.

Apparently Mathematica 14.3 couldn't solve this ODE.

kanbankaren1d ago

Qwen 3.5 9B Q4_K_M solved this using 10K tokens in 5 mins on a RX 7600.

The answer is exactly what you have posted. I am impressed by Qwen!

le-mark1d ago

How do you know it’s a valid solution? Are you able to verify it yourself?

1 more reply

kame3d1d ago

Interesting!

I just tried the quantized Q4_K_M from [1] in my RTX 2070 Super, it ran at 110 tok/s with 1800 tok/s prefill, and found the same solution to your prompt. It generated valid LaTeX for the answer but its reasoning trace uses mostly compact ASCII math notation. Took 3min 22s to answer, spending 22k tokens almost all on thinking.

[1] https://huggingface.co/prithivMLmods/VibeThinker-3B-GGUF

trick-or-treat1d ago

How do we know the solution isn't in the weights though?

skeledrew1d ago

If it can code well then once you put it in a loop with an interpreter it can do anything.

noperator1d ago· 4 in thread

Having some success while testing this model out as a replacement for GPT-5 nano in source code security review. Running on RTX 3090 (24 GB VRAM) via vLLM. It's not great on structured output (as noted in the model card) but I'm working around that in my harness.

dummydummy12341d ago

Can't you just force it to do structured output via constrained generation?

noperator20h ago

Yes, I did end up figuring out a clean way to allow normal reasoning inside <think> and then force JSON _after_ the closing </think>. Example here: https://gist.github.com/noperator/6c711ab19027ea8056442df839...

hypfer1d ago

> but I'm working around that in my harness.

How?

uberex1d ago

Maybe limiting logits to what is syntactically correct? E.g. {"hello" has to be followed by whitespace or colon. Any other logits get dropped.

achrono22h ago· 4 in thread

Beats Opus 4.5 on reasoning you say?

Prompt: If A goes to B who then goes to C, can A send something to C?

Response:

We need to interpret best. The phrase "If A goes to B who then goes to C, can A send something to C?" could be a puzzle about the concept of sending something (like passing a ball) and the relationships.

Scenario: A gives something to B, and B passes it on to C. Question: Can A also give the same thing to C? Answer: Only if A can obtain a second copy (e.g., the thing was duplicated). Otherwise, after handing it to B, A no longer holds it and cannot “send” it unless a copy exists.

[Lots of other unnecessary commentary and "scenarios" that make even lesser sense]

rapatel021h ago

Ran the same query and there is a ton of stuff, but it looks like it's reasoning through the ambiguity of the sentence. It still gets the right answer. Moreover, if we consider the FLOPs expended to get to the answer, and compare that to opus, I think it's still a net win.

My hunch is that Opus scale models probably have shortcuts encoded into the model that handle these ambiguities cases, wheres this model has learned a program to reason through the edge case (crystalized vs fluid intelligence). Remembering that probablity (frontier) vs calculating it on the fly (vibethink)

nolist_policy21h ago

> Multi-level Quality Control.

> [...]

> LLM-based Query Quality Filtering. We utilize capable LLMs to assess query quality, filtering out samples with incomplete descriptions, unreasonable conditions, invalid logic, or an inability to effectively assess target knowledge points.

erdevs22h ago

I am a human and I don't know how to interpret this prompt.

postalrat18h ago

If A goes to B who then goes to C does C know A?

androiddrew1d ago· 4 in thread

I have been thinking about how to use this. Since it doesn’t support tool calling I have been considering a dual model deployment, where a small tool calling llm drives the majority of the user experience, and vibe thinker is tapped for reasoning by the other llm.

So who has suggestions on small models with excellent tool calling capabilities?

smallerize1d ago

Gemma 4 E4B and Qwen 3 4B are pretty good, but fine-tuning makes them really good. There are tradeoffs at this size, so you'll have to find (or make) a finetune that does what you need.

j-bos1d ago

Maybe bonsai 8b would make the duo, if you do try it, pls post here as I'm a bit curious too.

scotty7918h ago

Qwen3.6-35B-A3B is pretty amazing. I'm using it with 96k context on 24GB VRAM through ollama.

reddec1d ago

granite 4

NotSuspicious1d ago· 3 in thread

The interesting thing about models this small is they should be able to be put on a single Taalas chip (the HC1 already runs a Llama 3.1 8B model). We're already at the point where half-decent reasoning could be run on an ASIC (and at mind-boggling speeds).

pants21d ago

Yeah, if they can fit an 8B model that's really good at improving the output by thinking, running at 16K tok/s on Taalas would be mind-blowing.

le-mark1d ago

Given this and the quality of open models, it makes no sense to me that there’s a future for Anthropic et all?

james_marks1d ago

Packaging a capability into a consumable form will still be business.

It's like web hosting; all the open source tools are there and free, and yet website tools, hosts, etc flourish.

1 more reply

yousif_12312322h ago· 3 in thread

I really hope that in a couple of years I can have a laptop that runs a reasonably good coding agent locally, that I can run fast and do most of my programming with, without running my laptop hot. I could keep open code and use other models when needed, but really for most of my work, I'm already breaking it down so that I can review code changes eventually, and I just need something reasonably decent and fast and unlimited. I think its coming.

alkonaut22h ago

I hope so too. But I fear that it will feel inadequate if we know there is always a $20 online model that is an order of magnitude better. I don't think there will be a "good enough" local model so long as frontier models look so much better.

vadansky22h ago

Seems like most people have settled on Opus 4.6 as the breaking point (me as well).

Once I can spend 10k to run Opus 4.6 at home, I'm done.

yogthos22h ago

I'm very optimistic here as well. And it's also worth noting that tooling is improving along with the models. I really think we have to treat models and tools as a package. The models is your engine, but you need a chassis to run it.

I find what makes frontier models actually work well isn't just the capability of the model, but how well the harness is tuned to its expectations. I wrote a about this in a bit more detail here. https://yogthos.net/posts/2026-06-08-dirge-code.html

uberex1d ago· 3 in thread

What is the idiots guude to run this one local now?

yousif_1231231d ago

Use LM Studio.

uberex16h ago

how do I get these weights in particular?

Landing76101d ago

omlx makes it quite easy

brainless1d ago· 2 in thread

I recently came across this model and I would love to try it with my coding agent soon.

I really like the idea of small models that can reason but do not have too much knowledge. Also, no emphasis on tool calls. I think the agent should do the heavy lifting and reach half way.

I use really small models, like Qwen 3.5 0.8B to 9B - no tool calling, no MCP, no skills, nothing. No multi-turn chat even. Models are given very specific tasks using a vast number of system prompts and all the response handling is done in the agent(s).

https://github.com/brainless/nocodo

SubiculumCode22h ago

Maybe no tool calling, but seems it could be really good at deciding which tool to use and when?

brainless21h ago

That is a good point. I do think these models would be good in the decision making. The large models are trained to use tool calling. Perhaps the small models can generate the text that would express their decision but not generate good JSON to reply with correct syntax. I do not know but this is my hunch.

SwellJoe1d ago· 1 in thread

It's terrible at hunting security bugs (I expected it to be, but I wanted to be sure). I added it to a benchmark I made with a corpus of some Mythos-discovered bugs, and it found zero. The smallest pretty successful models remain Qwen 3.6 and Gemma 4 (but I haven't tested the very small variants of those yet).

https://swelljoe.com/post/will-it-mythos/

nsingh21d ago

The lack of tool use will hinder it a lot I think, since bug hunting requires collecting context across a code base and stitching it together. It might be good in a more narrow sense, i.e "is there a bug in this block of code" and not considering how it interacts with the rest of the code base.

That's also more aligned to its leetcode style training data, the code under test is fully in the context window. It might be interesting to have a bigger tool use model go through the effort of collecting the context, and feeding it into this kind of model for analysis only. It becomes more of a thinking tool, instead of the orchestrator.

darkoob125h ago

I still cannot trust evaluations and benchmarks. How can you prove that the test datasets are truly unseen examples?

I think the only way to prove that these models are truly as good as they claim is to wait and see if they are getting adopted in practice.

1 more reply

mvitorino17h ago

Really enjoying seeing these really capable SMLs. Note that on HF they state: "This model was not trained on tool-calling or agent-based programming data. We therefore do not recommend using it for tasks that involve function calling, API orchestration, or autonomous coding agents." - https://huggingface.co/WeiboAI/VibeThinker-3B So we can't just hook it up to a coding harness like pi.dev or something.

troglodytetrain17h ago

Sounds like something that could be pretty useful as a 'validation' subagent. Provide it the details/context related to a larger LLM's run or turn in a harness and have it act as a gatekeeper. At this size and speed it looks like it could be economical to have it run every turn or even every tool call and inform the main agent about the result and success/failure.

sorenjan1d ago

How would you best utilize a model like this for coding? I take it it's not meant for vibe coding a full app, and the reasoning probably makes it unsuitable for autocomplete. Would you use it to implement specific functions? I looked at one of the coding benchmarks used, Live Code Bench, and it seems to be problem descriptions with sample input and output, and then a solution with a single function or class.

Seems like a really good model to use in an IDE when you still want control over the code structure then.

2 more replies

virajk_311d ago

SLM when trained for single use case often beats the LLM. That's both the advantage and limitation.

nolist_policy22h ago

Notable:

  VibeThinker-3B is developed through a staged post-training pipeline built upon Qwen2.5-Coder-3B base, a compact 3B foundation model.

Qwen2.5 is ancient by LLM standards.

iamgopal1d ago

Two model, one is optimised for system, reasoning etc, second is optimised for specific language ( rust or go ? ) , both small enough to run on local computer, will it work ?

delis-thumbs-7e18h ago

I gave this a run on llama.cpp locally. My GPU is Ge1080, so I needed quantized version for even such a small model and…

This. Is. Amazing. I am flabbergasted.

I am not into the whole GenAI thing and I have very little need for anything agentic, but Python, C++ and Maths is exactly what I mostly used these for, so this might actually become my main work horse. This is so cool.

I even used it for stuff it is not built for, asking complex qustion on history (Battle of Tours 732) and literature (Joyce’s Portrait of Artist) and it was surprisingly good, even though it started to hallucinate names and details (such as claiming Joyce’s father was a priest). For 3B I expected it to mainly spout complete nonsense.

nickalaso14h ago

So I went ahead and quickly vibecoded a working harness with a barebones tool interface and some constraints on output (credit to noperator for the idea). github: https://github.com/NickalasLight/VibeHarness.git

Its meant for a Windows machine using ollama but I'm sure anyone who wants to mess with it can point claude code at it to convert it for your own operating system and requirements. After install you can ask it to do something with "vibe 'create me a poem about cheese in cheese.txt'" its workspace is by default the directory the cli was located in when you called it.

tracerbulletx17h ago

Man just need something like this with tool calling.

andai19h ago

I tried actually talking to it. It reminded me of GPT-2.

1 more reply

jpcompartir23h ago

The absolute worst name for a model I've seen

cold_harbor1d ago

GRPO skips the value network that makes PPO expensive — it scores candidates relative to each other within a group. that's what makes verifiable-reward training practical at 3B scale

1 more reply

makethembroke19h ago

I don't get this beating opus, It just hardcoded the tasks for bench , It does even respond normally

A alot randomness in it

Please don't hype

unfirehose1d ago

this is a good model. I benchmark reasoned answers to qwen 3.6 27b (no think)+ bash and it held up.

diimdeep22h ago

BF16 with no QAT quants == half backed bread

scotty791d ago

If you could pair it somehow with a model that can code and describe code this could be a very powerful combo.

anonyfox1d ago

Wake me up when it does OCaml fine.

4gotunameagain23h ago

What are the implications of local SOTA inference, given the insane datacenter "investing" ?

It surely cannot be justified only for training at this scale, and since models nowadays are improved more and more by fine tuning than re-training from scratch.

Will a viable local model crash the US economy ?

More importantly, are the LLM companies aware, and are they deliberately buying out all the RAM and GPUs in order to prolong the inevitable ? Probably not, but I wouldn't be surprised if that is the case.

maxignol1d ago

3B param on par with opus 4.5 sounds interesting. Will read the full article before making my mind

zkmon1d ago

Does python coding depend on political facts of the world?

It might appear not, but actually, the process of reasoning is not an isolated act. The right and wrong way of doing things is codified in social evolution that absorbed all facets of life. Why should you optimize a piece of code for performance? Why performance is needed? What is a bug? What features and UI themes would be more intuitive for humans?

There is a butterfly effect. Everything affects everything to some extent.

2 more replies

kmchandy22h ago

The paper makes a clear claim: "it provides an important and concrete proof: on well-constrained, verifiable reasoning tasks, first-tier performance is no longer the exclusive domain of ultra-large models" And that's exciting.

j / k navigate · click thread line to collapse

203 comments

155 comments · 35 top-level

secretslol1d ago· 33 in thread

numlock861d ago

Sadly that's not how LLMs work, since all they do is "token prediction". At least the models we have to today ...

dandaka1d ago

sigmoid101d ago

>Some amount of knowledge is required for reasoning.

1 more reply

XCSme1d ago

Yup, you still need knowledge. Even if you have access to all the data and tools, you still need to know what to search for, what tools to use and to understand what the user is asking.

Our computers can already do everything, have access to all the tools and information, yet they still need a human/intelligence to use it and apply to specific problems.

Even defining the problem requires knowledge.

As for the tools, if the model has access to 1000 tools, how would it know which one to use if it doesn't have any knowledge itself?

What if I ask for "table tennis spin" it had a "magnus effect calculator", how would it know to make the connection between the two?

1 more reply

athrowaway3z1d ago

This is me vibe-splaining something I don't know a lot about, but I doubt there is such a thing.

If "all the knowledge" is what our models now do, what exactly would be the most extreme "none of the knowledge +search" ?

> language specifications.

It would load in all the knowledge to figure it what "language" means, then it would continue trying to decode what "specifications" means.

That might sound absurd, but to figure out the population of New York It's either: Just going to google it, or derive from primary sources.

tomaskafka1d ago

Education had this sad 15 year period where it thought “competences” are all you need.

Turns out that without the world knowledge to have a base of facts, it is not.

gmac1d ago

Basically: you can't teach people to think without giving them some facts and ideas to think with. It's like trying to teach woodworking without giving the students any wood.

1 more reply

g8oz23h ago

Competences were always supposed to be supported by demonstrable knowledge and skills and behavior.

So I don't think it's true that relevant knowledge was deprioritized. At least it wasn't supposed to be.

inopinatus1d ago

Any sufficiently general superintelligence will deduce the existence of rice pudding and income tax from Cartesian first principles.

3eb7988a16631d ago

fjsoxjdnwk1d ago

But isn’t that what “training” is anyway? They train LLM today like that and the database becomes the parameters. You can post train on smaller corpus for purpose-built stuff.

dminik1d ago

I mean, this really doesn't sound useful even if LLMs worked that way.

First, if you know nothing you don't even know what you're missing or what to search for.

Then, without unlimited context, you have to do research for every task all over again every time.

regularfry1d ago

> First, if you know nothing you don't even know what you're missing or what to search for.

RAG on the initial prompt would be the first thing to try.

> Then, without unlimited context, you have to do research for every task all over again every time.

Thing is, we're really really good at building very fast search engines. Doing research all over again every time shouldn't be a problem.

1 more reply

scotty791d ago

The model they built knows a fair bit apparently. You can't get 94.3 on AIME26 knowing nothing.

LoganDark1d ago

Reasoning alone can’t always predict all the bits of knowledge you’d need to sufficiently solve a problem, that you would research when planning.

hypendev23h ago

Because reasoning is an emergent byproduct of training it on all knowledge. It still doesn't "know" things in this form and just generates tokens, no matter how weird we spin it.

You can try training a really small model and seeing the gibberish outputs when you train it on only a small dataset.

cogman1022h ago

I think small expert models could be pretty powerful from open weight providers.

You could even expand it to just "webdev tech" or the like.

Lerc1d ago

I think you could probably train a model to consider boolean logic, modal logic, and mathematics reasonably well, but there is still a pretty big leap between that and thinking about things.

Even the most basic questions such as put a ball in a cup and place it on a table upside down then pick up the cup and put it in a box.

Requires knowledge of things not mentioned in the question (notably gravity).

Does that end up with a significantly different model to one that is trained on books on acting, creative writing, and fantasy novels, before introducing the same final mass data set.

How much does it's current ability allow it to contextualise new training data?

placebo22h ago

>Even the most basic questions such as put a ball in a cup and place it on a table upside down then pick up the cup and put it in a box.

That reminds me - this used to be my go-to question for smaller models and on which they would always fail miserably on:

Here's what the 1.9GB VibeThinker-3B-GGUF:Q4_K_M answered:

Answer: The strawberry is still on the kitchen table – it fell out when the cup was turned upside‑down, and the subsequent lift‑and‑microwave move doesn’t change that.

So it seems there is definite progress here. Both specialized and yet improved common sense on things outside its domain of specialization.

Lerc21h ago

Is that learned common sense or has it learned the structure of that particular problem?

What happens if you ask

dmichulke22h ago

The hard part was always the number of 'r's

mejutoco1d ago

> Even the most basic questions such as put a ball in a cup and place it on a table upside down then pick up the cup and put it in a box.

Lerc21h ago

I wasn't explicitly stating the question, It was paraphrasing a common test question for world knowledge.

That you don't need to have a ball, cup, table, or even the ability to perform physical actions in order to consider where the ball ends up is in-itself required knowledge.

grumbelbart223h ago

The thing is we tried that for decades, using more formal logic to build reasoning engines. And we never got it to be even a fraction as good and generic as learning-based LLMs are today.

1 more reply

kristjansson21h ago

other way around. it's trained to generate long CoT to reason through problems (and does it well!) but has ~no tool calling capability, and ~no ability to manage more than 1-2 messages.

see the warning at the top of https://huggingface.co/WeiboAI/VibeThinker-3B

giancarlostoro18h ago

kitd1d ago

That plus this model should give you a very powerful and focussed assistant.

seunosewa21h ago

Then smaller the models are, the longer they have to reason when dealing with complex problems. The trade-off is real.

soulofmischief1d ago

Except for the most basic of tasks, such as "turn on my lights" or "cross-reference these two lists", I wouldn't trust a small model to be as conscientious and reliable as one with deep knowledge.

dominotw20h ago

> Am I right in thinking this is a tiny model which has been trained well to reason, and that's it?

i remember karpathy mentioning in dwarkesh podcast. But is reasoning really possible without all the knowledge.

supern0va20h ago

Even Karpathy acknowledged that this would require some baseline of human knowledge. The idea wasn't pure logic/reasoning, but some subset to bootstrap from.

witnessme19h ago

Sure it is small, 3B. But on Pi Zero, I can tell you from my experience, you'll be disappointed.

altmanaltman1d ago

Even recent massive models do not work anything like a smart human does at the moment so why are we assuming this can?

deftio1d ago· 24 in thread

There is some base level of intelligence any model needs to be useful, even in narrow tasks.

swiftcoder1d ago

> To drive a car requires being able to read

Emphatically, it does not. Passing your drivers test may require being able to read, but plenty of illiterate people around the world drive just fine.

There is a reason we made all the common road signs recognisable purely by shape/colour, after all.

varispeed20h ago

I don't think many drivers pay too much attention to signs apart from traffic lights.

avereveard1d ago

Until they reverse on a highway and kill a family. Being able to drive isn't where parent poster put the bar

swiftcoder1d ago

I don't see what reading has to do with knowing not to reverse on a highway. It's not like they put up big glowing signs that say "wrong way" like in a video game.

3 more replies

coldtea1d ago

Not reversing on a highway doesn't require reading, just driving sense.

And whole lot of people have done stupid shit like that while perfectly able to read, many even with masters and PhDs.

delis-thumbs-7e23h ago

ygjb1d ago

attila-lendvai1d ago

if you don't sit in the car you lack a lot of information. driving without them is almost a different skill.

(i'm above average in both)

universa11d ago

A 10 year old definitely,and 5year old is close, but not unrealistic, To drive a car you don't need to be able to read... To drive a car on the road with other people is a whole other story :-)

3eb7988a16631d ago

smokel1d ago

Being able to drive a car properly also depends on having the right exploration-exploitation balance. A three-year-old is likely to explore too much in a situation where mistakes can be dangerous.

This requires not only knowledge, but also the control systems that develop with the prefrontal cortex. LLMs don't do much control yet.

threatripper1d ago

Ask people who grew up on a farm in a rural area. Sometimes you have to even if you can't and you do.

subscribed1d ago

Different times though.

coldtea1d ago

Not that different. Still happens all the time all over the world (the west included).

madduci1d ago

True story, they can already drive a tractor at 10 and I know people who learned to drive a proper truck at 13 too

jmalicki1d ago

You can teach a dog to drive a car.

https://www.youtube.com/watch?v=BWAK0J8Uhzk

dkersten1d ago

And AI tried telling me that Uber for Dogs (dogs are the drivers) was a terrible idea…

satvikpendem1d ago

While I agree with your assessment, probably could've chosen a better example, as in many countries young kids even as young as 8 will learn how to drive.

embedding-shape1d ago

In some countries they even let kids as young as 16 drive, no wonder they have so many accidents.

swiftcoder1d ago

1 more reply

skeledrew1d ago

> To drive a car requires being able to read, to have judgement about ice or rainy conditions, to anticipate a child running after a ball.

wilg1d ago

YetAnotherNick22h ago

> A 10 year old

https://www.youtube.com/watch?v=sLIAoW4QxIs

coldtea1d ago

>To drive a car requires being able to read

Millions of people do drive who can't read. It's very common in parts of Asia, Africa, Latin America, etc, especially rural, but even in cities.

There are places where oral exams and audio-assisted testing is allowed. And there are places where people just drive (and drive fine) not bothering with a license.

rbbydotdev1d ago· 18 in thread

Catloafdev21h ago

I think people are going to continue to be surprised by the capability of small models.

Now, if you ask this model to have a conversation with you, it's gonna fail and be incoherent. But boy, does it sure reason through math problems well.

bakies1d ago

smcleod1d ago

Try 27b, it's significantly smarter than 35b-a3b (although it is slower, it's not so bad with MTP).

stymaar17h ago

It is, but it's way too slow on a Strix Halo due to its limited bandwidth.

(I'm still sad that they didn't make a 122B-A10B version of it, as it's the kind of model that fits best on a Strix Halo, and for 3.5 it was comparable in performance to the dense 27B version).

2 more replies

ignoramous1d ago

At least according to gertlabs, Qwen3.6 27B outperforms every SoTA (closed) model at Kotlin: https://archive.vn/RYBCL / https://gertlabs.com/rankings?mode=agentic_coding&language=k...

2 more replies

bakies22h ago

Hmm, I just assumed bigger was better. How's it different?

2 more replies

diseasedyak23h ago

tarruda22h ago

If your framework desktop is the 128G Strix Halo, I recommend giving Qwen 3.5 122B-A10B a shot.

This Q5_K_M quant should be near lossless and fit with full 256K context in about 100GB of RAM: https://huggingface.co/AesSedai/Qwen3.5-122B-A10B-GGUF

Catloafdev21h ago

3.6 scores better on coding across the board.

Edit: specifically Qwen 3.6 27B beats that on coding and agentic workflows.

1 more reply

bakies21h ago

I'll keep this in mind.

andy991d ago

Could you please share which coding agent you are using with it?

waezel23h ago

Crush: https://github.com/charmbracelet/crush/

The Q8_K_XL MTP model from Unsloth: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF

bakies22h ago

I settled on opencode after trying goose and aider as well. I'll probably try some more but opencode worked similar to Claude code which is my main agent.

I serve the model with ollama and am thinking about replacing ollama but haven't looked into it.

I have openwebui for chat if I want that too, but don't really use it.

oneshtein23h ago

npx @oh-my-pi/pi-coding-agent

npodbielski23h ago

I am using Mistral Vibe.

NamlchakKhandro23h ago

j451d ago

It feels sometimes like optimizations are only starting.

trollbridge23h ago

I’m beginning to suspect the closed SOTA labs were doing all these optimisations, keeping quiet about it, and just charging us out the yinyang for inference.

aero21461d ago· 13 in thread

I tried generating the classic pelican svg, but it failed horribly just showing me a rectangle and a black circle...

fwipsy1d ago

I think this is predicted? Part of the story is how they were able to preserve core reasoning ability while cutting knowledge like "pelicans have wings."

sheepscreek21h ago

pylotlight1d ago

The only real essential item here is tool calling capability is it not? So I assume they tested a strong read/write/edit tool consistency?

nsingh21d ago

These kinds of models might be more useful as tools to be used by larger orchestrator models, than being the orchestrators themselves.

btown1d ago

kristopolous1d ago

Here's what I got

https://9ol.es/tmp/pelican.png

with https://9ol.es/tmp/prompt_pelican.txt

using prithivMLmods/VibeThinker-3B-GGUF:Q4_K_M

realitysballs1d ago

That’s all I needed to hear

pylotlight1d ago

As in, you learnt that a useless test that no one should be using was tested here, that's what you meant right?

fransje2623h ago

right?

physPop1d ago

Its for reasoning not generating art?

websap1d ago

Can you explain this a bit more

tyre1d ago

pylotlight1d ago

SVG generation is a useless test, what's there more to know?

1 more reply

gslepak1d ago· 8 in thread

Note that these are Python-only results, the model will not do as well with other languages.

I'm glad to see more domain-focused SLMs, we need more of them! A programming focused MoE should work well across many languages.

rcarmo1d ago

If it writes functional Python instead of cosplaying as a Java programmer and cramming code with classes and accessors, it's already better than Opus...

nsingh21d ago

Lots of confusion about what this model is actually focused on.

It is a cheap specialist for closed-world, verifiable reasoning tasks like math, self-contained coding problems, and similar.

So no open ended research, repo wide agent work, factual Q&A, or SVG generation. More of a compact reasoning module for bounded problems.

nsingh21d ago

To follow up on this, I had it solve a nasty ODE problem that I saw in the recent Mathematica 15 release post:

    Solve the following first-order ODE for f(x):

    ((-1 - 2*x)*f(x)*tan(1 + x - exp(-61 - 2*x)*f(x)/x)
    + exp(61 + 2*x)*x*(1 - x*tan(1 + x - exp(-61 - 2*x)*f(x)/x))
    + x*tan(1 + x - exp(-61 - 2*x)*f(x)/x)*f'(x)) = 0

    Find the general solution f(x).

And surprisingly it found a valid solution! Extra impressive because it runs 25 tok/s on my measly RTX 2070 super.

    f(x) = x*exp(61 + 2*x)*(1 + x - arccos(C/x))

    C is an arbitrary constant.

Apparently Mathematica 14.3 couldn't solve this ODE.

kanbankaren1d ago

Qwen 3.5 9B Q4_K_M solved this using 10K tokens in 5 mins on a RX 7600.

The answer is exactly what you have posted. I am impressed by Qwen!

le-mark1d ago

How do you know it’s a valid solution? Are you able to verify it yourself?

1 more reply

kame3d1d ago

Interesting!

[1] https://huggingface.co/prithivMLmods/VibeThinker-3B-GGUF

trick-or-treat1d ago

How do we know the solution isn't in the weights though?

skeledrew1d ago

If it can code well then once you put it in a loop with an interpreter it can do anything.

noperator1d ago· 4 in thread

dummydummy12341d ago

Can't you just force it to do structured output via constrained generation?

noperator20h ago

hypfer1d ago

> but I'm working around that in my harness.

How?

uberex1d ago

Maybe limiting logits to what is syntactically correct? E.g. {"hello" has to be followed by whitespace or colon. Any other logits get dropped.

achrono22h ago· 4 in thread

Beats Opus 4.5 on reasoning you say?

Prompt: If A goes to B who then goes to C, can A send something to C?

Response:

[Lots of other unnecessary commentary and "scenarios" that make even lesser sense]

rapatel021h ago

nolist_policy21h ago

> Multi-level Quality Control.

> [...]

erdevs22h ago

I am a human and I don't know how to interpret this prompt.

postalrat18h ago

If A goes to B who then goes to C does C know A?

androiddrew1d ago· 4 in thread

So who has suggestions on small models with excellent tool calling capabilities?

smallerize1d ago

Gemma 4 E4B and Qwen 3 4B are pretty good, but fine-tuning makes them really good. There are tradeoffs at this size, so you'll have to find (or make) a finetune that does what you need.

j-bos1d ago

Maybe bonsai 8b would make the duo, if you do try it, pls post here as I'm a bit curious too.

scotty7918h ago

Qwen3.6-35B-A3B is pretty amazing. I'm using it with 96k context on 24GB VRAM through ollama.

reddec1d ago

granite 4

NotSuspicious1d ago· 3 in thread

pants21d ago

Yeah, if they can fit an 8B model that's really good at improving the output by thinking, running at 16K tok/s on Taalas would be mind-blowing.

le-mark1d ago

Given this and the quality of open models, it makes no sense to me that there’s a future for Anthropic et all?

james_marks1d ago

Packaging a capability into a consumable form will still be business.

It's like web hosting; all the open source tools are there and free, and yet website tools, hosts, etc flourish.

1 more reply

yousif_12312322h ago· 3 in thread

alkonaut22h ago

vadansky22h ago

Seems like most people have settled on Opus 4.6 as the breaking point (me as well).

Once I can spend 10k to run Opus 4.6 at home, I'm done.

yogthos22h ago

uberex1d ago· 3 in thread

What is the idiots guude to run this one local now?

yousif_1231231d ago

Use LM Studio.

uberex16h ago

how do I get these weights in particular?

Landing76101d ago

omlx makes it quite easy

brainless1d ago· 2 in thread

I recently came across this model and I would love to try it with my coding agent soon.

I really like the idea of small models that can reason but do not have too much knowledge. Also, no emphasis on tool calls. I think the agent should do the heavy lifting and reach half way.

https://github.com/brainless/nocodo

SubiculumCode22h ago

Maybe no tool calling, but seems it could be really good at deciding which tool to use and when?

brainless21h ago

SwellJoe1d ago· 1 in thread

https://swelljoe.com/post/will-it-mythos/

nsingh21d ago

darkoob125h ago

I still cannot trust evaluations and benchmarks. How can you prove that the test datasets are truly unseen examples?

I think the only way to prove that these models are truly as good as they claim is to wait and see if they are getting adopted in practice.

1 more reply

mvitorino17h ago

troglodytetrain17h ago

sorenjan1d ago

Seems like a really good model to use in an IDE when you still want control over the code structure then.

2 more replies

virajk_311d ago

SLM when trained for single use case often beats the LLM. That's both the advantage and limitation.

nolist_policy22h ago

Notable:

  VibeThinker-3B is developed through a staged post-training pipeline built upon Qwen2.5-Coder-3B base, a compact 3B foundation model.

Qwen2.5 is ancient by LLM standards.

iamgopal1d ago

Two model, one is optimised for system, reasoning etc, second is optimised for specific language ( rust or go ? ) , both small enough to run on local computer, will it work ?

delis-thumbs-7e18h ago

I gave this a run on llama.cpp locally. My GPU is Ge1080, so I needed quantized version for even such a small model and…

This. Is. Amazing. I am flabbergasted.

nickalaso14h ago

tracerbulletx17h ago

Man just need something like this with tool calling.

andai19h ago

I tried actually talking to it. It reminded me of GPT-2.

1 more reply

jpcompartir23h ago

The absolute worst name for a model I've seen

cold_harbor1d ago

GRPO skips the value network that makes PPO expensive — it scores candidates relative to each other within a group. that's what makes verifiable-reward training practical at 3B scale

1 more reply

makethembroke19h ago

I don't get this beating opus, It just hardcoded the tasks for bench , It does even respond normally

A alot randomness in it

Please don't hype

unfirehose1d ago

this is a good model. I benchmark reasoned answers to qwen 3.6 27b (no think)+ bash and it held up.

diimdeep22h ago

BF16 with no QAT quants == half backed bread

scotty791d ago

If you could pair it somehow with a model that can code and describe code this could be a very powerful combo.

anonyfox1d ago

Wake me up when it does OCaml fine.

4gotunameagain23h ago

What are the implications of local SOTA inference, given the insane datacenter "investing" ?

It surely cannot be justified only for training at this scale, and since models nowadays are improved more and more by fine tuning than re-training from scratch.

Will a viable local model crash the US economy ?

maxignol1d ago

3B param on par with opus 4.5 sounds interesting. Will read the full article before making my mind

zkmon1d ago

Does python coding depend on political facts of the world?

There is a butterfly effect. Everything affects everything to some extent.

2 more replies

kmchandy22h ago

j / k navigate · click thread line to collapse