Sadly that's not how LLMs work, since all they do is "token prediction". At least the models we have to today ...
Some amount of knowledge is required for reasoning. Maybe such model can dynamically knowledge domains to have taxonomy. For example, model can't effective reason about development task, if it has no knowledge about development best practices. But population of New York or recipies can definitely be loaded run time with tools.
This is the root of problem. If you think about STEM universities, they don't really teach you things you need in the real world. They teach you what you need to know in order to go out there and accumulate the necessary information which can then be used to solve problems. Giving a person access to the internet or a super powerful calculator (like Mathematica) won't mean that they can do anything useful. They need tons of experience to use these tools in an effective way. That experience is basically all that implicit adjacent knowledge that we pick up along the way getting our degrees. And LLMs pick that up during pre-training. Drop this part and the outcome will be worthless.
Our computers can already do everything, have access to all the tools and information, yet they still need a human/intelligence to use it and apply to specific problems.
Even defining the problem requires knowledge.
As for the tools, if the model has access to 1000 tools, how would it know which one to use if it doesn't have any knowledge itself?
What if I ask for "table tennis spin" it had a "magnus effect calculator", how would it know to make the connection between the two?
If "all the knowledge" is what our models now do, what exactly would be the most extreme "none of the knowledge +search" ?
> language specifications.
It would load in all the knowledge to figure it what "language" means, then it would continue trying to decode what "specifications" means.
That might sound absurd, but to figure out the population of New York It's either: Just going to google it, or derive from primary sources.
But how is it ever going to interpret the primary sources? It needs to understand the question, how complex a question is, and how complete an answer is and how things relate. Thats just _too_ much language.
There might be a way to compact this down into a LLM-native language such that the request of `the population of New York` or `use best practices` is encoded without our messy human language for a reasoning model to work with, but the encoding itself has to be done by the "all the knowledge" llm. Now it seems we just rebuild something related to MoE with extra step afaict.
Turns out that without the world knowledge to have a base of facts, it is not.
So I don't think it's true that relevant knowledge was deprioritized. At least it wasn't supposed to be.
First, if you know nothing you don't even know what you're missing or what to search for.
Then, without unlimited context, you have to do research for every task all over again every time.
RAG on the initial prompt would be the first thing to try.
> Then, without unlimited context, you have to do research for every task all over again every time.
Thing is, we're really really good at building very fast search engines. Doing research all over again every time shouldn't be a problem.
So if you don't train it on a large dataset of a lot of words with a lot of sensible connections, it won't be able to reason, as it won't be able to make proper connections between words and sentences.
You can try training a really small model and seeing the gibberish outputs when you train it on only a small dataset.
Minmaxing the dataset to extract maximum generation with minimal data does sound like fun, but if you want to build SoTA models as a company, the economic tradeoff of doing that vs slapping a few more GPU's together is terrible.
Imagine, for example, a model that's primarily train on typescript and general programming. It would be faster to train and it could be a lot smaller than a generalist model. It might be the best model to pick when you are doing typescript programming. And if you could squeeze that into 3B parameters a lot of consumer hardware could run it locally.
You could even expand it to just "webdev tech" or the like.
Even the most basic questions such as put a ball in a cup and place it on a table upside down then pick up the cup and put it in a box.
Requires knowledge of things not mentioned in the question (notably gravity).
Strict definition of all terms quickly gets you into a quagmire of complexity. Some base level of knowledge about things is required for you to give it instructions. If it only knows how to reason, it lacks any idea of what to aim to achieve.
There is quite a pronounced disconnect between the vast stores of written data that models are trained on and robust consideration of a topic. I do wonder if the path can be directed by the order of training.
For example if you train a model to basic literacy using tinystories, then math and philosopy texts, then psychology, and sociology texts, and then finally the mass data of everything from conversations and rants, to code and fiction.
Does that end up with a significantly different model to one that is trained on books on acting, creative writing, and fantasy novels, before introducing the same final mass data set.
How much does it's current ability allow it to contextualise new training data?
That reminds me - this used to be my go-to question for smaller models and on which they would always fail miserably on:
A small strawberry is placed in a large cup. The cup is placed upside down on the kitchen table. Someone then lifts the cup as-is and puts it in the microwave. Where is the strawberry when the cup is in the microwave?
Here's what the 1.9GB VibeThinker-3B-GGUF:Q4_K_M answered:
Answer: The strawberry is still on the kitchen table – it fell out when the cup was turned upside‑down, and the subsequent lift‑and‑microwave move doesn’t change that.
So it seems there is definite progress here. Both specialized and yet improved common sense on things outside its domain of specialization.
What happens if you ask
A small strawberry is placed in a large cup. The cup is placed upside down on a saucer on the kitchen table. Someone then lifts the cup and saucer as-is and puts them in the microwave. Where is the strawberry when the cup is in the microwave?
I do not think this is a great example. First, it is not a question. Second, it seems very related to robotics. A model itself cannot put a ball anywhere, it can just call tools and answer in text, image, etc.
An LLM seeing "put a x in a y and place it on a z upside down then pick up the y and put it in a z2." and then a question about what happens could check a rag for properties of those x,y,z,z2 and still answer. Alternatively, this could be useful for coding, for example. And that is a very extreme example. Some basic language plus tool use could go quite far. I think it is a very interesting direction vs here is a gpu the price of a car.
That you don't need to have a ball, cup, table, or even the ability to perform physical actions in order to consider where the ball ends up is in-itself required knowledge.
see the warning at the top of https://huggingface.co/WeiboAI/VibeThinker-3B
That plus this model should give you a very powerful and focussed assistant.
Except for the most basic of tasks, such as "turn on my lights" or "cross-reference these two lists", I wouldn't trust a small model to be as conscientious and reliable as one with deep knowledge.
i remember karpathy mentioning in dwarkesh podcast. But is reasoning really possible without all the knowledge.
Even recent massive models do not work anything like a smart human does at the moment so why are we assuming this can?
Could you teach a 5 year old to drive a car? A 10 year old? A 12 year old? To drive a car requires being able to read, to have judgement about ice or rainy conditions, to anticipate a child running after a ball. By the time a human in in their mid teens they have acquired the base knowledge...
Small models need to have enough base knowledge to be able to be good enough -- even in a seemingly narrow regime. Where is that? Obviously they don't need all the obscure knowledge of a frontier model but there is some base level which is probably more than it would first seem.
Emphatically, it does not. Passing your drivers test may require being able to read, but plenty of illiterate people around the world drive just fine.
There is a reason we made all the common road signs recognisable purely by shape/colour, after all.
And whole lot of people have done stupid shit like that while perfectly able to read, many even with masters and PhDs.
It is really strange to see comments like this here, where people seem to reduce some basic human action into how it would work in a text-only computer game. Driving itself requires mainly muscular memory how to operate the car, which why people who drive a lot can just go on autopilot and think something completely different when driving long distances. That is of course a form of kno, but you only get it through repetition. Of course driving in traffic requires far more, basic understanding of traffic law etc, but most of driving is muscle memory, understanding the vehicle and anticipating future occurrences. Why we apes are so good at this is because we have some million years of evolution of just using our bodies and seeing what happens. And of course we all seen the gif of an orangutang driving a golf cart (how real it is I’m uncertain), so there’s that.
I think might help to think models not as some future replicants, but models with certain capabilities in certain domains. It probably doesn’t make much sense to ask Opus 4.8 to drive you around as it doesn’t make sense to except a small image model made for edge devices to be able to write a novel. Perhaps we should just think of them as tools with certain applications they are made for.
I would be interested to see a formal study of this. I say this not out of anything other than a observation that I think the only real blockers are a) judgement, and b) physical reflexes/strength. As a kid I was certainly aware of ice,snow, and rain, because I road my bike year round and had low confidence in my own ability to control my bike on snowy or wet terrain, especially during season changes. That translated into learning to drive in northern Canada in the winter and applying those lessons to driving.
In an environment devoid of consequences, I have seen kids operate driving simulations (both real simulations, and video games) with a degree of precision that is shocking, including seeing several 9-11 year olds play the simulations and games with a much higher degree of confidence than adult drivers. Children have an awareness that the simulations are consequence free, unless given other motivation. Adults that are consistent drivers have muscle memory and preconceived expectations that govern the decisions they make when playing the game. I am curious about the level of training and exposure required for children to overcome their lack of awareness of the hard limits and consequences of driving and driver error, versus the amount of training and exposure required for expert drivers that are novice gamers to stop applying their learned experience to consequence free simulations.
(i'm above average in both)
This requires not only knowledge, but also the control systems that develop with the prefrontal cortex. LLMs don't do much control yet.
Different times though.
Conflation. That's to drive a car safely. To just drive a car one only need know to press gas to move, press brake to stop, turn steering wheel to change direction and maybe use a gear stick to shift into drive/park (car can be modified to abstract that away). Not much more complex than riding a bicycle; maybe even less since no need to learn to balance.
Millions of people do drive who can't read. It's very common in parts of Asia, Africa, Latin America, etc, especially rural, but even in cities.
There are places where oral exams and audio-assisted testing is allowed. And there are places where people just drive (and drive fine) not bothering with a license.
Now, if you ask this model to have a conversation with you, it's gonna fail and be incoherent. But boy, does it sure reason through math problems well.
(I'm still sad that they didn't make a 122B-A10B version of it, as it's the kind of model that fits best on a Strix Halo, and for 3.5 it was comparable in performance to the dense 27B version).
This Q5_K_M quant should be near lossless and fit with full 256K context in about 100GB of RAM: https://huggingface.co/AesSedai/Qwen3.5-122B-A10B-GGUF
Edit: specifically Qwen 3.6 27B beats that on coding and agentic workflows.
The Q8_K_XL MTP model from Unsloth: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF
I serve the model with ollama and am thinking about replacing ollama but haven't looked into it.
I have openwebui for chat if I want that too, but don't really use it.
> these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios.
These kinds of models might be more useful as tools to be used by larger orchestrator models, than being the orchestrators themselves.
Here's what I got
https://9ol.es/tmp/pelican.png
with https://9ol.es/tmp/prompt_pelican.txt
using prithivMLmods/VibeThinker-3B-GGUF:Q4_K_M
It would look really dumb if someone asked it that, but that's fine. You're trying to make a model that is optimized for efficiency for a specific task. As much as possible, you should prune uncorrelated things.
I'm glad to see more domain-focused SLMs, we need more of them! A programming focused MoE should work well across many languages.
It is a cheap specialist for closed-world, verifiable reasoning tasks like math, self-contained coding problems, and similar.
"Closed-world" means the needed information is already in the context. It is not a tool-using agent that can discover missing context. "Verifiable" means answers are hard to generate but easy to check.
So no open ended research, repo wide agent work, factual Q&A, or SVG generation. More of a compact reasoning module for bounded problems.
Solve the following first-order ODE for f(x):
((-1 - 2*x)*f(x)*tan(1 + x - exp(-61 - 2*x)*f(x)/x)
+ exp(61 + 2*x)*x*(1 - x*tan(1 + x - exp(-61 - 2*x)*f(x)/x))
+ x*tan(1 + x - exp(-61 - 2*x)*f(x)/x)*f'(x)) = 0
Find the general solution f(x).
And surprisingly it found a valid solution! Extra impressive because it runs 25 tok/s on my measly RTX 2070 super. f(x) = x*exp(61 + 2*x)*(1 + x - arccos(C/x))
C is an arbitrary constant.
Apparently Mathematica 14.3 couldn't solve this ODE.The answer is exactly what you have posted. I am impressed by Qwen!
I just tried the quantized Q4_K_M from [1] in my RTX 2070 Super, it ran at 110 tok/s with 1800 tok/s prefill, and found the same solution to your prompt. It generated valid LaTeX for the answer but its reasoning trace uses mostly compact ASCII math notation. Took 3min 22s to answer, spending 22k tokens almost all on thinking.
[1] https://huggingface.co/prithivMLmods/VibeThinker-3B-GGUF
Prompt: If A goes to B who then goes to C, can A send something to C?
Response:
We need to interpret best. The phrase "If A goes to B who then goes to C, can A send something to C?" could be a puzzle about the concept of sending something (like passing a ball) and the relationships.
Scenario: A gives something to B, and B passes it on to C. Question: Can A also give the same thing to C? Answer: Only if A can obtain a second copy (e.g., the thing was duplicated). Otherwise, after handing it to B, A no longer holds it and cannot “send” it unless a copy exists.
[Lots of other unnecessary commentary and "scenarios" that make even lesser sense]
My hunch is that Opus scale models probably have shortcuts encoded into the model that handle these ambiguities cases, wheres this model has learned a program to reason through the edge case (crystalized vs fluid intelligence). Remembering that probablity (frontier) vs calculating it on the fly (vibethink)
> [...]
> LLM-based Query Quality Filtering. We utilize capable LLMs to assess query quality, filtering out samples with incomplete descriptions, unreasonable conditions, invalid logic, or an inability to effectively assess target knowledge points.
So who has suggestions on small models with excellent tool calling capabilities?
It's like web hosting; all the open source tools are there and free, and yet website tools, hosts, etc flourish.
Once I can spend 10k to run Opus 4.6 at home, I'm done.
I find what makes frontier models actually work well isn't just the capability of the model, but how well the harness is tuned to its expectations. I wrote a about this in a bit more detail here. https://yogthos.net/posts/2026-06-08-dirge-code.html
I really like the idea of small models that can reason but do not have too much knowledge. Also, no emphasis on tool calls. I think the agent should do the heavy lifting and reach half way.
I use really small models, like Qwen 3.5 0.8B to 9B - no tool calling, no MCP, no skills, nothing. No multi-turn chat even. Models are given very specific tasks using a vast number of system prompts and all the response handling is done in the agent(s).
That's also more aligned to its leetcode style training data, the code under test is fully in the context window. It might be interesting to have a bigger tool use model go through the effort of collecting the context, and feeding it into this kind of model for analysis only. It becomes more of a thinking tool, instead of the orchestrator.
I think the only way to prove that these models are truly as good as they claim is to wait and see if they are getting adopted in practice.
Seems like a really good model to use in an IDE when you still want control over the code structure then.
VibeThinker-3B is developed through a staged post-training pipeline built upon Qwen2.5-Coder-3B base, a compact 3B foundation model.
Qwen2.5 is ancient by LLM standards.This. Is. Amazing. I am flabbergasted.
I am not into the whole GenAI thing and I have very little need for anything agentic, but Python, C++ and Maths is exactly what I mostly used these for, so this might actually become my main work horse. This is so cool.
I even used it for stuff it is not built for, asking complex qustion on history (Battle of Tours 732) and literature (Joyce’s Portrait of Artist) and it was surprisingly good, even though it started to hallucinate names and details (such as claiming Joyce’s father was a priest). For 3B I expected it to mainly spout complete nonsense.
Its meant for a Windows machine using ollama but I'm sure anyone who wants to mess with it can point claude code at it to convert it for your own operating system and requirements. After install you can ask it to do something with "vibe 'create me a poem about cheese in cheese.txt'" its workspace is by default the directory the cli was located in when you called it.
A alot randomness in it
Please don't hype
It surely cannot be justified only for training at this scale, and since models nowadays are improved more and more by fine tuning than re-training from scratch.
Will a viable local model crash the US economy ?
More importantly, are the LLM companies aware, and are they deliberately buying out all the RAM and GPUs in order to prolong the inevitable ? Probably not, but I wouldn't be surprised if that is the case.
It might appear not, but actually, the process of reasoning is not an isolated act. The right and wrong way of doing things is codified in social evolution that absorbed all facets of life. Why should you optimize a piece of code for performance? Why performance is needed? What is a bug? What features and UI themes would be more intuitive for humans?
There is a butterfly effect. Everything affects everything to some extent.