There is also a lot of thought put into all the tiny bits of optimization to reduce memory usage, using FP8 effectively without significant loss of precision nor dynamic range.
None of the techniques by themselves are really mind blowing, but the whole of it is very well done.
The DeepSeekV3 paper is really a good read: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSee...
Madness.
Sure training cost will go down with time, but if you are only using 10% of the compute of your competition (TFA: DeepSeek vs LLaMa) then you could be saving 100's of millions per training run!
It's also a science-based organization like OpenAI. Very intelligent people, but they aren't programmers first.
Making AI practical for self-hosting was the real disruption of DeepSeek.
With DeepSeek we can now run on non-GPU servers with a lot of RAM. But surely quite a lot of the 671 GB or whatever is knowledge that is usually irrelevant?
I guess what I sort of am thinking of is something like a model that comes with its own built in vector db and search as part of every inference cycle or something.
But I know that there is something about the larger models that is required for really intelligent responses. Or at least that is what it seems because smaller models are just not as smart.
If we could figure out how to change it so that you would rarely need to update the background knowledge during inference and most of that could live on disk, that would make this dramatically more economical.
Maybe a model could have retrieval built in, and trained on reducing the number of retrievals the longer the context is. Or something.
Small correction - It's 671B Parameters - not 671 Gigabytes (doing some rudimentary math if you want to run the entire model in memory it would take ~750GB (671b * fp8 == 8 bits * 1.2 (20% overhead)) = 749.901 GiB)
It's a MoE model so you don't actually need to load all 750gb at once.
I think maybe what you are asking is "Why do more params make a better model?"
Generally speaking its because if you have more units of representation (params) you can encode more information about the relationships in the data used to train the model.
Think of it like building a LEGO city.
A model with fewer parameters is like having a small LEGO set with fewer blocks. You can still build something cool, like a little house or a car, but you're limited in how detailed or complex it can be.
A model with more parameters is like having a giant LEGO set with thousands of pieces in all shapes and colours. Now, you can build an entire city with skyscrapers, parks, and detailed streets.
---
In terms of "is a lot of of irrelevant?" - This is a hot area of research!
It's very difficult currently to know what parameters are relevant and what aren't - there is an area of research called mechanistic interpretability that aims to illuminate this - if you are interested - Anthropic released a good paper called "Golden Gate Claude" on this.
I tried the same thing on the 7B and 32B model today, neither are as effective as codellama.
No one (publically) had really pushed any of these techniques far, especially not for such a big run.
the transformer was an entirely new architecture, very different step change than this
e: and alibaba
I think recomputing MLA and RMS on backprop is something few would have done.
Dispensing with tensor parallelism by kind of overlapping forward and backprop. That would not have been intuitive to me. (I do, however, make room for the possibility that I'm just not terribly good at this anymore.)
I don't know? I just think there's a lot of new takes in there.
"Write Python code for the game of Tetris" resulting in working code that resembles Tetris is great. But the back and forth asking for clarification or details (or even post-solution adjustments) isn't there. The models dwell and draw almost entirely self-referentially from their own reasoning through the entire exchange.
"Do you want to keep score?" "How should scoring work?" "Do you want aftertouch?" "What about pushing down, should it be instantaneous or at some multiple of the normal drop speed?" "What should that multiple be?"
as well as questions from the prompter that inquire about capabilities and possibilities. "Can you add one 5-part piece that shows up randomly on average every 100 pieces on average?" or "Is it possible to make the drop speed function as an acceleration rather than a linear drop speed?"....these are somewhat possible, but sometimes require the model to re-reason the entire solution again.
So right now, even the best models may or may not provide working code that generates something that resembles a Tetris game, but have no specifics beyond what some internal self-referential reasoning provides, even if that reasoning happens in stages.
Such a capability would help users of these models troubleshoot or fix specific problems or express specific desires....the Tetris game works but has no left-hand L blocks for example. Or the scoring makes no sense. Everything happens in a sort of highly superficial approach where the reasoning is used to fill in gaps in the top-down understand of the problem the model is working on.
For example, I can ask it to write a python script to get the public IP, geolocation, and weather, trying different known free public APIs until it succeeds. But the first successful try was dumping a ton of weather JSON to the console, so I gave feedback to make it human readable with one line each for IP, location, and weather, with a few details for the location and weather. That worked, but it used the wrong units for the region, so I asked it to also use local units, and then both the LLM and myself judged that the project was complete. Now, if I want to accomplish the same project in fewer prompts, I know to specify human-readable output in region-appropriate units.
This only uses text based LLMs, but the logical next step would be to have a multimodal network review images or video of the running program to continue to self-evaluate and improve.
Sounds like a deficiency in theory of the mind.
Maybe explains some of the outputs I've seen from deepseek where it conjectures about the reasons why you said whatever you said. Perhaps this is where we're at for mitigations for what you've noticed.
If your prompt instructs the model to ask such questions along the way, the model will, in fact, do so!
But yes, it would be nice if the model were smart enough to realize when it's in a situation where it should ask the user a few questions, and when it should just get on with things.
That's probably the key to understanding why the hallucination "problem" isn't going to be fixed because language models, as probabilistic models it's an inherent feature and they were never designed to be expert systems in the first place.
Building an knowledge representation system that can properly model the world itself is going more into the foundations of mathematics and logic than it is to do with engineering, of which the current frameworks like FOL are very lacking and there aren't many people in the world who are working on such problems.
https://www.mdpi.com/1999-4893/13/7/175
> the open-domain Frame Problem is equivalent to the Halting Problem and is therefore undecidable.
Diaconescu's Theorem will help understand where Rice's theorm comes to play here.
Littlestone and Warmuth's work will explain where PAC Learning really depends on a many to one reduction that is similar to fixed points.
Viewing supervised learning as paramedic linear regression, this dependent on IID, and unsupervised learning as clustering thus dependent on AC will help with the above.
Both IID and AC imply PEM, is another lens.
Basically for problems like protein folding, which has rules that have the Markovian and Ergodic properties it will work reliably well for science.
The basic three properties of (confident, competent, and inevitable wrong) will always be with us.
Doesn't mean that we can't do useful things with them, but if you are waiting for the hallucinations problem to be 'solved' you will be waiting for a very long time.
What this new combo of elements does do is seriously help with being able to leverage base models to do very powerful things, while not waiting for some huge groups to train a general model that fits your needs.
This is a 'no effective procedure/algorithm exists' problem. Leveraging LLMs for frontier search will open up possible paths, but the limits of the tool will still be there.
Stable orbits of the planets is an example of another limit of math, but JPL still does a great job as an example.
Obviously someone may falsify this paper... but the safe bet is that it holds.
https://arxiv.org/abs/2401.11817
Heck Laplacian determism has been falsified, but as scientists are more interested in finding useful models that doesn't mean it isn't useful.
All models are wrong, some are useful is the TL;DR
> the open-domain Frame Problem is equivalent to the Halting Problem and is therefore undecidable.
Thank you, Code as Data problems are innate to the von Nuemman architecture, but I could never articulate how LLMs are so huge they are essentially Turing-complete and equivalent computationally.You _can_ combinate through them, just not in our universe.
And yes, if you thought this means these models are being commonly misapplied, you'd be correct. This will continue until the bubble bursts.