Also this is the direction the small LLMs are moving in already. They are too small for general knowledge, but getting quite good at tool use (incl. Googling).
Now we just need them to be very strict about what they know and don't know! (I think this is still an open problem, even with big ones.)
And I don't think that LLM could just Google or check Wikipedia.
But I do agree that this architecture makes a lot of sense. I assume it will become the norm to use such edge LLMs.
While I understand some of the fundamental thoughts behind that comparison, it's slightly wonky... I'm not asking "compress wikipedia really well", but instead "can a 'model' reason its way through wikipedia" (and what does that reasoning look like?).
Theoretically with wikipedia-multi-lang you should be able to reasonably nail machine-translation, but if everyone is starting with "only wikipedia" then how well can they keep up with the wild-web-trained models on similar bar chart per task performance?
If your particular training technique (using only wikipedia) can go from 60% of SOTA to 80% of SOTA on "Explain why 6-degrees of Kevin Bacon is relevant for tensor operations" (which is interesting to plug into Google's AI => Dive Deeper...), then that's a clue that it's not just throwing piles of data at the problem, but instead getting closer to extracting the deeper meaning (and/or reasoning!) that the data enables.
Maybe not crawl the web, but hit a service with pre-hosted, precurated content it can digest (and cache) that doesn't necessarily change often enough. You aren't using it for the latest news necessarily, but programming is mostly static knowledge a a good example.
How? They can validate thousands if not millions of queries but nothing prevent the millions-th-and-one from being a hallucination. People who would then pay extra for a "Encyclopedia Britanica validated LLM" would then, rightfully so IMHO, complain that "it" suggested them to cook with a dangerous mushroom.
> I often wonder is "what will be the minimally viable LLM" that can work from just enough information that if it googles the rest it can provide reasonable answers?
It depends what that word "reasonable" means for your specific use-case ;)
Does that make sense?
> As inhibitory synapses account for 20%-30%, this could map well to how biological brains are structured.
No concise HN comment will give you a complete picture of whats currently known about the human brain, so a platitude necessarily follows:
We call the nearly touching interfaces between neurons synapses, small packets / droplets of neurotransmitter are sent across this interface from the source to the target neuron. Such signals can be excitatory (promote the probability of excitation of the target firing soon) or inhibitory (inhibits the probability of the target firing soon). There are 2 types of sensitive areas on your average neuron: the dendrites (long branching tentacles, that receive excitatory signals) and the cell body where all the signals are accumulated to a local instantaneous "sum" is also sensitive to synaptic activation, but the synapses on the cell body are inhibitory, when sufficiently inhibited the neuron will refuse to fire its axons, so the inhibitory synapses on the cell body can gate the cumulative signal and prevent it from triggering this neuron temporarily. If the neuron does fire, this propagates along the axons (another type of branching tentacles, which lead to yet other neurons, sometimes touching them excitatorily at their dendrite, sometimes touching a neuron inhibitorily at their cell body.
I hope that helped?
One bit or one trit? I am confused!
For something more practical, you can pack five three-state values within a byte because 3^5 = 243, which is smaller than 256. To unpack, you divide and modulo by 3 five separate times. This encodes data in bytes at 1.6 bits per symbol.
But the packing of 5 symbols into a byte was not done here. Instead, they packed 4 symbols into a byte to reduce computational complexity (no unpacking needed)
So it's not a inference framework for 1-bit models (two states per parameter) but for 1.58 bit models (three states per parameter). Annoying that they try to mix up the two.
There's a lot that you can do when the model size is that small, yet still powerful.
Our next step is that we want to put up a content distribution network for it where people can also share their diffs for their own fine-tuned model. I'll post the project if we finish all the parts.
[1] https://www.youtube.com/live/x791YvPIhFo?is=NfuDFTm9HjvA3nzN
If Microslop aren't gonna train the model themselves to prove their own thesis, why would others? They've had 2 years (I think?) to prove BitNet in at least some way, are you really saying they haven't tried so far?
Personally that makes it slightly worrisome to just take what they say at face value, why wouldn't they train and publish a model themselves if this actually led to worthwhile results?
But it doesn't mean, idea is worthless.
You could have said same about Transformers, Google released it, but didn't move forward, turns out it was a great idea.
The "new" on huggingface banner has weights that were uploaded 11 months ago, and it's 2B params. Work on this in the repo is 2 years old.
The amount of publicity compared to the anemic delivery for BitNet is impressive.
You'd still need full GPUs for training, but for inference the hardware would be orders of magnitude simpler than what Nvidia is making
Interestingly, a trit x float multiplier is cheaper than a trit x integer multiplier in hardware if you're willing to ignore things like NaNs.
0 and 1 are trivial, just a mux for identity and zero. But because floats are sign-magnitude, multiply by -1 is just an inverter for the sign bit, where as for integers you need a bitwise inverter and full incrermenter.
I happened to "live" on 7.0-7.5 tok/sec output speed for a while, and it is an annoying experience. It is the equivalent of walking behind someone slightly slower on a footwalk. I dealt with this by deliberately looking away for a minute until output was "buffered" and only then started reading.
For any local setup I'd try to reach for 10 tok/sec. Sacrifice some kv cache and shove a few more layers on your GPU, it's worth it.
In what way? On modern processors, a Fused Multiply-Add (FMA) instruction generally has the exact same execution throughput as a basic addition instruction
typically for 1-bit matmul, you can get away with xors and pop_counts which should have a better throughput profile than FMA when taking into account the SIMD nature of the inputs/outputs.
So it's not that individual ops are faster — it's that the packed representation lets each instruction do more useful work, and you're moving far less data from memory to do it.
I'm hoping that today's complaints are tomorrow's innovations. Back when 1Mb hard drive was $100,000, or when Gates said 640kb is enough.
Perhaps some 'in the (chip) industry' can comment on what RAM manufacturers are doing at the moment - better, faster, larger? Or is there not much headroom left and it's down to MOBO manufacturers, and volume?
The last logical step of this process would be figuring out how to mix the CPU transistors with the RAM capacitors on the same chip as opposed to merely stacking separate chips on the same package.
A related stopgap is the AI startup (forget which) making accelerators on giant chips full of SRAM. Not a cost effective approach outside of ML.
AMD actually used HBM2 memory in their Radeon VII card back in 2019 (!!) for $700. It had 16 GB of HBM2 memory with 1 TB/s throughput.
The RTX 5080 in conversion l comparison also has 16 GB of VRAM, but was released in 2025 and has 960 GB/s throughput. The RTX 5090 does have an edge at 1.8 TB/s bandwidth and 32 GB of VRAM but it also costs several times more. Imagine if GPUs had gone down the path of the Radeon VII.
That being said, the data center cards from both are monstrous.
The Nvidia B200 has 180 GB of VRAM (2x90GB) offering 8.2 TB/s bandwidth (4.1 TB/s x2) released in 2024. It just costs as much as a car, but that doesn't matter, because afaik you can't even buy them individually. I think you need to buy a server system from Nvidia or Dell that will come with like 8 of these and cost you like $600k.
AMD has the Mi series. Eg AMD MI325x. 288 GB of VRAM doing 10 TB/s bandwidth and released in 2024. Same story as Nvidia: buy from an OEM that will sell you a full system with 8x of these (and if you do get your hands on one of these you need a special motherboard for them since they don't do PCIe). Supposedly a lot cheaper than Nvidia, but still probably $250k.
These are not even the latest and greatest for either company. The B300 and Mi355x are even better.
It's a shame about the socket for the Mi series GPUs (and the Nvidia ones too). The Mi200 and Mi250x would be pretty cool to get second-hand. They are 64 GB and 128GB VRAM GPUs, but since they use OAP socket you need the special motherboard to run them. They're from 2021, so in a few years time they will likely be replaced, but as a regular joe you likely can't use them.
The systems exist, you just can't have them, but you can rent them in the cloud at about $2-4 per hour per GPU.
But it seems the biggest model available is 10B? Somewhat unusual and does make me wonder just how challenging it will be to train any model in the 100B order of magnitude.
The key insight of the BitNet paper was that using their custom BitLinear layer instead of normal Linear layers (as well as some more training and architecture changes) lead to much, much better results than quantizing an existing model down to 1.58 bits. So you end up making a full training run in bf16 precision using the specially adapted model architecture
(only suggesting that it's intentional because it's been there so long)
> We evaluated bitnet.cpp in terms of both inference speed and energy cost. Comprehensive tests were conducted on models with various parameter sizes, ranging from 125M to 100B. specific configurations for each model are detailed in the Appendix A.
However this user uses — in almost all his posts and he had a speed of 1 comment per minute or so on multiple different topics.
Edit: oh, just recalled dang restricted Show HNs the other day to only non-new users (possibly with some other thresholds). I wonder if word got out and some are filling accounts with activity.
It's not a question of if there are other bots out there, but only what % of comments on HN right now and elsewhere are bot generated. That number is only going to increase if nothing is done.
Also, isnt there different affinities to 8bit vs 4bit for inferences
can we stop already with these decimals and just call it "1 trit" which it exactly is?
I imagine you got 96gb because you thought you'd be running models locally? Did you not know the phrase Unified Memory is marketing speak?
``` Ecosystem Services and their impact on the Ecosystem
Ecosystem services refer to the services provided by ecosystems to the human society. These services include water, air, energy, nutrients, and soil (Jenkins, 2010). For instance, water is the most important service provided by an ecosystem and it helps in the conservation of water, irrigation and sanitation (Jenkins, 2010). On the other hand, air provides the oxygen needed for life.
The water cycle is a significant ecosystem service because it involves the cycling of water among the different parts of an ecosystem. It also involves the movement of water through the atmosphere, from one place to another. It is also the process of evaporation and condensation of water from the atmosphere. It also involves the movement of water from the air to the soil and water into the oceans.
The water cycle is a significant ecosystem service because it involves the cycling of water among the different parts of an ecosystem. It also involves the movement of water through the atmosphere, from one place to another. It is also the process of evaporation and condensation of water from the atmosphere. It also involves the movement of water from the air to the soil and water into the oceans. ```
Except on GSM8K and math...
A few months ago I used Whisper from OpenAI, an automatic speech recognition system released in 2002, on my modern 20-core Intel CPU to convert audio from a video file to text. It worked fine. Took a while and the machine got hot and the fans kicked in. I then found the Intel's optimized version of whisper that used NPU. It required a lot more steps to get working, but in the end it did work and was about 6x faster. And the machine remained cool and silent in the process. Since then I have become a fan of the NPUs. They are not NVIDIA GeForce RTX 5090, but they are significantly better than a modern CPU.
A claude 4.6 they are most certainly not, but if you get through the janky AF software ecosystem they can run small LLMs reasonably well with basically zero CPU/GPU usage
I was under the impression that they were primarily designed for low power use.
demo shows a huge love for water, this AI knows its home
Also as far as I know, this is more of a research curiosity - BitNet really doesn't perform that well on evals.
I think Qwen3.5 2B is the best you can get in the ~1GB class.
Furthermore, it was published 11 months ago, it's not a new release.
I had the same question, after some debates with Chatgpt, it's not the "quantize" for post-training we often witness these days, you have to use 1 trit in the beginning since pre-train.
The engineering/optimization work is nice, but this is not what people have been waiting for, as much as, can’t the Bitnet idea that seemed promise really deliver in a competitive way.
With how much RAM? How much storage does it requires?
Seems like that could end up as a situation where a fractional number of bits or bytes per parameter might make sense. Particularly with adverbs and adjectives, negators.
It seems to keep repeating that the water cycle is the main source of energy for all living things on the planet and then citing Jenkins 2010. There are also a ton of sentence beginning with “It also…”
I don’t even think it’s correct. The sun is the main source of energy for most living things but there’s also life near hydrothermal vents etc.
I don’t know who Jenkins is, but this model appears to be very fond of them and the particular fact about water.
I suppose fast and inaccurate is better than slow and inaccurate.
If you have an existing network, making an int4 quant is the better tradeoff. 1.58b quants only become interesting when you train the model specifically for it
On the other hand maybe it works much better than expected because llama3 is just a terrible baseline
My disappointment is immeasurable and my day is ruined.