BitNet: Inference framework for 1-bit LLMs (opens in new tab)

(github.com)

370 pointsredm3mo ago167 comments

167 comments

107 comments · 30 top-level

LuxBennu3mo ago· 38 in thread

The title is misleading — there's no trained 100B model, just an inference framework that claims to handle one. But the engineering is worth paying attention to. I run quantized 70B models locally (M2 Max 96GB, llama.cpp + LiteLLM), and memory bandwidth is always the bottleneck. The 1.58-bit approach is interesting because ternary weights turn matmuls into additions — a fundamentally different compute profile on commodity CPUs. If 5-7 tok/s on a single CPU for 100B-class models is reproducible, that's a real milestone for on-device inference. Framework is ready. Now we need someone to actually train the model.

embedding-shape3mo ago

> Framework is ready. Now we need someone to actually train the model.

If Microslop aren't gonna train the model themselves to prove their own thesis, why would others? They've had 2 years (I think?) to prove BitNet in at least some way, are you really saying they haven't tried so far?

Personally that makes it slightly worrisome to just take what they say at face value, why wouldn't they train and publish a model themselves if this actually led to worthwhile results?

throwaw123mo ago

Because this is Microsoft, experimenting and failing is not encouraged, taking less risky bets and getting promoted is. Also no customer asked them to have 1-bit model, hence PM didn't prioritize it.

But it doesn't mean, idea is worthless.

You could have said same about Transformers, Google released it, but didn't move forward, turns out it was a great idea.

2 more replies

GorbachevyChase3mo ago

The most benign answer would be that they don’t want to further support an emerging competitor to OpenAI, which they have significant business ties to. I think the more likely answer which you hinted at is that the utility of the model falls apart as scale increases. They see the approach as a dead end so they are throwing the scraps out to the stray dogs.

1 more reply

observationist3mo ago

So is it finally time for a Beowulf cluster to do something amazing?

1 more reply

embeddnet3mo ago

Rest assured, all the big players (openai, google, deepseek etc) have run countless experiments with 4,3,2,1.58,1 bits, and various sparse factors and shapes. This barrel has been scraped to the bottom

1 more reply

gregman13mo ago

Cannot agree more!

deepsquirrelnet3mo ago

The title being misleading is important as well, because this has landed on the front page, and the only thing that would be the only notable part of this submission.

The "new" on huggingface banner has weights that were uploaded 11 months ago, and it's 2B params. Work on this in the repo is 2 years old.

The amount of publicity compared to the anemic delivery for BitNet is impressive.

wongarsu3mo ago

I've also always though that it's an interesting opportunity for custom hardware. Two bit addition is incredibly cheap in hardware, especially compared to anything involving floating point. You could make huge vector instructions on the cheap, then connect it to the fastest memory you can buy, and you have a capable inference chip.

You'd still need full GPUs for training, but for inference the hardware would be orders of magnitude simpler than what Nvidia is making

monocasa3mo ago

These are trits, which provide their own efficiencies.

Interestingly, a trit x float multiplier is cheaper than a trit x integer multiplier in hardware if you're willing to ignore things like NaNs.

0 and 1 are trivial, just a mux for identity and zero. But because floats are sign-magnitude, multiply by -1 is just an inverter for the sign bit, where as for integers you need a bitwise inverter and full incrermenter.

1 more reply

regularfry3mo ago

You only need GPUs if you assume the training is gradient descent. GAs or anything else that can handle nonlinearities would be fine, and possibly fast enough to be interesting.

riidom3mo ago

Text is misleading too. 5-7 tok/sec is not reading speed, it's a tad slower. For me, at least, and I am an experienced reader, not especially schooled in quick-reading though.

I happened to "live" on 7.0-7.5 tok/sec output speed for a while, and it is an annoying experience. It is the equivalent of walking behind someone slightly slower on a footwalk. I dealt with this by deliberately looking away for a minute until output was "buffered" and only then started reading.

For any local setup I'd try to reach for 10 tok/sec. Sacrifice some kv cache and shove a few more layers on your GPU, it's worth it.

WithinReason3mo ago

> a fundamentally different compute profile on commodity CPU

In what way? On modern processors, a Fused Multiply-Add (FMA) instruction generally has the exact same execution throughput as a basic addition instruction

ismailmaj3mo ago

You drop the memory throughput requirements because of the packed representation of bits so an FMA can become the bottleneck, and you bypass the problem of needing to upscale the bits to whatever FP the FMA instruction needs.

typically for 1-bit matmul, you can get away with xors and pop_counts which should have a better throughput profile than FMA when taking into account the SIMD nature of the inputs/outputs.

1 more reply

ActivePattern3mo ago

The win is in how many weights you process per instruction and how much data you load.

So it's not that individual ops are faster — it's that the packed representation lets each instruction do more useful work, and you're moving far less data from memory to do it.

actionfromafar3mo ago

Bitnet encoding more information dense per byte perhaps? CPUs have slow buses so would eke out more use of bandwidth?

DrBazza3mo ago

> memory bandwidth is always the bottleneck

I'm hoping that today's complaints are tomorrow's innovations. Back when 1Mb hard drive was $100,000, or when Gates said 640kb is enough.

Perhaps some 'in the (chip) industry' can comment on what RAM manufacturers are doing at the moment - better, faster, larger? Or is there not much headroom left and it's down to MOBO manufacturers, and volume?

fc417fc8023mo ago

Chip speed has increased faster than memory speed for a long time now, leaving DRAM behind. GDDR was good for awhile but is no longer sufficient. HBM is what's used now.

The last logical step of this process would be figuring out how to mix the CPU transistors with the RAM capacitors on the same chip as opposed to merely stacking separate chips on the same package.

A related stopgap is the AI startup (forget which) making accelerators on giant chips full of SRAM. Not a cost effective approach outside of ML.

1 more reply

Aerroon3mo ago

We have faster memory, it's just all used in data center cards you can't buy (and can't afford to buy).

AMD actually used HBM2 memory in their Radeon VII card back in 2019 (!!) for $700. It had 16 GB of HBM2 memory with 1 TB/s throughput.

The RTX 5080 in conversion l comparison also has 16 GB of VRAM, but was released in 2025 and has 960 GB/s throughput. The RTX 5090 does have an edge at 1.8 TB/s bandwidth and 32 GB of VRAM but it also costs several times more. Imagine if GPUs had gone down the path of the Radeon VII.

That being said, the data center cards from both are monstrous.

The Nvidia B200 has 180 GB of VRAM (2x90GB) offering 8.2 TB/s bandwidth (4.1 TB/s x2) released in 2024. It just costs as much as a car, but that doesn't matter, because afaik you can't even buy them individually. I think you need to buy a server system from Nvidia or Dell that will come with like 8 of these and cost you like $600k.

AMD has the Mi series. Eg AMD MI325x. 288 GB of VRAM doing 10 TB/s bandwidth and released in 2024. Same story as Nvidia: buy from an OEM that will sell you a full system with 8x of these (and if you do get your hands on one of these you need a special motherboard for them since they don't do PCIe). Supposedly a lot cheaper than Nvidia, but still probably $250k.

These are not even the latest and greatest for either company. The B300 and Mi355x are even better.

It's a shame about the socket for the Mi series GPUs (and the Nvidia ones too). The Mi200 and Mi250x would be pretty cool to get second-hand. They are 64 GB and 128GB VRAM GPUs, but since they use OAP socket you need the special motherboard to run them. They're from 2021, so in a few years time they will likely be replaced, but as a regular joe you likely can't use them.

The systems exist, you just can't have them, but you can rent them in the cloud at about $2-4 per hour per GPU.

bigyabai3mo ago

For larger contexts, the bottleneck is probably token prefill instead of memory bandwidth. Supposedly prefill is faster on the M5+ GPUs, but still a big hurdle for pre-M5 chips.

joquarky3mo ago

It might be advantageous to have a different memory structure altogether, bespoke to the specific task.

rustyhancock3mo ago

Yes. I had to read it over twice, it does strike me as odd that there wasn't a base model to work with.

But it seems the biggest model available is 10B? Somewhat unusual and does make me wonder just how challenging it will be to train any model in the 100B order of magnitude.

wongarsu3mo ago

Approximately as challenging as training a regular 100B model from scratch. Maybe a bit more challenging because there's less experience with it

The key insight of the BitNet paper was that using their custom BitLinear layer instead of normal Linear layers (as well as some more training and architecture changes) lead to much, much better results than quantizing an existing model down to 1.58 bits. So you end up making a full training run in bf16 precision using the specially adapted model architecture

naasking3mo ago

What's unusual about it? It seems pretty standard to train small models to validate an approach, and then show that training scales with model size to 8B to 14B parameter models, which is what they did.

cat_plus_plus3mo ago

There are 1 bit average GGUFs of large models, not perfect quality but they will hold a conversation. These days, there is also quantized finetuning to heal the damage.

august113mo ago

In their demo they're running 3B model.

webXL3mo ago

It comes from (intentionally?) misleading docs: https://github.com/microsoft/BitNet/issues/391

(only suggesting that it's intentional because it's been there so long)

verdverm3mo ago

That issue appears to be the one that's wrong. From the technical report

> We evaluated bitnet.cpp in terms of both inference speed and energy cost. Comprehensive tests were conducted on models with various parameter sizes, ranging from 125M to 100B. specific configurations for each model are detailed in the Appendix A.

1 more reply

cubefox3mo ago

LLM account

Springtime3mo ago

Hmm, the user joined in 2019 but had no submissions or comments until just 40 minutes ago (at least judging by the lack of a second page?) and all the comments are on AI related submissions. Benefit of doubt is it'd have to be a very dedicated lurker or dormant account they remembered they had.

Edit: oh, just recalled dang restricted Show HNs the other day to only non-new users (possibly with some other thresholds). I wonder if word got out and some are filling accounts with activity.

2 more replies

bottlepalm3mo ago

It's scary, without the em dashes, and the rapid fire commenting of the account - who would ever realize this is a bot? Two easy to fix things, and after that it'd be very difficult to tell that this is a bot.

It's not a question of if there are other bots out there, but only what % of comments on HN right now and elsewhere are bot generated. That number is only going to increase if nothing is done.

152334H3mo ago

Looks like gradual disempowerment is already happening - the minority of humans who are capable of spotting AI content are losing the struggle for attention on all major social networks

Jowsey3mo ago

Agreed. This is becoming an issue, see also: https://news.ycombinator.com/item?id=47259308

orbital-decay3mo ago

Funny enough I now involuntarily take RTFA as a slight slop signal, because all these accounts dutifully read the article before commenting, unlike most HNers who often respond to headlines.

4 more replies

nkohari3mo ago

I would love to understand the thought process behind this. I'm sure it's a fun experiment, to see if it's possible and so on... but what tangible benefit could there be to burning tokens to spam comments on every post?

cyanydeez3mo ago

Check out the new QWEN coder model.

Also, isnt there different affinities to 8bit vs 4bit for inferences

RandomTeaParty3mo ago

> The 1.58-bit approach

can we stop already with these decimals and just call it "1 trit" which it exactly is?

hsbauauvhabzb3mo ago

Yeah because THAT won’t confuse the average reader.

butILoveLife3mo ago

>. I run quantized 70B models locally (M2 Max 96GB, llama.cpp + LiteLLM), and memory bandwidth is always the bottleneck.

I imagine you got 96gb because you thought you'd be running models locally? Did you not know the phrase Unified Memory is marketing speak?

giancarlostoro3mo ago· 16 in thread

One of the things I often wonder is "what will be the minimally viable LLM" that can work from just enough information that if it googles the rest it can provide reasonable answers? I'm surprised something like Encyclopedia Britanica hasn't yet (afaik) tried to capitalize on AI by selling their data to LLMs and validating outputs for LLM companies, it would make a night and day difference in some areas I would think. Wikipedia is nice, but there's so much room for human error and bias there.

andai3mo ago

Here's a short clip of Karpathy speaking on this subject.

https://youtu.be/UldqWmyUap4

Also this is the direction the small LLMs are moving in already. They are too small for general knowledge, but getting quite good at tool use (incl. Googling).

Now we just need them to be very strict about what they know and don't know! (I think this is still an open problem, even with big ones.)

intrasight3mo ago

It's not so much a "minimally viable LLM" but rather an LLM that knows natural language well but knows nothing else. Like me - as an engineer who knows how to troubleshoot in general but doesn't know about a specific device like my furnace (recent example).

And I don't think that LLM could just Google or check Wikipedia.

But I do agree that this architecture makes a lot of sense. I assume it will become the norm to use such edge LLMs.

ramses03mo ago

I asked this question a while back (the "only train w/ wikipedia LLM") and got pointed to the general-purpose "compression benchmarks" page: `https://www.mattmahoney.net/dc/text.html`

While I understand some of the fundamental thoughts behind that comparison, it's slightly wonky... I'm not asking "compress wikipedia really well", but instead "can a 'model' reason its way through wikipedia" (and what does that reasoning look like?).

Theoretically with wikipedia-multi-lang you should be able to reasonably nail machine-translation, but if everyone is starting with "only wikipedia" then how well can they keep up with the wild-web-trained models on similar bar chart per task performance?

If your particular training technique (using only wikipedia) can go from 60% of SOTA to 80% of SOTA on "Explain why 6-degrees of Kevin Bacon is relevant for tensor operations" (which is interesting to plug into Google's AI => Dive Deeper...), then that's a clue that it's not just throwing piles of data at the problem, but instead getting closer to extracting the deeper meaning (and/or reasoning!) that the data enables.

giancarlostoro3mo ago

Correct! I know RAG is a thing, but I wish we could have "DLCs" for LLMs like image generation has LoRa's which are cheaper to train for than retraining the entire model, and provide more output like what you want. I would love to pop in the CS "LoRa or DLC" and ask it about functional programming in Elixir, or whatever.

Maybe not crawl the web, but hit a service with pre-hosted, precurated content it can digest (and cache) that doesn't necessarily change often enough. You aren't using it for the latest news necessarily, but programming is mostly static knowledge a a good example.

1 more reply

embedding-shape3mo ago

Your worry about Wikipedia is that there is "much room for human error and bias", yet earlier you seem to imply that a LLM that has access to the www somehow would have less human error and bias? Personally, I'd see it the other way around.

giancarlostoro3mo ago

When GPT 3.5 became a thing, it had crawled a very nuanced set of websites, this is what I mean. You basically curate where it sources data from.

bee_rider3mo ago

Isn’t that sort of what a RAG is? You’d need an LLM “smart” enough to turn natural-user prompts into searches, then some kind of search, then an LLM “smart” though to summarize the results.

giancarlostoro3mo ago

Yeah, I think RAG is the idea that will lead us there, though its a little complicated, because for some subjects, say Computer Science, you need a little more than just "This is Hello World in Go" you might need to understand not just Go syntax on the fly, but more CS nuances that are not covered in one single simple document. The idea being having a model that runs fully locally on a phone or laptop with minimal resources. On the other hand, I can also see smaller models talking to larger models that are cheaper to run in the cloud. I am wondering if this is the approach Apple might take with Siri, specifically in order to retain user privacy as much as possible.

andai3mo ago

I remember reading tht hallucination is still a problem even with perfect context. You build a theoretical perfect RAG, give the LLM the exact correct information, and it will still make mistakes surprisingly often.

1 more reply

krychu3mo ago

Unfortunately reasoning ability depends on (or is enabled by) information intake during training. A model will know better what to search for and how to interpret it if the information was part of the training. So there is a trade off. Still I think the question is a practical one. Perhaps there are ideas to focus training on a) reasoning / conceptual modeling and b) reliance on external memory (search etc.) rather than internal memorization.

utopiah3mo ago

> validating outputs for LLM companies

How? They can validate thousands if not millions of queries but nothing prevent the millions-th-and-one from being a hallucination. People who would then pay extra for a "Encyclopedia Britanica validated LLM" would then, rightfully so IMHO, complain that "it" suggested them to cook with a dangerous mushroom.

rablackburn3mo ago

I feel like I should say "spoiler alert" but:

> I often wonder is "what will be the minimally viable LLM" that can work from just enough information that if it googles the rest it can provide reasonable answers?

It depends what that word "reasonable" means for your specific use-case ;)

thinkingtoilet3mo ago

Wikipedia has proven to be as accurate as encyclopedias for decades now. Also, I'm betting AI companies have illegally trained their models on the Encyclopedia Britanica's data by now.

davidron3mo ago

It's perfectly legal to train a human on copyrighted work and I think, depending on the country, it's not settled that training ai on the same data is illegal.

naasking3mo ago

I think the idea is to train a small, minimal LLM thinking model that can run on edge devices, but that has very little knowledge embedded in its weights, and so performs a sort of RAG to Encylopedia Britannica to ground answers to user queries.

uniq73mo ago

Since Google Search already includes an AI summary, your minimally viable "LLM" can be just an HTTP GET call

nickcw3mo ago· 4 in thread

> bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support fast and lossless inference of 1.58-bit models on CPU and GPU (NPU support will coming next).

One bit or one trit? I am confused!

drsopp3mo ago

"1-bit LLMs" is just marketing. The Shannon entropy of one letter with a 3 symbol alphabet (-1, 0, 1) is 1.58.

Dwedit3mo ago

Log Base 2 of 3 = ~1.5849625, so that's the limit to how well you can pack three-state values into bits of data.

For something more practical, you can pack five three-state values within a byte because 3^5 = 243, which is smaller than 256. To unpack, you divide and modulo by 3 five separate times. This encodes data in bytes at 1.6 bits per symbol.

But the packing of 5 symbols into a byte was not done here. Instead, they packed 4 symbols into a byte to reduce computational complexity (no unpacking needed)

1 more reply

cubefox3mo ago

Yeah, "1.58 bit" is 1 trit with three states, since log2(3)≈1.58.

So it's not a inference framework for 1-bit models (two states per parameter) but for 1.58 bit models (three states per parameter). Annoying that they try to mix up the two.

silon423mo ago

I always hope for "just a bunch of if statements" ... this is not it.

2 more replies

lemonish973mo ago· 4 in thread

I wonder when we begin to see the dividends of all the NPU PCs come into play. AMD have been doing some good work with their NPU/iGPU hybrid inference kernels. If these larger models could be scaled down to run on NPUs, you'd see much better power advantages, compared to running them on the CPU.

cheema333mo ago

> I wonder when we begin to see the dividends of all the NPU PCs come into play.

A few months ago I used Whisper from OpenAI, an automatic speech recognition system released in 2002, on my modern 20-core Intel CPU to convert audio from a video file to text. It worked fine. Took a while and the machine got hot and the fans kicked in. I then found the Intel's optimized version of whisper that used NPU. It required a lot more steps to get working, but in the end it did work and was about 6x faster. And the machine remained cool and silent in the process. Since then I have become a fan of the NPUs. They are not NVIDIA GeForce RTX 5090, but they are significantly better than a modern CPU.

Havoc3mo ago

You can already run some models on the NPUs in the Rockchip RK3588 SBCs which are pretty abundant.

A claude 4.6 they are most certainly not, but if you get through the janky AF software ecosystem they can run small LLMs reasonably well with basically zero CPU/GPU usage

throwa3562623mo ago

Are the NPUs really that powerful?

I was under the impression that they were primarily designed for low power use.

lemonish973mo ago

They seem to be getting better or more powerful. The newer Intel Panther lakes and AMD Ryzen are over 50 TOPS now, IIRC

radarsat13mo ago· 3 in thread

I'm curious if 1-bit params can be compared to 4- or 8-bit params. I imagine that 100B is equivalent to something like a 30B model? I guess only evals can say. Still, being able to run a 30B model at good speed on a CPU would be amazing.

regularfry3mo ago

At some point you hit information limits. With conventional quantisation you see marked capability fall-off below q5. All else being equal you'd expect an N-parameter 5-bit quant to be roughly comparable to a 3N-parameter ternary, if they are trained to the same level, just in terms of the amount of information they can possibly hold. So yes, 100B ternary would be within the ballpark of a 30B q5 conventional model, with a lot of hand-waving and sufficiently-smart-training

cubefox3mo ago

I assume that theoretically, 1-bit models could be most efficient because modern models switched from 32 bit to 16 bit to 8 bit per parameter (without quantization).

1 more reply

throwa3562623mo ago

The paper has performance comparisons towards the end.

https://arxiv.org/abs/2402.17764

simonw3mo ago· 3 in thread

Anyone know how hard it would be to create a 1-bit variant of one of the recent Qwen 3.5 models?

regularfry3mo ago

There are q2 and q1 quants, if you want an idea of how much performance you'd drop. Not quite the same implementation-wise, but probably equivalent in terms of smarts.

nikhizzle3mo ago

Almost trivial using open source tools, the question is how it performs without calibration/fine tuning.

wongarsu3mo ago

The results would probably be underwhelming. The bitnet paper doesn't give great baselines to compare to, but in their tests a 2B network trained for 1.58bits using their architecture was better than Llama 3 8B quantized to 1.58bits. Though that 2B network was about on par with a 1.5B qwen2.5.

If you have an existing network, making an int4 quant is the better tradeoff. 1.58b quants only become interesting when you train the model specifically for it

On the other hand maybe it works much better than expected because llama3 is just a terrible baseline

htk3mo ago· 2 in thread

So Microsoft is actually using 2 bits instead of 1.58. In this case they could represent -1, 0, 1, 2. As inhibitory synapses account for 20%-30%, this could map well to how biological brains are structured.

Does that make sense?

hrimfaxi3mo ago

Can you explain your third statement?

> As inhibitory synapses account for 20%-30%, this could map well to how biological brains are structured.

DoctorOetker3mo ago

In the human brain most synapses are indeed excitatory, while a minority is inhibitory.

No concise HN comment will give you a complete picture of whats currently known about the human brain, so a platitude necessarily follows:

We call the nearly touching interfaces between neurons synapses, small packets / droplets of neurotransmitter are sent across this interface from the source to the target neuron. Such signals can be excitatory (promote the probability of excitation of the target firing soon) or inhibitory (inhibits the probability of the target firing soon). There are 2 types of sensitive areas on your average neuron: the dendrites (long branching tentacles, that receive excitatory signals) and the cell body where all the signals are accumulated to a local instantaneous "sum" is also sensitive to synaptic activation, but the synapses on the cell body are inhibitory, when sufficiently inhibited the neuron will refuse to fire its axons, so the inhibitory synapses on the cell body can gate the cumulative signal and prevent it from triggering this neuron temporarily. If the neuron does fire, this propagates along the axons (another type of branching tentacles, which lead to yet other neurons, sometimes touching them excitatorily at their dendrite, sometimes touching a neuron inhibitorily at their cell body.

I hope that helped?

2 more replies

StilesCrisis3mo ago· 2 in thread

The output from this model is horrible! It's GPT-2 level babble and repeats entire paragraphs verbatim. It also reuses the same fake citation `(Jenkins, 2010)` over and over again. From the start of their video (which scrolls by fast enough that you don't see the slop clearly...)

``` Ecosystem Services and their impact on the Ecosystem

Ecosystem services refer to the services provided by ecosystems to the human society. These services include water, air, energy, nutrients, and soil (Jenkins, 2010). For instance, water is the most important service provided by an ecosystem and it helps in the conservation of water, irrigation and sanitation (Jenkins, 2010). On the other hand, air provides the oxygen needed for life.

The water cycle is a significant ecosystem service because it involves the cycling of water among the different parts of an ecosystem. It also involves the movement of water through the atmosphere, from one place to another. It is also the process of evaporation and condensation of water from the atmosphere. It also involves the movement of water from the air to the soil and water into the oceans.

naasking3mo ago

It's a two year old base model that's only 3B parameters, trained on only 100B tokens. It's still a research project at this point.

gardnr3mo ago

The new model they just released has impressive benchmark results: https://huggingface.co/microsoft/bitnet-b1.58-2B-4T

Except on GSM8K and math...

2 more replies

QuadmasterXLII3mo ago· 2 in thread

headline hundred billion parameter, none of the official models are over 10 billion parameters. Curious.

Tuna-Fish3mo ago

The project is an inference framework which should support 100B parameter model at 5-7tok/s on CPU. No one has quantized a 100B parameter model to 1 trit, but this existing is an incentive for someone to do so.

est3mo ago

> quantized a 100B parameter model to 1 trit

I had the same question, after some debates with Chatgpt, it's not the "quantize" for post-training we often witness these days, you have to use 1 trit in the beginning since pre-train.

Arcuru3mo ago· 1 in thread

It's good to see this getting some continued development. I looked into it last year[1] and I thought it showed a lot of promise so I've been very disappointed that I never saw a newer model.

[1] - https://jackson.dev/post/dont-sleep-on-bitnet/

cubefox3mo ago

I think this approach is not so interesting because it's just quantization of a full precision model. So it speeds up inference (at a quality penalty) but not training. It would be more interesting to train an actually binary model directly, without any floating point multiplication, like in this paper: https://proceedings.neurips.cc/paper_files/paper/2024/hash/7...

152334H3mo ago· 1 in thread

but there is no trained 100b param model? "can run a 100B BitNet" is about the inference implementation, not about the existence of any such model

webXL3mo ago

I think they used a dummy model or else they would have linked to it. Just google '1-bit 100b model' and you'll only see references to this project without any download links.

itsthecourier3mo ago· 1 in thread

https://github-production-user-asset-6210df.s3.amazonaws.com...

demo shows a huge love for water, this AI knows its home

_fw3mo ago

Also, very influenced by the literature of Jenkins (2010).

herf3mo ago

https://arxiv.org/pdf/2310.11453 The original paper [fig 1, bottom-right] seems to say it needs about 4-5x the parameters of a fp16 model. You can build it and run some models, but the selection is limited because it has to be trained from scratch. I imagine inference speed is faster compared with modern PTQ (4- and 8-bit quants) though.

1 more reply

leventilo3mo ago

The energy numbers are the real story here, 70-82% reduction on CPU inference. If 1-bit models ever get good enough, running them on commodity hardware with no GPU budget changes who can deploy LLMs. That's more interesting than the speed benchmarks imo.

logicallee3mo ago

It might interest you to know that one or two months ago, I had Claude port BitNet to WebGPU from the reference implementation, so that it runs right in your browser as a local model. After some debugging, the port seemed to work, but the model didn't function as well as the reference implementation so I'll have to work on it for a while. You can see a debugging session livestreamed here[1]. The released model file was about a gigabyte, it fits in most people's GPU's. We were also able to successfully fine-tune it right in the browser.

There's a lot that you can do when the model size is that small, yet still powerful.

Our next step is that we want to put up a content distribution network for it where people can also share their diffs for their own fine-tuned model. I'll post the project if we finish all the parts.

[1] https://www.youtube.com/live/x791YvPIhFo?is=NfuDFTm9HjvA3nzN

naasking3mo ago

I think the README [1] for the new CPU feature is of more interest, showing linear speedups with number of threads. Up to 73 tokens/sec with 8 threads (64 toks/s for their recommended Q6 quant):

https://github.com/microsoft/BitNet/blob/main/src/README.md

kristopolous3mo ago

I don't see the news here ... there's https://huggingface.co/collections/microsoft/bitnet which is last updated 12/2025 ... am I just paying more attention here or is there something actually new about this?

Also as far as I know, this is more of a research curiosity - BitNet really doesn't perform that well on evals.

I think Qwen3.5 2B is the best you can get in the ~1GB class.

faldore3mo ago

Why did you call it a 100b parameter model? it is not 100b parameters. they published a 1b parameter and a 2b parameter model.

Furthermore, it was published 11 months ago, it's not a new release.

algoth13mo ago

Headline: 100B. Falcon 3 family: 10B. An order of magnitude off

bee_rider3mo ago

What’s the lower limit on the number of bits per parameter? If you use CSR-style sparse matrices to store the weights can it be less than 1?

WhitneyLand3mo ago

If they had a big result like, native 1.58-bit quality clearly matches top peers, they would be saying that prominently in the repo.

The engineering/optimization work is nice, but this is not what people have been waiting for, as much as, can’t the Bitnet idea that seemed promise really deliver in a competitive way.

Herring3mo ago

If this stuff was so revolutionary, don't you guys think Qwen/DeepSeek would have snapped it up already? Both those teams are highly innovative, picking up and inventing new techniques all the time. Hell, Deepseek-v3 was one of the first to do large scale fp8 training.

a1o3mo ago

> A demo of bitnet.cpp running a BitNet b1.58 3B model on Apple M2

With how much RAM? How much storage does it requires?

hinkley3mo ago

Do LLMs have a way to look at or consider dependent variables?

Seems like that could end up as a situation where a fractional number of bits or bytes per parameter might make sense. Particularly with adverbs and adjectives, negators.

philvas3mo ago

steve jobs would have loved the microsoft repo with demo on mac

syntaxing3mo ago

Misleading title but this is pretty exciting. Interesting how this is based on llama cpp. Its nice to see some momentum since they released the paper in 2023

janalsncm3mo ago

They have a demo video in the readme. I think they are trying to convey that BitNet is fast, which it is. But it is worth taking a moment to pause and actually see what the thing is doing so quickly.

It seems to keep repeating that the water cycle is the main source of energy for all living things on the planet and then citing Jenkins 2010. There are also a ton of sentence beginning with “It also…”

I don’t even think it’s correct. The sun is the main source of energy for most living things but there’s also life near hydrothermal vents etc.

I don’t know who Jenkins is, but this model appears to be very fond of them and the particular fact about water.

I suppose fast and inaccurate is better than slow and inaccurate.

1 more reply

almaight3mo ago

Could this ternary model be more easily replicated on the Taalas HC1?

knodi1233mo ago

Why would they film a demo video of it spewing out barely-coherent rambling repetitive drivel? If your model sucks at writing essays, maybe just tell us that, and film a demo of it doing something it IS good at?

rarisma3mo ago

No 100b model.

My disappointment is immeasurable and my day is ruined.

j / k navigate · click thread line to collapse

167 comments

107 comments · 30 top-level

LuxBennu3mo ago· 38 in thread

embedding-shape3mo ago

> Framework is ready. Now we need someone to actually train the model.

Personally that makes it slightly worrisome to just take what they say at face value, why wouldn't they train and publish a model themselves if this actually led to worthwhile results?

throwaw123mo ago

Because this is Microsoft, experimenting and failing is not encouraged, taking less risky bets and getting promoted is. Also no customer asked them to have 1-bit model, hence PM didn't prioritize it.

But it doesn't mean, idea is worthless.

You could have said same about Transformers, Google released it, but didn't move forward, turns out it was a great idea.

2 more replies

GorbachevyChase3mo ago

1 more reply

observationist3mo ago

So is it finally time for a Beowulf cluster to do something amazing?

1 more reply

embeddnet3mo ago

1 more reply

gregman13mo ago

Cannot agree more!

deepsquirrelnet3mo ago

The title being misleading is important as well, because this has landed on the front page, and the only thing that would be the only notable part of this submission.

The "new" on huggingface banner has weights that were uploaded 11 months ago, and it's 2B params. Work on this in the repo is 2 years old.

The amount of publicity compared to the anemic delivery for BitNet is impressive.

wongarsu3mo ago

You'd still need full GPUs for training, but for inference the hardware would be orders of magnitude simpler than what Nvidia is making

monocasa3mo ago

These are trits, which provide their own efficiencies.

Interestingly, a trit x float multiplier is cheaper than a trit x integer multiplier in hardware if you're willing to ignore things like NaNs.

1 more reply

regularfry3mo ago

You only need GPUs if you assume the training is gradient descent. GAs or anything else that can handle nonlinearities would be fine, and possibly fast enough to be interesting.

riidom3mo ago

Text is misleading too. 5-7 tok/sec is not reading speed, it's a tad slower. For me, at least, and I am an experienced reader, not especially schooled in quick-reading though.

For any local setup I'd try to reach for 10 tok/sec. Sacrifice some kv cache and shove a few more layers on your GPU, it's worth it.

WithinReason3mo ago

> a fundamentally different compute profile on commodity CPU

In what way? On modern processors, a Fused Multiply-Add (FMA) instruction generally has the exact same execution throughput as a basic addition instruction

ismailmaj3mo ago

typically for 1-bit matmul, you can get away with xors and pop_counts which should have a better throughput profile than FMA when taking into account the SIMD nature of the inputs/outputs.

1 more reply

ActivePattern3mo ago

The win is in how many weights you process per instruction and how much data you load.

So it's not that individual ops are faster — it's that the packed representation lets each instruction do more useful work, and you're moving far less data from memory to do it.

actionfromafar3mo ago

Bitnet encoding more information dense per byte perhaps? CPUs have slow buses so would eke out more use of bandwidth?

DrBazza3mo ago

> memory bandwidth is always the bottleneck

I'm hoping that today's complaints are tomorrow's innovations. Back when 1Mb hard drive was $100,000, or when Gates said 640kb is enough.

fc417fc8023mo ago

Chip speed has increased faster than memory speed for a long time now, leaving DRAM behind. GDDR was good for awhile but is no longer sufficient. HBM is what's used now.

The last logical step of this process would be figuring out how to mix the CPU transistors with the RAM capacitors on the same chip as opposed to merely stacking separate chips on the same package.

A related stopgap is the AI startup (forget which) making accelerators on giant chips full of SRAM. Not a cost effective approach outside of ML.

1 more reply

Aerroon3mo ago

We have faster memory, it's just all used in data center cards you can't buy (and can't afford to buy).

AMD actually used HBM2 memory in their Radeon VII card back in 2019 (!!) for $700. It had 16 GB of HBM2 memory with 1 TB/s throughput.

That being said, the data center cards from both are monstrous.

These are not even the latest and greatest for either company. The B300 and Mi355x are even better.

The systems exist, you just can't have them, but you can rent them in the cloud at about $2-4 per hour per GPU.

bigyabai3mo ago

For larger contexts, the bottleneck is probably token prefill instead of memory bandwidth. Supposedly prefill is faster on the M5+ GPUs, but still a big hurdle for pre-M5 chips.

joquarky3mo ago

It might be advantageous to have a different memory structure altogether, bespoke to the specific task.

rustyhancock3mo ago

Yes. I had to read it over twice, it does strike me as odd that there wasn't a base model to work with.

But it seems the biggest model available is 10B? Somewhat unusual and does make me wonder just how challenging it will be to train any model in the 100B order of magnitude.

wongarsu3mo ago

Approximately as challenging as training a regular 100B model from scratch. Maybe a bit more challenging because there's less experience with it

naasking3mo ago

cat_plus_plus3mo ago

There are 1 bit average GGUFs of large models, not perfect quality but they will hold a conversation. These days, there is also quantized finetuning to heal the damage.

august113mo ago

In their demo they're running 3B model.

webXL3mo ago

It comes from (intentionally?) misleading docs: https://github.com/microsoft/BitNet/issues/391

(only suggesting that it's intentional because it's been there so long)

verdverm3mo ago

That issue appears to be the one that's wrong. From the technical report

1 more reply

cubefox3mo ago

LLM account

Springtime3mo ago

Edit: oh, just recalled dang restricted Show HNs the other day to only non-new users (possibly with some other thresholds). I wonder if word got out and some are filling accounts with activity.

2 more replies

bottlepalm3mo ago

It's not a question of if there are other bots out there, but only what % of comments on HN right now and elsewhere are bot generated. That number is only going to increase if nothing is done.

152334H3mo ago

Looks like gradual disempowerment is already happening - the minority of humans who are capable of spotting AI content are losing the struggle for attention on all major social networks

Jowsey3mo ago

Agreed. This is becoming an issue, see also: https://news.ycombinator.com/item?id=47259308

orbital-decay3mo ago

Funny enough I now involuntarily take RTFA as a slight slop signal, because all these accounts dutifully read the article before commenting, unlike most HNers who often respond to headlines.

4 more replies

nkohari3mo ago

cyanydeez3mo ago

Check out the new QWEN coder model.

Also, isnt there different affinities to 8bit vs 4bit for inferences

RandomTeaParty3mo ago

> The 1.58-bit approach

can we stop already with these decimals and just call it "1 trit" which it exactly is?

hsbauauvhabzb3mo ago

Yeah because THAT won’t confuse the average reader.

butILoveLife3mo ago

>. I run quantized 70B models locally (M2 Max 96GB, llama.cpp + LiteLLM), and memory bandwidth is always the bottleneck.

I imagine you got 96gb because you thought you'd be running models locally? Did you not know the phrase Unified Memory is marketing speak?

giancarlostoro3mo ago· 16 in thread

andai3mo ago

Here's a short clip of Karpathy speaking on this subject.

https://youtu.be/UldqWmyUap4

Also this is the direction the small LLMs are moving in already. They are too small for general knowledge, but getting quite good at tool use (incl. Googling).

Now we just need them to be very strict about what they know and don't know! (I think this is still an open problem, even with big ones.)

intrasight3mo ago

And I don't think that LLM could just Google or check Wikipedia.

But I do agree that this architecture makes a lot of sense. I assume it will become the norm to use such edge LLMs.

ramses03mo ago

I asked this question a while back (the "only train w/ wikipedia LLM") and got pointed to the general-purpose "compression benchmarks" page: `https://www.mattmahoney.net/dc/text.html`

giancarlostoro3mo ago

1 more reply

embedding-shape3mo ago

giancarlostoro3mo ago

When GPT 3.5 became a thing, it had crawled a very nuanced set of websites, this is what I mean. You basically curate where it sources data from.

bee_rider3mo ago

giancarlostoro3mo ago

andai3mo ago

1 more reply

krychu3mo ago

utopiah3mo ago

> validating outputs for LLM companies

rablackburn3mo ago

I feel like I should say "spoiler alert" but:

> I often wonder is "what will be the minimally viable LLM" that can work from just enough information that if it googles the rest it can provide reasonable answers?

It depends what that word "reasonable" means for your specific use-case ;)

thinkingtoilet3mo ago

Wikipedia has proven to be as accurate as encyclopedias for decades now. Also, I'm betting AI companies have illegally trained their models on the Encyclopedia Britanica's data by now.

davidron3mo ago

It's perfectly legal to train a human on copyrighted work and I think, depending on the country, it's not settled that training ai on the same data is illegal.

naasking3mo ago

uniq73mo ago

Since Google Search already includes an AI summary, your minimally viable "LLM" can be just an HTTP GET call

nickcw3mo ago· 4 in thread

One bit or one trit? I am confused!

drsopp3mo ago

"1-bit LLMs" is just marketing. The Shannon entropy of one letter with a 3 symbol alphabet (-1, 0, 1) is 1.58.

Dwedit3mo ago

Log Base 2 of 3 = ~1.5849625, so that's the limit to how well you can pack three-state values into bits of data.

But the packing of 5 symbols into a byte was not done here. Instead, they packed 4 symbols into a byte to reduce computational complexity (no unpacking needed)

1 more reply

cubefox3mo ago

Yeah, "1.58 bit" is 1 trit with three states, since log2(3)≈1.58.

So it's not a inference framework for 1-bit models (two states per parameter) but for 1.58 bit models (three states per parameter). Annoying that they try to mix up the two.

silon423mo ago

I always hope for "just a bunch of if statements" ... this is not it.

2 more replies

lemonish973mo ago· 4 in thread

cheema333mo ago

> I wonder when we begin to see the dividends of all the NPU PCs come into play.

Havoc3mo ago

You can already run some models on the NPUs in the Rockchip RK3588 SBCs which are pretty abundant.

A claude 4.6 they are most certainly not, but if you get through the janky AF software ecosystem they can run small LLMs reasonably well with basically zero CPU/GPU usage

throwa3562623mo ago

Are the NPUs really that powerful?

I was under the impression that they were primarily designed for low power use.

lemonish973mo ago

They seem to be getting better or more powerful. The newer Intel Panther lakes and AMD Ryzen are over 50 TOPS now, IIRC

radarsat13mo ago· 3 in thread

regularfry3mo ago

cubefox3mo ago

I assume that theoretically, 1-bit models could be most efficient because modern models switched from 32 bit to 16 bit to 8 bit per parameter (without quantization).

1 more reply

throwa3562623mo ago

The paper has performance comparisons towards the end.

https://arxiv.org/abs/2402.17764

simonw3mo ago· 3 in thread

Anyone know how hard it would be to create a 1-bit variant of one of the recent Qwen 3.5 models?

regularfry3mo ago

There are q2 and q1 quants, if you want an idea of how much performance you'd drop. Not quite the same implementation-wise, but probably equivalent in terms of smarts.

nikhizzle3mo ago

Almost trivial using open source tools, the question is how it performs without calibration/fine tuning.

wongarsu3mo ago

If you have an existing network, making an int4 quant is the better tradeoff. 1.58b quants only become interesting when you train the model specifically for it

On the other hand maybe it works much better than expected because llama3 is just a terrible baseline

htk3mo ago· 2 in thread

Does that make sense?

hrimfaxi3mo ago

Can you explain your third statement?

> As inhibitory synapses account for 20%-30%, this could map well to how biological brains are structured.

DoctorOetker3mo ago

In the human brain most synapses are indeed excitatory, while a minority is inhibitory.

No concise HN comment will give you a complete picture of whats currently known about the human brain, so a platitude necessarily follows:

I hope that helped?

2 more replies

StilesCrisis3mo ago· 2 in thread

``` Ecosystem Services and their impact on the Ecosystem

naasking3mo ago

It's a two year old base model that's only 3B parameters, trained on only 100B tokens. It's still a research project at this point.

gardnr3mo ago

The new model they just released has impressive benchmark results: https://huggingface.co/microsoft/bitnet-b1.58-2B-4T

Except on GSM8K and math...

2 more replies

QuadmasterXLII3mo ago· 2 in thread

headline hundred billion parameter, none of the official models are over 10 billion parameters. Curious.

Tuna-Fish3mo ago

est3mo ago

> quantized a 100B parameter model to 1 trit

I had the same question, after some debates with Chatgpt, it's not the "quantize" for post-training we often witness these days, you have to use 1 trit in the beginning since pre-train.

Arcuru3mo ago· 1 in thread

It's good to see this getting some continued development. I looked into it last year[1] and I thought it showed a lot of promise so I've been very disappointed that I never saw a newer model.

[1] - https://jackson.dev/post/dont-sleep-on-bitnet/

cubefox3mo ago

152334H3mo ago· 1 in thread

but there is no trained 100b param model? "can run a 100B BitNet" is about the inference implementation, not about the existence of any such model

webXL3mo ago

I think they used a dummy model or else they would have linked to it. Just google '1-bit 100b model' and you'll only see references to this project without any download links.

itsthecourier3mo ago· 1 in thread

https://github-production-user-asset-6210df.s3.amazonaws.com...

demo shows a huge love for water, this AI knows its home

_fw3mo ago

Also, very influenced by the literature of Jenkins (2010).

herf3mo ago

1 more reply

leventilo3mo ago

logicallee3mo ago

There's a lot that you can do when the model size is that small, yet still powerful.

Our next step is that we want to put up a content distribution network for it where people can also share their diffs for their own fine-tuned model. I'll post the project if we finish all the parts.

[1] https://www.youtube.com/live/x791YvPIhFo?is=NfuDFTm9HjvA3nzN

naasking3mo ago

I think the README [1] for the new CPU feature is of more interest, showing linear speedups with number of threads. Up to 73 tokens/sec with 8 threads (64 toks/s for their recommended Q6 quant):

https://github.com/microsoft/BitNet/blob/main/src/README.md

kristopolous3mo ago

Also as far as I know, this is more of a research curiosity - BitNet really doesn't perform that well on evals.

I think Qwen3.5 2B is the best you can get in the ~1GB class.

faldore3mo ago

Why did you call it a 100b parameter model? it is not 100b parameters. they published a 1b parameter and a 2b parameter model.

Furthermore, it was published 11 months ago, it's not a new release.

algoth13mo ago

Headline: 100B. Falcon 3 family: 10B. An order of magnitude off

bee_rider3mo ago

What’s the lower limit on the number of bits per parameter? If you use CSR-style sparse matrices to store the weights can it be less than 1?

WhitneyLand3mo ago

If they had a big result like, native 1.58-bit quality clearly matches top peers, they would be saying that prominently in the repo.

The engineering/optimization work is nice, but this is not what people have been waiting for, as much as, can’t the Bitnet idea that seemed promise really deliver in a competitive way.

Herring3mo ago

a1o3mo ago

> A demo of bitnet.cpp running a BitNet b1.58 3B model on Apple M2

With how much RAM? How much storage does it requires?

hinkley3mo ago

Do LLMs have a way to look at or consider dependent variables?

Seems like that could end up as a situation where a fractional number of bits or bytes per parameter might make sense. Particularly with adverbs and adjectives, negators.

philvas3mo ago

steve jobs would have loved the microsoft repo with demo on mac

syntaxing3mo ago

Misleading title but this is pretty exciting. Interesting how this is based on llama cpp. Its nice to see some momentum since they released the paper in 2023

janalsncm3mo ago

They have a demo video in the readme. I think they are trying to convey that BitNet is fast, which it is. But it is worth taking a moment to pause and actually see what the thing is doing so quickly.

I don’t even think it’s correct. The sun is the main source of energy for most living things but there’s also life near hydrothermal vents etc.

I don’t know who Jenkins is, but this model appears to be very fond of them and the particular fact about water.

I suppose fast and inaccurate is better than slow and inaccurate.

1 more reply

almaight3mo ago

Could this ternary model be more easily replicated on the Taalas HC1?

knodi1233mo ago

rarisma3mo ago

No 100b model.

My disappointment is immeasurable and my day is ruined.

j / k navigate · click thread line to collapse