Ternary Bonsai: Top Intelligence at 1.58 Bits (opens in new tab)

(prismml.com)

202 pointsnnx22d ago54 comments

54 comments

Open access for next 5 hours (Ternary-Bonsai-8B-Q2_0.gguf, running on RTX 3090) or until server crashes or the this spot instance gets taken away :) =>

https://uklkyvetsjf7qt-80.proxy.runpod.net

    ./build/bin/llama-server \
     -m ../Ternary-Bonsai-8B-Q2_0.gguf \
     -ngl 999 \
     --flash-attn on \
     --host 0.0.0.0 \
     --port 80 \
     --ctx-size 65500 \
     --batch-size 512 \
     --ubatch-size 512 \
     --parallel 5 \
     --cont-batching \
     --threads 8 \
     --threads-batch 8 \
     --cache-type-k q8_0 \
     --cache-type-v q8_0 \
     --log-colors on

# llama.cpp is forked one: https://github.com/PrismML-Eng/llama.cpp.git

# The server can serve 5 parallel request, with each request capped at around `13K` tokens...

# A bit of of benchmarks I did:

# 1. Input: 1001 tokens, ttfs: 0.3 second, outputs: 1618 tokens ~140t/s

# 2. Input: 9708 tokens, ttfs: 2.4 second, outputs: 2562 tokens at ~106t/s

# Vram usage was consistently at ~7GiB.

> https://huggingface.co/prism-ml/Ternary-Bonsai-8B-gguf/resol...

sigmoid1019d ago

Thanks a lot, I was about to clone their llama.cpp branch and do the same.

Some more interesting tidbits from my go-to tests:

* Fails the car wash test (basic logic seems to be weak in general)

* Fails simple watch face generation in html/css.

* Fails the "how many Rs in raspberry test" (not enough cross-token training data), but will funnily assume you may be talking about Indian Rupees and tell you a lot about raspberry prices in India without being asked. Possible Indian training data unbalance?

* Flat out refuses to talk about Tiananmen square when pushed directly - despite being from a US company. Again, perhaps they are exposed to some censored training data? Anyways, when slowly built up along the conversation by asking about locations and histories, it will eventually tell you about the massacre, so the censorship bias seems weak in general. Also has no problem immediately talking about anything Gaza/Israel/US or other sensitive topics.

* Happily tells you how to synthesize RDX with list of ingredients and chemical process step by step. At least it warns you that it is highly dangerous and legally controlled in the US.

yorwba19d ago

The 1-bit Bonsai and Ternary Bonsai models are all based on the corresponding Qwen3 model: https://raw.githubusercontent.com/PrismML-Eng/Bonsai-demo/re... (page 4)

sigmoid1019d ago

Thanks, already suspected as much. Also gives context to the other comment here that says it is basically equivalent in accuracy to Qwen3.5-4B. Essentially seems to be a very good quantization of that model, not a new BitNet.

1 more reply

morpheos13719d ago

I think the future of economically useful AI is to build efficient reasoners. A goal of model as oracle of truth or encylopedia for our world is orthagonal to a goal of a model that can reason about novel input. I think the focus on meme litmus tests is somewhat misguided. AI is not suited to be a spell checker, a news source, a history book, or an idependent developer. but if it can reason about a prompt and augment human effort then that is useful. the idea of an ai without human like world model matching or exceeding our own world model or store of facts is misguided in my opinion.

morpheos13719d ago

I think the future of economically useful AI is to build efficient reasoners. A goal of model as oracle of truth or encylopedia for our world is orthagonal to a goal of a model that can reason about novel input. I think the focus on meme litmus tests is somewhat misguided. AI is not suited to be a spell checker, a news source, a history book, or an idependent developer. but if it can reasom about a prompt and augment human effort then that is useful. the idea of an ai without human like world model matching or exceeding our own world model or store of facts is misguided in my opinion.

vegnus19d ago

The future of AI I would imagine is that AI acts purely as logic within constraints

walrus0119d ago

If you ask it who Ahmad Al Sharaa is, it outputs an answer that he was an ISIS commander (he wasn't, his group was rather opposed to them), and that he was killed in a US airstrike in 2017. It also refuses to answer direct questions about the cutoff date in its training dataset, but if you ask it who the president of syria is, it happily says that as of the training cutoff date of october 2023, bashar al-assad is the president. Same question about who the prime minister of canada is gets the same october 2023 response and says it's Trudeau.

Interestingly if you ask it "who is the dalai lama" it outputs an answer that certainly wouldn't be acceptable to post online in mainland china. It says:

The Dalai Lama is the spiritual and political leader of Tibet, and the title is held by the incarnate lama who is believed to be a reincarnation of the previous Dalai Lama. The Dalai Lama is a central figure in Tibetan Buddhism and holds a unique position in Tibetan society, both as a religious leader and as a symbol of Tibetan identity and independence. Key Points About the Dalai Lama: 1. Current Dalai Lama: Tenzin Gyatso (14th Dalai Lama)

    Born: 1935 in Amdo, Tibet (now part of China).
    Education: He studied at Ganden Monastery and later at Dunhuang University in China.
    Title: He is known as the 14th Dalai Lama, a title that has been passed down through generations of lamas who are believed to be reincarnations.
    Role: He is the spiritual leader of Tibetan Buddhism and the de facto leader of the Tibetan diaspora, particularly in India.

2. Historical Context

    The title of Dalai Lama has been held by Tibetan Buddhist lamas for centuries. It means "Ocean of Wisdom" in Tibetan.
    The 13th Dalai Lama (1876–1959) was a key figure in Tibetan history, known for his efforts to maintain Tibetan independence and promote modernization.
    He fled to India in 1950 after the Chinese Communist Party (CCP) took control of Tibet in 1951.

3. 14th Dalai Lama: Exile and Leadership

    The 14th Dalai Lama, Lama Tenzin Gyatso, was exiled to India in 1959 after the Tibetan Uprising.
    He has led the Tibetan government-in-exile in India since then, advocating for Tibetan independence and cultural preservation.
    He has also been a prominent voice for human rights and peace.

4. Political Status

    The 14th Dalai Lama is not recognized by the Chinese government as the legitimate ruler of Tibet, which is part of the People's Republic of China (PRC).
    The Central Tibetan Administration (CTA), based in India, is the self-proclaimed government of Tibet.
    The PRC considers Tibet to be an inseparable part of China and has implemented policies to integrate Tibet into the Chinese state.

5. International Recognition

    The 14th Dalai Lama is widely recognized internationally as the spiritual leader of Tibet and as the de facto leader of the Tibetan people.
    Many countries and international organizations support the Tibetan cause, but they are not recognized as a sovereign state.

Summary:

The Dalai Lama is the spiritual and political leader of Tibet, currently Lama Tenzin Gyatso (14th Dalai Lama), who has been in exile in India since 1959. He is a central figure in Tibetan identity and has played a key role in the global Tibetan movement for independence and cultural preservation.

If you'd like to know more about his background, policies, or role in the Tibetan independence movement, feel free to ask!

haellsigh19d ago

Fyi, I believe `--flash-attn on` doesn't do anything, you should instead use `--flash-attn 1`. I'm getting ~150t/s on a RTX 3080 10GB as well with f16 cache type.

freakynit19d ago

Thanks.. updated my local docs :)

opem19d ago

played around with it a little, works great for - basic conversational tasks - basic reasoning - creating little UI snippets and not good for one shot coding tasks. once it starts hallucinating about something you can't make it rectify that. during a coding task it wrote some madeup library function, despite of pointing it out, giving docs, it kept producing the same code with the madeup function all while acknowledging the problem. btw, thanks for putting it together!

freakynit19d ago

update: Well, spot survived... and since a lot of the folks are still using it, I'm keeping it alive for 2 hours more.

freakynit19d ago

Update 2: Terminated.

Thank you all for trying it out.

armanj19d ago

I did a quick benchmark & compared it with Qwen3.5: https://github.com/ArmanJR/PrismML-Bonsai-vs-Qwen3.5-Benchma...

in my results, accuracy-wise Ternary-Bonsai-8B is on par with Qwen3.5-4B. But in accuracy-per-byte, bonsai is the clear winner:

=> Ternary-Bonsai-1.7B achieved 65.1% from 462 MiB, beating Qwen3.5-0.8B by 12 points while being ~5% smaller on disk. => Ternary-Bonsai-4B is the accuracy-per-byte winner above 1 GiB. 83.0% from only 1.1 GiB, within 2 points of Qwen3.5-4B at 40% of the weight size.

they show strong promise on edge devices and where disk space is limited. I think this lab is worth watching.

Animats19d ago

This makes sense. The 1-bit model implies needing 2x as many neurons, because you need an extra level to invert. But the ternary model still has a sign, just really low resolution.

(I've been reading the MMLU-Redux questions for electrical engineering. They're very funny. Fifty years ago they might have been relevant. The references to the Intel 8085 date this to the mid-1970s. Moving coil meters were still a big thing back then. Ward-Leonard drives still drove some elevators and naval guns. This is supposed to be the hand-curated version of the questions. Where do they get this stuff? Old exams?)

[1] https://github.com/aryopg/mmlu-redux/blob/main/outputs/multi...

philipp-gayret19d ago

Nice work, I applied my own benchmarking tools to it.

On my single NVidia Spark I get 173.3 tokens/s on baseline config, 372.4 tokens/s with added tuning/parallel options. Most notably time to first token is incredibly low, similar models take ~6000ms. Bonsai was 70ms (almost 100x reduction) with flash attention

Having said all that, gemma4-e4b-q4km did much better and I can achieve 70% of the tokens/s on the same machine, specifically in context of tool use and for running agents.

usernametaken2919d ago

I think it’s exciting to live in this quirky universe where we have simply accepted our hardware does weird and nonlinear stuff and that powers some math and that’s why your transform function works. Many people thought quantisation is not viable to the extent we see, but we clearly underestimated the effect of hardware on the actual non linearity of the models. Cool to see this pushed to the limits.

freakynit19d ago

Nature has already set an absurdly high bar. The human brain runs on roughly 20 watts, yet delivers a level of intelligence we still can't clearly define, let alone replicate. Nothing we've built comes close... either in capability or efficiency. We're still very early in understanding what "intelligence" even means, much less engineering it. so, we have a long way to go, and push.

heltale19d ago

This reminds me of DeepGrove AI’s ternary weight model: https://github.com/deepgrove-ai/Bonsai

Is there a particular difference that I might be missing here?

swiftcoder19d ago

Does this sort of thing scale? Would a 30B or higher model see similar performance/memory gains under this scheme?

zkmon19d ago

The raw math: File size is hard-linked to parameter count and quant type. Intelligence is sort of linked to parameter count. Parameter count dictates the hardware requirement. What't left for the labs is, compressing more intelligence into lower parameter count, or packing more of specialized intelligence or buying up more hardware. Those are the only 3 directions all models/labs are heading.

WatchDog19d ago

All of their benchmarks are against 16 bit models right?

Why aren't they comparing to 2/3/4 bit quants?

himata411319d ago

looked at quant versions of these models and they all outperform it so I guess it just doesn't look as good.

yodon19d ago

So excited to see this - the big advantage of 1.58 bits is there are no multiplications at inference time, so you can run them on radically simpler and cheaper hardware.

Animats19d ago

At 4 bits, you could just have a hard-wired table lookup. Two 4 bit values in, 256 entry table. You can have saturating arithmetic and a post-processing function for free. Somebody must be building hardware like that.

AlotOfReading19d ago

A LUT is pretty wasteful. You only have a one bit significand, so the mantissa and sign bits are boolean binops, and the exponent is a 2 bit adder.

Taniwha19d ago

and so you can at 1-bit too, and the hardware will be even smaller and cheaper

mchusma19d ago

Ever since I saw the first one of these one-bit models made by Microsoft, I thought this was a fascinating route. I assume that in practice, this is less helpful than it seems, just because there's every economic incentive in the world for the big AI labs to produce small, powerful, fast models. None of them seem to be using this technique, so it's interesting, but I suspect it's not quite working.

I also have yet to see any of these at a larger scale. For example, can you try one of these at 100 billion parameters?

mungoman219d ago

This is very interesting and exciting, but IMHO the comparisons read as a bit disingenuous with the other models at 16 bit weights. The 16 bit releases of the others models are not optimized for size, making it difficult to take the comparison seriously.

Would be interesting to see a comparison to quantized versions of the other models. If this model beats the others also in a fair comparison it gives more credibility to it.

londons_explore19d ago

How is the research on training these models directly in their quantized state going?

That'll be the real game changer.

sigmoid1019d ago

The original BitNet was natively trained on 1.58 bits. PrismML has not released any actual info on how they trained, but since they are based on Qwen, there was certainly some downstream quantization involved.

usrusr19d ago

Is it just quantization or is it also rearranging the weights to get clusters with (almost) the same factors? If it's the latter it would very much be training in full precision (but also hardly any precision lost by the compression).

Unfortunately my mental model doesn't contain anything to even guess if that's possible or not, my AI times were at the falling flank of symbolic. Funny how one bit models feel a bit like approaching an approximation of symbolic again (until you read about the grouped scale factors and then the illusion is gone)

One thought that suggests rearranging is not involved,a thought that does not require any knowledge at all: if it did involve rearranging, someone would certainly have added some order by scale factor tricks with linear interpolation by address offset to lose even less precision.

Reubend18d ago

It's not just quantization. I verified that if you naïvely quantize to 1 bit from the original Qwen model (and set grouped scale factors based on what the original model's weights were like), it just spits out gibberish.

> One thought that suggests rearranging is not involved,a thought that does not require any knowledge at all: if it did involve rearranging, someone would certainly have added some order by scale factor tricks with linear interpolation by address offset to lose even less precision.

Can you elaborate?

cubefox19d ago

This is the only paper which really does this:

https://proceedings.neurips.cc/paper_files/paper/2024/hash/7...

They train directly in the 1 bit domain, without any floating point weights. They don't use the classical Newton-Leibniz derivative (which operates on approximations of real numbers) for gradient descent / backpropagation. Instead they invented a binary version called "Boolean variation".

I don't know why this paper didn't get more attention.

tiagod19d ago

> Fig IV: Throughput (toks/sec) and energy consumption (mWh/tok) across various hardware platforms.

I don't see any mWh/token figures in that chart.

ericb19d ago

This is pretty cool! I would love to see an even larger models shrunk down.

If you got that into a couple gigs--what could you stuff into 20 gigs?

wmf19d ago

Yet again they're comparing against unquantized versions of other models. They would probably still win but by a much smaller size margin.

Dumbledumb19d ago

Wouldnt the margin be higher? All other models being moved from unquantized to quantized would lower their performance, while bonsai stays. I get what you see if it was in regards to score/modelsize, but not for absolute performance

SwellJoe19d ago

The metric they're selling this on is intelligence per byte, rather than total intelligence. So, if they used the quantized competing models, the intelligence per byte gap shrinks, because most models hold up very well down to 6-bit quantization, and 4-bit is usually still pretty good, though intelligence definitely tends to fall below 6-bit.

Nonetheless, the Prism Bonsai models are impressive for their size. Where it falls apart is with knowledge. It has good prose/logic for a tiny model, and it's fast even on modest hardware, but it hallucinates a lot. Which makes sense. You can't fit the world's data in a couple of gigabytes. But, as a base model for fine-tuning for use cases where size matters, it's probably a great choice.

happygoose19d ago

unfortunately, there doesn't seem to be a clear way to fine-tune these models yet. excited for when that happens though.

syntex19d ago

hallucinates in pretty much every answer

TimorousBestie19d ago

This model tends to be annoyingly literal. An example from earlier today:

>> What are some names like Llewelyn?

> Some names like Llewelyn are Llewelyn, Llewelyn, Llewelyn, (repeats several times), and Llewelyn.

gbgarbeb19d ago

When do we get 1100B Kimi K2.6 in 160 GB of memory at 1.125 bpw?

est19d ago

installed since last HN post. So Bonsai (1-bit) and Ternary-Bonsai are different?

Can it be run on browsers with WASM/WebGPU?

xenova19d ago

yep! :) https://huggingface.co/spaces/webml-community/bonsai-ternary...

goofy_lemur19d ago

> On M4 Pro, Ternary Bonsai 8B runs at 82 toks/sec, roughly 5x faster than a 16-bit 8B model

Wow, if this is true, I am extremely impressed and excited!

I wonder about kv cache how much better it is as well!

j / k navigate · click thread line to collapse

54 comments

freakynit19d ago

Open access for next 5 hours (Ternary-Bonsai-8B-Q2_0.gguf, running on RTX 3090) or until server crashes or the this spot instance gets taken away :) =>

https://uklkyvetsjf7qt-80.proxy.runpod.net

    ./build/bin/llama-server \
     -m ../Ternary-Bonsai-8B-Q2_0.gguf \
     -ngl 999 \
     --flash-attn on \
     --host 0.0.0.0 \
     --port 80 \
     --ctx-size 65500 \
     --batch-size 512 \
     --ubatch-size 512 \
     --parallel 5 \
     --cont-batching \
     --threads 8 \
     --threads-batch 8 \
     --cache-type-k q8_0 \
     --cache-type-v q8_0 \
     --log-colors on

# llama.cpp is forked one: https://github.com/PrismML-Eng/llama.cpp.git

# The server can serve 5 parallel request, with each request capped at around `13K` tokens...

# A bit of of benchmarks I did:

# 1. Input: 1001 tokens, ttfs: 0.3 second, outputs: 1618 tokens ~140t/s

# 2. Input: 9708 tokens, ttfs: 2.4 second, outputs: 2562 tokens at ~106t/s

# Vram usage was consistently at ~7GiB.

> https://huggingface.co/prism-ml/Ternary-Bonsai-8B-gguf/resol...

sigmoid1019d ago

Thanks a lot, I was about to clone their llama.cpp branch and do the same.

Some more interesting tidbits from my go-to tests:

* Fails the car wash test (basic logic seems to be weak in general)

* Fails simple watch face generation in html/css.

* Happily tells you how to synthesize RDX with list of ingredients and chemical process step by step. At least it warns you that it is highly dangerous and legally controlled in the US.

yorwba19d ago

The 1-bit Bonsai and Ternary Bonsai models are all based on the corresponding Qwen3 model: https://raw.githubusercontent.com/PrismML-Eng/Bonsai-demo/re... (page 4)

sigmoid1019d ago

1 more reply

morpheos13719d ago

I think the future of economically useful AI is to build efficient reasoners. A goal of model as oracle of truth or encylopedia for our world is orthagonal to a goal of a model that can reason about novel input. I think the focus on meme litmus tests is somewhat misguided. AI is not suited to be a spell checker, a news source, a history book, or an idependent developer. but if it can reasom about a prompt and augment human effort then that is useful. the idea of an ai without human like world model matching or exceeding our own world model or store of facts is misguided in my opinion.

vegnus19d ago

The future of AI I would imagine is that AI acts purely as logic within constraints

walrus0119d ago

Interestingly if you ask it "who is the dalai lama" it outputs an answer that certainly wouldn't be acceptable to post online in mainland china. It says:

    Born: 1935 in Amdo, Tibet (now part of China).
    Education: He studied at Ganden Monastery and later at Dunhuang University in China.
    Title: He is known as the 14th Dalai Lama, a title that has been passed down through generations of lamas who are believed to be reincarnations.
    Role: He is the spiritual leader of Tibetan Buddhism and the de facto leader of the Tibetan diaspora, particularly in India.

2. Historical Context

    The title of Dalai Lama has been held by Tibetan Buddhist lamas for centuries. It means "Ocean of Wisdom" in Tibetan.
    The 13th Dalai Lama (1876–1959) was a key figure in Tibetan history, known for his efforts to maintain Tibetan independence and promote modernization.
    He fled to India in 1950 after the Chinese Communist Party (CCP) took control of Tibet in 1951.

3. 14th Dalai Lama: Exile and Leadership

    The 14th Dalai Lama, Lama Tenzin Gyatso, was exiled to India in 1959 after the Tibetan Uprising.
    He has led the Tibetan government-in-exile in India since then, advocating for Tibetan independence and cultural preservation.
    He has also been a prominent voice for human rights and peace.

4. Political Status

    The 14th Dalai Lama is not recognized by the Chinese government as the legitimate ruler of Tibet, which is part of the People's Republic of China (PRC).
    The Central Tibetan Administration (CTA), based in India, is the self-proclaimed government of Tibet.
    The PRC considers Tibet to be an inseparable part of China and has implemented policies to integrate Tibet into the Chinese state.

5. International Recognition

    The 14th Dalai Lama is widely recognized internationally as the spiritual leader of Tibet and as the de facto leader of the Tibetan people.
    Many countries and international organizations support the Tibetan cause, but they are not recognized as a sovereign state.

Summary:

If you'd like to know more about his background, policies, or role in the Tibetan independence movement, feel free to ask!

haellsigh19d ago

Fyi, I believe `--flash-attn on` doesn't do anything, you should instead use `--flash-attn 1`. I'm getting ~150t/s on a RTX 3080 10GB as well with f16 cache type.

freakynit19d ago

Thanks.. updated my local docs :)

opem19d ago

freakynit19d ago

update: Well, spot survived... and since a lot of the folks are still using it, I'm keeping it alive for 2 hours more.

freakynit19d ago

Update 2: Terminated.

Thank you all for trying it out.

armanj19d ago

I did a quick benchmark & compared it with Qwen3.5: https://github.com/ArmanJR/PrismML-Bonsai-vs-Qwen3.5-Benchma...

in my results, accuracy-wise Ternary-Bonsai-8B is on par with Qwen3.5-4B. But in accuracy-per-byte, bonsai is the clear winner:

they show strong promise on edge devices and where disk space is limited. I think this lab is worth watching.

Animats19d ago

This makes sense. The 1-bit model implies needing 2x as many neurons, because you need an extra level to invert. But the ternary model still has a sign, just really low resolution.

[1] https://github.com/aryopg/mmlu-redux/blob/main/outputs/multi...

philipp-gayret19d ago

Nice work, I applied my own benchmarking tools to it.

Having said all that, gemma4-e4b-q4km did much better and I can achieve 70% of the tokens/s on the same machine, specifically in context of tool use and for running agents.

usernametaken2919d ago

freakynit19d ago

heltale19d ago

This reminds me of DeepGrove AI’s ternary weight model: https://github.com/deepgrove-ai/Bonsai

Is there a particular difference that I might be missing here?

swiftcoder19d ago

Does this sort of thing scale? Would a 30B or higher model see similar performance/memory gains under this scheme?

zkmon19d ago

WatchDog19d ago

All of their benchmarks are against 16 bit models right?

Why aren't they comparing to 2/3/4 bit quants?

himata411319d ago

looked at quant versions of these models and they all outperform it so I guess it just doesn't look as good.

yodon19d ago

So excited to see this - the big advantage of 1.58 bits is there are no multiplications at inference time, so you can run them on radically simpler and cheaper hardware.

Animats19d ago

AlotOfReading19d ago

A LUT is pretty wasteful. You only have a one bit significand, so the mantissa and sign bits are boolean binops, and the exponent is a 2 bit adder.

Taniwha19d ago

and so you can at 1-bit too, and the hardware will be even smaller and cheaper

mchusma19d ago

I also have yet to see any of these at a larger scale. For example, can you try one of these at 100 billion parameters?

mungoman219d ago

Would be interesting to see a comparison to quantized versions of the other models. If this model beats the others also in a fair comparison it gives more credibility to it.

londons_explore19d ago

How is the research on training these models directly in their quantized state going?

That'll be the real game changer.

sigmoid1019d ago

usrusr19d ago

Reubend18d ago

Can you elaborate?

cubefox19d ago

This is the only paper which really does this:

https://proceedings.neurips.cc/paper_files/paper/2024/hash/7...

I don't know why this paper didn't get more attention.

tiagod19d ago

> Fig IV: Throughput (toks/sec) and energy consumption (mWh/tok) across various hardware platforms.

I don't see any mWh/token figures in that chart.

ericb19d ago

This is pretty cool! I would love to see an even larger models shrunk down.

If you got that into a couple gigs--what could you stuff into 20 gigs?

wmf19d ago

Yet again they're comparing against unquantized versions of other models. They would probably still win but by a much smaller size margin.

Dumbledumb19d ago

SwellJoe19d ago

happygoose19d ago

unfortunately, there doesn't seem to be a clear way to fine-tune these models yet. excited for when that happens though.

syntex19d ago

hallucinates in pretty much every answer

TimorousBestie19d ago

This model tends to be annoyingly literal. An example from earlier today:

>> What are some names like Llewelyn?

> Some names like Llewelyn are Llewelyn, Llewelyn, Llewelyn, (repeats several times), and Llewelyn.

gbgarbeb19d ago

When do we get 1100B Kimi K2.6 in 160 GB of memory at 1.125 bpw?

est19d ago

installed since last HN post. So Bonsai (1-bit) and Ternary-Bonsai are different?

Can it be run on browsers with WASM/WebGPU?

xenova19d ago

yep! :) https://huggingface.co/spaces/webml-community/bonsai-ternary...

goofy_lemur19d ago

> On M4 Pro, Ternary Bonsai 8B runs at 82 toks/sec, roughly 5x faster than a 16-bit 8B model

Wow, if this is true, I am extremely impressed and excited!

I wonder about kv cache how much better it is as well!

j / k navigate · click thread line to collapse