undefined | Better HN

0 pointsChamix16d ago0 comments

Try 10s of trillions. These days everyone is running 4-bit at inference (the flagship feature of Blackwell+), with the big flagship models running on recently installed Nvidia 72gpu rubin clusters (and equivalent-ish world size for those rented Ironwood TPUs Anthropic also uses). Let's see, Vera Rubin racks come standard with 20 TB (Blackwell NVL72 with 10 TB) of unified memory, and NVFP4 fits 2 parameters per btye...

Of course, intense sparsification via MoE (and other techniques ;) ) lets total model size largely decouple from inference speed and cost (within the limit of world size via NVlink/TPU torrus caps)

So the real mystery, as always, is the actual parameter count of the activated head(s). You can do various speed benchmarks and TPS tracking across likely hardware fleets, and while an exact number is hard to compute, let me tell you, it is not 17B or anywhere in that particular OOM :)

Comparing Opus 4.6 or GPT 5.4 thinking or Gemini 3.1 pro to any sort Chinese model (on cost) is just totally disingenuous when China does NOT have Vera Rubin NVL72 GPUs or Ironwood V7 TPUs in any meaningful capacity, and is forced to target 8gpu Blackwell systems (and worse!) for deployment.

0 comments

jychang16d ago

Nobody is running 10s of trillion param models in 2026. That's ridiculous.

Opus is 2T-3T in size at most.

ChamixOP16d ago

What do you think labs are doing with the minimum 10TB memory in NvLink 72 systems that were publicly reported to all start coming online in November/December of last year? And why would this 1 TB -> 10 TB jump matter so much for Anthropic previously being wholly dependent on running Opus 4x on TPUs, if the models were 2-3T at 4bit and could fit in 8x B200 (1.5 TB = 3T param) widely deployed during the Opus 4 era?

You have presented a vibe-based rebuttal with no evidence or or logic to outline why you think labs are still stuck in the single trillions of parameters (GPT 4 was ~1 trillion params!). Though, you have successfully cunninghammed me into saying that while anything I publicly state is derived from public info, working in the industry itself is a helpful guide to point at the right public info to reference.

johndough16d ago

Could you point at some more public info about active parameter count? You said:

> and while an exact number is hard to compute, let me tell you, it is not 17B or anywhere in that particular OOM :)

I can see ~100B, but that would near the same order of magnitude. I find ~1000B active parameters hard to believe.

1 more reply

jychang15d ago

Well, for one, Anthropic mostly uses Google TPUs and Amazon Inferentia2 chips, not Nvidia NVL72s. That's because... Google and Amazon are major investors in Anthropic.

Secondly, you missed out the entire AI industry trend in 2024-2025, where the failure of the GPT-4.5 pretrain run and the pullback from GPT-4 to GPT-4 Turbo to GPT-4o (each of which are smaller in parameter count). GPT-4 is 1.6T, GPT-4 Turbo is generally considered 1/2 to 1/4 that, and GPT-4o is even smaller (details below)

Thirdly, we KNOW that GPT-4o runs on Microsoft Maia 100 hardware with 64GB each chip, which gives a hard limit on the size of GPT-4o and tells us that it's a much smaller distilled version of GPT-4. Microsoft says each server has 4 Maia 100 chips and 256GB total. We know Microsoft uses Maia 100s to serve GPT-4o for Azure! So we know that quantized GPT-4o fits in 256GB, and GPT-4 does not fit. It's not possible to have GPT-4o be some much larger model that requires a large cluster to serve- that would drop performance below what we see in Azure.

Fourthly, it is not publicly KNOWN, but leaks say that GPT-4o is 200b-300b in size, which also tells us that running GPT-4 sized models is nonsense. This matches the information from Microsoft Maia servers above.

Fifthly, OpenAI Head of Research has since confirmed that o1, o3, GPT-5 use the same pretrain run as 4o, so they would be the same size.[1] That means GPT-5 is not some 1T+ model! Semianalysis confirms that the only pretrain run since 4o is 4.5, which is a ~10T model but everyone knows is a failed run.

Sixthly, Amazon Bedrock and Google Vertex serves models at approximately similar memory bandwidths when calculating tokens/sec, giving 4900GB/sec for Google Vertex. Opus 4.5 aligns very well with 100b of active params.

    42 tps for Claude Opus 4.6 https://openrouter.ai/anthropic/claude-opus-4.6
    143 tps for GLM 4.7 (32B active parameters) https://openrouter.ai/z-ai/glm-4.7
    70 tps for Llama 3.3 70B (dense model) https://openrouter.ai/meta-llama/llama-3.3-70b-instruct

For GLM 4.7, that makes 143 * 32B = 4576B parameters per second, and for Llama 3.3, we get 70 * 70B = 4900B. There's calculations for Amazon Bedrock on the Opus 4.5 launch thread that compares it to gpt-oss-120b with similar conclusions.

Seventhly, Anthropic distilled Opus 4/4.1 to 4.5, which is why it runs ~3x faster than Opus 4 while costing 1/3 the price in terms of API fees.

Eightly, no respectable model has a sparsity below 3% these days- ridiculously low sparsity gives you Llama 4. Every single cutting edge model are around 3-5% sparsity. Knowing the active param count for Opus 4.5 gives you a very good estimate of total param count.

The entire AI industry is moving AWAY from multi-trillion-parameter models. Everything is about increasing efficiency with the amount of parameters you have, not hyperscaling like GPT-4.5 which was shown to be a bad way forward.

Nobody thinks Opus 4.5 is bigger than around 2T in size (so not 10T). Opus 4/4.1 may have been ~6T, but that's it. Any guess of 10T or above is patently ridiculous for both Opus 4/4.1 and Opus 4.5.

[1] https://x.com/petergostev/status/1995744289079656834

1 more reply

johndough16d ago

Do you have any clues to guess the total model size? I do not see any limitations to making models ridiculously large (besides training), and the Scaling Law paper showed that more parameters = more better, so it would be a safe bet for companies that have more money than innovative spirit.

magicalhippo16d ago

> I do not see any limitations to making models ridiculously large (besides training)

From my understanding, the "besides training" is a big issue. As I noted earlier[1], Qwen3 was much better than Qwen2.5, but the main difference was just more and better training data. The Qwen3.5-397B-A17B beat their 1T-parameter Qwen3-Max-Base, again a large change was more and better training data.

[1]: https://news.ycombinator.com/item?id=47089780

aurareturn16d ago

China is targeting H20 because that's all they were officially allowed to buy.

ChamixOP16d ago

I generally agree, back of the napkin math shows H20 cluster of 8gpu * 96gb = 768gb = 768B parameters on FP8 (no NVFP4 on Hopper), which lines up pretty nicely with the sizes of recent open source Chinese models.

However, I'd say its relatively well assumed in realpolitik land that Chinese labs managed to acquire plenty of H100/200 clusters and even meaningful numbers of B200 systems semi-illicitly before the regulations and anti-smuggling measures really started to crack down.

This does somewhat beg the question of how nicely the closed source variants, of undisclosed parameter counts, fit within the 1.1tb of H200 or 1.5tb of B200 systems.

aurareturn16d ago

They do not have enough H200 or Blackwell systems to server 1.6 billion people and the world so I doubt it's in any meaningful number.

1 more reply

j / k navigate · click thread line to collapse

0 comments

jychang16d ago

Nobody is running 10s of trillion param models in 2026. That's ridiculous.

Opus is 2T-3T in size at most.

ChamixOP16d ago

johndough16d ago

Could you point at some more public info about active parameter count? You said:

> and while an exact number is hard to compute, let me tell you, it is not 17B or anywhere in that particular OOM :)

I can see ~100B, but that would near the same order of magnitude. I find ~1000B active parameters hard to believe.

1 more reply

jychang15d ago

Well, for one, Anthropic mostly uses Google TPUs and Amazon Inferentia2 chips, not Nvidia NVL72s. That's because... Google and Amazon are major investors in Anthropic.

    42 tps for Claude Opus 4.6 https://openrouter.ai/anthropic/claude-opus-4.6
    143 tps for GLM 4.7 (32B active parameters) https://openrouter.ai/z-ai/glm-4.7
    70 tps for Llama 3.3 70B (dense model) https://openrouter.ai/meta-llama/llama-3.3-70b-instruct

Seventhly, Anthropic distilled Opus 4/4.1 to 4.5, which is why it runs ~3x faster than Opus 4 while costing 1/3 the price in terms of API fees.

Nobody thinks Opus 4.5 is bigger than around 2T in size (so not 10T). Opus 4/4.1 may have been ~6T, but that's it. Any guess of 10T or above is patently ridiculous for both Opus 4/4.1 and Opus 4.5.

[1] https://x.com/petergostev/status/1995744289079656834

1 more reply

johndough16d ago

magicalhippo16d ago

> I do not see any limitations to making models ridiculously large (besides training)

[1]: https://news.ycombinator.com/item?id=47089780

aurareturn16d ago

China is targeting H20 because that's all they were officially allowed to buy.

ChamixOP16d ago

This does somewhat beg the question of how nicely the closed source variants, of undisclosed parameter counts, fit within the 1.1tb of H200 or 1.5tb of B200 systems.

aurareturn16d ago

They do not have enough H200 or Blackwell systems to server 1.6 billion people and the world so I doubt it's in any meaningful number.

1 more reply

j / k navigate · click thread line to collapse