To quote The Next Platform: "An Ironwood cluster linked with Google’s absolutely unique optical circuit switch interconnect can bring to bear 9,216 Ironwood TPUs with a combined 1.77 PB of HBM memory... This makes a rackscale Nvidia system based on 144 “Blackwell” GPU chiplets with an aggregate of 20.7 TB of HBM memory look like a joke."
Nvidia may have the superior architecture at the single-chip level, but for large-scale distributed training (and inference) they currently have nothing that rivals Google's optical switching scalability.
For example the currently very popular Mixture of Experts architectures require a lot of all to all traffic (for expert parallelism) which works a lot better on the switched NVlink fabric as opposed where it doesn't need to traverse multiple links in the torus.
Bisection bandwidth is a useful metric, but is hop count? Per-hop cost tends to be pretty small.
NVFP4 is to put it mildly a masterpiece, the UTF-8 of its domain and in strikingly similar ways it is 1. general 2. robust to gross misuse 3. not optional if success and cost both matter.
It's not a gap that can be closed by a process node or an architecture tweak: it's an order of magnitude where the polynomials that were killing you on the way up are now working for you.
sm_120 (what NVIDIA's quiet repos call CTA1) consumer gear does softmax attention and projection/MLP blockscaled GEMM at a bit over a petaflop at 300W and close to two (dense) at 600W.
This changes the whole game and it's not clear anyone outside the lab even knows the new equilibrium points, it's nothing like Flash3 on Hopper, lotta stuff looks FLOPs bound, GDDR7 looks like a better deal than HBMe3. The DGX Spark is in no way deficient, it has ample memory bandwidth.
This has been in the pipe for something like five years and even if everyone else started at the beginning of the year when this was knowable, it would still be 12-18 months until tape out. And they haven't started.
Years Until Anyone Can Compete With NVIDIA is back up to the 2-5 it was 2-5 years ago.
This was supposed to be the year ROCm and the new Intel stuff became viable.
They had a plan.
So if we look at what NVIDIA has to say about NVFP4 it sure sounds impressive [1]. But look closely that initial graph never compares fp8 and fp4 on the same hardware. They jump from H100 to B200 while implying a 5x jump of going with fp4 which it isn't. Accompanied with scary words like if you use MXFP4 "Risk of noticeable accuracy drop compared to FP8" .
Contrast that with what AMD has to say on the open MXFP4 approach which is quite similar to NVFP4 [2]. Ohh the horrors of getting 79.6 instead of 79.9 on GPQA Diamond when using MXFP4 instead of FP8.
[1] https://developer.nvidia.com/blog/introducing-nvfp4-for-effi...
[2] https://rocm.blogs.amd.com/software-tools-optimization/mxfp4...
The tweet gives their justification; CUDA isn't ASIC. Nvidia GPUs were popular for crypto mining, protein folding, and now AI inference too. TPUs are tensor ASICs.
FWIW I'm inclined to agree with Nvidia here. Scaling up a systolic array is impressive but nothing new.
a generation is 6 months
It’s better to have a faster, smaller network for model parallelism and a larger, slower one for data parallelism than a very large, but slower, network for everything. This is why NVIDIA wins.
^Even now I get capacity related error messages, so many days after the Gemini 3 launch. Also, Jules is basically unusable. Maybe Gemini 3 is a bigger resource hog than anyone outside of Google realizes.
While the B200 wins on raw FP8 throughput (~9000 vs 4614 TFLOPs), that makes sense given NVIDIA has optimized for the single-chip game for over 20 years. But the bottleneck here isn't the chip—it's the domain size.
NVIDIA's top-tier NVL72 tops out at an NVLink domain of 72 Blackwell GPUs. Meanwhile, Google is connecting 9216 chips at 9.6Tbps to deliver nearly 43 ExaFlops. NVIDIA has the ecosystem (CUDA, community, etc.), but until they can match that interconnect scale, they simply don't compete in this weight class.
The biggest problem though is trust, and I'm still holding back from letting anyone under my authority in my org use Gemini because of the lack of any clear or reasonable statement or guidelines on how they use your data. I think it won't matter in the end if they execute their way to domination - but it's going to give everyone else a chance at least for a while.
They’ve been very clear, in my opinion: https://cloud.google.com/gemini/docs/discover/data-governanc...
I suppose there will always be the people who refuse to trust them or choose to believe they’re secretly doing something different.
However I’m not sure what you’re referring to by saying they haven’t said anything about how data is used.
If you’re a business/enterprise, you get a different ToS that very clearly states that your data is yours.
If you use the free/consumer options, that’s where they are vague or direct about vacuuming up data.
Yes, but Google will never be able to compete with their greatest challenge... Google's attention span.
And outside of Google this is a very academic debate. Any efficiency gains over GPUs will primarily turn into profit for Google rather than benefit for me as a developer or user of AI systems. Since Google doesn't sell TPUs, they are extremely well-positioned to ensure no one else can profit from any advantages created by TPUs.
As you note, they'll set the margins to benefit themselves, but you can still eke out some benefit.
Also, you can buy Edge TPUs, but as the name says these are for edge AI inference and useless for any heavy lifting workloads like training or LLMs.
https://www.amazon.com/Google-Coral-Accelerator-coprocessor-...
First part is true at the moment, not sure the second follows. Microsoft is developing their own “Maia” chips for running AI on Azure with custom hardware, and everyone else is also getting in the game of hardware accelerators. Google is certainly ahead of the curve in making full-stack hardware that’s very very specialized for machine learning. But everyone else is moving in the same direction: lots of action is in buying up other companies that make interconnects and fancy networking equipment, and AMD/NVIDIA continue to hyper specialize their data center chips for neural networks.
Google is in a great position, for sure. But I don’t see how they can stop other players from converging on similar solutions.
Does anyone have a sense of why CUDA is more important for training than inference?
Once you have trained, you have frozen weights/feed-forward networks that consist out of frozen weights that you can just program in and run data over. These weights can be duplicated across any amount of devices and just sit there and run inference with new data.
If this turns out to be the future use-case for NNs(it is today), then Google are better set.
A real shame, BTW, all that silicon doesn't do FP32 (very well). After training ceases to be that needed, we could use all that number crunching for climate models and weather prediction.
Further it's worth noting that the Ironwood, Google's v7 TPU, supports only up to BF16 (a 16-bit floating point that has the range of FP32 minus the precision. Many training processes rely upon larger types, quantizing later, so this breaks a lot of assumptions. Yet Google surprised and actually training Gemini 3 with just that type, so I think a lot of people are reconsidering assumptions.
Another factor is that training is always done with batches. Inference batching depends on the number of concurrent users. This means training tends to be compute bound where supporting the latest data types is critical, whereas inference speeds are often bottlenecked by memory which does not lend itself to product differentiation. If you put the same memory into your chip as your competitor, the difference is going to be way smaller.
Once you settle on a design then doing ASICs to accelerate it might make sense. But I'm not sure the gap is so big, the article says some things that aren't really true of datacenter GPUs (Nvidia dc gpus haven't wasted hardware on graphics related stuff for years).
What does it even mean in neural net context?
> numerical stability
also nice to expand a bit.
"Meta in talks to spend billions on Google's chips, The Information reports"
https://www.reuters.com/business/meta-talks-spend-billions-g...
The big difference is that Google is both the chip designer *and* the AI company. So they get both sets of profits.
Both Google and Nvidia contract TSMC for chips. Then Nvidia sells them at a huge profit. Then OpenAI (for example) buys them at that inflated rate and them puts them into production.
So while Nvidia is "selling shovels", Google is making their own shovels and has their own mines.
Having your own mines only pays off if you actually do strike gold. So far AI undercuts Google's profitable search ads, and loses money for OpenAI.
Citation needed. But the vertical integration is likely valuable right now, especially with NVidia being supply constrained.
And Google will end up with lots of useless super specialized custom hardware.
Everyone using Nvidia hardware has a lot of overlap in requirements, but they also all have enough architectural differences that they won't be able to match Google.
OpenAI announced they will be designing their own chips, exactly for this reason, but that also becomes another extremely capital intensive investment for them.
This also doesn't get into that Google also already has S-tier dataceters and datacenter construction/management capabilities.
You don't think Nvidia has field-service engineers and applications engineers with their big customers? Come on man. There is quite a bit of dialogue between the big players and the chipmaker.
They could make a systolic array TPU and software, perhaps. But it would mean abandoning 18 years of CUDA.
The top post right now is talking about TPU's colossal advantage in scaling & throughput. Ironwood is massively bigger & faster than what Nvidia is shooting for, already. And that's a huge advantage. But imo that is a replicateable win. Throw gobs more at networking and scaling and nvidia could do similar with their architecture.
The architectural win of what TPU is more interesting. Google sort of has a working super powerful Connection Machine CM-1. The systolic array is a lot of (semi-)independent machines that communicate with nearby chips. There's incredible work going on to figure out how to map problems onto these arrays.
Where-as on a GPU, main memory is used to transfer intermediary results. It doesn't really matter who picks up work, there's lots of worklets with equal access time to that bit of main memory. The actual situation is a little more nuanced (even in consumer gpu's there's really multiple different main memories, which creates some locality), but there's much less need for data locality in the GPU, and much much much much tighter needs, the whole premise of the TPU is to exploit data locality. Because sending data to a neighbor is cheap, sending storing and retrieving data from memory is slower and much more energy intense.
CUDA takes advantage of, relies strongly on the GPU's reliance in main memory being (somewhat) globally accessible. There's plenty of workloads folks do in CUDA that would never work on TPU, on these much more specialized data-passing systolic arrays. That's why TPUs are so amazing, because they are much more constrained devices, that require so much more careful workload planning, to get the work to flow across the 2D array of the chip.
Google's work on projects like XLA and IREE is a wonderful & glorious general pursuit of how to map these big crazy machine learning pipelines down onto specific hardware. Nvidia could make their own or join forces here. And perhaps they will. But the CUDA moat would have to be left behind.
Tensor cores are specialized and have CUDA support.
To put it into perspective, the tensor cores deliver about 2,000 TFLOPs of FP8, and half that for FP16, and this is all tensor FMA/MAC (comprising the bulk of compute for AI workloads). The CUDA cores -- the rest of the GPU -- deliver more in the 70 TFLOP range.
So if data centres are buying nvidia hardware for AI, they already are buying focused TPU chips that almost incidentally have some other hardware that can do some other stuff.
I mean, GPUs still have a lot of non-tensor general uses in the sciences, finance, etc, and TPUs don't touch that, but yes a lot of nvidia GPUs are being sold as a focused TPU-like chip.
The real challenge is getting the TPU to do more general purpose computation. But that doesn't make for as good a story. And the point about Google arbitrarily raising the prices as soon as they think they have the upper hand is good old fashioned capitalism in action.
turning a giant lumbering ship around is not easy
Nothing prevents them per se, but it would risk cannibalising their highly profitable (IIRC 50% margin) higher end cards.
It might be even 'free' to fill it with more complicated logic (especially one that allows you write clever algorithms that let you save on bandwidth).
But they're not.
There's a few confounding problems:
1. Actually using that hardware effectively isn't easy. It's not as simple as jacking up some constant values and reaping the benefits. Actually using the hardware is hard, and by the time you've optimized for it, you're already working on the next model.
2. This is a problem that, if you're not Google, you can just spend your way out of. A model doesn't take a petabyte of memory to train or run. Regular old H100s still mostly work fine. Faster models are nice, but Gemini 3 Pro being 50% of the latency as Opus 4.5 or GPT 5.1 doesn't add enough value to matter to really anyone.
3. There's still a lot of clever tricks that work as low hanging fruit to improve almost everything about ML models. You can make stuff remarkably good with novel research without building your own chips.
4. A surprising amount of ML model development is boots on the ground work. Doing evals. Curating datasets. Tweaking system prompts. Having your own Dyson sphere doesn't obviate a lot of the typing and staring at a screen that necessarily has to be done to make a model half decent.
5. Fancy bespoke hardware means fancy bespoke failure modes. You can search stack overflow for CUDA problems, you can't just Bing your way to victory when your fancy TPU cluster isn't doing the thing you want it to do.
For example, OpenAI has announced trillion-dollar investments in data centers to continue scaling. They need to go through a middle-man (Nvidia), while Google does not, and will be able to use their investment much more efficiently to train and serve their own future models.
Performance per dollar doesn't "win" anything though. Performance (as in speed) hardly cracks the top five concerns that most folks have when choosing a model provider, because fast, good models already exist at price points that are acceptable. That might mean slightly better margins for Google, but ultimately isn't going to make them "win"
https://www.anthropic.com/news/expanding-our-use-of-google-c...
Slightly more seriously: what you say makes sense if and only if you're projecting Sam Altman and assuming that a) real legit superhuman AGI is just around the corner, and b) all the spoils will accrue to the first company that finds it, which means you need to be 100% in on building the next model that will finally unlock AGI.
But if this is not the case -- and it's increasingly looking like it's not -- it's going to continue to be a race of competing AIs, and that race will be won by the company that can deliver AI at scale the most cheaply. And the article is arguing that company will be Google.
I think you are missing the point. They are saying "weeks old" isn't very old.
> it's going to continue to be a race of competing AIs, and that race will be won by the company that can deliver AI at scale the most cheaply.
I don't see how that follows at all. Quality and distribution both matter a lot here.
Google has some advantages but some disadvantages here too.
If you are on AWS GovCloud, Anthropic is right there. Same on Azure, and on Oracle.
I believe Gemini will be available on the Oracle Cloud at some point (it has been announced) but they are still behind in the enterprise distribution race.
OpenAI is only available on Azure, although I believe their new contract lets them strike deals elsewhere.
On the consumer side, OpenAI and Google are well ahead of course.
Last week it looked like Google had won (hence the blog post) but now almost nobody is talking about antigravity and Gemini 3 anymore so yeah what op says is relevant
Arguably indeed, because I think it still is.
Which is to say, if Google was set up to win, it shouldn't even be a question that 3 Pro is the best. It should be obvious. But it's definitely not obvious that it's the best, and many benchmarks don't support it as being the best.
I am fairly pro-google(they invented the LLM, FFS...) and recognize the advantages(price/token, efficiency, vertical integration, established DCs w/ power allocations) but also know they have a habit of slightly sucking at everything but search.
Cerebras CS-3 specs:
• 4 trillion transistors
• 900,000 AI cores
• 125 petaflops of peak AI performance
• 44GB on-chip SRAM
• 5nm TSMC process
• External memory: 1.5TB, 12TB, or 1.2PB
• Trains AI models up to 24 trillion parameters
• Cluster size of up to 2048 CS-3 systems
• Memory B/W of 21 PB/s
• Fabric B/W of 214 Pb/s (~26.75 PB/s)
Comparing GPU to TPU is helpful for showcasing the advantages of the TPU in the same way that comparing CPU to Radeon GPU is helpful for showcasing the advantages of GPU, but everyone knows Radeon GPU's competition isn't CPU, it's Nvidia GPU!
TPU vs GPU is new paradigm vs old paradigm. GPUs aren't going away even after they "lose" the AI inference wars, but the winner isn't necessarily guaranteed to be the new paradigm chip from the most famous company.
Cerebras inference remains the fastest on the market to this day to my knowledge due to the use of massive on-chip SRAM rather than DRAM, and to my knowledge, they remain the only company focused on specialized inference hardware that has enough positive operating revenue to justify the costs from a financial perspective.
I get how valuable and important Google's OCS interconnects are, not just for TPUs or inference, but really as a demonstrated PoC for computing in general. Skipping the E-O-E translation in general is huge and the entire computing hardware industry would stand to benefit from taking notes here, but that alone doesn't automatically crown Google the victor here, does it?
Am I misunderstanding "TPU" in the context of the article?
- Google is not owning the technology but builds a cohesive cloud around it, Tesla, Meta work on their own asic ai chips and I guess others
- A signal is already given: Softbank sold it's entire Nvidia stock and berkshire added google on their portfolio.
Microsoft "has" a lot of companies data, and google is probably building the most advanced ai cloud.
However, I can't think they had a cloud which was light-years ahead of aws 15 years ago and now GCP is no 3, they also released opensource gpt models more than 5 years ago that constituted the foundation for openai closed sourced models.
Why? To me, it seems better for the market, if the best models and the best hardware were not controlled by the same company.
Sparse models have same quality of results but have less coefficients to process, in case described in the link above sixteen (16) times as less.
This means that these models need 8 times less data to store, can be 16 and more times faster and use 16+ times less energy.
TPUs are not all that good in the case of sparse matrices. They can be used to train dense versions, but inference efficiency with sparse matrices may be not all that great.
https://docs.cloud.google.com/tpu/docs/system-architecture-t...
Here's another inference-efficient architecture where TPUs are useless: https://arxiv.org/pdf/2210.08277
There is no matrix-vector multiplication. Parameters are estimated using Gumbel-Softmax. TPUs are of no use here.
Inference is done bit-wise and most efficient inference is done after application of boolean logic simplification algorithms (ABC or mockturtle).
In my (not so) humble opinion, TPUs are example case of premature optimization.
The downside is the same thing that makes them fast: they’re very specialized. If your code already fits the TPU stack (JAX/TensorFlow), you get great performance per dollar. If not, the ecosystem gap and fear of lock-in make GPUs the safer default.
What I'm sure about is having a programming unit more purposed to a task is more optimal than a general programming unit designed to accommodate all programming tasks.
More and more of the economics of programming boils down to energy usage and invariably towards physical rules, the efficiency of the process has the benefit of less energy consumed.
As a Layman is makes general sense. Maybe a future where productivity is based closer on energy efficiency rather than monetary gain pushes the economy in better directions.
Cryptocurrency and LLMs seem like they'll play out that story over the next 10 years.
In AI, we're still in the explosion phase. If you build the perfect ASIC for Transformers today, and tomorrow a paper drops with a new architecture, your chip becomes a brick. NVIDIA pays the "legacy tax" and keeps CUDA specifically as insurance against algorithm churn. As long as the industry moves this fast, flexibility beats raw efficiency
- they were way ahead, and they didn't make any big mistakes
- they weren't waiting for others to catch up. They were aggressively improving
- memory bandwidth is almost always the bottleneck. Hence systolic array is "overrated". Furthermore, interconnect is the new bottleneck now
- cuda offers the most flexibility in the world of ever changing model requirements
https://aibusiness.com/companies/google-ceo-sundar-pichai-we...
I think we can be reasonably sure that search, Gmail, and some flavor of AI will live on, but other than that, Google apps are basically end-of-life at launch.
Agree there are lots of other contributing causes like culture, incentives, security, etc.
With simulations becoming key to training models doesn't this seem like a huge problem for Google?
to quote from their paper "In order to ensure sufficient computational performance for DualPipe, we customize efficient cross-node all-to-all communication kernels (including dispatching and combining) to conserve the number of SMs dedicated to communication. The implementation of the kernels is codesigned with the MoE gating algorithm and the network topology of our cluster."
Arguably main OpenAI raison d'être was to be a counterweight to that pre-2023 Google AI dominance. But I'd also argue that OpenAI lost its way.
They do voluntarily offer a way to signal that the data GoogleBot sees is not to be used for training, for now, and assuming you take them at their word, but AFAIK there is no way to stop them doing RAG on your content without destroying your SEO in the process.
Perhaps the assumptions are true. The mere presence of LLMs seems to have lowered the IQ of the Internet drastically, sopping up financial investors and resources that might otherwise be put to better use.
Google is a giant without a direction. The ads money is so good that it just doesn't have the gut to leave it on the table.
TAKE MY MONEY!!!
In practice it doesnt quite work out that way.
Nvidia is tied down to support previous and existing customers while Google can still easily shift things around without needing to worry too much about external dependencies.
Google will have no problem discontinuing Google "AI" if they finally notice that people want a computer to shut up rather than talk at them.
The truth is the LLM boom has opened the first major crack in Google as the front page of the web (the biggest since Facebook), in the same way the web in the long run made Windows so irrelevant Microsoft seemingly don’t care about it at all.
As long as "tomorrow" is a better day to invade Taiwan than today is, China will wait for tomorrow.
I'd guess most of their handicap comes from their hardware and software not being as refined as the US's