Skip to content

Top Best Ask Show New Jobs

Who uses Google TPUs for inference in production?

116 pointsarthurdelerue2y ago48 comments

I am really puzzled by TPUs. I've been reading everywhere that TPUs are powerful and a great alternative to NVIDIA.

I have been playing with TPUs for a couple of months now, and to be honest I don't understand how can people use them in production for inference:

- almost no resources online showing how to run modern generative models like Mistral, Yi 34B, etc. on TPUs - poor compatibility between JAX and Pytorch - very hard to understand the memory consumption of the TPU chips (no nvidia-smi equivalent) - rotating IP addresses on TPU VMs - almost impossible to get my hands on a TPU v5

Is it only me? Or did I miss something?

I totally understand that TPUs can be useful for training though.

48 comments

36 comments · 12 top-level

hiddencost2y ago· 9 in thread

Google is using them in prod. I think they're so hungry for chips internally that cloud isn't getting much support in selling them.

_b2y ago

I think this is right, in part because I've been told exactly this from people who work for Google and their job is to sell me cloud stuff- i.e., they say they have so much internal demand they aren't pushing TPUs for external use. Hence external pricing and support just isn't that great right now. But presumably when capacity catches up they'll start pushing TPUs again.

Feels like a bad point in the curve to try and sell them. “Oh our internal hypecycle is done… we’ll put them in the market now that they’re all worn out.

VirusNewbie2y ago

They're getting swallowed up by Anthropic and the other huge spenders:

https://www.prnewswire.com/news-releases/google-announces-ex...

"Partnership includes important new collaborations on AI safety standards, committing to the highest standards of AI security, and use of TPU v5e accelerators for AI inference "

danjl2y ago

I would guess that Google's vertexAI managed solution uses TPUs. Also Google uses them internally to train and infer for all their research products.

sciencesama2y ago

80 to 90% are consumed internally !! Only from version 5 it is planned to be customer focussed !!

While you can use TPUs with vertexai, it's just virtual machines - you can have one with an nvidia card if you like.

Or maybe they are just using nVidia. Who knows ...

vineyardmike2y ago

Beyond the fact that this is hardly a secret, there’s lots of other signs.

1. They have bought far less from NVidia than other hyper scalers, and they literally can’t vomit without saying “AI”. They have to be running those models on something. They have purchased huge amounts of chips from fabs, and what else would that be?

2. They have said they use them. Should be pretty obvious here.

3. They maintain a whole software stack for them, they design the chips, etc. Then they don’t really try to sell the TPU. Why else would they do this?

Lots of people know.

htrp2y ago· 6 in thread

We've previously tried and almost always regretted the decision. I think the tech stack needs another 12-18 months to mature (doesn't help that almost all work ex Google is being done in torch).

mike_d2y ago

> I think the tech stack needs another 12-18 months to mature

Google has been doing AI before any other company even thought about it. They are on the 6th generation of TPU hardware.

I don't think there is any maturity issue, just an availability issue because they are all being used internally.

htrp2y ago

100% agree, if I have access to the TPU team internally, it'll be very easy to use in production.

If you aren't internal, the documentation, support, and even just general bug fixing is impossible to get.

chatmasta2y ago

Google sells access to TPUs in its cloud platform, so you'd think they would be more open about sharing development and tooling frameworks for TPUs. It's like Borg (closed source, never used outside Google, made them no profit) vs. Kubernetes (open source, used everywhere, makes them profit).

> Google has been doing AI before any other company even thought about it

This not even remotely true. SRI was working on AI in various forms long before google existed

danielcampos932y ago

I feel like I have been hearing that since V1 TPU. I think Google is the perfect solution because they are teams whose job is to take a model and TPUify it. Elsewhere there is no team, so it's no fun.

arthurdelerueOP2y ago

I agree with that, and I'm not sure they'll be able to improve the stack dramatically by themselves without the open-source community being more involved.

ein0p2y ago· 4 in thread

They aren’t really an alternative to anything. For one thing they’re now often slower on per-accelerator basis than NVIDIA stuff. They’re cheaper, of course, but because of disparity in performance you’ll need to estimate cost per flop on your own particular workload. They are also more difficult and slower to develop against, and SWE cost is always an issue if you don’t own a money printer like Google. Furthermore, for advanced users who can do their own CUDA kernels or Triton, that too can unlock additional efficiency from GPU. Such capability can’t even be contemplated on the TPU side because you basically get a black box. Then there’s the issue of limited capacity, further exacerbated by the fact that this capacity is provided by a single supplier who is struggling to fulfill its internal needs (which is why you can’t get v5). You can’t just get TPUs elsewhere. You can’t get them under your desk for dev work either.

That said, it wouldn’t be too difficult to port most models to Jax, load in the existing weights, and export the result for serving. Should you bother? IMO, no, unless we’re talking really large scale inference. Your time and money are almost certainly better spent iterating on the models.

arthurdelerueOP2y ago

I agree, except about this statement: "it wouldn’t be too difficult to port most models to Jax"

--> We tried such ports at https://kwatch.io (the company I work for), and it appeared to be much harder than expected (at least for us). I don't think so many people are capable of porting an LLM based on PyTorch + GPU to Jax + TPU.

ein0p2y ago

Well, I should have said “it wouldn’t be too difficult for me” then. I keep forgetting why I get paid so much.

emu2y ago

> Such capability can’t even be contemplated on the TPU side because you basically get a black box.

I'll just leave this here: https://jax.readthedocs.io/en/latest/pallas/index.html

ein0p2y ago

Pallas is very new. Given how difficult these things are to debug and how half assed the XLA tooling generally is, I’d give it at least another year, if not two, before I touch it for anything prod related.

emadm2y ago· 2 in thread

https://pytorch.org/blog/high-performance-llama-2/

htrp2y ago

>Cheers,

> The PyTorch/XLA Team at Google

Meanwhile you have an issue from 5 years ago with 0 support

https://github.com/pytorch/xla/issues/202

hcfman2y ago

5 years ago PyTorch wasn’t owned by the Linux foundation. Give ‘em a chance now.

On my wish list for PyTorch is that the apt install version work out of the box on Jetson SBCs

kccqzy2y ago· 1 in thread

Apparently Midjourney uses it. GCP put out a press release a while ago: https://www.prnewswire.com/news-releases/midjourney-selects-...

trsohmers2y ago

The quote from the linked press release is that they do training on TPUv4, while inference is running on GPUs. I have also heard this separately from people associated with Midjourney recently, and that they solely do training on TPUs.

ooterness2y ago· 1 in thread

There's a cubesat using a Coral TPU for pose estimation.

https://aerospace.org/article/aerospaces-slingshot-1-demonst...

mianos2y ago

They were lucky to get that going. The software support for the USB TPU was abandoned by google years ago now. Works fine if you run ubuntu 16 I think.

pogue2y ago· 1 in thread

I've seen people connecting these to Raspberry Pis to run local LLMs but I'm not sure how effective it is. Check YouTube for some videos about it.

Speaking of SBCs, prior to the Raspberry Pi, I was looking at the Orange Pi 5 which has a Rockchip RK3588S with an NPU (Neural Processing Unit). This was the first I had heard of such a thing but I was curious how/what exactly it does. Unfortunately, there's very little support for Orange Pi & not a large community for it so I couldn't find any feedback on how well it worked or what it did.

http://www.orangepi.org/html/hardWare/computerAndMicrocontro...

Havoc2y ago

The rock chip npu can do object recognition a la opencv but not LLMs

juliensalinas2y ago

We tried hard to move some of our inference workloads to TPUs at NLP Cloud, but finally gave up (at least for the moment) basically for the reasons you mention. We now only perform our fine-tunings on TPUs using JAX (see https://nlpcloud.com/how-to-fine-tune-llama-openllama-xgen-w...) and we are happy like that.

It seems to me that Google does not really want to sell TPUs but only showcase their AI work and maybe get some early adopters feedback. It must be quite a challenge for them to create a dynamic community around JAX and TPUs if TPUs stay a vendor locked-in product...

gperkins9782y ago

I tried to use a Google Coral. I have no idea how to make it work. I could follow a tutorial using tensorflow. I could not figure out how to use for anything else. Is there some way to run CUDA stuff on it? I always assumed it required someone with actual skills (not me). I have used CUDA stuff before, but more for mass calculation and simulation (for financial stuff). It is great when it works. I worked at a shop that had these Xeon Phi systems that worked great, but I had no clue how, and it only worked with their pre-canned tools.

Just as an example, over a decade ago I replaced a few cases filled with racks and a SAN that made up a compute cluster with one box (plus SAN) and a backup box (both boxes were basically the same in case one failed), but basically like dozens of servers were replaced by a two CPU box with a couple Tesla cards (probably one A100 later). The entire model had to be re-written, but it was not that bad. I wanted to do with AMD cards, but there was no easy way.

I would also say that modern networked has made all kinds of stuff more interesting (also lining Nvidia's pockets). Those TPU's do not make sense to me. I have no idea how to use them. They should release their version of CUDA.

derdrdirk2y ago

TPUs are tightly coupled to JAX and the XLA compiler. If your model is based on Pytorch you can use a bridge to export your model to StableHLO and then compile it to a TPU accelerator. In theory the XLA compiler should be more performant than the Pytorch Inductor.

mrwilliamchang2y ago

To see memory consumption on the TPU while running on GKE you can look at kubernetes.io/node/accelerator/memory_used

https://cloud.google.com/kubernetes-engine/docs/how-to/tpus#...

ChrisArchitect2y ago

Ask HN:

j / k navigate · click thread line to collapse