AMD Alveo V70 AI inference accelerator card (opens in new tab)

(xilinx.com)

109 pointstherealchiggs3y ago87 comments

87 comments

This is inference only. AMD should invest into the full AI stack starting from training. For this they need a product comparable to NVIDIA 4090, so that entry level researchers could use their hardware. Honestly, I don't know why AMD aren't doing that already, they are best positioned to do that in the industry landscape.

modeless3y ago

The hardware is not their major problem. They have been failing super hard at the software side of machine learning for a solid decade now.

It seems like pure management incompetence to me. They need to invest a whole lot more in software, integrating their stuff directly into pytorch/TF/XLA/etc and making sure it works on consumer cards too. The investment would be paid back tenfold. The market is crying out for more competition for Nvidia and there's huge money to be made on the datacenter side but it all needs to work on the consumer side too.

gymbeaux3y ago

AMD has finite resources, like any company, and they’ve been focusing on CPU/datacenter dominance, which to me is both the safer bet and the more-lucrative bet. It wasn’t that long ago that AMD was on the brink of bankruptcy (~2016), so I appreciate that they’re not trying to divide their attention.

Their attempts at entering the ML space so far have been failures, and they are wise to hold off on really competing with Nvidia until they have the bandwidth to go “all in”. Consciously NOT trying to compete with Nvidia is the reason they didn’t go bankrupt. Their Radeon division minted from 2016-2020 because they focused on a niche Nvidia was neglecting- low-end/eSports (also leveraging their APU expertise to win PS4/Xbox contracts).

I think Nvidia will eventually lose its monopoly on ML/AI stuff as AMD, Apple, Qualcomm, Amazon and Google chip away at their “moat” with their own accelerators/NPUs. As mentioned though, the Nvidia Edge really comes from CUDA and other software, not the hardware. I doubt that Apple, Qualcomm, Amazon or Google will be interested in selling hardware direct to consumers. They want that sweet, sweet cloud money and/or competitive advantages in their phones (like photo processing). I don’t want to be paying AWS $100/mo for a GPU I could pay $600 once for. I do think AMD/RTG will go hard on Nvidia eventually, and it will not matter whether you have an AMD or Nvidia GPU for Tensorflow or spaCy or whatever else.

2 more replies

hedgehog3y ago

Last I checked they see deep learning training as a niche market, their strategy is to try to win big contracts (HPC etc) and then supply software specifically for that. Then "the community" will supply software. Having spent a bunch of time beating my head on this and related walls it's not clear to me that they're entirely wrong from an economic standpoint. Remember that 2/3 public cloud providers have their own chips as well as NVIDIA's so it would be tough to negotiate a good deal. As a user it's super irritating to be stuck on NVIDIA especially when Jensen gets up on stage to say "haha, Moore's law is over, stop expecting our products to get cheaper."

1 more reply

dathinab3y ago

> product comparable to NVIDIA 4090

no, they need a product good at training and gpu compute at a reasonable price

that product doesn't need to be good at rendering, ray tracing and similar

sure students and some independent contractors probably love getting both a good graphic card and a CUDA card in one and it makes it easier for people to experiment with it but company PCs normally ban playing games on company PCs and the overlap of "needing max GPU compute" and "needing complicated 3D rendering tasks" is limited.

through having 1 product instead of two does make supply chain and pricing easier

but then 4090 is by now in a price range where students are unlikely to afford it and people will think twice about buying it just to play around with GPU compute.

So e.g. the 7900XTX having somewhat comparable GPU compute usability then a 4080 would have been good enough for the non company use case, where a dedicated compute-per-money cheaper GPU compute only card would be preferable for the company use case I think.

nl3y ago

Long time ML worker here. People work in one of 3 ways:

1) Consumer Nvidia GPU cards on custom PCs

2) Self hosted shared server

3) Cloud infrastructure.

There is no "GPU compute only card" that is widely used outside servers.

> company PCs normally ban playing games on company PCs and the overlap of "needing max GPU compute" and "needing complicated 3D rendering tasks" is limited.

The "don't play games thing" isn't a factor. Most companies just buy a 4090 or whatever, and if they have to tell staff not to play games, they say "don't play games". Fortnight runs just fine on pretty much anything anyway.

1 more reply

hedgehog3y ago

Big question is why? Would competing for entry level researchers buy them much?

kombine3y ago

Because entry level researchers shape the industry in the long term. I'm in academia, I worked at two universities and I have not seen a lab that uses non-NVIDIA hardware for research. Majority of graduates go to work in the industry.

1 more reply

wmf3y ago

MI200, MI250, and MI300 should work for training. They don't have an exact equivalent to the 4090 but that may be setting the bar too high. Nobody can deliver everything that Nvidia has but better.

kombine3y ago

It will be harder for a small academic lab to afford MI300, whereas everyone can purchase a few cards costing $1500. And even if I had money to buy MI300, I wouldn't - it is a too risky investment, because we have no idea how suitable they are for common AI research workflows. They need to lower the entry bar, so that people can try out their hardware. Even 80% of the performance of 4090 would be enough at an appropriate price point.

Hooray_Darakian3y ago

> AMD should invest into the full AI stack starting from training.

https://www.amd.com/en/graphics/servers-solutions-rocm-ml

> For this they need a product comparable to NVIDIA 4090, so that entry level researchers could use their hardware.

Why is a high end product a requirement for entry level research?

kombine3y ago

4090 (or 3090, 1080Ti and so on) is a high-end consumer GPU, but at the same time it is an entry level GPU for AI researchers. Don't forget that workstation cards (RTX 8000) let alone server-grade GPUs such as A100 are an order of magnitude more expensive.

2 more replies

marcyb5st3y ago

Because of VRAM. Even a simple model for language can easily max out a 4090.

Also, ROC-M is a bit of a mess to setup. With Nvidia i just need to install cuda, cudnn and then pip install tensorflow/pytorch.

jzwinck3y ago

Because high end research uses a fleet of them, not just one.

wyldfire3y ago

AMD XDNA – Versal AI Core / 2nd-gen AIE-ML tiles

Are these programmable by the end-user? The "software programmability" section describes "Vitis AI" frameworks supported. But can we write our own software on these?

Is this card FPGA-based?

EDIT: [1] more info on the AI-engine tiles: scalar cores + "adaptable hardware (FPGA?)" + {AI+DSP}.

[1] https://www.xilinx.com/products/technology/ai-engine.html

derefr3y ago

It's very likely FPGA-based; Xilinx is an FPGA company. This is being pitched as an "AI accelerator", but "Alveo" as a product line existed before AMD's acquisition of Xilinx, and other "Alveo" products exist (https://www.xilinx.com/products/boards-and-kits/alveo.html) that are marketed for other purposes, while really just being Xilinx FPGAs pre-programmed to perform specific other tasks, with some domain-specific DSPs + interconnects around the edges.

It's possible that AMD could have reworked an existing Xilinx design to incorporate RDNA chiplets in place of some of the FPGA-gate-grid chiplets, creating a heterogeneous mesh; but I find it just as likely that AMD just took their VLSI for an RDNA core and loaded it onto the existing FPGA.

typon3y ago

It's not a traditional FPGA chip (lots of luts and flip flops). The "AI Engine" is basically hardened chiplets that are working alongside soft logic chiplets and I/O. This is how they're able to get their performance/power numbers

1 more reply

YakBizzarro3y ago

Interesting. It seems then that the xdna architecture in the Ryzen 4070 is nothing more than a port of the existing Xilinx Versal cores (fpga+ai engine)

wmf3y ago

That's what AMD said.

djmips3y ago

Says RDNA based which is AMD's GPU tech.

mastax3y ago

No, it says XDNA.

1 more reply

Roark663y ago

If this is based on fpga tech (xilinx) I don't think it will have a cost/benefit edge over asics. Why not do their own TPU like Google did? Nowadays even embedded MCUs come with AI accelerators (last I heard was 5TOPS in a banana pi-cm4 board - that is sufficient for object detection stuff and perhaps even more).

westmeal3y ago

No price is listed on their site so I'm assuming its gonna be stupid expensive, but if anyone knows would you mind posting?

dragontamer3y ago

Previous AMD/Xilinx Alveo are in the $5000 to $20,000 range USD. I'd assume somewhere around there, or maybe even a bit higher.

EDIT: 75W is a smaller card than I expected. "Inference" also usually means "cheaper". so maybe we can be optimistic with $5000-ish ??

Anyone shocked by the price, remember that this is an FPGA-line from Xilinx. Not a GPU from Radeon. Expect very high prices.

dhruvdh3y ago

It’s 1,995$ - I tried to order when it was announced.

dhruvdh3y ago

I went through the checkout flow earlier, it was 1,995$ pre tax.

wyldfire3y ago

The price is shown as $1,995.00 + tax&shipping for A-V70-P16G-ES3-G.

ChuckNorris893y ago

Probably because it's a product aimed at datacenters and cloud providers who work directly with Xilinx/AMD to develop it, so they already know the price.

mcilie3y ago

It costs 1995 if you look at the "order now" section

messe3y ago

> High-Density Video Decoder**: 96 channels of 1920x1080p

> [...]

> **: @10 fps, H.264/H.265

Is 10 fps a standard measure for this kind of thing?

novaRom3y ago

10 fps should be fast enough to provide input tensors for real time inference with small scale transformers / convolutional nets.

scottlamb3y ago

Maybe. Running inference at 10 fps is probably plenty. But that doesn't mean you only have to do 10 fps of H.264/H.265 decoding. I think the most common scenario is for the input video to be e.g. 30 fps with mostly P frames that each depend on the prior frame in a chain. In that case, you need to decode almost [1] 30 fps to get 10 fps of evenly spaced frames to process.

[1] You could skip the last P frame before an IDR frame, but that doesn't buy you much.

1 more reply

kmeisthax3y ago

Every time I hear about an AI accelerator, I get really excited, then it turns out to be inference only.

psychphysic3y ago

Did I miss something did AMD buy Xilinx? Makes sense I suppose after Intel bought Altera.

Who owns lattice?

AnonMO3y ago

In the time it took you to write this you could have searched your two question on google clicked the first wiki link and got your answer

psychphysic3y ago

Bah! I'll go back to voicing all my thoughts to chagpt

p1esk3y ago

How much memory does it have?

novaRom3y ago

What TOPS means exactly in "... TOPS*|(INT8) 404 ..." ?

capableweb3y ago

TOPS - Trillions of Operations Per Second, used as a benchmark to figure out the performance of the accelerator.

In my experience, mostly a marketing number, higher TOPS doesn't actually mean it'll be faster than something with a lower TOPS.

As always, you need to do your own benchmarks with your use case in mind.

novaRom3y ago

What kind of operations is not clear. Wether it's a simple logic operation or a FMA is big difference.

1 more reply

tgtweak3y ago

Sad to see amds ROCm efforts essentially abandoned. They were close to universal interop for cudnn and cuda on amd (and other!) Architectures.

Hopefully Intel takes a stab at it with their ARC line out now.

dhruvdh3y ago

Abandoned? What?

rubatuga3y ago

In Their consumer card line

h2odragon3y ago

Douglas Adams said we'd have robots to watch TV for us. That seems to be the designed use case for this.

16gb RAM / 96 video channels ... I haven't done any of that work but it feels like they expect that "96" not to be fully used in practice.

inetsee3y ago

I have no problem imagining a security camera application needing to monitor quite a few video channels.

phkahler3y ago

>> I have no problem imagining a security camera application needing to monitor quite a few video channels.

As a joke I sometimes tell people the automatic flushing toilets in public bathrooms work by having a little camera monitored by someone in a 3rd world country who remotely flushes as needed, while monitoring a whole lot of video feeds. They usually don't buy it, but will often acknowledge that our world is uncomfortably close to having stuff like become reality.

h2odragon3y ago

Certainly. I'm suspecting that doing much of anything with all 96 channels would really need more RAM, for most users.

1 more reply

andy_ppp3y ago

/camera/state/g

hhh3y ago

I have models in production that currently monitor ~400 cameras with an addition of 2-3 cameras/month. If it were cheap enough, it would be useful for our use case (Quality Control). We generally pull from cameras roughly 6400 pixels per region of interest, of which one instance may have 4-30 RoIs across N cameras.

71a54xd3y ago

Curious where I could learn more about models like this / potentially see some open code outlining tooling / infra required as well?

1 more reply

hyuen3y ago

I won't even take a look at the numbers unless they show a PyTorch model running on it, the problem is the big disconnect between HW and SW, realistically, have you ever seen any off-the shelf model running on something other than NVidia?

Narew3y ago

It's for inference only not training. In this use case, there is lots of device that's not Nvidia. For server you have Google tpu, for more close to public there is the Apple Neural Engine for example.

hhh3y ago

I've ran models on Apple HW, Raspberry Pi, random CPUs, NVIDIA GPUs, TPUs. Waiting to get my hands on Tenstorrent gear.

kombine3y ago

That. I work on the research side and I am still waiting for non-NVIDIA hardware for training deep models.

frozenport3y ago

Google's TPU comes to mind

j / k navigate · click thread line to collapse

87 comments

kombine3y ago

modeless3y ago

The hardware is not their major problem. They have been failing super hard at the software side of machine learning for a solid decade now.

gymbeaux3y ago

2 more replies

hedgehog3y ago

1 more reply

dathinab3y ago

> product comparable to NVIDIA 4090

no, they need a product good at training and gpu compute at a reasonable price

that product doesn't need to be good at rendering, ray tracing and similar

through having 1 product instead of two does make supply chain and pricing easier

but then 4090 is by now in a price range where students are unlikely to afford it and people will think twice about buying it just to play around with GPU compute.

nl3y ago

Long time ML worker here. People work in one of 3 ways:

1) Consumer Nvidia GPU cards on custom PCs

2) Self hosted shared server

3) Cloud infrastructure.

There is no "GPU compute only card" that is widely used outside servers.

> company PCs normally ban playing games on company PCs and the overlap of "needing max GPU compute" and "needing complicated 3D rendering tasks" is limited.

1 more reply

hedgehog3y ago

Big question is why? Would competing for entry level researchers buy them much?

kombine3y ago

1 more reply

wmf3y ago

MI200, MI250, and MI300 should work for training. They don't have an exact equivalent to the 4090 but that may be setting the bar too high. Nobody can deliver everything that Nvidia has but better.

kombine3y ago

Hooray_Darakian3y ago

> AMD should invest into the full AI stack starting from training.

https://www.amd.com/en/graphics/servers-solutions-rocm-ml

> For this they need a product comparable to NVIDIA 4090, so that entry level researchers could use their hardware.

Why is a high end product a requirement for entry level research?

kombine3y ago

2 more replies

marcyb5st3y ago

Because of VRAM. Even a simple model for language can easily max out a 4090.

Also, ROC-M is a bit of a mess to setup. With Nvidia i just need to install cuda, cudnn and then pip install tensorflow/pytorch.

jzwinck3y ago

Because high end research uses a fleet of them, not just one.

wyldfire3y ago

AMD XDNA – Versal AI Core / 2nd-gen AIE-ML tiles

Are these programmable by the end-user? The "software programmability" section describes "Vitis AI" frameworks supported. But can we write our own software on these?

Is this card FPGA-based?

EDIT: [1] more info on the AI-engine tiles: scalar cores + "adaptable hardware (FPGA?)" + {AI+DSP}.

[1] https://www.xilinx.com/products/technology/ai-engine.html

derefr3y ago

typon3y ago

1 more reply

YakBizzarro3y ago

Interesting. It seems then that the xdna architecture in the Ryzen 4070 is nothing more than a port of the existing Xilinx Versal cores (fpga+ai engine)

wmf3y ago

That's what AMD said.

djmips3y ago

Says RDNA based which is AMD's GPU tech.

mastax3y ago

No, it says XDNA.

1 more reply

Roark663y ago

westmeal3y ago

No price is listed on their site so I'm assuming its gonna be stupid expensive, but if anyone knows would you mind posting?

dragontamer3y ago

Previous AMD/Xilinx Alveo are in the $5000 to $20,000 range USD. I'd assume somewhere around there, or maybe even a bit higher.

EDIT: 75W is a smaller card than I expected. "Inference" also usually means "cheaper". so maybe we can be optimistic with $5000-ish ??

Anyone shocked by the price, remember that this is an FPGA-line from Xilinx. Not a GPU from Radeon. Expect very high prices.

dhruvdh3y ago

It’s 1,995$ - I tried to order when it was announced.

dhruvdh3y ago

I went through the checkout flow earlier, it was 1,995$ pre tax.

wyldfire3y ago

The price is shown as $1,995.00 + tax&shipping for A-V70-P16G-ES3-G.

ChuckNorris893y ago

Probably because it's a product aimed at datacenters and cloud providers who work directly with Xilinx/AMD to develop it, so they already know the price.

mcilie3y ago

It costs 1995 if you look at the "order now" section

messe3y ago

> High-Density Video Decoder**: 96 channels of 1920x1080p

> [...]

> **: @10 fps, H.264/H.265

Is 10 fps a standard measure for this kind of thing?

novaRom3y ago

10 fps should be fast enough to provide input tensors for real time inference with small scale transformers / convolutional nets.

scottlamb3y ago

[1] You could skip the last P frame before an IDR frame, but that doesn't buy you much.

1 more reply

kmeisthax3y ago

Every time I hear about an AI accelerator, I get really excited, then it turns out to be inference only.

psychphysic3y ago

Did I miss something did AMD buy Xilinx? Makes sense I suppose after Intel bought Altera.

Who owns lattice?

AnonMO3y ago

In the time it took you to write this you could have searched your two question on google clicked the first wiki link and got your answer

psychphysic3y ago

Bah! I'll go back to voicing all my thoughts to chagpt

p1esk3y ago

How much memory does it have?

novaRom3y ago

What TOPS means exactly in "... TOPS*|(INT8) 404 ..." ?

capableweb3y ago

TOPS - Trillions of Operations Per Second, used as a benchmark to figure out the performance of the accelerator.

In my experience, mostly a marketing number, higher TOPS doesn't actually mean it'll be faster than something with a lower TOPS.

As always, you need to do your own benchmarks with your use case in mind.

novaRom3y ago

What kind of operations is not clear. Wether it's a simple logic operation or a FMA is big difference.

1 more reply

tgtweak3y ago

Sad to see amds ROCm efforts essentially abandoned. They were close to universal interop for cudnn and cuda on amd (and other!) Architectures.

Hopefully Intel takes a stab at it with their ARC line out now.

dhruvdh3y ago

Abandoned? What?

rubatuga3y ago

In Their consumer card line

h2odragon3y ago

Douglas Adams said we'd have robots to watch TV for us. That seems to be the designed use case for this.

16gb RAM / 96 video channels ... I haven't done any of that work but it feels like they expect that "96" not to be fully used in practice.

inetsee3y ago

I have no problem imagining a security camera application needing to monitor quite a few video channels.

phkahler3y ago

>> I have no problem imagining a security camera application needing to monitor quite a few video channels.

h2odragon3y ago

Certainly. I'm suspecting that doing much of anything with all 96 channels would really need more RAM, for most users.

1 more reply

andy_ppp3y ago

/camera/state/g

hhh3y ago

71a54xd3y ago

Curious where I could learn more about models like this / potentially see some open code outlining tooling / infra required as well?

1 more reply

hyuen3y ago

Narew3y ago

hhh3y ago

I've ran models on Apple HW, Raspberry Pi, random CPUs, NVIDIA GPUs, TPUs. Waiting to get my hands on Tenstorrent gear.

kombine3y ago

That. I work on the research side and I am still waiting for non-NVIDIA hardware for training deep models.

frozenport3y ago

Google's TPU comes to mind

j / k navigate · click thread line to collapse