GAIA – Open-source framework for building AI agents that run on local hardware (opens in new tab)

(amd-gaia.ai)

156 pointsgalaxyLogic26d ago37 comments

37 comments

Nvidia went through a lot of effort to make CUDA operational on their entire lineup, and they did it before deep learning even took off.

You do this thing not because you expect consumers with 5 year old hardware to provide meaningful utilization but as a demo ("let me grab my old gaming machine and do some supercomputing real quick") and a signal that you intend to stay the course. AMD management hasn't realized this even after various Nvidia people said that this was exactly why they did it, at some point the absence of that signal is a signal that the AMD compute ecosystem is an unreliable investment, no?

androiddrew26d ago

You got it right I think. I’m sitting with two “AI Ready Radeon AI Pro 9700 workstation cards, which are RDNA4 not CDNA. My experience is that my cards are not a priority. Individual engineers at AMD may care, the company doesn’t. I have been trying since February to get ahold of anyone responsible for shipping tuned Tensile gfx1201 kernels in rocm-libs, which is used by Ollama.its been three weeks since I raised enough hell on the discord to get a response, but they still can’t find “who” is responsible for Tensile tuning, and “if” they are even going to do it for the gfx12* cards.

Don’t get me started with vLLM and AITER.

ra26d ago

I agree, and think AMD and Nvidia philosophy diverged way before Cuda.

I can't count how many times over the last 30 years I've had AMD drivers crash the OS (Linux and Windows). Nvidia have been mostly rock solid.

The thing is, the die isn't much use without a stable driver (and AI stack).

Azantys26d ago

Yeah I own an AMD Instinct MI50 and i need to patch all of my applications to work, like PyTorch, bitsandbytes, blender etc, while Nvidia cards from the same generation are still mostly supported. But the better value and hardware are worth it

xrd26d ago

I wanted to believe but anyone who has spent any time trying to run models locally knows this is not going to be solved by two lines of python running on rocm as the example shows.

sandworm10126d ago

I am running q 4xgpu rig at home (similar to a mining rig) doing everything from llms to content creation. I have learned a lot. Having an AI rig today is much like having an early PC in the 80s. You dont appeciate the possible uses until you have it in your hands.

All you need is a used GPU slapped onto any disused ddr4 mobo. New 5060s, the 16gb models, can do basically everything now.

wingtw26d ago

Can you specify a bit, what gpus and how do you wire them "together"? Nvlink?

suprjami22d ago

You don't.

With multiple cards in normal PCI express slots, LLM layers are split across cards.

When you run inference, it runs on one card then the other card. You can repeat this for as many cards as you want.

You only copy the activations between the cards which ~10 MB/sec at runtime so PCIe width or generation is irrelevant. Even PCIe 1.0 x1 would be sufficient.

There are other software optimisations (row split, tensor parallel) which require fast interlinks like NVLink but you can get a long way without any of that.

sandworm10126d ago

A couple 5060s and a couple 3060s. They are wired via PCI risers to an older mono with an amd cpu. (I wanted to avoid long 3-fan cards.) It looks like a mining rig, but with thicker pci risers. Many llm tools easily leverage multiple GPUs. Sucks 800w at full load, idles below 50w.

2 more replies

h4kunamata26d ago

Not entirely.

I am running OpenWeb UI + Ollama + 7B on a Proxmox LXC container, it consumes less than 2GB, the GPU only has 4GB, and 50% CPU, it is very usable, sometimes faster than online ones to start giving you the answer and 100% offline.

If I replace the GPU with a faster one, I have no need to use online ones.

wilkystyle26d ago

Curious to hear more. My experience is limited to llama.cpp on Apple silicon so far, but have been eyeing AMD ecosystem from afar.

craftkiller26d ago

FWIW I run llama.cpp on AMD hardware using Vulkan. I've got no complaints but also nothing else to compare against.

verdverm26d ago

The main thing to consider is that how you run the models does not need to be coupled to the what you send models (and how you orchestrate agents).

I've used several agent frameworks and they all support many different providers from cloud to local. These are orthogonal responsibilities. I'm using VertexAI for cloud and ollama on a minisforum with rocm locally. There is a dropdown to change between them.

nevi-me26d ago

Perhaps not a good example, I tried running local models a few times, to much disappointment (actually made me skeptical of LLMs in general for a while).

My last experiment in January was trying to run a Qwen model locally (RTX 4080; 128GB RAM; 9950X3D). I must have been doing it extremely wrong because the models that I tried either hallucinated severely or got stuck in a loop. The funniest one was stuck in a "but wait, ..." loop.

I fortunately had started experimenting with Claude, so I opted to pay Anthropic more money for tokens (work already covers the bill, this was for personal use).

That whole experience + a noisy GPU, put me off the idea of running/building local agents.

buryat26d ago

I have a Mac Studio with 512GB Ram and ran models of different sizes to test out how local agents are and I agree that local models aren't there yet but that depends on whether you need a lot of knowledge or not to answer your question, and I think it should be possible to either distill or train a smaller model that works on a subset of knowledge tailored toward local execution. My main interest is in reducing the latency and it feels that the local agents that work at high speeds should be an answer to this but it's not something that someone is trying to solve yet. Feels like if I could get a smaller model that could run at incredible speed locally that could unlock some interesting autoresearching.

3 more replies

lostmsu26d ago

I hope you are not running models under Q8, preferably Q8 directly from the vendor.

cyberax26d ago

Uhmm... I have a local Ollama setup on Linux+AMD, and it was only a bit more involved than this sample. And only because I wanted to run everything in a container.

If you mean that you can't just run the largest unquantized models, then it's indeed true.

sabedevops26d ago

ROCm is finally getting better due to a few well meaning engineers.

But let’s be honest, AMD has been an extremely bad citizen to non-corporate users.

For my iGPU I have to fake GFX900 and build things from source or staging packages to get that working. Support for GFX90c is finally in the pipeline…

The improvements feel like a bodyguard finally letting you through the door just because NVIDIA is eating their lunch and they don’t want their club to be empty.

They strongarm their customers to using “Enterprise” GPUs to be able to play with ROCm, and are only broadening their offerings for market share purposes.

Really shouldn’t reward this behavior.

androiddrew26d ago

Yup, meanwhile Jensen is on the Lexfriedman podcast stating the reason why CUDA is successful is because all thier devices run it. The on ramp is at the individual user.

I have and RDNA4 card and they certainly are prioritizing CDNA over a CDNA + RDNA strategy or a unification strategy.

suprjami22d ago

Debian build their ROCm with support for all possible devices. If you are tired of compiling from source just use a Debian Stable container, install their libraries in your container build, and pass /dev/kfd and /dev/dri to the container. No ROCm or out-of-tree drivers required on the container host, just regular upstream Linux kernel amdgpu and those two devices to the container.

It's also probably worth trying Vulkan inference. It is now faster than ROCm - both tg and pp over 16k ctx - on Strix Halo so maybe you'll see the benefits too.

Shitty-kitty26d ago

The problem is the split CDNA/RDNA architecture. A problem they are adressing with their upcoming unified UDMA.

galaxyLogicOP26d ago

Not so clear from their page but from

https://www.tipranks.com/news/amd-stock-slips-despite-a-majo...

I read:

" In addition to that, the update allows these agents to be turned into desktop apps for multiple operating systems. "

This seems like a new way to create app: Create an (AI) app that creates apps.

latchkey26d ago

The stock is peaking highs again and people are quoting articles saying "stock slips"...

Mars00826d ago

In case you are interested:

Requirement Minimum

Processor AMD Ryzen AI 300-series

lysp26d ago

It reads as optional for NPU support. But seems to have that as minimum on another page.

Mars00826d ago

Any lock in makes it significantly less attractive. AMD is not in dominant position to insist. More portable would make it more attractive. Like MS did, sort of works everywhere but better on Windows.

warwickmcintosh26d ago

ROCm has improved but the reality is you're still fighting the driver stack more than the models. If you're actually doing local inference on AMD you're spending your time on CUDA compatibility layers, not the AI part. Two lines of python is marketing, the gap between demo and working AMD setup is still real.

ddtaylor26d ago

Ollama works very well in Linux on my AMD hardware. I have a 6800 XT which isn't even originally supported by the ROCm stack in some ways and it "just works" for a ton of very nice models, especially if I seek out quantized versions of the model.

0xbadcafebee26d ago

I would love to use your tool locally, AMD, if you'd support the AMD graphics card you sold me.

j / k navigate · click thread line to collapse

37 comments

coppsilgold26d ago

Nvidia went through a lot of effort to make CUDA operational on their entire lineup, and they did it before deep learning even took off.

androiddrew26d ago

Don’t get me started with vLLM and AITER.

ra26d ago

I agree, and think AMD and Nvidia philosophy diverged way before Cuda.

I can't count how many times over the last 30 years I've had AMD drivers crash the OS (Linux and Windows). Nvidia have been mostly rock solid.

The thing is, the die isn't much use without a stable driver (and AI stack).

Azantys26d ago

xrd26d ago

I wanted to believe but anyone who has spent any time trying to run models locally knows this is not going to be solved by two lines of python running on rocm as the example shows.

sandworm10126d ago

All you need is a used GPU slapped onto any disused ddr4 mobo. New 5060s, the 16gb models, can do basically everything now.

wingtw26d ago

Can you specify a bit, what gpus and how do you wire them "together"? Nvlink?

suprjami22d ago

You don't.

With multiple cards in normal PCI express slots, LLM layers are split across cards.

When you run inference, it runs on one card then the other card. You can repeat this for as many cards as you want.

You only copy the activations between the cards which ~10 MB/sec at runtime so PCIe width or generation is irrelevant. Even PCIe 1.0 x1 would be sufficient.

There are other software optimisations (row split, tensor parallel) which require fast interlinks like NVLink but you can get a long way without any of that.

sandworm10126d ago

2 more replies

h4kunamata26d ago

Not entirely.

If I replace the GPU with a faster one, I have no need to use online ones.

wilkystyle26d ago

Curious to hear more. My experience is limited to llama.cpp on Apple silicon so far, but have been eyeing AMD ecosystem from afar.

craftkiller26d ago

FWIW I run llama.cpp on AMD hardware using Vulkan. I've got no complaints but also nothing else to compare against.

verdverm26d ago

The main thing to consider is that how you run the models does not need to be coupled to the what you send models (and how you orchestrate agents).

nevi-me26d ago

Perhaps not a good example, I tried running local models a few times, to much disappointment (actually made me skeptical of LLMs in general for a while).

I fortunately had started experimenting with Claude, so I opted to pay Anthropic more money for tokens (work already covers the bill, this was for personal use).

That whole experience + a noisy GPU, put me off the idea of running/building local agents.

buryat26d ago

3 more replies

lostmsu26d ago

I hope you are not running models under Q8, preferably Q8 directly from the vendor.

cyberax26d ago

Uhmm... I have a local Ollama setup on Linux+AMD, and it was only a bit more involved than this sample. And only because I wanted to run everything in a container.

If you mean that you can't just run the largest unquantized models, then it's indeed true.

sabedevops26d ago

ROCm is finally getting better due to a few well meaning engineers.

But let’s be honest, AMD has been an extremely bad citizen to non-corporate users.

For my iGPU I have to fake GFX900 and build things from source or staging packages to get that working. Support for GFX90c is finally in the pipeline…

The improvements feel like a bodyguard finally letting you through the door just because NVIDIA is eating their lunch and they don’t want their club to be empty.

They strongarm their customers to using “Enterprise” GPUs to be able to play with ROCm, and are only broadening their offerings for market share purposes.

Really shouldn’t reward this behavior.

androiddrew26d ago

Yup, meanwhile Jensen is on the Lexfriedman podcast stating the reason why CUDA is successful is because all thier devices run it. The on ramp is at the individual user.

I have and RDNA4 card and they certainly are prioritizing CDNA over a CDNA + RDNA strategy or a unification strategy.

suprjami22d ago

It's also probably worth trying Vulkan inference. It is now faster than ROCm - both tg and pp over 16k ctx - on Strix Halo so maybe you'll see the benefits too.

Shitty-kitty26d ago

The problem is the split CDNA/RDNA architecture. A problem they are adressing with their upcoming unified UDMA.

galaxyLogicOP26d ago

Not so clear from their page but from

https://www.tipranks.com/news/amd-stock-slips-despite-a-majo...

I read:

" In addition to that, the update allows these agents to be turned into desktop apps for multiple operating systems. "

This seems like a new way to create app: Create an (AI) app that creates apps.

latchkey26d ago

The stock is peaking highs again and people are quoting articles saying "stock slips"...

Mars00826d ago

In case you are interested:

Requirement Minimum

Processor AMD Ryzen AI 300-series

lysp26d ago

It reads as optional for NPU support. But seems to have that as minimum on another page.

Mars00826d ago

warwickmcintosh26d ago

ddtaylor26d ago

0xbadcafebee26d ago

I would love to use your tool locally, AMD, if you'd support the AMD graphics card you sold me.

j / k navigate · click thread line to collapse