You do this thing not because you expect consumers with 5 year old hardware to provide meaningful utilization but as a demo ("let me grab my old gaming machine and do some supercomputing real quick") and a signal that you intend to stay the course. AMD management hasn't realized this even after various Nvidia people said that this was exactly why they did it, at some point the absence of that signal is a signal that the AMD compute ecosystem is an unreliable investment, no?
Don’t get me started with vLLM and AITER.
I can't count how many times over the last 30 years I've had AMD drivers crash the OS (Linux and Windows). Nvidia have been mostly rock solid.
The thing is, the die isn't much use without a stable driver (and AI stack).
All you need is a used GPU slapped onto any disused ddr4 mobo. New 5060s, the 16gb models, can do basically everything now.
With multiple cards in normal PCI express slots, LLM layers are split across cards.
When you run inference, it runs on one card then the other card. You can repeat this for as many cards as you want.
You only copy the activations between the cards which ~10 MB/sec at runtime so PCIe width or generation is irrelevant. Even PCIe 1.0 x1 would be sufficient.
There are other software optimisations (row split, tensor parallel) which require fast interlinks like NVLink but you can get a long way without any of that.
I am running OpenWeb UI + Ollama + 7B on a Proxmox LXC container, it consumes less than 2GB, the GPU only has 4GB, and 50% CPU, it is very usable, sometimes faster than online ones to start giving you the answer and 100% offline.
If I replace the GPU with a faster one, I have no need to use online ones.
I've used several agent frameworks and they all support many different providers from cloud to local. These are orthogonal responsibilities. I'm using VertexAI for cloud and ollama on a minisforum with rocm locally. There is a dropdown to change between them.
My last experiment in January was trying to run a Qwen model locally (RTX 4080; 128GB RAM; 9950X3D). I must have been doing it extremely wrong because the models that I tried either hallucinated severely or got stuck in a loop. The funniest one was stuck in a "but wait, ..." loop.
I fortunately had started experimenting with Claude, so I opted to pay Anthropic more money for tokens (work already covers the bill, this was for personal use).
That whole experience + a noisy GPU, put me off the idea of running/building local agents.
If you mean that you can't just run the largest unquantized models, then it's indeed true.
But let’s be honest, AMD has been an extremely bad citizen to non-corporate users.
For my iGPU I have to fake GFX900 and build things from source or staging packages to get that working. Support for GFX90c is finally in the pipeline…
The improvements feel like a bodyguard finally letting you through the door just because NVIDIA is eating their lunch and they don’t want their club to be empty.
They strongarm their customers to using “Enterprise” GPUs to be able to play with ROCm, and are only broadening their offerings for market share purposes.
Really shouldn’t reward this behavior.
I have and RDNA4 card and they certainly are prioritizing CDNA over a CDNA + RDNA strategy or a unification strategy.
It's also probably worth trying Vulkan inference. It is now faster than ROCm - both tg and pp over 16k ctx - on Strix Halo so maybe you'll see the benefits too.
https://www.tipranks.com/news/amd-stock-slips-despite-a-majo...
I read:
" In addition to that, the update allows these agents to be turned into desktop apps for multiple operating systems. "
This seems like a new way to create app: Create an (AI) app that creates apps.
Requirement Minimum
Processor AMD Ryzen AI 300-series