Really?? For me it's terrible doing that. I also have 64GB RAM but meh. It's so bad when I can no longer offload everything. The tokens literally drizzle in. With full offloading they appear faster than I can read (8B llama3 with 8 bit quant). On a Radeon Pro VII with 16GB (HBM2 memory!)
Oh man, I hate to say it, but it's likely your amd card. Yes, they can run LLMs and SD, but badly. Larger models are usable for me with partial offloading, but you're right that full loading the model in vram is really preferable.