https://github.com/ggerganov/llama.cpp/discussions/4167
OMM, Llama 3.3 70B runs at ~7 text generation tokens per second on Macbook Pro Max 128GB, while generating GPT-4 feeling text with more in depth responses and fewer bullets. Llama 3.3 70B also doesn't fight the system prompt, it leans in.
Consider e.g. LM Studio (0.3.5 or newer) for a Metal (MLX) centered UI, include MLX in your search term when downloading models.
Also, do not scrimp on the storage. At 60GB - 100GB per model, it takes a day of experimentation to use 2.5TB of storage in your model cache. And remember to exclude that path from your TimeMachine backups.
The problem is that often you can't run anything else. I've had trouble running larger models in 64GB when I've had a bunch of Firefox and VS Code tabs open at the same time.
Qwen2.5 Coder 14B at a 4 bit quantization could run but you will need to be diligent about what else you have in memory at the same time.
Yes, it is not "local" as I will have to use the internet when not at home. But it will also not drain the battery very quickly when using it, which I suspect would happen to a Macbook Pro running such models. Also 70B models are out of reach of my setup, but I think they are painfully slow on Mac hardware.