Using LM Studio, trying to load the model throws an error of "insufficient system resources."
I disabled this error, set the context length to 1024 and was able to get 0.24 tokens per second. Comparatively, the 32B distill model gets about 20 tokens per second.
And it became incredibly flaky, using up all available ram, and crashing the whole system a few times.
While the M4 Max 128GB handles the 32B well, it seems to choke on this. Here's to hoping someone works on something in-between (or works out what the ideal settings are because nothing I fiddled with helped much).
In theory half of the model fits to RAM, so it should be GPU limited if memory management is smart.