Rapid MLX team has done some interesting benchmarking that suggests Qwopus 27B is pretty solid. Their tool includes benchmarking features so you can evaluate your own setup.
They have a metric called Model-Harness Index:
MHI = 0.50 × ToolCalling + 0.30 × HumanEval + 0.20 × MMLU (scale 0-100)
https://github.com/raullenchai/Rapid-MLX