It seems to me that most model providers are not running/testing via the most used backends i.e Llama, Ollama etc because if they were, they would see how broken their release is.
Tool calling is like the Achilles Heel where most will fail unless you either modify the system prompts or run via proxies so you can inject/munge the request/reply.
Like seriously… how many billions and billions (actually we saw one >800 billion evaluation last week, so almost a whole trillion) goes into AI development and yet 99.999% of all models from the big names do not work straight out of the box with the most common backends. Blows my mind!