So you end up hitting roadblocks for seemingly simple Pydantic schemas.
But they seem to be considered disparate concepts. So I'm trying to understand if there's some additional nuance I'm missing.
Not really this application, but QvQ for visual reasoning is also impressive. https://qwenlm.github.io/blog/qvq-72b-preview/
Meta has used Qwen as the basis for their Apollo research. https://arxiv.org/abs/2412.10360
We’ve locally tested with Llama 3.2 11B Vision on Ollama: https://github.com/vlm-run/vlmrun-hub/blob/main/tests/benchm...
FWIW I think Ollama structured outputs API is quite buggy compared to the HF transformers variant.
If you haven’t heard of us, we provide a language and runtime that enable defining your schemas in a simpler syntax, and allow usage with _any_ model, not just those that implement tool calling or json mode, by by relying on schema-aligned parsing. Check it out! https://github.com/BoundaryML/baml
What’s the use-case and what kind of latency do you require?
A few video schemas are already added to the main catalog: https://github.com/vlm-run/vlmrun-hub/blob/main/vlmrun/hub/c...
git config --global init.defaultBranch master
There's the equivalent setting in GitHub.