That said, I'm definitely glad to see work in this area, particularly with open weights.
It would be interesting to see if fine tuning on your KTANE test improves your results?
I thought we called models with test data in their training set “poisoned”
But your point about quality stands. Separately, this model emits the docling XML format, not the JSON format, so as far as I know today that means you are using the Python flavored docling only, the JS variant does not support this yet (afaik).
It was fine tuned from this: https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct
There's an example of fine tuning the base that would likely be applicable to this one as well.
But I agree that accurate OCR is kind of a prerequisite for adaptation.
The good: - Open source.
- Can run locally (Apple Silicon) at a fair speed.
- Image detection is good.
The bad:
- Not detecting tables.
- Text in a perfectly clean PDF (resume) is not detected.
I know its in preview, small and open source which is great, but its far from being usable.
-