Run structured extraction on documents/images locally with Ollama and Pydantic (opens in new tab)

(github.com)

170 pointsEarlyOom1y ago29 comments

29 comments

24 comments · 10 top-level

jbmsf1y ago· 5 in thread

Interesting. We're using a SAAS solution for document extraction right now. I don't know if it's in our interest to build out more but I do like the idea of keeping extraction local.

jgalt2121y ago

Our customers insist we run everything on their docs locally.

fzysingularity1y ago

Absolutely, we’ve been hearing the same from our customers - which is why we thought it makes sense to open source a bunch of schemas so that they’re reusable and compatible across various inference providers (esp. Ollama/local ones).

fzysingularity1y ago

Cool, what types of documents do you currently handle? We could share some of our learnings/schemas here too.

andrewinardeer1y ago

Different commenter; Here I'm extracting data from commerical invoices, POs and bills of lading.

1 more reply

jbmsf1y ago

Mostly tax forms, state-specific formations documents (Articles of X), and state-specific payroll registration documents.

jasonjmcghee1y ago· 3 in thread

I've used "structured output" (with supplied schema) on Google and openai, and function calling / tool use on those, anthropic and others- and afaict they are functionally the same (if you force a specific function / schema). Has someone had a different experience?

fzysingularity1y ago

They’re slightly nuanced - every model provider has a slightly different Pydantic /JSON schema compatibility (i.e for handling Literals, Unions, nested subtypes etc).

So you end up hitting roadblocks for seemingly simple Pydantic schemas.

jasonjmcghee1y ago

I meant between "structured output" and "function calling". Afaict one is outputting according to a schema and the other is outputting according to a schema... which will be used as the parameters to a function.

But they seem to be considered disparate concepts. So I'm trying to understand if there's some additional nuance I'm missing.

2 more replies

potatoman221y ago

The model might not use the tools every completion, depending on your setup.

jauntywundrkind1y ago· 2 in thread

I'd really like to play with Qwen2.5-VL at some point, perhaps for reading data-sheets for microchips. Nicely for some applications, it's also very good at reporting position of what it finds, which many ML tools are pretty mediocre at. https://qwenlm.github.io/blog/qwen2.5-vl/

Not really this application, but QvQ for visual reasoning is also impressive. https://qwenlm.github.io/blog/qvq-72b-preview/

Meta has used Qwen as the basis for their Apollo research. https://arxiv.org/abs/2412.10360

fzysingularity1y ago

Is Qwen2.5-VL on Ollama? Could give it a try with a few of the schemas we have.

We’ve locally tested with Llama 3.2 11B Vision on Ollama: https://github.com/vlm-run/vlmrun-hub/blob/main/tests/benchm...

FWIW I think Ollama structured outputs API is quite buggy compared to the HF transformers variant.

fzysingularity1y ago

Just ran them for Qwen2.5-VL: https://github.com/vlm-run/vlmrun-hub/blob/main/tests/benchm...

joatmon-snoo1y ago· 1 in thread

Super cool! We at BAML had been thinking about doing something like this for our ecosystem as well - we’d love to add BAML models to this repo!

If you haven’t heard of us, we provide a language and runtime that enable defining your schemas in a simpler syntax, and allow usage with _any_ model, not just those that implement tool calling or json mode, by by relying on schema-aligned parsing. Check it out! https://github.com/BoundaryML/baml

EarlyOomOP1y ago

Would love to chat! reach out scott@vlm.run

kaushikbokka1y ago· 1 in thread

Have you folks tried finetuning models for data extraction from visual data?

EarlyOomOP1y ago

That's one of our main focuses, yes: https://docs.vlm.run/api-reference/v1/fine-tuning/post-finet...

Inviz1y ago· 1 in thread

What are the most promising ways to extract information from picture like this, if the domain has strict time constraints? What's the second best way that is still fast?

fzysingularity1y ago

You can always distill VLMs into much smaller / faster models that’s specific to your domain or use-case.

What’s the use-case and what kind of latency do you require?

youknowwhentous1y ago· 1 in thread

This seems to work for videos as well. Pretty cool demo and very nice interface for the pydantic types.

fzysingularity1y ago

Yes, good catch. We'll be adding several more schemas for videos in the next few weeks.

A few video schemas are already added to the main catalog: https://github.com/vlm-run/vlmrun-hub/blob/main/vlmrun/hub/c...

EarlyOomOP1y ago

We put together an open-source collection of Pydantic schemas for a variety of document categories (W2 filings, invoices etc.), including instructions for how to get structured JSON responses from any visual input with the model of your choosing. Run everything locally.

peterhadlaw1y ago

When making a new repo, reset your initial branch back to master with the following command:

git config --global init.defaultBranch master

There's the equivalent setting in GitHub.

18chetanpatel1y ago

This is something I was searching for..Thanks for creating!

j / k navigate · click thread line to collapse

29 comments

24 comments · 10 top-level

jbmsf1y ago· 5 in thread

Interesting. We're using a SAAS solution for document extraction right now. I don't know if it's in our interest to build out more but I do like the idea of keeping extraction local.

jgalt2121y ago

Our customers insist we run everything on their docs locally.

fzysingularity1y ago

Cool, what types of documents do you currently handle? We could share some of our learnings/schemas here too.

andrewinardeer1y ago

Different commenter; Here I'm extracting data from commerical invoices, POs and bills of lading.

1 more reply

jbmsf1y ago

Mostly tax forms, state-specific formations documents (Articles of X), and state-specific payroll registration documents.

jasonjmcghee1y ago· 3 in thread

fzysingularity1y ago

They’re slightly nuanced - every model provider has a slightly different Pydantic /JSON schema compatibility (i.e for handling Literals, Unions, nested subtypes etc).

So you end up hitting roadblocks for seemingly simple Pydantic schemas.

jasonjmcghee1y ago

But they seem to be considered disparate concepts. So I'm trying to understand if there's some additional nuance I'm missing.

2 more replies

potatoman221y ago

The model might not use the tools every completion, depending on your setup.

jauntywundrkind1y ago· 2 in thread

Not really this application, but QvQ for visual reasoning is also impressive. https://qwenlm.github.io/blog/qvq-72b-preview/

Meta has used Qwen as the basis for their Apollo research. https://arxiv.org/abs/2412.10360

fzysingularity1y ago

Is Qwen2.5-VL on Ollama? Could give it a try with a few of the schemas we have.

We’ve locally tested with Llama 3.2 11B Vision on Ollama: https://github.com/vlm-run/vlmrun-hub/blob/main/tests/benchm...

FWIW I think Ollama structured outputs API is quite buggy compared to the HF transformers variant.

fzysingularity1y ago

Just ran them for Qwen2.5-VL: https://github.com/vlm-run/vlmrun-hub/blob/main/tests/benchm...

joatmon-snoo1y ago· 1 in thread

Super cool! We at BAML had been thinking about doing something like this for our ecosystem as well - we’d love to add BAML models to this repo!

EarlyOomOP1y ago

Would love to chat! reach out scott@vlm.run

kaushikbokka1y ago· 1 in thread

Have you folks tried finetuning models for data extraction from visual data?

EarlyOomOP1y ago

That's one of our main focuses, yes: https://docs.vlm.run/api-reference/v1/fine-tuning/post-finet...

Inviz1y ago· 1 in thread

What are the most promising ways to extract information from picture like this, if the domain has strict time constraints? What's the second best way that is still fast?

fzysingularity1y ago

You can always distill VLMs into much smaller / faster models that’s specific to your domain or use-case.

What’s the use-case and what kind of latency do you require?

youknowwhentous1y ago· 1 in thread

This seems to work for videos as well. Pretty cool demo and very nice interface for the pydantic types.

fzysingularity1y ago

Yes, good catch. We'll be adding several more schemas for videos in the next few weeks.

A few video schemas are already added to the main catalog: https://github.com/vlm-run/vlmrun-hub/blob/main/vlmrun/hub/c...

EarlyOomOP1y ago

peterhadlaw1y ago

When making a new repo, reset your initial branch back to master with the following command:

git config --global init.defaultBranch master

There's the equivalent setting in GitHub.

18chetanpatel1y ago

This is something I was searching for..Thanks for creating!

j / k navigate · click thread line to collapse