Structured Outputs with Ollama (opens in new tab)

(ollama.com)

265 pointsPatrick_Devine1y ago70 comments

70 comments

55 comments · 16 top-level

bluechair1y ago· 9 in thread

Has anyone seen how these constraints affect the quality of the output out of the LLM?

In some instances, I'd rather parse Markdown or plain text if it means the quality of the output is higher.

Working with OpenAI's models I've found a very good strategy is to have two passes if you can afford the extra tokens: one pass uses a heavy model and natural language with markdown sections discussing the reasoning and providing a final natural language answer (ideally labeled clearly with a markdown header). The second pass can use a cheaper and faster model to put the answer into a structured output format for consumption by the non-LLM parts of the pipeline.

You basically use JSON schema mode to draw a clean boundary around the wishy-washy language bits, using the LLM as a preprocessor to capture its own output in a useful format.

mmoskal1y ago

It depends how fine-tuned the model is to JSON output.

Also, you need to tell the model the schema. If you don't you will get more weird tokenization issues.

For example, if the schema expects a JSON key "foobarbaz" and the canonical BPE tokenization is ["foobar", "baz"], the token mask generated by all current constrained output libraries will let the model choose from "f", "foo", "foobar" (assuming these are all valid tokens). The model might then choose "foo", and then the constraint will force eg. "bar" and "baz" as next tokens. Now the model will see ["foo", "bar", "baz"] instead of ["foobar", "baz"] and will get confused [0]

If the model knows from the prompt "foobarbaz" is one of the schema keys, it will generally prefer "foobar" over "foo".

[0] In modern models these tokens are related, because of regularization but they are not the same.

thot_experiment1y ago

YMMV, it's a negative effect in terms of "reasoning" but the delta isn't super significant in most cases. It really depends on the LLM and whether your prompt is likely to generate a JSON response to begin with, the more you have to coerce the LLM the less likely it is to generate sane input. With smaller models you more quickly end up at the edge of space where the LLM has meaningful predictive power and so the outputs start getting closer to random noise.

FWIW measured by me using a vibes based method, nothing rigorous just a lot of hours spent on various LLM projects. I have not used these particular tools yet but ollama was previously able to guarantee json output through what I assume is similar techniques and my partner and I worked previously on a jsonformer-like thing for oobabooga, another LLM runtime tool.

parthsareen1y ago

We’ve been keeping a close eye on this as well as research is coming out. We’re looking into improving sampling as a whole on both speed and accuracy.

Hopefully with those changes we might also enable general structure generation not only limited to JSON.

hackernewds1y ago

Who is "we"?

1 more reply

benreesman1y ago

I can say that I was categorically wrong about the utility of things like instructor.

It’s easy to burn a lot of tokens but if the thing you’re doing merits the cost? You can be a bully with it and while its never the best, 95% as good for zero effort is a tool in one’s kit.

crystal_revenge1y ago

There was a paper going around claiming that structured outputs did hurt the quality of the output, but it turns out their experiment setup was laughably bad [0].

It looks like, so long as you're reasonable with the prompting, you tend to get better outputs when using structure.

0. https://blog.dottxt.co/say-what-you-mean.html

coredog641y ago

I’ve seen one case where structured output was terrible: OCR transcription of handwritten text in a form with blanks. You want a very low temperature for transcription, but as soon as the model starts to see multiple blank sequences, it starts to hallucinate that “” is the most likely next token.

nikolayasdf1231y ago

same here. I noticed that when you ask model to generate elaborate responses in natural text, and then come up with an answer, quality is orders of magnitude better, and something in line you would expect human-like reasoning.

asking LLM to directly generate JSON gives much worser results, similar to either random guess or intuition.

rdescartes1y ago· 7 in thread

If anyone needs a more powerful constrain outputs, llama.cpp support gbnf:

https://github.com/ggerganov/llama.cpp/blob/master/grammars/...

jimmySixDOF1y ago

Thats is exactly what they are using

lolinder1y ago

Have you found the output for arbitrary grammars to be satisfactory? My naive assumption has been that these models will produce better JSON than other formats simply by virtue of having seen so much of it.

rdescartes1y ago

If you want to get a good result, the grammar should be following the expect output from the prompt, especially if you use a small model. Normally I would manually fine-tune the prompt to output the grammar format first, and then apply the grammar in production.

throwaway3141551y ago

Who would downvote this perfectly reasonable question?

edit: Nm

sa-code1y ago

This is amazing, thank you for the link

dcreater1y ago

How is it more powerful?

evilduck1y ago

Grammars don't have to just be JSON, which means you could have it format responses as anything with a formal grammar. XML, HTTP responses, SQL, algebraic notation of math, etc.

chirau1y ago· 4 in thread

This is wonderful news.

I was actually scratching my head on how to structure a regular prompt to produce csv data without extra nonsense like "Here is your data" and "Please note blah blah" at the beginning and end, so this is much welcome as I can define exactly what I want returned then just push structured output to csv.

firejake3081y ago

Remember that you still need to include an instruction to produce a CSV to get the prompt into the right context to generate a CSV that makes sense. Otherwise, you may get output that is technically in the CSV format but doesn't make any sense because the model was actually trying to write a paragraph response and the token sampler just selected really low-probability tokens that the model didn't really want to say.

mmoskal1y ago

It seems ollama only supports JSON Schema.

Interestingly, JSON Schema has much less of this problem than say CSV - when the model is forced to produce `{"first_key":` it will generally understand it's supposed to continue in JSON. It still helps to tell it the schema though, especially due to weird tokenization issues you can get otherwise.

1 more reply

KTibow1y ago

A lot of the time you can prevent this by prefilling the output with ```\n and stopping at ```.

chirau1y ago

care to explain further? I am not sure I understand you fully

4 more replies

quaintdev1y ago· 3 in thread

So I can use this with any supported models? The reason I'm asking is because I can only run 1b-3b models reliably on my hardware.

parthsareen1y ago

Hey! Author of the blog post here. Yes you should be able to use any model. Your mileage may vary with the smaller models but asking them to “return x in json” tends to help with accuracy (anecdotally).

pamelafox1y ago

Do you happen to know if got-4o would be negatively affected by the addition of “return x in json”? I’m debating whether I could use the same prompt across all models, hosted and ollama.

dcreater1y ago

Why do smaller models fail to follow? Isn't the objective of constraints that it always provides the right output type?

1 more reply

JackYoustra1y ago· 3 in thread

PRs on this have been open for something like a year! I'm a bit sad about how quiet the maintainers have been on this.

parthsareen1y ago

Hey! Author of the post and one of the maintainers here. I agree - we (maintainers) got to this late and in general want to encourage more contributions.

Hoping to be more on top of community PRs and get them merged in the coming year.

dcreater1y ago

Reading tea leaves, they seem to be headed down the corporate path so view everything through that lens and how to maximize profit

SebastianSosa1y ago

Yeah... I was hoping to collaborate in building this but the conversation reached an abrupt end.

diimdeep1y ago· 3 in thread

Very annoying marketing and pretending to be anything other than just wrapper around llama.cpp.

evilduck1y ago

Can you ollama haters stop with this bullshit?

Does llama.cpp do dynamic model loading and unloading? Will it fetch a model you request but isn't downloaded? Does it provide SDKs? Does it have startup services it provides? There's space for things that wrap llama.cpp and solve many of its pain points. You can find piles of reports of people struggling to build and compile llama.cpp for some reason or another who then clicked an Ollama installer and it worked right away.

It's also a free OSS project giving all this away, why are you being an ass and discouraging them?

diimdeep1y ago

Sure llama.cpp does not do all of that, except that it lets you curl model from public and free to use endpoints, it does that. But SDK? - fuck that. Load, unload and startup services - who is even need that ? All this value is so minuscule compared to core functionality provided by ggml/llamacpp.

But this submitted link is not even about all of that, it is about what really llama.cpp does not do - it does not write more lines of marketing material than lines of code, which is that marketing material is about, lines of code that really just wrap 10x more lines of code down the line, and all of that by not making it clear as day.

1 more reply

dcreater1y ago

They're going to go corporate

highlanderNJ1y ago· 2 in thread

What's the value-add compared to `outlines`?

https://www.souzatharsis.com/tamingLLMs/notebooks/structured...

parthsareen1y ago

Hey! Author of the blog here. The current implementation uses llama.cpp GBNF which has allowed for a quick implementation. The biggest value-add at this time was getting the feature out.

With the newer research - outlines/xgrammar coming out, I hope to be able to update the sampling to support more formats, increase accuracy, and improve performance.

mwieler1y ago

Hi, just wanted to say how much I appreciate your work.

I'm curious if you have considered implementing Microsoft's Guidance (https://github.com/guidance-ai/guidance)? Their approach offers significant speed improvements, which I understand can sometimes be shortcoming of GBNF (e.g https://github.com/ggerganov/llama.cpp/issues/4218).

1 more reply

xnx1y ago· 2 in thread

Is there a best approach for providing structured input to LLMs? Example: feed in 100 sentences and get each one classified in different ways. It's easy to get structured data out, but my approach of prefixing line numbers seems clumsy.

mmoskal1y ago

Models are trained on Markdown, JSON and various programming languages, so either one of these should work.

However, in this case, you're best of giving the model sentences one by one to avoid it being confused. If you structure the prompt like "Classify the following sentence, here are the rules ...." + sentence, then you should be hitting prefix cache and get even better performance than when doing a single query. Of course, this only works if you have the prefix cache and are not paying per input token (though most providers now let you indicate you want to use prefix cache and pay less).

xnx1y ago

Good idea. I might try that. I think classification quality improves when it has following sentences. I'll have to see if feeding them sequentially makes it worse.

seertaak1y ago· 2 in thread

Could someone explain how this is implemented? I saw on Meta's Llama page that the model has intrinsic support for structured output. My 30k ft mental model of LLM is as a text completer, so it's not clear to me how this is accomplished.

Are llama.cpp and ollama leveraging llama's intrinsic structured output capability, or is this something else bolted ex-post on the output? (And if the former, how is the capability guaranteed across other models?)

evertedsphere1y ago

presumably at each step they mask out all tokens that would be invalid at that step according to the grammar

seertaak1y ago

That makes sense. Thanks

quaintdev1y ago· 1 in thread

Yay! It works. I used gemma2:2b and gave it below text

   You have spent 190 at Fresh Mart. Current balance: 5098

and it gave below output

   {\n\"amount\": 190,\n\"balance\": 5098 ,\"category\": \"Shopping\",\n\"place\":\"Fresh Mart\"\n}

diggan1y ago

That's some pretty inconsistent JSON, but I guess that makes sense when using a really small model and gemma on top of that.

guerrilla1y ago· 1 in thread

No way. This is amazing and one of the things I actually wanted. I love ollama be because it makes using an LLM feel like using any other UNIX program. It makes LLMs feel like they belong on UNIX.

Question though. Has anyone had luck running it on AMD GPUs? I've heard it's harder but I really want to support the competition when I get cards next year.

rcarmo1y ago

Yes, even on iGPUs. I was running it fairly well on a mini-PC with a 780M and the BIOS a set to allocate 16GB of shared memory to it.

lxe1y ago· 1 in thread

I'm still running oobabooga because of its exlv2 support which does much more efficient inference on dual 3090s

thot_experiment1y ago

I haven't touched ooba in a while, what's the situation like with exl2 vs the non-homogeneous quantization methods people are using like q3k_s or whatever. IIRC while exl2 is faster the gptq quants were outperforming it in terms of accuracy esp at lower bit depths.

lormayna1y ago· 1 in thread

This is a fantastic news! I spent hours on fine tuning my prompt to summarise text and output in JSON and still have some issues sometimes. Is this feature available also with Go?

lioeters1y ago

It looks like the structured output feature is available in Go, with the `format` field.

  type GenerateRequest struct {
    ...
    // Format specifies the format to return a response in.
    Format json.RawMessage `json:"format,omitempty"`

https://github.com/ollama/ollama/blob/de52b6c2f90ff220ed9469...

ein0p1y ago

That's very useful. To see why, try to get an LLM _reliably_ generate JSON output without this. Sometimes it will, but sometimes it'll just YOLO and produce something you didn't ask for, that can't be parsed.

rcarmo1y ago

I must say it is nice to see the curl example first. As much as I like Pydantic, I still prefer to hand-code the schemas, since it makes it easier to move my prototypes to Go (or something else).

vincentpants1y ago

Wow neat! The first step to format ambivalence! Curious to see how well does this perform on the edge, our overhead is always so scarce!

Amazing work as always, looking forward to taking this for a spin!

j / k navigate · click thread line to collapse

70 comments

55 comments · 16 top-level

bluechair1y ago· 9 in thread

Has anyone seen how these constraints affect the quality of the output out of the LLM?

In some instances, I'd rather parse Markdown or plain text if it means the quality of the output is higher.

lolinder1y ago

You basically use JSON schema mode to draw a clean boundary around the wishy-washy language bits, using the LLM as a preprocessor to capture its own output in a useful format.

mmoskal1y ago

It depends how fine-tuned the model is to JSON output.

Also, you need to tell the model the schema. If you don't you will get more weird tokenization issues.

If the model knows from the prompt "foobarbaz" is one of the schema keys, it will generally prefer "foobar" over "foo".

[0] In modern models these tokens are related, because of regularization but they are not the same.

thot_experiment1y ago

parthsareen1y ago

We’ve been keeping a close eye on this as well as research is coming out. We’re looking into improving sampling as a whole on both speed and accuracy.

Hopefully with those changes we might also enable general structure generation not only limited to JSON.

hackernewds1y ago

Who is "we"?

1 more reply

benreesman1y ago

I can say that I was categorically wrong about the utility of things like instructor.

It’s easy to burn a lot of tokens but if the thing you’re doing merits the cost? You can be a bully with it and while its never the best, 95% as good for zero effort is a tool in one’s kit.

crystal_revenge1y ago

There was a paper going around claiming that structured outputs did hurt the quality of the output, but it turns out their experiment setup was laughably bad [0].

It looks like, so long as you're reasonable with the prompting, you tend to get better outputs when using structure.

0. https://blog.dottxt.co/say-what-you-mean.html

coredog641y ago

nikolayasdf1231y ago

asking LLM to directly generate JSON gives much worser results, similar to either random guess or intuition.

rdescartes1y ago· 7 in thread

If anyone needs a more powerful constrain outputs, llama.cpp support gbnf:

https://github.com/ggerganov/llama.cpp/blob/master/grammars/...

jimmySixDOF1y ago

Thats is exactly what they are using

lolinder1y ago

rdescartes1y ago

throwaway3141551y ago

Who would downvote this perfectly reasonable question?

edit: Nm

sa-code1y ago

This is amazing, thank you for the link

dcreater1y ago

How is it more powerful?

evilduck1y ago

Grammars don't have to just be JSON, which means you could have it format responses as anything with a formal grammar. XML, HTTP responses, SQL, algebraic notation of math, etc.

chirau1y ago· 4 in thread

This is wonderful news.

firejake3081y ago

mmoskal1y ago

It seems ollama only supports JSON Schema.

1 more reply

KTibow1y ago

A lot of the time you can prevent this by prefilling the output with ```\n and stopping at ```.

chirau1y ago

care to explain further? I am not sure I understand you fully

4 more replies

quaintdev1y ago· 3 in thread

So I can use this with any supported models? The reason I'm asking is because I can only run 1b-3b models reliably on my hardware.

parthsareen1y ago

pamelafox1y ago

Do you happen to know if got-4o would be negatively affected by the addition of “return x in json”? I’m debating whether I could use the same prompt across all models, hosted and ollama.

dcreater1y ago

Why do smaller models fail to follow? Isn't the objective of constraints that it always provides the right output type?

1 more reply

JackYoustra1y ago· 3 in thread

PRs on this have been open for something like a year! I'm a bit sad about how quiet the maintainers have been on this.

parthsareen1y ago

Hey! Author of the post and one of the maintainers here. I agree - we (maintainers) got to this late and in general want to encourage more contributions.

Hoping to be more on top of community PRs and get them merged in the coming year.

dcreater1y ago

Reading tea leaves, they seem to be headed down the corporate path so view everything through that lens and how to maximize profit

SebastianSosa1y ago

Yeah... I was hoping to collaborate in building this but the conversation reached an abrupt end.

diimdeep1y ago· 3 in thread

Very annoying marketing and pretending to be anything other than just wrapper around llama.cpp.

evilduck1y ago

Can you ollama haters stop with this bullshit?

It's also a free OSS project giving all this away, why are you being an ass and discouraging them?

diimdeep1y ago

1 more reply

dcreater1y ago

They're going to go corporate

highlanderNJ1y ago· 2 in thread

What's the value-add compared to `outlines`?

https://www.souzatharsis.com/tamingLLMs/notebooks/structured...

parthsareen1y ago

Hey! Author of the blog here. The current implementation uses llama.cpp GBNF which has allowed for a quick implementation. The biggest value-add at this time was getting the feature out.

With the newer research - outlines/xgrammar coming out, I hope to be able to update the sampling to support more formats, increase accuracy, and improve performance.

mwieler1y ago

Hi, just wanted to say how much I appreciate your work.

1 more reply

xnx1y ago· 2 in thread

mmoskal1y ago

Models are trained on Markdown, JSON and various programming languages, so either one of these should work.

xnx1y ago

Good idea. I might try that. I think classification quality improves when it has following sentences. I'll have to see if feeding them sequentially makes it worse.

seertaak1y ago· 2 in thread

evertedsphere1y ago

presumably at each step they mask out all tokens that would be invalid at that step according to the grammar

seertaak1y ago

That makes sense. Thanks

quaintdev1y ago· 1 in thread

Yay! It works. I used gemma2:2b and gave it below text

   You have spent 190 at Fresh Mart. Current balance: 5098

and it gave below output

   {\n\"amount\": 190,\n\"balance\": 5098 ,\"category\": \"Shopping\",\n\"place\":\"Fresh Mart\"\n}

diggan1y ago

That's some pretty inconsistent JSON, but I guess that makes sense when using a really small model and gemma on top of that.

guerrilla1y ago· 1 in thread

No way. This is amazing and one of the things I actually wanted. I love ollama be because it makes using an LLM feel like using any other UNIX program. It makes LLMs feel like they belong on UNIX.

Question though. Has anyone had luck running it on AMD GPUs? I've heard it's harder but I really want to support the competition when I get cards next year.

rcarmo1y ago

Yes, even on iGPUs. I was running it fairly well on a mini-PC with a 780M and the BIOS a set to allocate 16GB of shared memory to it.

lxe1y ago· 1 in thread

I'm still running oobabooga because of its exlv2 support which does much more efficient inference on dual 3090s

thot_experiment1y ago

lormayna1y ago· 1 in thread

This is a fantastic news! I spent hours on fine tuning my prompt to summarise text and output in JSON and still have some issues sometimes. Is this feature available also with Go?

lioeters1y ago

It looks like the structured output feature is available in Go, with the `format` field.

  type GenerateRequest struct {
    ...
    // Format specifies the format to return a response in.
    Format json.RawMessage `json:"format,omitempty"`

https://github.com/ollama/ollama/blob/de52b6c2f90ff220ed9469...

ein0p1y ago

rcarmo1y ago

I must say it is nice to see the curl example first. As much as I like Pydantic, I still prefer to hand-code the schemas, since it makes it easier to move my prototypes to Go (or something else).

vincentpants1y ago

Wow neat! The first step to format ambivalence! Curious to see how well does this perform on the edge, our overhead is always so scarce!

Amazing work as always, looking forward to taking this for a spin!

j / k navigate · click thread line to collapse