User friendly library that connects to lots of OSS model serving backends: https://github.com/guidance-ai/guidance/
Core Rust library written for high performance mask computation (written mostly by my collaborator @mmoskal): http://github.com/guidance-ai/llguidance
TL;DR instead of just getting a token and seeing if it would be accepted by the parser, you can actually zero-out probabilities for all invalid tokens, and do the computation for this in parallel at effectively zero cost:
> Here, compute_mask() can run on the CPU during the time it would be normally just waiting for the GPU to finish. The line prob[~mask] = 0.0 would normally be fused into the softmax kernel in the last stage of the LLM, with negligible overhead. Therefore, as long as the compute_mask() function completes faster than the LLM forward pass and parser.consume() is negligible (typically follows from compute_mask() speed), the constrained generation will be as fast as the unconstrained one.
I'm curious - have there been any research/conversations about pushing masking even earlier in the pipeline? In theory, there's a fair amount of compute that goes into computing the probability of tokens that will end up being masked away anyways.
Well, thank you for that; from a quick skim of Guidance, it looks like it is used when interfacing with the model directly - i.e. if I want to use Guidance I can't simply send input to my local Ollama instance, I have to stand up a small Python program that loads the model, accepts input from the user, push the user input tokens into the model, and for each output token, reject it if it fails some criteria.
Is this correct? If so, it means that the current way LLMs are interfaced with (via stdin/stout or an HTTP endpoint) can't be used with something like Guidance, correct?
Should work with any llama.cpp compatible model: https://github.com/sutt/innocuous
I didn't find any more on that comment below. Is there a list of supported LLMs?
We have support for Huggingface Transformers, llama.cpp, vLLM, SGLang, and TensorRT-LLM, along with some smaller providers (e.g. mistral.rs). Using any of these libraries as an inference host means you can use an OSS model with the guidance backend for full support. Most open source models will run on at least one of these backends (with vLLM probably being the most popular hosted solution, and transformers/llama.cpp being the most popular local model solutions)
We're also the backend used by OpenAI/Azure OpenAI for structured outputs on the closed source model side.
I'm yet to see a thorough comparison of design, performance and reliability between these options (along with outlines etc)
Happy to chat more about the benchmark. Note that these are a bit out of date though, I'm sure many of the providers we tested have made improvements (and some have switched to wholesale using llguidance as a backend)
I'm trying to write a really large book. I have a lot of material that I'm using RAG to help manage. I put into my prompts the top RAG cosine scores with some summaries of characters and previous chapters and scene sketches. I get scenes out and then work them over. LLMs are really helpful for my disability and have allowed me to make any progress at all on this.
Is your thing something I should look into for helping keep track of my material. I'm using Excel sheets and crappy python code right now.
Im pretty sure your stuff is some super technical backend thingy, but I figured I'd shoot my shot here. Thanks for any and all info, I appreciate it
In general I find that matching the most natural format for a document outperforms waiting for the big model trainers to convince the model that the format you want is a valid structure, so anything that lets me interweave structured and unstructured generation is very interesting to me right now.
The annoying bit with grammars is that they are unfortunately a bit complex to write properly. Fortunately language models are getting better at this, so hopefully to get an XML grammar, you can get most of the way there with just a GPT-5 prompt. Suppose it would be a good idea to have a better pre-built set of popular grammars (like a modified XML) in guidance so that we cut this headache out for users...!
One statement surprised me was that the author thinks "models over time will just be able to output JSON perfectly without the need for constraining over time."
I'm not sure how this conclusion was reached. "Perfectly" is a bar that probabilistic sampling cannot meet.
I've a related observation. In my experience the amount of hallucinated urls with structured output (think of a field `url` or `link`) is pretty high. Especially compared to the alternative approach, where you let the llm generate text and then use a second llm to convert the text into the desired structured format.
With structured output, it's like the llm is forced to answer in a very specific way. So if there is no url for the given field, it makes up the url.
Here a related quote from the article:
> Structured outputs builds on top of sampling by constraining the model's output to a specific format.
E.g. if the LLM hallucinates non-existing URLs, you may add a boolean "contains_url" field to your entity's JSON schema, placing it before the URL field itself. This way, the URL extraction is split into two simpler steps, checking if the URL is there and actually extracting it. If the URL is missing, the `"contains_url": false` field in the context will strongly urge the LLM to output an empty string there.
This also comes up with quantities a lot. Imagine you're trying to sort job adverts by salary ranges, which you extract via LLm. . These may be expressed as monthly instead of annual (common in some countries), in different currencies, pre / post tax etc.
Instead of having an `annual_pretax_salary_usd` field, which is what you actually want, but which the LLM is extremely ill-equipped to generate, have a detailed schema like `type: monthly|yearly, currency:str, low:float, high:float, tax: pre_tax|post_tax`.
That schema is much easier for an LLM to generate, and you can then convert it to a single number via straight code.
As you know, (most current) LLMs build text autoregressively. This allows them to generate text with _exactly_ the same distribution as the training data.
When you constrain LLM output at each token, that gives a completely different distribution from letting the LLM generate a full output and then doing something with that (trying again, returning an error, post-processing, etc).
E.g.: Suppose the LLM has a training set of (aa, ab, ab, ba), noting that "ab" appears twice. Suppose your valid grammar is the set (ab, ba). Then your output distributions are:
Baseline: {invalid: 25%, ab: 50%, ba: 25%}
Constrained: {invalid: 0%, ab: 75%, ba: 25%}
Note that _all_ the previously invalid outputs were dumped into the "ab" bucket, skewing the ratio between "ab" and "ba". That skew may or may not be desirable, but assuming the training process was any good it's likely undesirable.
You've observed it in URLs, but I see it in JSON output as well. LLMs like to truncate long strings from time to time, but when they do they're more likely to provide invalid JSON (adding an ellipsis at the end of the fragment and doing nothing else). If that truncation starts to happen in a constrained environment, a period is a valid character in a long string, and eventually the grammar constraint will force a closing quote to appear. The result is still garbage, but instead of a detectable parse failure you have an undetectable corrupt field.
this sounds similar to what they discussed in the article with regards to "thinking" models, i.e. let them generate their <think>blah blah</think> preamble first before starting to constrain the output to structured format
The Gemini API has a canonical implementation of structured outputs where you can instead pass the JSON schema as a separate parameter to control the grammar more closely. However, this setting will reorder the JSON schema fields to be alphabetical beforehand, which is especially not desired behavior as the order of JSON fields in a schema is often very deliberate to control generation.
You can specify ordering in the Gemini API with propertyOrdering:
"propertyOrdering": ["recipeName", "ingredients"]
Why wouldn't we apply the mask immediately for the first sampling? Is this an optimization somehow, is masking expensive?
Other libraries work by essentially pre-computing all the masks for all possible generations, but of course you're restricted to working with simple grammars in this case (like a subset of regular expressions)
> is masking expensive?
It's not expensive per-se; A single element-wise multiplication of the output vector.
The real "expense" is that you need to prepare masks for every element of your grammar as they are expensive to recompute as needed; LLM tokens do not cleanly map onto elements of your grammar. (Consider JSON: LLM tokens often combine various special characters such as curly braces, colons, and quotes.)
This isn't that hard to compute, it's just more work to implement.
The greedy accept is so that the mask doesn't need to be computed. Planning to make this more efficient from either ends.
Human
4x1200 with 30 second rest
AI DSL output
Repeat 4 times:
- Run 1200 meters
- Rest 30 seconds
I hand wrote a recursive descent parser in Python to process DSL. Human speech to DSL is pretty effective with a simple prompt and some examples.
I created a tool that can program Garmin & Apple Watches for interval training based on what I wrote above.
Looking for beta testers- please give it a try :)
> Surprisingly, we observe a significant decline in LLMs’ reasoning abilities under format restrictions. Furthermore, we find that stricter format constraints generally lead to greater performance degradation in reasoning tasks.
A consequence of this seems to be that clicking the link to a different article leaves you at the bottom of the page even though the article itself has changed.
This seems to be using JS to fetch the markdown and then render it but I do feel that it may be better off to simply pre-convert the markdown as part of the deployment process and serve the static page.
[1]: https://app.sqlai.ai
However, why not use a grammar that does not have invalid sentences, and from there convert to any grammar that you want?
With a 2nd pass you basically "condition" it on the text right above, hoping to get better semantic understanding.
The AI companies believe that these kinds of grammar mistakes will be solved by improving the models. To build out tools for grammar constrained inference like this is to suggest, on some level, that GPT-N+1 won't magically solve the problem.
The deeper level is that it's not just simple grammar constraints. Constraining to JSON is a nice party trick, but it opens the door to further ideas. How about constraining to a programming language's grammar? Those are well defined, you just swap the JSON grammar file for the Java grammar file, job done.
We can go further: Why not use a language server to constrain not only the grammar but also the content? What variables and functions are in-scope is known, constraining a variable reference or function call to one of their names can be done with the same techique as grammar constraints. ("monitor-guided decoding", figured out back in 2023)
Entire classes of hallucination problems can be eliminated this way. The marketing writes itself; "Our AI is literally incapable of making the errors humans make!"
What many AI developers, firms, and especially their leaders find grating about this is the implication. That AI is fallible and has to be constrained.
Another such inconvenience is that while these techniques improve grammar they highlight semantic problems. The code is correct & compiles, it just does the wrong thing.
As to why providers don't give you a nice API, maybe it's hard to implement efficiently.
It's not too bad if inference is happening token by token and reverting to the CPU every time, but I understand high performance LLM inference uses speculative decoding, with a smaller model guessing multiple tokens in advance and the main model doing verification. Doing grammar constraints across multiple tokens is tougher, there's an exponential number of states that need precomputing.
So you'd need to think about putting the parser automaton onto the GPU/TPU and use it during inference without needing to stall a pipeline by going back CPU.
And then you start thinking about how big that automaton is going to be. How many states, pushdown stack. You're basically taking code from the API call and running it on your hardware. There's dragons here, around fair use, denial of service etc.
https://github.com/ggml-org/llama.cpp/blob/master/grammars/R...
There's also a grammar validation tool in the default llama.cpp build, which is much easier to reason about for debugging grammars than having them bounce off the server.
https://fireworks.ai/docs/structured-responses/structured-ou...