Not only does this guarantee your output is JSON, it lowers your generation cost and latency by filling in many of the repetitive schema tokens without passing them through the LLM.
For the very common case of "extracting multiple structured fields from a piece of unstructured text," I believe there's an even stronger optimization possible that would further decrease costs, latency and potentially even improve accuracy.
Assuming the fields you want to extract are independent (and they often are), you don't need to generate them all in one go autoregressively. Eg. instead of running the following pseudo-prompt:
"Input: 'It's sunny and cold today'
Output schema: {"sunny": boolean, "temperature": string}"
You could instead run the following two: "Input: 'It's sunny and cold today'
Output schema: {"sunny": boolean}"
"Input: 'It's sunny and cold today'
Output schema: {"temperature": string}"
We don't do that today because when done naively it's very inefficient -- you'd be tokenizing, passing to the GPU, and computing the KV cache of the shared part of the prompt twice. But a library with the right abstraction could run the second two queries in a batch in parallel and reuse the same tokenization and KV cache for both of them. It would actually be more efficient than generating both fields in one go, since when you factor out the shared prefixes both the generated text and its context are shorter!I mentioned above that this could also improve accuracy. Of course it doesn't do that by default (except that by excluding all the irrelevant fields it makes self-attention's job easier). But what it does do is give you an independent prompt for each field you're interested in. And so for particularly tricky fields you're trying to extract, you have the flexibility to eg. add several examples to make the generation N-shot.
What I've been doing instead is having LLMs generate JSON, putting what jsonformer does in the first prompt for few-shot learning, and then combining it with CUE after, since you can easily intermix data files with CUE.
My latest experiment: https://twitter.com/verdverm/status/1652504163635347456
Creating the prompt for this was pretty interesting and illuminating. While it works for the full text there, you can also do it in parts, only outputting the new parts of the JSON that is merged with the CUE.
Adding a scheme like this reduces the area of potential off-roading that the LLM can do to a much smaller zone. Additionally, it breaks up the chain of dependencies between the two example outputs, because now we do not need to depend upon past inputs to correctly output this scheme.
Since the information for JSON semantic structure is no longer required to be driven by the LLM (it still has to understand it to still be able to generate things with a modicum of sense, IIRC), we can look at our dependency graph for outputs. _This changes because now the fields really and truly are independent, (if they are truly informationally independent) _.
So now some kind of conjoined information requirement of ( autoregressive output ) <- (( field A ) <- ( field B )) becomes ( autoregressive output ) <- (( field A ) && ( field B )) which then can be factored out into separate calls instead of sequentially, which yields us a batched call of (( autoregressive output A ) <- ( field A ) && ( autoregressive output B ) <- ( field B )).
From there it is just implementation. I likely would not have thought about the OP's way of handling things for a good while, though maybe I would have stumbled into it had I enough reason to think about structured/templated kinds of generation, which I do believe that I do now! <3 :) It really breaks a lot of assumptions that are easy to quietly make and I had not thought appropriately about the consequences of reframing things in this way, to be honest.
As for "how" to think about this, if I were to give my take, it would be always just turning whatever problem in front of you is into a puzzle where you simplify it further each time. Optimizing for less computation, time, code, or even just what all of those are a kind of proxy for: less information to sufficiently solve a problem. We can see that this problem is reduced in complexity appropriately because we remove a redundancy that does not need to be there at all.
One way to look at this is in the relationships between parts of an idea. If you're able to understand, even vaguely, the concepts behind some other concept and how they interact, and maybe even have a 'standard toolkit' of relating to them, you can start finding/transferring/applying other skills to these parts of a concept. I don't think there's a guaranteed-efficient way to maybe reduce a concept down to its parts, or down to a more efficient representation without already, well, knowing that representation. It's an NP-hard problem to me personally, and is the reason why research and other academic pursuits can take a while. It is a good skill to learn I suppose and I certainly enjoy trying to use it, personally.
To tie this back to your question about language models -- yes, some things have to do with the language model, but oftentimes it's actually just the raw mathematical components underlying a model. If you look for that, and (please please please please please!!!!) then you don't necessarily _have_ to concern yourself with the implementation details (beyond runtime limits, etc), as long as the math still applies you should be able to reason really quite well about what else is happening/could happen with a model type like these are.
In particular, LLMs being an autoregressive model where each output depends upon its inputs lets us set up a dependency graph. Then based upon some prior assumptions, we can maybe make some substitutions/changes that allow us to fragment the dependency graph and move it around as we wish. This is not just applicable to LLMs, however, dependency graphs are useful in a wide number of areas.
So one other thing that we're not talking about here is that we're optimizing for an objective we want (clean JSON) by explicitly...well, injecting that objective instead of living on just hopes and dreams, y'aknow. This is a pretty straightforward way of solving the problem by putting the answer in the question, though poor input content still can be a problem.
Stated a different way, we're collapsing the entropy of what the network can introduce (which should be JSON, but remember [!!!!!!!], neural networks are noisy estimators, and JSON errors are mathematically guaranteed (even if rare), which means any pipeline depending upon output like code can and will fail, and is brittle to all sorts of other kinds of complicated parsing errors. This is because to catch/detect/enumerate/correct these errors, we need to have all of the information needed to implement a JSON structure itself. So basically we'd be using the same exact information, just enforcing it in a horrendously inefficient manner, which is how people have been doing it until the present, which is okay as we humans are certainly not NP-optimal machines IMO. In any case, we're still in the parentheses, and the point was that any kind of variance can be a problem here beyond some extremely tiny limit, and that's not what LLMs are made to do. So at some point it's guaranteed to break, and high volumes -- it's basically guaranteed to break in a way that's either unusable or requires so much effort to fix that you might as well have embedded a JSON prior into your network generation process because it would have required the same amount of information as external validation would, albeit with less effort (!!!!)), which is perfectly fine in our case if we're exclusively generating JSON as it gives us what we want. Most methods like this thankfully should have a low level of invasiveness to the model as well, freeing us up to use either the same or a similar model for multiple tasks.
This can create a bit of an ideological illusion as we technically are destroying information by collapsing the distributions of sentences/strings of tokens/etc that we are generating, and maybe can lend to a "oh, we can add whatever functionality we want!" kind of belief about this kind of modeling. It's important what we're adding and taking away. Also important is part of how/why/what is so powerful about training these models on next token prediction on large text corpora. We can trim them down to some smaller subproblem much much more easily than we can expand them to cover a larger subset. Which is pretty darn cool!
I know this sorta flew around a lot of places and touched on a lot of things, probably not as cogently as I'd want to if I had more time to review and revise it. Hope it was/is helpful for you and feel free to let me know if you have any questions. It's a very cool topic on the whole to me, tbh, and there's a number of interesting conversations that can branch off from this one. Honestly this whole general area is where I see the real value in LLM development in research. It's practical and it's helpful! :D :) <3 :)
Source for experience is a number of years of experience across a wide variety of ML models, though I'm sure I made an embarassing blunder or two in this post. ;P
This is also probably why leading with a question works better in the first place. All later processing conditions on the question in this way.
BTW, in my very limited testing, GPT4 doesn’t care about the order.
I think the main differentiating factor here is that this is better if you have a simpler JSON schema without enums or oneOf constraints. If you do have these constraints, i.e. let's say you wanted an array of different types that represented a items on a menu { kind: pizza, toppings: [pepperoni] } or { kind: ice_cream, flavor: vanilla | strawberry } then you would need something more sophisticated like clownfish that can ask the LLM to pick specific properties (and an ability to do some backtracking so you can do proper beam search).
For completeness, another common approach can be found here: https://github.com/ShreyaR/guardrails which essentially boils down to "provide the schema in the prompt and ask the LLM to correct things if it fails to get the schema right the first time."
Another thing I thought about is integrating formatting for fields using a similar system. ISO-8601 dates comes immediately to mind but also number and currency formatting are other examples.
Probabilistic enums is another thing that I can think of that might be useful for fallback values, I am pretty sure there's a lot of work that can be done in this area, also for other parser kinds
related and highly recommended resource is https://github.com/mkuchnik/relm and https://arxiv.org/abs/2211.15458. It is a similar system used to validate LLMs using regexes, however built for completely different use cases. I imagine integrating regex checks to the output fields can also have a lot of use cases.
One of the primary difficulties with writing LLM applications is that prompts are basically not composable, and any LLM library that modifies your prompt is going to be a nightmare to work with.
MEMORY EXAMPLE INSTRUCTION [COMPLETION]
it will basically not work to wrap it in a prompt that's structured
INSTRUCTION MEMORY EXAMPLE [COMPLETION]
I'm currently building something that leverages an ensemble of different LLMs depending on the difficulty of a task and ran into this issue.
Dolly V2 takes "###Instruction: <your stuff> ###Response" as the structure fed to the model where as GPT3.5 Turbo wasn't trained to treat that particular structure as important.
The nice thing is that GPT3.5 Turbo will just roll with the prompt structure Dolly uses but that only works in very large LLMs, I'd imagine I wouldn't get away with it in other 12BN parameter models.
But realistically this could look like taking the "INSTRUCTION MEMORY EXAMPLE [COMPLETION]" schema represented in a library and each adapter would transform it into
"MEMORY EXAMPLE INSTRUCTION [COMPLETION]" schema or whatever is needed by the different model.
> Instruction: Write a poem and them emit a structure that follows a schema named X.
> Completion: [map-schema X "roses are red, violets are blue"]
Conceptually this is basically just a function call where context is local to the function.
So, hurray! We've made it more accessible. And hopefully in years to come, even very much more so! <3 :)
- It is guaranteed to match your schema
- It is much lighter weight
You can force it by banning the vocabulary which violates a constraint for free.
* Occasionally, random junk will get tossed in alongside valid JSON. If you're a human, it's easy to identify and fix. Programmatically, it can be much harder to fix.
* Context lengths can create issues.
This is an important definition to take note of: "bulletproof" doesn't mean that you'll get good or correct data. It only means that it'll be valid JSON and in a particular schema that you specify (because the LLM isn't building the JSON in the first place, the library is).
It's an interesting idea. But it's not clear if they've validated the heuristics they use, to see how well it performs in terms of accuracy against, say, some kind of BeautifulSoup-like attempt to make sense of the JSON-ish that the LLM produces and correct that to be valid JSON, or any other approach to the problem.
Something always felt incredibly icky to me about any kind of ad-hoc 'fixer' scripts that were part of a pipeline that was fully controlled by a user.
[0] https://js.langchain.com/docs/modules/prompts/output_parsers...
It uses JSONSchema internally, but I’m thinking of revising it to just use Typescript directly after learning more about the ChatGPT plugin implementation (via their hackathon).
To me this is a killer feature of GPT, being able to turn a document into a json or any other template.
The kind of prompt is just amazing for GPT (try it with a blog post, document or any other thing): "Analyze this document and transform it into the following format:
<title>
<summary (text conciseness: 5/10)>
<content bullet points (text conciseness 3/10)>
<content_item 1>
<content_item 2>
<content_item N>"
Also you can ask the same prompt in a json and GPT will gladly transform a PDF into a JSON.
Saying that if the model is unable to generate JSON due to its training/fine tuning, this is indeed a clever solution!
Since you’re generating the variable fields anyway, it will actually require fewer forward passes even if broken up in multiple prompts than if you generated the static fields as well.
Of course this doesn’t work for OpenAI apis which charge for input context on a per invocation basis.
{
“command”: “find”,
“param”: { “type”: “regex”, “value”: “[^.]*Final\.[^.]*” }
}
which you can actually interpret and execute for the user.
But I’m absolutely a novice at anything ML, so take my comments with a grain of salts.See my work and the paper about it. I've got a lot of y'all beat on this (constrained decoding, not the templating and structuring) by about a year:
https://github.com/hellisotherpeople/constrained-text-genera...
What I currently have been doing:
The JSON template for your response is provided below. The parts to fill out are capitalized. Please do not modify the template. Please fill in the template with one of the above options for your response. <result> { "rating": "N. RATING", "reason": "REASON" } </result>
Never thought to use json schema. I'll check this out!
Efficiency: By generating only the content tokens and filling in the fixed tokens, Jsonformer is more efficient than generating a full JSON string and parsing it.
I was excited to try this in Replit... and realized it required pytorch. Ouch. Replit was not happy about that!