If you're facing problems getting GPT to adhere to a schema (JSON, XML, etc.) or regex, need to bulk process some unstructured data, or generate synthetic data, check it out.
We run our own tuned model (you can self-host if you want), so, we're able to have incredibly fine grained control over text generation.
Repository: https://github.com/automorphic-ai/trex
Playground: https://automorphic.ai/playground
1. LMQL/guidance/JSONformer/OP's post
2. finetuning the model to understand function calls and their (potentially) JSON schemas.
there was a comment here about OpenAI's approach (finetuning a model to understand function call) which raised a good point: since finetuning is often forgetful (previous knowledge learnt by the model gets forgotten a little bit), it's not clear if OpenAI's approach has made GPT-4 less capable than it was before. Not to mention that you're still dealing with a statistical process (LLM), not a locked-in algorithm that generates the desired schema 100% the time.
Which brings me to the other approach: steering the LLM's output __as it is generating tokens__, which is what LMQL does. This results in less token usage (you don't send function schema as part of your prompt/message to OpenAI) and 100% accuracy because token probabilities are modified (e.g., 0% chance of any character except ":" after a double quotation mark).
A relevant PR:
https://github.com/ggerganov/llama.cpp/pull/1773
The plan is to support arbitrary grammar files to constrain token generation, similar to the grammar files here:
this is tablestakes now, but it doesnt seem ANY opensource model has this capability
I think approach #1 outlined above is the better (more cost- and time-efficient) technique—where a pretrained model already understands JSON (among myriad other formats), and you merely constrain it at text-gen time to valid JSON (or other format).
here's my question then - was the GPT 0613 update (which introduced functions) a completely new base model or simply a finetuned model ? it seems to be the latter.
Though it may not seem too fast right now on account of the hundreds of simultaneous requests we're getting :)
1) You're wasting GPT tokens on outputting JSON instead of meaningful information.
2) GPT functions won't, with absolute, 100% certainty, return JSON in the schema you want. In 1% to 3% of cases it hallucinates fields, etc.
3) This also allows you to output data in arbitrary non-JSON formats.
4) You can't self-host OpenAI functions.
As with the other poster, I’d be interested to hear a bit more about point 1.
When you say in another comment that using OpenAI functions to output JSON is a waste of tokens, how are you generating the JSON output? And why do your prompts then include few shot examples of JSON objects?
We also prefill some tokens depending on the set of allowed tokens at a given state, so the model doesn't waste resources trying to predict them.
The only reference I can find to this being a self-hosted model is a blurb in the GitHub README saying “If you'd like to self-host this in your own cloud, email us”. Sure, I can email my OpenAI/Microsoft rep and self-host GPT-4 in my own cloud for enough money too, but that doesn’t change the fact that the primary business model is SaaS. Just be up-front about this fact in community posts, rather than obfuscating it. Your website does a great job with that.
Our intention wasn't to obfuscate this, so thanks for the feedback. We'll try making that more apparent.
For instance: https://pastebin.com/QFZmEAJA
I use Clojure's EDN JSON-equivalent format, and what you can read in this paste is an attempt to make GPT write its own prompt in a conversation where I gradually built a format for this narrative structure using Clojure.
It turns out GPT isn't able to produce EDN data using this prompt (it will produce something that looks like the "grammar" displayed in the paste from above that GPT came up with, not Clojure data as instructed).
I can get it to output EDN but I need to provide an example, but then the story in the example will tend to leak into the generated story. And it still has problems, missing keys for instance, or it doesn't used nested subnarratives, or just fails at outputting strict EDN, for instance forgetting or adding surnumerary parenthesis.
Here's what the EDN structure I want to get might look like:
And here's what kind of text can be generated from it:
For now I haven't even used the parseable EDN programmatically. I just feed it back to GPT as a string (realistically, I'd need to use a vector database to store these narrative blocks). However GPT will slowly erode the structure with every round-trip.