https://github.com/ggerganov/llama.cpp/blob/master/grammars/...
edit: Nm
I was actually scratching my head on how to structure a regular prompt to produce csv data without extra nonsense like "Here is your data" and "Please note blah blah" at the beginning and end, so this is much welcome as I can define exactly what I want returned then just push structured output to csv.
Interestingly, JSON Schema has much less of this problem than say CSV - when the model is forced to produce `{"first_key":` it will generally understand it's supposed to continue in JSON. It still helps to tell it the schema though, especially due to weird tokenization issues you can get otherwise.
"Encoding" CSV as JSON is trivial though, so make it output JSON then parse the array-of-arrays into CSV :)
Typically, you're interested in getting more than just one token out of them, so you run them in a loop. You start with the token list containing just the user's message, run the LLM with that list, append the newly obtained token at the end, run the LLM again to see what comes after that new token and so on, until a special end-of-sentence token is generated and the loop terminates.
There's no reason why you have to start with nothing but the users' message, though. You can let the user specify the beginning of the LLM's supposed completion, and ask the LLM to start from that point, instead of generating its completion from scratch. This essentially ensures that what the LLM says begins with a specific string.
Not alll APIs expose this feature, there are good safety reasons not to, but all LLMs are capable of doing it in principle, and doing it with the open ones is trivial.
LLMs are typically trained to output markdown, which uses ```language_name to denote code blocks in language_name, so that user interfaces like Chat GPT's web UI can do proper syntax highlighting.
Therefore, if you make your LLM think that it already started a completion, and that completion began with ```json, it will predict what's most likely to come after that delimiter, and that would be a JSON block.
You turn the crank and you get a probability distribution for the next token in the sequence. You then sample the distribution to get the next token, append it to the vector, and do it again and again.
Thus the typical LLM have no memory as such, it inferes what it was thinking by looking at what it has already said and uses that to figure out what to say next, so to speak.
The characters in the input prompt are converted to these tokens, but there are also special tokens such as start of input, end of input, start of output and end of output. The end of output token is how the LLM "tells you" it's done talking.
Normally in a chat scenario these special tokens are inserted by the LLM front-end, say Ollama/llama.cpp in this case.
However if you interface more directly you need to add these yourself, and hence can prefill out the output before feeding the vector to the LLM for the first time, and thus the LLM will "think" it already started writing code say, and thus it is likely to continue doing so.
You have spent 190 at Fresh Mart. Current balance: 5098
and it gave below output {\n\"amount\": 190,\n\"balance\": 5098 ,\"category\": \"Shopping\",\n\"place\":\"Fresh Mart\"\n}Question though. Has anyone had luck running it on AMD GPUs? I've heard it's harder but I really want to support the competition when I get cards next year.
In some instances, I'd rather parse Markdown or plain text if it means the quality of the output is higher.
You basically use JSON schema mode to draw a clean boundary around the wishy-washy language bits, using the LLM as a preprocessor to capture its own output in a useful format.
Also, you need to tell the model the schema. If you don't you will get more weird tokenization issues.
For example, if the schema expects a JSON key "foobarbaz" and the canonical BPE tokenization is ["foobar", "baz"], the token mask generated by all current constrained output libraries will let the model choose from "f", "foo", "foobar" (assuming these are all valid tokens). The model might then choose "foo", and then the constraint will force eg. "bar" and "baz" as next tokens. Now the model will see ["foo", "bar", "baz"] instead of ["foobar", "baz"] and will get confused [0]
If the model knows from the prompt "foobarbaz" is one of the schema keys, it will generally prefer "foobar" over "foo".
[0] In modern models these tokens are related, because of regularization but they are not the same.
FWIW measured by me using a vibes based method, nothing rigorous just a lot of hours spent on various LLM projects. I have not used these particular tools yet but ollama was previously able to guarantee json output through what I assume is similar techniques and my partner and I worked previously on a jsonformer-like thing for oobabooga, another LLM runtime tool.
Hopefully with those changes we might also enable general structure generation not only limited to JSON.
The current implementation uses llama.cpp GBNF grammars. The more recent research (Outlines, XGrammar) points to potentially speeding up the sampling process through FSTs and GPU parallelism.
It’s easy to burn a lot of tokens but if the thing you’re doing merits the cost? You can be a bully with it and while its never the best, 95% as good for zero effort is a tool in one’s kit.
It looks like, so long as you're reasonable with the prompting, you tend to get better outputs when using structure.
asking LLM to directly generate JSON gives much worser results, similar to either random guess or intuition.
Hoping to be more on top of community PRs and get them merged in the coming year.
https://www.souzatharsis.com/tamingLLMs/notebooks/structured...
With the newer research - outlines/xgrammar coming out, I hope to be able to update the sampling to support more formats, increase accuracy, and improve performance.
I'm curious if you have considered implementing Microsoft's Guidance (https://github.com/guidance-ai/guidance)? Their approach offers significant speed improvements, which I understand can sometimes be shortcoming of GBNF (e.g https://github.com/ggerganov/llama.cpp/issues/4218).
However, in this case, you're best of giving the model sentences one by one to avoid it being confused. If you structure the prompt like "Classify the following sentence, here are the rules ...." + sentence, then you should be hitting prefix cache and get even better performance than when doing a single query. Of course, this only works if you have the prefix cache and are not paying per input token (though most providers now let you indicate you want to use prefix cache and pay less).
Are llama.cpp and ollama leveraging llama's intrinsic structured output capability, or is this something else bolted ex-post on the output? (And if the former, how is the capability guaranteed across other models?)
Amazing work as always, looking forward to taking this for a spin!
type GenerateRequest struct {
...
// Format specifies the format to return a response in.
Format json.RawMessage `json:"format,omitempty"`
https://github.com/ollama/ollama/blob/de52b6c2f90ff220ed9469...Does llama.cpp do dynamic model loading and unloading? Will it fetch a model you request but isn't downloaded? Does it provide SDKs? Does it have startup services it provides? There's space for things that wrap llama.cpp and solve many of its pain points. You can find piles of reports of people struggling to build and compile llama.cpp for some reason or another who then clicked an Ollama installer and it worked right away.
It's also a free OSS project giving all this away, why are you being an ass and discouraging them?
But this submitted link is not even about all of that, it is about what really llama.cpp does not do - it does not write more lines of marketing material than lines of code, which is that marketing material is about, lines of code that really just wrap 10x more lines of code down the line, and all of that by not making it clear as day.