https://github.com/ggerganov/llama.cpp/blob/master/grammars/...
edit: Nm
I was actually scratching my head on how to structure a regular prompt to produce csv data without extra nonsense like "Here is your data" and "Please note blah blah" at the beginning and end, so this is much welcome as I can define exactly what I want returned then just push structured output to csv.
Interestingly, JSON Schema has much less of this problem than say CSV - when the model is forced to produce `{"first_key":` it will generally understand it's supposed to continue in JSON. It still helps to tell it the schema though, especially due to weird tokenization issues you can get otherwise.
You have spent 190 at Fresh Mart. Current balance: 5098
and it gave below output {\n\"amount\": 190,\n\"balance\": 5098 ,\"category\": \"Shopping\",\n\"place\":\"Fresh Mart\"\n}Question though. Has anyone had luck running it on AMD GPUs? I've heard it's harder but I really want to support the competition when I get cards next year.
In some instances, I'd rather parse Markdown or plain text if it means the quality of the output is higher.
You basically use JSON schema mode to draw a clean boundary around the wishy-washy language bits, using the LLM as a preprocessor to capture its own output in a useful format.
Also, you need to tell the model the schema. If you don't you will get more weird tokenization issues.
For example, if the schema expects a JSON key "foobarbaz" and the canonical BPE tokenization is ["foobar", "baz"], the token mask generated by all current constrained output libraries will let the model choose from "f", "foo", "foobar" (assuming these are all valid tokens). The model might then choose "foo", and then the constraint will force eg. "bar" and "baz" as next tokens. Now the model will see ["foo", "bar", "baz"] instead of ["foobar", "baz"] and will get confused [0]
If the model knows from the prompt "foobarbaz" is one of the schema keys, it will generally prefer "foobar" over "foo".
[0] In modern models these tokens are related, because of regularization but they are not the same.
FWIW measured by me using a vibes based method, nothing rigorous just a lot of hours spent on various LLM projects. I have not used these particular tools yet but ollama was previously able to guarantee json output through what I assume is similar techniques and my partner and I worked previously on a jsonformer-like thing for oobabooga, another LLM runtime tool.
Hopefully with those changes we might also enable general structure generation not only limited to JSON.
It’s easy to burn a lot of tokens but if the thing you’re doing merits the cost? You can be a bully with it and while its never the best, 95% as good for zero effort is a tool in one’s kit.
It looks like, so long as you're reasonable with the prompting, you tend to get better outputs when using structure.
asking LLM to directly generate JSON gives much worser results, similar to either random guess or intuition.
Hoping to be more on top of community PRs and get them merged in the coming year.
https://www.souzatharsis.com/tamingLLMs/notebooks/structured...
With the newer research - outlines/xgrammar coming out, I hope to be able to update the sampling to support more formats, increase accuracy, and improve performance.
I'm curious if you have considered implementing Microsoft's Guidance (https://github.com/guidance-ai/guidance)? Their approach offers significant speed improvements, which I understand can sometimes be shortcoming of GBNF (e.g https://github.com/ggerganov/llama.cpp/issues/4218).
However, in this case, you're best of giving the model sentences one by one to avoid it being confused. If you structure the prompt like "Classify the following sentence, here are the rules ...." + sentence, then you should be hitting prefix cache and get even better performance than when doing a single query. Of course, this only works if you have the prefix cache and are not paying per input token (though most providers now let you indicate you want to use prefix cache and pay less).
Are llama.cpp and ollama leveraging llama's intrinsic structured output capability, or is this something else bolted ex-post on the output? (And if the former, how is the capability guaranteed across other models?)
Amazing work as always, looking forward to taking this for a spin!
type GenerateRequest struct {
...
// Format specifies the format to return a response in.
Format json.RawMessage `json:"format,omitempty"`
https://github.com/ollama/ollama/blob/de52b6c2f90ff220ed9469...Does llama.cpp do dynamic model loading and unloading? Will it fetch a model you request but isn't downloaded? Does it provide SDKs? Does it have startup services it provides? There's space for things that wrap llama.cpp and solve many of its pain points. You can find piles of reports of people struggling to build and compile llama.cpp for some reason or another who then clicked an Ollama installer and it worked right away.
It's also a free OSS project giving all this away, why are you being an ass and discouraging them?
But this submitted link is not even about all of that, it is about what really llama.cpp does not do - it does not write more lines of marketing material than lines of code, which is that marketing material is about, lines of code that really just wrap 10x more lines of code down the line, and all of that by not making it clear as day.