I built this because I'm tuning a bunch of prompts and don't have a great way to do this systematically.
This CLI tool helps you pick the best prompt and model by allowing you to configure multiple prompts and variables. It outputs "before" and "after" so you can easily compare LLM outputs side-by-side and determine if the prompt has improved the quality of each example.
Example use cases:
- Deciding whether it's worth using GPT-4 over GPT-3.5
- Evaluating quality improvements to your prompt across a large range of examples
- Catching regressions in edge cases as you iterate on your prompt
It supports a handful of useful output formats: console, HTML table view, csv, json, yaml, so you can integrate into your workflow as needed. It also can be used as a library, not a CLI.
I'm interested in hearing your thoughts and suggestions on how to improve this tool further. Thanks!
How does this compare to Nat's OpenPlayground?
Vercel recently launched a playground too.
I'm running these tests in bulk, so I prefer to automate with the CLI, or integrate with a test framework like Jest. I think the web UI is good for tinkering, but does not fit as well into real workflows.
You have lots of steps to run, I would suggest:
1. Create a config file (yaml or json) to define prompts, variables, models, and output file.
2. Create an init command which will create empty files with the required structure. For example:
`promptfoo init`
output will be:
config.yaml var.json prompts.json
Good luck!