Show HN: Promptfoo – CLI for testing & improving LLM prompt quality (opens in new tab)

(github.com)

14 pointstyppo3y ago5 comments

5 comments

5 comments · 2 top-level

typpoOP3y ago· 2 in thread

Hi HN,

I built this because I'm tuning a bunch of prompts and don't have a great way to do this systematically.

This CLI tool helps you pick the best prompt and model by allowing you to configure multiple prompts and variables. It outputs "before" and "after" so you can easily compare LLM outputs side-by-side and determine if the prompt has improved the quality of each example.

Example use cases:

- Deciding whether it's worth using GPT-4 over GPT-3.5

- Evaluating quality improvements to your prompt across a large range of examples

- Catching regressions in edge cases as you iterate on your prompt

It supports a handful of useful output formats: console, HTML table view, csv, json, yaml, so you can integrate into your workflow as needed. It also can be used as a library, not a CLI.

I'm interested in hearing your thoughts and suggestions on how to improve this tool further. Thanks!

brianjking3y ago

https://github.com/nat/openplayground

How does this compare to Nat's OpenPlayground?

Vercel recently launched a playground too.

typpoOP3y ago

Looks like the playground is mainly for comparison between models, not prompts, and doesn't support templating? Vercel's is similar but not free and open-source.

I'm running these tests in bulk, so I prefer to automate with the CLI, or integrate with a test framework like Jest. I think the web UI is good for tinkering, but does not fit as well into real workflows.

Oras3y ago· 1 in thread

This is a good idea, I used gradio and streamlit to list outputs from different models to check the output manually. But using CLI makes more sense for running multiple use cases and evaluate.

You have lots of steps to run, I would suggest:

1. Create a config file (yaml or json) to define prompts, variables, models, and output file.

2. Create an init command which will create empty files with the required structure. For example:

`promptfoo init`

output will be:

config.yaml var.json prompts.json

Good luck!

typpoOP3y ago

Thanks for the suggestion! I've added a `promptfoo init` command so the initial scaffolding is much easier.

j / k navigate · click thread line to collapse

5 comments

5 comments · 2 top-level

typpoOP3y ago· 2 in thread

Hi HN,

I built this because I'm tuning a bunch of prompts and don't have a great way to do this systematically.

Example use cases:

- Deciding whether it's worth using GPT-4 over GPT-3.5

- Evaluating quality improvements to your prompt across a large range of examples

- Catching regressions in edge cases as you iterate on your prompt

It supports a handful of useful output formats: console, HTML table view, csv, json, yaml, so you can integrate into your workflow as needed. It also can be used as a library, not a CLI.

I'm interested in hearing your thoughts and suggestions on how to improve this tool further. Thanks!

brianjking3y ago

https://github.com/nat/openplayground

How does this compare to Nat's OpenPlayground?

Vercel recently launched a playground too.

typpoOP3y ago

Looks like the playground is mainly for comparison between models, not prompts, and doesn't support templating? Vercel's is similar but not free and open-source.

Oras3y ago· 1 in thread

This is a good idea, I used gradio and streamlit to list outputs from different models to check the output manually. But using CLI makes more sense for running multiple use cases and evaluate.

You have lots of steps to run, I would suggest:

1. Create a config file (yaml or json) to define prompts, variables, models, and output file.

2. Create an init command which will create empty files with the required structure. For example:

`promptfoo init`

output will be:

config.yaml var.json prompts.json

Good luck!

typpoOP3y ago

Thanks for the suggestion! I've added a `promptfoo init` command so the initial scaffolding is much easier.

j / k navigate · click thread line to collapse