Promptfoo: The Ultimate Tool for Ensuring LLM Quality and Reliability

After exploring MLflow’s “using Prompt Engineering” feature. A feature that allowed me to reach a satisfactory “prompt+LLM” combination level and was the subhect of my previous post.

Check “Enhance LLM Prompt Quality and Results with MLflow Integration” at https://wp.me/p3Vuhl-3lS

I was wondering: How to automatically test LLMs output to ensure that the quality/relevance should be always present? So, you can ship LLM application online safely.

This question is partly answered by “promptfoo” as its purpose is to “Test & secure your LLM apps”

Source : https://www.promptfoo.dev/

For this post also, you can find all files and prompts, on my GitHub account. See https://github.com/bflaven/ia_usages/tree/main/ia_testing_llm

The answer is simple: “promptfoo” has the ability in the same test, to simultaneously ensure the validity of the result produced by the LLM both in form and content.

Source: promptfoo

So, what does “promptfoo” bring to the table? My testing scenario was the following:

The first part of the test would be to ensure, for example, that the LLM outputs a valid JSON using a JSON model. Indeed, obtaining Structural Output is extremely useful for LLM-based applications since the process of parsing the results is much easier.
The second part of the test would be to ensure, for example, that the generative AI produces content in the desired language and meets some validation criteria such as: the summary contains between 2 and 3 sentences, the number of desired keywords is five… etc.

Obviously, these tests, at the LLM level, would be a useful complement to functional tests carried out with Cypress on the API, therefore on the application side.

The combination of the two test suites would therefore offer a guarantee of quality on the results generated by the AI, within a CI/CD.

Technically, my POC is based on an LLM (Mistral) operated by Ollama. So I modified the `promptfooconfig.yaml` file to configure the `providers` so that everything is compatible with my development environment.

Good ressources on how-to use “promptfoo”

Here some good ressources that gave good insights on how to configure and use promptfoo:

“How to Use Promptfoo for LLM Testing”, a good post as an introdution to promptfoo is this one: https://medium.com/thedeephub/how-to-use-promptfoo-for-llm-testing-13e96a9a9773

A bunch of examples are provided by promptfoo so you can see the extent of possibilities offered by the framework. Check https://github.com/promptfoo/promptfoo/tree/main/examples

For me, these 8 examples in particular have been a good source of inspiration.

python-assert-external: https://github.com/promptfoo/promptfoo/tree/main/examples/python-assert-external
json-output: https://github.com/promptfoo/promptfoo/tree/main/examples/python-assert-external
prompts-per-model: https://github.com/promptfoo/promptfoo/tree/main/examples/prompts-per-model
summarization: https://github.com/promptfoo/promptfoo/tree/main/examples/summarization
simple-cli: https://github.com/promptfoo/promptfoo/tree/main/examples/simple-cli
simple-csv: https://github.com/promptfoo/promptfoo/tree/main/examples/simple-csv
simple-test: https://github.com/promptfoo/promptfoo/tree/main/examples/simple-test
mistral-llama-comparison: https://github.com/promptfoo/promptfoo/tree/main/examples/mistral-llama-comparison

After reading this post and browse the examples, my “shopping” list was the following. I am operating the LLMs locally with the help of Ollama and the model I am using is Mistral.

Add provider via id to be able to test the different models (mistral-openorca, mistral, openhermes, phi3, zephyr)
Externalize prompts, externalize contents (articles)
Do a test on the json model output from LLM
Do a test using python to qualify the quality of the result from a relevance point of view.

Below an extract on how to set the provider in the file promptfooconfig.yaml

providers:
  - id: openrouter:mistralai/mistral-7b-instruct
    config:
      temperature: 0.5
  - id: openrouter:mistralai/mixtral-8x7b-instruct
    config:
      temperature: 0.5
  - id: openrouter:meta-llama/llama-3.1-8b-instruct
    config:
      temperature: 0.5

The walkthrough


#create the dir
mkdir 001_promptfoo_running

# path
cd /Users/brunoflaven/Documents/01_work/blog_articles/ia_testing_llm/001_promptfoo_running/

# install promptfoo
npm install -g promptfoo@latest

# change providers in promptfooconfig.yaml
- ollama:mistral:latest

# eval
npx promptfoo eval

# launch commands
LOG_LEVEL=debug npx promptfoo eval

# launch eval
npx promptfoo eval

# view result
npx promptfoo view

#uninstall
npm uninstall -g promptfoo
# gain space disk
npm cache clean --force

Conclusion:

Integrate Promptfoo into your development workflow or in your CI/CD, this is probably the only way to enhance quality, and reliability of the LLM output.

Here is below the praises sung by the promptfoo’s creators themselves :

Developer friendly: promptfoo is fast, with quality-of-life features like live reloads and caching.
Battle-tested: Originally built for LLM apps serving over 10 million users in production. Our tooling is flexible and can be adapted to many setups.
Simple, declarative test cases: Define evals without writing code or working with heavy notebooks.
Language agnostic: Use Python, Javascript, or any other language.
Share & collaborate: Built-in share functionality & web viewer for working with teammates.
Open-source: LLM evals are a commodity and should be served by 100% open-source projects with no strings attached.
Private: This software runs completely locally. The evals run on your machine and talk directly with the LLM.

Source: https://www.promptfoo.dev/docs/intro/

Extra: Using NotebookLM

This post is also an expreiment to test NotebookLM. So, here is this regular blog post “Promptfoo: The Ultimate Tool for Ensuring LLM Quality and Reliability” converted into a podcast using NotebookLM.

NotebookLM gives you a personalized AI collaborator that helps you do your best thinking. After uploading your documents, NotebookLM becomes an instant expert in those sources so you can read, take notes, and collaborate with it to refine and organize your ideas.