Promptfoo: The Ultimate Tool for Ensuring LLM Quality and Reliability
After exploring MLflow’s “using Prompt Engineering” feature. A feature that allowed me to reach a satisfactory “prompt+LLM” combination level and was the subhect of my previous post.
Check “Enhance LLM Prompt Quality and Results with MLflow Integration” at https://wp.me/p3Vuhl-3lS
I was wondering: How to automatically test LLMs output to ensure that the quality/relevance should be always present? So, you can ship LLM application online safely.
This question is partly answered by “promptfoo” as its purpose is to “Test & secure your LLM apps”
Source : https://www.promptfoo.dev/
For this post also, you can find all files and prompts, on my GitHub account. See https://github.com/bflaven/ia_usages/tree/main/ia_testing_llm
The answer is simple: “promptfoo” has the ability in the same test, to simultaneously ensure the validity of the result produced by the LLM both in form and content.
Source: promptfoo
So, what does “promptfoo” bring to the table? My testing scenario was the following:
- The first part of the test would be to ensure, for example, that the LLM outputs a valid JSON using a JSON model. Indeed, obtaining Structural Output is extremely useful for LLM-based applications since the process of parsing the results is much easier.
- The second part of the test would be to ensure, for example, that the generative AI produces content in the desired language and meets some validation criteria such as: the summary contains between 2 and 3 sentences, the number of desired keywords is five… etc.
Obviously, these tests, at the LLM level, would be a useful complement to functional tests carried out with Cypress on the API, therefore on the application side.
The combination of the two test suites would therefore offer a guarantee of quality on the results generated by the AI, within a CI/CD.
Technically, my POC is based on an LLM (Mistral) operated by Ollama. So I modified the `promptfooconfig.yaml` file to configure the `providers` so that everything is compatible with my development environment.
Good ressources on how-to use “promptfoo”
Here some good ressources that gave good insights on how to configure and use promptfoo:
“How to Use Promptfoo for LLM Testing”, a good post as an introdution to promptfoo is this one: https://medium.com/thedeephub/how-to-use-promptfoo-for-llm-testing-13e96a9a9773
A bunch of examples are provided by promptfoo so you can see the extent of possibilities offered by the framework. Check https://github.com/promptfoo/promptfoo/tree/main/examples
For me, these 8 examples in particular have been a good source of inspiration.
- python-assert-external: https://github.com/promptfoo/promptfoo/tree/main/examples/python-assert-external
- json-output: https://github.com/promptfoo/promptfoo/tree/main/examples/python-assert-external
- prompts-per-model: https://github.com/promptfoo/promptfoo/tree/main/examples/prompts-per-model
- summarization: https://github.com/promptfoo/promptfoo/tree/main/examples/summarization
- simple-cli: https://github.com/promptfoo/promptfoo/tree/main/examples/simple-cli
- simple-csv: https://github.com/promptfoo/promptfoo/tree/main/examples/simple-csv
- simple-test: https://github.com/promptfoo/promptfoo/tree/main/examples/simple-test
- mistral-llama-comparison: https://github.com/promptfoo/promptfoo/tree/main/examples/mistral-llama-comparison
After reading this post and browse the examples, my “shopping” list was the following. I am operating the LLMs locally with the help of Ollama and the model I am using is Mistral.
- Add provider via id to be able to test the different models (mistral-openorca, mistral, openhermes, phi3, zephyr)
- Externalize prompts, externalize contents (articles)
- Do a test on the json model output from LLM
- Do a test using python to qualify the quality of the result from a relevance point of view.
Below an extract on how to set the provider in the file promptfooconfig.yaml
providers: - id: openrouter:mistralai/mistral-7b-instruct config: temperature: 0.5 - id: openrouter:mistralai/mixtral-8x7b-instruct config: temperature: 0.5 - id: openrouter:meta-llama/llama-3.1-8b-instruct config: temperature: 0.5
The walkthrough
#create the dir mkdir 001_promptfoo_running # path cd /Users/brunoflaven/Documents/01_work/blog_articles/ia_testing_llm/001_promptfoo_running/ # install promptfoo npm install -g promptfoo@latest # change providers in promptfooconfig.yaml - ollama:mistral:latest # eval npx promptfoo eval # launch commands LOG_LEVEL=debug npx promptfoo eval # launch eval npx promptfoo eval # view result npx promptfoo view #uninstall npm uninstall -g promptfoo # gain space disk npm cache clean --force
Conclusion:
Integrate Promptfoo into your development workflow or in your CI/CD, this is probably the only way to enhance quality, and reliability of the LLM output.
Here is below the praises sung by the promptfoo’s creators themselves :
- Developer friendly: promptfoo is fast, with quality-of-life features like live reloads and caching.
- Battle-tested: Originally built for LLM apps serving over 10 million users in production. Our tooling is flexible and can be adapted to many setups.
- Simple, declarative test cases: Define evals without writing code or working with heavy notebooks.
- Language agnostic: Use Python, Javascript, or any other language.
- Share & collaborate: Built-in share functionality & web viewer for working with teammates.
- Open-source: LLM evals are a commodity and should be served by 100% open-source projects with no strings attached.
- Private: This software runs completely locally. The evals run on your machine and talk directly with the LLM.
Source: https://www.promptfoo.dev/docs/intro/
Extra: Using NotebookLM
This post is also an expreiment to test NotebookLM. So, here is this regular blog post “Promptfoo: The Ultimate Tool for Ensuring LLM Quality and Reliability” converted into a podcast using NotebookLM.
NotebookLM gives you a personalized AI collaborator that helps you do your best thinking. After uploading your documents, NotebookLM becomes an instant expert in those sources so you can read, take notes, and collaborate with it to refine and organize your ideas.
More on : https://support.google.com/notebooklm/#topic=14287611
Videos to tackle this post
Promptfoo: The Ultimate Tool for Ensuring LLM Quality and Reliability (Part 1)
Promptfoo: The Ultimate Tool for Ensuring LLM Quality and Reliability (Part 2)
More infos
- Testing Language Models (and Prompts) Like We Test Software | by Marco Tulio Ribeiro | Towards Data Science
https://towardsdatascience.com/testing-large-language-models-like-we-test-software-92745d28a359 - GitHub – aws-samples/llm-based-advanced-summarization
https://github.com/aws-samples/llm-based-advanced-summarization/tree/main - Prompt Engineering Testing Strategies with Python |
Shiro
https://openshiro.com/articles/prompt-engineering-testing-strategies-with-python - GitHub – duncantmiller/llm_prompt_engineering: Prompt engineering testing strategies, using the OpenAI API.
https://github.com/duncantmiller/llm_prompt_engineering/tree/main - Generative AI Evaluation with Promptfoo: A Comprehensive Guide | by Yuki Nagae | Sep, 2024 | Medium
https://medium.com/@yukinagae/generative-ai-evaluation-with-promptfoo-a-comprehensive-guide-e23ea95c1bb7 - promptfoo/examples/assistant-cli at 0665ec88d58369ede2d0615c24c1f023b7fafa9b · promptfoo/promptfoo · GitHub
https://github.com/promptfoo/promptfoo/tree/0665ec88d58369ede2d0615c24c1f023b7fafa9b/examples/assistant-cli - Getting started | promptfoo
https://www.promptfoo.dev/docs/getting-started/ - Model-graded metrics | promptfoo
https://www.promptfoo.dev/docs/configuration/expected-outputs/model-graded/ - Solving the “Punycode Module is Deprecated” Issue in Node.js | by Asimabas | Aug, 2024 | Medium
https://medium.com/@asimabas96/solving-the-punycode-module-is-deprecated-issue-in-node-js-93437637948a - Semantic Tagging: Create Meaningful Tags for your Text Data | by Gabriele Sgroi, PhD | Towards AI
https://pub.towardsai.net/semantic-tagging-create-meaningful-tags-for-your-text-data-dcf8d2f24960 - How to use Large Language Models to tag your data: A complete tutorial | by Research Graph | Medium
https://medium.com/@researchgraph/how-to-use-large-language-models-to-tag-your-data-a-complete-tutorial-4a3647ae0f05 - Welcome To Instructor – Instructor
https://jxnl.github.io/instructor/ - Ollama – Instructor
https://jxnl.github.io/instructor/examples/ollama/#ollama - GitHub – guidance-ai/guidance: A guidance language for controlling large language models.
https://github.com/guidance-ai/guidance - Evaluate AI/LLM Performance with Effective Test Prompts
https://writingmate.ai/blog/ai-llm-perfomance-testing - GitHub – promptfoo/promptfoo: Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
https://github.com/promptfoo/promptfoo - LLM evaluation techniques for JSON outputs | promptfoo
https://www.promptfoo.dev/docs/guides/evaluate-json/ - Test Driven PROMPT Engineering: Using Promptfoo to COMPARE Prompts, LLMs, and Providers. – YouTube
https://www.youtube.com/watch?v=KhINc5XwhKs - GitHub – disler/llm-prompt-testing-quick-start: LLM Prompt Testing Quick Start
https://github.com/disler/llm-prompt-testing-quick-start - Phi vs Llama: Benchmark on your own data | promptfoo
https://www.promptfoo.dev/docs/guides/phi-vs-llama/ - Enhancing JSON Output with Large Language Models: A Comprehensive Guide | by Dina Berenbaum | Medium
https://medium.com/@dinber19/enhancing-json-output-with-large-language-models-a-comprehensive-guide-f1935aa724fb - How to Get Only JSON response from Any LLM Using LangChain | by Harshit Dubey | Medium
https://medium.com/@harshitdy/how-to-get-only-json-response-from-any-llm-using-langchain-ed53bc2df50f - LLM Evaluation: Comparing Four Methods to Automatically Detect Errors | Label Studio
https://labelstud.io/blog/llm-evaluation-comparing-four-methods-to-automatically-detect-errors/?trk=public_post_comment-text - Your AI Product Needs Evals – Hamel’s Blog
https://hamel.dev/blog/posts/evals/ - GitHub – confident-ai/deepeval: The LLM Evaluation Framework
https://github.com/confident-ai/deepeval - LLM Observability & Application Tracing (open source) – Langfuse
https://langfuse.com/docs/tracing - Google Colab
https://colab.research.google.com/github/langfuse/langfuse-docs/blob/main/cookbook/integration_ollama.ipynb - llm-evaluation · GitHub Topics · GitHub
https://github.com/topics/llm-evaluation - prompt-testing · GitHub Topics · GitHub
https://github.com/topics/prompt-testing - Collecting user feedback on ML in Streamlit
https://blog.streamlit.io/collecting-user-feedback-on-ml-in-streamlit/ - Trubrics
https://www.trubrics.com/#pricing - How to capture the feedback effectively – #3 by ferdy – Using Streamlit – Streamlit
https://discuss.streamlit.io/t/how-to-capture-the-feedback-effectively/60138/3 - Collect user feedback on AI models from your Streamlit app – YouTube
https://www.youtube.com/watch?v=2Qt54qGwIdQ - Getting Started | LMQL
https://lmql.ai/docs/