promptfoo: make agent evals fail the build, not the user
Write a declarative promptfoo config with real assertions and wire promptfoo eval into CI, so a regression in prompt or agent behavior fails a check instead of reaching production.
Run this workflow
CI-verified, 2/2 fixtures passing.
Build this with your agent
One copy-paste hands Claude Code, Codex, or Cursor the full recipe, steps included, nothing to fetch.
Intended Use
Anyone shipping a prompt/agent more than one person can change. CI parses promptfooconfig.yaml and asserts it has providers and tests carrying assertions, and that the CI workflow runs promptfoo eval. No keys, no model call. The eval actually running and its pass/fail are fenced.
Not for
- Leaning on LLM-as-judge alone, it shares the blind spots of the model judging; prefer an objective assert (contains/equals) where you can write one
- Building deep on it without a glance at the roadmap, OpenAI announced it is acquiring promptfoo (March 2026); it stays MIT for now
The Stack
Tested Against
promptfoo docs (2026-06)ruby@3.x (YAML stdlib)Side effects & data flow
- Network
- none, local only
- Writes
- ./promptfooconfig.yaml, ./.github/workflows/evals.yml
- Credentials
- none required
Prerequisites
- npx promptfoo + provider keys (only to actually run the evals)
Steps
- 1
Author the eval config + CI workflow and validate them
Write promptfooconfig.yaml (providers, prompts, tests with assertions) and a CI workflow that runs promptfoo eval on every push. CI parses the YAML and asserts providers + tests-with-assertions exist and that CI runs the evals. The eval scoring a live model is fenced.
mkdir -p .github/workflows cat > promptfooconfig.yaml <<'YAML' description: "Agent behavior eval" providers: - anthropic:messages:claude-opus-4.8 prompts: - "Answer concisely: {{q}}" tests: - vars: { q: "What is 2 + 2?" } assert: - type: contains value: "4" - vars: { q: "What is the capital of France?" } assert: - type: icontains value: "Paris" YAML cat > .github/workflows/evals.yml <<'YML' name: evals on: [push] jobs: promptfoo: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: npx promptfoo@latest eval -c promptfooconfig.yaml YML ruby -ryaml -e ' c = YAML.load_file("promptfooconfig.yaml") providers = c["providers"] abort "BAD: no providers" unless providers.is_a?(Array) && providers.length >= 1 tests = c["tests"] abort "BAD: no tests" unless tests.is_a?(Array) && tests.length >= 1 with_assert = tests.count { |t| t["assert"].is_a?(Array) && t["assert"].length >= 1 } abort "BAD: tests carry no assertions" unless with_assert >= 1 wf = File.read(".github/workflows/evals.yml") abort "BAD: CI does not run promptfoo eval" unless wf.include?("promptfoo") && wf.include?("eval") puts "config OK: promptfoo config parses with " + providers.length.to_s + " provider(s) and " + with_assert.to_s + " test(s) carrying assertions; CI runs promptfoo eval" ' - 2
Run the evals (the model step, not checked by CI)
Set your provider keys and run npx promptfoo eval locally or in CI. Because model-backed evals are noisy, lean on concrete assertions and run repeats. The scoring is fenced.
Eval, 2 fixtures
Last passed: verified todayevals-okcontainstimeout 30s · max $0Expected:
config OK: promptfoo config parses with 1 provider(s) and 2 test(s) carrying assertions; CI runs promptfoo evalclean-exitexit_codetimeout 30s · max $0Expected:
0
Results
promptfoo runs declarative evals (and red-team probes for prompt injection, jailbreaks, PII leaks, tool misuse) in CI. A behavior regression becomes a failed check, not an incident. The strong assertions are concrete (output contains X) over LLM-as-judge, which shares the judge model's blind spots.
Did this work for you?
Our CI checks the setup runs. You tell us if the whole thing worked. Tell us straight.
Related workflows
- E2B: run model-written code in a sandbox, not on your box
- DSPy: program the pipeline, compile the prompts (stop hand-tuning)
- Write an agent loop in code with smolagents (sandboxed)
- Hermes /learn: author a reusable skill from a source, not by hand
- Text your own AI assistant on WhatsApp: Hermes wired to FreeLLMAPI
- FreeLLMAPI: one socket, sixteen free model tiers with auto-fallback
Liked this workflow?
Get new verified workflows in WebAfterAI, three issues a week (Tue, Thu, Sat).