AgentsOpen SourceFreeActiveMachine-verified· intermediate · ~15 min setup

promptfoo: make agent evals fail the build, not the user

Write a declarative promptfoo config with real assertions and wire promptfoo eval into CI, so a regression in prompt or agent behavior fails a check instead of reaching production.

by Shilpa Mitra· verified today· v1.0.0

Run this workflow

CI-verified, 2/2 fixtures passing.

Build this with your agent

One copy-paste hands Claude Code, Codex, or Cursor the full recipe, steps included, nothing to fetch.

Intended Use

Anyone shipping a prompt/agent more than one person can change. CI parses promptfooconfig.yaml and asserts it has providers and tests carrying assertions, and that the CI workflow runs promptfoo eval. No keys, no model call. The eval actually running and its pass/fail are fenced.

Not for

  • Leaning on LLM-as-judge alone, it shares the blind spots of the model judging; prefer an objective assert (contains/equals) where you can write one
  • Building deep on it without a glance at the roadmap, OpenAI announced it is acquiring promptfoo (March 2026); it stays MIT for now

The Stack

Tested Against

promptfoo docs (2026-06)ruby@3.x (YAML stdlib)

Side effects & data flow

Network
none, local only
Writes
./promptfooconfig.yaml, ./.github/workflows/evals.yml
Credentials
none required

Prerequisites

  • npx promptfoo + provider keys (only to actually run the evals)

Steps

  1. 1

    Author the eval config + CI workflow and validate them

    Write promptfooconfig.yaml (providers, prompts, tests with assertions) and a CI workflow that runs promptfoo eval on every push. CI parses the YAML and asserts providers + tests-with-assertions exist and that CI runs the evals. The eval scoring a live model is fenced.

    mkdir -p .github/workflows
    cat > promptfooconfig.yaml <<'YAML'
    description: "Agent behavior eval"
    providers:
      - anthropic:messages:claude-opus-4.8
    prompts:
      - "Answer concisely: {{q}}"
    tests:
      - vars: { q: "What is 2 + 2?" }
        assert:
          - type: contains
            value: "4"
      - vars: { q: "What is the capital of France?" }
        assert:
          - type: icontains
            value: "Paris"
    YAML
    cat > .github/workflows/evals.yml <<'YML'
    name: evals
    on: [push]
    jobs:
      promptfoo:
        runs-on: ubuntu-latest
        steps:
          - uses: actions/checkout@v4
          - run: npx promptfoo@latest eval -c promptfooconfig.yaml
    YML
    ruby -ryaml -e '
    c = YAML.load_file("promptfooconfig.yaml")
    providers = c["providers"]
    abort "BAD: no providers" unless providers.is_a?(Array) && providers.length >= 1
    tests = c["tests"]
    abort "BAD: no tests" unless tests.is_a?(Array) && tests.length >= 1
    with_assert = tests.count { |t| t["assert"].is_a?(Array) && t["assert"].length >= 1 }
    abort "BAD: tests carry no assertions" unless with_assert >= 1
    wf = File.read(".github/workflows/evals.yml")
    abort "BAD: CI does not run promptfoo eval" unless wf.include?("promptfoo") && wf.include?("eval")
    puts "config OK: promptfoo config parses with " + providers.length.to_s + " provider(s) and " + with_assert.to_s + " test(s) carrying assertions; CI runs promptfoo eval"
    '
  2. 2

    Run the evals (the model step, not checked by CI)

    Set your provider keys and run npx promptfoo eval locally or in CI. Because model-backed evals are noisy, lean on concrete assertions and run repeats. The scoring is fenced.

Eval, 2 fixtures

Last passed: verified today
  • evals-okcontainstimeout 30s · max $0

    Expected: config OK: promptfoo config parses with 1 provider(s) and 2 test(s) carrying assertions; CI runs promptfoo eval

  • clean-exitexit_codetimeout 30s · max $0

    Expected: 0

Results

promptfoo runs declarative evals (and red-team probes for prompt injection, jailbreaks, PII leaks, tool misuse) in CI. A behavior regression becomes a failed check, not an incident. The strong assertions are concrete (output contains X) over LLM-as-judge, which shares the judge model's blind spots.

Did this work for you?

Our CI checks the setup runs. You tell us if the whole thing worked. Tell us straight.

Related workflows

Liked this workflow?

Get new verified workflows in WebAfterAI, three issues a week (Tue, Thu, Sat).