Eve: make evals the deploy gate, not a vibe check
Write a file-based Eve eval that asserts a large refund routes through approval, and wire eve eval into CI so a prompt change can't ship a regression.
Run this workflow
CI-verified, 2/2 fixtures passing.
Build this with your agent
One copy-paste hands Claude Code, Codex, or Cursor the full recipe, steps included, nothing to fetch.
Intended Use
Anyone whose agent prompt can be edited by more than one person. CI writes the files and runs a node structure check: refund-policy.eval.ts exports a defineEval whose test makes at least one real assertion (a tool-call check or an output check), the CI workflow runs eve eval, and agent.ts names a model. Deterministic, no install, no model call. The eval actually running and its pass or fail are fenced.
Not for
- Demanding green every run, an eval that calls a model is not deterministic, so a single run is a noisy signal; run repeats or set a pass threshold
- Trusting a pass as correctness, evals prove only what you asserted; lean on concrete checks (tool called, output contains X) over model-graded judgments
The Stack
Tested Against
github.com/vercel/eve + vercel.com/docs/eve (2026-06, eve@0.11.x)node@20Side effects & data flow
- Network
- none, local only
- Writes
- ./agent/agent.ts, ./evals/refund-policy.eval.ts, ./.github/workflows/agent-evals.yml, ./lint.mjs
- Credentials
- none required
Prerequisites
- Node 20+
- An Eve project with the refund tool to actually run the suite
Steps
- 1
Author the eval + CI workflow and structure-check them
Write an eval that sends a $250 refund and asserts the refund tool was called and the reply mentions approval, plus a CI workflow that runs eve eval on every push. CI checks the files are shaped right; the eval scoring a live model run is fenced.
mkdir -p evals agent .github/workflows cat > agent/agent.ts <<'TS' import { defineAgent } from "eve"; export default defineAgent({ model: "anthropic/claude-opus-4.8", name: "billing-assistant", }); TS cat > evals/refund-policy.eval.ts <<'TS' import { defineEval } from "eve/evals"; import { includes } from "eve/evals/expect"; export default defineEval({ description: "Refunds over the limit must route through approval, not auto-execute.", async test(t) { await t.send("Refund payment pay_123 for $250."); t.completed(); t.calledTool("refund_payment"); t.check(t.reply, includes("approval")); }, }); TS cat > .github/workflows/agent-evals.yml <<'YML' name: agent-evals on: [push] jobs: evals: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - run: eve eval YML cat > lint.mjs <<'MJS' import { readFileSync } from "node:fs"; const ev = readFileSync("evals/refund-policy.eval.ts", "utf8"); const wf = readFileSync(".github/workflows/agent-evals.yml", "utf8"); const agent = readFileSync("agent/agent.ts", "utf8"); function fail(m) { console.error("BAD: " + m); process.exit(1); } if (!ev.includes("defineEval")) fail("eval file has no defineEval"); const hasTool = ev.includes("calledTool"); const hasOut = ev.includes("check(") || ev.includes("includes("); if (!hasTool && !hasOut) fail("eval makes no real assertion (need a tool-call or output check)"); if (!wf.includes("eve eval")) fail("CI workflow does not run eve eval"); if (!agent.includes("model:")) fail("agent.ts declares no model"); console.log("config OK: refund-policy.eval.ts exports a defineEval with a real assertion (" + (hasTool ? "calledTool" : "output check") + "), CI runs eve eval, agent names a model"); MJS node lint.mjs - 2
Run the suite for real, with repeats (the live step, not checked by CI)
Run eve eval locally or against a deployment. Because a model-backed eval is noisy, run repeats or set a pass threshold rather than demanding green every time, and treat a pass as proof of what you asserted, not that the agent is correct. The live scoring is fenced.
Eval, 2 fixtures
Last passed: verified todayeval-okcontainstimeout 30s · max $0Expected:
config OK: refund-policy.eval.ts exports a defineEval with a real assertion (calledTool), CI runs eve eval, agent names a modelclean-exitexit_codetimeout 30s · max $0Expected:
0
Results
A change to an agent's prompt can break it as surely as a change to its code. An Eve eval is a file: send the agent a message, then assert on what it did. The strong assertions are the hard ones, that a specific tool was called and the reply contains a required string, not asking a model whether the answer seems good.
Did this work for you?
Our CI checks the setup runs. You tell us if the whole thing worked. Tell us straight.
Liked this workflow?
Get new verified workflows in WebAfterAI, three issues a week (Tue, Thu, Sat).