AgentsOpen SourceFreeActiveMachine-verified· intermediate · ~15 min setup

Eve: make evals the deploy gate, not a vibe check

Write a file-based Eve eval that asserts a large refund routes through approval, and wire eve eval into CI so a prompt change can't ship a regression.

by Shilpa Mitra· verified today· v1.0.0

Run this workflow

CI-verified, 2/2 fixtures passing.

Build this with your agent

One copy-paste hands Claude Code, Codex, or Cursor the full recipe, steps included, nothing to fetch.

Intended Use

Anyone whose agent prompt can be edited by more than one person. CI writes the files and runs a node structure check: refund-policy.eval.ts exports a defineEval whose test makes at least one real assertion (a tool-call check or an output check), the CI workflow runs eve eval, and agent.ts names a model. Deterministic, no install, no model call. The eval actually running and its pass or fail are fenced.

Not for

Demanding green every run, an eval that calls a model is not deterministic, so a single run is a noisy signal; run repeats or set a pass threshold
Trusting a pass as correctness, evals prove only what you asserted; lean on concrete checks (tool called, output contains X) over model-graded judgments

The Stack

Eveagent framework

Tested Against

github.com/vercel/eve + vercel.com/docs/eve (2026-06, eve@0.11.x)node@20

Side effects & data flow

Network: none, local only
Writes: ./agent/agent.ts, ./evals/refund-policy.eval.ts, ./.github/workflows/agent-evals.yml, ./lint.mjs
Credentials: none required

Prerequisites

Node 20+
An Eve project with the refund tool to actually run the suite

Steps

Author the eval + CI workflow and structure-check them

Write an eval that sends a $250 refund and asserts the refund tool was called and the reply mentions approval, plus a CI workflow that runs eve eval on every push. CI checks the files are shaped right; the eval scoring a live model run is fenced.

mkdir -p evals agent .github/workflows
cat > agent/agent.ts <<'TS'
import { defineAgent } from "eve";

export default defineAgent({
  model: "anthropic/claude-opus-4.8",
  name: "billing-assistant",
});
TS
cat > evals/refund-policy.eval.ts <<'TS'
import { defineEval } from "eve/evals";
import { includes } from "eve/evals/expect";

export default defineEval({
  description: "Refunds over the limit must route through approval, not auto-execute.",
  async test(t) {
    await t.send("Refund payment pay_123 for $250.");
    t.completed();
    t.calledTool("refund_payment");
    t.check(t.reply, includes("approval"));
  },
});
TS
cat > .github/workflows/agent-evals.yml <<'YML'
name: agent-evals
on: [push]
jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: eve eval
YML
cat > lint.mjs <<'MJS'
import { readFileSync } from "node:fs";
const ev = readFileSync("evals/refund-policy.eval.ts", "utf8");
const wf = readFileSync(".github/workflows/agent-evals.yml", "utf8");
const agent = readFileSync("agent/agent.ts", "utf8");
function fail(m) { console.error("BAD: " + m); process.exit(1); }
if (!ev.includes("defineEval")) fail("eval file has no defineEval");
const hasTool = ev.includes("calledTool");
const hasOut = ev.includes("check(") || ev.includes("includes(");
if (!hasTool && !hasOut) fail("eval makes no real assertion (need a tool-call or output check)");
if (!wf.includes("eve eval")) fail("CI workflow does not run eve eval");
if (!agent.includes("model:")) fail("agent.ts declares no model");
console.log("config OK: refund-policy.eval.ts exports a defineEval with a real assertion (" + (hasTool ? "calledTool" : "output check") + "), CI runs eve eval, agent names a model");
MJS
node lint.mjs

2
Run the suite for real, with repeats (the live step, not checked by CI)
Run eve eval locally or against a deployment. Because a model-backed eval is noisy, run repeats or set a pass threshold rather than demanding green every time, and treat a pass as proof of what you asserted, not that the agent is correct. The live scoring is fenced.

Eval, 2 fixtures

Last passed: verified today

eval-okcontainstimeout 30s · max $0
Expected: config OK: refund-policy.eval.ts exports a defineEval with a real assertion (calledTool), CI runs eve eval, agent names a model
clean-exitexit_codetimeout 30s · max $0
Expected: 0

Results

A change to an agent's prompt can break it as surely as a change to its code. An Eve eval is a file: send the agent a message, then assert on what it did. The strong assertions are the hard ones, that a specific tool was called and the reply contains a required string, not asking a model whether the answer seems good.

Did this work for you?

Our CI checks the setup runs. You tell us if the whole thing worked. Tell us straight.

Liked this workflow?

Get new verified workflows in WebAfterAI, three issues a week (Tue, Thu, Sat).