Pick a model with evidence: a GitHub Models bake-off that fits the free cap
Run the few prompts that actually matter across several models on GitHub Models' free tier, then keep the winner, with the daily call budget proven to fit before you start.
Run this workflow
CI-verified, 2/2 fixtures passing.
Build this with your agent
One copy-paste hands Claude Code, Codex, or Cursor the full recipe, steps included, nothing to fetch.
Intended Use
Anyone choosing between models for a specific task. CI validates the bake-off plan: at least two prompts and two models, a stated selection criterion, and that prompts times models stays within the free daily cap. No key, no model call. The actual runs and the scoring are fenced.
Not for
- Production traffic, the free tier is roughly 10 requests/min and 50 requests/day on top models; this is for deciding, not serving
- Self-grading blind spots, if you score with an LLM judge it shares the judged model's weaknesses; prefer a concrete metric where you can
- Big contexts, each free request is capped near 8k tokens in and 4k out, so keep bake-off prompts compact
The Stack
Tested Against
GitHub Models free tier (2026-06)node@20Side effects & data flow
- Network
- none, local only
- Writes
- ./bakeoff.json
- Credentials
- none required
Prerequisites
- A GitHub account (the only sign-in needed to actually run the bake-off)
Steps
- 1
Write the bake-off plan and prove it fits the free cap
List the prompts that actually matter, the models to compare, and how you will pick the winner. CI checks the plan is a real comparison and that prompts times models stays under the free daily cap, so you do not get rate-limited halfway through. Running the prompts needs your GitHub login and is fenced.
cat > bakeoff.json <<'JSON' { "prompts": ["summarize ticket", "extract fields", "classify intent", "draft reply", "rate sentiment"], "models": ["gpt-5.5", "deepseek-r1", "llama-4-maverick"], "select_by": "win_rate", "free_daily_cap": 50 } JSON node -e ' const fs = require("fs"); const c = JSON.parse(fs.readFileSync("bakeoff.json", "utf8")); function bad(m) { console.error("BAD: " + m); process.exit(1); } const P = (c.prompts || []).length; const M = (c.models || []).length; if (P < 2) bad("need at least 2 prompts to compare meaningfully"); if (M < 2) bad("need at least 2 models for a bake-off"); if (!c.select_by) bad("no selection criterion (how do you pick the winner?)"); const cap = c.free_daily_cap || 0; const calls = P * M; if (cap && calls > cap) bad("bake-off needs " + calls + " calls but the free cap is " + cap + "/day"); console.log("config OK: bake-off of " + P + " prompt(s) x " + M + " model(s) = " + calls + " call(s), within the " + cap + "/day free cap; pick by " + c.select_by); ' - 2
Run the prompts and keep the winner (the model step, not checked by CI)
Point the same prompts at each model through GitHub Models, score them by your chosen metric, and keep the winner. Then move that workload to whichever provider hosts your pick. The runs and scoring are fenced.
Eval, 2 fixtures
Last passed: verified todaybakeoff-okcontainstimeout 30s · max $0Expected:
config OK: bake-off of 5 prompt(s) x 3 model(s) = 15 call(s), within the 50/day free cap; pick by win_rateclean-exitexit_codetimeout 30s · max $0Expected:
0
Results
GitHub Models exposes 45+ models (including frontier ones) behind your existing GitHub login. The free cap is small, about 50 requests/day on top models, which is too tight for production but exactly enough for a careful one-afternoon comparison: a handful of prompts times a few models. Decide with evidence, then move the chosen workload to whichever provider hosts it.
Did this work for you?
Our CI checks the setup runs. You tell us if the whole thing worked. Tell us straight.
Related workflows
Liked this workflow?
Get new verified workflows in WebAfterAI, three issues a week (Tue, Thu, Sat).