ResearchCommercialFreeActiveMachine-verified· beginner · ~20 min setup

Pick a model with evidence: a GitHub Models bake-off that fits the free cap

Run the few prompts that actually matter across several models on GitHub Models' free tier, then keep the winner, with the daily call budget proven to fit before you start.

by Shilpa Mitra· verified today· v1.0.0

Run this workflow

CI-verified, 2/2 fixtures passing.

Build this with your agent

One copy-paste hands Claude Code, Codex, or Cursor the full recipe, steps included, nothing to fetch.

Intended Use

Anyone choosing between models for a specific task. CI validates the bake-off plan: at least two prompts and two models, a stated selection criterion, and that prompts times models stays within the free daily cap. No key, no model call. The actual runs and the scoring are fenced.

Not for

Production traffic, the free tier is roughly 10 requests/min and 50 requests/day on top models; this is for deciding, not serving
Self-grading blind spots, if you score with an LLM judge it shares the judged model's weaknesses; prefer a concrete metric where you can
Big contexts, each free request is capped near 8k tokens in and 4k out, so keep bake-off prompts compact

The Stack

GitHub Modelsfree multi-model access for comparison

Tested Against

GitHub Models free tier (2026-06)node@20

Side effects & data flow

Network: none, local only
Writes: ./bakeoff.json
Credentials: none required

Prerequisites

A GitHub account (the only sign-in needed to actually run the bake-off)

Steps

Write the bake-off plan and prove it fits the free cap

List the prompts that actually matter, the models to compare, and how you will pick the winner. CI checks the plan is a real comparison and that prompts times models stays under the free daily cap, so you do not get rate-limited halfway through. Running the prompts needs your GitHub login and is fenced.

cat > bakeoff.json <<'JSON'
{
  "prompts": ["summarize ticket", "extract fields", "classify intent", "draft reply", "rate sentiment"],
  "models": ["gpt-5.5", "deepseek-r1", "llama-4-maverick"],
  "select_by": "win_rate",
  "free_daily_cap": 50
}
JSON
node -e '
const fs = require("fs");
const c = JSON.parse(fs.readFileSync("bakeoff.json", "utf8"));
function bad(m) { console.error("BAD: " + m); process.exit(1); }
const P = (c.prompts || []).length;
const M = (c.models || []).length;
if (P < 2) bad("need at least 2 prompts to compare meaningfully");
if (M < 2) bad("need at least 2 models for a bake-off");
if (!c.select_by) bad("no selection criterion (how do you pick the winner?)");
const cap = c.free_daily_cap || 0;
const calls = P * M;
if (cap && calls > cap) bad("bake-off needs " + calls + " calls but the free cap is " + cap + "/day");
console.log("config OK: bake-off of " + P + " prompt(s) x " + M + " model(s) = " + calls + " call(s), within the " + cap + "/day free cap; pick by " + c.select_by);
'

2
Run the prompts and keep the winner (the model step, not checked by CI)
Point the same prompts at each model through GitHub Models, score them by your chosen metric, and keep the winner. Then move that workload to whichever provider hosts your pick. The runs and scoring are fenced.

Eval, 2 fixtures

Last passed: verified today

bakeoff-okcontainstimeout 30s · max $0
Expected: config OK: bake-off of 5 prompt(s) x 3 model(s) = 15 call(s), within the 50/day free cap; pick by win_rate
clean-exitexit_codetimeout 30s · max $0
Expected: 0

Results

GitHub Models exposes 45+ models (including frontier ones) behind your existing GitHub login. The free cap is small, about 50 requests/day on top models, which is too tight for production but exactly enough for a careful one-afternoon comparison: a handful of prompts times a few models. Decide with evidence, then move the chosen workload to whichever provider hosts it.

Did this work for you?

Our CI checks the setup runs. You tell us if the whole thing worked. Tell us straight.

Related workflows

Liked this workflow?

Get new verified workflows in WebAfterAI, three issues a week (Tue, Thu, Sat).