Serve MiniMax M3 yourself for agentic coding (vLLM)
Stand up MiniMax M3 on an 8x H200 node as an OpenAI-compatible endpoint and point any coding agent at it, validated serve flags + endpoint config.
Run this workflow
CI-verified, 2/2 fixtures passing.
Build this with your agent
One copy-paste hands Claude Code, Codex, or Cursor the full recipe, steps included, nothing to fetch.
Intended Use
Teams self-hosting M3 for privacy, throughput, or fine-tuning (not to save money over the API). CI validates the serve command pins MiniMaxAI/MiniMax-M3 with the mandatory --block-size 128 and the minimax_m3 tool + reasoning parsers, and that the agent's OpenAI endpoint (base_url, key, model) is well-formed. No GPU, no model call. The serving and coding are fenced.
Not for
- Single-GPU or homelab, this is server-class (8x H200 BF16; the MXFP8 variant roughly halves VRAM)
- Assuming open weights means do-anything, M3 is under the MiniMax Community License, so read the terms before commercial use
The Stack
Tested Against
vllm recipe MiniMaxAI/MiniMax-M3 (2026-06)node@20.xSide effects & data flow
- Network
- none, local only
- Writes
- ./serve.sh, ./endpoint.json
- Credentials
- none required
Prerequisites
- A multi-GPU node (8x H200) or a rented GPU instance, to actually serve it
Steps
- 1
Write the serve command + agent endpoint, and validate them
M3 support is not in a stable vLLM release yet, so use the dedicated image. --block-size 128 is mandatory (it matches the MSA sparse-attention cache; the default 16 fails to start). Any agent that takes a custom OpenAI base URL drives it. CI checks the serve command pins the model id + required flags and the endpoint config is well-formed.
cat > serve.sh <<'SH' docker pull vllm/vllm-openai:minimax-m3 && vllm serve MiniMaxAI/MiniMax-M3 --tensor-parallel-size 8 --block-size 128 --tool-call-parser minimax_m3 --reasoning-parser minimax_m3 --enable-auto-tool-choice SH cat > endpoint.json <<'JSON' { "base_url": "http://localhost:8000/v1", "api_key": "EMPTY", "model": "MiniMaxAI/MiniMax-M3" } JSON node -e 'const fs=require("fs");const s=fs.readFileSync("serve.sh","utf8");const e=JSON.parse(fs.readFileSync("endpoint.json","utf8"));const need=["MiniMaxAI/MiniMax-M3","--block-size 128","--tool-call-parser minimax_m3","--reasoning-parser minimax_m3"];for(const t of need){if(!s.includes(t)){console.log("BAD: serve command missing "+t);process.exit(1)}}if(!String(e.base_url).includes("localhost:8000/v1")){console.log("BAD: endpoint base_url");process.exit(1)}if(e.api_key!=="EMPTY"){console.log("BAD: endpoint api_key should be EMPTY");process.exit(1)}if(e.model!=="MiniMaxAI/MiniMax-M3"){console.log("BAD: endpoint model id");process.exit(1)}console.log("config OK: serve pins MiniMaxAI/MiniMax-M3 with --block-size 128 + minimax_m3 parsers, agent endpoint well-formed")' - 2
Point your agent at it (the model step, not checked by CI)
In Aider: aider --model openai/MiniMaxAI/MiniMax-M3 --openai-api-base http://localhost:8000/v1 --openai-api-key EMPTY. In OpenCode/Cline/Kilo Code, add a custom OpenAI-compatible provider at the same base URL. The serving and the coding are fenced, because no green check stands up the model.
Eval, 2 fixtures
Last passed: verified todayserve-okcontainstimeout 30s · max $0Expected:
config OK: serve pins MiniMaxAI/MiniMax-M3 with --block-size 128 + minimax_m3 parsers, agent endpoint well-formedclean-exitexit_codetimeout 30s · max $0Expected:
0
Results
The lightest of the three big open coding models to run (427B total but 26B active), 1M context, and the only one that also sees images. Top published SWE-Bench Pro of the three (59.0%, self-reported).
Did this work for you?
Our CI checks the setup runs. You tell us if the whole thing worked. Tell us straight.
Liked this workflow?
Get new verified workflows in WebAfterAI, three issues a week (Tue, Thu, Sat).