Local InferenceHybridFreeActiveMachine-verified· advanced · ~30 min setup

Serve MiniMax M3 yourself for agentic coding (vLLM)

Stand up MiniMax M3 on an 8x H200 node as an OpenAI-compatible endpoint and point any coding agent at it, validated serve flags + endpoint config.

by Shilpa Mitra· verified today· v1.0.0

Run this workflow

CI-verified, 2/2 fixtures passing.

Build this with your agent

One copy-paste hands Claude Code, Codex, or Cursor the full recipe, steps included, nothing to fetch.

Intended Use

Teams self-hosting M3 for privacy, throughput, or fine-tuning (not to save money over the API). CI validates the serve command pins MiniMaxAI/MiniMax-M3 with the mandatory --block-size 128 and the minimax_m3 tool + reasoning parsers, and that the agent's OpenAI endpoint (base_url, key, model) is well-formed. No GPU, no model call. The serving and coding are fenced.

Not for

Single-GPU or homelab, this is server-class (8x H200 BF16; the MXFP8 variant roughly halves VRAM)
Assuming open weights means do-anything, M3 is under the MiniMax Community License, so read the terms before commercial use

The Stack

MiniMax M3model vLLMserving engine

Tested Against

vllm recipe MiniMaxAI/MiniMax-M3 (2026-06)node@20.x

Side effects & data flow

Network: none, local only
Writes: ./serve.sh, ./endpoint.json
Credentials: none required

Prerequisites

A multi-GPU node (8x H200) or a rented GPU instance, to actually serve it

Steps

Write the serve command + agent endpoint, and validate them

M3 support is not in a stable vLLM release yet, so use the dedicated image. --block-size 128 is mandatory (it matches the MSA sparse-attention cache; the default 16 fails to start). Any agent that takes a custom OpenAI base URL drives it. CI checks the serve command pins the model id + required flags and the endpoint config is well-formed.

cat > serve.sh <<'SH'
docker pull vllm/vllm-openai:minimax-m3 && vllm serve MiniMaxAI/MiniMax-M3 --tensor-parallel-size 8 --block-size 128 --tool-call-parser minimax_m3 --reasoning-parser minimax_m3 --enable-auto-tool-choice
SH
cat > endpoint.json <<'JSON'
{
  "base_url": "http://localhost:8000/v1",
  "api_key": "EMPTY",
  "model": "MiniMaxAI/MiniMax-M3"
}
JSON
node -e 'const fs=require("fs");const s=fs.readFileSync("serve.sh","utf8");const e=JSON.parse(fs.readFileSync("endpoint.json","utf8"));const need=["MiniMaxAI/MiniMax-M3","--block-size 128","--tool-call-parser minimax_m3","--reasoning-parser minimax_m3"];for(const t of need){if(!s.includes(t)){console.log("BAD: serve command missing "+t);process.exit(1)}}if(!String(e.base_url).includes("localhost:8000/v1")){console.log("BAD: endpoint base_url");process.exit(1)}if(e.api_key!=="EMPTY"){console.log("BAD: endpoint api_key should be EMPTY");process.exit(1)}if(e.model!=="MiniMaxAI/MiniMax-M3"){console.log("BAD: endpoint model id");process.exit(1)}console.log("config OK: serve pins MiniMaxAI/MiniMax-M3 with --block-size 128 + minimax_m3 parsers, agent endpoint well-formed")'

2
Point your agent at it (the model step, not checked by CI)
In Aider: aider --model openai/MiniMaxAI/MiniMax-M3 --openai-api-base http://localhost:8000/v1 --openai-api-key EMPTY. In OpenCode/Cline/Kilo Code, add a custom OpenAI-compatible provider at the same base URL. The serving and the coding are fenced, because no green check stands up the model.

Eval, 2 fixtures

Last passed: verified today

serve-okcontainstimeout 30s · max $0
Expected: config OK: serve pins MiniMaxAI/MiniMax-M3 with --block-size 128 + minimax_m3 parsers, agent endpoint well-formed
clean-exitexit_codetimeout 30s · max $0
Expected: 0

Results

The lightest of the three big open coding models to run (427B total but 26B active), 1M context, and the only one that also sees images. Top published SWE-Bench Pro of the three (59.0%, self-reported).

Did this work for you?

Our CI checks the setup runs. You tell us if the whole thing worked. Tell us straight.

Related workflows

Liked this workflow?

Get new verified workflows in WebAfterAI, three issues a week (Tue, Thu, Sat).