Serve NVIDIA Nemotron 3 Ultra yourself for high-throughput agents (vLLM)
Stand up the NVFP4 Nemotron 3 Ultra checkpoint as an OpenAI-compatible endpoint for fast, long-running agent loops, validated serve flags + endpoint.
Run this workflow
CI-verified, 2/2 fixtures passing.
Build this with your agent
One copy-paste hands Claude Code, Codex, or Cursor the full recipe, steps included, nothing to fetch.
Intended Use
Teams that want throughput and a fully open, fine-tunable stack over a single coding leaderboard. CI validates the serve command pins the NVFP4 checkpoint with the documented nemotron_v3 reasoning parser, qwen3_coder tool parser, and triton mamba backend, and that the agent endpoint (targeting the served model name) is well-formed. No GPU, no model call. Serving and agent work are fenced.
Not for
- Picking it purely as a coding champion, it publishes no SWE-Bench Pro number; M3 and GLM-5.1 have the clearer coding evidence
- Single-GPU, this is server-class (8x B200 or 8x H100 NVFP4; BF16 needs 8x B200 / 16x H100 / 8x H200)
The Stack
Tested Against
vllm Nemotron 3 Ultra blog + cookbook (2026-06)vllm/vllm-openai:v0.22.0node@20.xSide effects & data flow
- Network
- none, local only
- Writes
- ./serve.sh, ./endpoint.json
- Credentials
- none required
Prerequisites
- A multi-GPU node (8x B200 or 8x H100 NVFP4) or a rented GPU instance, to actually serve it
Steps
- 1
Write the serve command + agent endpoint, and validate them
This is NVIDIA's own 8x B200 NVFP4 example; the BF16 path and full flag set are in the Nemotron vLLM cookbook. Note --served-model-name, so the agent targets that name (not the checkpoint path). CI checks the serve command pins the NVFP4 checkpoint with the documented parsers and mamba backend, and the endpoint is well-formed.
cat > serve.sh <<'SH' docker pull vllm/vllm-openai:v0.22.0 && VLLM_USE_FLASHINFER_MOE_FP4=1 vllm serve nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 --served-model-name nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B --tensor-parallel-size 8 --kv-cache-dtype fp8 --max-model-len 262144 --reasoning-parser nemotron_v3 --enable-auto-tool-choice --tool-call-parser qwen3_coder --speculative_config.method mtp --speculative_config.num_speculative_tokens 5 --mamba-backend triton SH cat > endpoint.json <<'JSON' { "base_url": "http://localhost:8000/v1", "api_key": "EMPTY", "model": "nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B" } JSON node -e 'const fs=require("fs");const s=fs.readFileSync("serve.sh","utf8");const e=JSON.parse(fs.readFileSync("endpoint.json","utf8"));const need=["nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4","--reasoning-parser nemotron_v3","--tool-call-parser qwen3_coder","--mamba-backend triton"];for(const t of need){if(!s.includes(t)){console.log("BAD: serve command missing "+t);process.exit(1)}}if(!String(e.base_url).includes("localhost:8000/v1")){console.log("BAD: endpoint base_url");process.exit(1)}if(e.api_key!=="EMPTY"){console.log("BAD: endpoint api_key should be EMPTY");process.exit(1)}if(e.model!=="nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B"){console.log("BAD: endpoint model should match --served-model-name");process.exit(1)}console.log("config OK: serve pins the NVFP4 checkpoint + nemotron_v3/qwen3_coder parsers + triton mamba backend, endpoint targets the served model name")' - 2
Point your agent at it (the model step, not checked by CI)
Use base_url http://localhost:8000/v1, key EMPTY, and model nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B (the --served-model-name, not the checkpoint path). The serving and agent loops are fenced.
Eval, 2 fixtures
Last passed: verified todayserve-okcontainstimeout 30s · max $0Expected:
config OK: serve pins the NVFP4 checkpoint + nemotron_v3/qwen3_coder parsers + triton mamba backend, endpoint targets the served model nameclean-exitexit_codetimeout 30s · max $0Expected:
0
Results
The throughput-and-openness pick: a 550B Mamba-Transformer hybrid shipped with its weights, data, and training recipes, with vLLM day-0 support. NVIDIA claims roughly 30% cost savings vs other open models. It does not publish a SWE-Bench Pro score.
Did this work for you?
Our CI checks the setup runs. You tell us if the whole thing worked. Tell us straight.
Liked this workflow?
Get new verified workflows in WebAfterAI, three issues a week (Tue, Thu, Sat).