Local InferenceOpen SourceFreeActiveMachine-verified· advanced · ~30 min setup

Serve NVIDIA Nemotron 3 Ultra yourself for high-throughput agents (vLLM)

Stand up the NVFP4 Nemotron 3 Ultra checkpoint as an OpenAI-compatible endpoint for fast, long-running agent loops, validated serve flags + endpoint.

by Shilpa Mitra· verified today· v1.0.0

Run this workflow

CI-verified, 2/2 fixtures passing.

Build this with your agent

One copy-paste hands Claude Code, Codex, or Cursor the full recipe, steps included, nothing to fetch.

Intended Use

Teams that want throughput and a fully open, fine-tunable stack over a single coding leaderboard. CI validates the serve command pins the NVFP4 checkpoint with the documented nemotron_v3 reasoning parser, qwen3_coder tool parser, and triton mamba backend, and that the agent endpoint (targeting the served model name) is well-formed. No GPU, no model call. Serving and agent work are fenced.

Not for

Picking it purely as a coding champion, it publishes no SWE-Bench Pro number; M3 and GLM-5.1 have the clearer coding evidence
Single-GPU, this is server-class (8x B200 or 8x H100 NVFP4; BF16 needs 8x B200 / 16x H100 / 8x H200)

The Stack

NVIDIA Nemotron 3 Ultramodel vLLMserving engine

Tested Against

vllm Nemotron 3 Ultra blog + cookbook (2026-06)vllm/vllm-openai:v0.22.0node@20.x

Side effects & data flow

Network: none, local only
Writes: ./serve.sh, ./endpoint.json
Credentials: none required

Prerequisites

A multi-GPU node (8x B200 or 8x H100 NVFP4) or a rented GPU instance, to actually serve it

Steps

Write the serve command + agent endpoint, and validate them

This is NVIDIA's own 8x B200 NVFP4 example; the BF16 path and full flag set are in the Nemotron vLLM cookbook. Note --served-model-name, so the agent targets that name (not the checkpoint path). CI checks the serve command pins the NVFP4 checkpoint with the documented parsers and mamba backend, and the endpoint is well-formed.

cat > serve.sh <<'SH'
docker pull vllm/vllm-openai:v0.22.0 && VLLM_USE_FLASHINFER_MOE_FP4=1 vllm serve nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 --served-model-name nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B --tensor-parallel-size 8 --kv-cache-dtype fp8 --max-model-len 262144 --reasoning-parser nemotron_v3 --enable-auto-tool-choice --tool-call-parser qwen3_coder --speculative_config.method mtp --speculative_config.num_speculative_tokens 5 --mamba-backend triton
SH
cat > endpoint.json <<'JSON'
{
  "base_url": "http://localhost:8000/v1",
  "api_key": "EMPTY",
  "model": "nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B"
}
JSON
node -e 'const fs=require("fs");const s=fs.readFileSync("serve.sh","utf8");const e=JSON.parse(fs.readFileSync("endpoint.json","utf8"));const need=["nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4","--reasoning-parser nemotron_v3","--tool-call-parser qwen3_coder","--mamba-backend triton"];for(const t of need){if(!s.includes(t)){console.log("BAD: serve command missing "+t);process.exit(1)}}if(!String(e.base_url).includes("localhost:8000/v1")){console.log("BAD: endpoint base_url");process.exit(1)}if(e.api_key!=="EMPTY"){console.log("BAD: endpoint api_key should be EMPTY");process.exit(1)}if(e.model!=="nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B"){console.log("BAD: endpoint model should match --served-model-name");process.exit(1)}console.log("config OK: serve pins the NVFP4 checkpoint + nemotron_v3/qwen3_coder parsers + triton mamba backend, endpoint targets the served model name")'

2
Point your agent at it (the model step, not checked by CI)
Use base_url http://localhost:8000/v1, key EMPTY, and model nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B (the --served-model-name, not the checkpoint path). The serving and agent loops are fenced.

Eval, 2 fixtures

Last passed: verified today

serve-okcontainstimeout 30s · max $0
Expected: config OK: serve pins the NVFP4 checkpoint + nemotron_v3/qwen3_coder parsers + triton mamba backend, endpoint targets the served model name
clean-exitexit_codetimeout 30s · max $0
Expected: 0

Results

The throughput-and-openness pick: a 550B Mamba-Transformer hybrid shipped with its weights, data, and training recipes, with vLLM day-0 support. NVIDIA claims roughly 30% cost savings vs other open models. It does not publish a SWE-Bench Pro score.

Did this work for you?

Our CI checks the setup runs. You tell us if the whole thing worked. Tell us straight.

Related workflows

Liked this workflow?

Get new verified workflows in WebAfterAI, three issues a week (Tue, Thu, Sat).