Local InferenceOpen SourceFreeActiveMachine-verified· advanced · ~30 min setup

Serve GLM-5.1 yourself for long-horizon agentic coding (vLLM)

Stand up the MIT-licensed GLM-5.1 FP8 checkpoint as an OpenAI-compatible endpoint for long agentic runs, validated serve config + endpoint.

by Shilpa Mitra· verified today· v1.0.0

Run this workflow

CI-verified, 2/2 fixtures passing.

Build this with your agent

One copy-paste hands Claude Code, Codex, or Cursor the full recipe, steps included, nothing to fetch.

Intended Use

Teams that need permissive licensing for a product and long-horizon agentic runs. CI validates the serve command names zai-org/GLM-5.1-FP8 with a valid tensor-parallel size, and that the agent's OpenAI endpoint is well-formed. No GPU, no model call. The model-specific parser flags live in the vLLM recipe / SGLang cookbook; serving and the agent run are fenced.

Not for

  • Reading 'SOTA on SWE-Bench Pro' as a current crown, it was true at launch before M3 posted a marginally higher number
  • Small tasks, the long-horizon design is overhead unless the job is genuinely big

The Stack

Tested Against

vllm recipe zai-org/GLM-5.1 (2026-06)vLLM v0.19.0+ / SGLang v0.5.10+node@20.x

Side effects & data flow

Network
none, local only
Writes
./serve.sh, ./endpoint.json
Credentials
none required

Prerequisites

  • A multi-GPU node (8x B200 class) or a rented GPU instance, to actually serve it

Steps

  1. 1

    Write the serve command + agent endpoint, and validate them

    Serve the FP8 checkpoint across an 8-GPU node with vLLM v0.19.0+ (SGLang v0.5.10+ is equally supported). The tool and reasoning parser flags live in the official vLLM recipe and SGLang cookbook, so follow those rather than guessing. CI checks the serve command names the FP8 checkpoint with a valid tensor-parallel size and the endpoint is well-formed.

    cat > serve.sh <<'SH'
    vllm serve zai-org/GLM-5.1-FP8 --tensor-parallel-size 8
    SH
    cat > endpoint.json <<'JSON'
    {
      "base_url": "http://localhost:8000/v1",
      "api_key": "EMPTY",
      "model": "zai-org/GLM-5.1-FP8"
    }
    JSON
    node -e 'const fs=require("fs");const s=fs.readFileSync("serve.sh","utf8");const e=JSON.parse(fs.readFileSync("endpoint.json","utf8"));if(!s.includes("zai-org/GLM-5.1-FP8")){console.log("BAD: serve command missing model id");process.exit(1)}const m=s.match(/--tensor-parallel-size\s+(\d+)/);if(!m||parseInt(m[1],10)<1){console.log("BAD: invalid tensor-parallel size");process.exit(1)}if(!String(e.base_url).includes("localhost:8000/v1")){console.log("BAD: endpoint base_url");process.exit(1)}if(e.api_key!=="EMPTY"){console.log("BAD: endpoint api_key should be EMPTY");process.exit(1)}if(e.model!=="zai-org/GLM-5.1-FP8"){console.log("BAD: endpoint model id");process.exit(1)}console.log("config OK: serve names zai-org/GLM-5.1-FP8 with tensor-parallel-size "+m[1]+", agent endpoint well-formed")'
  2. 2

    Point your agent at it (the model step, not checked by CI)

    Any agent that accepts a custom OpenAI base URL drives it: base_url http://localhost:8000/v1, key EMPTY, model zai-org/GLM-5.1-FP8. Add the parser flags from the recipe for tool-calling. The serving and agent run are fenced.

Eval, 2 fixtures

Last passed: verified today
  • serve-okcontainstimeout 30s · max $0

    Expected: config OK: serve names zai-org/GLM-5.1-FP8 with tensor-parallel-size

  • clean-exitexit_codetimeout 30s · max $0

    Expected: 0

Results

The permissive pick: genuinely MIT, commercial use and fine-tuning with no strings. Purpose-built to stay productive across hundreds of tool-call rounds. The heaviest here (~754B), so the FP8 checkpoint (~860GB across the node) is the realistic target.

Did this work for you?

Our CI checks the setup runs. You tell us if the whole thing worked. Tell us straight.

Liked this workflow?

Get new verified workflows in WebAfterAI, three issues a week (Tue, Thu, Sat).