AgentsOpen SourceFreeActiveMachine-verified· advanced · ~60 min setup

Run GLM-5.2 fully local on a Mac Studio and drive it with Hermes

Serve GLM-5.2's 2-bit GGUF on a Mac Studio over an OpenAI-compatible endpoint, point Hermes at it as a custom provider, and hand it long hands-off agentic jobs.

by Shilpa Mitra· verified today· v1.0.0

Run this workflow

CI-verified, 2/2 fixtures passing.

Build this with your agent

One copy-paste hands Claude Code, Codex, or Cursor the full recipe, steps included, nothing to fetch.

Intended Use

Anyone with a 256GB+ Mac Studio who wants a local, private agent for long async work. CI checks the wiring: config.yaml parses and points Hermes at a custom LOCAL endpoint with a context_length, and the llama-server command pins the UD-IQ2_M GGUF with --jinja and Unsloth's recommended sampling (temp 1.0, top-p 0.95, min-p 0.01). No GPU, no download, no model run. The agent planning and executing is fenced.

Not for

Fast or interactive chat, a few tok/s is the cost of running a 744B model locally; this is a background worker, not a snappy assistant
Treating the 2-bit quant as the full model, it is a compressed copy that trades accuracy for fit and makes more mistakes than the cloud version, so scope tasks the agent can verify (run tests, check output) and review what matters
Laptops, the ~239GB quant needs a Mac Studio class of unified memory

The Stack

Hermes Agentagent runtime LM Studiolocal model server (or llama.cpp)GLM-5.2model (GLM-5.2, local GGUF)

Tested Against

huggingface.co/unsloth/GLM-5.2-GGUF (2026-06)hermes-agent docs (2026-06)ruby@3.x (YAML stdlib)

Side effects & data flow

Network: none, local only
Writes: ./config.yaml, ./serve.sh
Credentials: none required

Prerequisites

A Mac with 256GB+ unified memory (512GB comfortable)
LM Studio or llama.cpp, plus the ~239GB UD-IQ2_M download (to actually serve)
Hermes Agent installed (to actually run the agent)

Steps

1
Serve GLM-5.2 locally
Easiest on macOS: LM Studio, search its browser for the Unsloth GLM-5.2 GGUF, download the UD-IQ2_M quant (it warns if it will not fit), and start the Developer-tab server at http://localhost:1234/v1. For max control use llama.cpp: `pip install huggingface_hub` then `hf download unsloth/GLM-5.2-GGUF --local-dir unsloth/GLM-5.2-GGUF --include "*UD-IQ2_M*"`, then run the serve command below. The download and serving need the hardware, so they are not CI steps.

Wire Hermes to the local endpoint and validate the config

Write the llama-server command (serve.sh) and ~/.hermes/config.yaml (provider custom, base_url your local endpoint). --jinja turns on the chat template tool calling needs; GLM is not on Hermes' tool-use auto-enforce list (gpt/gemini/grok style), so tool_use_enforcement: true steers it back to calling tools if it narrates. CI parses the YAML and checks the serve command pins the quant and sampling flags; nothing is downloaded or run.

cat > serve.sh <<'SH'
./llama.cpp/llama-server --model unsloth/GLM-5.2-GGUF/UD-IQ2_M/GLM-5.2-UD-IQ2_M-00001-of-00006.gguf --temp 1.0 --top-p 0.95 --min-p 0.01 --ctx-size 32768 --jinja --host 0.0.0.0 --port 8080
SH
cat > config.yaml <<'YAML'
model:
  default: glm-5.2
  provider: custom
  base_url: http://localhost:8080/v1
  api_key: local
  context_length: 32768
agent:
  tool_use_enforcement: true
YAML
ruby -ryaml -e '
c = YAML.load_file("config.yaml")
m = c["model"] || {}
abort "BAD: provider not custom" unless m["provider"] == "custom"
bu = m["base_url"].to_s
abort "BAD: base_url not a local endpoint" unless bu.include?("localhost") || bu.include?("127.0.0.1")
abort "BAD: no context_length" unless m["context_length"]
s = File.read("serve.sh")
abort "BAD: serve does not pin the UD-IQ2_M quant" unless s.include?("UD-IQ2_M")
abort "BAD: serve missing --jinja (tool calling needs the chat template)" unless s.include?("--jinja")
abort "BAD: serve missing Unsloth recommended sampling" unless s.include?("--temp 1.0") && s.include?("--top-p 0.95") && s.include?("--min-p 0.01")
puts "config OK: Hermes -> custom local endpoint (provider custom, base_url local, context_length set); serve pins UD-IQ2_M with --jinja + recommended sampling"
'

3
Hand it a long, checkable job (the model step, not checked by CI)
Sandbox anything that runs commands (`hermes config set terminal.backend docker`), then give GLM-5.2 work that suits a slow, private, tireless worker: a repo-wide refactor you describe once, a research pass over a folder of local documents, or an overnight scheduled job that produces a report by morning. The agent loop and the model run are fenced.

Eval, 2 fixtures

Last passed: verified today

wiring-okcontainstimeout 30s · max $0
Expected: config OK: Hermes -> custom local endpoint (provider custom, base_url local, context_length set); serve pins UD-IQ2_M with --jinja + recommended sampling
clean-exitexit_codetimeout 30s · max $0
Expected: 0

Results

A private, free, capable agent that never leaves your desk. The UD-IQ2_M quant is ~239GB (needs 256GB+ unified memory, 512GB comfortable) and runs at low-single-digits to ~9 tok/s on an M3 Ultra: miserable for chat, completely fine for an agent grinding a long task while you do other things.

Did this work for you?

Our CI checks the setup runs. You tell us if the whole thing worked. Tell us straight.

Liked this workflow?

Get new verified workflows in WebAfterAI, three issues a week (Tue, Thu, Sat).