Run GLM-5.2 fully local on a Mac Studio and drive it with Hermes
Serve GLM-5.2's 2-bit GGUF on a Mac Studio over an OpenAI-compatible endpoint, point Hermes at it as a custom provider, and hand it long hands-off agentic jobs.
Run this workflow
CI-verified, 2/2 fixtures passing.
Build this with your agent
One copy-paste hands Claude Code, Codex, or Cursor the full recipe, steps included, nothing to fetch.
Intended Use
Anyone with a 256GB+ Mac Studio who wants a local, private agent for long async work. CI checks the wiring: config.yaml parses and points Hermes at a custom LOCAL endpoint with a context_length, and the llama-server command pins the UD-IQ2_M GGUF with --jinja and Unsloth's recommended sampling (temp 1.0, top-p 0.95, min-p 0.01). No GPU, no download, no model run. The agent planning and executing is fenced.
Not for
- Fast or interactive chat, a few tok/s is the cost of running a 744B model locally; this is a background worker, not a snappy assistant
- Treating the 2-bit quant as the full model, it is a compressed copy that trades accuracy for fit and makes more mistakes than the cloud version, so scope tasks the agent can verify (run tests, check output) and review what matters
- Laptops, the ~239GB quant needs a Mac Studio class of unified memory
The Stack
Tested Against
huggingface.co/unsloth/GLM-5.2-GGUF (2026-06)hermes-agent docs (2026-06)ruby@3.x (YAML stdlib)Side effects & data flow
- Network
- none, local only
- Writes
- ./config.yaml, ./serve.sh
- Credentials
- none required
Prerequisites
- A Mac with 256GB+ unified memory (512GB comfortable)
- LM Studio or llama.cpp, plus the ~239GB UD-IQ2_M download (to actually serve)
- Hermes Agent installed (to actually run the agent)
Steps
- 1
Serve GLM-5.2 locally
Easiest on macOS: LM Studio, search its browser for the Unsloth GLM-5.2 GGUF, download the UD-IQ2_M quant (it warns if it will not fit), and start the Developer-tab server at http://localhost:1234/v1. For max control use llama.cpp: `pip install huggingface_hub` then `hf download unsloth/GLM-5.2-GGUF --local-dir unsloth/GLM-5.2-GGUF --include "*UD-IQ2_M*"`, then run the serve command below. The download and serving need the hardware, so they are not CI steps.
- 2
Wire Hermes to the local endpoint and validate the config
Write the llama-server command (serve.sh) and ~/.hermes/config.yaml (provider custom, base_url your local endpoint). --jinja turns on the chat template tool calling needs; GLM is not on Hermes' tool-use auto-enforce list (gpt/gemini/grok style), so tool_use_enforcement: true steers it back to calling tools if it narrates. CI parses the YAML and checks the serve command pins the quant and sampling flags; nothing is downloaded or run.
cat > serve.sh <<'SH' ./llama.cpp/llama-server --model unsloth/GLM-5.2-GGUF/UD-IQ2_M/GLM-5.2-UD-IQ2_M-00001-of-00006.gguf --temp 1.0 --top-p 0.95 --min-p 0.01 --ctx-size 32768 --jinja --host 0.0.0.0 --port 8080 SH cat > config.yaml <<'YAML' model: default: glm-5.2 provider: custom base_url: http://localhost:8080/v1 api_key: local context_length: 32768 agent: tool_use_enforcement: true YAML ruby -ryaml -e ' c = YAML.load_file("config.yaml") m = c["model"] || {} abort "BAD: provider not custom" unless m["provider"] == "custom" bu = m["base_url"].to_s abort "BAD: base_url not a local endpoint" unless bu.include?("localhost") || bu.include?("127.0.0.1") abort "BAD: no context_length" unless m["context_length"] s = File.read("serve.sh") abort "BAD: serve does not pin the UD-IQ2_M quant" unless s.include?("UD-IQ2_M") abort "BAD: serve missing --jinja (tool calling needs the chat template)" unless s.include?("--jinja") abort "BAD: serve missing Unsloth recommended sampling" unless s.include?("--temp 1.0") && s.include?("--top-p 0.95") && s.include?("--min-p 0.01") puts "config OK: Hermes -> custom local endpoint (provider custom, base_url local, context_length set); serve pins UD-IQ2_M with --jinja + recommended sampling" ' - 3
Hand it a long, checkable job (the model step, not checked by CI)
Sandbox anything that runs commands (`hermes config set terminal.backend docker`), then give GLM-5.2 work that suits a slow, private, tireless worker: a repo-wide refactor you describe once, a research pass over a folder of local documents, or an overnight scheduled job that produces a report by morning. The agent loop and the model run are fenced.
Eval, 2 fixtures
Last passed: verified todaywiring-okcontainstimeout 30s · max $0Expected:
config OK: Hermes -> custom local endpoint (provider custom, base_url local, context_length set); serve pins UD-IQ2_M with --jinja + recommended samplingclean-exitexit_codetimeout 30s · max $0Expected:
0
Results
A private, free, capable agent that never leaves your desk. The UD-IQ2_M quant is ~239GB (needs 256GB+ unified memory, 512GB comfortable) and runs at low-single-digits to ~9 tok/s on an M3 Ultra: miserable for chat, completely fine for an agent grinding a long task while you do other things.
Did this work for you?
Our CI checks the setup runs. You tell us if the whole thing worked. Tell us straight.
Liked this workflow?
Get new verified workflows in WebAfterAI, three issues a week (Tue, Thu, Sat).