Local model chore: read a photo with a vision model, on-device
Snap a receipt, a medication label, or a handwritten note, and have a free offline vision model read out the details so you do not have to squint and retype.
Run this workflow
CI-verified, 2/2 fixtures passing.
Build this with your agent
One copy-paste hands Claude Code, Codex, or Cursor the full recipe, steps included, nothing to fetch.
Intended Use
Anyone who wants a photo read without retyping, privately. CI builds a fixture receipt image with Pillow, pulls the real vision model gemma3:4b, and runs it on the image via the local Ollama API offline, asserting a non-empty answer with a clean exit (no key, no cloud). What it reads is non-deterministic and is fenced; only the 4B and larger Gemma 3 sizes are vision-capable.
Not for
- The 1B/270m Gemma 3 sizes, those are text-only, vision needs 4B or larger
- Expecting CI to confirm the number it read, only that a local vision model ran on the image offline
The Stack
Tested Against
ollama@0.30gemma3:4b (vision)pillow@latestpython@3.12Side effects & data flow
- Network
- ollama.com, model pull only; inference is fully offline
- Writes
- ~/.ollama/ (downloaded model), ./receipt.png, ./.venv/
- Credentials
- none required
Data privacy
- nobody (fully local after the one-time model download) ← the photo (retention: read on-device via the local Ollama API; the image never leaves the laptop)
Prerequisites
- A laptop with ~8GB RAM (gemma3:4b is about 3.3GB)
Steps
- 1
Read a fixture image on a real vision model (offline)
Install Ollama, pull the vision-capable gemma3:4b, build a small receipt image, and ask the model to read the total via the local API. After the download this runs offline. CI runs exactly this and checks the vision model returned a real answer (the number it reads is fenced).
command -v ollama >/dev/null 2>&1 || curl -fsSL https://ollama.com/install.sh | sh curl -s http://localhost:11434/api/version >/dev/null 2>&1 || (ollama serve >/tmp/ollama-serve.log 2>&1 &) for i in $(seq 1 60); do curl -s http://localhost:11434/api/version >/dev/null 2>&1 && break; sleep 1; done ollama pull gemma3:4b >/dev/null 2>&1 python3 -m venv .venv .venv/bin/pip install -q pillow .venv/bin/python - <<'EOF' import base64, json, urllib.request from PIL import Image, ImageDraw img = Image.new("RGB", (480, 220), "white") d = ImageDraw.Draw(img) for pos, line in [((20, 30), "CAFE RECEIPT"), ((20, 90), "Latte 4.50"), ((20, 120), "Muffin 3.25"), ((20, 170), "TOTAL 7.75")]: d.text(pos, line, fill="black") img.save("receipt.png") b64 = base64.b64encode(open("receipt.png", "rb").read()).decode() payload = json.dumps({ "model": "gemma3:4b", "prompt": "Read this receipt. What is the total amount?", "images": [b64], "stream": False, }).encode() req = urllib.request.Request("http://localhost:11434/api/generate", data=payload, headers={"Content-Type": "application/json"}) out = (json.load(urllib.request.urlopen(req, timeout=720)).get("response") or "").strip() assert out, "vision model returned an empty response" print(f"chore4 OK: gemma3:4b read the image and returned {len(out)} chars offline (quality fenced)") EOF - 2
Point it at your own photos (the quality step, not checked by CI)
In LM Studio or Open WebUI with a vision model, attach a real receipt, label, or handwritten note and ask 'what is the total' or 'type out this text.' What it reads is non-deterministic, so CI never claims the specific value.
Eval, 2 fixtures
Last passed: verified todayvision-rancontainstimeout 900s · max $0Expected:
chore4 OK: gemma3:4b read the image and returnedclean-exitexit_codetimeout 900s · max $0Expected:
0
Results
Vision-capable Gemma 3 (4B and up) reads images on-device. Attach the photo, ask 'what is the total' or 'type out this text,' and it pulls the details, fully local.
Did this work for you?
Our CI checks the setup runs. You tell us if the whole thing worked. Tell us straight.
Liked this workflow?
Get new verified workflows in WebAfterAI, three issues a week (Tue, Thu, Sat).