Local InferenceOpen SourceFreeActiveMachine-verified· intermediate · ~15 min setup

Local model chore: read a photo with a vision model, on-device

Snap a receipt, a medication label, or a handwritten note, and have a free offline vision model read out the details so you do not have to squint and retype.

by Shilpa Mitra· verified today· v1.0.0

Run this workflow

CI-verified, 2/2 fixtures passing.

Build this with your agent

One copy-paste hands Claude Code, Codex, or Cursor the full recipe, steps included, nothing to fetch.

Intended Use

Anyone who wants a photo read without retyping, privately. CI builds a fixture receipt image with Pillow, pulls the real vision model gemma3:4b, and runs it on the image via the local Ollama API offline, asserting a non-empty answer with a clean exit (no key, no cloud). What it reads is non-deterministic and is fenced; only the 4B and larger Gemma 3 sizes are vision-capable.

Not for

The 1B/270m Gemma 3 sizes, those are text-only, vision needs 4B or larger
Expecting CI to confirm the number it read, only that a local vision model ran on the image offline

The Stack

Google Gemma 3vision model (Gemma 3 4B)Ollamalocal runtime

Tested Against

ollama@0.30gemma3:4b (vision)pillow@latestpython@3.12

Side effects & data flow

Network: ollama.com, model pull only; inference is fully offline
Writes: ~/.ollama/ (downloaded model), ./receipt.png, ./.venv/
Credentials: none required

Data privacy

nobody (fully local after the one-time model download) ← the photo (retention: read on-device via the local Ollama API; the image never leaves the laptop)

Prerequisites

A laptop with ~8GB RAM (gemma3:4b is about 3.3GB)

Steps

Read a fixture image on a real vision model (offline)

Install Ollama, pull the vision-capable gemma3:4b, build a small receipt image, and ask the model to read the total via the local API. After the download this runs offline. CI runs exactly this and checks the vision model returned a real answer (the number it reads is fenced).

command -v ollama >/dev/null 2>&1 || curl -fsSL https://ollama.com/install.sh | sh
curl -s http://localhost:11434/api/version >/dev/null 2>&1 || (ollama serve >/tmp/ollama-serve.log 2>&1 &)
for i in $(seq 1 60); do curl -s http://localhost:11434/api/version >/dev/null 2>&1 && break; sleep 1; done
for attempt in 1 2 3; do
  ollama pull gemma3:4b && break
  echo "gemma3:4b pull attempt $attempt failed; retrying in 10s" >&2
  sleep 10
done
ollama list | grep -q gemma3 || { echo "BAD: gemma3:4b unavailable after 3 pull attempts (registry or network)" >&2; exit 1; }
python3 -m venv .venv
.venv/bin/pip install -q pillow
.venv/bin/python - <<'EOF'
import base64, json, urllib.request
from PIL import Image, ImageDraw

img = Image.new("RGB", (480, 220), "white")
d = ImageDraw.Draw(img)
for pos, line in [((20, 30), "CAFE RECEIPT"), ((20, 90), "Latte      4.50"),
                  ((20, 120), "Muffin     3.25"), ((20, 170), "TOTAL      7.75")]:
    d.text(pos, line, fill="black")
img.save("receipt.png")

b64 = base64.b64encode(open("receipt.png", "rb").read()).decode()
payload = json.dumps({
    "model": "gemma3:4b",
    "prompt": "Read this receipt. What is the total amount?",
    "images": [b64],
    "stream": False,
}).encode()
req = urllib.request.Request("http://localhost:11434/api/generate", data=payload,
                            headers={"Content-Type": "application/json"})
out = (json.load(urllib.request.urlopen(req, timeout=1200)).get("response") or "").strip()
assert out, "vision model returned an empty response"
print(f"chore4 OK: gemma3:4b read the image and returned {len(out)} chars offline (quality fenced)")
EOF

2
Point it at your own photos (the quality step, not checked by CI)
In LM Studio or Open WebUI with a vision model, attach a real receipt, label, or handwritten note and ask 'what is the total' or 'type out this text.' What it reads is non-deterministic, so CI never claims the specific value.

Eval, 2 fixtures

Last passed: verified today

vision-rancontainstimeout 1500s · max $0
Expected: chore4 OK: gemma3:4b read the image and returned
clean-exitexit_codetimeout 1500s · max $0
Expected: 0

Results

Vision-capable Gemma 3 (4B and up) reads images on-device. Attach the photo, ask 'what is the total' or 'type out this text,' and it pulls the details, fully local.

Did this work for you?

Our CI checks the setup runs. You tell us if the whole thing worked. Tell us straight.

Related workflows

Liked this workflow?

Get new verified workflows in WebAfterAI, three issues a week (Tue, Thu, Sat).