Local InferenceOpen SourceFreeActiveMachine-verified· intermediate · ~15 min setup

Local model chore: read a photo with a vision model, on-device

Snap a receipt, a medication label, or a handwritten note, and have a free offline vision model read out the details so you do not have to squint and retype.

by Shilpa Mitra· verified today· v1.0.0

Run this workflow

CI-verified, 2/2 fixtures passing.

Build this with your agent

One copy-paste hands Claude Code, Codex, or Cursor the full recipe, steps included, nothing to fetch.

Intended Use

Anyone who wants a photo read without retyping, privately. CI builds a fixture receipt image with Pillow, pulls the real vision model gemma3:4b, and runs it on the image via the local Ollama API offline, asserting a non-empty answer with a clean exit (no key, no cloud). What it reads is non-deterministic and is fenced; only the 4B and larger Gemma 3 sizes are vision-capable.

Not for

  • The 1B/270m Gemma 3 sizes, those are text-only, vision needs 4B or larger
  • Expecting CI to confirm the number it read, only that a local vision model ran on the image offline

The Stack

Tested Against

ollama@0.30gemma3:4b (vision)pillow@latestpython@3.12

Side effects & data flow

Network
ollama.com, model pull only; inference is fully offline
Writes
~/.ollama/ (downloaded model), ./receipt.png, ./.venv/
Credentials
none required

Data privacy

  • nobody (fully local after the one-time model download) the photo (retention: read on-device via the local Ollama API; the image never leaves the laptop)

Prerequisites

  • A laptop with ~8GB RAM (gemma3:4b is about 3.3GB)

Steps

  1. 1

    Read a fixture image on a real vision model (offline)

    Install Ollama, pull the vision-capable gemma3:4b, build a small receipt image, and ask the model to read the total via the local API. After the download this runs offline. CI runs exactly this and checks the vision model returned a real answer (the number it reads is fenced).

    command -v ollama >/dev/null 2>&1 || curl -fsSL https://ollama.com/install.sh | sh
    curl -s http://localhost:11434/api/version >/dev/null 2>&1 || (ollama serve >/tmp/ollama-serve.log 2>&1 &)
    for i in $(seq 1 60); do curl -s http://localhost:11434/api/version >/dev/null 2>&1 && break; sleep 1; done
    ollama pull gemma3:4b >/dev/null 2>&1
    python3 -m venv .venv
    .venv/bin/pip install -q pillow
    .venv/bin/python - <<'EOF'
    import base64, json, urllib.request
    from PIL import Image, ImageDraw
    
    img = Image.new("RGB", (480, 220), "white")
    d = ImageDraw.Draw(img)
    for pos, line in [((20, 30), "CAFE RECEIPT"), ((20, 90), "Latte      4.50"),
                      ((20, 120), "Muffin     3.25"), ((20, 170), "TOTAL      7.75")]:
        d.text(pos, line, fill="black")
    img.save("receipt.png")
    
    b64 = base64.b64encode(open("receipt.png", "rb").read()).decode()
    payload = json.dumps({
        "model": "gemma3:4b",
        "prompt": "Read this receipt. What is the total amount?",
        "images": [b64],
        "stream": False,
    }).encode()
    req = urllib.request.Request("http://localhost:11434/api/generate", data=payload,
                                headers={"Content-Type": "application/json"})
    out = (json.load(urllib.request.urlopen(req, timeout=720)).get("response") or "").strip()
    assert out, "vision model returned an empty response"
    print(f"chore4 OK: gemma3:4b read the image and returned {len(out)} chars offline (quality fenced)")
    EOF
  2. 2

    Point it at your own photos (the quality step, not checked by CI)

    In LM Studio or Open WebUI with a vision model, attach a real receipt, label, or handwritten note and ask 'what is the total' or 'type out this text.' What it reads is non-deterministic, so CI never claims the specific value.

Eval, 2 fixtures

Last passed: verified today
  • vision-rancontainstimeout 900s · max $0

    Expected: chore4 OK: gemma3:4b read the image and returned

  • clean-exitexit_codetimeout 900s · max $0

    Expected: 0

Results

Vision-capable Gemma 3 (4B and up) reads images on-device. Attach the photo, ask 'what is the total' or 'type out this text,' and it pulls the details, fully local.

Did this work for you?

Our CI checks the setup runs. You tell us if the whole thing worked. Tell us straight.

Liked this workflow?

Get new verified workflows in WebAfterAI, three issues a week (Tue, Thu, Sat).