← The Context Window
Models · June 26, 2026

Gemma 4 Vision: What It Can Actually Do on Your Local Machine

OpenClaw Sanctuary · 5 min read

Most local models are text-only. You give them words, they give you words back. Gemma 4 breaks that. Every size - the small E4B, the 12B, the 26B MoE - accepts image input natively through Ollama's API. Send it a screenshot, a photo, a diagram. It can read and reason about what it sees.

The part worth paying attention to: the E4B variant runs in about 4GB of RAM. If you have an 8GB machine that can't run much else, it can run this. That's unusual for a vision model.

Which size to run

Gemma 4 comes in three sizes with meaningfully different RAM requirements:

  • gemma4:e4b - 4B parameters, ~4GB RAM. Fits on 8GB systems. Fast, capable enough for most vision tasks. This is the default when you run ollama pull gemma4.
  • gemma4:12b - 12B parameters, ~8-10GB RAM, needs 16GB total to run comfortably. Better quality on complex images and longer descriptions. This is the one to use if you have the headroom.
  • gemma4:26b - MoE architecture, needs 32GB or more. Best quality, but out of reach for most setups.

For most people: start with E4B on 8GB, step up to 12B on 16GB. The quality difference between E4B and 12B is noticeable on tasks like reading small text or interpreting complex charts - less so on straightforward photo descriptions. If your machine is at the limit running 12B, the mini PC guide covers hardware options with 32GB that handle both sizes without thermal throttling.

How to send images from the command line

Ollama's /api/chat endpoint accepts a base64-encoded image in an images array on the message object. The simplest way to test it:

Terminal - encode and send an image
# Encode an image to base64 first
IMAGE=$(base64 -i /path/to/your/image.jpg)

# Send it to Gemma 4
curl http://localhost:11434/api/chat -s -d "{
  \"model\": \"gemma4:12b\",
  \"messages\": [{
    \"role\": \"user\",
    \"content\": \"What is in this image? Be specific.\",
    \"images\": [\"$IMAGE\"]
  }],
  \"stream\": false
}" | python3 -m json.tool | grep '"content"'

The images field is an array, so you can pass multiple images in a single message if you need to compare or analyze a sequence. Keep in mind that each image adds to your token count - large or high-resolution images can eat into your context budget.

Using vision in Open WebUI

If you have Open WebUI running, you don't need to touch the API at all. Select gemma4:12b (or gemma4:e4b) from the model dropdown, then drag and drop any image directly into the chat box, or use the image attach button. Type your question and send.

This is the quickest way to test what the model can and can't do. Open WebUI renders the image inline in the conversation so you can see exactly what you sent alongside the response.

What it's actually good at

Works well

  • Describing photos and scenes
  • Reading text from screenshots
  • OCR on clean, printed documents
  • Identifying objects and UI elements
  • Analyzing charts with clear labels
  • Comparing two images side by side
  • Reading error messages from terminal screenshots

Struggles with

  • Very small or low-contrast text
  • Handwriting
  • Precise counting of many similar items
  • Complex multi-panel diagrams
  • Subtle color differences
  • Heavily compressed or blurry images

The practical upshot: it's well-suited to reading your own screenshots, analyzing charts you're working with, and processing photos where you know what's in them and need the model to extract or describe specific things. It's less reliable for anything requiring pixel-level precision or fine visual discrimination.

A note on image size and context

Gemma 4 supports a 128K context window, but images consume a significant portion of that budget. A high-resolution photo can use thousands of tokens before your text prompt even starts. If you're running long conversations with multiple images, keep an eye on whether responses start to degrade - that's usually the context window filling up, not a model quality issue.

Resize large images before sending them. For most tasks, anything over 1024px wide is more than the model needs and just costs you context tokens. A quick resize in Preview or ImageMagick before the API call makes a real difference:

Terminal - resize before encoding
# Resize to 1024px wide, then encode
convert input.jpg -resize 1024x output.jpg
IMAGE=$(base64 -i output.jpg)

Using it in OpenClaw

The Ollama API is the same whether the model is text-only or vision-capable - you just include the images array when you have something to analyze. In an OpenClaw skill, that means you can write a skill that accepts an image path as input, reads and encodes the file, and passes it to gemma4 for analysis.

A practical example: a !readscreen skill that takes a screenshot filename, sends it to gemma4:12b with a structured prompt, and returns a plain-text description or extracted data. No cloud, no API keys, no image leaving your machine.

The full Gemma 4 setup guide - pull commands for all sizes, vision API examples, and OpenClaw integration - is at Run Gemma 4 Locally.