Can Gemma 4 analyze images locally?

Yes. All Gemma 4 sizes - E4B, 12B, and 26B - accept image input natively via Ollama. You pass images as base64 strings in the messages array of the /api/chat endpoint. The E4B variant runs on 8GB of RAM, making it the most accessible local vision model available.

How do I send an image to Gemma 4 in Ollama?

Use the /api/chat endpoint with an images array in your message. The image must be base64-encoded. Example: curl http://localhost:11434/api/chat -d '{"model":"gemma4:12b","messages":[{"role":"user","content":"What is in this image?","images":["BASE64_STRING"]}],"stream":false}'. In Open WebUI you can drag and drop images directly into the chat box.

What can Gemma 4 vision do?

Gemma 4 is good at describing photos and scenes, reading text from screenshots and clean documents (OCR), identifying objects and UI elements, and analyzing charts with clear labels. It struggles with very small or handwritten text, precise counting of many items, and complex multi-panel layouts.

How much RAM does Gemma 4 vision need?

Gemma 4 E4B (the 4B parameter variant) runs in approximately 4GB of RAM and fits comfortably on an 8GB system. Gemma 4 12B needs around 8-10GB RAM and requires 16GB total to run comfortably. Gemma 4 26B is a MoE model and needs 32GB or more.

← The Context Window

Models · June 26, 2026

Gemma 4 Vision: What It Can Actually Do on Your Local Machine

OpenClaw Sanctuary · 5 min read

Most local models are text-only. You give them words, they give you words back. Gemma 4 breaks that. Every size - the small E4B, the 12B, the 26B MoE - accepts image input natively through Ollama's API. Send it a screenshot, a photo, a diagram. It can read and reason about what it sees.

The part worth paying attention to: the E4B variant runs in about 4GB of RAM. If you have an 8GB machine that can't run much else, it can run this. That's unusual for a vision model.

Which size to run

Gemma 4 comes in three sizes with meaningfully different RAM requirements:

gemma4:e4b - 4B parameters, ~4GB RAM. Fits on 8GB systems. Fast, capable enough for most vision tasks. This is the default when you run ollama pull gemma4.
gemma4:12b - 12B parameters, ~8-10GB RAM, needs 16GB total to run comfortably. Better quality on complex images and longer descriptions. This is the one to use if you have the headroom.
gemma4:26b - MoE architecture, needs 32GB or more. Best quality, but out of reach for most setups.

For most people: start with E4B on 8GB, step up to 12B on 16GB. The quality difference between E4B and 12B is noticeable on tasks like reading small text or interpreting complex charts - less so on straightforward photo descriptions. If your machine is at the limit running 12B, the mini PC guide covers hardware options with 32GB that handle both sizes without thermal throttling.

How to send images from the command line

Ollama's /api/chat endpoint accepts a base64-encoded image in an images array on the message object. The simplest way to test it:

Terminal - encode and send an image

# Encode an image to base64 first
IMAGE=$(base64 -i /path/to/your/image.jpg)

# Send it to Gemma 4
curl http://localhost:11434/api/chat -s -d "{
  \"model\": \"gemma4:12b\",
  \"messages\": [{
    \"role\": \"user\",
    \"content\": \"What is in this image? Be specific.\",
    \"images\": [\"$IMAGE\"]
  }],
  \"stream\": false
}" | python3 -m json.tool | grep '"content"'

The images field is an array, so you can pass multiple images in a single message if you need to compare or analyze a sequence. Keep in mind that each image adds to your token count - large or high-resolution images can eat into your context budget.

Using vision in Open WebUI

If you have Open WebUI running, you don't need to touch the API at all. Select gemma4:12b (or gemma4:e4b) from the model dropdown, then drag and drop any image directly into the chat box, or use the image attach button. Type your question and send.

This is the quickest way to test what the model can and can't do. Open WebUI renders the image inline in the conversation so you can see exactly what you sent alongside the response.

What it's actually good at

Works well

Describing photos and scenes
Reading text from screenshots
OCR on clean, printed documents
Identifying objects and UI elements
Analyzing charts with clear labels
Comparing two images side by side
Reading error messages from terminal screenshots

Struggles with

Very small or low-contrast text
Handwriting
Precise counting of many similar items
Complex multi-panel diagrams
Subtle color differences
Heavily compressed or blurry images

The practical upshot: it's well-suited to reading your own screenshots, analyzing charts you're working with, and processing photos where you know what's in them and need the model to extract or describe specific things. It's less reliable for anything requiring pixel-level precision or fine visual discrimination.

A note on image size and context

Gemma 4 supports a 128K context window, but images consume a significant portion of that budget. A high-resolution photo can use thousands of tokens before your text prompt even starts. If you're running long conversations with multiple images, keep an eye on whether responses start to degrade - that's usually the context window filling up, not a model quality issue.

Resize large images before sending them. For most tasks, anything over 1024px wide is more than the model needs and just costs you context tokens. A quick resize in Preview or ImageMagick before the API call makes a real difference:

Terminal - resize before encoding

# Resize to 1024px wide, then encode
convert input.jpg -resize 1024x output.jpg
IMAGE=$(base64 -i output.jpg)

Using it in OpenClaw

The Ollama API is the same whether the model is text-only or vision-capable - you just include the images array when you have something to analyze. In an OpenClaw skill, that means you can write a skill that accepts an image path as input, reads and encodes the file, and passes it to gemma4 for analysis.

A practical example: a !readscreen skill that takes a screenshot filename, sends it to gemma4:12b with a structured prompt, and returns a plain-text description or extracted data. No cloud, no API keys, no image leaving your machine.

The full Gemma 4 setup guide - pull commands for all sizes, vision API examples, and OpenClaw integration - is at Run Gemma 4 Locally.

Related guides