Run Gemma 4 Locally

Google's Gemma 4 brings native vision and 86% tool calling accuracy to every model size. Install with Ollama in minutes - no GPU required for the E4B and 12B variants.

30 min Beginner Free Updated June 2026

Gemma 4: Vision and Tool Calling in Every Size

Google released Gemma 4 on April 2, 2026. The headline feature is not the benchmark scores - it's that every variant in the family, from the 2B edge model to the 31B workstation model, ships with native vision and tool calling. Most model families gate these capabilities behind their larger sizes. Gemma 4 doesn't.

This guide covers installing Gemma 4 with Ollama, choosing the right size for your hardware, using the vision API to analyze images, and connecting tool calling to OpenClaw for agent work.

What's Actually Different

Vision on every variant. You can send an image to gemma4:e2b (the 2B model) and it will analyze it. This is not common. Most local model families add vision only to specific variants - Llama 3.2 Vision is separate from the base models, for example. With Gemma 4 you pick a size based on your hardware and vision comes with it.

Tool calling that actually works. Gemma 3 had 6.6% tool calling accuracy on the τ2-bench agentic benchmark - not usable for serious agent work. Gemma 4 sits at 86.4%. That's a real jump. The 26B MoE variant scores 85.5% specifically on the τ2-bench agentic tool use test. For OpenClaw setups that rely on structured function calls, this matters.

The E-series architecture. The E2B and E4B models use Google's Multi-Token Prediction (MTP) approach - generating multiple tokens per forward pass rather than one. On the same hardware, these variants are faster than their parameter count would suggest compared to standard architectures. The tradeoff is they need more RAM than a same-parameter standard model.

The 26B MoE variant. This is the most interesting size if you have 32GB RAM. It's a Mixture of Experts model with 26B total parameters but only ~4B active per token. Ollama loads all 26B into RAM at startup (so you need ~18GB), but inference speed is closer to a 4B model than a 26B model. Google claims it delivers 97% of the 31B dense model's quality at roughly 8x less compute per token.

MMLU and Benchmark Context

Gemma 4 12B scores 87.1% on MMLU. For reference, GPT-4-class performance was around 86-87% MMLU two years ago. That number doesn't tell you everything about real-world quality, but it does tell you this is not a toy model. On coding tasks, the 12B and 26B variants are competitive with much larger models from 2024.

Where Gemma 4 is weaker: complex multi-step reasoning chains. Dedicated reasoning models like DeepSeek-R1 still outperform it on hard math and logic problems. If your primary use case is reasoning, the model comparison guide covers this in detail. For vision, tool calling, and general instruction following, Gemma 4 is the strongest local option at its size class right now.

Prerequisites
  • Ollama installed and running - see Ollama Basics if you haven't done this yet
  • Ollama version 0.22 or newer (ollama --version to check)
  • Minimum 8GB RAM for the E4B, 16GB for the 12B, 32GB for the 26B and 31B
  • Disk space: 4GB for E4B, 8GB for 12B, 16GB for 26B, 22GB for 31B

Pick the Right Gemma 4 Size

Five variants, all with vision and tool calling. The decision comes down to RAM. The 26B MoE is not the obvious choice it looks like on paper - read the note below before defaulting to it.

Model Type RAM (4-bit) Disk CPU Speed
gemma4:e2b Dense 2B ~3 GB ~3 GB 5-10 t/s
gemma4:e4b default Dense 4B ~5 GB ~4 GB 2-5 t/s
gemma4:12b recommended Dense 12B ~8 GB ~8 GB 3-7 t/s
gemma4:26b MoE 26B, 4B active ~18 GB ~16 GB 5-9 t/s
gemma4:31b Dense 31B ~20 GB ~22 GB 2-4 t/s
Understanding the 26B MoE speed numbers: Mixture of Experts loads all 26 billion parameters into RAM - hence the 18GB requirement. But during inference, only ~4 billion parameters are active per token. This means the compute cost per token is similar to a 4B model, which is why the tokens-per-second is faster than the 31B despite holding more total parameters in memory. It is not faster than the 12B - you're still moving 26B worth of weights through memory bandwidth. The advantage is quality, not speed.

Which Size for Your Hardware

8GB RAM - use gemma4:e4b. It fits, but you're leaving almost nothing for your OS. Close other applications while running it. The E2B is faster and more comfortable if you don't need the full 4B quality.

16GB RAM - use gemma4:12b. This is the sweet spot. The 12B is a dense model released June 3, 2026 - added specifically because the jump from e4b to 26B MoE was too large for 16GB machines. It's the recommended starting point for most setups and handles vision and tool calling well.

32GB RAM - use gemma4:26b. The MoE variant gives you close to 31B quality with faster inference than the 31B dense model. At this RAM tier, 26B is the better choice over 31B for most workloads. Use 31B only if raw quality on a specific task matters more than response speed.

Storage Adds Up Fast

If you're downloading multiple sizes to test them, disk fills up quickly. E4B is 4GB, 12B is 8GB, 26B is 16GB, 31B is 22GB - downloading two or three variants alongside other Ollama models can easily push 60-80GB of model files.

Running out of model storage? Ollama stores models in ~/.ollama/models by default. You can move this to external storage by setting the OLLAMA_MODELS environment variable. A Samsung T7 Shield 2TB connected via USB-C gives you fast enough sequential reads (1,050 MB/s) to run models directly off it without noticeable loading delays. Affiliate link - costs you nothing extra.

Need More RAM?

The 12B requires 8GB of RAM just for the model - your OS and other processes need headroom on top of that. 16GB total is the realistic minimum. The 26B needs 18GB for the model alone, so 32GB is required.

Upgrading hardware: If you're on a laptop with accessible SODIMM slots, a 32GB SODIMM kit is usually the cheapest path to running larger models. If you're building a dedicated local AI machine, the Ryzen 7 6800H mini PC with 32GB runs all Gemma 4 variants up to and including the 26B MoE comfortably. Affiliate links - costs you nothing extra.

Install Gemma 4 with Ollama

Gemma 4 support requires Ollama 0.22 or newer. Check your version first, then pull whichever size fits your hardware.

Check Your Ollama Version

Terminal
ollama --version

You need 0.22 or newer. If you're behind, update Ollama:

Linux - Terminal
curl -fsSL https://ollama.com/install.sh | sh

On Mac, download the latest from ollama.com/download or run brew upgrade ollama. On Windows, re-run the installer from ollama.com.

Pull Gemma 4

Running ollama pull gemma4 without a tag pulls the E4B (4B) variant by default. Specify a tag to get a different size:

Terminal
# Default - pulls gemma4:e4b (~4GB)
ollama pull gemma4

# All available sizes
ollama pull gemma4:e2b     # 2B dense, ~3GB disk, ~3GB RAM
ollama pull gemma4:e4b     # 4B dense, ~4GB disk, ~5GB RAM (default)
ollama pull gemma4:12b     # 12B dense, ~8GB disk, ~8GB RAM
ollama pull gemma4:26b     # 26B MoE (4B active), ~16GB disk, ~18GB RAM
ollama pull gemma4:31b     # 31B dense, ~22GB disk, ~20GB RAM

Verify the Model Loaded Correctly

Check that the model is available and inspect its capabilities:

Terminal
ollama list          # should show gemma4:12b (or whichever you pulled)
ollama show gemma4:12b   # inspect model metadata

The ollama show output will include a line confirming vision capability. If you don't see it, your Ollama version is likely too old.

Run a Quick Test

Start an interactive session to confirm it's working:

Terminal
ollama run gemma4:12b

Type a prompt and press Enter. Type /bye to exit when done.

Or test the REST API directly:

Terminal
curl -s http://localhost:11434/api/generate \
  -d '{"model":"gemma4:12b","prompt":"What is a mixture of experts model, in two sentences?","stream":false}' \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['response'])"

Context Window

All Gemma 4 variants support up to 128K tokens of context. The default context window Ollama allocates at startup is smaller - check your current setting with:

Terminal
ollama show gemma4:12b | grep context

If you're sending long documents or image-heavy prompts and getting truncation, override it in your API requests: "options": {"num_ctx": 32768}. Note that larger context windows increase RAM usage and slow inference - set it to what you actually need. The Ollama Advanced parameters guide covers this in detail.

Gemma 4 is installed. The next two sections cover what actually makes it worth using: vision input and tool calling.

Using Gemma 4's Vision Capabilities

Every Gemma 4 variant accepts image input alongside text. You pass the image as base64-encoded data in your API request, and the model processes both together. This works for OCR, diagram analysis, screenshot debugging, document processing, and anything else where you'd normally need a separate vision model.

How the API Works

Vision requests use the /api/chat endpoint (not /api/generate). Images are passed as an array in the message object, base64-encoded. The model receives the image and text together and generates a text response.

The ollama run CLI does not support image input. Use the API directly, the Python SDK, or Open WebUI (which handles the encoding for you via drag-and-drop).

Via curl (Base64)

Encode the image and include it in your request:

Linux - Terminal
# Encode image to base64
IMAGE_B64=$(base64 -w 0 /path/to/image.png)

# Send to Gemma 4
curl -s http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"gemma4:12b\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": \"Describe what you see in this image.\",
      \"images\": [\"$IMAGE_B64\"]
    }],
    \"stream\": false
  }" | python3 -c "import sys,json; print(json.load(sys.stdin)['message']['content'])"
🍎 Mac - Terminal
# Mac base64 doesn't use -w flag
IMAGE_B64=$(base64 /path/to/image.png)

Via Python SDK

The Ollama Python library accepts local file paths directly - no manual base64 encoding needed:

Python
import ollama

response = ollama.chat(
    model='gemma4:12b',
    messages=[{
        'role': 'user',
        'content': 'Describe what you see in this image.',
        'images': ['/path/to/image.jpg']  # local file path
    }]
)
print(response['message']['content'])

Install if needed: pip install ollama

You can pass multiple images in the same message by adding more entries to the images list. You can also pass base64 strings directly instead of file paths.

Practical Use Cases

These are the tasks where local vision is most useful in practice:

OCR and text extraction. Pass a screenshot, scanned document, or photo of a whiteboard. Gemma 4 extracts text reliably, including handwriting at reasonable resolution. Replace: "Extract all text from this image, preserving structure."

Error screenshots. Paste a terminal error or browser console screenshot directly instead of copying the text. Works well for longer stack traces where copy-paste introduces formatting issues. Prompt: "What is causing this error and how do I fix it?"

Diagram and chart analysis. Architecture diagrams, database schemas, flowcharts - describe them, identify components, or ask questions about relationships. Works better than most vision models for technical content.

Form and document processing. Pass a PDF page screenshot or image of a form and extract structured data. For programmatic use, ask for JSON output: "Extract the fields from this form as JSON with keys matching the field labels."

Code review from screenshots. Sometimes you have an image of code and can't copy it - from a video, a presentation, or a printed page. Gemma 4 reads and explains code from images accurately.

Prompt Tips for Vision

Put the image before the text instruction in your message content when using multi-image prompts - Google's guidance is that multimodal content placed first produces better results. Be specific about what you want: "Extract the text" or "Describe the architecture diagram" outperforms generic prompts like "What is this?".

For OCR tasks on dense text, the 12B or 26B variants are noticeably better than E4B. If you're doing high-volume document processing and the E4B results look weak, step up to the 12B before assuming the task is impossible locally.

Want drag-and-drop image chat? Open WebUI handles the base64 encoding for you - just drop an image into the chat input and it works. Easier than building the API call yourself for one-off image analysis tasks.

Tool Calling and OpenClaw Integration

Tool calling is the mechanism that makes a model useful as an agent backend. Instead of generating text, the model returns a structured JSON object describing a function it wants to call - your code calls that function, returns the result, and the model continues. Gemma 4 does this at 86.4% accuracy. For comparison, Gemma 3 was at 6.6%. The difference in practice is the difference between a model that works for agent tasks and one that doesn't.

How Tool Calling Works with Ollama

You define tools in your API request as JSON schema objects. The model reads your tools list and decides whether to respond with text or with a tool call. When it decides to use a tool, the response contains a tool_calls field instead of content text.

Your application receives the tool call, executes the function, and sends the result back in the conversation as a message with role: "tool". The model then generates its final response using the function output.

Basic Tool Calling Example

Request with a tool definition:

Terminal
curl -s http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:12b",
    "messages": [
      {"role": "user", "content": "What files are in the /tmp directory?"}
    ],
    "tools": [{
      "type": "function",
      "function": {
        "name": "list_files",
        "description": "List files in a directory",
        "parameters": {
          "type": "object",
          "properties": {
            "path": {
              "type": "string",
              "description": "Directory path to list"
            }
          },
          "required": ["path"]
        }
      }
    }],
    "stream": false
  }'

When the model decides to call the tool, the response looks like this:

JSON Response
{
  "message": {
    "role": "assistant",
    "content": "",
    "tool_calls": [{
      "function": {
        "name": "list_files",
        "arguments": {
          "path": "/tmp"
        }
      }
    }]
  }
}

Your application then executes list_files("/tmp"), gets the directory listing, and sends it back to the model:

Python - Complete Loop
import ollama
import os

def list_files(path):
    try:
        return {'files': os.listdir(path)}
    except Exception as e:
        return {'error': str(e)}

messages = [{'role': 'user', 'content': 'What files are in /tmp?'}]
tools = [{
    'type': 'function',
    'function': {
        'name': 'list_files',
        'description': 'List files in a directory',
        'parameters': {
            'type': 'object',
            'properties': {
                'path': {'type': 'string', 'description': 'Directory path'}
            },
            'required': ['path']
        }
    }
}]

# First call - model decides to use the tool
response = ollama.chat(model='gemma4:12b', messages=messages, tools=tools)
msg = response['message']

if msg.get('tool_calls'):
    for tool_call in msg['tool_calls']:
        fn_name = tool_call['function']['name']
        fn_args = tool_call['function']['arguments']

        # Execute the function
        result = list_files(**fn_args)

        # Add tool result to conversation
        messages.append(msg)
        messages.append({
            'role': 'tool',
            'content': str(result),
            'name': fn_name
        })

    # Second call - model generates final response with the data
    final = ollama.chat(model='gemma4:12b', messages=messages, tools=tools)
    print(final['message']['content'])

OpenClaw Integration

OpenClaw uses the /api/chat endpoint natively. Setting Gemma 4 as your primary model is a one-line change in openclaw.json:

openclaw.json
{
  "model": {
    "primary": "ollama/gemma4:12b"
  }
}

Validate with openclaw doctor, then restart the gateway: openclaw gateway restart.

The full integration walkthrough - including configuring the Ollama endpoint URL, handling streaming, and testing with Discord - is in the Ollama Advanced OpenClaw Integration section.

Which Gemma 4 Size for Agent Work

12B for most setups. 86.4% tool calling accuracy is consistent across the 12B and 26B variants. The 12B is meaningfully faster on CPU, which matters for agent tasks where the model may make 5-10 tool calls in a single workflow. Use the 26B when the quality difference on the underlying reasoning matters more than response time.

Model choice for different tasks. Gemma 4 is the better choice when your OpenClaw agent needs to call tools reliably or process image input. For tasks that are primarily long-form text reasoning or complex multi-step problem solving without tool use, Qwen3:8b and DeepSeek-R1:7b are still competitive and run faster at their size.

Running both simultaneously. If you have 32GB RAM, you can keep Gemma 4 12B and Qwen3 8B loaded at the same time. Use OLLAMA_MAX_LOADED_MODELS=2 to prevent automatic unloading. OpenClaw can then be configured to route different request types to different models. Details in Ollama Advanced - Multiple Models.

Tool calling vs. function calling: These terms are used interchangeably. In Ollama's API, the field is tools and the response field is tool_calls. OpenAI's API uses the same naming convention, so most tool calling examples you find online for GPT-4 will translate directly to the Ollama API with minimal changes.

Where to Go From Here

You have Gemma 4 running with vision and tool calling. Here's what's worth doing next depending on your use case.

Wire It Into OpenClaw

If you haven't set up OpenClaw's Ollama integration yet, that's the main next step. Ollama Advanced - OpenClaw Integration covers the full configuration: setting the Ollama endpoint, choosing your primary and fallback models, validating with openclaw doctor, and testing through Discord. Gemma 4's tool calling accuracy makes it a better agent backend than the general models covered in that guide - the configuration steps are identical.

Add a Web Interface for Vision

Open WebUI is the easiest way to use Gemma 4's vision interactively. Drag an image into the chat input and it handles the encoding automatically - no API calls to write. If you're using Gemma 4 for regular document review, screenshot debugging, or anything where you want a proper chat interface rather than API scripts, Open WebUI is worth setting up.

Compare Gemma 4 to Other Local Models

Gemma 4 is the strongest local option for vision and tool calling, but it's not the best at everything. Best Local AI Models 2026 has a full comparison against Qwen3, DeepSeek-R1, Llama 3.2, and Phi4 by task type. Worth reading before committing to Gemma 4 as your primary model if your use case is primarily reasoning or coding.

Tune Performance

The default Ollama settings are not optimized for any specific model. Ollama Advanced covers the parameters that actually move the needle on Gemma 4 performance: num_ctx (context window - the single biggest lever on CPU inference speed), OLLAMA_KEEP_ALIVE (keeps the model hot in RAM between requests), and num_threads (CPU core allocation). These apply to Gemma 4 the same as any other model.

Coding Assistant with Gemma 4

Continue.dev supports Ollama models directly. The Local Coding Assistant guide uses qwen2.5-coder by default, but you can substitute gemma4:12b in the config.yaml for the chat model. Gemma 4 handles code well, and the vision capability means you can paste code screenshots directly into the Continue.dev chat instead of copying the text.