How do I connect OpenClaw to Ollama?

In your openclaw.json, set model.primary to "ollama/modelname" where modelname matches exactly what appears in "ollama list". For example: "ollama/qwen3:8b". Run "openclaw doctor" to validate the config, then run "openclaw gateway restart" to apply it.

What does temperature do in Ollama?

Temperature controls how creative or random the output is. A value of 0.1 to 0.3 gives focused, consistent, factual responses - good for coding and structured tasks. A value of 0.7 to 1.0 gives more varied and creative output. The default is usually 0.8.

How do I make Ollama faster on CPU?

Three things help most: use a smaller or more aggressively quantized model (q4_K_M is faster than q8_0), reduce num_ctx (the context window) to 4096-8192 instead of the large default, and make sure your model fits in RAM without spilling into swap storage.

Can Ollama run multiple models at the same time?

Yes. Ollama keeps recently used models loaded in RAM. Run "ollama ps" to see what is currently active. Models unload automatically after a timeout period. On 32GB RAM you can comfortably keep two 7B models loaded simultaneously.

How do I measure Ollama performance in tokens per second?

Send a test prompt via the REST API with "stream": false and examine the JSON response. Divide the eval_count field by eval_duration (which is in nanoseconds) and multiply by 1,000,000,000 to get tokens per second. The advanced tutorial includes the exact curl command for this.

Ollama Optimization Guide 2026 - Faster Local LLMs & Multi-Model

01 / Introduction

Beyond the Basics: Optimization and Integration

You've completed Ollama Basics. You have Llama 3.2 running, you've tested the REST API, and you've pulled a few models. Now: let's hook it into OpenClaw and then make it better.

This tutorial starts with the thing you're here for - getting OpenClaw to use your local models. Once that's working, we dig into parameter tuning, benchmarking, and optimization so you can make it faster and smarter over time.

Where You Left Off (Quick Review)

You have:

✓ Ollama installed and running as a service
✓ At least one model downloaded (Llama 3.2 or Qwen3)
✓ REST API working (tested with curl)
✓ Baseline performance metrics (10-15 tokens/sec on the Ryzen)

If you don't have these, go back to Ollama Basics first. Not sure which model to use? The Best Local AI Models 2026 comparison covers every model worth knowing with pull commands ready to go.

What You'll Learn (Advanced)

In this 90-minute tutorial:

OpenClaw Integration: Wire your bot to local Ollama - the whole reason you're here
Parameters: Temperature, top_p, context windows - the knobs that control your model's behavior
Use-Case Tuning: Different settings for different tasks (structured output, creative writing, code)
Benchmarking: Objective methods to measure quality and speed
GPU Detection: Understand when GPUs help (spoiler: not for you)
Multiple Models: Run models concurrently, queue requests, instant switching
Optimization: Memory tuning, CPU optimization, inference speed tuning
Monitoring: Health checks, logs, continuous operation
Troubleshooting: Common issues and how to fix them

Your Hardware

This tutorial works on any modern hardware capable of running Ollama. Here's what to keep in mind:

Recommended Specs

CPU: 4+ cores (8+ cores recommended for smooth performance)
RAM: 16GB minimum, 32GB recommended (more RAM = more models loaded simultaneously)
Storage: SSD recommended (fast model loading)
GPU: Optional - CPU-only inference works well for 7B models

Performance numbers in this tutorial are based on the Ryzen 7 6800H mini PC with 32GB RAM - the same hardware recommended throughout these tutorials. If you're on different hardware, your numbers will vary but the tuning principles are the same.

Why This Matters

For OpenClaw: A well-tuned local LLM can power a bot that feels just as smart as cloud-based solutions, but faster, cheaper, and completely private.

For Learning: Understanding parameters teaches you how language models actually work. You'll develop intuition for LLM behavior.

For Optimization: Squeezing another 20% performance from your hardware means faster responses and better user experience.

A Note on Complexity

This tutorial assumes you're comfortable with terminal commands and basic system administration. You don't need to be an expert, but "comfortable with Linux" is the baseline.

If something is unclear, go back to Ollama Basics or skip to the sections that interest you. You don't have to do everything in order.

Integration first, optimization later. Section 2 gets OpenClaw working with your local models right away. Everything after that - parameters, benchmarking, optimization - makes it better over time. You can always come back to the tuning sections later.

Let's Go Deep: You're about to understand Ollama at a level beyond "it works." Ready?

02 / Integration with OpenClaw

Connect OpenClaw to Your Local Ollama

This is the payoff: your Discord bot running on a local LLM you control completely. No API keys, no cloud dependency, no monthly bills. Your conversations stay on your hardware.

What You Need Before Starting

Prerequisites

Ollama installed and running (covered in Ollama Basics)
At least one model pulled - qwen3:8b recommended
OpenClaw installed (see EC2 setup or Mini PC setup)
Node 24 or Node 22 LTS (22.19+)

Verify Ollama is ready before touching OpenClaw config:

🖥️ Mini PC - Verify Ollama

curl http://localhost:11434/api/tags
# Should return JSON with your pulled models listed
# If you get "connection refused" - start Ollama first: ollama serve

The Fast Path: Onboard Command

OpenClaw has a built-in onboarding command that handles Ollama configuration automatically. This is the quickest way to get wired up:

🖥️ Mini PC - Onboard OpenClaw

# Interactive mode (recommended for first time)
openclaw onboard

# Non-interactive with your chosen model
openclaw onboard --non-interactive \
  --auth-choice ollama \
  --custom-model-id "qwen3:8b" \
  --accept-risk

The --auth-choice ollama flag tells OpenClaw to use local Ollama instead of a cloud provider. It auto-discovers models from http://127.0.0.1:11434 and sets all costs to $0.

After onboarding, open the Control UI at http://localhost:18789 to verify the agent is running and connected.

Manual Config (More Control)

If you want to specify exact models, a custom host, or tune parameters, edit the config file directly. OpenClaw stores its settings at:

📁 Config Location

~/.openclaw/openclaw.json

Here is a working Ollama config block. Add or merge this into your existing openclaw.json:

📄 ~/.openclaw/openclaw.json

{
  "models": {
    "providers": {
      "ollama": {
        "baseUrl": "http://127.0.0.1:11434",
        "apiKey": "ollama-local",
        "api": "ollama",
        "models": [
          {
            "id": "qwen3:8b",
            "name": "Qwen 2.5 7B",
            "reasoning": false,
            "input": ["text"],
            "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
            "contextWindow": 8192,
            "maxTokens": 4096
          },
          {
            "id": "llama3.2",
            "name": "Llama 3.2 3B",
            "reasoning": false,
            "input": ["text"],
            "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
            "contextWindow": 4096,
            "maxTokens": 2048
          }
        ]
      }
    }
  },
  "agents": {
    "defaults": {
      "model": {
        "primary": "ollama/qwen3:8b",
        "fallbacks": ["ollama/llama3.2"]
      }
    }
  }
}

Critical: Do not add /v1 to the baseUrl. Using http://127.0.0.1:11434/v1 activates the OpenAI-compatible endpoint, which breaks tool calling - your bot will output raw tool JSON as plain text instead of actually calling tools. Always use the native Ollama URL with "api": "ollama".

Auto-Discovery (Simplest Option)

If you just want OpenClaw to pick up whatever models you have installed without listing them manually, set one environment variable and restart OpenClaw:

🖥️ Mini PC - Environment Variable

export OLLAMA_API_KEY="ollama-local"

# Add to your shell profile so it persists
echo 'export OLLAMA_API_KEY="ollama-local"' >> ~/.bashrc
source ~/.bashrc

With this set, OpenClaw automatically queries /api/tags to find all your models, reads their context windows, and marks anything with "r1", "reasoning", or "think" in the name as a reasoning-capable model. All costs are set to $0.

Switch the Active Model

Once configured, you can verify OpenClaw sees your Ollama models and swap between them:

🖥️ Mini PC - Model Management

# See what models OpenClaw has available
openclaw models list

# Set qwen3:8b as primary
openclaw models set ollama/qwen3:8b

# Switch to the fast 3B model for lighter tasks
openclaw models set ollama/llama3.2

The ollama/ prefix tells OpenClaw which provider the model lives on. You'll use this prefix any time you reference a model in config or CLI commands.

Test It End to End

Before you start routing Discord messages through it, confirm the full chain is working:

🖥️ Mini PC - End to End Test

# 1. Confirm Ollama is responding
curl http://localhost:11434/api/tags

# 2. Confirm OpenClaw sees the models
openclaw models list

# 3. Check provider connectivity
openclaw models status

# 4. Open the Control UI and send a test prompt
# http://localhost:18789

If openclaw models list shows your Ollama models and the Control UI returns a response to a test prompt, you are fully wired up.

Performance on Your Hardware

What to expect on a Ryzen 7 6800H mini PC (CPU-only inference):

💻 Typical Discord Bot Response Times

Short factual question (e.g. "What is HTTP 429?")
  → llama3.2 (3B):   3-5 seconds
  → qwen3:8b:      6-10 seconds

Medium response (paragraph explanation)
  → llama3.2 (3B):   8-12 seconds
  → qwen3:8b:      12-20 seconds

Multiple users queued simultaneously
  → Ollama processes one at a time
  → Each user waits for the one ahead of them
  → Fine for personal bots and small communities

For a personal Discord server this is completely usable. Users perceive the "typing..." indicator as the bot thinking, which feels natural. If you need faster responses, use llama3.2 as the primary and save qwen3:8b for when quality matters more than speed.

New to openclaw.json? The Configuration Reference explains every key in the config file - dmPolicy, workspace, gateway settings, and environment variable syntax for keeping API keys out of the file.

Wired up: Your OpenClaw bot now runs on a local LLM you own. No API fees, no data leaving your machine. Keep going - the rest of this tutorial covers tuning that model to perform exactly the way you want it.

03 / Understanding Parameters

The Knobs That Control Your LLM

LLMs generate text one token at a time. But how they choose which token to generate is controlled by parameters. These are your tuning knobs. Understand them, and you control the model's behavior.

Temperature (Randomness)

Range: 0.0 to 2.0 (or higher)
Default: Usually 0.7 (balanced)

Temperature controls how "creative" or "random" the model gets:

Temperature Examples

0.0 (Cold, Deterministic): Always picks the most likely next token. Same input = identical output every time. Good for: Precise answers, code generation, structured output.
0.5 (Cool, Focused): Mostly predictable but with some variation. Good for: Professional writing, Q&A, technical content.
0.7 (Balanced, Default): A mix of creativity and consistency. Good for: General chat, creative but coherent responses.
1.2 (Warm, Creative): More randomness, more interesting responses. Good for: Brainstorming, creative writing, poetry.
1.5+ (Hot, Chaotic): Very random, often incoherent. Good for: Experimentation (usually a mistake).

Try it:

🖥️ Mini PC - Compare temperatures

# Cold (deterministic)
curl http://localhost:11434/api/generate -d '{
  "model": "qwen3:8b",
  "prompt": "Complete: The future of AI is",
  "temperature": 0.0,
  "stream": false
}' | jq -r '.response'

# Hot (creative)
curl http://localhost:11434/api/generate -d '{
  "model": "qwen3:8b",
  "prompt": "Complete: The future of AI is",
  "temperature": 1.5,
  "stream": false
}' | jq -r '.response'

Run both and notice the difference. You'll see the same semantic answer at 0.0, wildly creative answers at 1.5.

Top-P (Nucleus Sampling)

Range: 0.0 to 1.0
Default: 0.9

Top-P is a more sophisticated diversity control than temperature. It works by:

Model predicts probabilities for the next token
Sort tokens by likelihood (highest first)
Keep only the top tokens that sum to p% probability
Randomly sample from that restricted set

In practice:

0.1: Ultra-focused (only the top 10% of likely tokens)
0.5: Focused (top 50% of likely tokens)
0.9: Balanced (top 90% of likely tokens, allows more creativity)

Works best with temperature: Top-P controls diversity among the remaining candidates after temperature does its work. Usually keep at 0.9 unless you're experimenting.

Top-K

Range: 1 to infinity
Default: 40 (varies by Ollama version)

Top-K is the simpler cousin of top-P. It limits the model to only consider the K most likely tokens:

K=1: Only consider the most likely token (extremely constrained)
K=10: Consider top 10 most likely tokens
K=40: Consider top 40 (good balance)
K=100+: Very permissive (almost all tokens allowed)

Practical advice: Leave top-K at default and use temperature+top-P instead. Top-K is older and less intuitive than top-P.

Repeat Penalty

Range: 0.0 to 2.0
Default: 1.1

Repeat penalty prevents the model from repeating the same phrase over and over (common LLM failure mode):

1.0: No penalty (tokens can repeat freely)
1.1: Mild penalty (slight discouragement to repeat)
1.5: Strong penalty (heavily discourage repetition)

Use when: Model generates repetitive text ("the the the..." or "and and and...").
Default is good: Usually leave at 1.1.

Context Window (num_ctx)

Range: 128 to ~32000 tokens
Default: 2048

Context window is how many tokens the model can "see" when generating. It's like short-term memory:

256: Very short memory (forgets everything quickly)
2048: Good balance (remember ~1500 words of conversation)
4096: Long memory (remember ~3000 words)
8192: Very long (remember entire documents)

Tradeoff: Larger context = slower inference (more computation). On a typical multi-core CPU with a 7B model:

💻 Context window vs speed

2048 tokens:  ~15 tokens/sec (fast)
4096 tokens:  ~12 tokens/sec (slightly slower)
8192 tokens:  ~8 tokens/sec (noticeably slower)
16384+:       ~3-5 tokens/sec (CPU gets saturated)

Recommendation: Use 2048 for chat. Use 4096 for document analysis. Skip 8192+ unless you really need the memory.

Prediction Tokens (num_predict)

Range: -1 (unlimited) to any positive number
Default: -1 (unlimited)

Maximum tokens to generate before stopping. Prevents runaway responses:

-1: Unlimited (model stops when it feels done)
128: Stop after 128 tokens (~100 words)
512: Stop after 512 tokens (~400 words)
2048: Stop after 2048 tokens (full page)

Use when: You want predictable response lengths. Good for APIs where you need bounded latency.

Threads (num_threads)

Range: 1 to your CPU core count
Default: Auto-detect (all cores)

How many CPU cores Ollama uses. For example, on an 8-core CPU:

1-2: Slow, leaves CPU idle
8: All cores active (default, good)
>8: Hyperthreading counts, can use more (usually 16 total)

Leave it on auto. Ollama detects your CPU and uses all available threads.

Quick Parameter Summary

💻 Parameter Quick Reference

temperature      → 0.0–2.0   (randomness, default 0.7)
top_p            → 0.0–1.0   (diversity, default 0.9)
top_k            → 1–∞       (token limit, default 40, skip it)
repeat_penalty   → 0–2       (penalize repetition, default 1.1)
num_ctx          → 128–32k   (memory, default 2048)
num_predict      → -1–∞      (max length, default -1)
num_threads      → 1–cores   (CPU usage, default all)

Parameters Demystified: You now understand the knobs. Next: how to tune them for different use cases.

04 / Tuning for Different Use Cases

Parameters That Fit Your Task

Now that you understand parameters, let's apply them. Different tasks need different settings. A Discord bot behaves differently than a creative writer's assistant. Let's build configurations for real scenarios.

This is the main event. If you're here to connect Ollama to OpenClaw, Use Case 1 below is the configuration you need. The other use cases are handy for experimentation, but Use Case 1 is the one that powers your bot.

Use Case 1: OpenClaw Integration (Structured Output)

Goal: Consistent, deterministic responses. OpenClaw needs predictable JSON or structured text.

Recommended Settings:

📝 OpenClaw Configuration

curl http://localhost:11434/api/generate \
  -d '{
    "model": "qwen3:8b",
    "prompt": "Your prompt here",
    "temperature": 0.1,
    "top_p": 0.9,
    "num_predict": 2048,
    "num_ctx": 4096,
    "stream": false
  }'

Why these values:

temperature: 0.1 → Focused, deterministic (reproducible output)
top_p: 0.9 → Still allows valid variation, not robotic
num_predict: 2048 → Reasonable max length for API
num_ctx: 4096 → Enough memory for conversation context

Expected behavior: Responses are consistent. Same prompt = mostly same answer (good for testing and debugging).

Use Case 2: Creative Writing

Goal: Variety and creativity. You want different results each time, but still coherent.

Recommended Settings:

📝 Creative Writing Configuration

curl http://localhost:11434/api/generate \
  -d '{
    "model": "qwen3:8b",
    "prompt": "Write a short story about...",
    "temperature": 1.2,
    "top_p": 0.95,
    "num_predict": 1000,
    "num_ctx": 2048,
    "repeat_penalty": 1.2,
    "stream": true
  }'

Why these values:

temperature: 1.2 → Creative, more varied outputs
top_p: 0.95 → Very permissive vocabulary
repeat_penalty: 1.2 → Prevent repetitive phrases
stream: true → See creativity unfold in real-time

Expected behavior: Each run produces unique, interesting variations on the theme.

Use Case 3: Fast Inference (Real-Time Chat)

Goal: Speed matters. Users are waiting for a response. Trade some context for speed.

Recommended Settings:

📝 Fast Chat Configuration

curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.2",
    "prompt": "What is...",
    "temperature": 0.7,
    "top_p": 0.9,
    "num_predict": 512,
    "num_ctx": 2048,
    "stream": true
  }'

Why these values:

num_predict: 512 → Shorter responses (faster generation)
num_ctx: 2048 → Not too large (faster processing)
stream: true → First token appears faster (perceived speed)
model: llama3.2 → 3B model, faster than qwen3:8b for simple tasks

Expected behavior: First token appears in <500ms, total response in 5-10 seconds.

Use Case 4: Code Generation

Goal: Correct, working code. Logic must be sound.

Recommended Settings:

📝 Code Generation Configuration

curl http://localhost:11434/api/generate \
  -d '{
    "model": "qwen3:8b",
    "prompt": "Write a Python function that...",
    "temperature": 0.3,
    "top_p": 0.9,
    "num_predict": 1024,
    "num_ctx": 4096,
    "repeat_penalty": 1.1,
    "stream": false
  }'

Why these values:

temperature: 0.3 → Focused on correct syntax
num_ctx: 4096 → Understand full code context
repeat_penalty: 1.1 → Avoid redundant code

Expected behavior: Code is syntactically correct and logically sound most of the time.

Use Case 5: Long-Form Document Analysis

Goal: Understand and analyze large texts. Need big context window.

Recommended Settings:

📝 Document Analysis Configuration

curl http://localhost:11434/api/generate \
  -d '{
    "model": "qwen3:8b",
    "prompt": "Summarize this document:\n\n[LARGE TEXT HERE]",
    "temperature": 0.5,
    "top_p": 0.9,
    "num_predict": 1024,
    "num_ctx": 8192,
    "stream": false
  }'

Why these values:

num_ctx: 8192 → Can see entire document (trade-off: slower)
temperature: 0.5 → Faithful to source material
stream: false → You're doing analysis, not interactive chat

Expected behavior: Accurate summaries that capture main points. Slower (expect 20–30 seconds), but comprehensive.

Testing Your Configuration

Don't just trust recommendations. Test with your own prompts:

🖥️ Mini PC - A/B Test Script

#!/bin/bash
PROMPT="Write a haiku about programming"

echo "=== Configuration A ==="
time curl -s http://localhost:11434/api/generate \
  -d '{
    "model": "qwen3:8b",
    "prompt": "'$PROMPT'",
    "temperature": 0.5,
    "stream": false
  }' | jq -r '.response'

echo ""
echo "=== Configuration B ==="
time curl -s http://localhost:11434/api/generate \
  -d '{
    "model": "qwen3:8b",
    "prompt": "'$PROMPT'",
    "temperature": 1.0,
    "stream": false
  }' | jq -r '.response'

Run the same prompt with different configurations. Notice speed and quality differences. Build intuition.

Save Your Configs

Once you find good settings, save them to files for reuse:

📝 ~/.ollama/configs.sh

# OpenClaw Config
alias ollama-openclaw='curl -d "{\"model\": \"qwen3:8b\", \"temperature\": 0.1, \"top_p\": 0.9}"'

# Creative Config
alias ollama-creative='curl -d "{\"model\": \"qwen3:8b\", \"temperature\": 1.2, \"top_p\": 0.95}"'

# Fast Config
alias ollama-fast='curl -d "{\"model\": \"llama3.2\", \"temperature\": 0.7, \"num_predict\": 512}"'

Source the file and use aliases. Saves typing and keeps configs consistent.

Task-Specific Tuning Ready: You can now configure Ollama for your specific needs. Next: benchmarking to measure results objectively.

05 / Performance Benchmarking

Measure Speed and Quality Objectively

You can feel that a model is fast, but how fast exactly? How does Qwen3 compare to Llama 3.2 on your hardware? Benchmarking gives you objective data to guide optimization decisions.

Metrics That Matter

Four key metrics for LLM inference:

Benchmark Metrics

Time to First Token (TTFT): How long before the model starts responding. Goal: <500ms
Tokens Per Second (TPS): Generation speed. Goal: 10-15 for CPU, 50+ for GPU
Memory Used: RAM footprint during inference. Goal: <8GB for your setup
Response Quality: Does the answer make sense? Subjective but important

Benchmark Script

Create a reusable benchmarking script:

📝 benchmark.sh

#!/bin/bash

MODEL="${1:-qwen3:8b}"
PROMPT="Explain machine learning in 3 paragraphs"

echo "Benchmarking $MODEL..."
echo ""

# Run inference and capture timing
START=$(date +%s%N)
RESPONSE=$(curl -s http://localhost:11434/api/generate \
  -d "{
    \"model\": \"$MODEL\",
    \"prompt\": \"$PROMPT\",
    \"temperature\": 0.5,
    \"stream\": false
  }")
END=$(date +%s%N)

# Extract metrics
TOTAL_TIME=$((($END - $START) / 1000000))  # milliseconds
TOKENS=$(echo "$RESPONSE" | jq -r '.eval_count')
TPS=$(echo "1000 * $TOKENS / $(echo \"$RESPONSE\" | jq -r '.eval_duration') * 1000000" | bc -l)

echo "Model: $MODEL"
echo "Total Time: ${TOTAL_TIME}ms"
echo "Tokens: $TOKENS"
echo "Tokens/Sec: $(echo \"scale=2; $TPS\" | bc)"
echo ""
echo "Response:"
echo "---"
echo "$RESPONSE" | jq -r '.response'
echo "---"

Run it:

🖥️ Mini PC

bash benchmark.sh qwen3:8b
bash benchmark.sh llama3.2
bash benchmark.sh llama3.2:1b

Compare Models Side-by-Side

Test the same prompt across models to find your best balance of speed and quality:

🖥️ Mini PC - Comparison Output

Model            Time    Tokens  TPS     Quality
qwen3:8b       8.2s    82      10.0    Excellent
llama3.2         4.5s    85      18.9    Good
llama3.2:1b      2.1s    78      37.1    Basic

The 1B model is fastest, Qwen3 has the best quality. Your choice depends on your priority.

Monitor Resources During Benchmark

In another terminal, watch resource usage:

🖥️ Mini PC - Terminal 2

watch -n 0.5 'free -h && echo && top -n 1 -b | head -n 3'

Look for:

Peak memory usage (should be <8GB)
CPU utilization (should be 80%+ during inference)
No thermal throttling (CPU stays stable frequency)

Benchmarking Baseline Set: You can now objectively measure model performance. Use these numbers to guide optimization.

06 / GPU Detection and Acceleration

Understanding Hardware Acceleration (You Probably Don't Need It)

GPUs are much faster than CPUs for LLM inference. Many systems have an integrated GPU, but integrated graphics are typically too small to meaningfully accelerate LLMs. Let's understand when GPU acceleration actually helps.

Integrated vs Discrete GPUs

Most CPUs include an integrated GPU, but these are far too weak for LLM acceleration. For reference:

Typical integrated GPU: ~1–2 TFLOPs
RTX 4070: ~20 TFLOPs (10–20x more powerful)
RTX 4090: ~82 TFLOPs (40–80x more powerful)

Integrated GPUs are orders of magnitude weaker than discrete gaming GPUs. For LLMs on integrated graphics, the CPU is actually competitive.

Check if Ollama Detects Your GPU

Ollama logs what hardware it's using. Check:

🖥️ Mini PC

tail -50 ~/.ollama/ollama.log | grep -i gpu

If you see GPU references, Ollama detected it. If not, it's using CPU (default, which is fine for you).

When GPU Acceleration Helps

GPU acceleration is worth it if:

Nvidia RTX GPU: 2080 Ti or newer (30x-60x speedup)
AMD Radeon: RX 6700 or newer (10x-20x speedup)
Apple Silicon: Built-in GPU (3x-5x speedup)

An integrated GPU? Not worth the complexity. CPU inference is simpler and nearly as fast.

Should You Upgrade Hardware?

For local LLMs, consider GPU if:

You want to run 13B+ models (currently too slow on CPU)
You need sub-second response times (production use)
You want to run multiple concurrent inferences

For now, stick with CPU. A modern multi-core CPU is plenty fast for a single Ollama instance running 7B models.

GPU Decision Made: If you only have integrated graphics, CPU inference is the right choice. No GPU upgrade needed for 7B models.

07 / Running Multiple Models Concurrently

Leverage Your RAM - Load Once, Use Instantly

With 16GB+ RAM and OLLAMA_KEEP_ALIVE set, you can keep multiple models loaded simultaneously and switch between them with zero cold-start delay. Different models for different tasks - all ready to go.

Check What's Currently Loaded

Before thinking about concurrent models, know what's already in memory:

🖥️ Mini PC

ollama ps

🖥️ Mini PC - Output

NAME              ID              SIZE      PROCESSOR    UNTIL
qwen3:8b        845dbda0ea48    5.5 GB    100% CPU     28 minutes from now
llama3.2          a80c4f17acd5    2.5 GB    100% CPU     25 minutes from now

PROCESSOR shows whether your GPU is involved. On a CPU-only machine you'll see 100% CPU. On a machine with a compatible GPU (NVIDIA with CUDA, AMD with ROCm, or Apple Silicon), you'll see a split like 45%/55% CPU/GPU - higher GPU % means faster inference. The UNTIL column shows the keep-alive expiry for each loaded model.

Concurrent Model Memory Layout

On a 32GB system, loading your full stack looks like this:

💻 Memory Layout - 32GB System

OS + System:           ~2GB (always used)
Ollama Runtime:        ~1GB
qwen3:8b (q4_K_M):  ~5.5GB  - all-rounder, 128K context
llama3.2 (3B):         ~2.5GB  - speed tier, quick queries
phi4 (14B):            ~9.0GB  - heavy reasoning, complex problems
Reserve Buffer:        ~12GB   (available for gemma3 or larger models)
────────────────────────────────
Total Used:            ~20GB
Total Available:       ~12GB

With OLLAMA_KEEP_ALIVE set to -1 or a long duration, all three models stay resident. Switching between them is instant - the model is already warm in memory.

Practical Multi-Model Setup (2026)

A sensible three-tier stack for most use cases:

Recommended Model Stack

qwen3:8b - Default tier. Complex reasoning, code, long documents (128K context). Best all-rounder.
llama3.2 - Speed tier. Quick lookups, simple tasks, when you want an instant answer.
phi4 - Heavy reasoning (needs ~10GB RAM). Pull when qwen3:8b isn't cutting it on complex problems.
gemma3:12b - Alternative heavy model. Strong at instruction-following and multilingual tasks.

Total disk for the first two: ~8GB. Add phi4 and you're at ~17GB. On 32GB RAM, that's comfortable headroom.

API Request Routing by Model

With multiple models loaded, route requests to the right model for the task:

💻 Route by task type

# Complex task → Qwen3 for reasoning + 128K context
curl http://localhost:11434/api/generate \
  -d '{"model":"qwen3:8b","prompt":"Analyze this long document..."}'

# Quick factual lookup → Llama 3.2 for speed
curl http://localhost:11434/api/generate \
  -d '{"model":"llama3.2","prompt":"What does HTTP 429 mean?"}'

# Code review → Qwen3 for reliable instruction following
curl http://localhost:11434/api/generate \
  -d '{"model":"qwen3:8b","prompt":"Review this function for bugs..."}'

If all three are loaded in memory (keep-alive), these requests complete without any model loading delay. Ollama queues concurrent requests and processes them in order, switching models transparently.

Pinning Models with OLLAMA_KEEP_ALIVE

The key to no-delay model switching is keeping models in memory. Set keep-alive system-wide via systemd:

🖥️ Mini PC - Pin models via systemd

sudo systemctl edit ollama

📝 Add to the override file

[Service]
Environment="OLLAMA_KEEP_ALIVE=-1"

🖥️ Mini PC - Reload

sudo systemctl daemon-reload && sudo systemctl restart ollama

Use -1 only if you have headroom. With -1, models never unload until Ollama restarts. On 16GB RAM with one 7B model loaded, that's fine. Loading a second 7B while the first is pinned will consume ~12GB - check ollama ps and your free RAM before pinning multiple large models permanently.

Multiple Models Ready: With a tiered stack and keep-alive tuned to your RAM, you have a local AI infrastructure that routes tasks intelligently. ollama ps is your dashboard - check it whenever you're curious about what's running.

08 / Optimization Techniques

Squeeze Maximum Performance

Your CPU is capable, but there are ways to go further. Environment variables, context tuning, CPU pinning - small tweaks compound into noticeable improvements. This section covers the full toolkit.

Key Environment Variables

Ollama exposes a set of environment variables that control how it loads and runs models. These are the most useful ones for a Mini PC setup:

OLLAMA_KEEP_ALIVE

How long to keep a model loaded in RAM after its last request. Default: 5m. Set to 30m or -1 (forever) to avoid cold-start delays. On 16GB+ systems, longer keep-alive means faster responses - the model is already warm.
Environment="OLLAMA_KEEP_ALIVE=30m"

OLLAMA_MAX_LOADED_MODELS

Maximum number of models to keep loaded simultaneously. Default: 3 on systems with GPU, 1 on CPU-only. Raise to 3 or 4 on 32GB systems to keep your full model stack warm without unloading.
Environment="OLLAMA_MAX_LOADED_MODELS=3"

OLLAMA_NUM_PARALLEL

How many requests to process in parallel per model. Default: 1 on CPU. Increasing this can improve throughput if you have multiple concurrent users or applications hitting Ollama at the same time, at the cost of higher RAM usage per request. Start at 1 on CPU - most single-user setups don't need more.
Environment="OLLAMA_NUM_PARALLEL=2"

OLLAMA_FLASH_ATTENTION

Enable Flash Attention, an optimized attention mechanism that reduces memory usage during long-context inference. Set to 1 to enable. Most beneficial when running models with large context windows (like qwen3:8b at 128K tokens). Can noticeably reduce RAM pressure on long conversations.
Environment="OLLAMA_FLASH_ATTENTION=1"

🖥️ Mini PC - Apply via systemd (persists across reboots)

sudo systemctl edit ollama

📝 Recommended override block for 16–32GB systems

[Service]
Environment="OLLAMA_KEEP_ALIVE=30m"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
Environment="OLLAMA_FLASH_ATTENTION=1"

🖥️ Mini PC - Reload and restart

sudo systemctl daemon-reload && sudo systemctl restart ollama

CPU Affinity

Pin Ollama to specific cores to reduce OS scheduling overhead and give the model consistent CPU access:

🖥️ Mini PC - via systemd override

# Add to your [Service] block in systemctl edit ollama:
CPUAffinity=0-7   # Pins to cores 0–7 (adjust to your core count)

On a 12-core machine, you might pin Ollama to 8 cores (0–7) and leave 4 cores (8–11) for the OS and other services. Experiment - the benefit varies by workload.

Context Window Tuning

Context window size directly affects inference speed and RAM usage. Smaller context = faster tokens:

💻 Speed vs. Context Tradeoff (7B model)

Context 1024:   ~20 tokens/sec (very fast, short conversations)
Context 2048:   ~15 tokens/sec (balanced - good default)
Context 4096:   ~10 tokens/sec (slower, good for most tasks)
Context 8192:   ~7 tokens/sec  (long docs, complex reasoning)
Context 32768+: ~3–5 tokens/sec (Qwen3 long context, only when needed)

You can override context size per request via the API using the options.num_ctx field. For most chat tasks, 2048–4096 is the sweet spot. Only go higher when you're actually sending long inputs.

💻 Set context size per API request

curl http://localhost:11434/api/generate \
  -d '{
    "model": "qwen3:8b",
    "prompt": "Summarize this document...",
    "options": {
      "num_ctx": 8192
    }
  }'

Memory Pressure Relief

If you're hitting RAM limits - models are being unexpectedly unloaded, or you see swap activity - try these:

🖥️ Mini PC - Diagnose memory pressure

# See current model memory usage
ollama ps

# Check system RAM
free -h

# Check if swap is being used (bad for inference speed)
swapon --show

If you're hitting swap, either reduce OLLAMA_MAX_LOADED_MODELS, switch to smaller model variants (e.g. llama3.2:3b instead of a 7B), or reduce OLLAMA_KEEP_ALIVE so models unload faster.

Optimized: Keep-alive, Flash Attention, and context tuning are the three highest-impact knobs for a CPU-based Mini PC. Set them once via systemd, then verify with ollama ps.

09 / Monitoring and Maintenance

Keep Your LLM Healthy

A running system needs monitoring. Health checks, logs, resource tracking-simple practices prevent surprises.

Health Check Script

📝 health-check.sh

#!/bin/bash

echo "=== Ollama Health Check ==="
echo ""

# Service status
echo "✓ Service Status:"
systemctl status ollama --no-pager | head -3

echo ""
echo "✓ API Connectivity:"
curl -s http://localhost:11434/api/tags | jq '.models | length' | xargs echo "  Models available:"

echo ""
echo "✓ Resources:"
ps aux | grep "ollama serve" | grep -v grep | awk '{print "  CPU: " $3 "%, RAM: " $6 " KB"}'

echo ""
echo "✓ Logs (last error):"
grep -i error ~/.ollama/ollama.log | tail -1 || echo "  No errors"

Log Location

Logs live at ~/.ollama/ollama.log. Check for issues:

🖥️ Mini PC

# View recent logs
tail -50 ~/.ollama/ollama.log

# Find errors
grep ERROR ~/.ollama/ollama.log | tail -10

Regular Maintenance

Weekly: Check disk usage (du -sh ~/.ollama/models)
Monthly: Update Ollama (sudo apt update && sudo apt install ollama)
Quarterly: Clean unused models (ollama rm model_name)
Yearly: Backup models to external drive

Monitored: Your Ollama system is healthy and maintainable.

10 / Performance Tuning Checklist

Before You Go Live

You've learned parameters, benchmarking, optimization. This checklist ensures your Ollama setup is production-ready for OpenClaw.

Pre-Launch Checklist

Production Ready?

☐ Ollama running as systemd service (auto-start on boot)
☐ Model selected and downloaded (qwen3:8b recommended)
☐ API endpoint verified (curl http://localhost:11434/api/tags)
☐ Temperature set appropriately (0.1-0.3 for OpenClaw)
☐ Context window configured (4096 minimum)
☐ Max tokens set (2048 to prevent runaway)
☐ Baseline performance benchmarked
☐ Health check script running

Configuration Backup

Save your working configuration:

🖥️ Mini PC

# Backup models
cp -r ~/.ollama/models ~/ollama-models-backup

# Document your settings
cat > ~/ollama-config.txt << EOC
Model: qwen3:8b
Temperature: 0.1
Top-P: 0.9
Context: 4096
Num Predict: 2048
EOC

OpenClaw Integration Checklist

☐ Ollama endpoint configured in OpenClaw (localhost:11434)
☐ Model name matches what you pulled
☐ Test prompt sent and received successfully
☐ Response quality acceptable
☐ Response time reasonable (5-20 seconds)
☐ No memory leaks after sustained use
☐ Reboot test: Ollama starts automatically, works after restart

Launch Ready: Your Ollama + OpenClaw setup is production-ready.

11 / Troubleshooting Guide

Common Issues and Fixes

Things go wrong. Here's how to diagnose and fix common Ollama problems.

Ollama Won't Start

Symptom: Service shows inactive or fails to start

🖥️ Mini PC - Diagnosis

systemctl status ollama
# Read error message

# Try manual start to see error
ollama serve

Common fixes:

Permission denied: sudo chown ollama:ollama ~/.ollama
Port conflict: Check sudo lsof -i :11434
Corrupted model: Delete and re-pull

Out of Memory Errors

Symptom: "OOM Killer" in dmesg, processes killed

🖥️ Mini PC - Check Memory

free -h
# If available < 2GB during inference, you're hitting limits

dmesg | grep -i killed | tail -5
# Shows what got OOM killed

Fixes:

Reduce context window (num_ctx: 2048 instead of 8192)
Unload unused models (ollama rm model_name)
Close other apps consuming memory
Add swap if permanently needed

Very Slow Inference

Symptom: <5 tokens/sec (should be 10-15)

🖥️ Mini PC - Check System

top -n 1
# CPU usage <80%? Issue might be elsewhere
# Check if CPU is being shared with other apps

iostat -x 1
# High wait time? Disk bottleneck

Fixes:

Close heavy apps (browser, IDE, etc)
Reduce num_ctx to 2048
Use faster model (llama3.2:3b)
Check for thermal throttling (watch -n 1 'cat /proc/cpuinfo | grep MHz')

API Not Responding

Symptom: curl returns Connection refused

🖥️ Mini PC - Check Service

systemctl status ollama
# Make sure it's running

netstat -an | grep 11434
# Should show LISTEN on port 11434

curl http://localhost:11434/api/tags
# Should return JSON, not error

Fixes:

Start service: sudo systemctl start ollama
Check firewall: sudo ufw allow 11434
Restart: sudo systemctl restart ollama

Bad Quality Responses

Symptom: Responses are nonsensical or repetitive

Fixes:

Lower temperature (0.3 or less)
Increase repeat_penalty (1.5)
Reduce num_predict (limit length)
Try different model (llama3.2 instead of qwen3:8b)

Fixed: Most issues are solvable with these techniques.

12 / Next Steps

Where to Go From Here

You've completed Ollama Advanced. You understand parameters, tuning, integration, optimization, and troubleshooting. Your local LLM setup is sophisticated and production-ready.

You've Accomplished

✓ Understand all Ollama parameters and their effects
✓ Tune for specific use cases (OpenClaw, creative, code, etc)
✓ Benchmark models objectively
✓ Know when GPU acceleration matters (and doesn't for you)
✓ Run multiple models concurrently
✓ Integrate Ollama seamlessly with OpenClaw
✓ Optimize your hardware for maximum LLM performance
✓ Monitor and maintain a healthy Ollama system
✓ Troubleshoot common problems

Option 1: Deploy Your OpenClaw Bot

You have everything you need. Configure OpenClaw to use your local Ollama, then deploy:

Point OpenClaw to http://localhost:11434
Select your tuned model and parameters
Launch your Discord bot
Enjoy your private, offline-capable AI agent

Option 2: Explore Advanced Topics

If you want to go deeper:

Model Fine-Tuning: Customize models for specific tasks (advanced)
Quantization: Compress models further (4-bit, 2-bit)
Distributed Inference: Run Ollama across multiple machines
Web UI: Build a web interface for Ollama
Monitoring: Prometheus/Grafana metrics tracking

Option 3: Compare with Other LLM Tools

Ollama isn't the only option. If you want to explore:

LM Studio: Web UI for local models (easier than Ollama CLI)
vLLM: High-performance inference server (more complex)
Text Generation WebUI: Feature-rich but steeper learning curve
GPT4All: Lightweight, beginner-friendly

But honestly? Ollama is the best balance of simplicity and power for your use case.

Keep Learning

Understand Transformers Better:

Read: "Attention is All You Need" (the original paper)
Watch: YouTube tutorials on how LLMs work
Experiment: Try different models, prompt engineering

Follow the Community:

Ollama GitHub (issues, discussions)
Hugging Face model hub (find new models)
Reddit r/LocalLLM (community sharing)

You're Part of the AI Revolution

A few years ago, running local LLMs meant compiling C++, wrestling with dependencies, and getting 1-2 tokens/sec. Now? You install Ollama, pull a model, and get 10-15 tokens/sec on CPU. You've got a private, offline-capable AI brain that costs nothing to run.

Your data is yours. Your LLM is yours. No cloud vendor, no API limits, no surveillance.

That's power. Use it wisely.

Advanced Complete: You're now an Ollama expert. Go build amazing things. 🚀