Ollama Advanced

Optimize Ollama for your hardware - faster inference, multi-model loading, benchmarking, and OpenClaw integration. Squeeze maximum performance from CPU and GPU.

90 min Intermediate Free Updated February 2026

Beyond the Basics: Optimization and Integration

You've completed Ollama Basics. You have Llama 3.2 running, you've tested the REST API, and you've pulled a few models. Now: let's hook it into OpenClaw and then make it better.

This tutorial starts with the thing you're here for - getting OpenClaw to use your local models. Once that's working, we dig into parameter tuning, benchmarking, and optimization so you can make it faster and smarter over time.

Where You Left Off (Quick Review)

You have:

  • ✓ Ollama installed and running as a service
  • ✓ At least one model downloaded (Llama 3.2 or Qwen3)
  • ✓ REST API working (tested with curl)
  • ✓ Baseline performance metrics (10-15 tokens/sec on the Ryzen)

If you don't have these, go back to Ollama Basics first. Not sure which model to use? The Best Local AI Models 2026 comparison covers every model worth knowing with pull commands ready to go.

What You'll Learn (Advanced)

In this 90-minute tutorial:

  • OpenClaw Integration: Wire your bot to local Ollama - the whole reason you're here
  • Parameters: Temperature, top_p, context windows - the knobs that control your model's behavior
  • Use-Case Tuning: Different settings for different tasks (structured output, creative writing, code)
  • Benchmarking: Objective methods to measure quality and speed
  • GPU Detection: Understand when GPUs help (spoiler: not for you)
  • Multiple Models: Run models concurrently, queue requests, instant switching
  • Optimization: Memory tuning, CPU optimization, inference speed tuning
  • Monitoring: Health checks, logs, continuous operation
  • Troubleshooting: Common issues and how to fix them

Your Hardware

This tutorial works on any modern hardware capable of running Ollama. Here's what to keep in mind:

Recommended Specs
  • CPU: 4+ cores (8+ cores recommended for smooth performance)
  • RAM: 16GB minimum, 32GB recommended (more RAM = more models loaded simultaneously)
  • Storage: SSD recommended (fast model loading)
  • GPU: Optional - CPU-only inference works well for 7B models

Performance numbers in this tutorial are based on the Ryzen 7 6800H mini PC with 32GB RAM - the same hardware recommended throughout these tutorials. If you're on different hardware, your numbers will vary but the tuning principles are the same.

Why This Matters

For OpenClaw: A well-tuned local LLM can power a bot that feels just as smart as cloud-based solutions, but faster, cheaper, and completely private.

For Learning: Understanding parameters teaches you how language models actually work. You'll develop intuition for LLM behavior.

For Optimization: Squeezing another 20% performance from your hardware means faster responses and better user experience.

A Note on Complexity

This tutorial assumes you're comfortable with terminal commands and basic system administration. You don't need to be an expert, but "comfortable with Linux" is the baseline.

If something is unclear, go back to Ollama Basics or skip to the sections that interest you. You don't have to do everything in order.

Integration first, optimization later. Section 2 gets OpenClaw working with your local models right away. Everything after that - parameters, benchmarking, optimization - makes it better over time. You can always come back to the tuning sections later.
Let's Go Deep: You're about to understand Ollama at a level beyond "it works." Ready?

Connect OpenClaw to Your Local Ollama

This is the payoff: your Discord bot running on a local LLM you control completely. No API keys, no cloud dependency, no monthly bills. Your conversations stay on your hardware.

What You Need Before Starting

Prerequisites
  • Ollama installed and running (covered in Ollama Basics)
  • At least one model pulled - qwen3:8b recommended
  • OpenClaw installed (see EC2 setup or Mini PC setup)
  • Node 24 or Node 22 LTS (22.19+)

Verify Ollama is ready before touching OpenClaw config:

🖥️ Mini PC - Verify Ollama
curl http://localhost:11434/api/tags
# Should return JSON with your pulled models listed
# If you get "connection refused" - start Ollama first: ollama serve

The Fast Path: Onboard Command

OpenClaw has a built-in onboarding command that handles Ollama configuration automatically. This is the quickest way to get wired up:

🖥️ Mini PC - Onboard OpenClaw
# Interactive mode (recommended for first time)
openclaw onboard

# Non-interactive with your chosen model
openclaw onboard --non-interactive \
  --auth-choice ollama \
  --custom-model-id "qwen3:8b" \
  --accept-risk

The --auth-choice ollama flag tells OpenClaw to use local Ollama instead of a cloud provider. It auto-discovers models from http://127.0.0.1:11434 and sets all costs to $0.

After onboarding, open the Control UI at http://localhost:18789 to verify the agent is running and connected.

Manual Config (More Control)

If you want to specify exact models, a custom host, or tune parameters, edit the config file directly. OpenClaw stores its settings at:

📁 Config Location
~/.openclaw/openclaw.json

Here is a working Ollama config block. Add or merge this into your existing openclaw.json:

📄 ~/.openclaw/openclaw.json
{
  "models": {
    "providers": {
      "ollama": {
        "baseUrl": "http://127.0.0.1:11434",
        "apiKey": "ollama-local",
        "api": "ollama",
        "models": [
          {
            "id": "qwen3:8b",
            "name": "Qwen 2.5 7B",
            "reasoning": false,
            "input": ["text"],
            "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
            "contextWindow": 8192,
            "maxTokens": 4096
          },
          {
            "id": "llama3.2",
            "name": "Llama 3.2 3B",
            "reasoning": false,
            "input": ["text"],
            "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
            "contextWindow": 4096,
            "maxTokens": 2048
          }
        ]
      }
    }
  },
  "agents": {
    "defaults": {
      "model": {
        "primary": "ollama/qwen3:8b",
        "fallbacks": ["ollama/llama3.2"]
      }
    }
  }
}
Critical: Do not add /v1 to the baseUrl. Using http://127.0.0.1:11434/v1 activates the OpenAI-compatible endpoint, which breaks tool calling - your bot will output raw tool JSON as plain text instead of actually calling tools. Always use the native Ollama URL with "api": "ollama".

Auto-Discovery (Simplest Option)

If you just want OpenClaw to pick up whatever models you have installed without listing them manually, set one environment variable and restart OpenClaw:

🖥️ Mini PC - Environment Variable
export OLLAMA_API_KEY="ollama-local"

# Add to your shell profile so it persists
echo 'export OLLAMA_API_KEY="ollama-local"' >> ~/.bashrc
source ~/.bashrc

With this set, OpenClaw automatically queries /api/tags to find all your models, reads their context windows, and marks anything with "r1", "reasoning", or "think" in the name as a reasoning-capable model. All costs are set to $0.

Switch the Active Model

Once configured, you can verify OpenClaw sees your Ollama models and swap between them:

🖥️ Mini PC - Model Management
# See what models OpenClaw has available
openclaw models list

# Set qwen3:8b as primary
openclaw models set ollama/qwen3:8b

# Switch to the fast 3B model for lighter tasks
openclaw models set ollama/llama3.2

The ollama/ prefix tells OpenClaw which provider the model lives on. You'll use this prefix any time you reference a model in config or CLI commands.

Test It End to End

Before you start routing Discord messages through it, confirm the full chain is working:

🖥️ Mini PC - End to End Test
# 1. Confirm Ollama is responding
curl http://localhost:11434/api/tags

# 2. Confirm OpenClaw sees the models
openclaw models list

# 3. Check provider connectivity
openclaw models status

# 4. Open the Control UI and send a test prompt
# http://localhost:18789

If openclaw models list shows your Ollama models and the Control UI returns a response to a test prompt, you are fully wired up.

Performance on Your Hardware

What to expect on a Ryzen 7 6800H mini PC (CPU-only inference):

💻 Typical Discord Bot Response Times
Short factual question (e.g. "What is HTTP 429?")
  → llama3.2 (3B):   3-5 seconds
  → qwen3:8b:      6-10 seconds

Medium response (paragraph explanation)
  → llama3.2 (3B):   8-12 seconds
  → qwen3:8b:      12-20 seconds

Multiple users queued simultaneously
  → Ollama processes one at a time
  → Each user waits for the one ahead of them
  → Fine for personal bots and small communities

For a personal Discord server this is completely usable. Users perceive the "typing..." indicator as the bot thinking, which feels natural. If you need faster responses, use llama3.2 as the primary and save qwen3:8b for when quality matters more than speed.

New to openclaw.json? The Configuration Reference explains every key in the config file - dmPolicy, workspace, gateway settings, and environment variable syntax for keeping API keys out of the file.
Wired up: Your OpenClaw bot now runs on a local LLM you own. No API fees, no data leaving your machine. Keep going - the rest of this tutorial covers tuning that model to perform exactly the way you want it.

The Knobs That Control Your LLM

LLMs generate text one token at a time. But how they choose which token to generate is controlled by parameters. These are your tuning knobs. Understand them, and you control the model's behavior.

Temperature (Randomness)

Range: 0.0 to 2.0 (or higher)
Default: Usually 0.7 (balanced)

Temperature controls how "creative" or "random" the model gets:

Temperature Examples
  • 0.0 (Cold, Deterministic): Always picks the most likely next token. Same input = identical output every time. Good for: Precise answers, code generation, structured output.
  • 0.5 (Cool, Focused): Mostly predictable but with some variation. Good for: Professional writing, Q&A, technical content.
  • 0.7 (Balanced, Default): A mix of creativity and consistency. Good for: General chat, creative but coherent responses.
  • 1.2 (Warm, Creative): More randomness, more interesting responses. Good for: Brainstorming, creative writing, poetry.
  • 1.5+ (Hot, Chaotic): Very random, often incoherent. Good for: Experimentation (usually a mistake).

Try it:

🖥️ Mini PC - Compare temperatures
# Cold (deterministic)
curl http://localhost:11434/api/generate -d '{
  "model": "qwen3:8b",
  "prompt": "Complete: The future of AI is",
  "temperature": 0.0,
  "stream": false
}' | jq -r '.response'

# Hot (creative)
curl http://localhost:11434/api/generate -d '{
  "model": "qwen3:8b",
  "prompt": "Complete: The future of AI is",
  "temperature": 1.5,
  "stream": false
}' | jq -r '.response'

Run both and notice the difference. You'll see the same semantic answer at 0.0, wildly creative answers at 1.5.

Top-P (Nucleus Sampling)

Range: 0.0 to 1.0
Default: 0.9

Top-P is a more sophisticated diversity control than temperature. It works by:

  1. Model predicts probabilities for the next token
  2. Sort tokens by likelihood (highest first)
  3. Keep only the top tokens that sum to p% probability
  4. Randomly sample from that restricted set

In practice:

  • 0.1: Ultra-focused (only the top 10% of likely tokens)
  • 0.5: Focused (top 50% of likely tokens)
  • 0.9: Balanced (top 90% of likely tokens, allows more creativity)

Works best with temperature: Top-P controls diversity among the remaining candidates after temperature does its work. Usually keep at 0.9 unless you're experimenting.

Top-K

Range: 1 to infinity
Default: 40 (varies by Ollama version)

Top-K is the simpler cousin of top-P. It limits the model to only consider the K most likely tokens:

  • K=1: Only consider the most likely token (extremely constrained)
  • K=10: Consider top 10 most likely tokens
  • K=40: Consider top 40 (good balance)
  • K=100+: Very permissive (almost all tokens allowed)

Practical advice: Leave top-K at default and use temperature+top-P instead. Top-K is older and less intuitive than top-P.

Repeat Penalty

Range: 0.0 to 2.0
Default: 1.1

Repeat penalty prevents the model from repeating the same phrase over and over (common LLM failure mode):

  • 1.0: No penalty (tokens can repeat freely)
  • 1.1: Mild penalty (slight discouragement to repeat)
  • 1.5: Strong penalty (heavily discourage repetition)

Use when: Model generates repetitive text ("the the the..." or "and and and...").
Default is good: Usually leave at 1.1.

Context Window (num_ctx)

Range: 128 to ~32000 tokens
Default: 2048

Context window is how many tokens the model can "see" when generating. It's like short-term memory:

  • 256: Very short memory (forgets everything quickly)
  • 2048: Good balance (remember ~1500 words of conversation)
  • 4096: Long memory (remember ~3000 words)
  • 8192: Very long (remember entire documents)

Tradeoff: Larger context = slower inference (more computation). On a typical multi-core CPU with a 7B model:

💻 Context window vs speed
2048 tokens:  ~15 tokens/sec (fast)
4096 tokens:  ~12 tokens/sec (slightly slower)
8192 tokens:  ~8 tokens/sec (noticeably slower)
16384+:       ~3-5 tokens/sec (CPU gets saturated)

Recommendation: Use 2048 for chat. Use 4096 for document analysis. Skip 8192+ unless you really need the memory.

Prediction Tokens (num_predict)

Range: -1 (unlimited) to any positive number
Default: -1 (unlimited)

Maximum tokens to generate before stopping. Prevents runaway responses:

  • -1: Unlimited (model stops when it feels done)
  • 128: Stop after 128 tokens (~100 words)
  • 512: Stop after 512 tokens (~400 words)
  • 2048: Stop after 2048 tokens (full page)

Use when: You want predictable response lengths. Good for APIs where you need bounded latency.

Threads (num_threads)

Range: 1 to your CPU core count
Default: Auto-detect (all cores)

How many CPU cores Ollama uses. For example, on an 8-core CPU:

  • 1-2: Slow, leaves CPU idle
  • 8: All cores active (default, good)
  • >8: Hyperthreading counts, can use more (usually 16 total)

Leave it on auto. Ollama detects your CPU and uses all available threads.

Quick Parameter Summary

💻 Parameter Quick Reference
temperature      → 0.0–2.0   (randomness, default 0.7)
top_p            → 0.0–1.0   (diversity, default 0.9)
top_k            → 1–∞       (token limit, default 40, skip it)
repeat_penalty   → 0–2       (penalize repetition, default 1.1)
num_ctx          → 128–32k   (memory, default 2048)
num_predict      → -1–∞      (max length, default -1)
num_threads      → 1–cores   (CPU usage, default all)
Parameters Demystified: You now understand the knobs. Next: how to tune them for different use cases.

Parameters That Fit Your Task

Now that you understand parameters, let's apply them. Different tasks need different settings. A Discord bot behaves differently than a creative writer's assistant. Let's build configurations for real scenarios.

This is the main event. If you're here to connect Ollama to OpenClaw, Use Case 1 below is the configuration you need. The other use cases are handy for experimentation, but Use Case 1 is the one that powers your bot.

Use Case 1: OpenClaw Integration (Structured Output)

Goal: Consistent, deterministic responses. OpenClaw needs predictable JSON or structured text.

Recommended Settings:

📝 OpenClaw Configuration
curl http://localhost:11434/api/generate \
  -d '{
    "model": "qwen3:8b",
    "prompt": "Your prompt here",
    "temperature": 0.1,
    "top_p": 0.9,
    "num_predict": 2048,
    "num_ctx": 4096,
    "stream": false
  }'

Why these values:

  • temperature: 0.1 → Focused, deterministic (reproducible output)
  • top_p: 0.9 → Still allows valid variation, not robotic
  • num_predict: 2048 → Reasonable max length for API
  • num_ctx: 4096 → Enough memory for conversation context

Expected behavior: Responses are consistent. Same prompt = mostly same answer (good for testing and debugging).

Use Case 2: Creative Writing

Goal: Variety and creativity. You want different results each time, but still coherent.

Recommended Settings:

📝 Creative Writing Configuration
curl http://localhost:11434/api/generate \
  -d '{
    "model": "qwen3:8b",
    "prompt": "Write a short story about...",
    "temperature": 1.2,
    "top_p": 0.95,
    "num_predict": 1000,
    "num_ctx": 2048,
    "repeat_penalty": 1.2,
    "stream": true
  }'

Why these values:

  • temperature: 1.2 → Creative, more varied outputs
  • top_p: 0.95 → Very permissive vocabulary
  • repeat_penalty: 1.2 → Prevent repetitive phrases
  • stream: true → See creativity unfold in real-time

Expected behavior: Each run produces unique, interesting variations on the theme.

Use Case 3: Fast Inference (Real-Time Chat)

Goal: Speed matters. Users are waiting for a response. Trade some context for speed.

Recommended Settings:

📝 Fast Chat Configuration
curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.2",
    "prompt": "What is...",
    "temperature": 0.7,
    "top_p": 0.9,
    "num_predict": 512,
    "num_ctx": 2048,
    "stream": true
  }'

Why these values:

  • num_predict: 512 → Shorter responses (faster generation)
  • num_ctx: 2048 → Not too large (faster processing)
  • stream: true → First token appears faster (perceived speed)
  • model: llama3.2 → 3B model, faster than qwen3:8b for simple tasks

Expected behavior: First token appears in <500ms, total response in 5-10 seconds.

Use Case 4: Code Generation

Goal: Correct, working code. Logic must be sound.

Recommended Settings:

📝 Code Generation Configuration
curl http://localhost:11434/api/generate \
  -d '{
    "model": "qwen3:8b",
    "prompt": "Write a Python function that...",
    "temperature": 0.3,
    "top_p": 0.9,
    "num_predict": 1024,
    "num_ctx": 4096,
    "repeat_penalty": 1.1,
    "stream": false
  }'

Why these values:

  • temperature: 0.3 → Focused on correct syntax
  • num_ctx: 4096 → Understand full code context
  • repeat_penalty: 1.1 → Avoid redundant code

Expected behavior: Code is syntactically correct and logically sound most of the time.

Use Case 5: Long-Form Document Analysis

Goal: Understand and analyze large texts. Need big context window.

Recommended Settings:

📝 Document Analysis Configuration
curl http://localhost:11434/api/generate \
  -d '{
    "model": "qwen3:8b",
    "prompt": "Summarize this document:\n\n[LARGE TEXT HERE]",
    "temperature": 0.5,
    "top_p": 0.9,
    "num_predict": 1024,
    "num_ctx": 8192,
    "stream": false
  }'

Why these values:

  • num_ctx: 8192 → Can see entire document (trade-off: slower)
  • temperature: 0.5 → Faithful to source material
  • stream: false → You're doing analysis, not interactive chat

Expected behavior: Accurate summaries that capture main points. Slower (expect 20–30 seconds), but comprehensive.

Testing Your Configuration

Don't just trust recommendations. Test with your own prompts:

🖥️ Mini PC - A/B Test Script
#!/bin/bash
PROMPT="Write a haiku about programming"

echo "=== Configuration A ==="
time curl -s http://localhost:11434/api/generate \
  -d '{
    "model": "qwen3:8b",
    "prompt": "'$PROMPT'",
    "temperature": 0.5,
    "stream": false
  }' | jq -r '.response'

echo ""
echo "=== Configuration B ==="
time curl -s http://localhost:11434/api/generate \
  -d '{
    "model": "qwen3:8b",
    "prompt": "'$PROMPT'",
    "temperature": 1.0,
    "stream": false
  }' | jq -r '.response'

Run the same prompt with different configurations. Notice speed and quality differences. Build intuition.

Save Your Configs

Once you find good settings, save them to files for reuse:

📝 ~/.ollama/configs.sh
# OpenClaw Config
alias ollama-openclaw='curl -d "{\"model\": \"qwen3:8b\", \"temperature\": 0.1, \"top_p\": 0.9}"'

# Creative Config
alias ollama-creative='curl -d "{\"model\": \"qwen3:8b\", \"temperature\": 1.2, \"top_p\": 0.95}"'

# Fast Config
alias ollama-fast='curl -d "{\"model\": \"llama3.2\", \"temperature\": 0.7, \"num_predict\": 512}"'

Source the file and use aliases. Saves typing and keeps configs consistent.

Task-Specific Tuning Ready: You can now configure Ollama for your specific needs. Next: benchmarking to measure results objectively.

Measure Speed and Quality Objectively

You can feel that a model is fast, but how fast exactly? How does Qwen3 compare to Llama 3.2 on your hardware? Benchmarking gives you objective data to guide optimization decisions.

Metrics That Matter

Four key metrics for LLM inference:

Benchmark Metrics
  • Time to First Token (TTFT): How long before the model starts responding. Goal: <500ms
  • Tokens Per Second (TPS): Generation speed. Goal: 10-15 for CPU, 50+ for GPU
  • Memory Used: RAM footprint during inference. Goal: <8GB for your setup
  • Response Quality: Does the answer make sense? Subjective but important

Benchmark Script

Create a reusable benchmarking script:

📝 benchmark.sh
#!/bin/bash

MODEL="${1:-qwen3:8b}"
PROMPT="Explain machine learning in 3 paragraphs"

echo "Benchmarking $MODEL..."
echo ""

# Run inference and capture timing
START=$(date +%s%N)
RESPONSE=$(curl -s http://localhost:11434/api/generate \
  -d "{
    \"model\": \"$MODEL\",
    \"prompt\": \"$PROMPT\",
    \"temperature\": 0.5,
    \"stream\": false
  }")
END=$(date +%s%N)

# Extract metrics
TOTAL_TIME=$((($END - $START) / 1000000))  # milliseconds
TOKENS=$(echo "$RESPONSE" | jq -r '.eval_count')
TPS=$(echo "1000 * $TOKENS / $(echo \"$RESPONSE\" | jq -r '.eval_duration') * 1000000" | bc -l)

echo "Model: $MODEL"
echo "Total Time: ${TOTAL_TIME}ms"
echo "Tokens: $TOKENS"
echo "Tokens/Sec: $(echo \"scale=2; $TPS\" | bc)"
echo ""
echo "Response:"
echo "---"
echo "$RESPONSE" | jq -r '.response'
echo "---"

Run it:

🖥️ Mini PC
bash benchmark.sh qwen3:8b
bash benchmark.sh llama3.2
bash benchmark.sh llama3.2:1b

Compare Models Side-by-Side

Test the same prompt across models to find your best balance of speed and quality:

🖥️ Mini PC - Comparison Output
Model            Time    Tokens  TPS     Quality
qwen3:8b       8.2s    82      10.0    Excellent
llama3.2         4.5s    85      18.9    Good
llama3.2:1b      2.1s    78      37.1    Basic

The 1B model is fastest, Qwen3 has the best quality. Your choice depends on your priority.

Monitor Resources During Benchmark

In another terminal, watch resource usage:

🖥️ Mini PC - Terminal 2
watch -n 0.5 'free -h && echo && top -n 1 -b | head -n 3'

Look for:

  • Peak memory usage (should be <8GB)
  • CPU utilization (should be 80%+ during inference)
  • No thermal throttling (CPU stays stable frequency)
Benchmarking Baseline Set: You can now objectively measure model performance. Use these numbers to guide optimization.

Understanding Hardware Acceleration (You Probably Don't Need It)

GPUs are much faster than CPUs for LLM inference. Many systems have an integrated GPU, but integrated graphics are typically too small to meaningfully accelerate LLMs. Let's understand when GPU acceleration actually helps.

Integrated vs Discrete GPUs

Most CPUs include an integrated GPU, but these are far too weak for LLM acceleration. For reference:

  • Typical integrated GPU: ~1–2 TFLOPs
  • RTX 4070: ~20 TFLOPs (10–20x more powerful)
  • RTX 4090: ~82 TFLOPs (40–80x more powerful)

Integrated GPUs are orders of magnitude weaker than discrete gaming GPUs. For LLMs on integrated graphics, the CPU is actually competitive.

Check if Ollama Detects Your GPU

Ollama logs what hardware it's using. Check:

🖥️ Mini PC
tail -50 ~/.ollama/ollama.log | grep -i gpu

If you see GPU references, Ollama detected it. If not, it's using CPU (default, which is fine for you).

When GPU Acceleration Helps

GPU acceleration is worth it if:

  • Nvidia RTX GPU: 2080 Ti or newer (30x-60x speedup)
  • AMD Radeon: RX 6700 or newer (10x-20x speedup)
  • Apple Silicon: Built-in GPU (3x-5x speedup)

An integrated GPU? Not worth the complexity. CPU inference is simpler and nearly as fast.

Should You Upgrade Hardware?

For local LLMs, consider GPU if:

  • You want to run 13B+ models (currently too slow on CPU)
  • You need sub-second response times (production use)
  • You want to run multiple concurrent inferences

For now, stick with CPU. A modern multi-core CPU is plenty fast for a single Ollama instance running 7B models.

GPU Decision Made: If you only have integrated graphics, CPU inference is the right choice. No GPU upgrade needed for 7B models.

Leverage Your RAM - Load Once, Use Instantly

With 16GB+ RAM and OLLAMA_KEEP_ALIVE set, you can keep multiple models loaded simultaneously and switch between them with zero cold-start delay. Different models for different tasks - all ready to go.

Check What's Currently Loaded

Before thinking about concurrent models, know what's already in memory:

🖥️ Mini PC
ollama ps
🖥️ Mini PC - Output
NAME              ID              SIZE      PROCESSOR    UNTIL
qwen3:8b        845dbda0ea48    5.5 GB    100% CPU     28 minutes from now
llama3.2          a80c4f17acd5    2.5 GB    100% CPU     25 minutes from now

PROCESSOR shows whether your GPU is involved. On a CPU-only machine you'll see 100% CPU. On a machine with a compatible GPU (NVIDIA with CUDA, AMD with ROCm, or Apple Silicon), you'll see a split like 45%/55% CPU/GPU - higher GPU % means faster inference. The UNTIL column shows the keep-alive expiry for each loaded model.

Concurrent Model Memory Layout

On a 32GB system, loading your full stack looks like this:

💻 Memory Layout - 32GB System
OS + System:           ~2GB (always used)
Ollama Runtime:        ~1GB
qwen3:8b (q4_K_M):  ~5.5GB  - all-rounder, 128K context
llama3.2 (3B):         ~2.5GB  - speed tier, quick queries
phi4 (14B):            ~9.0GB  - heavy reasoning, complex problems
Reserve Buffer:        ~12GB   (available for gemma3 or larger models)
────────────────────────────────
Total Used:            ~20GB
Total Available:       ~12GB

With OLLAMA_KEEP_ALIVE set to -1 or a long duration, all three models stay resident. Switching between them is instant - the model is already warm in memory.

Practical Multi-Model Setup (2026)

A sensible three-tier stack for most use cases:

Recommended Model Stack
  • qwen3:8b - Default tier. Complex reasoning, code, long documents (128K context). Best all-rounder.
  • llama3.2 - Speed tier. Quick lookups, simple tasks, when you want an instant answer.
  • phi4 - Heavy reasoning (needs ~10GB RAM). Pull when qwen3:8b isn't cutting it on complex problems.
  • gemma3:12b - Alternative heavy model. Strong at instruction-following and multilingual tasks.

Total disk for the first two: ~8GB. Add phi4 and you're at ~17GB. On 32GB RAM, that's comfortable headroom.

API Request Routing by Model

With multiple models loaded, route requests to the right model for the task:

💻 Route by task type
# Complex task → Qwen3 for reasoning + 128K context
curl http://localhost:11434/api/generate \
  -d '{"model":"qwen3:8b","prompt":"Analyze this long document..."}'

# Quick factual lookup → Llama 3.2 for speed
curl http://localhost:11434/api/generate \
  -d '{"model":"llama3.2","prompt":"What does HTTP 429 mean?"}'

# Code review → Qwen3 for reliable instruction following
curl http://localhost:11434/api/generate \
  -d '{"model":"qwen3:8b","prompt":"Review this function for bugs..."}'

If all three are loaded in memory (keep-alive), these requests complete without any model loading delay. Ollama queues concurrent requests and processes them in order, switching models transparently.

Pinning Models with OLLAMA_KEEP_ALIVE

The key to no-delay model switching is keeping models in memory. Set keep-alive system-wide via systemd:

🖥️ Mini PC - Pin models via systemd
sudo systemctl edit ollama
📝 Add to the override file
[Service]
Environment="OLLAMA_KEEP_ALIVE=-1"
🖥️ Mini PC - Reload
sudo systemctl daemon-reload && sudo systemctl restart ollama
Use -1 only if you have headroom. With -1, models never unload until Ollama restarts. On 16GB RAM with one 7B model loaded, that's fine. Loading a second 7B while the first is pinned will consume ~12GB - check ollama ps and your free RAM before pinning multiple large models permanently.
Multiple Models Ready: With a tiered stack and keep-alive tuned to your RAM, you have a local AI infrastructure that routes tasks intelligently. ollama ps is your dashboard - check it whenever you're curious about what's running.

Squeeze Maximum Performance

Your CPU is capable, but there are ways to go further. Environment variables, context tuning, CPU pinning - small tweaks compound into noticeable improvements. This section covers the full toolkit.

Key Environment Variables

Ollama exposes a set of environment variables that control how it loads and runs models. These are the most useful ones for a Mini PC setup:

OLLAMA_KEEP_ALIVE
How long to keep a model loaded in RAM after its last request. Default: 5m. Set to 30m or -1 (forever) to avoid cold-start delays. On 16GB+ systems, longer keep-alive means faster responses - the model is already warm.
Environment="OLLAMA_KEEP_ALIVE=30m"
OLLAMA_MAX_LOADED_MODELS
Maximum number of models to keep loaded simultaneously. Default: 3 on systems with GPU, 1 on CPU-only. Raise to 3 or 4 on 32GB systems to keep your full model stack warm without unloading.
Environment="OLLAMA_MAX_LOADED_MODELS=3"
OLLAMA_NUM_PARALLEL
How many requests to process in parallel per model. Default: 1 on CPU. Increasing this can improve throughput if you have multiple concurrent users or applications hitting Ollama at the same time, at the cost of higher RAM usage per request. Start at 1 on CPU - most single-user setups don't need more.
Environment="OLLAMA_NUM_PARALLEL=2"
OLLAMA_FLASH_ATTENTION
Enable Flash Attention, an optimized attention mechanism that reduces memory usage during long-context inference. Set to 1 to enable. Most beneficial when running models with large context windows (like qwen3:8b at 128K tokens). Can noticeably reduce RAM pressure on long conversations.
Environment="OLLAMA_FLASH_ATTENTION=1"
🖥️ Mini PC - Apply via systemd (persists across reboots)
sudo systemctl edit ollama
📝 Recommended override block for 16–32GB systems
[Service]
Environment="OLLAMA_KEEP_ALIVE=30m"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
Environment="OLLAMA_FLASH_ATTENTION=1"
🖥️ Mini PC - Reload and restart
sudo systemctl daemon-reload && sudo systemctl restart ollama

CPU Affinity

Pin Ollama to specific cores to reduce OS scheduling overhead and give the model consistent CPU access:

🖥️ Mini PC - via systemd override
# Add to your [Service] block in systemctl edit ollama:
CPUAffinity=0-7   # Pins to cores 0–7 (adjust to your core count)

On a 12-core machine, you might pin Ollama to 8 cores (0–7) and leave 4 cores (8–11) for the OS and other services. Experiment - the benefit varies by workload.

Context Window Tuning

Context window size directly affects inference speed and RAM usage. Smaller context = faster tokens:

💻 Speed vs. Context Tradeoff (7B model)
Context 1024:   ~20 tokens/sec (very fast, short conversations)
Context 2048:   ~15 tokens/sec (balanced - good default)
Context 4096:   ~10 tokens/sec (slower, good for most tasks)
Context 8192:   ~7 tokens/sec  (long docs, complex reasoning)
Context 32768+: ~3–5 tokens/sec (Qwen3 long context, only when needed)

You can override context size per request via the API using the options.num_ctx field. For most chat tasks, 2048–4096 is the sweet spot. Only go higher when you're actually sending long inputs.

💻 Set context size per API request
curl http://localhost:11434/api/generate \
  -d '{
    "model": "qwen3:8b",
    "prompt": "Summarize this document...",
    "options": {
      "num_ctx": 8192
    }
  }'

Memory Pressure Relief

If you're hitting RAM limits - models are being unexpectedly unloaded, or you see swap activity - try these:

🖥️ Mini PC - Diagnose memory pressure
# See current model memory usage
ollama ps

# Check system RAM
free -h

# Check if swap is being used (bad for inference speed)
swapon --show

If you're hitting swap, either reduce OLLAMA_MAX_LOADED_MODELS, switch to smaller model variants (e.g. llama3.2:3b instead of a 7B), or reduce OLLAMA_KEEP_ALIVE so models unload faster.

Optimized: Keep-alive, Flash Attention, and context tuning are the three highest-impact knobs for a CPU-based Mini PC. Set them once via systemd, then verify with ollama ps.

Keep Your LLM Healthy

A running system needs monitoring. Health checks, logs, resource tracking-simple practices prevent surprises.

Health Check Script

📝 health-check.sh
#!/bin/bash

echo "=== Ollama Health Check ==="
echo ""

# Service status
echo "✓ Service Status:"
systemctl status ollama --no-pager | head -3

echo ""
echo "✓ API Connectivity:"
curl -s http://localhost:11434/api/tags | jq '.models | length' | xargs echo "  Models available:"

echo ""
echo "✓ Resources:"
ps aux | grep "ollama serve" | grep -v grep | awk '{print "  CPU: " $3 "%, RAM: " $6 " KB"}'

echo ""
echo "✓ Logs (last error):"
grep -i error ~/.ollama/ollama.log | tail -1 || echo "  No errors"

Log Location

Logs live at ~/.ollama/ollama.log. Check for issues:

🖥️ Mini PC
# View recent logs
tail -50 ~/.ollama/ollama.log

# Find errors
grep ERROR ~/.ollama/ollama.log | tail -10

Regular Maintenance

  • Weekly: Check disk usage (du -sh ~/.ollama/models)
  • Monthly: Update Ollama (sudo apt update && sudo apt install ollama)
  • Quarterly: Clean unused models (ollama rm model_name)
  • Yearly: Backup models to external drive
Monitored: Your Ollama system is healthy and maintainable.

Before You Go Live

You've learned parameters, benchmarking, optimization. This checklist ensures your Ollama setup is production-ready for OpenClaw.

Pre-Launch Checklist

Production Ready?
  • ☐ Ollama running as systemd service (auto-start on boot)
  • ☐ Model selected and downloaded (qwen3:8b recommended)
  • ☐ API endpoint verified (curl http://localhost:11434/api/tags)
  • ☐ Temperature set appropriately (0.1-0.3 for OpenClaw)
  • ☐ Context window configured (4096 minimum)
  • ☐ Max tokens set (2048 to prevent runaway)
  • ☐ Baseline performance benchmarked
  • ☐ Health check script running

Configuration Backup

Save your working configuration:

🖥️ Mini PC
# Backup models
cp -r ~/.ollama/models ~/ollama-models-backup

# Document your settings
cat > ~/ollama-config.txt << EOC
Model: qwen3:8b
Temperature: 0.1
Top-P: 0.9
Context: 4096
Num Predict: 2048
EOC

OpenClaw Integration Checklist

  • ☐ Ollama endpoint configured in OpenClaw (localhost:11434)
  • ☐ Model name matches what you pulled
  • ☐ Test prompt sent and received successfully
  • ☐ Response quality acceptable
  • ☐ Response time reasonable (5-20 seconds)
  • ☐ No memory leaks after sustained use
  • ☐ Reboot test: Ollama starts automatically, works after restart
Launch Ready: Your Ollama + OpenClaw setup is production-ready.

Common Issues and Fixes

Things go wrong. Here's how to diagnose and fix common Ollama problems.

Ollama Won't Start

Symptom: Service shows inactive or fails to start

🖥️ Mini PC - Diagnosis
systemctl status ollama
# Read error message

# Try manual start to see error
ollama serve

Common fixes:

  • Permission denied: sudo chown ollama:ollama ~/.ollama
  • Port conflict: Check sudo lsof -i :11434
  • Corrupted model: Delete and re-pull

Out of Memory Errors

Symptom: "OOM Killer" in dmesg, processes killed

🖥️ Mini PC - Check Memory
free -h
# If available < 2GB during inference, you're hitting limits

dmesg | grep -i killed | tail -5
# Shows what got OOM killed

Fixes:

  • Reduce context window (num_ctx: 2048 instead of 8192)
  • Unload unused models (ollama rm model_name)
  • Close other apps consuming memory
  • Add swap if permanently needed

Very Slow Inference

Symptom: <5 tokens/sec (should be 10-15)

🖥️ Mini PC - Check System
top -n 1
# CPU usage <80%? Issue might be elsewhere
# Check if CPU is being shared with other apps

iostat -x 1
# High wait time? Disk bottleneck

Fixes:

  • Close heavy apps (browser, IDE, etc)
  • Reduce num_ctx to 2048
  • Use faster model (llama3.2:3b)
  • Check for thermal throttling (watch -n 1 'cat /proc/cpuinfo | grep MHz')

API Not Responding

Symptom: curl returns Connection refused

🖥️ Mini PC - Check Service
systemctl status ollama
# Make sure it's running

netstat -an | grep 11434
# Should show LISTEN on port 11434

curl http://localhost:11434/api/tags
# Should return JSON, not error

Fixes:

  • Start service: sudo systemctl start ollama
  • Check firewall: sudo ufw allow 11434
  • Restart: sudo systemctl restart ollama

Bad Quality Responses

Symptom: Responses are nonsensical or repetitive

Fixes:

  • Lower temperature (0.3 or less)
  • Increase repeat_penalty (1.5)
  • Reduce num_predict (limit length)
  • Try different model (llama3.2 instead of qwen3:8b)
Fixed: Most issues are solvable with these techniques.

Where to Go From Here

You've completed Ollama Advanced. You understand parameters, tuning, integration, optimization, and troubleshooting. Your local LLM setup is sophisticated and production-ready.

You've Accomplished

  • ✓ Understand all Ollama parameters and their effects
  • ✓ Tune for specific use cases (OpenClaw, creative, code, etc)
  • ✓ Benchmark models objectively
  • ✓ Know when GPU acceleration matters (and doesn't for you)
  • ✓ Run multiple models concurrently
  • ✓ Integrate Ollama seamlessly with OpenClaw
  • ✓ Optimize your hardware for maximum LLM performance
  • ✓ Monitor and maintain a healthy Ollama system
  • ✓ Troubleshoot common problems

Option 1: Deploy Your OpenClaw Bot

You have everything you need. Configure OpenClaw to use your local Ollama, then deploy:

  • Point OpenClaw to http://localhost:11434
  • Select your tuned model and parameters
  • Launch your Discord bot
  • Enjoy your private, offline-capable AI agent

Option 2: Explore Advanced Topics

If you want to go deeper:

  • Model Fine-Tuning: Customize models for specific tasks (advanced)
  • Quantization: Compress models further (4-bit, 2-bit)
  • Distributed Inference: Run Ollama across multiple machines
  • Web UI: Build a web interface for Ollama
  • Monitoring: Prometheus/Grafana metrics tracking

Option 3: Compare with Other LLM Tools

Ollama isn't the only option. If you want to explore:

  • LM Studio: Web UI for local models (easier than Ollama CLI)
  • vLLM: High-performance inference server (more complex)
  • Text Generation WebUI: Feature-rich but steeper learning curve
  • GPT4All: Lightweight, beginner-friendly

But honestly? Ollama is the best balance of simplicity and power for your use case.

Keep Learning

Understand Transformers Better:

  • Read: "Attention is All You Need" (the original paper)
  • Watch: YouTube tutorials on how LLMs work
  • Experiment: Try different models, prompt engineering

Follow the Community:

  • Ollama GitHub (issues, discussions)
  • Hugging Face model hub (find new models)
  • Reddit r/LocalLLM (community sharing)

You're Part of the AI Revolution

A few years ago, running local LLMs meant compiling C++, wrestling with dependencies, and getting 1-2 tokens/sec. Now? You install Ollama, pull a model, and get 10-15 tokens/sec on CPU. You've got a private, offline-capable AI brain that costs nothing to run.

Your data is yours. Your LLM is yours. No cloud vendor, no API limits, no surveillance.

That's power. Use it wisely.

Advanced Complete: You're now an Ollama expert. Go build amazing things. 🚀