Beyond the Basics: Optimization and Integration
You've completed Ollama Basics. You have Llama 3.2 running, you've tested the REST API, and you've pulled a few models. Now: let's hook it into OpenClaw and then make it better.
This tutorial starts with the thing you're here for - getting OpenClaw to use your local models. Once that's working, we dig into parameter tuning, benchmarking, and optimization so you can make it faster and smarter over time.
Where You Left Off (Quick Review)
You have:
- ✓ Ollama installed and running as a service
- ✓ At least one model downloaded (Llama 3.2 or Qwen3)
- ✓ REST API working (tested with curl)
- ✓ Baseline performance metrics (10-15 tokens/sec on the Ryzen)
If you don't have these, go back to Ollama Basics first. Not sure which model to use? The Best Local AI Models 2026 comparison covers every model worth knowing with pull commands ready to go.
What You'll Learn (Advanced)
In this 90-minute tutorial:
- OpenClaw Integration: Wire your bot to local Ollama - the whole reason you're here
- Parameters: Temperature, top_p, context windows - the knobs that control your model's behavior
- Use-Case Tuning: Different settings for different tasks (structured output, creative writing, code)
- Benchmarking: Objective methods to measure quality and speed
- GPU Detection: Understand when GPUs help (spoiler: not for you)
- Multiple Models: Run models concurrently, queue requests, instant switching
- Optimization: Memory tuning, CPU optimization, inference speed tuning
- Monitoring: Health checks, logs, continuous operation
- Troubleshooting: Common issues and how to fix them
Your Hardware
This tutorial works on any modern hardware capable of running Ollama. Here's what to keep in mind:
- CPU: 4+ cores (8+ cores recommended for smooth performance)
- RAM: 16GB minimum, 32GB recommended (more RAM = more models loaded simultaneously)
- Storage: SSD recommended (fast model loading)
- GPU: Optional - CPU-only inference works well for 7B models
Performance numbers in this tutorial are based on the Ryzen 7 6800H mini PC with 32GB RAM - the same hardware recommended throughout these tutorials. If you're on different hardware, your numbers will vary but the tuning principles are the same.
Why This Matters
For OpenClaw: A well-tuned local LLM can power a bot that feels just as smart as cloud-based solutions, but faster, cheaper, and completely private.
For Learning: Understanding parameters teaches you how language models actually work. You'll develop intuition for LLM behavior.
For Optimization: Squeezing another 20% performance from your hardware means faster responses and better user experience.
A Note on Complexity
This tutorial assumes you're comfortable with terminal commands and basic system administration. You don't need to be an expert, but "comfortable with Linux" is the baseline.
If something is unclear, go back to Ollama Basics or skip to the sections that interest you. You don't have to do everything in order.
Connect OpenClaw to Your Local Ollama
This is the payoff: your Discord bot running on a local LLM you control completely. No API keys, no cloud dependency, no monthly bills. Your conversations stay on your hardware.
What You Need Before Starting
- Ollama installed and running (covered in Ollama Basics)
- At least one model pulled -
qwen3:8brecommended - OpenClaw installed (see EC2 setup or Mini PC setup)
- Node 24 or Node 22 LTS (22.19+)
Verify Ollama is ready before touching OpenClaw config:
curl http://localhost:11434/api/tags
# Should return JSON with your pulled models listed
# If you get "connection refused" - start Ollama first: ollama serve
The Fast Path: Onboard Command
OpenClaw has a built-in onboarding command that handles Ollama configuration automatically. This is the quickest way to get wired up:
# Interactive mode (recommended for first time)
openclaw onboard
# Non-interactive with your chosen model
openclaw onboard --non-interactive \
--auth-choice ollama \
--custom-model-id "qwen3:8b" \
--accept-risk
The --auth-choice ollama flag tells OpenClaw to use local Ollama instead of a cloud provider.
It auto-discovers models from http://127.0.0.1:11434 and sets all costs to $0.
After onboarding, open the Control UI at http://localhost:18789 to verify the agent is running and connected.
Manual Config (More Control)
If you want to specify exact models, a custom host, or tune parameters, edit the config file directly. OpenClaw stores its settings at:
~/.openclaw/openclaw.json
Here is a working Ollama config block. Add or merge this into your existing openclaw.json:
{
"models": {
"providers": {
"ollama": {
"baseUrl": "http://127.0.0.1:11434",
"apiKey": "ollama-local",
"api": "ollama",
"models": [
{
"id": "qwen3:8b",
"name": "Qwen 2.5 7B",
"reasoning": false,
"input": ["text"],
"cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
"contextWindow": 8192,
"maxTokens": 4096
},
{
"id": "llama3.2",
"name": "Llama 3.2 3B",
"reasoning": false,
"input": ["text"],
"cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
"contextWindow": 4096,
"maxTokens": 2048
}
]
}
}
},
"agents": {
"defaults": {
"model": {
"primary": "ollama/qwen3:8b",
"fallbacks": ["ollama/llama3.2"]
}
}
}
}
/v1 to the baseUrl. Using
http://127.0.0.1:11434/v1 activates the OpenAI-compatible endpoint, which breaks
tool calling - your bot will output raw tool JSON as plain text instead of actually calling tools.
Always use the native Ollama URL with "api": "ollama".
Auto-Discovery (Simplest Option)
If you just want OpenClaw to pick up whatever models you have installed without listing them manually, set one environment variable and restart OpenClaw:
export OLLAMA_API_KEY="ollama-local"
# Add to your shell profile so it persists
echo 'export OLLAMA_API_KEY="ollama-local"' >> ~/.bashrc
source ~/.bashrc
With this set, OpenClaw automatically queries /api/tags to find all your models,
reads their context windows, and marks anything with "r1", "reasoning", or "think" in the name
as a reasoning-capable model. All costs are set to $0.
Switch the Active Model
Once configured, you can verify OpenClaw sees your Ollama models and swap between them:
# See what models OpenClaw has available
openclaw models list
# Set qwen3:8b as primary
openclaw models set ollama/qwen3:8b
# Switch to the fast 3B model for lighter tasks
openclaw models set ollama/llama3.2
The ollama/ prefix tells OpenClaw which provider the model lives on.
You'll use this prefix any time you reference a model in config or CLI commands.
Test It End to End
Before you start routing Discord messages through it, confirm the full chain is working:
# 1. Confirm Ollama is responding
curl http://localhost:11434/api/tags
# 2. Confirm OpenClaw sees the models
openclaw models list
# 3. Check provider connectivity
openclaw models status
# 4. Open the Control UI and send a test prompt
# http://localhost:18789
If openclaw models list shows your Ollama models and the Control UI returns
a response to a test prompt, you are fully wired up.
Performance on Your Hardware
What to expect on a Ryzen 7 6800H mini PC (CPU-only inference):
Short factual question (e.g. "What is HTTP 429?")
→ llama3.2 (3B): 3-5 seconds
→ qwen3:8b: 6-10 seconds
Medium response (paragraph explanation)
→ llama3.2 (3B): 8-12 seconds
→ qwen3:8b: 12-20 seconds
Multiple users queued simultaneously
→ Ollama processes one at a time
→ Each user waits for the one ahead of them
→ Fine for personal bots and small communities
For a personal Discord server this is completely usable. Users perceive the "typing..." indicator
as the bot thinking, which feels natural. If you need faster responses, use llama3.2
as the primary and save qwen3:8b for when quality matters more than speed.
The Knobs That Control Your LLM
LLMs generate text one token at a time. But how they choose which token to generate is controlled by parameters. These are your tuning knobs. Understand them, and you control the model's behavior.
Temperature (Randomness)
Range: 0.0 to 2.0 (or higher)
Default: Usually 0.7 (balanced)
Temperature controls how "creative" or "random" the model gets:
- 0.0 (Cold, Deterministic): Always picks the most likely next token. Same input = identical output every time. Good for: Precise answers, code generation, structured output.
- 0.5 (Cool, Focused): Mostly predictable but with some variation. Good for: Professional writing, Q&A, technical content.
- 0.7 (Balanced, Default): A mix of creativity and consistency. Good for: General chat, creative but coherent responses.
- 1.2 (Warm, Creative): More randomness, more interesting responses. Good for: Brainstorming, creative writing, poetry.
- 1.5+ (Hot, Chaotic): Very random, often incoherent. Good for: Experimentation (usually a mistake).
Try it:
# Cold (deterministic)
curl http://localhost:11434/api/generate -d '{
"model": "qwen3:8b",
"prompt": "Complete: The future of AI is",
"temperature": 0.0,
"stream": false
}' | jq -r '.response'
# Hot (creative)
curl http://localhost:11434/api/generate -d '{
"model": "qwen3:8b",
"prompt": "Complete: The future of AI is",
"temperature": 1.5,
"stream": false
}' | jq -r '.response'
Run both and notice the difference. You'll see the same semantic answer at 0.0, wildly creative answers at 1.5.
Top-P (Nucleus Sampling)
Range: 0.0 to 1.0
Default: 0.9
Top-P is a more sophisticated diversity control than temperature. It works by:
- Model predicts probabilities for the next token
- Sort tokens by likelihood (highest first)
- Keep only the top tokens that sum to p% probability
- Randomly sample from that restricted set
In practice:
- 0.1: Ultra-focused (only the top 10% of likely tokens)
- 0.5: Focused (top 50% of likely tokens)
- 0.9: Balanced (top 90% of likely tokens, allows more creativity)
Works best with temperature: Top-P controls diversity among the remaining candidates after temperature does its work. Usually keep at 0.9 unless you're experimenting.
Top-K
Range: 1 to infinity
Default: 40 (varies by Ollama version)
Top-K is the simpler cousin of top-P. It limits the model to only consider the K most likely tokens:
- K=1: Only consider the most likely token (extremely constrained)
- K=10: Consider top 10 most likely tokens
- K=40: Consider top 40 (good balance)
- K=100+: Very permissive (almost all tokens allowed)
Practical advice: Leave top-K at default and use temperature+top-P instead. Top-K is older and less intuitive than top-P.
Repeat Penalty
Range: 0.0 to 2.0
Default: 1.1
Repeat penalty prevents the model from repeating the same phrase over and over (common LLM failure mode):
- 1.0: No penalty (tokens can repeat freely)
- 1.1: Mild penalty (slight discouragement to repeat)
- 1.5: Strong penalty (heavily discourage repetition)
Use when: Model generates repetitive text ("the the the..." or "and and and...").
Default is good: Usually leave at 1.1.
Context Window (num_ctx)
Range: 128 to ~32000 tokens
Default: 2048
Context window is how many tokens the model can "see" when generating. It's like short-term memory:
- 256: Very short memory (forgets everything quickly)
- 2048: Good balance (remember ~1500 words of conversation)
- 4096: Long memory (remember ~3000 words)
- 8192: Very long (remember entire documents)
Tradeoff: Larger context = slower inference (more computation). On a typical multi-core CPU with a 7B model:
2048 tokens: ~15 tokens/sec (fast)
4096 tokens: ~12 tokens/sec (slightly slower)
8192 tokens: ~8 tokens/sec (noticeably slower)
16384+: ~3-5 tokens/sec (CPU gets saturated)
Recommendation: Use 2048 for chat. Use 4096 for document analysis. Skip 8192+ unless you really need the memory.
Prediction Tokens (num_predict)
Range: -1 (unlimited) to any positive number
Default: -1 (unlimited)
Maximum tokens to generate before stopping. Prevents runaway responses:
- -1: Unlimited (model stops when it feels done)
- 128: Stop after 128 tokens (~100 words)
- 512: Stop after 512 tokens (~400 words)
- 2048: Stop after 2048 tokens (full page)
Use when: You want predictable response lengths. Good for APIs where you need bounded latency.
Threads (num_threads)
Range: 1 to your CPU core count
Default: Auto-detect (all cores)
How many CPU cores Ollama uses. For example, on an 8-core CPU:
- 1-2: Slow, leaves CPU idle
- 8: All cores active (default, good)
- >8: Hyperthreading counts, can use more (usually 16 total)
Leave it on auto. Ollama detects your CPU and uses all available threads.
Quick Parameter Summary
temperature → 0.0–2.0 (randomness, default 0.7)
top_p → 0.0–1.0 (diversity, default 0.9)
top_k → 1–∞ (token limit, default 40, skip it)
repeat_penalty → 0–2 (penalize repetition, default 1.1)
num_ctx → 128–32k (memory, default 2048)
num_predict → -1–∞ (max length, default -1)
num_threads → 1–cores (CPU usage, default all)
Parameters That Fit Your Task
Now that you understand parameters, let's apply them. Different tasks need different settings. A Discord bot behaves differently than a creative writer's assistant. Let's build configurations for real scenarios.
Use Case 1: OpenClaw Integration (Structured Output)
Goal: Consistent, deterministic responses. OpenClaw needs predictable JSON or structured text.
Recommended Settings:
curl http://localhost:11434/api/generate \
-d '{
"model": "qwen3:8b",
"prompt": "Your prompt here",
"temperature": 0.1,
"top_p": 0.9,
"num_predict": 2048,
"num_ctx": 4096,
"stream": false
}'
Why these values:
- temperature: 0.1 → Focused, deterministic (reproducible output)
- top_p: 0.9 → Still allows valid variation, not robotic
- num_predict: 2048 → Reasonable max length for API
- num_ctx: 4096 → Enough memory for conversation context
Expected behavior: Responses are consistent. Same prompt = mostly same answer (good for testing and debugging).
Use Case 2: Creative Writing
Goal: Variety and creativity. You want different results each time, but still coherent.
Recommended Settings:
curl http://localhost:11434/api/generate \
-d '{
"model": "qwen3:8b",
"prompt": "Write a short story about...",
"temperature": 1.2,
"top_p": 0.95,
"num_predict": 1000,
"num_ctx": 2048,
"repeat_penalty": 1.2,
"stream": true
}'
Why these values:
- temperature: 1.2 → Creative, more varied outputs
- top_p: 0.95 → Very permissive vocabulary
- repeat_penalty: 1.2 → Prevent repetitive phrases
- stream: true → See creativity unfold in real-time
Expected behavior: Each run produces unique, interesting variations on the theme.
Use Case 3: Fast Inference (Real-Time Chat)
Goal: Speed matters. Users are waiting for a response. Trade some context for speed.
Recommended Settings:
curl http://localhost:11434/api/generate \
-d '{
"model": "llama3.2",
"prompt": "What is...",
"temperature": 0.7,
"top_p": 0.9,
"num_predict": 512,
"num_ctx": 2048,
"stream": true
}'
Why these values:
- num_predict: 512 → Shorter responses (faster generation)
- num_ctx: 2048 → Not too large (faster processing)
- stream: true → First token appears faster (perceived speed)
- model: llama3.2 → 3B model, faster than qwen3:8b for simple tasks
Expected behavior: First token appears in <500ms, total response in 5-10 seconds.
Use Case 4: Code Generation
Goal: Correct, working code. Logic must be sound.
Recommended Settings:
curl http://localhost:11434/api/generate \
-d '{
"model": "qwen3:8b",
"prompt": "Write a Python function that...",
"temperature": 0.3,
"top_p": 0.9,
"num_predict": 1024,
"num_ctx": 4096,
"repeat_penalty": 1.1,
"stream": false
}'
Why these values:
- temperature: 0.3 → Focused on correct syntax
- num_ctx: 4096 → Understand full code context
- repeat_penalty: 1.1 → Avoid redundant code
Expected behavior: Code is syntactically correct and logically sound most of the time.
Use Case 5: Long-Form Document Analysis
Goal: Understand and analyze large texts. Need big context window.
Recommended Settings:
curl http://localhost:11434/api/generate \
-d '{
"model": "qwen3:8b",
"prompt": "Summarize this document:\n\n[LARGE TEXT HERE]",
"temperature": 0.5,
"top_p": 0.9,
"num_predict": 1024,
"num_ctx": 8192,
"stream": false
}'
Why these values:
- num_ctx: 8192 → Can see entire document (trade-off: slower)
- temperature: 0.5 → Faithful to source material
- stream: false → You're doing analysis, not interactive chat
Expected behavior: Accurate summaries that capture main points. Slower (expect 20–30 seconds), but comprehensive.
Testing Your Configuration
Don't just trust recommendations. Test with your own prompts:
#!/bin/bash
PROMPT="Write a haiku about programming"
echo "=== Configuration A ==="
time curl -s http://localhost:11434/api/generate \
-d '{
"model": "qwen3:8b",
"prompt": "'$PROMPT'",
"temperature": 0.5,
"stream": false
}' | jq -r '.response'
echo ""
echo "=== Configuration B ==="
time curl -s http://localhost:11434/api/generate \
-d '{
"model": "qwen3:8b",
"prompt": "'$PROMPT'",
"temperature": 1.0,
"stream": false
}' | jq -r '.response'
Run the same prompt with different configurations. Notice speed and quality differences. Build intuition.
Save Your Configs
Once you find good settings, save them to files for reuse:
# OpenClaw Config
alias ollama-openclaw='curl -d "{\"model\": \"qwen3:8b\", \"temperature\": 0.1, \"top_p\": 0.9}"'
# Creative Config
alias ollama-creative='curl -d "{\"model\": \"qwen3:8b\", \"temperature\": 1.2, \"top_p\": 0.95}"'
# Fast Config
alias ollama-fast='curl -d "{\"model\": \"llama3.2\", \"temperature\": 0.7, \"num_predict\": 512}"'
Source the file and use aliases. Saves typing and keeps configs consistent.
Measure Speed and Quality Objectively
You can feel that a model is fast, but how fast exactly? How does Qwen3 compare to Llama 3.2 on your hardware? Benchmarking gives you objective data to guide optimization decisions.
Metrics That Matter
Four key metrics for LLM inference:
- Time to First Token (TTFT): How long before the model starts responding. Goal: <500ms
- Tokens Per Second (TPS): Generation speed. Goal: 10-15 for CPU, 50+ for GPU
- Memory Used: RAM footprint during inference. Goal: <8GB for your setup
- Response Quality: Does the answer make sense? Subjective but important
Benchmark Script
Create a reusable benchmarking script:
#!/bin/bash
MODEL="${1:-qwen3:8b}"
PROMPT="Explain machine learning in 3 paragraphs"
echo "Benchmarking $MODEL..."
echo ""
# Run inference and capture timing
START=$(date +%s%N)
RESPONSE=$(curl -s http://localhost:11434/api/generate \
-d "{
\"model\": \"$MODEL\",
\"prompt\": \"$PROMPT\",
\"temperature\": 0.5,
\"stream\": false
}")
END=$(date +%s%N)
# Extract metrics
TOTAL_TIME=$((($END - $START) / 1000000)) # milliseconds
TOKENS=$(echo "$RESPONSE" | jq -r '.eval_count')
TPS=$(echo "1000 * $TOKENS / $(echo \"$RESPONSE\" | jq -r '.eval_duration') * 1000000" | bc -l)
echo "Model: $MODEL"
echo "Total Time: ${TOTAL_TIME}ms"
echo "Tokens: $TOKENS"
echo "Tokens/Sec: $(echo \"scale=2; $TPS\" | bc)"
echo ""
echo "Response:"
echo "---"
echo "$RESPONSE" | jq -r '.response'
echo "---"
Run it:
bash benchmark.sh qwen3:8b
bash benchmark.sh llama3.2
bash benchmark.sh llama3.2:1b
Compare Models Side-by-Side
Test the same prompt across models to find your best balance of speed and quality:
Model Time Tokens TPS Quality
qwen3:8b 8.2s 82 10.0 Excellent
llama3.2 4.5s 85 18.9 Good
llama3.2:1b 2.1s 78 37.1 Basic
The 1B model is fastest, Qwen3 has the best quality. Your choice depends on your priority.
Monitor Resources During Benchmark
In another terminal, watch resource usage:
watch -n 0.5 'free -h && echo && top -n 1 -b | head -n 3'
Look for:
- Peak memory usage (should be <8GB)
- CPU utilization (should be 80%+ during inference)
- No thermal throttling (CPU stays stable frequency)
Understanding Hardware Acceleration (You Probably Don't Need It)
GPUs are much faster than CPUs for LLM inference. Many systems have an integrated GPU, but integrated graphics are typically too small to meaningfully accelerate LLMs. Let's understand when GPU acceleration actually helps.
Integrated vs Discrete GPUs
Most CPUs include an integrated GPU, but these are far too weak for LLM acceleration. For reference:
- Typical integrated GPU: ~1–2 TFLOPs
- RTX 4070: ~20 TFLOPs (10–20x more powerful)
- RTX 4090: ~82 TFLOPs (40–80x more powerful)
Integrated GPUs are orders of magnitude weaker than discrete gaming GPUs. For LLMs on integrated graphics, the CPU is actually competitive.
Check if Ollama Detects Your GPU
Ollama logs what hardware it's using. Check:
tail -50 ~/.ollama/ollama.log | grep -i gpu
If you see GPU references, Ollama detected it. If not, it's using CPU (default, which is fine for you).
When GPU Acceleration Helps
GPU acceleration is worth it if:
- Nvidia RTX GPU: 2080 Ti or newer (30x-60x speedup)
- AMD Radeon: RX 6700 or newer (10x-20x speedup)
- Apple Silicon: Built-in GPU (3x-5x speedup)
An integrated GPU? Not worth the complexity. CPU inference is simpler and nearly as fast.
Should You Upgrade Hardware?
For local LLMs, consider GPU if:
- You want to run 13B+ models (currently too slow on CPU)
- You need sub-second response times (production use)
- You want to run multiple concurrent inferences
For now, stick with CPU. A modern multi-core CPU is plenty fast for a single Ollama instance running 7B models.
Leverage Your RAM - Load Once, Use Instantly
With 16GB+ RAM and OLLAMA_KEEP_ALIVE set, you can keep multiple models loaded simultaneously and switch between them with zero cold-start delay. Different models for different tasks - all ready to go.
Check What's Currently Loaded
Before thinking about concurrent models, know what's already in memory:
ollama ps
NAME ID SIZE PROCESSOR UNTIL
qwen3:8b 845dbda0ea48 5.5 GB 100% CPU 28 minutes from now
llama3.2 a80c4f17acd5 2.5 GB 100% CPU 25 minutes from now
PROCESSOR shows whether your GPU is involved. On a CPU-only machine you'll see 100% CPU. On a machine with a compatible GPU (NVIDIA with CUDA, AMD with ROCm, or Apple Silicon), you'll see a split like 45%/55% CPU/GPU - higher GPU % means faster inference. The UNTIL column shows the keep-alive expiry for each loaded model.
Concurrent Model Memory Layout
On a 32GB system, loading your full stack looks like this:
OS + System: ~2GB (always used)
Ollama Runtime: ~1GB
qwen3:8b (q4_K_M): ~5.5GB - all-rounder, 128K context
llama3.2 (3B): ~2.5GB - speed tier, quick queries
phi4 (14B): ~9.0GB - heavy reasoning, complex problems
Reserve Buffer: ~12GB (available for gemma3 or larger models)
────────────────────────────────
Total Used: ~20GB
Total Available: ~12GB
With OLLAMA_KEEP_ALIVE set to -1 or a long duration, all three models stay resident.
Switching between them is instant - the model is already warm in memory.
Practical Multi-Model Setup (2026)
A sensible three-tier stack for most use cases:
- qwen3:8b - Default tier. Complex reasoning, code, long documents (128K context). Best all-rounder.
- llama3.2 - Speed tier. Quick lookups, simple tasks, when you want an instant answer.
- phi4 - Heavy reasoning (needs ~10GB RAM). Pull when qwen3:8b isn't cutting it on complex problems.
- gemma3:12b - Alternative heavy model. Strong at instruction-following and multilingual tasks.
Total disk for the first two: ~8GB. Add phi4 and you're at ~17GB. On 32GB RAM, that's comfortable headroom.
API Request Routing by Model
With multiple models loaded, route requests to the right model for the task:
# Complex task → Qwen3 for reasoning + 128K context
curl http://localhost:11434/api/generate \
-d '{"model":"qwen3:8b","prompt":"Analyze this long document..."}'
# Quick factual lookup → Llama 3.2 for speed
curl http://localhost:11434/api/generate \
-d '{"model":"llama3.2","prompt":"What does HTTP 429 mean?"}'
# Code review → Qwen3 for reliable instruction following
curl http://localhost:11434/api/generate \
-d '{"model":"qwen3:8b","prompt":"Review this function for bugs..."}'
If all three are loaded in memory (keep-alive), these requests complete without any model loading delay. Ollama queues concurrent requests and processes them in order, switching models transparently.
Pinning Models with OLLAMA_KEEP_ALIVE
The key to no-delay model switching is keeping models in memory. Set keep-alive system-wide via systemd:
sudo systemctl edit ollama
[Service]
Environment="OLLAMA_KEEP_ALIVE=-1"
sudo systemctl daemon-reload && sudo systemctl restart ollama
-1 only if you have headroom. With -1, models never unload
until Ollama restarts. On 16GB RAM with one 7B model loaded, that's fine. Loading a second 7B while the first
is pinned will consume ~12GB - check ollama ps and your free RAM before pinning multiple
large models permanently.
ollama ps is your dashboard - check it whenever
you're curious about what's running.
Squeeze Maximum Performance
Your CPU is capable, but there are ways to go further. Environment variables, context tuning, CPU pinning - small tweaks compound into noticeable improvements. This section covers the full toolkit.
Key Environment Variables
Ollama exposes a set of environment variables that control how it loads and runs models. These are the most useful ones for a Mini PC setup:
5m.
Set to 30m or -1 (forever) to avoid cold-start delays.
On 16GB+ systems, longer keep-alive means faster responses - the model is already warm.
Environment="OLLAMA_KEEP_ALIVE=30m"
3 on systems with GPU,
1 on CPU-only. Raise to 3 or 4 on 32GB systems to keep
your full model stack warm without unloading.
Environment="OLLAMA_MAX_LOADED_MODELS=3"
1 on CPU.
Increasing this can improve throughput if you have multiple concurrent users or
applications hitting Ollama at the same time, at the cost of higher RAM usage per request.
Start at 1 on CPU - most single-user setups don't need more.
Environment="OLLAMA_NUM_PARALLEL=2"
1 to enable. Most beneficial when running models
with large context windows (like qwen3:8b at 128K tokens). Can noticeably
reduce RAM pressure on long conversations.
Environment="OLLAMA_FLASH_ATTENTION=1"
sudo systemctl edit ollama
[Service]
Environment="OLLAMA_KEEP_ALIVE=30m"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
Environment="OLLAMA_FLASH_ATTENTION=1"
sudo systemctl daemon-reload && sudo systemctl restart ollama
CPU Affinity
Pin Ollama to specific cores to reduce OS scheduling overhead and give the model consistent CPU access:
# Add to your [Service] block in systemctl edit ollama:
CPUAffinity=0-7 # Pins to cores 0–7 (adjust to your core count)
On a 12-core machine, you might pin Ollama to 8 cores (0–7) and leave 4 cores (8–11) for the OS and other services. Experiment - the benefit varies by workload.
Context Window Tuning
Context window size directly affects inference speed and RAM usage. Smaller context = faster tokens:
Context 1024: ~20 tokens/sec (very fast, short conversations)
Context 2048: ~15 tokens/sec (balanced - good default)
Context 4096: ~10 tokens/sec (slower, good for most tasks)
Context 8192: ~7 tokens/sec (long docs, complex reasoning)
Context 32768+: ~3–5 tokens/sec (Qwen3 long context, only when needed)
You can override context size per request via the API using the options.num_ctx field.
For most chat tasks, 2048–4096 is the sweet spot. Only go higher when you're actually sending long inputs.
curl http://localhost:11434/api/generate \
-d '{
"model": "qwen3:8b",
"prompt": "Summarize this document...",
"options": {
"num_ctx": 8192
}
}'
Memory Pressure Relief
If you're hitting RAM limits - models are being unexpectedly unloaded, or you see swap activity - try these:
# See current model memory usage
ollama ps
# Check system RAM
free -h
# Check if swap is being used (bad for inference speed)
swapon --show
If you're hitting swap, either reduce OLLAMA_MAX_LOADED_MODELS, switch to smaller model variants
(e.g. llama3.2:3b instead of a 7B), or reduce OLLAMA_KEEP_ALIVE so models unload faster.
ollama ps.
Keep Your LLM Healthy
A running system needs monitoring. Health checks, logs, resource tracking-simple practices prevent surprises.
Health Check Script
#!/bin/bash
echo "=== Ollama Health Check ==="
echo ""
# Service status
echo "✓ Service Status:"
systemctl status ollama --no-pager | head -3
echo ""
echo "✓ API Connectivity:"
curl -s http://localhost:11434/api/tags | jq '.models | length' | xargs echo " Models available:"
echo ""
echo "✓ Resources:"
ps aux | grep "ollama serve" | grep -v grep | awk '{print " CPU: " $3 "%, RAM: " $6 " KB"}'
echo ""
echo "✓ Logs (last error):"
grep -i error ~/.ollama/ollama.log | tail -1 || echo " No errors"
Log Location
Logs live at ~/.ollama/ollama.log. Check for issues:
# View recent logs
tail -50 ~/.ollama/ollama.log
# Find errors
grep ERROR ~/.ollama/ollama.log | tail -10
Regular Maintenance
- Weekly: Check disk usage (
du -sh ~/.ollama/models) - Monthly: Update Ollama (
sudo apt update && sudo apt install ollama) - Quarterly: Clean unused models (
ollama rm model_name) - Yearly: Backup models to external drive
Before You Go Live
You've learned parameters, benchmarking, optimization. This checklist ensures your Ollama setup is production-ready for OpenClaw.
Pre-Launch Checklist
- ☐ Ollama running as systemd service (auto-start on boot)
- ☐ Model selected and downloaded (qwen3:8b recommended)
- ☐ API endpoint verified (
curl http://localhost:11434/api/tags) - ☐ Temperature set appropriately (0.1-0.3 for OpenClaw)
- ☐ Context window configured (4096 minimum)
- ☐ Max tokens set (2048 to prevent runaway)
- ☐ Baseline performance benchmarked
- ☐ Health check script running
Configuration Backup
Save your working configuration:
# Backup models
cp -r ~/.ollama/models ~/ollama-models-backup
# Document your settings
cat > ~/ollama-config.txt << EOC
Model: qwen3:8b
Temperature: 0.1
Top-P: 0.9
Context: 4096
Num Predict: 2048
EOC
OpenClaw Integration Checklist
- ☐ Ollama endpoint configured in OpenClaw (localhost:11434)
- ☐ Model name matches what you pulled
- ☐ Test prompt sent and received successfully
- ☐ Response quality acceptable
- ☐ Response time reasonable (5-20 seconds)
- ☐ No memory leaks after sustained use
- ☐ Reboot test: Ollama starts automatically, works after restart
Common Issues and Fixes
Things go wrong. Here's how to diagnose and fix common Ollama problems.
Ollama Won't Start
Symptom: Service shows inactive or fails to start
systemctl status ollama
# Read error message
# Try manual start to see error
ollama serve
Common fixes:
- Permission denied:
sudo chown ollama:ollama ~/.ollama - Port conflict: Check
sudo lsof -i :11434 - Corrupted model: Delete and re-pull
Out of Memory Errors
Symptom: "OOM Killer" in dmesg, processes killed
free -h
# If available < 2GB during inference, you're hitting limits
dmesg | grep -i killed | tail -5
# Shows what got OOM killed
Fixes:
- Reduce context window (num_ctx: 2048 instead of 8192)
- Unload unused models (
ollama rm model_name) - Close other apps consuming memory
- Add swap if permanently needed
Very Slow Inference
Symptom: <5 tokens/sec (should be 10-15)
top -n 1
# CPU usage <80%? Issue might be elsewhere
# Check if CPU is being shared with other apps
iostat -x 1
# High wait time? Disk bottleneck
Fixes:
- Close heavy apps (browser, IDE, etc)
- Reduce num_ctx to 2048
- Use faster model (llama3.2:3b)
- Check for thermal throttling (
watch -n 1 'cat /proc/cpuinfo | grep MHz')
API Not Responding
Symptom: curl returns Connection refused
systemctl status ollama
# Make sure it's running
netstat -an | grep 11434
# Should show LISTEN on port 11434
curl http://localhost:11434/api/tags
# Should return JSON, not error
Fixes:
- Start service:
sudo systemctl start ollama - Check firewall:
sudo ufw allow 11434 - Restart:
sudo systemctl restart ollama
Bad Quality Responses
Symptom: Responses are nonsensical or repetitive
Fixes:
- Lower temperature (0.3 or less)
- Increase repeat_penalty (1.5)
- Reduce num_predict (limit length)
- Try different model (llama3.2 instead of qwen3:8b)
Where to Go From Here
You've completed Ollama Advanced. You understand parameters, tuning, integration, optimization, and troubleshooting. Your local LLM setup is sophisticated and production-ready.
You've Accomplished
- ✓ Understand all Ollama parameters and their effects
- ✓ Tune for specific use cases (OpenClaw, creative, code, etc)
- ✓ Benchmark models objectively
- ✓ Know when GPU acceleration matters (and doesn't for you)
- ✓ Run multiple models concurrently
- ✓ Integrate Ollama seamlessly with OpenClaw
- ✓ Optimize your hardware for maximum LLM performance
- ✓ Monitor and maintain a healthy Ollama system
- ✓ Troubleshoot common problems
Option 1: Deploy Your OpenClaw Bot
You have everything you need. Configure OpenClaw to use your local Ollama, then deploy:
- Point OpenClaw to http://localhost:11434
- Select your tuned model and parameters
- Launch your Discord bot
- Enjoy your private, offline-capable AI agent
Option 2: Explore Advanced Topics
If you want to go deeper:
- Model Fine-Tuning: Customize models for specific tasks (advanced)
- Quantization: Compress models further (4-bit, 2-bit)
- Distributed Inference: Run Ollama across multiple machines
- Web UI: Build a web interface for Ollama
- Monitoring: Prometheus/Grafana metrics tracking
Option 3: Compare with Other LLM Tools
Ollama isn't the only option. If you want to explore:
- LM Studio: Web UI for local models (easier than Ollama CLI)
- vLLM: High-performance inference server (more complex)
- Text Generation WebUI: Feature-rich but steeper learning curve
- GPT4All: Lightweight, beginner-friendly
But honestly? Ollama is the best balance of simplicity and power for your use case.
Keep Learning
Understand Transformers Better:
- Read: "Attention is All You Need" (the original paper)
- Watch: YouTube tutorials on how LLMs work
- Experiment: Try different models, prompt engineering
Follow the Community:
- Ollama GitHub (issues, discussions)
- Hugging Face model hub (find new models)
- Reddit r/LocalLLM (community sharing)
You're Part of the AI Revolution
A few years ago, running local LLMs meant compiling C++, wrestling with dependencies, and getting 1-2 tokens/sec. Now? You install Ollama, pull a model, and get 10-15 tokens/sec on CPU. You've got a private, offline-capable AI brain that costs nothing to run.
Your data is yours. Your LLM is yours. No cloud vendor, no API limits, no surveillance.
That's power. Use it wisely.