What is Ollama and Why Should You Care?
You've probably heard the hype: run AI models locally, no API keys, no cloud costs, fully offline. That's Ollama. It's a tool that makes running large language models (LLMs) on your own hardware dead simple. Install it, pull a model, start chatting. That's it.
This tutorial assumes you're coming in cold - no deep learning background, no AI experience. Just curiosity and some decent hardware. By the end, you'll have a working local LLM and understand what you're actually doing (no black boxes). Whether you're on a laptop, mini PC, or server, the concepts are identical - only the speed varies. If you're still deciding on hardware, the Mini PC Setup tutorial covers how to pick the right machine for local AI work.
Why Ollama Matters
Privacy: Your prompts stay on your laptop. No OpenAI, no Anthropic, no third party. Your data is yours.
Cost: Free. Download once, run forever. No $20/month subscriptions, no pay-per-token API fees.
Speed: Local inference is fast. No network latency waiting for cloud API responses. Your laptop responds instantly.
Offline: Internet down? Your LLM still works. Perfect for training, experimentation, or just using AI when connectivity is spotty.
What You'll Actually Learn
- ✓ Install Ollama on Linux, Mac, or Windows
- ✓ Understand which models fit your hardware (spoiler: lots of them)
- ✓ Run your first LLM and generate text
- ✓ Use Ollama's REST API for programmatic access
- ✓ Monitor performance and understand what's happening under the hood
- ✓ Manage multiple models simultaneously
What This Is NOT
This tutorial is deliberately beginner-focused. We're not covering:
- Model fine-tuning or training (that's Tutorial 2+)
- Deep learning or transformer internals
- Advanced optimization or quantization
- Production deployment or scaling
What we ARE covering: Getting Ollama running and understanding it works. That's the goal.
Does Your Hardware Cut It? (Spoiler: Yes)
Before we install, let's talk hardware. The good news: Ollama runs on almost any modern CPU. The question isn't "Can I run it?" but "How fast will it run?" That depends on your CPU cores, RAM, and storage speed. Let's figure out what to expect - and pick the right model for your machine.
Minimum Hardware Requirements
- CPU: Any modern multi-core processor (Intel, AMD, Apple Silicon)
- RAM: 8GB minimum (16GB+ recommended for 7B models with headroom)
- Storage: 20GB+ free (models range from ~2GB to ~9GB each)
- OS: Linux (Ubuntu 24.04+), macOS, or Windows with WSL2
If you have these, you can run Ollama. The key variable is speed - which depends on your CPU cores and RAM bandwidth.
Model Selection: 2026 Recommended Picks
The model landscape has moved fast. Llama 2 and Neural Chat are gone from the recommended list - there are better, faster, smarter options that run just as well on consumer hardware. Here are the three best starting points:
RAM Used: ~3GB
Speed (8-core CPU): 15–30 tokens/sec
Quality: Excellent for its size
Why: Instant responses, works on 8GB RAM, great for first test
RAM Used: ~6GB
Speed (8-core CPU): 8–14 tokens/sec
Quality: Exceptional - the 4B variant rivals qwen2.5-72B
Why: 40K context, outstanding code + reasoning + 100+ languages
RAM Used: ~5.5GB
Speed (8-core CPU): 8–14 tokens/sec
Quality: Outstanding at math, logic, and code problems
Why: Shows its reasoning chain - great when you need to understand why
qwen3:14b (9.3GB, excellent across the board),
deepseek-r1:14b (9GB, serious reasoning), or phi4 (Microsoft's strong 14B model).
All pull the same way: ollama pull qwen3:14b. Worth the disk space if you have the RAM.
Understanding Quantization Tags
When you browse ollama.com/library, you'll see tags
like q4_0, q4_K_M, and q8_0 next to model names. These are quantization levels -
how much the model has been compressed to fit in memory. Here's what they mean:
:latest resolves to. The sweet spot for most users.
Same size savings as q4_0 but noticeably better output on complex prompts.
When you pull a model (after installing Ollama in the next section), you can specify a quantization tag or just use the default:
# :latest usually resolves to q4_K_M (recommended)
ollama pull qwen3:8b
# Explicitly request q8 for better quality (more RAM)
ollama pull qwen3:8b-q8_0
# 1B model - tiny and fast for simple tasks
ollama pull llama3.2:1b
For most users, :latest is the right call. Ollama picks a good default. Only specify a quantization tag
if you're optimizing for a specific RAM budget or quality ceiling. You'll pull your first model in Section 04.
What NOT to Run (On Low RAM)
Some models need more headroom than others. If you're on 8GB total RAM, stick to 3B models:
- 8GB RAM: 1B–3B models comfortably (llama3.2:1b, llama3.2:3b). 7B models are tight - other apps may push you into swap.
- 16GB RAM: 7B models comfortably. Can experiment with 12–14B (phi4, gemma3:12b). Two 7B models loaded simultaneously.
- 32GB RAM: 14B models and below comfortably. Multiple concurrent 7B models. Some 32B quantized models with patience.
If you're on 8GB and a 7B model feels sluggish, try llama3.2:3b instead - it's genuinely impressive
for its size and will be noticeably more responsive.
Hardware Tiers & What to Expect
Speed varies dramatically based on your CPU. Here's what to expect running a 7B model (q4_K_M):
Budget / Older CPU (4 cores, 8 threads): 3–6 tokens/sec (usable, slower)
Mid-Range CPU (8 cores, 16 threads): 8–15 tokens/sec (very good!)
High-End CPU (12+ cores, 24+ threads): 15–25+ tokens/sec (excellent)
Apple Silicon M2/M3: 30–50+ tokens/sec (fast as GPU)
Note: Tested on systems with 16GB+ RAM and NVMe SSD storage
To put it in perspective: 10 tokens/second means a 500-token response (a long paragraph) takes about 50 seconds. That's fast for CPU-only inference. You'll be pleasantly surprised.
llama3.2:3b - it's fast, free, and works on any modern machine.
Then pull qwen3:8b when you're ready for something more capable. Both will surprise you.
Need a full breakdown of every model worth knowing in 2026?
Best Local AI Models 2026 - full comparison.
Gear Worth Having for Local AI
Models pile up fast - Qwen3:8b is 5.2GB, DeepSeek-R1:7b is 4.7GB, and once you start experimenting you'll want 6-10 models on hand. That's 30-50GB before you know it. Here's the gear that makes the experience better, from a genuine "we ran these tutorials on this hardware" perspective:
Amazon links above are affiliate links - they cost you nothing extra and help keep these tutorials free and updated. Full hardware buying guide with specific model recommendations: Mini PC Setup - Hardware Guide.
Getting Ollama Running
Ollama installation is genuinely simple. No compilation, no complex setup, no configuration files to fiddle with. Download, run the installer, done. Let's do it.
Prerequisites
- ✓ Linux OS (Ubuntu 24.04+, Fedora, Debian, etc.)
- ✓ 8GB+ RAM (16GB+ recommended for multiple models)
- ✓ 20GB+ free disk space (models are ~4GB each)
- ✓ Stable internet connection (for downloading models once)
- ✓ Ability to run sudo commands (or use your user password)
Step 1 - Download and Install Ollama
Open a terminal and run:
curl -fsSL https://ollama.ai/install.sh | sh
This script:
- Downloads the Ollama binary for your system
- Places it in `/usr/local/bin/` (in your PATH)
- Sets up a systemd service to auto-start on boot
- Creates the ollama user and group
The installation takes 1–2 minutes. You'll see output as it progresses. When it finishes, you're done.
Step 2 - Verify Installation
Check that Ollama is installed and in your PATH:
ollama --version
You should see something like:
ollama version is 0.1.45 (or newer)
Step 3 - The Daemon is Already Running
During installation, the Ollama daemon started automatically in the background. It's listening on http://127.0.0.1:11434.
Do not run ollama serve manually - the port is already in use by the running daemon. You can verify it's active by testing the API:
curl -s http://127.0.0.1:11434/api/tags
If it responds with JSON (like {"models":[]}), the daemon is running. Good to go!
Step 4 - Three Ways to Interact with Ollama
Now that the daemon is running, here's how you can use it:
-
ollama run <model>
Interactive chat mode. Type prompts, get responses in your terminal. Great for testing and learning. -
REST API (curl/Python/etc)
Programmatic access. Send JSON requests tolocalhost:11434, get JSON responses. Perfect for integrations and scripts. -
Background daemon
The daemon runs automatically on boot and stays running. You don't manage it manually; it just works.
Step 5 - Ready for Your First Model
The daemon is running and the API is responding. Installation is complete.
You're now ready to download and run your first model. Head to the next section to pull Llama 3.2.
Installation Summary
Ollama is installed and running as a background service on http://127.0.0.1:11434.
No configuration needed. No manual daemon management. Just pull a model and start using it.
You can now:
- Pull models with
ollama pull <model> - Run interactive chat with
ollama run <model> - Make API calls to
http://127.0.0.1:11434from scripts - Start using local AI immediately - no API keys, no cloud, no costs
Download and Run Your First Model
Time for the moment of truth. We're going to download Llama 3.2 3B - the fast, lightweight model we recommended in the last section - and run our first interactive chat session. This is where it gets real.
Step 1 - Pull Llama 3.2
"Pulling" a model means downloading it from Ollama's registry and storing it locally. Run:
ollama pull llama3.2
You'll see output like:
pulling manifest
pulling 418956b73c34... (downloading layer 1)
pulling e1cd8f6a5d4a... (downloading layer 2)
verifying sha256 digest
writing manifest
success
The model is about 2GB. On a typical internet connection, this takes 1-2 minutes. Quick.
Step 2 - Run Interactive Chat
Once the download completes, start an interactive chat session:
ollama run llama3.2
You'll see the prompt appear with the model ready:
>>>
(Run ollama list anytime to see all installed models and their exact versions.)
Now type a question or statement. Let's try something simple:
>>> What is Ollama?
Watch as the model generates a response in real-time. You'll see tokens appearing one by one. This is your local LLM doing inference right now, on your CPU, with no API calls, no cloud, no tracking.
Step 3 - Try More Prompts
Keep the chat session open and try different questions. Here are some good ones to test:
>>> Explain machine learning in simple terms
>>> Write a Python function that checks if a number is prime
>>> What's a good name for a Discord bot?
>>> Why is the sky blue?
>>> Tell me a joke
Notice:
- Response speed (you'll see "14 tokens/sec" or similar at the end)
- Response quality (is it coherent? Accurate?)
- CPU usage (all your cores working hard)
- No waiting for external APIs
Step 4 - Exit the Chat
To exit the interactive session, type:
>>> /bye
Or press Ctrl+D. Either works.
What Actually Happened Here?
Let's be concrete about the workflow:
- Pull: Download the model (~2GB) to
~/.ollama/models - Load: When you run Llama 3.2, Ollama loads it into RAM (~3GB used)
- Inference: Your prompt goes to the model, which generates tokens one at a time
- Display: Each token appears on your screen as it's generated
- Repeat: You type, the model responds, until you exit
Performance Notes
The interactive mode doesn't display detailed timing information directly. However, Ollama's REST API provides complete timing metrics including generation speed, which you'll explore in the next section.
For now, what you can observe from the interactive experience:
- The response appears token-by-token in real-time
- You can estimate speed by watching how fast tokens appear (roughly 10–15 tokens/sec on mid-range CPUs is typical)
- The model loads into memory the first time you run it (notice a slight delay before responses start)
- Subsequent responses should be faster since the model stays loaded
In the next section, you'll use the REST API to make requests and see exact timing metrics (tokens/sec, total duration, load time, etc.) in JSON responses. That's where you get precise performance data.
Programmatic Access to Your LLM
Interactive chat is fun for testing, but the real power comes from using Ollama's REST API. This lets you send prompts programmatically and get responses as JSON. Perfect for integrating with OpenClaw, scripts, or custom applications.
How It Works
Ollama runs a simple HTTP server on localhost:11434. You send JSON requests, you get JSON responses.
No authentication, no setup. Just HTTP.
This is how you'll integrate Ollama with OpenClaw later. For now, let's test it with curl.
Step 1 - Simple Text Generation (Synchronous)
Open a terminal and run:
curl http://localhost:11434/api/generate \
-d '{
"model": "llama3.2",
"prompt": "Explain quantum computing in 2 sentences",
"stream": false
}'
The -d flag sends JSON data in the request body. The stream: false means "wait for the full
response before returning."
You'll get a JSON response that looks like:
{
"model": "llama3.2",
"created_at": "2026-02-22T10:30:45.123456Z",
"response": "Quantum computing exploits quantum mechanics (superposition and entanglement) to process data in fundamentally different ways than classical computers. A quantum computer can explore many solutions simultaneously, making certain problems exponentially faster to solve.",
"done": true,
"total_duration": 2500000000,
"load_duration": 300000000,
"prompt_eval_count": 12,
"prompt_eval_duration": 800000000,
"eval_count": 30,
"eval_duration": 1400000000
}
Key fields:
- response: The generated text (what you want)
- done: Whether generation is complete
- eval_count: Number of tokens generated
- eval_duration: Time spent generating (nanoseconds)
Step 2 - Parse with jq (Optional But Nice)
The full JSON response includes several fields you may not need, including a context array (used for multi-turn conversations). If you have jq installed, you can extract just what you want:
curl -s http://localhost:11434/api/generate \
-d '{
"model": "llama3.2",
"prompt": "Tell me a joke",
"stream": false
}' | jq -r '.response'
The -r flag means "raw output" (no quotes around the text). You'll just see:
Why don't scientists trust atoms? Because they make up everything!
Or extract key fields without the context clutter:
curl -s http://localhost:11434/api/generate \
-d '{
"model": "llama3.2",
"prompt": "Tell me a joke",
"stream": false
}' | jq '{response, eval_count, eval_duration}'
Much cleaner. Install jq if you don't have it: sudo apt install jq
Note on the context field: The actual API response includes a context array containing token IDs from your prompt and response. This is used for multi-turn conversations (sending context back to maintain conversation history). For single requests, you can safely ignore it or filter it out with jq as shown above.
Step 3 - Streaming API (Real-Time Responses)
For longer responses, you might want tokens to stream in real-time (like in interactive chat). Set stream: true:
curl http://localhost:11434/api/generate \
-d '{
"model": "llama3.2",
"prompt": "Write a haiku about programming",
"stream": true
}'
With streaming, you get multiple JSON objects (one per token), streamed line-by-line:
{"model":"llama3.2","created_at":"...","response":"Code","done":false}
{"model":"llama3.2","created_at":"...","response":" flows","done":false}
{"model":"llama3.2","created_at":"...","response":" like","done":false}
...
{"model":"llama3.2","created_at":"...","response":"","done":true}
Parse each line and print the response field to see tokens appear in real-time. This is how chat interfaces work.
Step 4 - API Parameters
You can pass additional parameters to control generation behavior:
curl http://localhost:11434/api/generate \
-d '{
"model": "llama3.2",
"prompt": "Complete this: The future of AI is...",
"stream": false,
"temperature": 0.7,
"top_p": 0.9,
"num_predict": 100
}'
Parameters explained (you'll dive deeper in Tutorial 2):
- temperature: Randomness (0.0=deterministic, 1.0=balanced, >1.5=creative)
- top_p: Diversity control (lower=more focused)
- num_predict: Max tokens to generate (prevents runaway responses)
For now, the defaults (no parameters) are fine. We'll explore these in Advanced.
Why This Matters for Integration
When you integrate Ollama with OpenClaw, this is what happens under the hood:
- OpenClaw receives a user message from Discord
- It constructs a JSON request to your local Ollama API
- Ollama generates a response
- OpenClaw parses the response and sends it back to Discord
All locally. All on your laptop. All in milliseconds. That's powerful.
Monitoring and Understanding Resource Usage
Now that you have Ollama running, let's look at what's actually happening under the hood. How much memory is it using? How hard is your CPU working? What can you realistically expect?
Monitor While Running
While Ollama is generating text, open another terminal and watch resource usage:
watch -n 1 'free -h'
This updates every second showing your RAM usage. A typical 7B model uses about 5–6GB.
total used free shared buff/cache available
Mem: 31Gi 6.2Gi 22Gi 1.3Gi 3.4Gi 23Gi
This example shows Ollama using ~6GB out of 31GB total-plenty of breathing room. Your system will vary.
For CPU usage, in another terminal:
top -n 1 -o %CPU
Or use htop for a nicer interface:
htop
During inference, you'll see all your CPU cores at high utilization (70–95%). This is normal and expected. Your CPU is working hard, which is why you get good performance.
Performance Benchmarks by Hardware
Here's what to expect for a 7B-8B class model (like Qwen3 8B, which you'll pull in the next section) on different hardware. All numbers assume 16GB+ RAM and SSD storage. Your current Llama 3.2 3B runs roughly 2x faster and uses about half the RAM - these numbers represent the all-rounder tier:
Budget / Older CPU (4 cores):
Tokens/Sec: 3–6 tokens/sec (slow but usable)
Time to First Token: 1–2 seconds
RAM Used: ~6GB
Mid-Range CPU (8 cores):
Tokens/Sec: 8–15 tokens/sec (very good)
Time to First Token: 500ms–1 second
RAM Used: ~6GB
High-End CPU (12+ cores):
Tokens/Sec: 15–25+ tokens/sec (excellent)
Time to First Token: 300–500ms
RAM Used: ~6GB
Universal:
Model Size on Disk: ~5.2GB
Model Load Time: 1–3 seconds (SSD-dependent)
CPU Utilization: 70–95% during generation
Where you land on this spectrum depends on your CPU core count and clock speed. The good news: even budget CPUs generate text at usable speeds.
What Affects Performance?
Token generation speed varies based on several factors:
- Prompt Length: Longer prompts take longer to process before generating
- Response Length: More tokens to generate = longer total time (linear relationship)
- System Load: Other apps running = fewer CPU cycles for Ollama
- Model Size: Bigger models (13B+) are much slower on CPU
- SSD Speed: Slow first model load if SSD is bottleneck (yours is fast)
Is 10–15 Tokens/Sec Fast Enough?
Let's put it in perspective:
Short Answer (50 tokens): ~3–5 seconds
Medium Answer (200 tokens): ~13–20 seconds
Long Answer (500 tokens): ~33–50 seconds
Full Essay (1000 tokens): ~65–100 seconds
For comparison:
- OpenAI's API: Also 10–20 tokens/sec, but costs money and requires internet
- Claude/GPT directly: Similar speed with 100x the cost
- Instant messaging apps: Slower (typing speed is 40–60 words/min = ~10 tokens/sec)
So yes, 10–15 tokens/sec is genuinely fast. You're getting good performance locally, for free.
Sustained Running (24/7 Concerns)
Most hardware can run Ollama continuously without issues:
- Thermals: CPU inference doesn't generate excessive heat. Modern cooling handles it fine.
- Battery Drain: On laptops with battery, expect 1–2 hours per charge during heavy continuous use.
- Reliability: Modern CPUs are designed for sustained loads. No degradation over time.
- Memory Leaks: Ollama is stable. No memory creep after hours of running.
For occasional or development use on a laptop/desktop, your current hardware is fine. If you want to run Ollama 24/7 at scale with multiple concurrent requests, consider a dedicated server or device with more consistent power delivery.
Download, List, and Switch Between Models
One model is useful. A library of models is powerful. Different models have different strengths - a small 3B model for quick queries, a 7B all-rounder for complex work, a reasoning model when you need to think things through. Ollama makes managing all of them trivial.
Step 1 - Download More Models
Let's pull two more models to build out your library. The first is Qwen3:8b - the current all-rounder pick with exceptional reasoning, code ability, and 100+ language support:
ollama pull qwen3:8b
And the tiny-but-capable Llama 3.2 1B - useful when you want an instant response and don't need heavy reasoning:
ollama pull llama3.2:1b
The 1B model is only ~1.3GB and loads in seconds. It's your speed tier - great for quick lookups and simple tasks when you don't want to wait for a 7B model to warm up.
Step 2 - List Your Models
See everything you've downloaded:
ollama list
Output:
NAME ID SIZE MODIFIED
llama3.2:latest a4f39e04031c 2.0 GB 3 hours ago
qwen3:8b 9a3f5f67c0c9 5.2 GB 2 hours ago
llama3.2:1b baf6a787fdff 1.3 GB 1 hour ago
Three models, ~8.5GB total. They cover different use cases - speed (1B), everyday tasks (3B), and serious reasoning (8B).
Step 3 - See What's Actually Running
ollama list shows what's downloaded. ollama ps shows what's currently
loaded in memory - this is the command you want when troubleshooting performance or wondering why
your RAM is full:
ollama ps
NAME ID SIZE PROCESSOR UNTIL
qwen3:8b 9a3f5f67c0c9 5.2 GB 100% CPU 4 minutes from now
llama3.2:latest a4f39e04031c 2.0 GB 100% CPU Expires in 2 minutes
Step 4 - Control Keep-Alive (OLLAMA_KEEP_ALIVE)
By default, Ollama unloads a model from RAM 5 minutes after it was last used. This is good for shared machines and 8GB systems. But if you're the only user and have the RAM, keeping models loaded means instant responses with no cold-start delay.
# Keep models loaded for 30 minutes of idle time
export OLLAMA_KEEP_ALIVE=30m
# Keep models loaded indefinitely (until you restart Ollama)
export OLLAMA_KEEP_ALIVE=-1
# Use the default 5-minute unload (default behavior)
export OLLAMA_KEEP_ALIVE=5m
# Apply it permanently (add to ~/.bashrc or ~/.zshrc)
echo 'export OLLAMA_KEEP_ALIVE=30m' >> ~/.bashrc
source ~/.bashrc
30m or longer - a loaded 7B model uses
~5GB and the performance gain is significant. On 8GB RAM, keep the default 5m so the model frees
memory when you switch tasks.
To apply OLLAMA_KEEP_ALIVE to the Ollama systemd service (so it persists across reboots), add it to the service's environment:
sudo systemctl edit ollama
[Service]
Environment="OLLAMA_KEEP_ALIVE=30m"
sudo systemctl daemon-reload
sudo systemctl restart ollama
Step 5 - Switch Between Models
Switch is simple - just run a different model name:
ollama run qwen3:8b
You're now in Qwen3's interactive session. When done:
>>> /bye
Then switch to the fast tier:
ollama run llama3.2:1b
Step 6 - Run Different Models via API
With the REST API, you can specify which model to use per request. This is how OpenClaw routes different task types to different models:
curl http://localhost:11434/api/generate \
-d '{
"model": "qwen3:8b",
"prompt": "Analyze this error and suggest a fix...",
"stream": false
}'
curl http://localhost:11434/api/generate \
-d '{
"model": "llama3.2:1b",
"prompt": "What does HTTP 429 mean?",
"stream": false
}'
Ollama queues requests and processes them in order. On 16GB+ RAM with keep-alive set, both models can stay loaded and the switch between them is instantaneous - no unload/reload delay.
Current Popular Models Worth Trying
All of these are 4-bit quantized by default and pull directly from Ollama's library:
ollama pull llama3.2:1b # 1.3GB - instant speed, simple tasks
ollama pull llama3.2:3b # 2GB - small and surprisingly capable
ollama pull qwen3:4b # 2.5GB - tiny but rivals much larger models
ollama pull deepseek-r1:7b # 4.7GB - reasoning model, shows its work
ollama pull qwen3:8b # 5.2GB - best all-rounder, 100+ languages
ollama pull qwen3:14b # 9.3GB - step up in quality, needs 16GB RAM
ollama pull deepseek-r1:14b # 9.0GB - serious reasoning, needs 16GB RAM
Start with llama3.2:3b and qwen3:8b. That covers 90% of use cases.
Add qwen3:14b or deepseek-r1:14b if you have 16GB+ and want stronger reasoning.
Disk Space Management
Models live in ~/.ollama/models. Check your usage:
du -sh ~/.ollama/models
To remove a model and free disk space:
ollama rm llama3.2:1b
Re-pull anytime - Ollama uses layer caching, so re-downloading a model you've had before is faster than the initial pull. Models that share architecture (like Llama 3.2 1B and 3B) share base layers, so the second model is much smaller to download.
Memory Limits (Hardware Dependent)
Ollama loads models on-demand into RAM. Here's what you need to know:
- One 7B model (q4_K_M): ~5–6GB RAM (plus OS overhead, so ~8GB total consumed)
- One 3B model: ~2.5GB RAM - very comfortable on 8GB systems
- Two 7B models loaded: ~12GB - comfortable on 16GB systems
- One 14B model (phi4): ~9GB - needs 16GB, leaves headroom
Ollama gracefully unloads models you're not using when memory gets tight. With OLLAMA_KEEP_ALIVE set, models stay warm as long as you have the RAM to support them.
ollama ps, and tune keep-alive to match your hardware. Next: integrating Ollama with OpenClaw.
Where to Go From Here
You have a working local LLM. You can download models, run them interactively, hit the API, and manage multiple models. That's the foundation. Here's where to go next.
If you want OpenClaw to use your local models instead of cloud APIs, the Advanced tutorial is where that happens. It covers parameter tuning (so your bot gives consistent answers) and walks you through wiring Ollama directly into OpenClaw. That's the whole point of Book III. Budget about 90 minutes.
The core path
Book III has a main road and side roads. The main road is:
- Ollama Basics (you just finished this) → install, run models, learn the API
- Best Local AI Models 2026 → which model to actually pull for your hardware and use case
- Ollama Advanced → tune parameters, benchmark, and integrate with OpenClaw
Once your OpenClaw bot is running on local models, the rest of Book III is optional extras you can add whenever you want.
Optional add-ons (do these anytime after Advanced)
These tutorials are standalone and don't depend on each other. Pick whichever sounds useful:
- Open WebUI - a ChatGPT-style web interface for your local models. Great for non-terminal users or sharing with family.
- Document Q&A with RAG - upload PDFs, notes, and manuals, then ask questions about them. Uses AnythingLLM.
- Local Coding Assistant - Copilot-style autocomplete and AI chat in VS Code, powered by Ollama. Your code never leaves your machine.
Resources for self-study
-
Ollama GitHub
Source code, issues, community discussions -
Hugging Face Model Hub
Thousands of models, descriptions, benchmarks -
Ollama Model Library
Browse every model Ollama supports, with sizes and descriptions
Common Next Questions
Q: Can I run this in the background permanently?
A: Yes! Ollama already auto-starts as a systemd service. It'll run on boot and keep running.
Q: Can I access Ollama from other machines (not localhost)?
A: Not by default (security). You'd need to expose the API with an nginx proxy or SSH tunnel (advanced).
Q: What if I want to upgrade to a bigger model later?
A: You can, but 13B+ models are significantly slower on CPU. Consider upgrading hardware to GPU.
Q: Is there a web UI for Ollama?
A: Not built-in, but there are excellent options. Our Open WebUI tutorial walks you through setting up a polished, ChatGPT-style interface that connects directly to your local Ollama instance.
Q: Can I use Ollama on Windows/Mac?
A: Yes! Ollama has installers for all platforms. Same concepts apply.
You're Ready
You've completed Ollama Basics. You understand:
- ✓ What Ollama is and why it matters
- ✓ How to install and verify it works
- ✓ How to download and run models
- ✓ How to use the REST API programmatically
- ✓ What performance to expect from modern CPUs
- ✓ How to manage multiple models
That's a solid foundation. From here, you can:
- → Ollama Advanced - optimization, parameter tuning, and OpenClaw integration
- → Open WebUI - give your local models a real chat interface
- → Local Coding Assistant - use Ollama to power VS Code autocomplete and chat
- → Document Q&A with RAG - search and ask questions about your own documents
Whatever you choose, you're now part of the local AI revolution. Your data is yours. Your LLM is yours. No cloud, no subscriptions, just you and your hardware. That's powerful. Enjoy!