Ollama Basics

Install Ollama and run local LLMs on your own hardware - free, private, no API key. Works on Mac, Linux, and Windows without a GPU.

45 min Beginner Free Updated February 2026

What is Ollama and Why Should You Care?

You've probably heard the hype: run AI models locally, no API keys, no cloud costs, fully offline. That's Ollama. It's a tool that makes running large language models (LLMs) on your own hardware dead simple. Install it, pull a model, start chatting. That's it.

This tutorial assumes you're coming in cold - no deep learning background, no AI experience. Just curiosity and some decent hardware. By the end, you'll have a working local LLM and understand what you're actually doing (no black boxes). Whether you're on a laptop, mini PC, or server, the concepts are identical - only the speed varies. If you're still deciding on hardware, the Mini PC Setup tutorial covers how to pick the right machine for local AI work.

Why Ollama Matters

Privacy: Your prompts stay on your laptop. No OpenAI, no Anthropic, no third party. Your data is yours.

Cost: Free. Download once, run forever. No $20/month subscriptions, no pay-per-token API fees.

Speed: Local inference is fast. No network latency waiting for cloud API responses. Your laptop responds instantly.

Offline: Internet down? Your LLM still works. Perfect for training, experimentation, or just using AI when connectivity is spotty.

What You'll Actually Learn

  • ✓ Install Ollama on Linux, Mac, or Windows
  • ✓ Understand which models fit your hardware (spoiler: lots of them)
  • ✓ Run your first LLM and generate text
  • ✓ Use Ollama's REST API for programmatic access
  • ✓ Monitor performance and understand what's happening under the hood
  • ✓ Manage multiple models simultaneously
Teaser: Most modern CPUs can run 7B-parameter models at 8–20 tokens per second. That's genuinely fast. You likely don't need a GPU. If you have 8GB+ RAM and a multi-core CPU, you have what you need. Let's prove it.

What This Is NOT

This tutorial is deliberately beginner-focused. We're not covering:

  • Model fine-tuning or training (that's Tutorial 2+)
  • Deep learning or transformer internals
  • Advanced optimization or quantization
  • Production deployment or scaling

What we ARE covering: Getting Ollama running and understanding it works. That's the goal.

Where this is all headed: Everything in Book III builds toward one goal - replacing cloud AI APIs with models you run yourself. The core path is Basics (this tutorial) then Advanced (where you wire Ollama into OpenClaw). The other tutorials - Open WebUI, Document Q&A, Coding Assistant - are optional add-ons you can layer on once that core is working.

Does Your Hardware Cut It? (Spoiler: Yes)

Before we install, let's talk hardware. The good news: Ollama runs on almost any modern CPU. The question isn't "Can I run it?" but "How fast will it run?" That depends on your CPU cores, RAM, and storage speed. Let's figure out what to expect - and pick the right model for your machine.

Minimum Hardware Requirements

To Run Ollama
  • CPU: Any modern multi-core processor (Intel, AMD, Apple Silicon)
  • RAM: 8GB minimum (16GB+ recommended for 7B models with headroom)
  • Storage: 20GB+ free (models range from ~2GB to ~9GB each)
  • OS: Linux (Ubuntu 24.04+), macOS, or Windows with WSL2

If you have these, you can run Ollama. The key variable is speed - which depends on your CPU cores and RAM bandwidth.

Model Selection: 2026 Recommended Picks

The model landscape has moved fast. Llama 2 and Neural Chat are gone from the recommended list - there are better, faster, smarter options that run just as well on consumer hardware. Here are the three best starting points:

DeepSeek-R1:7b (4-bit)
Reasoning Model
Disk: ~4.7GB
RAM Used: ~5.5GB
Speed (8-core CPU): 8–14 tokens/sec
Quality: Outstanding at math, logic, and code problems
Why: Shows its reasoning chain - great when you need to understand why
Got 16GB+ RAM? Step up to qwen3:14b (9.3GB, excellent across the board), deepseek-r1:14b (9GB, serious reasoning), or phi4 (Microsoft's strong 14B model). All pull the same way: ollama pull qwen3:14b. Worth the disk space if you have the RAM.

Understanding Quantization Tags

When you browse ollama.com/library, you'll see tags like q4_0, q4_K_M, and q8_0 next to model names. These are quantization levels - how much the model has been compressed to fit in memory. Here's what they mean:

q4_0smallest
4-bit quantization, aggressive compression. Smallest file size and lowest RAM usage. Slight quality tradeoff on nuanced reasoning tasks. Good starting point if storage is tight.
q4_K_Mbest balance
4-bit with K-means quantization - smarter compression that preserves quality better. This is usually what :latest resolves to. The sweet spot for most users. Same size savings as q4_0 but noticeably better output on complex prompts.
q8_0near-full quality
8-bit quantization. Nearly indistinguishable from unquantized. Roughly twice the RAM of a q4 model. Use this when quality matters most and you have the headroom. A Qwen3:8b-q8_0 uses ~8GB RAM vs ~5.2GB for q4_K_M.

When you pull a model (after installing Ollama in the next section), you can specify a quantization tag or just use the default:

📝 Example syntax (don't run these yet - install Ollama first)
# :latest usually resolves to q4_K_M (recommended)
ollama pull qwen3:8b

# Explicitly request q8 for better quality (more RAM)
ollama pull qwen3:8b-q8_0

# 1B model - tiny and fast for simple tasks
ollama pull llama3.2:1b

For most users, :latest is the right call. Ollama picks a good default. Only specify a quantization tag if you're optimizing for a specific RAM budget or quality ceiling. You'll pull your first model in Section 04.

What NOT to Run (On Low RAM)

Some models need more headroom than others. If you're on 8GB total RAM, stick to 3B models:

RAM Guide - What Fits Where
  • 8GB RAM: 1B–3B models comfortably (llama3.2:1b, llama3.2:3b). 7B models are tight - other apps may push you into swap.
  • 16GB RAM: 7B models comfortably. Can experiment with 12–14B (phi4, gemma3:12b). Two 7B models loaded simultaneously.
  • 32GB RAM: 14B models and below comfortably. Multiple concurrent 7B models. Some 32B quantized models with patience.

If you're on 8GB and a 7B model feels sluggish, try llama3.2:3b instead - it's genuinely impressive for its size and will be noticeably more responsive.

Hardware Tiers & What to Expect

Speed varies dramatically based on your CPU. Here's what to expect running a 7B model (q4_K_M):

💻 Performance by CPU Class (7B Model, q4_K_M)
Budget / Older CPU (4 cores, 8 threads):   3–6 tokens/sec (usable, slower)
Mid-Range CPU (8 cores, 16 threads):      8–15 tokens/sec (very good!)
High-End CPU (12+ cores, 24+ threads):    15–25+ tokens/sec (excellent)
Apple Silicon M2/M3:                      30–50+ tokens/sec (fast as GPU)

Note: Tested on systems with 16GB+ RAM and NVMe SSD storage

To put it in perspective: 10 tokens/second means a 500-token response (a long paragraph) takes about 50 seconds. That's fast for CPU-only inference. You'll be pleasantly surprised.

Bottom Line: Start with llama3.2:3b - it's fast, free, and works on any modern machine. Then pull qwen3:8b when you're ready for something more capable. Both will surprise you. Need a full breakdown of every model worth knowing in 2026? Best Local AI Models 2026 - full comparison.

Gear Worth Having for Local AI

Models pile up fast - Qwen3:8b is 5.2GB, DeepSeek-R1:7b is 4.7GB, and once you start experimenting you'll want 6-10 models on hand. That's 30-50GB before you know it. Here's the gear that makes the experience better, from a genuine "we ran these tutorials on this hardware" perspective:

Storageeveryone needs this
Portable SSD for model storage - keep your boot drive clean and your model library portable. A 2TB drive fits 20+ models with room to spare. The Samsung T7 Shield is fast, durable, and works on any machine you're ever going to run Ollama on. Browse Samsung T7 on Amazon →
Budget Option~$150-180
Intel N100 Mini PC (16GB RAM) - the cheapest real entry point for a dedicated local AI machine. Handles OpenClaw well and runs 3B models at a usable pace. Not great for 7B+ models but it's a solid start and sips power 24/7. Browse N100 Mini PCs on Amazon →
Sweet Spot~$479 ⭐
Ryzen 7 6800H Mini PC (32GB RAM) - everything in these tutorials was built and tested on hardware like this. Runs OpenClaw and Ollama simultaneously without breaking a sweat. 7B-14B models at 8-15 tokens/sec. Not Mac Mini money, not toy-spec hardware. View on Amazon →
RAM Upgradeif your laptop allows it
32GB DDR4 SODIMM kit - if you're running Ollama on a laptop with upgradeable RAM, going from 16GB to 32GB is the single highest-impact change you can make for local LLM performance. More RAM = bigger models, less swap, noticeably faster responses. Browse RAM Upgrades on Amazon →

Amazon links above are affiliate links - they cost you nothing extra and help keep these tutorials free and updated. Full hardware buying guide with specific model recommendations: Mini PC Setup - Hardware Guide.

Getting Ollama Running

Ollama installation is genuinely simple. No compilation, no complex setup, no configuration files to fiddle with. Download, run the installer, done. Let's do it.

Prerequisites

  • ✓ Linux OS (Ubuntu 24.04+, Fedora, Debian, etc.)
  • ✓ 8GB+ RAM (16GB+ recommended for multiple models)
  • ✓ 20GB+ free disk space (models are ~4GB each)
  • ✓ Stable internet connection (for downloading models once)
  • ✓ Ability to run sudo commands (or use your user password)

Step 1 - Download and Install Ollama

Open a terminal and run:

🖥️ Mini PC
curl -fsSL https://ollama.ai/install.sh | sh

This script:

  • Downloads the Ollama binary for your system
  • Places it in `/usr/local/bin/` (in your PATH)
  • Sets up a systemd service to auto-start on boot
  • Creates the ollama user and group

The installation takes 1–2 minutes. You'll see output as it progresses. When it finishes, you're done.

Step 2 - Verify Installation

Check that Ollama is installed and in your PATH:

🖥️ Mini PC
ollama --version

You should see something like:

🖥️ Mini PC - Output
ollama version is 0.1.45 (or newer)

Step 3 - The Daemon is Already Running

During installation, the Ollama daemon started automatically in the background. It's listening on http://127.0.0.1:11434.

Do not run ollama serve manually - the port is already in use by the running daemon. You can verify it's active by testing the API:

🖥️ Mini PC
curl -s http://127.0.0.1:11434/api/tags

If it responds with JSON (like {"models":[]}), the daemon is running. Good to go!

Step 4 - Three Ways to Interact with Ollama

Now that the daemon is running, here's how you can use it:

Three Modes of Operation
  • ollama run <model>
    Interactive chat mode. Type prompts, get responses in your terminal. Great for testing and learning.
  • REST API (curl/Python/etc)
    Programmatic access. Send JSON requests to localhost:11434, get JSON responses. Perfect for integrations and scripts.
  • Background daemon
    The daemon runs automatically on boot and stays running. You don't manage it manually; it just works.

Step 5 - Ready for Your First Model

The daemon is running and the API is responding. Installation is complete.

You're now ready to download and run your first model. Head to the next section to pull Llama 3.2.

Installation Summary

Ollama is installed and running as a background service on http://127.0.0.1:11434. No configuration needed. No manual daemon management. Just pull a model and start using it.

You can now:

  • Pull models with ollama pull <model>
  • Run interactive chat with ollama run <model>
  • Make API calls to http://127.0.0.1:11434 from scripts
  • Start using local AI immediately - no API keys, no cloud, no costs
Installation Complete: Ollama is running and ready. Next: download and run your first model.

Download and Run Your First Model

Time for the moment of truth. We're going to download Llama 3.2 3B - the fast, lightweight model we recommended in the last section - and run our first interactive chat session. This is where it gets real.

Step 1 - Pull Llama 3.2

"Pulling" a model means downloading it from Ollama's registry and storing it locally. Run:

🖥️ Mini PC
ollama pull llama3.2

You'll see output like:

🖥️ Mini PC - Output
pulling manifest
pulling 418956b73c34... (downloading layer 1)
pulling e1cd8f6a5d4a... (downloading layer 2)
verifying sha256 digest
writing manifest
success

The model is about 2GB. On a typical internet connection, this takes 1-2 minutes. Quick.

Why Llama 3.2 first? It's small (~2GB), fast (15-30 tokens/sec on the Ryzen), and impressive for its size. Perfect for learning the basics without waiting around. You'll pull a bigger model (Qwen3 8B) later in the multiple models section.

Step 2 - Run Interactive Chat

Once the download completes, start an interactive chat session:

🖥️ Mini PC
ollama run llama3.2

You'll see the prompt appear with the model ready:

🖥️ Mini PC - Output
>>> 

(Run ollama list anytime to see all installed models and their exact versions.)

Now type a question or statement. Let's try something simple:

💻 Your Input
>>> What is Ollama?

Watch as the model generates a response in real-time. You'll see tokens appearing one by one. This is your local LLM doing inference right now, on your CPU, with no API calls, no cloud, no tracking.

Step 3 - Try More Prompts

Keep the chat session open and try different questions. Here are some good ones to test:

💻 Example Prompts
>>> Explain machine learning in simple terms
>>> Write a Python function that checks if a number is prime
>>> What's a good name for a Discord bot?
>>> Why is the sky blue?
>>> Tell me a joke

Notice:

  • Response speed (you'll see "14 tokens/sec" or similar at the end)
  • Response quality (is it coherent? Accurate?)
  • CPU usage (all your cores working hard)
  • No waiting for external APIs

Step 4 - Exit the Chat

To exit the interactive session, type:

🖥️ Mini PC
>>> /bye

Or press Ctrl+D. Either works.

What Actually Happened Here?

Let's be concrete about the workflow:

  1. Pull: Download the model (~2GB) to ~/.ollama/models
  2. Load: When you run Llama 3.2, Ollama loads it into RAM (~3GB used)
  3. Inference: Your prompt goes to the model, which generates tokens one at a time
  4. Display: Each token appears on your screen as it's generated
  5. Repeat: You type, the model responds, until you exit

Performance Notes

The interactive mode doesn't display detailed timing information directly. However, Ollama's REST API provides complete timing metrics including generation speed, which you'll explore in the next section.

For now, what you can observe from the interactive experience:

  • The response appears token-by-token in real-time
  • You can estimate speed by watching how fast tokens appear (roughly 10–15 tokens/sec on mid-range CPUs is typical)
  • The model loads into memory the first time you run it (notice a slight delay before responses start)
  • Subsequent responses should be faster since the model stays loaded

In the next section, you'll use the REST API to make requests and see exact timing metrics (tokens/sec, total duration, load time, etc.) in JSON responses. That's where you get precise performance data.

You Did It: You have a working local LLM generating intelligent responses. Now let's see the timing metrics and learn to access Ollama programmatically via its REST API.

Programmatic Access to Your LLM

Interactive chat is fun for testing, but the real power comes from using Ollama's REST API. This lets you send prompts programmatically and get responses as JSON. Perfect for integrating with OpenClaw, scripts, or custom applications.

How It Works

Ollama runs a simple HTTP server on localhost:11434. You send JSON requests, you get JSON responses. No authentication, no setup. Just HTTP.

This is how you'll integrate Ollama with OpenClaw later. For now, let's test it with curl.

Step 1 - Simple Text Generation (Synchronous)

Open a terminal and run:

🖥️ Mini PC
curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.2",
    "prompt": "Explain quantum computing in 2 sentences",
    "stream": false
  }'

The -d flag sends JSON data in the request body. The stream: false means "wait for the full response before returning."

You'll get a JSON response that looks like:

🖥️ Mini PC - JSON Response
{
  "model": "llama3.2",
  "created_at": "2026-02-22T10:30:45.123456Z",
  "response": "Quantum computing exploits quantum mechanics (superposition and entanglement) to process data in fundamentally different ways than classical computers. A quantum computer can explore many solutions simultaneously, making certain problems exponentially faster to solve.",
  "done": true,
  "total_duration": 2500000000,
  "load_duration": 300000000,
  "prompt_eval_count": 12,
  "prompt_eval_duration": 800000000,
  "eval_count": 30,
  "eval_duration": 1400000000
}

Key fields:

  • response: The generated text (what you want)
  • done: Whether generation is complete
  • eval_count: Number of tokens generated
  • eval_duration: Time spent generating (nanoseconds)

Step 2 - Parse with jq (Optional But Nice)

The full JSON response includes several fields you may not need, including a context array (used for multi-turn conversations). If you have jq installed, you can extract just what you want:

🖥️ Mini PC - extract response
curl -s http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.2",
    "prompt": "Tell me a joke",
    "stream": false
  }' | jq -r '.response'

The -r flag means "raw output" (no quotes around the text). You'll just see:

🖥️ Mini PC - Output
Why don't scientists trust atoms? Because they make up everything!

Or extract key fields without the context clutter:

🖥️ Mini PC - response + timing
curl -s http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.2",
    "prompt": "Tell me a joke",
    "stream": false
  }' | jq '{response, eval_count, eval_duration}'

Much cleaner. Install jq if you don't have it: sudo apt install jq

Note on the context field: The actual API response includes a context array containing token IDs from your prompt and response. This is used for multi-turn conversations (sending context back to maintain conversation history). For single requests, you can safely ignore it or filter it out with jq as shown above.

Step 3 - Streaming API (Real-Time Responses)

For longer responses, you might want tokens to stream in real-time (like in interactive chat). Set stream: true:

🖥️ Mini PC
curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.2",
    "prompt": "Write a haiku about programming",
    "stream": true
  }'

With streaming, you get multiple JSON objects (one per token), streamed line-by-line:

🖥️ Mini PC - Output (Streaming)
{"model":"llama3.2","created_at":"...","response":"Code","done":false}
{"model":"llama3.2","created_at":"...","response":" flows","done":false}
{"model":"llama3.2","created_at":"...","response":" like","done":false}
...
{"model":"llama3.2","created_at":"...","response":"","done":true}

Parse each line and print the response field to see tokens appear in real-time. This is how chat interfaces work.

Step 4 - API Parameters

You can pass additional parameters to control generation behavior:

🖥️ Mini PC - Advanced Example
curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.2",
    "prompt": "Complete this: The future of AI is...",
    "stream": false,
    "temperature": 0.7,
    "top_p": 0.9,
    "num_predict": 100
  }'

Parameters explained (you'll dive deeper in Tutorial 2):

  • temperature: Randomness (0.0=deterministic, 1.0=balanced, >1.5=creative)
  • top_p: Diversity control (lower=more focused)
  • num_predict: Max tokens to generate (prevents runaway responses)

For now, the defaults (no parameters) are fine. We'll explore these in Advanced.

Why This Matters for Integration

When you integrate Ollama with OpenClaw, this is what happens under the hood:

  1. OpenClaw receives a user message from Discord
  2. It constructs a JSON request to your local Ollama API
  3. Ollama generates a response
  4. OpenClaw parses the response and sends it back to Discord

All locally. All on your laptop. All in milliseconds. That's powerful.

API Documentation: For a complete list of parameters and endpoints, check Ollama's API docs.
API Ready: You now know how to use Ollama programmatically. Next: understanding performance and managing multiple models.

Monitoring and Understanding Resource Usage

Now that you have Ollama running, let's look at what's actually happening under the hood. How much memory is it using? How hard is your CPU working? What can you realistically expect?

Monitor While Running

While Ollama is generating text, open another terminal and watch resource usage:

🖥️ Mini PC - Terminal 2
watch -n 1 'free -h'

This updates every second showing your RAM usage. A typical 7B model uses about 5–6GB.

🖥️ Mini PC - Output
              total        used        free      shared  buff/cache   available
Mem:           31Gi       6.2Gi        22Gi       1.3Gi      3.4Gi        23Gi

This example shows Ollama using ~6GB out of 31GB total-plenty of breathing room. Your system will vary.

For CPU usage, in another terminal:

🖥️ Mini PC - Terminal 3
top -n 1 -o %CPU

Or use htop for a nicer interface:

🖥️ Mini PC
htop

During inference, you'll see all your CPU cores at high utilization (70–95%). This is normal and expected. Your CPU is working hard, which is why you get good performance.

Performance Benchmarks by Hardware

Here's what to expect for a 7B-8B class model (like Qwen3 8B, which you'll pull in the next section) on different hardware. All numbers assume 16GB+ RAM and SSD storage. Your current Llama 3.2 3B runs roughly 2x faster and uses about half the RAM - these numbers represent the all-rounder tier:

💻 8B Model Performance Ranges (e.g. Qwen3 8B)
Budget / Older CPU (4 cores):
  Tokens/Sec:         3–6 tokens/sec (slow but usable)
  Time to First Token: 1–2 seconds
  RAM Used:           ~6GB

Mid-Range CPU (8 cores):
  Tokens/Sec:         8–15 tokens/sec (very good)
  Time to First Token: 500ms–1 second
  RAM Used:           ~6GB

High-End CPU (12+ cores):
  Tokens/Sec:         15–25+ tokens/sec (excellent)
  Time to First Token: 300–500ms
  RAM Used:           ~6GB

Universal:
  Model Size on Disk: ~5.2GB
  Model Load Time:    1–3 seconds (SSD-dependent)
  CPU Utilization:    70–95% during generation

Where you land on this spectrum depends on your CPU core count and clock speed. The good news: even budget CPUs generate text at usable speeds.

What Affects Performance?

Token generation speed varies based on several factors:

Performance Factors
  • Prompt Length: Longer prompts take longer to process before generating
  • Response Length: More tokens to generate = longer total time (linear relationship)
  • System Load: Other apps running = fewer CPU cycles for Ollama
  • Model Size: Bigger models (13B+) are much slower on CPU
  • SSD Speed: Slow first model load if SSD is bottleneck (yours is fast)

Is 10–15 Tokens/Sec Fast Enough?

Let's put it in perspective:

💻 Response Time Examples
Short Answer (50 tokens):       ~3–5 seconds
Medium Answer (200 tokens):     ~13–20 seconds
Long Answer (500 tokens):       ~33–50 seconds
Full Essay (1000 tokens):       ~65–100 seconds

For comparison:

  • OpenAI's API: Also 10–20 tokens/sec, but costs money and requires internet
  • Claude/GPT directly: Similar speed with 100x the cost
  • Instant messaging apps: Slower (typing speed is 40–60 words/min = ~10 tokens/sec)

So yes, 10–15 tokens/sec is genuinely fast. You're getting good performance locally, for free.

Sustained Running (24/7 Concerns)

Most hardware can run Ollama continuously without issues:

  • Thermals: CPU inference doesn't generate excessive heat. Modern cooling handles it fine.
  • Battery Drain: On laptops with battery, expect 1–2 hours per charge during heavy continuous use.
  • Reliability: Modern CPUs are designed for sustained loads. No degradation over time.
  • Memory Leaks: Ollama is stable. No memory creep after hours of running.

For occasional or development use on a laptop/desktop, your current hardware is fine. If you want to run Ollama 24/7 at scale with multiple concurrent requests, consider a dedicated server or device with more consistent power delivery.

Performance Baseline Set: You now understand what your CPU can do and what to expect. Now let's see how to manage multiple models.

Download, List, and Switch Between Models

One model is useful. A library of models is powerful. Different models have different strengths - a small 3B model for quick queries, a 7B all-rounder for complex work, a reasoning model when you need to think things through. Ollama makes managing all of them trivial.

Step 1 - Download More Models

Let's pull two more models to build out your library. The first is Qwen3:8b - the current all-rounder pick with exceptional reasoning, code ability, and 100+ language support:

🖥️ Mini PC
ollama pull qwen3:8b

And the tiny-but-capable Llama 3.2 1B - useful when you want an instant response and don't need heavy reasoning:

🖥️ Mini PC
ollama pull llama3.2:1b

The 1B model is only ~1.3GB and loads in seconds. It's your speed tier - great for quick lookups and simple tasks when you don't want to wait for a 7B model to warm up.

Step 2 - List Your Models

See everything you've downloaded:

🖥️ Mini PC
ollama list

Output:

🖥️ Mini PC - Output
NAME                 ID              SIZE    MODIFIED
llama3.2:latest      a4f39e04031c    2.0 GB  3 hours ago
qwen3:8b             9a3f5f67c0c9    5.2 GB  2 hours ago
llama3.2:1b          baf6a787fdff    1.3 GB  1 hour ago

Three models, ~8.5GB total. They cover different use cases - speed (1B), everyday tasks (3B), and serious reasoning (8B).

Step 3 - See What's Actually Running

ollama list shows what's downloaded. ollama ps shows what's currently loaded in memory - this is the command you want when troubleshooting performance or wondering why your RAM is full:

🖥️ Mini PC
ollama ps
🖥️ Mini PC - Output
NAME             ID              SIZE      PROCESSOR    UNTIL
qwen3:8b         9a3f5f67c0c9    5.2 GB    100% CPU     4 minutes from now
llama3.2:latest  a4f39e04031c    2.0 GB    100% CPU     Expires in 2 minutes
PROCESSORcolumn
Shows how the model is being processed. 100% CPU means pure CPU inference. On systems with a compatible GPU, you'll see something like 78%/22% CPU/GPU, meaning the GPU is offloading part of the work. More GPU % = faster.
UNTILcolumn
When Ollama will unload this model from memory. By default, models unload 5 minutes after their last request. "Forever" means the model is pinned in memory indefinitely (see OLLAMA_KEEP_ALIVE below). Once unloaded, the next request reloads the model - a cold start takes a few seconds.

Step 4 - Control Keep-Alive (OLLAMA_KEEP_ALIVE)

By default, Ollama unloads a model from RAM 5 minutes after it was last used. This is good for shared machines and 8GB systems. But if you're the only user and have the RAM, keeping models loaded means instant responses with no cold-start delay.

🖥️ Mini PC - Set keep-alive duration
# Keep models loaded for 30 minutes of idle time
export OLLAMA_KEEP_ALIVE=30m

# Keep models loaded indefinitely (until you restart Ollama)
export OLLAMA_KEEP_ALIVE=-1

# Use the default 5-minute unload (default behavior)
export OLLAMA_KEEP_ALIVE=5m

# Apply it permanently (add to ~/.bashrc or ~/.zshrc)
echo 'export OLLAMA_KEEP_ALIVE=30m' >> ~/.bashrc
source ~/.bashrc
How to choose: On 16GB RAM, set keep-alive to 30m or longer - a loaded 7B model uses ~5GB and the performance gain is significant. On 8GB RAM, keep the default 5m so the model frees memory when you switch tasks.

To apply OLLAMA_KEEP_ALIVE to the Ollama systemd service (so it persists across reboots), add it to the service's environment:

🖥️ Mini PC - Persist via systemd
sudo systemctl edit ollama
📝 Add this block in the editor that opens
[Service]
Environment="OLLAMA_KEEP_ALIVE=30m"
🖥️ Mini PC - Reload and restart
sudo systemctl daemon-reload
sudo systemctl restart ollama

Step 5 - Switch Between Models

Switch is simple - just run a different model name:

🖥️ Mini PC
ollama run qwen3:8b

You're now in Qwen3's interactive session. When done:

🖥️ Mini PC
>>> /bye

Then switch to the fast tier:

🖥️ Mini PC
ollama run llama3.2:1b

Step 6 - Run Different Models via API

With the REST API, you can specify which model to use per request. This is how OpenClaw routes different task types to different models:

🖥️ Mini PC - Heavy reasoning task → Qwen3
curl http://localhost:11434/api/generate \
  -d '{
    "model": "qwen3:8b",
    "prompt": "Analyze this error and suggest a fix...",
    "stream": false
  }'
🖥️ Mini PC - Quick lookup → Llama 3.2 1B
curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.2:1b",
    "prompt": "What does HTTP 429 mean?",
    "stream": false
  }'

Ollama queues requests and processes them in order. On 16GB+ RAM with keep-alive set, both models can stay loaded and the switch between them is instantaneous - no unload/reload delay.

Current Popular Models Worth Trying

All of these are 4-bit quantized by default and pull directly from Ollama's library:

💻 2026 Model Picks
ollama pull llama3.2:1b       # 1.3GB - instant speed, simple tasks
ollama pull llama3.2:3b       # 2GB   - small and surprisingly capable
ollama pull qwen3:4b          # 2.5GB - tiny but rivals much larger models
ollama pull deepseek-r1:7b    # 4.7GB - reasoning model, shows its work
ollama pull qwen3:8b          # 5.2GB - best all-rounder, 100+ languages
ollama pull qwen3:14b         # 9.3GB - step up in quality, needs 16GB RAM
ollama pull deepseek-r1:14b   # 9.0GB - serious reasoning, needs 16GB RAM

Start with llama3.2:3b and qwen3:8b. That covers 90% of use cases. Add qwen3:14b or deepseek-r1:14b if you have 16GB+ and want stronger reasoning.

Disk Space Management

Models live in ~/.ollama/models. Check your usage:

🖥️ Mini PC
du -sh ~/.ollama/models

To remove a model and free disk space:

🖥️ Mini PC - remove a model
ollama rm llama3.2:1b

Re-pull anytime - Ollama uses layer caching, so re-downloading a model you've had before is faster than the initial pull. Models that share architecture (like Llama 3.2 1B and 3B) share base layers, so the second model is much smaller to download.

Memory Limits (Hardware Dependent)

Ollama loads models on-demand into RAM. Here's what you need to know:

  • One 7B model (q4_K_M): ~5–6GB RAM (plus OS overhead, so ~8GB total consumed)
  • One 3B model: ~2.5GB RAM - very comfortable on 8GB systems
  • Two 7B models loaded: ~12GB - comfortable on 16GB systems
  • One 14B model (phi4): ~9GB - needs 16GB, leaves headroom

Ollama gracefully unloads models you're not using when memory gets tight. With OLLAMA_KEEP_ALIVE set, models stay warm as long as you have the RAM to support them.

Multiple Models Ready: You now know how to manage a model library, check what's loaded with ollama ps, and tune keep-alive to match your hardware. Next: integrating Ollama with OpenClaw.

Where to Go From Here

You have a working local LLM. You can download models, run them interactively, hit the API, and manage multiple models. That's the foundation. Here's where to go next.

Recommended next step: Ollama Advanced

If you want OpenClaw to use your local models instead of cloud APIs, the Advanced tutorial is where that happens. It covers parameter tuning (so your bot gives consistent answers) and walks you through wiring Ollama directly into OpenClaw. That's the whole point of Book III. Budget about 90 minutes.

The core path

Book III has a main road and side roads. The main road is:

  • Ollama Basics (you just finished this) → install, run models, learn the API
  • Best Local AI Models 2026 → which model to actually pull for your hardware and use case
  • Ollama Advanced → tune parameters, benchmark, and integrate with OpenClaw

Once your OpenClaw bot is running on local models, the rest of Book III is optional extras you can add whenever you want.

Optional add-ons (do these anytime after Advanced)

These tutorials are standalone and don't depend on each other. Pick whichever sounds useful:

  • Open WebUI - a ChatGPT-style web interface for your local models. Great for non-terminal users or sharing with family.
  • Document Q&A with RAG - upload PDFs, notes, and manuals, then ask questions about them. Uses AnythingLLM.
  • Local Coding Assistant - Copilot-style autocomplete and AI chat in VS Code, powered by Ollama. Your code never leaves your machine.

Resources for self-study

Learn more

Common Next Questions

Q: Can I run this in the background permanently?
A: Yes! Ollama already auto-starts as a systemd service. It'll run on boot and keep running.

Q: Can I access Ollama from other machines (not localhost)?
A: Not by default (security). You'd need to expose the API with an nginx proxy or SSH tunnel (advanced).

Q: What if I want to upgrade to a bigger model later?
A: You can, but 13B+ models are significantly slower on CPU. Consider upgrading hardware to GPU.

Q: Is there a web UI for Ollama?
A: Not built-in, but there are excellent options. Our Open WebUI tutorial walks you through setting up a polished, ChatGPT-style interface that connects directly to your local Ollama instance.

Q: Can I use Ollama on Windows/Mac?
A: Yes! Ollama has installers for all platforms. Same concepts apply.

You're Ready

You've completed Ollama Basics. You understand:

  • ✓ What Ollama is and why it matters
  • ✓ How to install and verify it works
  • ✓ How to download and run models
  • ✓ How to use the REST API programmatically
  • ✓ What performance to expect from modern CPUs
  • ✓ How to manage multiple models

That's a solid foundation. From here, you can:

Whatever you choose, you're now part of the local AI revolution. Your data is yours. Your LLM is yours. No cloud, no subscriptions, just you and your hardware. That's powerful. Enjoy!

Ollama Basics Complete: You're ready for the next step. Good luck! 🚀