Can I chat with my PDFs locally without sending data to the cloud?

Yes. Using AnythingLLM and Ollama, you can build a RAG system that runs entirely on your own hardware. Your documents never leave your machine - no API key, no subscription, no cloud upload required.

What is RAG and how does it work?

RAG (Retrieval Augmented Generation) lets an AI answer questions about your specific documents. It converts your documents into vector embeddings, stores them in a local database, and retrieves the most relevant passages when you ask a question. The AI then generates a cited answer based on those passages.

What tools do I need to set up a local RAG system?

This tutorial uses AnythingLLM (the document management and RAG interface), Ollama (to run the local LLM), and nomic-embed-text (the embedding model). All are free and open source.

What file types can I use with a local RAG system?

AnythingLLM supports PDF, TXT, DOCX, Markdown, and more. You can upload manuals, notes, contracts, research papers, or any text-based document and ask questions about it with natural language.

Chat With Your PDFs Locally 2026 - RAG Setup With Ollama

01 / Introduction

Ask Your Own Documents Questions

Your local AI is smart, but it only knows what it was trained on. Ask it about your company's internal docs, your class notes, or that 200-page PDF manual sitting on your desktop and it'll either make something up or admit it doesn't know.

RAG fixes this. It lets your local model read your actual documents and answer questions about them - with citations pointing back to the source.

What is RAG?

RAG stands for Retrieval-Augmented Generation. The name is technical but the idea is simple: before the AI answers your question, it first searches through your documents to find the relevant parts, then uses those parts to generate an accurate answer.

Instead of relying purely on what the model "remembers" from training, it reads the actual source material first. This means it can answer questions about things it was never trained on - your personal files, your company's documentation, last week's meeting notes.

Real use cases people actually build

Personal knowledge base - dump all your notes, bookmarks, and research into one searchable system
Technical documentation - ask "how do I configure X?" and get the answer from your own docs
Legal and contract review - upload a contract and ask "what are the termination clauses?"
Study assistant - feed it your textbooks and quiz yourself on the material
Codebase Q&A - upload README files and docs, ask "how does authentication work in this project?"
Meeting notes search - upload a month of meeting notes, ask "what did we decide about the pricing change?"

The common thread: you have documents with answers in them, and you want to find those answers without reading through everything yourself.

What you'll build in this tutorial

By the end of this guide, you'll have a working document Q&A system running locally. You'll upload a document, ask it questions, and get answers with references to where in the document the answer came from. Everything stays on your machine.

How this fits into Book III: This tutorial is an optional add-on. The core path is Ollama Basics then Ollama Advanced (where you wire local models into OpenClaw). Document Q&A gives you a separate knowledge search system powered by your Ollama models - great for searching your own files, but independent from OpenClaw.

Prerequisites

Ollama installed and running with at least one chat model (see Ollama Basics)
Docker installed (see Open WebUI tutorial for Docker setup if needed)
16GB+ RAM recommended - RAG runs a chat model and an embedding model at the same time (the Ryzen 7 6800H with 32GB handles both without breaking a sweat)
A document you want to search through (PDF, text file, or Markdown)
About 30 minutes for setup, then you're exploring

02 / How RAG Works

The Simple Version

Before we set anything up, let's understand what's actually happening under the hood. No math, no jargon overload - just the mental model you need to use this effectively.

Step 1: Your documents get chopped up

When you upload a document, the system breaks it into smaller pieces called "chunks." A 50-page PDF might become 200 chunks, each containing a paragraph or two. This is because AI models work better with focused, relevant snippets than with entire books shoved into the prompt.

Step 2: Each chunk becomes a set of numbers

Each chunk gets converted into a list of numbers (called an "embedding") that represents its meaning. Think of it like a GPS coordinate for the content - similar ideas end up with similar coordinates. A chunk about "installing software" would have coordinates close to a chunk about "setting up applications," even if they use completely different words.

This conversion is done by a small, specialized model called an embedding model. It's fast and lightweight - not the same thing as your chat model.

Step 3: Numbers go into a searchable database

Those number-coordinates get stored in a vector database. It's basically a search index optimized for finding things by meaning rather than by exact keywords. Traditional search finds "install" when you search for "install." Vector search finds "install" when you search for "how do I set this up?" - because the meaning is similar.

When you ask a question

Here's what happens when you type a question:

Your question gets converted into the same kind of number-coordinates
The database finds the chunks with the most similar coordinates (the most relevant content)
Those chunks get inserted into the prompt alongside your question
The chat model reads the relevant chunks and generates an answer based on them
You get an answer with citations pointing back to which chunks it used

Why not just paste the whole document into the chat?

Good question. Three reasons:

Context window limits - models can only process so much text at once. A 7B model typically handles 4,000-8,000 tokens (roughly 3,000-6,000 words). Your document might be 50,000 words.
Accuracy drops - even models with large context windows tend to miss things in the middle of long inputs. Focused, relevant chunks produce better answers.
Speed - processing 200 words of relevant content is way faster than processing 50,000 words of everything.

RAG gives the model exactly the parts it needs, nothing more. Better answers, faster responses.

The key insight.

RAG doesn't make the model smarter. It gives the model the right information at the right time. The model's job is to read and summarize - RAG's job is to find the right pages to read.

03 / Setting Up AnythingLLM

Your Document Q&A Hub

There are several tools that can do RAG with local models. We're using AnythingLLM because it bundles everything you need into one package - no Python scripts, no dependency hell, no stitching five different tools together. It connects to Ollama out of the box and handles the chunking, embedding, and vector storage for you.

Option A: Docker (recommended)

If you already have Docker running (from the Open WebUI setup or earlier tutorials), this is one command:

🖥️ Terminal

docker run -d \
  --name anythingllm \
  -p 3001:3001 \
  -v anythingllm:/app/server/storage \
  --restart unless-stopped \
  mintplexlabs/anythingllm

What each flag does:

-p 3001:3001 - exposes the web interface on port 3001
-v anythingllm:/app/server/storage - persists your data (workspaces, documents, chat history)
--restart unless-stopped - auto-starts on boot

Wait about 30 seconds for it to start up, then open your browser.

Option B: Desktop app

If you prefer a standalone app over Docker, AnythingLLM has desktop downloads for macOS, Windows, and Linux. Grab it from anythingllm.com and install it like any other application.

The desktop version works the same way - it just runs locally as an app instead of a Docker container.

First launch

Open your browser and go to:

🌐 Browser

http://localhost:3001

AnythingLLM will walk you through a setup wizard on first launch. Here's what to select:

LLM Provider - select "Ollama" and point it to http://localhost:11434 (or http://host.docker.internal:11434 if on macOS with Docker)
Chat Model - pick whichever Ollama model you want to use for answering questions (llama3.2 or qwen3:8b are good defaults)
Embedding Provider - we'll set this up in the next section

Can't connect to Ollama?

If AnythingLLM is running in Docker and can't reach Ollama on localhost, try using http://host.docker.internal:11434 (macOS/Windows) or http://172.17.0.1:11434 (Linux) as the Ollama URL. This is a Docker networking thing - containers sometimes can't see "localhost" the same way your host machine does.

Quick tour

Once setup is done, you'll land on the main screen. The key areas:

Workspaces (sidebar) - these are like project folders. Each workspace has its own documents and chat history.
Settings (gear icon) - where you configure LLM, embedding, and vector database options
Upload - drag documents into a workspace to make them searchable

Don't upload anything yet - we need to set up the embedding model first so your documents get indexed properly.

04 / Choosing an Embedding Model

The Engine Behind Search

Remember those "number-coordinates" from the previous section? The embedding model is what creates them. It's a small, specialized model that turns text into searchable vectors. Without it, RAG can't find the right chunks to answer your questions.

Pull an embedding model

Ollama has embedding models in its library just like chat models. The one we want is nomic-embed-text - it's small (274MB), fast, and accurate enough for most use cases.

🖥️ Terminal

ollama pull nomic-embed-text

This downloads in under a minute on most connections. It's tiny compared to chat models.

Configure AnythingLLM to use it

In AnythingLLM:

Go to Settings (gear icon)
Click Embedding Preference
Select Ollama as the provider
Set the Ollama URL (same as before - http://localhost:11434)
Select nomic-embed-text from the model dropdown
Save

Why nomic-embed-text?

A few reasons it's a solid default:

Size - 274MB, barely noticeable next to your multi-gigabyte chat models
Speed - embeds thousands of chunks in seconds
Quality - competitive with much larger embedding models on standard benchmarks
Long context - handles chunks up to 8,192 tokens, so it won't choke on longer passages

There are other options (mxbai-embed-large is slightly more accurate, all-minilm is even smaller), but nomic-embed-text hits the sweet spot for most people.

Embedding models vs. chat models.

These are different tools for different jobs. Your chat model (llama3.2, qwen3:8b) generates text responses. The embedding model (nomic-embed-text) converts text into searchable coordinates. They work together but don't compete for the same resources - on the recommended Ryzen 7 6800H with 32GB RAM, nomic-embed-text uses under 300MB and embedding runs so fast you won't even notice it happening alongside your chat model.

05 / Your First Workspace

Upload, Ask, Get Answers

Everything is configured. Ollama is running with a chat model and an embedding model. AnythingLLM is connected to both. Time to upload a document and actually use this thing.

Step 1 - Create a workspace

In AnythingLLM, click the + button in the sidebar to create a new workspace. Give it a name that describes what you're putting in it - something like "Product Docs" or "Meeting Notes" or "Tax Stuff."

Workspaces keep things organized. Documents in one workspace don't mix with documents in another. Think of them like project folders.

Step 2 - Upload a document

Click the upload icon in your workspace and drag in a document. For your first try, pick something you know well - maybe a README file, a recipe collection, some meeting notes, or a manual for something you own. Knowing the content helps you verify the answers are accurate.

Supported formats include:

PDF - the most common format people use
TXT - plain text files
MD - Markdown documents
DOCX - Word documents
CSV - spreadsheet data

Scanned PDFs won't work well.

If your PDF is actually images of text (like a scanned document), the system can't read it properly. You'll need to run OCR on it first. Most modern PDFs with selectable text work fine.

Step 3 - Watch it process

After uploading, AnythingLLM will chunk the document and create embeddings. You'll see a progress indicator. For a typical 10-20 page PDF, this takes a few seconds. Larger documents (hundreds of pages) might take a minute or two.

This is a one-time cost per document. Once it's embedded, searching through it is nearly instant.

Step 4 - Ask your first question

Switch to the chat tab in your workspace and type a question about the document you just uploaded. Be specific. Instead of "tell me about this document," try something like:

"What are the main requirements listed in section 3?"
"How do I configure the database connection?"
"Summarize the key decisions from the March 15 meeting"
"What ingredients do I need for the chocolate cake recipe?"

Step 5 - Check the citations

Look at the response. You should see citation markers that point back to specific chunks of your document. Click on them to see exactly which part of the original text the answer came from.

This is one of the most valuable parts of RAG. You're not just getting an answer - you're getting proof of where the answer came from. If something looks wrong, you can check the source directly.

Your first RAG query is done.

You just asked a question about a local document and got an answer from your own AI, running on your own hardware, without any of your data leaving your machine. That's the whole point.

06 / Real World Examples

Putting It to Work

You've seen the basics work. Now let's look at some practical scenarios where document Q&A actually saves you time - the kind of thing you'll find yourself reaching for regularly.

Example 1: Product manual lookup

You bought a new router, NAS, or some piece of hardware with a 150-page PDF manual. Instead of scrolling through it or using Ctrl+F with the exact right keyword, upload it and ask naturally:

"How do I reset this to factory defaults?"
"What ports need to be open for remote access?"
"What's the default admin password?"
"How do I set up port forwarding?"

The model finds the relevant section and gives you a direct answer. Way faster than hunting through a table of contents.

Example 2: Meeting notes search

Upload a month's worth of meeting notes into a workspace. Then ask things like:

"What action items were assigned to me in the last two weeks?"
"What did we decide about the pricing change?"
"When did we last discuss the API migration?"
"Summarize what happened in the March 10 standup"

This works especially well if your meeting notes follow a consistent format. The model gets better at finding relevant information when the source material is well-structured.

Example 3: Codebase documentation

Upload your project's README, architecture docs, and API documentation into a workspace. Now you've got a searchable assistant that knows your project:

"How does authentication work in this project?"
"What environment variables need to be set?"
"What's the deployment process?"
"Which API endpoint handles user registration?"

This is particularly useful for onboarding or for jumping back into a project you haven't touched in a while.

Tips for getting better answers

Be specific - "What are the system requirements for installation?" works better than "tell me about installation"
Ask follow-ups - if the first answer is close but not quite right, ask a more targeted follow-up question
Check citations - always glance at the source chunks to verify the answer makes sense in context
One topic per workspace - don't dump unrelated documents together. A workspace about "tax documents" will give better results than one with tax docs, recipes, and meeting notes mixed together
Structured documents work best - documents with clear headings, sections, and formatting produce better chunks and better answers

It's not perfect.

RAG is really good at finding and summarizing specific information from your documents. It's less good at synthesizing across many documents or answering questions that require reasoning between multiple sources. For those tasks, you might need to ask multiple focused questions and piece things together yourself. Think of it as a very fast research assistant, not an oracle.

07 / Tips and Troubleshooting

Getting the Most Out of It

RAG works well out of the box, but knowing a few details about how it handles documents will help you get better results and fix common issues.

Best document formats

Not all documents are created equal when it comes to RAG:

PDF with selectable text - works great. This is what most people use.
Plain text (.txt) - works perfectly. No formatting to get in the way.
Markdown (.md) - excellent. Headings and structure help with chunking.
Word documents (.docx) - works well. Tables and formatting are preserved.
Scanned PDFs (image-based) - won't work without OCR first. The system can't read images of text.
Heavily formatted PDFs - tables, multi-column layouts, and complex formatting can sometimes chunk poorly. If results seem off, try converting to plain text first.

Chunk size and overlap

AnythingLLM lets you configure how documents get split up. The two settings that matter:

Chunk size - how big each piece is (default is usually 1,000 characters). Bigger chunks give more context but might include irrelevant info. Smaller chunks are more precise but might miss context.
Chunk overlap - how much consecutive chunks share (default is usually 20-50 characters). Overlap prevents information from being lost at chunk boundaries.

The defaults work fine for most documents. Only adjust if you're getting answers that seem to be missing context (try bigger chunks) or including too much irrelevant information (try smaller chunks).

"I'm getting wrong answers"

Common causes and fixes:

The document wasn't properly parsed - check if the document has selectable text. If it's a scanned image, you'll need OCR.
The question is too vague - "tell me about this" won't help the search find the right chunks. Be specific about what you're looking for.
The answer spans multiple sections - RAG finds individual chunks. If the answer requires connecting information from different parts of the document, you might need to ask multiple focused questions.
Wrong model for the job - smaller models (3B-7B) sometimes miss nuance. If accuracy matters, try a larger chat model.
The information isn't in the document - check the citations. If the model is citing irrelevant chunks, the information you're looking for might not actually be in the uploaded document.

Performance notes

Embedding is a one-time cost - each document only needs to be processed once. After that, searches are nearly instant.
Storage is minimal - the vector database for thousands of chunks takes up very little disk space (megabytes, not gigabytes)
RAM usage - the embedding model (nomic-embed-text) uses very little RAM. Your chat model is still the main consumer.
Speed scales well - having 100 documents doesn't make queries slower than having 10. Vector search is designed for this.

Backup your data

If you're using Docker, your workspaces and embedded documents live in the anythingllm Docker volume. To back it up:

🖥️ Terminal

docker run --rm \
  -v anythingllm:/data \
  -v $(pwd):/backup \
  alpine tar czf /backup/anythingllm-backup.tar.gz /data

This creates a compressed backup of everything - your workspaces, chat history, and embedded documents. Keep a copy somewhere safe.

A portable SSD is the right home for all of this. Your RAG document library, your Ollama model files (5-9GB each), your backups - it adds up faster than you'd expect. A dedicated external drive keeps your machine's boot drive clean and gives you a portable, self-contained AI workspace you can move between machines.

The one worth buying is the Samsung T7 Shield 2TB - 1,050 MB/s read, works with USB-A and USB-C so it's compatible with anything, IP65 dust and water resistant, and a 5-year warranty. It's genuinely the right call for this use case.

View Samsung T7 Shield 2TB on Amazon → Affiliate link - costs you nothing extra, helps keep these tutorials free.

08 / Next Steps

Where to Go From Here

You've built a local document Q&A system that runs entirely on your own hardware. No cloud APIs, no per-query costs, no data leaving your network. Let's talk about what you can do with it next.

What you've accomplished

✓ Learned how RAG works (chunking, embeddings, vector search)
✓ Set up AnythingLLM with Ollama
✓ Pulled and configured an embedding model
✓ Created your first workspace and uploaded a document
✓ Asked questions and got cited answers from your own files

Ideas to build on

Personal knowledge base - upload all your notes, bookmarks, and research. Build a second brain you can actually search.
Team documentation hub - create workspaces for different projects. Onboard new team members by pointing them at the workspace instead of sending them a pile of docs.
Study assistant - upload textbook chapters before an exam. Quiz yourself and get explanations with page references.
Recipe search - dump all your saved recipes into one workspace. Ask "what can I make with chicken and rice?" and get actual answers from your collection.
Contract review - upload a lease or service agreement. Ask "what happens if I cancel early?" and get the relevant clause cited back to you.

Alternative tools worth knowing

AnythingLLM is great for getting started, but the RAG ecosystem is growing fast:

PrivateGPT - similar concept, different interface. Worth trying if you want a comparison.
Open WebUI's built-in RAG - if you're already using Open WebUI (from the previous tutorial), it has document upload built in. Less powerful than AnythingLLM for heavy document work, but convenient for quick uploads.
LangChain + ChromaDB - for developers who want to build custom RAG pipelines with Python. More work, more control.

Explore next

→ Local Coding Assistant - use your local models to power VS Code autocomplete and AI chat
→ AnythingLLM documentation - API access, multi-user setup, advanced configuration
→ AnythingLLM on GitHub - star the project, follow updates
→ Ollama Advanced - tune your models for better RAG performance
→ Open WebUI - if you haven't set up a chat interface yet

Document Q&A tutorial complete.

You've built something genuinely useful - a private, local system that can search through your documents and answer questions about them. No subscriptions, no data leaving your network, no limits on how many documents you upload. This is your information, on your terms.