Ask Your Own Documents Questions
Your local AI is smart, but it only knows what it was trained on. Ask it about your company's internal docs, your class notes, or that 200-page PDF manual sitting on your desktop and it'll either make something up or admit it doesn't know.
RAG fixes this. It lets your local model read your actual documents and answer questions about them - with citations pointing back to the source.
What is RAG?
RAG stands for Retrieval-Augmented Generation. The name is technical but the idea is simple: before the AI answers your question, it first searches through your documents to find the relevant parts, then uses those parts to generate an accurate answer.
Instead of relying purely on what the model "remembers" from training, it reads the actual source material first. This means it can answer questions about things it was never trained on - your personal files, your company's documentation, last week's meeting notes.
Real use cases people actually build
- Personal knowledge base - dump all your notes, bookmarks, and research into one searchable system
- Technical documentation - ask "how do I configure X?" and get the answer from your own docs
- Legal and contract review - upload a contract and ask "what are the termination clauses?"
- Study assistant - feed it your textbooks and quiz yourself on the material
- Codebase Q&A - upload README files and docs, ask "how does authentication work in this project?"
- Meeting notes search - upload a month of meeting notes, ask "what did we decide about the pricing change?"
The common thread: you have documents with answers in them, and you want to find those answers without reading through everything yourself.
What you'll build in this tutorial
By the end of this guide, you'll have a working document Q&A system running locally. You'll upload a document, ask it questions, and get answers with references to where in the document the answer came from. Everything stays on your machine.
- Ollama installed and running with at least one chat model (see Ollama Basics)
- Docker installed (see Open WebUI tutorial for Docker setup if needed)
- 16GB+ RAM recommended - RAG runs a chat model and an embedding model at the same time (the Ryzen 7 6800H with 32GB handles both without breaking a sweat)
- A document you want to search through (PDF, text file, or Markdown)
- About 30 minutes for setup, then you're exploring
The Simple Version
Before we set anything up, let's understand what's actually happening under the hood. No math, no jargon overload - just the mental model you need to use this effectively.
Step 1: Your documents get chopped up
When you upload a document, the system breaks it into smaller pieces called "chunks." A 50-page PDF might become 200 chunks, each containing a paragraph or two. This is because AI models work better with focused, relevant snippets than with entire books shoved into the prompt.
Step 2: Each chunk becomes a set of numbers
Each chunk gets converted into a list of numbers (called an "embedding") that represents its meaning. Think of it like a GPS coordinate for the content - similar ideas end up with similar coordinates. A chunk about "installing software" would have coordinates close to a chunk about "setting up applications," even if they use completely different words.
This conversion is done by a small, specialized model called an embedding model. It's fast and lightweight - not the same thing as your chat model.
Step 3: Numbers go into a searchable database
Those number-coordinates get stored in a vector database. It's basically a search index optimized for finding things by meaning rather than by exact keywords. Traditional search finds "install" when you search for "install." Vector search finds "install" when you search for "how do I set this up?" - because the meaning is similar.
When you ask a question
Here's what happens when you type a question:
- Your question gets converted into the same kind of number-coordinates
- The database finds the chunks with the most similar coordinates (the most relevant content)
- Those chunks get inserted into the prompt alongside your question
- The chat model reads the relevant chunks and generates an answer based on them
- You get an answer with citations pointing back to which chunks it used
Why not just paste the whole document into the chat?
Good question. Three reasons:
- Context window limits - models can only process so much text at once. A 7B model typically handles 4,000-8,000 tokens (roughly 3,000-6,000 words). Your document might be 50,000 words.
- Accuracy drops - even models with large context windows tend to miss things in the middle of long inputs. Focused, relevant chunks produce better answers.
- Speed - processing 200 words of relevant content is way faster than processing 50,000 words of everything.
RAG gives the model exactly the parts it needs, nothing more. Better answers, faster responses.
RAG doesn't make the model smarter. It gives the model the right information at the right time. The model's job is to read and summarize - RAG's job is to find the right pages to read.
Your Document Q&A Hub
There are several tools that can do RAG with local models. We're using AnythingLLM because it bundles everything you need into one package - no Python scripts, no dependency hell, no stitching five different tools together. It connects to Ollama out of the box and handles the chunking, embedding, and vector storage for you.
Option A: Docker (recommended)
If you already have Docker running (from the Open WebUI setup or earlier tutorials), this is one command:
docker run -d \
--name anythingllm \
-p 3001:3001 \
-v anythingllm:/app/server/storage \
--restart unless-stopped \
mintplexlabs/anythingllm
What each flag does:
-p 3001:3001- exposes the web interface on port 3001-v anythingllm:/app/server/storage- persists your data (workspaces, documents, chat history)--restart unless-stopped- auto-starts on boot
Wait about 30 seconds for it to start up, then open your browser.
Option B: Desktop app
If you prefer a standalone app over Docker, AnythingLLM has desktop downloads for macOS, Windows, and Linux. Grab it from anythingllm.com and install it like any other application.
The desktop version works the same way - it just runs locally as an app instead of a Docker container.
First launch
Open your browser and go to:
http://localhost:3001
AnythingLLM will walk you through a setup wizard on first launch. Here's what to select:
- LLM Provider - select "Ollama" and point it to
http://localhost:11434(orhttp://host.docker.internal:11434if on macOS with Docker) - Chat Model - pick whichever Ollama model you want to use for answering questions (llama3.2 or mistral are good defaults)
- Embedding Provider - we'll set this up in the next section
If AnythingLLM is running in Docker and can't reach Ollama on localhost, try using
http://host.docker.internal:11434 (macOS/Windows) or
http://172.17.0.1:11434 (Linux) as the Ollama URL. This is a Docker networking
thing - containers sometimes can't see "localhost" the same way your host machine does.
Quick tour
Once setup is done, you'll land on the main screen. The key areas:
- Workspaces (sidebar) - these are like project folders. Each workspace has its own documents and chat history.
- Settings (gear icon) - where you configure LLM, embedding, and vector database options
- Upload - drag documents into a workspace to make them searchable
Don't upload anything yet - we need to set up the embedding model first so your documents get indexed properly.
The Engine Behind Search
Remember those "number-coordinates" from the previous section? The embedding model is what creates them. It's a small, specialized model that turns text into searchable vectors. Without it, RAG can't find the right chunks to answer your questions.
Pull an embedding model
Ollama has embedding models in its library just like chat models. The one we want is nomic-embed-text - it's small (274MB), fast, and accurate enough for most use cases.
ollama pull nomic-embed-text
This downloads in under a minute on most connections. It's tiny compared to chat models.
Configure AnythingLLM to use it
In AnythingLLM:
- Go to Settings (gear icon)
- Click Embedding Preference
- Select Ollama as the provider
- Set the Ollama URL (same as before -
http://localhost:11434) - Select nomic-embed-text from the model dropdown
- Save
Why nomic-embed-text?
A few reasons it's a solid default:
- Size - 274MB, barely noticeable next to your multi-gigabyte chat models
- Speed - embeds thousands of chunks in seconds
- Quality - competitive with much larger embedding models on standard benchmarks
- Long context - handles chunks up to 8,192 tokens, so it won't choke on longer passages
There are other options (mxbai-embed-large is slightly more accurate,
all-minilm is even smaller), but nomic-embed-text hits the sweet spot for most people.
These are different tools for different jobs. Your chat model (llama3.2, mistral) generates text responses. The embedding model (nomic-embed-text) converts text into searchable coordinates. They work together but don't compete for the same resources - on the recommended Ryzen 7 6800H with 32GB RAM, nomic-embed-text uses under 300MB and embedding runs so fast you won't even notice it happening alongside your chat model.
Upload, Ask, Get Answers
Everything is configured. Ollama is running with a chat model and an embedding model. AnythingLLM is connected to both. Time to upload a document and actually use this thing.
Step 1 - Create a workspace
In AnythingLLM, click the + button in the sidebar to create a new workspace. Give it a name that describes what you're putting in it - something like "Product Docs" or "Meeting Notes" or "Tax Stuff."
Workspaces keep things organized. Documents in one workspace don't mix with documents in another. Think of them like project folders.
Step 2 - Upload a document
Click the upload icon in your workspace and drag in a document. For your first try, pick something you know well - maybe a README file, a recipe collection, some meeting notes, or a manual for something you own. Knowing the content helps you verify the answers are accurate.
Supported formats include:
- PDF - the most common format people use
- TXT - plain text files
- MD - Markdown documents
- DOCX - Word documents
- CSV - spreadsheet data
If your PDF is actually images of text (like a scanned document), the system can't read it properly. You'll need to run OCR on it first. Most modern PDFs with selectable text work fine.
Step 3 - Watch it process
After uploading, AnythingLLM will chunk the document and create embeddings. You'll see a progress indicator. For a typical 10-20 page PDF, this takes a few seconds. Larger documents (hundreds of pages) might take a minute or two.
This is a one-time cost per document. Once it's embedded, searching through it is nearly instant.
Step 4 - Ask your first question
Switch to the chat tab in your workspace and type a question about the document you just uploaded. Be specific. Instead of "tell me about this document," try something like:
- "What are the main requirements listed in section 3?"
- "How do I configure the database connection?"
- "Summarize the key decisions from the March 15 meeting"
- "What ingredients do I need for the chocolate cake recipe?"
Step 5 - Check the citations
Look at the response. You should see citation markers that point back to specific chunks of your document. Click on them to see exactly which part of the original text the answer came from.
This is one of the most valuable parts of RAG. You're not just getting an answer - you're getting proof of where the answer came from. If something looks wrong, you can check the source directly.
You just asked a question about a local document and got an answer from your own AI, running on your own hardware, without any of your data leaving your machine. That's the whole point.
Putting It to Work
You've seen the basics work. Now let's look at some practical scenarios where document Q&A actually saves you time - the kind of thing you'll find yourself reaching for regularly.
Example 1: Product manual lookup
You bought a new router, NAS, or some piece of hardware with a 150-page PDF manual. Instead of scrolling through it or using Ctrl+F with the exact right keyword, upload it and ask naturally:
- "How do I reset this to factory defaults?"
- "What ports need to be open for remote access?"
- "What's the default admin password?"
- "How do I set up port forwarding?"
The model finds the relevant section and gives you a direct answer. Way faster than hunting through a table of contents.
Example 2: Meeting notes search
Upload a month's worth of meeting notes into a workspace. Then ask things like:
- "What action items were assigned to me in the last two weeks?"
- "What did we decide about the pricing change?"
- "When did we last discuss the API migration?"
- "Summarize what happened in the March 10 standup"
This works especially well if your meeting notes follow a consistent format. The model gets better at finding relevant information when the source material is well-structured.
Example 3: Codebase documentation
Upload your project's README, architecture docs, and API documentation into a workspace. Now you've got a searchable assistant that knows your project:
- "How does authentication work in this project?"
- "What environment variables need to be set?"
- "What's the deployment process?"
- "Which API endpoint handles user registration?"
This is particularly useful for onboarding or for jumping back into a project you haven't touched in a while.
Tips for getting better answers
- Be specific - "What are the system requirements for installation?" works better than "tell me about installation"
- Ask follow-ups - if the first answer is close but not quite right, ask a more targeted follow-up question
- Check citations - always glance at the source chunks to verify the answer makes sense in context
- One topic per workspace - don't dump unrelated documents together. A workspace about "tax documents" will give better results than one with tax docs, recipes, and meeting notes mixed together
- Structured documents work best - documents with clear headings, sections, and formatting produce better chunks and better answers
RAG is really good at finding and summarizing specific information from your documents. It's less good at synthesizing across many documents or answering questions that require reasoning between multiple sources. For those tasks, you might need to ask multiple focused questions and piece things together yourself. Think of it as a very fast research assistant, not an oracle.
Getting the Most Out of It
RAG works well out of the box, but knowing a few details about how it handles documents will help you get better results and fix common issues.
Best document formats
Not all documents are created equal when it comes to RAG:
- PDF with selectable text - works great. This is what most people use.
- Plain text (.txt) - works perfectly. No formatting to get in the way.
- Markdown (.md) - excellent. Headings and structure help with chunking.
- Word documents (.docx) - works well. Tables and formatting are preserved.
- Scanned PDFs (image-based) - won't work without OCR first. The system can't read images of text.
- Heavily formatted PDFs - tables, multi-column layouts, and complex formatting can sometimes chunk poorly. If results seem off, try converting to plain text first.
Chunk size and overlap
AnythingLLM lets you configure how documents get split up. The two settings that matter:
- Chunk size - how big each piece is (default is usually 1,000 characters). Bigger chunks give more context but might include irrelevant info. Smaller chunks are more precise but might miss context.
- Chunk overlap - how much consecutive chunks share (default is usually 20-50 characters). Overlap prevents information from being lost at chunk boundaries.
The defaults work fine for most documents. Only adjust if you're getting answers that seem to be missing context (try bigger chunks) or including too much irrelevant information (try smaller chunks).
"I'm getting wrong answers"
Common causes and fixes:
- The document wasn't properly parsed - check if the document has selectable text. If it's a scanned image, you'll need OCR.
- The question is too vague - "tell me about this" won't help the search find the right chunks. Be specific about what you're looking for.
- The answer spans multiple sections - RAG finds individual chunks. If the answer requires connecting information from different parts of the document, you might need to ask multiple focused questions.
- Wrong model for the job - smaller models (3B-7B) sometimes miss nuance. If accuracy matters, try a larger chat model.
- The information isn't in the document - check the citations. If the model is citing irrelevant chunks, the information you're looking for might not actually be in the uploaded document.
Performance notes
- Embedding is a one-time cost - each document only needs to be processed once. After that, searches are nearly instant.
- Storage is minimal - the vector database for thousands of chunks takes up very little disk space (megabytes, not gigabytes)
- RAM usage - the embedding model (nomic-embed-text) uses very little RAM. Your chat model is still the main consumer.
- Speed scales well - having 100 documents doesn't make queries slower than having 10. Vector search is designed for this.
Backup your data
If you're using Docker, your workspaces and embedded documents live in the anythingllm
Docker volume. To back it up:
docker run --rm \
-v anythingllm:/data \
-v $(pwd):/backup \
alpine tar czf /backup/anythingllm-backup.tar.gz /data
This creates a compressed backup of everything - your workspaces, chat history, and embedded documents. Keep a copy somewhere safe.
Where to Go From Here
You've built a local document Q&A system that runs entirely on your own hardware. No cloud APIs, no per-query costs, no data leaving your network. Let's talk about what you can do with it next.
What you've accomplished
- ✓ Learned how RAG works (chunking, embeddings, vector search)
- ✓ Set up AnythingLLM with Ollama
- ✓ Pulled and configured an embedding model
- ✓ Created your first workspace and uploaded a document
- ✓ Asked questions and got cited answers from your own files
Ideas to build on
- Personal knowledge base - upload all your notes, bookmarks, and research. Build a second brain you can actually search.
- Team documentation hub - create workspaces for different projects. Onboard new team members by pointing them at the workspace instead of sending them a pile of docs.
- Study assistant - upload textbook chapters before an exam. Quiz yourself and get explanations with page references.
- Recipe search - dump all your saved recipes into one workspace. Ask "what can I make with chicken and rice?" and get actual answers from your collection.
- Contract review - upload a lease or service agreement. Ask "what happens if I cancel early?" and get the relevant clause cited back to you.
Alternative tools worth knowing
AnythingLLM is great for getting started, but the RAG ecosystem is growing fast:
- PrivateGPT - similar concept, different interface. Worth trying if you want a comparison.
- Open WebUI's built-in RAG - if you're already using Open WebUI (from the previous tutorial), it has document upload built in. Less powerful than AnythingLLM for heavy document work, but convenient for quick uploads.
- LangChain + ChromaDB - for developers who want to build custom RAG pipelines with Python. More work, more control.
Explore next
- → Local Coding Assistant - use your local models to power VS Code autocomplete and AI chat
- → AnythingLLM documentation - API access, multi-user setup, advanced configuration
- → AnythingLLM on GitHub - star the project, follow updates
- → Ollama Advanced - tune your models for better RAG performance
- → Open WebUI - if you haven't set up a chat interface yet
You've built something genuinely useful - a private, local system that can search through your documents and answer questions about them. No subscriptions, no data leaving your network, no limits on how many documents you upload. This is your information, on your terms.