Local Coding Assistant

Set up a private AI coding assistant in VS Code with tab autocomplete and chat. Qwen 2.5 Coder runs locally through Ollama - your proprietary code never leaves your machine.

30 min Beginner Free Updated March 2026

Your Code Stays on Your Machine

GitHub Copilot, Cursor, Codeium - they're all great until you realize your proprietary code is being sent to someone else's server every time you type. If you're working on anything sensitive - client projects, internal tools, startup code - that's a real problem.

The fix is simple: run a coding model locally with Ollama and connect it to VS Code. Same autocomplete, same chat, same "explain this code" workflow - but everything stays on your machine. Zero cloud dependency. Zero per-month cost after setup.

What you'll set up

By the end of this tutorial, you'll have two things working in VS Code:

  • Tab autocomplete - as you type, a local model suggests completions inline (like Copilot)
  • AI chat sidebar - highlight code and ask "explain this," "write tests for this," "refactor this" - answered by your local model

We're using Continue.dev, an open source VS Code extension with native Ollama support. It's the most mature option for local coding assistance and it's completely free.

Will it be as good as Copilot?

Honest answer: no, not quite. Cloud-based tools run massive models on expensive GPU clusters. Your Ryzen mini PC is running a much smaller model on CPU. But here's the thing:

  • Autocomplete is surprisingly good with the right small model (qwen2.5-coder:1.5b)
  • Chat answers are solid for everyday questions - explaining code, writing boilerplate, debugging
  • Your proprietary code never leaves your network
  • No $10-20/month subscription
  • Works offline - on a plane, in a cafe with bad wifi, anywhere

For most working developers, a local coding assistant handles 70-80% of what you'd use Copilot for. The tradeoff is speed and occasional quality - the privacy and cost savings are the payoff.

How this fits into Book III: This tutorial is an optional add-on. The core path is Ollama Basics then Ollama Advanced (where you wire local models into OpenClaw). The coding assistant gives you Copilot-style features in VS Code using the same Ollama instance - handy if you code, but separate from OpenClaw.
Prerequisites
  • Ollama installed and running (see Ollama Basics)
  • VS Code installed (any recent version)
  • The recommended Ryzen 7 6800H mini PC with 32GB RAM, or similar hardware (8+ cores, 16GB+ RAM)
  • About 20 minutes

Two Models, Two Jobs

You need two different models for coding assistance, and this is where most tutorials get it wrong - they tell you to use one model for everything. That doesn't work well on CPU hardware.

Autocomplete needs to be fast - a suggestion that takes 3 seconds to appear is useless. Chat can be slower because you're waiting for a full response anyway. So we use a tiny model for autocomplete and a bigger one for chat.

Pull both models

🖥️ Terminal
# Small, fast model for autocomplete (274MB)
ollama pull qwen2.5-coder:1.5b

# Larger model for chat and code generation (4.7GB)
ollama pull qwen2.5-coder:7b

The 1.5B model downloads in under a minute. The 7B model takes a few minutes depending on your connection. Both fit comfortably in 32GB RAM with plenty of room to spare.

Why Qwen 2.5 Coder?

There are several coding models out there (CodeLlama, DeepSeek-Coder, StarCoder). We're using Qwen 2.5 Coder because:

  • Best quality at 7B size - scores 73.7 on the Aider code repair benchmark, competitive with models 10x its size
  • Good at fill-in-the-middle - the 1.5B version is specifically trained for code completion, not just chat
  • Wide language support - Python, JavaScript, TypeScript, Go, Rust, Java, C++, and dozens more
  • Active development - regularly updated by the Qwen team

Performance on the Ryzen 7 6800H

Here's what to expect on the recommended 32GB Ryzen mini PC with CPU-only inference:

  • qwen2.5-coder:1.5b (autocomplete) - ~20-30 tokens/sec, suggestions appear in under a second. Fast enough to feel natural.
  • qwen2.5-coder:7b (chat) - ~10-15 tokens/sec, a typical 200-token answer takes 15-20 seconds. Fine for "explain this function" or "write a test for this." Not instant, but usable.
On different hardware.

If you have 16GB RAM, stick with the 1.5B model for both autocomplete and chat - the 7B model will swap to disk and become unusable. If you have a dedicated GPU (even a modest one), both models will run significantly faster.

Alternative models worth trying

If Qwen doesn't click for you, these are solid alternatives:

  • deepseek-coder:6.7b - similar quality to Qwen 7B, some people prefer its Python output
  • starcoder2:3b - a middle ground between 1.5B and 7B, good for autocomplete if you want slightly better quality and can tolerate slightly slower suggestions
  • codellama:7b - Meta's coding model, strong at Python and general code tasks

Connect VS Code to Ollama

Continue.dev is an open source VS Code extension that connects to Ollama and gives you AI-powered coding features - tab autocomplete, inline chat, code explanations, refactoring, and test generation. All pointing at your local models.

Step 1 - Install the extension

In VS Code:

  • Open the Extensions panel (Ctrl+Shift+X)
  • Search for "Continue"
  • Install the one by Continue.dev (it has the most downloads)
  • A new icon appears in your sidebar - that's the Continue chat panel

Step 2 - Open the config file

Continue uses a YAML config file to know which models to use and how to connect to them.

Open the Command Palette (Ctrl+Shift+P) and type:

🖥️ VS Code Command Palette
Continue: Open Config File

This opens ~/.continue/config.yaml (or creates it if it doesn't exist).

Step 3 - Configure your models

Replace the contents of config.yaml with this:

🖥️ ~/.continue/config.yaml
name: Local Coding Assistant
version: 0.0.1
schema: v1

models:
  # Chat and code generation (7B - good quality, reasonable speed)
  - name: Qwen 2.5 Coder 7B
    provider: ollama
    model: qwen2.5-coder:7b
    roles:
      - chat
      - edit
    defaultCompletionOptions:
      temperature: 0.7

  # Tab autocomplete (1.5B - fast, lightweight)
  - name: Qwen 1.5B Autocomplete
    provider: ollama
    model: qwen2.5-coder:1.5b
    roles:
      - autocomplete
    autocompleteOptions:
      disable: false
      debounceDelay: 300
      maxPromptTokens: 1024

Here's what the key settings do:

  • roles: [chat, edit] - this model handles the sidebar chat and inline code edits
  • roles: [autocomplete] - this model handles tab completion suggestions
  • debounceDelay: 300 - waits 300ms after you stop typing before requesting a completion. Prevents hammering the CPU while you're still typing.
  • maxPromptTokens: 1024 - limits context sent to the autocomplete model. Lower = faster on CPU.
  • temperature: 0.7 - controls creativity. Lower (0.2-0.5) for more predictable code, higher for more creative suggestions.

Step 4 - Enable tab autocomplete

Look at the VS Code status bar (bottom right). You should see a Continue icon. Click it and make sure "Enable Tab Autocomplete" is checked.

Once enabled, start typing in any file and you'll see ghost text suggestions appear after a brief pause. Press Tab to accept, Esc to dismiss.

Step 5 - Test the chat

Click the Continue icon in the sidebar to open the chat panel. Type a question like "write a Python function that checks if a string is a palindrome" and hit Enter.

The response will stream in from your local Qwen 7B model. It'll take 10-20 seconds for a typical answer on the Ryzen 6800H - slower than cloud APIs, but completely private.

You're up and running.

If you see autocomplete suggestions and chat responses, everything is working. The next section covers the workflows that make this actually useful day-to-day.

How to Actually Use This

Having an AI coding assistant is one thing. Knowing when and how to use it effectively is another. Here are the workflows that actually save time in day-to-day development.

Explain unfamiliar code

You're reading someone else's code (or your own from six months ago). Highlight a function or block, then press Ctrl+L to send it to the Continue chat. Ask:

  • "What does this function do?"
  • "Why would someone write it this way?"
  • "What are the edge cases here?"

This is probably the most valuable daily use case. Understanding code faster means moving faster. And your codebase never leaves your machine to get that explanation.

Write boilerplate

Nobody likes writing boilerplate. In the Continue chat, ask for the boring stuff:

  • "Write a React component for a login form with email and password fields"
  • "Create a Python class for a database connection with retry logic"
  • "Write an Express middleware that validates JWT tokens"
  • "Generate a Dockerfile for a Node.js app with multi-stage build"

Review the output, paste what's useful, modify what isn't. The model handles the structure and you handle the details.

Generate tests

Highlight a function, send it to chat with Ctrl+L, and ask:

  • "Write unit tests for this function"
  • "Write edge case tests - what could break this?"
  • "Generate pytest tests for this with mocking"

The 7B model is surprisingly good at generating test scaffolding. You'll almost always need to adjust the assertions, but getting the test structure and setup/teardown written for you saves real time.

Debug errors

Copy a stack trace or error message, paste it into the Continue chat, and ask "what's causing this?" or "how do I fix this?" Include the relevant code if the model needs context.

Example:

🖥️ Continue Chat
I'm getting this error:

TypeError: Cannot read properties of undefined (reading 'map')
  at UserList (UserList.jsx:12)

Here's the component:

function UserList({ users }) {
  return users.map(u => <div key={u.id}>{u.name}</div>);
}

What's wrong and how do I fix it?

The model will identify that users could be undefined and suggest adding a default value or guard clause. Not groundbreaking, but faster than searching Stack Overflow when you already know it's something simple.

Refactor inline

Highlight code in your editor and press Ctrl+I to open an inline edit prompt. Type what you want changed:

  • "Convert this to async/await"
  • "Add error handling to this function"
  • "Rewrite this using a reduce instead of a for loop"
  • "Add TypeScript types to these parameters"

Continue shows a diff of the proposed changes. Accept them, reject them, or modify them before they're applied. You stay in control.

Tab autocomplete patterns that work well

The autocomplete model works best when it has context about what you're building. A few patterns that consistently get good suggestions:

  • Write a comment first - type // fetch user by ID and return formatted response then start the function. The model uses the comment as context.
  • Start with the function signature - type function validateEmail(email: string): boolean { and let autocomplete fill in the body
  • Pattern continuation - if you've written a similar function above, the model picks up on the pattern and suggests consistent code for the next one
  • Import suggestions - type import and the model suggests likely imports based on what you're using in the file
When to not use it.

Local models struggle with very project-specific logic, complex architecture decisions, or code that depends heavily on your particular codebase context. Use it for the mechanical parts of coding - boilerplate, tests, explanations, refactoring - and do the thinking yourself.

Making It Feel Right

The defaults work, but a few tweaks can make your local coding assistant feel noticeably better on CPU-only hardware. These are the settings that actually matter.

Autocomplete speed tuning

If autocomplete feels laggy or too aggressive, adjust these values in your config.yaml:

🖥️ ~/.continue/config.yaml (autocomplete model section)
    autocompleteOptions:
      disable: false
      debounceDelay: 300      # ms to wait after typing stops (increase if CPU is struggling)
      maxPromptTokens: 1024   # context sent to model (lower = faster)
      modelTimeout: 3000      # max ms to wait for a response (increase if timing out)
      onlyMyCode: true        # only use your project files for context

What to change and when:

  • Suggestions too slow? Lower maxPromptTokens to 512. Less context = faster responses.
  • Suggestions appearing while you're still typing? Increase debounceDelay to 400-500.
  • Suggestions timing out? Increase modelTimeout to 5000.
  • CPU usage too high? Set onlyMyCode: true to limit context to your project files only.

Keep Ollama models loaded

By default, Ollama unloads models from RAM after 5 minutes of inactivity. This means the first autocomplete suggestion after a break has a cold-start delay while the model reloads. To keep models in memory longer:

🖥️ Terminal
# Keep models loaded for 1 hour
export OLLAMA_KEEP_ALIVE=1h

# Or keep loaded indefinitely (until Ollama restarts)
export OLLAMA_KEEP_ALIVE=-1

Add this to your ~/.bashrc or ~/.zshrc to make it permanent. On the 32GB Ryzen machine, both the 1.5B and 7B models fit in RAM together with plenty of headroom - keeping them loaded costs very little.

Limit parallel requests

On CPU-only hardware, parallel model requests split your processing power and slow everything down. Tell Ollama to handle one request at a time:

🖥️ Terminal
export OLLAMA_NUM_PARALLEL=1

This means if you're getting a chat response and autocomplete fires at the same time, one waits for the other. On CPU, this actually gives better total throughput than trying to run both simultaneously.

Chat temperature for different tasks

The temperature setting in your config controls how creative vs. predictable the model's output is. You can adjust it per-conversation in the Continue chat, or change the default in config.yaml:

  • 0.1-0.3 - for test generation, bug fixes, type annotations (you want predictable, correct output)
  • 0.5-0.7 - for general coding help, explanations, refactoring (good balance)
  • 0.8-1.0 - for brainstorming approaches, generating examples, creative solutions (more varied output)

Memory usage on the 32GB Ryzen

Here's what running both coding models looks like on the recommended hardware:

  • qwen2.5-coder:1.5b - uses about 1.5GB RAM when loaded
  • qwen2.5-coder:7b - uses about 5GB RAM when loaded
  • Both loaded together - about 6.5GB total, leaving ~25GB free for your OS, VS Code, browser, Docker containers, and everything else

You won't notice any system slowdown. The Ryzen 6800H handles this without breaking a sweat.

Running other Ollama models at the same time?

If you're also running Open WebUI or the RAG setup from earlier tutorials, keep in mind that each loaded model takes RAM. On 32GB you can comfortably run 2-3 models simultaneously. Ollama automatically unloads idle models when memory gets tight, so it manages itself pretty well.

Where to Go From Here

You've got a working local coding assistant - tab autocomplete and AI chat running entirely on your Ryzen mini PC. Here's what to explore next.

What you've accomplished

  • ✓ Pulled purpose-built coding models (qwen2.5-coder 1.5B + 7B)
  • ✓ Installed and configured Continue.dev in VS Code
  • ✓ Set up tab autocomplete with a fast, lightweight model
  • ✓ Set up AI chat for code explanations, tests, and refactoring
  • ✓ Tuned settings for optimal CPU-only performance

Try different models

The Qwen models are a great starting point, but new coding models come out regularly. Worth trying:

  • deepseek-coder-v2:16b - if you want to push quality (slower on CPU, but noticeably smarter for complex tasks)
  • starcoder2:3b - a middle ground for autocomplete if 1.5B feels too limited
  • codellama:7b - different style of code generation, worth comparing to Qwen

Swap models in your config.yaml and restart VS Code. Keep what works best for your language and coding style.

Other tools worth knowing about

  • Twinny - another open source VS Code extension for local code completion. Simpler than Continue, focused purely on autocomplete.
  • CodeGPT - VS Code extension that supports Ollama. Free tier available, nice UI.
  • Cline - open source VS Code extension for agentic coding (the model can edit files, run commands). More experimental, best with larger models.

Explore next

Local coding assistant tutorial complete.

You're writing code with AI assistance that never leaves your machine. No subscriptions, no data collection, no proprietary code sent to third-party servers. Your code, your tools, your rules.