Skip to main content

Run LLMs Locally with Ollama: Llama, Mistral, Phi, Qwen, and DeepSeek on Your Machine

Intermediate90 min3 exercises55 XP
0/3 exercises

Your API bill hit $200 last month. Your client's legal team says patient data can't leave the building. Your flight has no Wi-Fi and you need to prototype a feature. These aren't hypotheticals — they're the three conversations that pushed me to run LLMs locally.

Ollama makes it embarrassingly simple: one install, one command, and you have a model that rivals GPT-3.5 running entirely on your laptop.

Why Run LLMs Locally?

Last year a client told me: "Our patient records can't leave our servers. Period." No amount of OpenAI's privacy policies or BAA agreements changed their mind. I needed a model that ran inside their firewall. That project is what sold me on local LLMs.

The case for running models locally comes down to five things. Privacy — your data never leaves your machine, full stop. Cost — after the initial download, every inference is free. Speed — no network round-trip means lower latency for short prompts. Offline capability — airports, trains, submarines. No rate limits — run 10,000 requests an hour if your hardware can handle it.

Cloud API (OpenAI, Anthropic)
# Pros:
# - Best-in-class models (GPT-4o, Claude)
# - No hardware requirements
# - Always up-to-date
#
# Cons:
# - Pay per token ($2-15 per million tokens)
# - Data leaves your machine
# - Rate limits (TPM, RPM)
# - Requires internet
# - Vendor lock-in risk
Local LLM (Ollama)
# Pros:
# - 100% private — data stays on your machine
# - Free after download
# - No rate limits
# - Works offline
# - Full control over model version
#
# Cons:
# - Smaller models than cloud (7B-70B vs 1T+)
# - Requires decent hardware (8GB+ RAM)
# - You manage updates

This tutorial uses Ollama 0.5+ and targets models available as of early 2026. The landscape moves fast — model names and sizes may shift, but the patterns you learn here stay the same.

Installing Ollama and Your First Model

Install Ollama
Loading editor...

After installation, the ollama command is available in your terminal. The Ollama server starts automatically in the background and listens on localhost:11434. Pull your first model:

Pull your first model
Loading editor...

Once the download finishes, you can chat with the model directly in your terminal. This is the fastest way to kick the tires on a new model before writing any Python.

Run and list models
Loading editor...

The Ollama Python Library

What if you want to use Ollama from Python, not the terminal? The official ollama Python library wraps the REST API in a clean interface.

Install the ollama Python package
Loading editor...
Basic chat with ollama.chat()
Loading editor...

The response object includes timing data that cloud APIs don't give you — total duration, model load time, prompt evaluation time. I find this invaluable for performance tuning.

For long responses, streaming prints tokens as they arrive instead of waiting for the full response. The user experience difference is dramatic.

Streaming responses
Loading editor...

Multi-turn conversations work the same way as the OpenAI API. You maintain a list of messages and append each exchange.

Multi-turn conversations
Loading editor...

That loading behavior is worth remembering. If your script makes one call then exits, you pay the loading cost every time. For interactive tools or servers that make repeated calls, the first-call penalty is negligible.

One thing that tripped me up early: the messages list is your responsibility. Unlike the chat interface in the terminal, the Python library doesn't track history for you. Append each response to the list yourself, or the model loses context between calls.

Quick Check

What happens if you call ollama.chat() with a model you haven't pulled yet?

Answer: Ollama automatically pulls (downloads) the model before running it. Convenient, but the first call will take minutes instead of seconds. I always pre-pull models to avoid surprises in production code.

The OpenAI-Compatible API

Here's why Ollama is winning the local LLM race: it exposes an OpenAI-compatible endpoint at localhost:11434/v1. That means any code written for the OpenAI API works with Ollama — change one line.

OpenAI client pointing at Ollama
Loading editor...

Read that code carefully. It's the standard OpenAI library. The only differences are base_url and api_key. The response object has the same .choices[0].message.content structure. Your existing OpenAI code ports over with zero refactoring.

Cloud OpenAI
from openai import OpenAI

client = OpenAI(
    # Uses OPENAI_API_KEY env var
    # Sends data to api.openai.com
)
response = client.chat.completions.create(
    model='gpt-4o-mini',
    messages=[
        {'role': 'user',
         'content': 'Explain decorators'}
    ],
)
Local Ollama (same code)
from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama',
)
response = client.chat.completions.create(
    model='llama3.2',
    messages=[
        {'role': 'user',
         'content': 'Explain decorators'}
    ],
)
Exercise: Build an LLM Client Config
Write Code

Write a function build_llm_config(provider) that returns a dictionary with the correct base_url, api_key, and default_model for a given provider.

Supported providers:

  • "ollama" → base_url: "http://localhost:11434/v1", api_key: "ollama", default_model: "llama3.2"
  • "openai" → base_url: "https://api.openai.com/v1", api_key: "sk-...", default_model: "gpt-4o-mini"
  • Any other string → raise ValueError with message f"Unknown provider: {provider}"
  • This is the pattern you'd use to switch between local Ollama and cloud APIs with one config change.

    Loading editor...

    Choosing the Right Model

    The model you choose matters more than any parameter you tune. I've watched people spend hours tweaking temperature and top_p on a 3B model when switching to a 7B model would have solved their problem instantly.

    Six model families dominate the open-source LLM space right now. Each has a different sweet spot.

    Llama 3.2 / 3.3 (Meta) — The default choice. Strong all-around performance, huge community. Sizes: 1B, 3B, 8B, 70B, 405B. The 8B model is the community's workhorse.

    Mistral / Mixtral (Mistral AI) — Punches above its weight on reasoning and code. Mixtral is a "mixture of experts" architecture that's faster than its parameter count suggests. Sizes: 7B (Mistral), 8x7B (Mixtral).

    Phi-3 / Phi-4 (Microsoft) — Small but surprisingly capable. The 3.8B model competes with models twice its size on benchmarks. Great for resource-constrained environments.

    Qwen 2.5 (Alibaba) — Excellent multilingual support, especially CJK languages. Strong on code and math. Sizes: 0.5B to 72B.

    DeepSeek-R1 (DeepSeek) — Specializes in chain-of-thought reasoning. Shows its "thinking" process before giving a final answer. Impressive on math and logic puzzles.

    Gemma 2 (Google) — Google's open model. The 9B variant is competitive with larger models. Good instruction-following out of the box.

    Pull multiple models for comparison
    Loading editor...
    Size tiers give you a rough quality-to-resource tradeoff:
    -------------------------------------------
    1B-3B4-6 GBVery fastLimitedQuick tasks, autocomplete, drafts
    7B-8B8-10 GBFastGoodGeneral use, coding, chat
    13B-14B16 GBModerateBetterComplex reasoning, detailed writing
    70B+48+ GBSlowNear-cloudWhen you need the best quality locally
    Inspect model details
    Loading editor...

    Quantization — Speed vs Quality Trade-off

    Your 70B model doesn't fit in memory. Do you really need all those decimal places? Probably not. That's what quantization solves.

    A model's parameters are numbers — weights in a neural network. At full precision (FP16), each parameter takes 2 bytes. A 7B model at FP16 needs ~14GB of RAM. Quantization reduces each parameter to fewer bits: 8-bit (Q8) cuts the size in half, 4-bit (Q4) cuts it to a quarter. The tradeoff is a small loss in output quality.

    Pull specific quantizations
    Loading editor...
    Here's what those tags mean in practice:
    ---------------------------------------------------------------------
    fp16161.0x (baseline)NoneBenchmarking, reference quality
    q8_080.5xNegligibleWhen quality matters and RAM allows
    q4_K_M4 (mixed)0.3xSmallDefault choice — best bang for buck
    q4_040.25xModerateMaximum speed, tight RAM

    My rule of thumb: use Q4_K_M unless you have a specific reason not to. If you're doing evaluations or comparing model quality, use Q8_0 or FP16 to eliminate quantization as a variable.

    Exercise: Recommend the Right Quantization
    Write Code

    Write a function recommend_quantization(ram_gb, priority) that recommends a quantization level based on available RAM and whether the user prioritizes "speed" or "quality".

    Rules:

  • If ram_gb < 8: always return "q4_0" (only option that fits)
  • If ram_gb >= 8 and priority == "speed": return "q4_K_M"
  • If ram_gb >= 8 and ram_gb < 16 and priority == "quality": return "q4_K_M" (can't fit q8)
  • If ram_gb >= 16 and priority == "speed": return "q4_K_M"
  • If ram_gb >= 16 and ram_gb < 32 and priority == "quality": return "q8_0"
  • If ram_gb >= 32 and priority == "quality": return "fp16"
  • If ram_gb >= 32 and priority == "speed": return "q4_K_M"
  • Loading editor...

    Benchmarking Local Models

    Numbers don't lie. Before committing to a model for a project, measure it. I've been burned by benchmark leaderboards that don't match real-world performance on my specific tasks.

    Benchmark function
    Loading editor...

    To compare models head-to-head, run the same prompt through each one. The function below benchmarks multiple models and prints a comparison table.

    Compare models side by side
    Loading editor...

    Real-World Pattern — Local Code Review Assistant

    Let's build something useful. This code review assistant reads a Python file from disk, sends it to your local model, and streams the review back. No data leaves your machine.

    Local code review assistant
    Loading editor...

    The streaming version gives a better experience for longer files. Tokens appear as they're generated instead of making you wait for the full review.

    Streaming code review
    Loading editor...
    Exercise: Build a Review Report Formatter
    Write Code

    Write a function format_review_report(reviews) that takes a dictionary of {filename: review_text} and returns a formatted report string.

    Rules:

  • Start with a header line: "Code Review Report (X files)"
  • For each file, add: "\n--- filename ---\n" followed by the review text
  • End with "\n=== END OF REPORT ==="
  • Files should appear in sorted order by filename
  • This is the output format you'd use to display results from a batch code review with your local LLM.

    Loading editor...

    Common Mistakes

    I've debugged all of these at least twice. Save yourself the headache.

    Not Enough RAM

    If your model runs at 2 tokens/second instead of 30, it's probably swapping to disk. Check with ollama ps — if the model size exceeds your available RAM, switch to a smaller model or a more aggressive quantization.

    Forgetting to Start the Ollama Server

    If you get ConnectionRefusedError, the Ollama server isn't running. On macOS and Windows, it starts automatically. On Linux, you may need to run ollama serve in a separate terminal or set it up as a systemd service.

    Ignoring the Context Window

    Ollama defaults to a 2048-token context window. If your input is longer, the model silently truncates it. For code review or long documents, increase it:

    Set context window size
    Loading editor...

    Using Too Large a Model

    Bigger isn't always better. A 70B model running in 4-bit quantization on a 32GB machine will be slower than a 7B model at Q8. And for many tasks — summarization, simple Q&A, code formatting — the quality difference is negligible. Profile your specific use case before upsizing.

    Summary and Next Steps

    You went from zero to running LLMs on your own machine. Here's what you now know:

  • Install Ollama with one command and pull models with ollama pull
  • Use the `ollama` Python library for native chat, streaming, and multi-turn conversations
  • Use the OpenAI-compatible API to swap between cloud and local with one line
  • Choose models based on your task, hardware, and quality requirements
  • Quantization trades precision for speed — Q4_K_M is the sweet spot for most use cases
  • Benchmark before committing — measure tokens/second on your actual hardware
  • Next steps to level up:

  • Pull 3-4 models and benchmark them on your specific tasks
  • Build a RAG (Retrieval-Augmented Generation) pipeline with local embeddings
  • Use Ollama in Docker for reproducible deployments: docker run -d -p 11434:11434 ollama/ollama
  • Explore custom Modelfiles to fine-tune system prompts and parameters per model
  • Frequently Asked Questions

    Can I use Ollama with a GPU?

    Yes. Ollama auto-detects NVIDIA GPUs (via CUDA) and Apple Silicon (via Metal). No configuration needed — if a GPU is available, Ollama uses it. Check with ollama ps to confirm. On multi-GPU setups, Ollama can split models across GPUs automatically.

    Is Ollama free for commercial use?

    Ollama itself is MIT-licensed — use it however you want. But each model has its own license. Llama models require accepting Meta's community license. Mistral and Phi use Apache 2.0. DeepSeek uses its own permissive license. Always check the model's license on the Ollama model page before shipping.

    Can I run multiple models simultaneously?

    Yes, but each model consumes RAM. If you load Llama 3.2 (5GB) and Mistral (5GB) at the same time, that's 10GB of RAM. Ollama unloads inactive models after 5 minutes by default. You can change this with the OLLAMA_KEEP_ALIVE environment variable.

    How does Ollama compare to llama.cpp?

    Ollama wraps llama.cpp (and other backends) with a user-friendly CLI and REST API. You get the same inference speed as raw llama.cpp but with model management, an OpenAI-compatible endpoint, and automatic GPU detection built in. If you need maximum control over inference parameters, use llama.cpp directly. For everything else, Ollama saves you hours of setup.

    References

  • Ollama Official Documentation — installation, CLI reference, model library
  • Ollama Python Library (GitHub) — Python client source code and examples
  • Ollama OpenAI Compatibility — using the OpenAI-compatible endpoint
  • GGUF Format Specification — the model format Ollama uses internally
  • Hugging Face Open LLM Leaderboard — benchmark comparisons across open models
  • llama.cpp (GitHub) — the inference engine behind Ollama
  • Versions used in this tutorial: Ollama 0.5+, Python 3.12, ollama library 0.4+, openai library 1.x. Tested March 2026.

    Related Tutorials