Run LLMs Locally with Ollama: Llama, Mistral, Phi, Qwen, and DeepSeek on Your Machine
Your API bill hit $200 last month. Your client's legal team says patient data can't leave the building. Your flight has no Wi-Fi and you need to prototype a feature. These aren't hypotheticals — they're the three conversations that pushed me to run LLMs locally.
Ollama makes it embarrassingly simple: one install, one command, and you have a model that rivals GPT-3.5 running entirely on your laptop.
Why Run LLMs Locally?
Last year a client told me: "Our patient records can't leave our servers. Period." No amount of OpenAI's privacy policies or BAA agreements changed their mind. I needed a model that ran inside their firewall. That project is what sold me on local LLMs.
The case for running models locally comes down to five things. Privacy — your data never leaves your machine, full stop. Cost — after the initial download, every inference is free. Speed — no network round-trip means lower latency for short prompts. Offline capability — airports, trains, submarines. No rate limits — run 10,000 requests an hour if your hardware can handle it.
# Pros:
# - Best-in-class models (GPT-4o, Claude)
# - No hardware requirements
# - Always up-to-date
#
# Cons:
# - Pay per token ($2-15 per million tokens)
# - Data leaves your machine
# - Rate limits (TPM, RPM)
# - Requires internet
# - Vendor lock-in risk# Pros:
# - 100% private — data stays on your machine
# - Free after download
# - No rate limits
# - Works offline
# - Full control over model version
#
# Cons:
# - Smaller models than cloud (7B-70B vs 1T+)
# - Requires decent hardware (8GB+ RAM)
# - You manage updatesThis tutorial uses Ollama 0.5+ and targets models available as of early 2026. The landscape moves fast — model names and sizes may shift, but the patterns you learn here stay the same.
Installing Ollama and Your First Model
After installation, the ollama command is available in your terminal. The Ollama server starts automatically in the background and listens on localhost:11434. Pull your first model:
Once the download finishes, you can chat with the model directly in your terminal. This is the fastest way to kick the tires on a new model before writing any Python.
The Ollama Python Library
What if you want to use Ollama from Python, not the terminal? The official ollama Python library wraps the REST API in a clean interface.
The response object includes timing data that cloud APIs don't give you — total duration, model load time, prompt evaluation time. I find this invaluable for performance tuning.
For long responses, streaming prints tokens as they arrive instead of waiting for the full response. The user experience difference is dramatic.
Multi-turn conversations work the same way as the OpenAI API. You maintain a list of messages and append each exchange.
That loading behavior is worth remembering. If your script makes one call then exits, you pay the loading cost every time. For interactive tools or servers that make repeated calls, the first-call penalty is negligible.
One thing that tripped me up early: the messages list is your responsibility. Unlike the chat interface in the terminal, the Python library doesn't track history for you. Append each response to the list yourself, or the model loses context between calls.
Quick Check
What happens if you call ollama.chat() with a model you haven't pulled yet?
Answer: Ollama automatically pulls (downloads) the model before running it. Convenient, but the first call will take minutes instead of seconds. I always pre-pull models to avoid surprises in production code.
The OpenAI-Compatible API
Here's why Ollama is winning the local LLM race: it exposes an OpenAI-compatible endpoint at localhost:11434/v1. That means any code written for the OpenAI API works with Ollama — change one line.
Read that code carefully. It's the standard OpenAI library. The only differences are base_url and api_key. The response object has the same .choices[0].message.content structure. Your existing OpenAI code ports over with zero refactoring.
from openai import OpenAI
client = OpenAI(
# Uses OPENAI_API_KEY env var
# Sends data to api.openai.com
)
response = client.chat.completions.create(
model='gpt-4o-mini',
messages=[
{'role': 'user',
'content': 'Explain decorators'}
],
)from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama',
)
response = client.chat.completions.create(
model='llama3.2',
messages=[
{'role': 'user',
'content': 'Explain decorators'}
],
)Write a function build_llm_config(provider) that returns a dictionary with the correct base_url, api_key, and default_model for a given provider.
Supported providers:
"ollama" → base_url: "http://localhost:11434/v1", api_key: "ollama", default_model: "llama3.2""openai" → base_url: "https://api.openai.com/v1", api_key: "sk-...", default_model: "gpt-4o-mini"ValueError with message f"Unknown provider: {provider}"This is the pattern you'd use to switch between local Ollama and cloud APIs with one config change.
Choosing the Right Model
The model you choose matters more than any parameter you tune. I've watched people spend hours tweaking temperature and top_p on a 3B model when switching to a 7B model would have solved their problem instantly.
Six model families dominate the open-source LLM space right now. Each has a different sweet spot.
Llama 3.2 / 3.3 (Meta) — The default choice. Strong all-around performance, huge community. Sizes: 1B, 3B, 8B, 70B, 405B. The 8B model is the community's workhorse.
Mistral / Mixtral (Mistral AI) — Punches above its weight on reasoning and code. Mixtral is a "mixture of experts" architecture that's faster than its parameter count suggests. Sizes: 7B (Mistral), 8x7B (Mixtral).
Phi-3 / Phi-4 (Microsoft) — Small but surprisingly capable. The 3.8B model competes with models twice its size on benchmarks. Great for resource-constrained environments.
Qwen 2.5 (Alibaba) — Excellent multilingual support, especially CJK languages. Strong on code and math. Sizes: 0.5B to 72B.
DeepSeek-R1 (DeepSeek) — Specializes in chain-of-thought reasoning. Shows its "thinking" process before giving a final answer. Impressive on math and logic puzzles.
Gemma 2 (Google) — Google's open model. The 9B variant is competitive with larger models. Good instruction-following out of the box.
| Size tiers give you a rough quality-to-resource tradeoff: | ||||
|---|---|---|---|---|
| ------ | ----------- | ------- | --------- | ---------- |
| 1B-3B | 4-6 GB | Very fast | Limited | Quick tasks, autocomplete, drafts |
| 7B-8B | 8-10 GB | Fast | Good | General use, coding, chat |
| 13B-14B | 16 GB | Moderate | Better | Complex reasoning, detailed writing |
| 70B+ | 48+ GB | Slow | Near-cloud | When you need the best quality locally |
Quantization — Speed vs Quality Trade-off
Your 70B model doesn't fit in memory. Do you really need all those decimal places? Probably not. That's what quantization solves.
A model's parameters are numbers — weights in a neural network. At full precision (FP16), each parameter takes 2 bytes. A 7B model at FP16 needs ~14GB of RAM. Quantization reduces each parameter to fewer bits: 8-bit (Q8) cuts the size in half, 4-bit (Q4) cuts it to a quarter. The tradeoff is a small loss in output quality.
| Here's what those tags mean in practice: | ||||
|---|---|---|---|---|
| ------------- | ---------------- | --------------- | ------------- | ------------ |
fp16 | 16 | 1.0x (baseline) | None | Benchmarking, reference quality |
q8_0 | 8 | 0.5x | Negligible | When quality matters and RAM allows |
q4_K_M | 4 (mixed) | 0.3x | Small | Default choice — best bang for buck |
q4_0 | 4 | 0.25x | Moderate | Maximum speed, tight RAM |
My rule of thumb: use Q4_K_M unless you have a specific reason not to. If you're doing evaluations or comparing model quality, use Q8_0 or FP16 to eliminate quantization as a variable.
Write a function recommend_quantization(ram_gb, priority) that recommends a quantization level based on available RAM and whether the user prioritizes "speed" or "quality".
Rules:
ram_gb < 8: always return "q4_0" (only option that fits)ram_gb >= 8 and priority == "speed": return "q4_K_M"ram_gb >= 8 and ram_gb < 16 and priority == "quality": return "q4_K_M" (can't fit q8)ram_gb >= 16 and priority == "speed": return "q4_K_M"ram_gb >= 16 and ram_gb < 32 and priority == "quality": return "q8_0"ram_gb >= 32 and priority == "quality": return "fp16"ram_gb >= 32 and priority == "speed": return "q4_K_M"Benchmarking Local Models
Numbers don't lie. Before committing to a model for a project, measure it. I've been burned by benchmark leaderboards that don't match real-world performance on my specific tasks.
To compare models head-to-head, run the same prompt through each one. The function below benchmarks multiple models and prints a comparison table.
Real-World Pattern — Local Code Review Assistant
Let's build something useful. This code review assistant reads a Python file from disk, sends it to your local model, and streams the review back. No data leaves your machine.
The streaming version gives a better experience for longer files. Tokens appear as they're generated instead of making you wait for the full review.
Write a function format_review_report(reviews) that takes a dictionary of {filename: review_text} and returns a formatted report string.
Rules:
"Code Review Report (X files)""\n--- filename ---\n" followed by the review text"\n=== END OF REPORT ==="This is the output format you'd use to display results from a batch code review with your local LLM.
Common Mistakes
I've debugged all of these at least twice. Save yourself the headache.
Not Enough RAM
If your model runs at 2 tokens/second instead of 30, it's probably swapping to disk. Check with ollama ps — if the model size exceeds your available RAM, switch to a smaller model or a more aggressive quantization.
Forgetting to Start the Ollama Server
If you get ConnectionRefusedError, the Ollama server isn't running. On macOS and Windows, it starts automatically. On Linux, you may need to run ollama serve in a separate terminal or set it up as a systemd service.
Ignoring the Context Window
Ollama defaults to a 2048-token context window. If your input is longer, the model silently truncates it. For code review or long documents, increase it:
Using Too Large a Model
Bigger isn't always better. A 70B model running in 4-bit quantization on a 32GB machine will be slower than a 7B model at Q8. And for many tasks — summarization, simple Q&A, code formatting — the quality difference is negligible. Profile your specific use case before upsizing.
Summary and Next Steps
You went from zero to running LLMs on your own machine. Here's what you now know:
ollama pullNext steps to level up:
docker run -d -p 11434:11434 ollama/ollamaFrequently Asked Questions
Can I use Ollama with a GPU?
Yes. Ollama auto-detects NVIDIA GPUs (via CUDA) and Apple Silicon (via Metal). No configuration needed — if a GPU is available, Ollama uses it. Check with ollama ps to confirm. On multi-GPU setups, Ollama can split models across GPUs automatically.
Is Ollama free for commercial use?
Ollama itself is MIT-licensed — use it however you want. But each model has its own license. Llama models require accepting Meta's community license. Mistral and Phi use Apache 2.0. DeepSeek uses its own permissive license. Always check the model's license on the Ollama model page before shipping.
Can I run multiple models simultaneously?
Yes, but each model consumes RAM. If you load Llama 3.2 (5GB) and Mistral (5GB) at the same time, that's 10GB of RAM. Ollama unloads inactive models after 5 minutes by default. You can change this with the OLLAMA_KEEP_ALIVE environment variable.
How does Ollama compare to llama.cpp?
Ollama wraps llama.cpp (and other backends) with a user-friendly CLI and REST API. You get the same inference speed as raw llama.cpp but with model management, an OpenAI-compatible endpoint, and automatic GPU detection built in. If you need maximum control over inference parameters, use llama.cpp directly. For everything else, Ollama saves you hours of setup.
References
Versions used in this tutorial: Ollama 0.5+, Python 3.12, ollama library 0.4+, openai library 1.x. Tested March 2026.