What is the ollama python library in Python?

What if you want to use Ollama from Python, not the terminal? The official ollama Python library wraps the REST API in a clean interface. The response object includes timing data that cloud APIs don't give you — total duration, model load time, prompt evaluation time. I find this invaluable for...

Run LLMs Locally with Ollama: Llama, Mistral, Phi, Qwen, and DeepSeek on Your Machine

Intermediate90 min3 exercises55 XP

Prerequisites

Your First Gen AI App

0/3 exercises

Your API bill hit $200 last month. Your client's legal team says patient data can't leave the building. Your flight has no Wi-Fi and you need to prototype a feature. These aren't hypotheticals — they're the three conversations that pushed me to run LLMs locally.

Ollama makes it embarrassingly simple: one install, one command, and you have a model that rivals GPT-3.5 running entirely on your laptop.

Why Run LLMs Locally?

Last year a client told me: "Our patient records can't leave our servers. Period." No amount of OpenAI's privacy policies or BAA agreements changed their mind. I needed a model that ran inside their firewall. That project is what sold me on local LLMs.

The case for running models locally comes down to five things. Privacy — your data never leaves your machine, full stop. Cost — after the initial download, every inference is free. Speed — no network round-trip means lower latency for short prompts. Offline capability — airports, trains, submarines. No rate limits — run 10,000 requests an hour if your hardware can handle it.

Cloud API (OpenAI, Anthropic)

# Pros:
# - Best-in-class models (GPT-4o, Claude)
# - No hardware requirements
# - Always up-to-date
#
# Cons:
# - Pay per token ($2-15 per million tokens)
# - Data leaves your machine
# - Rate limits (TPM, RPM)
# - Requires internet
# - Vendor lock-in risk

Local LLM (Ollama)

# Pros:
# - 100% private — data stays on your machine
# - Free after download
# - No rate limits
# - Works offline
# - Full control over model version
#
# Cons:
# - Smaller models than cloud (7B-70B vs 1T+)
# - Requires decent hardware (8GB+ RAM)
# - You manage updates

This tutorial uses Ollama 0.5+ and targets models available as of early 2026. The landscape moves fast — model names and sizes may shift, but the patterns you learn here stay the same.

Installing Ollama and Your First Model

Install Ollama

Loading editor...

After installation, the ollama command is available in your terminal. The Ollama server starts automatically in the background and listens on localhost:11434. Pull your first model:

Pull your first model

Loading editor...

Once the download finishes, you can chat with the model directly in your terminal. This is the fastest way to kick the tires on a new model before writing any Python.

Run and list models

Loading editor...

The Ollama Python Library

What if you want to use Ollama from Python, not the terminal? The official ollama Python library wraps the REST API in a clean interface.

Install the ollama Python package

Loading editor...

Basic chat with ollama.chat()

Loading editor...

The response object includes timing data that cloud APIs don't give you — total duration, model load time, prompt evaluation time. I find this invaluable for performance tuning.

For long responses, streaming prints tokens as they arrive instead of waiting for the full response. The user experience difference is dramatic.

Streaming responses

Loading editor...

Multi-turn conversations work the same way as the OpenAI API. You maintain a list of messages and append each exchange.

Multi-turn conversations

Loading editor...

That loading behavior is worth remembering. If your script makes one call then exits, you pay the loading cost every time. For interactive tools or servers that make repeated calls, the first-call penalty is negligible.

One thing that tripped me up early: the messages list is your responsibility. Unlike the chat interface in the terminal, the Python library doesn't track history for you. Append each response to the list yourself, or the model loses context between calls.

Quick Check

What happens if you call ollama.chat() with a model you haven't pulled yet?

Answer: Ollama automatically pulls (downloads) the model before running it. Convenient, but the first call will take minutes instead of seconds. I always pre-pull models to avoid surprises in production code.

The OpenAI-Compatible API

Here's why Ollama is winning the local LLM race: it exposes an OpenAI-compatible endpoint at localhost:11434/v1. That means any code written for the OpenAI API works with Ollama — change one line.

OpenAI client pointing at Ollama

Loading editor...

Read that code carefully. It's the standard OpenAI library. The only differences are base_url and api_key. The response object has the same .choices[0].message.content structure. Your existing OpenAI code ports over with zero refactoring.

Cloud OpenAI

from openai import OpenAI

client = OpenAI(
    # Uses OPENAI_API_KEY env var
    # Sends data to api.openai.com
)
response = client.chat.completions.create(
    model='gpt-4o-mini',
    messages=[
        {'role': 'user',
         'content': 'Explain decorators'}
    ],
)

Local Ollama (same code)

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama',
)
response = client.chat.completions.create(
    model='llama3.2',
    messages=[
        {'role': 'user',
         'content': 'Explain decorators'}
    ],
)

Exercise: Build an LLM Client Config

Write Code

Write a function build_llm_config(provider) that returns a dictionary with the correct base_url, api_key, and default_model for a given provider.

Supported providers:

"ollama" → base_url: "http://localhost:11434/v1", api_key: "ollama", default_model: "llama3.2"

"openai" → base_url: "https://api.openai.com/v1", api_key: "sk-...", default_model: "gpt-4o-mini"

Any other string → raise ValueError with message f"Unknown provider: {provider}"

This is the pattern you'd use to switch between local Ollama and cloud APIs with one config change.

Loading editor...

Choosing the Right Model

The model you choose matters more than any parameter you tune. I've watched people spend hours tweaking temperature and top_p on a 3B model when switching to a 7B model would have solved their problem instantly.

Six model families dominate the open-source LLM space right now. Each has a different sweet spot.

Llama 3.2 / 3.3 (Meta) — The default choice. Strong all-around performance, huge community. Sizes: 1B, 3B, 8B, 70B, 405B. The 8B model is the community's workhorse.

Mistral / Mixtral (Mistral AI) — Punches above its weight on reasoning and code. Mixtral is a "mixture of experts" architecture that's faster than its parameter count suggests. Sizes: 7B (Mistral), 8x7B (Mixtral).

Phi-3 / Phi-4 (Microsoft) — Small but surprisingly capable. The 3.8B model competes with models twice its size on benchmarks. Great for resource-constrained environments.

Qwen 2.5 (Alibaba) — Excellent multilingual support, especially CJK languages. Strong on code and math. Sizes: 0.5B to 72B.

DeepSeek-R1 (DeepSeek) — Specializes in chain-of-thought reasoning. Shows its "thinking" process before giving a final answer. Impressive on math and logic puzzles.

Gemma 2 (Google) — Google's open model. The 9B variant is competitive with larger models. Good instruction-following out of the box.

Pull multiple models for comparison

Loading editor...

Size tiers give you a rough quality-to-resource tradeoff:
------	-----------	-------	---------	----------
1B-3B	4-6 GB	Very fast	Limited	Quick tasks, autocomplete, drafts
7B-8B	8-10 GB	Fast	Good	General use, coding, chat
13B-14B	16 GB	Moderate	Better	Complex reasoning, detailed writing
70B+	48+ GB	Slow	Near-cloud	When you need the best quality locally

Inspect model details

Loading editor...

Quantization — Speed vs Quality Trade-off

Your 70B model doesn't fit in memory. Do you really need all those decimal places? Probably not. That's what quantization solves.

A model's parameters are numbers — weights in a neural network. At full precision (FP16), each parameter takes 2 bytes. A 7B model at FP16 needs ~14GB of RAM. Quantization reduces each parameter to fewer bits: 8-bit (Q8) cuts the size in half, 4-bit (Q4) cuts it to a quarter. The tradeoff is a small loss in output quality.

Pull specific quantizations

Loading editor...

Here's what those tags mean in practice:
-------------	----------------	---------------	-------------	------------
`fp16`	16	1.0x (baseline)	None	Benchmarking, reference quality
`q8_0`	8	0.5x	Negligible	When quality matters and RAM allows
`q4_K_M`	4 (mixed)	0.3x	Small	Default choice — best bang for buck
`q4_0`	4	0.25x	Moderate	Maximum speed, tight RAM

My rule of thumb: use Q4_K_M unless you have a specific reason not to. If you're doing evaluations or comparing model quality, use Q8_0 or FP16 to eliminate quantization as a variable.

Exercise: Recommend the Right Quantization

Write Code

Write a function recommend_quantization(ram_gb, priority) that recommends a quantization level based on available RAM and whether the user prioritizes "speed" or "quality".

Rules:

If ram_gb < 8: always return "q4_0" (only option that fits)

If ram_gb >= 8 and priority == "speed": return "q4_K_M"

If ram_gb >= 8 and ram_gb < 16 and priority == "quality": return "q4_K_M" (can't fit q8)

If ram_gb >= 16 and priority == "speed": return "q4_K_M"

If ram_gb >= 16 and ram_gb < 32 and priority == "quality": return "q8_0"

If ram_gb >= 32 and priority == "quality": return "fp16"

If ram_gb >= 32 and priority == "speed": return "q4_K_M"

Loading editor...

Benchmarking Local Models

Numbers don't lie. Before committing to a model for a project, measure it. I've been burned by benchmark leaderboards that don't match real-world performance on my specific tasks.

Benchmark function

Loading editor...

To compare models head-to-head, run the same prompt through each one. The function below benchmarks multiple models and prints a comparison table.

Compare models side by side

Loading editor...

Real-World Pattern — Local Code Review Assistant

Let's build something useful. This code review assistant reads a Python file from disk, sends it to your local model, and streams the review back. No data leaves your machine.

Local code review assistant

Loading editor...

The streaming version gives a better experience for longer files. Tokens appear as they're generated instead of making you wait for the full review.

Streaming code review

Loading editor...

Exercise: Build a Review Report Formatter

Write Code

Write a function format_review_report(reviews) that takes a dictionary of {filename: review_text} and returns a formatted report string.

Rules:

Start with a header line: "Code Review Report (X files)"

For each file, add: "\n--- filename ---\n" followed by the review text

End with "\n=== END OF REPORT ==="

Files should appear in sorted order by filename

This is the output format you'd use to display results from a batch code review with your local LLM.

Loading editor...

Common Mistakes

I've debugged all of these at least twice. Save yourself the headache.

Not Enough RAM

If your model runs at 2 tokens/second instead of 30, it's probably swapping to disk. Check with ollama ps — if the model size exceeds your available RAM, switch to a smaller model or a more aggressive quantization.

Forgetting to Start the Ollama Server

If you get ConnectionRefusedError, the Ollama server isn't running. On macOS and Windows, it starts automatically. On Linux, you may need to run ollama serve in a separate terminal or set it up as a systemd service.

Ignoring the Context Window

Ollama defaults to a 2048-token context window. If your input is longer, the model silently truncates it. For code review or long documents, increase it:

Set context window size

Loading editor...

Using Too Large a Model

Bigger isn't always better. A 70B model running in 4-bit quantization on a 32GB machine will be slower than a 7B model at Q8. And for many tasks — summarization, simple Q&A, code formatting — the quality difference is negligible. Profile your specific use case before upsizing.

Summary and Next Steps

You went from zero to running LLMs on your own machine. Here's what you now know:

Install Ollama with one command and pull models with ollama pull

Use the `ollama` Python library for native chat, streaming, and multi-turn conversations

Use the OpenAI-compatible API to swap between cloud and local with one line

Choose models based on your task, hardware, and quality requirements

Quantization trades precision for speed — Q4_K_M is the sweet spot for most use cases

Benchmark before committing — measure tokens/second on your actual hardware

Next steps to level up:

Pull 3-4 models and benchmark them on your specific tasks

Build a RAG (Retrieval-Augmented Generation) pipeline with local embeddings

Use Ollama in Docker for reproducible deployments: docker run -d -p 11434:11434 ollama/ollama

Explore custom Modelfiles to fine-tune system prompts and parameters per model

Frequently Asked Questions

Can I use Ollama with a GPU?

Yes. Ollama auto-detects NVIDIA GPUs (via CUDA) and Apple Silicon (via Metal). No configuration needed — if a GPU is available, Ollama uses it. Check with ollama ps to confirm. On multi-GPU setups, Ollama can split models across GPUs automatically.

Is Ollama free for commercial use?

Ollama itself is MIT-licensed — use it however you want. But each model has its own license. Llama models require accepting Meta's community license. Mistral and Phi use Apache 2.0. DeepSeek uses its own permissive license. Always check the model's license on the Ollama model page before shipping.

Can I run multiple models simultaneously?

Yes, but each model consumes RAM. If you load Llama 3.2 (5GB) and Mistral (5GB) at the same time, that's 10GB of RAM. Ollama unloads inactive models after 5 minutes by default. You can change this with the OLLAMA_KEEP_ALIVE environment variable.

How does Ollama compare to llama.cpp?

Ollama wraps llama.cpp (and other backends) with a user-friendly CLI and REST API. You get the same inference speed as raw llama.cpp but with model management, an OpenAI-compatible endpoint, and automatic GPU detection built in. If you need maximum control over inference parameters, use llama.cpp directly. For everything else, Ollama saves you hours of setup.

References

Ollama Official Documentation — installation, CLI reference, model library

Ollama Python Library (GitHub) — Python client source code and examples

Ollama OpenAI Compatibility — using the OpenAI-compatible endpoint

GGUF Format Specification — the model format Ollama uses internally

Hugging Face Open LLM Leaderboard — benchmark comparisons across open models

llama.cpp (GitHub) — the inference engine behind Ollama

Versions used in this tutorial: Ollama 0.5+, Python 3.12, ollama library 0.4+, openai library 1.x. Tested March 2026.

Why Run LLMs Locally?

Installing Ollama and Your First Model

The Ollama Python Library

Quick Check

The OpenAI-Compatible API

Choosing the Right Model

Quantization — Speed vs Quality Trade-off

Benchmarking Local Models

Real-World Pattern — Local Code Review Assistant

Common Mistakes

Not Enough RAM

Forgetting to Start the Ollama Server

Ignoring the Context Window

Using Too Large a Model

Summary and Next Steps

Frequently Asked Questions

References

Related Tutorials