Skip to main content

LangChain Model Switching: Use OpenAI, Claude, Gemini, and Ollama in One App

Intermediate60 min2 exercises30 XP
0/2 exercises

You've built an app on GPT-4o. It works great — until OpenAI has an outage and your users see nothing but error messages. Or your costs spike because GPT-4o is overkill for simple summarisation tasks. Or your enterprise client demands that sensitive data never leaves their network, so you need a local model.

The fix isn't rewriting your app for each provider. It's writing your app once and swapping the model with a single config change. That's exactly what LangChain's model abstraction gives you, and by the end of this tutorial, you'll have a production-ready model selector that handles four providers, automatic fallbacks, and rate-limit retries.

Why Model Switching Matters

I've shipped three different LLM-powered products, and every single one ended up needing multiple models. Not because I planned it that way, but because reality forced it. Here are the four forces that push every production app toward multi-model setups:

Reliability — Every provider has outages. OpenAI, Anthropic, Google — they all go down. If your app depends on a single provider, your app goes down too. A fallback chain that tries Provider B when Provider A fails is table stakes for production.


Privacy and compliance — Some clients require that data never leaves their infrastructure. Ollama running a local Llama model satisfies that constraint. Your cloud-hosted app won't. Model strengths — Claude excels at long-context document analysis. GPT-4o is strong at structured output. Gemini handles multimodal inputs natively. Matching the task to the model's strengths gives better results.

Setting Up Four Providers

Before we build the switcher, we need each provider working individually. Each one requires its own package and API key (except Ollama, which runs locally). If you haven't used LangChain before, start with the LangChain quickstart tutorial first.

Prerequisites

Python version: 3.10+

Required packages: langchain (0.3+), langchain-openai, langchain-anthropic, langchain-google-genai, langchain-ollama

Install:

Install all four provider packages
Loading editor...

API keys: You need keys for the cloud providers. Set them as environment variables — never hardcode them in your scripts.

Set API keys as environment variables
Loading editor...

Initialising Each Model

Each provider has its own LangChain class, but they all share the same interface. Watch how the invoke() call is identical across all four — that's the abstraction at work.

Initialise all four providers
Loading editor...

Every one of these objects is a BaseChatModel. That means they all respond to .invoke(), .stream(), .batch(), and .ainvoke() with the same signature. The code below loops through all four models, sends the same HumanMessage to each, and prints the response. You'll see four lines of output — one per provider — each answering the same question.

Same invoke() call across all four providers
Loading editor...

Each provider returns an AIMessage with a .content attribute. The response format is identical regardless of which model generated it. This is the foundation everything else builds on.

Building a Config-Driven Model Selector

Hardcoding ChatOpenAI(model="gpt-4o-mini") everywhere in your app is the same mistake as hardcoding database connection strings. When you need to change the model, you're hunting through files.

A simple model selector function
Loading editor...

The get_model function maps a provider name string to the corresponding LangChain class, then instantiates it with the given model name and temperature. Your application code never imports provider-specific classes directly — it calls this selector instead. Switching models means changing two strings, not rewriting imports and constructor calls:

Using the model selector in application code
Loading editor...

To switch from OpenAI to Claude, change get_model("openai", "gpt-4o-mini") to get_model("anthropic", "claude-sonnet-4-20250514"). Nothing else changes. Not the message format, not the invoke call, not the response parsing.

Config File Approach

For real applications, I prefer pulling the model choice from a config file rather than hardcoding it in Python. A YAML config separates deployment decisions from code, so ops teams can switch models without touching your source.

Loading model config from a YAML file
Loading editor...

Runtime Model Switching with configurable_alternatives

The get_model() selector works at startup — you pick a model when the app loads. But what if you need to switch models at runtime, per-request, without rebuilding the chain? LangChain has an official mechanism for this: configurable_alternatives.

The idea: you define a default model and a set of named alternatives. Then, when you invoke the chain, you pass a config dict to pick which alternative runs. The chain structure stays the same — only the model swaps out. This is especially useful when the same chain serves multiple clients or task types.

Define a model with configurable alternatives
Loading editor...

The configurable_alternatives call registers three named options: "openai" (default), "anthropic", and "google". At invocation time, you pass a config dict with the key "llm_provider" to select which model runs. The rest of the chain — prompt templates, parsers, callbacks — stays identical.

Switch models at runtime with config dicts
Loading editor...
Exercise 1: Build a Model Selector Function
Write Code

Write a function called select_model that takes a provider string and returns the corresponding model class name as a string. Support three providers: "openai" should return "ChatOpenAI", "anthropic" should return "ChatAnthropic", and "google" should return "ChatGoogleGenerativeAI". If the provider is not recognised, raise a ValueError with the message Unknown provider: '<name>' (where <name> is the provider that was passed in).

Loading editor...

Fallback Chains — Automatic Provider Failover

A model selector is great for manual switching, but what happens when OpenAI returns a 500 error at 2 AM? You need automatic failover — try the primary model, and if it fails, fall back to an alternative without any human intervention.

LangChain's .with_fallbacks() method chains models together so they are tried in sequence. The code below creates three model instances — GPT-4o-mini as the primary, Claude Sonnet as the first fallback, and Gemini Flash as the second — then links them into a single fallback chain. If the primary raises an exception on .invoke(), LangChain catches it and tries the next model in the list.

Creating a fallback chain with three providers
Loading editor...

The model_with_fallbacks object behaves exactly like a regular model. You call .invoke() on it the same way. If GPT-4o-mini raises an exception, LangChain catches it and tries Claude. If Claude also fails, it tries Gemini. If all three fail, it raises the last exception.

Using a fallback chain — same invoke() interface
Loading editor...

This is genuinely one of the most useful patterns in production LLM apps. The caller has no idea which model actually answered — it just works. I keep this pattern in every project now, even when I think I only need one provider.

Testing Fallback Behaviour

You need to verify that fallbacks actually trigger. The test technique here is simple: create a primary model with an intentionally invalid API key so it fails on every call, chain it with a working Claude model as the fallback, then invoke the chain. If the fallback is wired correctly, Claude handles the request and you see a response instead of an error.

Testing that fallback triggers on primary failure
Loading editor...
Exercise 2: Implement a Fallback Chain
Write Code

Write a function called create_fallback_chain that takes a list of provider names (strings) and returns the provider that would handle the request if earlier providers fail. The function should simulate a fallback chain: it receives a list of providers and a set of failed_providers. It should return the first provider from the list that is NOT in the failed_providers set. If all providers have failed, return "ALL_FAILED".

Loading editor...

Handling Rate Limits Gracefully

Rate limits are the failure mode I hit most often in production. Send too many requests per minute and the provider returns a 429 error. Without handling, your app crashes. With proper handling, it waits and retries automatically.

LangChain models support a max_retries parameter out of the box. When the provider returns a rate-limit error, LangChain waits with exponential backoff and retries.

Built-in retry with exponential backoff
Loading editor...

Proactive Rate Limiting with InMemoryRateLimiter

The max_retries approach is reactive — it waits until the provider rejects your request, then backs off. For batch workloads that process hundreds of documents, you already know you'll hit the limit. LangChain's InMemoryRateLimiter lets you throttle requests before they hit the provider, avoiding 429 errors entirely.

Proactive rate limiting with InMemoryRateLimiter
Loading editor...

The rate limiter sits between your code and the provider's API. When you call .invoke() or .batch(), it queues requests to stay within the limit. For heavy batch processing, combining the rate limiter with retries and fallbacks gives you the most robust pipeline.

Combining retries with fallbacks
Loading editor...

Model-Based Routing — Matching Tasks to Models

Why routing matters — cost comparison
Loading editor...

Not every request deserves your most expensive model. A simple yes/no classification doesn't need GPT-4o. A complex code generation task doesn't belong on a tiny local model. The pattern is a routing function that examines the request type and picks the appropriate model — this is different from the fallback chain. Fallbacks handle failures; routing handles intent.

The route_to_model function below maps seven task types to specific models. Simple tasks (classification, extraction, summarisation) go to GPT-4o-mini with temperature=0 for deterministic output. Complex tasks (code generation, long documents) go to Claude Sonnet. Privacy-sensitive tasks route to a local Ollama model so data never leaves the machine.

A task-based model router
Loading editor...

With routing in place, application code declares what it needs rather than which model to use. The three calls below demonstrate the routing in action: a spam classification task goes to GPT-4o-mini, a code generation request goes to Claude Sonnet, and a medical record summary routes to the local Ollama instance.

Using the router for different task types
Loading editor...

Putting It All Together — Production Model Manager

So far we've built four separate patterns: a config-driven selector, configurable_alternatives for runtime switching, fallback chains, and task-based routing. In a real project, I keep all of this in one class — a ModelManager that the rest of the application imports.

The ModelManager class below has three key pieces. First, a PROVIDERS dict that maps provider name strings to LangChain classes — the same pattern as get_model() but stored as class state. Second, a get_model() method that instantiates a model with retry config and handles the Ollama special case (Ollama doesn't support max_retries the same way as cloud providers). Third, a get_model_with_fallbacks() method that builds a fallback chain from a primary and a list of backup (provider, model) tuples.

A production ModelManager class
Loading editor...

Using the ModelManager is clean and explicit. The application code below creates a manager with OpenAI as the default, then demonstrates two usage patterns: a simple single-model call and a production setup with a fallback chain that tries OpenAI first, then Claude, then Gemini.

Using the ModelManager in application code
Loading editor...

Common Mistakes and How to Fix Them

Mistake 1: Wrong Parameter Names Across Providers

The LangChain abstraction normalises the .invoke() interface, but constructor parameters still differ between providers. I got bitten by this when I tried to pass OpenAI-style kwargs to the Anthropic client.

Wrong — using OpenAI parameters on Anthropic
# This will error — Anthropic uses "timeout", not "request_timeout"
model = ChatAnthropic(
    model="claude-sonnet-4-20250514",
    request_timeout=30,  # OpenAI parameter name
)
Correct — use the right parameter name for each provider
# OpenAI uses timeout (or request_timeout for older versions)
openai_model = ChatOpenAI(model="gpt-4o-mini", timeout=30)

# Anthropic uses timeout
claude_model = ChatAnthropic(model="claude-sonnet-4-20250514", timeout=30)

Mistake 2: Creating Models Inside Loops

Each model constructor creates a new HTTP client and validates the API key. Creating a model inside a loop means you spin up a new client for every iteration — slow and wasteful.

Wrong — new model object per iteration
questions = ["Q1?", "Q2?", "Q3?"]
for q in questions:
    model = ChatOpenAI(model="gpt-4o-mini")  # New client each time!
    response = model.invoke([HumanMessage(content=q)])
    print(response.content)
Correct — create once, reuse
model = ChatOpenAI(model="gpt-4o-mini")  # Create once
questions = ["Q1?", "Q2?", "Q3?"]
for q in questions:
    response = model.invoke([HumanMessage(content=q)])
    print(response.content)

Mistake 3: No Fallbacks in Production

Running with a single provider and no fallback is asking for trouble. Every LLM provider has downtime — building without fallbacks is like running a web server without health checks.

Risky — single point of failure
# If OpenAI goes down, your entire app goes down
model = ChatOpenAI(model="gpt-4o-mini")
response = model.invoke(messages)
Resilient — automatic failover
primary = ChatOpenAI(model="gpt-4o-mini", max_retries=2)
fallback = ChatAnthropic(model="claude-sonnet-4-20250514", max_retries=2)
model = primary.with_fallbacks([fallback])
response = model.invoke(messages)

Mistake 4: Forgetting That Ollama Requires a Running Server

Unlike cloud providers where an API key is enough, Ollama requires a local server process. If you try to invoke an Ollama model without ollama serve running, you'll get a connection error.

Start Ollama before using ChatOllama
Loading editor...
Ollama requires a running server
Loading editor...

Provider Comparison — Quick Reference

When deciding which provider to use for which task, this table summarises the key tradeoffs. Pricing changes frequently, so check each provider's pricing page for current rates.

FeatureOpenAI (GPT-4o-mini)Anthropic (Claude Sonnet)Google (Gemini Flash)Ollama (Local)
SetupAPI keyAPI keyAPI keyLocal install
LatencyLowLowLowDepends on hardware
Max context128K tokens200K tokens1M tokensModel-dependent
StrengthsStructured output, function callingLong docs, careful reasoningMultimodal, large contextPrivacy, no API costs
LangChain classChatOpenAIChatAnthropicChatGoogleGenerativeAIChatOllama
Packagelangchain-openailangchain-anthropiclangchain-google-genailangchain-ollama

Complete Code

Here's the full ModelManager class with the routing function and fallback support in a single script. Copy this into a model_manager.py file and import it across your project.

Complete model_manager.py
Loading editor...

Frequently Asked Questions

Can I mix streaming and non-streaming across providers?

Yes. All four providers support .stream() and .astream() through the same LangChain interface. The streaming chunks may differ slightly in structure, but you iterate them the same way:

Python
Loading editor...

What happens if my fallback model also fails?

LangChain tries each fallback in order. If the last model in the chain also raises an exception, that exception propagates to your calling code. Wrap the .invoke() call in a try/except block to handle the case where all providers are down.

Python
Loading editor...

How do I track which provider actually responded?

The response's response_metadata dictionary typically includes the model name. You can also use LangChain callbacks to log exactly which model in the fallback chain handled each request. This is essential for cost tracking and debugging.

Python
Loading editor...

Does model switching work with LCEL chains and agents?

Absolutely. Since every model returned by the selector is a BaseChatModel, it plugs into any LCEL chain, agent, or tool pipeline. A prompt template piped to a model with fallbacks works exactly as you'd expect: prompt | model_with_fallbacks | output_parser. The fallback logic is invisible to the rest of the chain.

How does configurable_alternatives differ from get_model()?

get_model() creates a fresh model instance — you decide the provider at code time or startup. configurable_alternatives bakes multiple models into a single runnable and lets you pick at invocation time via a config dict. Use get_model() for static deployments (one environment = one model). Use configurable_alternatives when you need per-request switching without rebuilding the chain.

References

  • LangChain documentation — How to add fallbacks to a runnable. Link
  • LangChain documentation — Chat model integrations. Link
  • LangChain documentation — Configurable alternatives. Link
  • LangChain documentation — Rate limiting. Link
  • LangChain documentation — langchain-openai package. Link
  • LangChain documentation — langchain-anthropic package. Link
  • LangChain documentation — langchain-google-genai package. Link
  • LangChain documentation — langchain-ollama package. Link
  • OpenAI API documentation — Rate limits. Link
  • Ollama documentation — Getting started. Link

  • Related Tutorials

    Save your progress across devices

    Never lose your code, challenges, or XP. Sign up free — no password needed.

    Already have an account?