LangChain Model Switching: Use OpenAI, Claude, Gemini, and Ollama in One App
You've built an app on GPT-4o. It works great — until OpenAI has an outage and your users see nothing but error messages. Or your costs spike because GPT-4o is overkill for simple summarisation tasks. Or your enterprise client demands that sensitive data never leaves their network, so you need a local model.
The fix isn't rewriting your app for each provider. It's writing your app once and swapping the model with a single config change. That's exactly what LangChain's model abstraction gives you, and by the end of this tutorial, you'll have a production-ready model selector that handles four providers, automatic fallbacks, and rate-limit retries.
Why Model Switching Matters
I've shipped three different LLM-powered products, and every single one ended up needing multiple models. Not because I planned it that way, but because reality forced it. Here's what happens in practice:
Cost control — GPT-4o costs roughly 15x more than GPT-4o-mini per token. If 70% of your requests are simple classification tasks, you're burning money on a vastly overqualified model. Routing those tasks to a cheaper model cuts your bill dramatically.
Reliability — Every provider has outages. OpenAI, Anthropic, Google — they all go down. If your app depends on a single provider, your app goes down too. A fallback chain that tries Provider B when Provider A fails is table stakes for production.
Privacy and compliance — Some clients require that data never leaves their infrastructure. Ollama running a local Llama model satisfies that constraint. Your cloud-hosted app won't.
Model strengths — Claude excels at long-context document analysis. GPT-4o is strong at structured output. Gemini handles multimodal inputs natively. Matching the task to the model's strengths gives better results.
Setting Up Four Providers
Before we build the switcher, let's get each provider working individually. Each one requires its own package and API key (except Ollama, which runs locally).
Prerequisites
Python version: 3.10+
Required packages: langchain (0.3+), langchain-openai, langchain-anthropic, langchain-google-genai, langchain-ollama
Install:
API keys: You need keys for the cloud providers. Set them as environment variables — never hardcode them in your scripts.
Initialising Each Model
Each provider has its own LangChain class, but they all share the same interface. Watch how the invoke() call is identical across all four — that's the abstraction at work.
Every one of these objects is a BaseChatModel. That means they all respond to .invoke(), .stream(), .batch(), and .ainvoke() with the same signature. Let's prove it:
Each provider returns an AIMessage with a .content attribute. The response format is identical regardless of which model generated it. This is the foundation everything else builds on.
Building a Config-Driven Model Selector
The idea is straightforward: define a function that takes a provider name and model name, then returns the right LangChain model object. Your application code never imports provider-specific classes directly — it calls the selector.
Now your app code looks like this — and switching models means changing two strings, not rewriting imports and constructor calls:
To switch from OpenAI to Claude, change get_model("openai", "gpt-4o-mini") to get_model("anthropic", "claude-sonnet-4-20250514"). Nothing else changes. Not the message format, not the invoke call, not the response parsing.
Config File Approach
For real applications, I prefer pulling the model choice from a config file rather than hardcoding it in Python. A YAML or JSON config separates deployment decisions from code, so ops teams can switch models without touching your source.
Write a function called select_model that takes a provider string and returns the corresponding model class name as a string. Support three providers: "openai" should return "ChatOpenAI", "anthropic" should return "ChatAnthropic", and "google" should return "ChatGoogleGenerativeAI". If the provider is not recognised, raise a ValueError with the message Unknown provider: '<name>' (where <name> is the provider that was passed in).
Fallback Chains — Automatic Provider Failover
A model selector is great for manual switching, but what happens when OpenAI returns a 500 error at 2 AM? You need automatic failover — try the primary model, and if it fails, fall back to an alternative without any human intervention.
LangChain has a built-in .with_fallbacks() method for exactly this. You chain models together, and LangChain tries them in order until one succeeds.
The model_with_fallbacks object behaves exactly like a regular model. You call .invoke() on it the same way. If the primary model (GPT-4o-mini) raises an exception, LangChain catches it and tries Claude. If Claude also fails, it tries Gemini. If all three fail, it raises the last exception.
This is genuinely one of the most useful patterns in production LLM apps. I've seen systems where the fallback chain saved entire deployments during provider outages. The caller has no idea which model actually answered — it just works.
Testing Fallback Behaviour
You'll want to verify that fallbacks actually trigger. The simplest way is to intentionally misconfigure the primary model so it fails, then check that the fallback responds.
Write a function called create_fallback_chain that takes a list of provider names (strings) and returns the provider that would handle the request if earlier providers fail. The function should simulate a fallback chain: it receives a list of providers and a set of failed_providers. It should return the first provider from the list that is NOT in the failed_providers set. If all providers have failed, return "ALL_FAILED".
Handling Rate Limits Gracefully
Rate limits are the most common failure mode in LLM applications. Hit your provider's requests-per-minute ceiling, and you get a 429 error. Without handling, your app crashes. With proper handling, it waits and retries automatically.
LangChain models support a max_retries parameter out of the box. When the provider returns a rate-limit error, LangChain waits with exponential backoff and retries.
For heavier workloads — batch processing thousands of documents, for instance — you might hit rate limits repeatedly. Combining retries with fallbacks gives you a robust pipeline: retry the primary model a few times, and if it still can't get through, switch to a different provider entirely.
Model-Based Routing — Matching Tasks to Models
Not every request deserves your most expensive model. A simple yes/no classification doesn't need GPT-4o. A complex code generation task doesn't belong on a tiny local model. Routing requests to the right model based on task type is where you see real cost savings.
The pattern is a routing function that examines the request and picks the appropriate model. This is different from the fallback chain — fallbacks handle failures, routing handles intent.
Now your application code declares what it needs, not which model to use. The routing table — separate from your business logic — makes the cost/quality tradeoff explicit and easy to adjust.
Putting It All Together — Production Model Manager
Let's combine the config-driven selector, fallback chains, retries, and routing into a single class that you can drop into any project. This is the pattern I use in production — a ModelManager that handles all the complexity so the rest of the app doesn't have to think about it.
Using the ModelManager is clean and explicit. The application code stays model-agnostic while the manager handles all the provider-specific details.
Common Mistakes and How to Fix Them
Mistake 1: Wrong Parameter Names Across Providers
The LangChain abstraction normalises the interface, but constructor parameters still differ slightly between providers. This catches people who assume all providers accept the same kwargs.
# This will error — Anthropic uses "timeout", not "request_timeout"
model = ChatAnthropic(
model="claude-sonnet-4-20250514",
request_timeout=30, # OpenAI parameter name
)# OpenAI uses request_timeout
openai_model = ChatOpenAI(model="gpt-4o-mini", request_timeout=30)
# Anthropic uses timeout
claude_model = ChatAnthropic(model="claude-sonnet-4-20250514", timeout=30)Mistake 2: Creating Models Inside Loops
Each model constructor creates a new HTTP client and validates the API key. Creating a model inside a loop means you spin up a new client for every iteration — slow and wasteful.
questions = ["Q1?", "Q2?", "Q3?"]
for q in questions:
model = ChatOpenAI(model="gpt-4o-mini") # New client each time!
response = model.invoke([HumanMessage(content=q)])
print(response.content)model = ChatOpenAI(model="gpt-4o-mini") # Create once
questions = ["Q1?", "Q2?", "Q3?"]
for q in questions:
response = model.invoke([HumanMessage(content=q)])
print(response.content)Mistake 3: No Fallbacks in Production
If your app uses a single provider with no fallback, you're one outage away from a production incident. Every LLM provider has downtime — building without fallbacks is like running a web server without health checks.
# If OpenAI goes down, your entire app goes down
model = ChatOpenAI(model="gpt-4o-mini")
response = model.invoke(messages)primary = ChatOpenAI(model="gpt-4o-mini", max_retries=2)
fallback = ChatAnthropic(model="claude-sonnet-4-20250514", max_retries=2)
model = primary.with_fallbacks([fallback])
response = model.invoke(messages)Mistake 4: Forgetting That Ollama Requires a Running Server
Unlike cloud providers where an API key is enough, Ollama requires a local server process. If you try to invoke an Ollama model without ollama serve running, you'll get a connection error.
Provider Comparison — Quick Reference
When deciding which provider to use for which task, this table summarises the key tradeoffs. Pricing changes frequently, so check each provider's pricing page for current rates.
| Feature | OpenAI (GPT-4o-mini) | Anthropic (Claude Sonnet) | Google (Gemini Flash) | Ollama (Local) |
|---|---|---|---|---|
| Setup | API key | API key | API key | Local install |
| Latency | Low | Low | Low | Depends on hardware |
| Max context | 128K tokens | 200K tokens | 1M tokens | Model-dependent |
| Strengths | Structured output, function calling | Long docs, careful reasoning | Multimodal, large context | Privacy, no API costs |
| LangChain class | ChatOpenAI | ChatAnthropic | ChatGoogleGenerativeAI | ChatOllama |
| Package | langchain-openai | langchain-anthropic | langchain-google-genai | langchain-ollama |
Complete Code
Here's the full ModelManager class with the routing function and fallback support in a single script. Copy this into a model_manager.py file and import it across your project.
Frequently Asked Questions
Can I mix streaming and non-streaming across providers?
Yes. All four providers support .stream() and .astream() through the same LangChain interface. The streaming chunks may differ slightly in structure, but you iterate them the same way:
What happens if my fallback model also fails?
LangChain tries each fallback in order. If the last model in the chain also raises an exception, that exception propagates to your calling code. Wrap the .invoke() call in a try/except block to handle the case where all providers are down.
How do I track which provider actually responded?
The response's response_metadata dictionary typically includes the model name. You can also use LangChain callbacks to log exactly which model in the fallback chain handled each request. This is essential for cost tracking and debugging.
Does model switching work with LCEL chains and agents?
Absolutely. Since every model returned by the selector is a BaseChatModel, it plugs into any LCEL chain, agent, or tool pipeline. A prompt template piped to a model with fallbacks works exactly as you'd expect: prompt | model_with_fallbacks | output_parser. The fallback logic is invisible to the rest of the chain.