LangChain Model Switching: Use OpenAI, Claude, Gemini, and Ollama in One App
You've built an app on GPT-4o. It works great — until OpenAI has an outage and your users see nothing but error messages. Or your costs spike because GPT-4o is overkill for simple summarisation tasks. Or your enterprise client demands that sensitive data never leaves their network, so you need a local model.
The fix isn't rewriting your app for each provider. It's writing your app once and swapping the model with a single config change. That's exactly what LangChain's model abstraction gives you, and by the end of this tutorial, you'll have a production-ready model selector that handles four providers, automatic fallbacks, and rate-limit retries.
Why Model Switching Matters
I've shipped three different LLM-powered products, and every single one ended up needing multiple models. Not because I planned it that way, but because reality forced it. Here are the four forces that push every production app toward multi-model setups:
Reliability — Every provider has outages. OpenAI, Anthropic, Google — they all go down. If your app depends on a single provider, your app goes down too. A fallback chain that tries Provider B when Provider A fails is table stakes for production.
Privacy and compliance — Some clients require that data never leaves their infrastructure. Ollama running a local Llama model satisfies that constraint. Your cloud-hosted app won't. Model strengths — Claude excels at long-context document analysis. GPT-4o is strong at structured output. Gemini handles multimodal inputs natively. Matching the task to the model's strengths gives better results.
Setting Up Four Providers
Before we build the switcher, we need each provider working individually. Each one requires its own package and API key (except Ollama, which runs locally). If you haven't used LangChain before, start with the LangChain quickstart tutorial first.
Prerequisites
Python version: 3.10+
Required packages: langchain (0.3+), langchain-openai, langchain-anthropic, langchain-google-genai, langchain-ollama
Install:
API keys: You need keys for the cloud providers. Set them as environment variables — never hardcode them in your scripts.
Initialising Each Model
Each provider has its own LangChain class, but they all share the same interface. Watch how the invoke() call is identical across all four — that's the abstraction at work.
Every one of these objects is a BaseChatModel. That means they all respond to .invoke(), .stream(), .batch(), and .ainvoke() with the same signature. The code below loops through all four models, sends the same HumanMessage to each, and prints the response. You'll see four lines of output — one per provider — each answering the same question.
Each provider returns an AIMessage with a .content attribute. The response format is identical regardless of which model generated it. This is the foundation everything else builds on.
Building a Config-Driven Model Selector
The get_model function maps a provider name string to the corresponding LangChain class, then instantiates it with the given model name and temperature. Your application code never imports provider-specific classes directly — it calls this selector instead. Switching models means changing two strings, not rewriting imports and constructor calls:
To switch from OpenAI to Claude, change get_model("openai", "gpt-4o-mini") to get_model("anthropic", "claude-sonnet-4-20250514"). Nothing else changes. Not the message format, not the invoke call, not the response parsing.
Config File Approach
For real applications, I prefer pulling the model choice from a config file rather than hardcoding it in Python. A YAML config separates deployment decisions from code, so ops teams can switch models without touching your source.
Runtime Model Switching with configurable_alternatives
The get_model() selector works at startup — you pick a model when the app loads. But what if you need to switch models at runtime, per-request, without rebuilding the chain? LangChain has an official mechanism for this: configurable_alternatives.
The idea: you define a default model and a set of named alternatives. Then, when you invoke the chain, you pass a config dict to pick which alternative runs. The chain structure stays the same — only the model swaps out. This is especially useful when the same chain serves multiple clients or task types.
The configurable_alternatives call registers three named options: "openai" (default), "anthropic", and "google". At invocation time, you pass a config dict with the key "llm_provider" to select which model runs. The rest of the chain — prompt templates, parsers, callbacks — stays identical.
Write a function called select_model that takes a provider string and returns the corresponding model class name as a string. Support three providers: "openai" should return "ChatOpenAI", "anthropic" should return "ChatAnthropic", and "google" should return "ChatGoogleGenerativeAI". If the provider is not recognised, raise a ValueError with the message Unknown provider: '<name>' (where <name> is the provider that was passed in).
Fallback Chains — Automatic Provider Failover
A model selector is great for manual switching, but what happens when OpenAI returns a 500 error at 2 AM? You need automatic failover — try the primary model, and if it fails, fall back to an alternative without any human intervention.
LangChain's .with_fallbacks() method chains models together so they are tried in sequence. The code below creates three model instances — GPT-4o-mini as the primary, Claude Sonnet as the first fallback, and Gemini Flash as the second — then links them into a single fallback chain. If the primary raises an exception on .invoke(), LangChain catches it and tries the next model in the list.
The model_with_fallbacks object behaves exactly like a regular model. You call .invoke() on it the same way. If GPT-4o-mini raises an exception, LangChain catches it and tries Claude. If Claude also fails, it tries Gemini. If all three fail, it raises the last exception.
This is genuinely one of the most useful patterns in production LLM apps. The caller has no idea which model actually answered — it just works. I keep this pattern in every project now, even when I think I only need one provider.
Testing Fallback Behaviour
You need to verify that fallbacks actually trigger. The test technique here is simple: create a primary model with an intentionally invalid API key so it fails on every call, chain it with a working Claude model as the fallback, then invoke the chain. If the fallback is wired correctly, Claude handles the request and you see a response instead of an error.
Write a function called create_fallback_chain that takes a list of provider names (strings) and returns the provider that would handle the request if earlier providers fail. The function should simulate a fallback chain: it receives a list of providers and a set of failed_providers. It should return the first provider from the list that is NOT in the failed_providers set. If all providers have failed, return "ALL_FAILED".
Handling Rate Limits Gracefully
Rate limits are the failure mode I hit most often in production. Send too many requests per minute and the provider returns a 429 error. Without handling, your app crashes. With proper handling, it waits and retries automatically.
LangChain models support a max_retries parameter out of the box. When the provider returns a rate-limit error, LangChain waits with exponential backoff and retries.
Proactive Rate Limiting with InMemoryRateLimiter
The max_retries approach is reactive — it waits until the provider rejects your request, then backs off. For batch workloads that process hundreds of documents, you already know you'll hit the limit. LangChain's InMemoryRateLimiter lets you throttle requests before they hit the provider, avoiding 429 errors entirely.
The rate limiter sits between your code and the provider's API. When you call .invoke() or .batch(), it queues requests to stay within the limit. For heavy batch processing, combining the rate limiter with retries and fallbacks gives you the most robust pipeline.
Model-Based Routing — Matching Tasks to Models
Not every request deserves your most expensive model. A simple yes/no classification doesn't need GPT-4o. A complex code generation task doesn't belong on a tiny local model. The pattern is a routing function that examines the request type and picks the appropriate model — this is different from the fallback chain. Fallbacks handle failures; routing handles intent.
The route_to_model function below maps seven task types to specific models. Simple tasks (classification, extraction, summarisation) go to GPT-4o-mini with temperature=0 for deterministic output. Complex tasks (code generation, long documents) go to Claude Sonnet. Privacy-sensitive tasks route to a local Ollama model so data never leaves the machine.
With routing in place, application code declares what it needs rather than which model to use. The three calls below demonstrate the routing in action: a spam classification task goes to GPT-4o-mini, a code generation request goes to Claude Sonnet, and a medical record summary routes to the local Ollama instance.
Putting It All Together — Production Model Manager
So far we've built four separate patterns: a config-driven selector, configurable_alternatives for runtime switching, fallback chains, and task-based routing. In a real project, I keep all of this in one class — a ModelManager that the rest of the application imports.
The ModelManager class below has three key pieces. First, a PROVIDERS dict that maps provider name strings to LangChain classes — the same pattern as get_model() but stored as class state. Second, a get_model() method that instantiates a model with retry config and handles the Ollama special case (Ollama doesn't support max_retries the same way as cloud providers). Third, a get_model_with_fallbacks() method that builds a fallback chain from a primary and a list of backup (provider, model) tuples.
Using the ModelManager is clean and explicit. The application code below creates a manager with OpenAI as the default, then demonstrates two usage patterns: a simple single-model call and a production setup with a fallback chain that tries OpenAI first, then Claude, then Gemini.
Common Mistakes and How to Fix Them
Mistake 1: Wrong Parameter Names Across Providers
The LangChain abstraction normalises the .invoke() interface, but constructor parameters still differ between providers. I got bitten by this when I tried to pass OpenAI-style kwargs to the Anthropic client.
# This will error — Anthropic uses "timeout", not "request_timeout"
model = ChatAnthropic(
model="claude-sonnet-4-20250514",
request_timeout=30, # OpenAI parameter name
)# OpenAI uses timeout (or request_timeout for older versions)
openai_model = ChatOpenAI(model="gpt-4o-mini", timeout=30)
# Anthropic uses timeout
claude_model = ChatAnthropic(model="claude-sonnet-4-20250514", timeout=30)Mistake 2: Creating Models Inside Loops
Each model constructor creates a new HTTP client and validates the API key. Creating a model inside a loop means you spin up a new client for every iteration — slow and wasteful.
questions = ["Q1?", "Q2?", "Q3?"]
for q in questions:
model = ChatOpenAI(model="gpt-4o-mini") # New client each time!
response = model.invoke([HumanMessage(content=q)])
print(response.content)model = ChatOpenAI(model="gpt-4o-mini") # Create once
questions = ["Q1?", "Q2?", "Q3?"]
for q in questions:
response = model.invoke([HumanMessage(content=q)])
print(response.content)Mistake 3: No Fallbacks in Production
Running with a single provider and no fallback is asking for trouble. Every LLM provider has downtime — building without fallbacks is like running a web server without health checks.
# If OpenAI goes down, your entire app goes down
model = ChatOpenAI(model="gpt-4o-mini")
response = model.invoke(messages)primary = ChatOpenAI(model="gpt-4o-mini", max_retries=2)
fallback = ChatAnthropic(model="claude-sonnet-4-20250514", max_retries=2)
model = primary.with_fallbacks([fallback])
response = model.invoke(messages)Mistake 4: Forgetting That Ollama Requires a Running Server
Unlike cloud providers where an API key is enough, Ollama requires a local server process. If you try to invoke an Ollama model without ollama serve running, you'll get a connection error.
Provider Comparison — Quick Reference
When deciding which provider to use for which task, this table summarises the key tradeoffs. Pricing changes frequently, so check each provider's pricing page for current rates.
| Feature | OpenAI (GPT-4o-mini) | Anthropic (Claude Sonnet) | Google (Gemini Flash) | Ollama (Local) |
|---|---|---|---|---|
| Setup | API key | API key | API key | Local install |
| Latency | Low | Low | Low | Depends on hardware |
| Max context | 128K tokens | 200K tokens | 1M tokens | Model-dependent |
| Strengths | Structured output, function calling | Long docs, careful reasoning | Multimodal, large context | Privacy, no API costs |
| LangChain class | ChatOpenAI | ChatAnthropic | ChatGoogleGenerativeAI | ChatOllama |
| Package | langchain-openai | langchain-anthropic | langchain-google-genai | langchain-ollama |
Complete Code
Here's the full ModelManager class with the routing function and fallback support in a single script. Copy this into a model_manager.py file and import it across your project.
Frequently Asked Questions
Can I mix streaming and non-streaming across providers?
Yes. All four providers support .stream() and .astream() through the same LangChain interface. The streaming chunks may differ slightly in structure, but you iterate them the same way:
What happens if my fallback model also fails?
LangChain tries each fallback in order. If the last model in the chain also raises an exception, that exception propagates to your calling code. Wrap the .invoke() call in a try/except block to handle the case where all providers are down.
How do I track which provider actually responded?
The response's response_metadata dictionary typically includes the model name. You can also use LangChain callbacks to log exactly which model in the fallback chain handled each request. This is essential for cost tracking and debugging.
Does model switching work with LCEL chains and agents?
Absolutely. Since every model returned by the selector is a BaseChatModel, it plugs into any LCEL chain, agent, or tool pipeline. A prompt template piped to a model with fallbacks works exactly as you'd expect: prompt | model_with_fallbacks | output_parser. The fallback logic is invisible to the rest of the chain.
How does configurable_alternatives differ from get_model()?
get_model() creates a fresh model instance — you decide the provider at code time or startup. configurable_alternatives bakes multiple models into a single runnable and lets you pick at invocation time via a config dict. Use get_model() for static deployments (one environment = one model). Use configurable_alternatives when you need per-request switching without rebuilding the chain.