Skip to main content

Multi-Provider LLM Router: Groq, Together AI, and OpenRouter with Automatic Fallbacks

Intermediate60 min2 exercises40 XP
0/2 exercises

Your app sends every request to OpenAI. Monday at 3 PM, their API goes down for 47 minutes. Your users stare at spinners. Your Slack lights up. I have been through exactly this scenario, and the fix is not "hope OpenAI stays up" -- it is building a router that automatically fails over to a backup provider before the user notices.

This tutorial builds a multi-provider LLM router in Python. You will wire up Groq, Together AI, and OpenRouter using the same OpenAI-compatible SDK, add health checks, and implement automatic failover. Every code block runs in your browser.

Why You Need More Than One LLM Provider

Relying on a single LLM provider is the AI equivalent of running your database on one server with no replicas. Each major provider has had multi-hour outages in the past year. When yours goes down, your entire AI feature goes dark.

But resilience is only half the reason. Different providers have genuinely different strengths. Groq serves open-source models with sub-200ms latency on custom LPU hardware. Together AI gives you access to dozens of open-source models with fine-tuning options. OpenRouter acts as a universal proxy routing to 200+ models from a single API key.

The key insight that makes multi-provider routing practical: Groq, Together AI, and OpenRouter all expose OpenAI-compatible APIs. You use the same openai Python SDK for all of them. Only the base_url and api_key change.

I keep three providers configured in every project I ship. Not because I expect daily outages, but because the one time it matters, it saves a 2 AM scramble.

Setting Up Three Providers with One SDK

Three providers, one SDK
Loading editor...

Three clients. Same class. The only differences are the URL and key. Let's confirm it works by sending a test request to Groq:

Test request to Groq
Loading editor...

That is the same chat.completions.create() call from the OpenAI API tutorial. Swap groq_client for together_client or openrouter_client (with the matching model name) and it works identically.

Provider Deep Dive: Groq, Together AI, and OpenRouter

Each provider has a sweet spot. Understanding when to route to each one is the difference between a router that just fails over and one that actively picks the best option for each request.

Groq -- The Speed Demon

Measuring Groq latency
Loading editor...

Groq runs models on custom chips called LPUs (Language Processing Units) designed for sequential token generation. In my testing, simple requests consistently come back in under 200ms -- faster than most database queries. The catch: Groq only offers a curated set of open-source models. No GPT-4o, no Claude. For tasks where Llama 3.1 8B is sufficient -- classification, extraction, simple Q&A -- the speed is hard to beat.

Together AI -- The Open-Source Catalog

Together AI is where I go when I need a specific open-source model that Groq does not carry. They host everything from small 7B models to Mixtral, DeepSeek, and Qwen variants. They also offer fine-tuning as a service -- if you fine-tune a Llama model through their platform, the fine-tuned version appears as a model callable through the same API.

Querying Together AI
Loading editor...

OpenRouter -- The Universal Proxy

OpenRouter takes a different approach entirely. Instead of hosting models, they proxy your request to the cheapest or fastest available backend. One API key gets you models from OpenAI, Anthropic, Google, Meta, Mistral, and more. The :free suffix on a model name routes to a free provider.

OpenRouter -- unified access
Loading editor...

Building the Provider Configuration

We need a clean data structure that our router can work with. Each provider config stores its client, model name mapping, priority order, and health status. We will build one class and evolve it throughout the tutorial.

ProviderConfig with model mapping and latency tracking
Loading editor...

The model_map dictionary is the critical piece. Your application code asks for "llama-3.1-8b" and the router translates that to the provider-specific name. Business logic never references a provider directly.

The update_latency method uses an exponential moving average. With alpha=0.3, recent measurements influence the average more than old ones. If a provider suddenly slows down, the average rises quickly. This matters when we add latency-aware routing later.

Health Checks -- Detecting Issues Before Users Do

A router that does not know which providers are down is useless. The health check function below probes each provider with a minimal request and marks it healthy or unhealthy.

Health check function
Loading editor...

The health check sends the smallest possible request -- one token of output for a trivial prompt. It costs almost nothing and takes under 500ms for a healthy provider. Any exception marks that provider as unhealthy.

The Router -- Automatic Failover

This is the core of the system. The router tries providers in priority order. If the first fails, it retries with the next healthy provider. The caller gets their response without knowing a failover happened.

The LLMRouter class
Loading editor...

The complete() method loops through healthy providers in priority order. On success, it records the latency and returns immediately. On failure, it increments the failure counter and moves on. After 3 consecutive failures, that provider is automatically marked unhealthy.

Notice the return value includes provider and latency_ms. These are invaluable for monitoring -- you can track which provider is handling traffic and how fast it responds.

Making a routed request
Loading editor...

Your application code calls router.complete() and gets the response. It does not know or care which provider handled it. If Groq goes down, the request silently goes to Together AI. If that is also down, OpenRouter handles it.

Build a Provider Selection Function
Write Code

Write a function called select_provider that takes a list of provider dictionaries and a requested model name. Each provider dict has keys "name" (str), "is_healthy" (bool), "priority" (int), and "models" (list of str). The function should return the name of the highest-priority (lowest priority number) healthy provider that supports the requested model. If no provider matches, return None.

Loading editor...

Simulating Failover -- Watching the Router Recover

Simulating provider failure
Loading editor...

The caller's code did not change. The router silently routed to the next available provider. In production, you would log this failover event, but the user experience stays uninterrupted.

Let's also see how the status dashboard looks after some traffic:

Provider status dashboard
Loading editor...

Common Mistakes and How to Avoid Them

I have seen each of these mistakes in production codebases. They look harmless but cause real issues when providers misbehave.

Hardcoded provider-specific model names
# Scattered across your codebase
response = await groq_client.chat.completions.create(
    model="llama-3.1-8b-instant",  # Groq-specific
    messages=messages,
)
Generic names with a mapping layer
# Your code uses generic names
result = await router.complete(
    model="llama-3.1-8b",  # Router translates
    messages=messages,
)

Hardcoding provider-specific model names locks you into one provider. A model mapping layer means you change the mapping in one place when you need to switch.

No timeout on health checks
# Can hang for 30+ seconds on a degraded provider
response = await client.chat.completions.create(
    model=model,
    messages=[{"role": "user", "content": "test"}],
)
Health checks with a timeout
import asyncio

try:
    response = await asyncio.wait_for(
        client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": "test"}],
            max_tokens=1,
        ),
        timeout=5.0,
    )
except asyncio.TimeoutError:
    provider.is_healthy = False

Without a timeout, a health check against a degraded provider can block for 30 seconds or more. asyncio.wait_for() bounds the wait time.

Build a Failover Tracker
Write Code

Write a class called FailoverTracker that tracks provider failures and determines health status. It should have:

1. __init__(self, threshold: int) -- sets the failure threshold

2. record_success(self, provider: str) -- resets the failure count for that provider to 0

3. record_failure(self, provider: str) -- increments the failure count

4. is_healthy(self, provider: str) -- returns True if failures are below the threshold, False otherwise. Unknown providers are healthy.

5. get_status(self) -- returns a dict mapping each tracked provider name to its failure count

Loading editor...

Frequently Asked Questions

Can I add OpenAI or Anthropic to this router?

OpenAI works immediately -- set base_url to "https://api.openai.com/v1" and add it as another ProviderConfig. Anthropic uses a different API format, so you would either use a wrapper or access Claude models through OpenRouter's OpenAI-compatible interface.

How do I handle rate limits across providers?

Track x-ratelimit-remaining headers from each provider's responses. When a provider approaches its limit, de-prioritize it before hitting a 429 error:

Rate limit awareness
Loading editor...

Should I use LiteLLM instead of building my own?

LiteLLM is a mature open-source library that does everything this tutorial teaches and more. If you need production-grade multi-provider routing today, start with LiteLLM. Building your own, as we did here, teaches the underlying concepts so you understand what LiteLLM does and can debug issues when they arise.

What about cost differences between providers?

The same model on different providers can have different per-token prices. Add a cost_per_1k_tokens field to ProviderConfig and the router can factor cost into routing decisions. Route to the cheapest provider for batch jobs and the fastest for interactive requests.

References

  • Groq API Documentation. Link
  • Together AI API Reference. Link
  • OpenRouter API Documentation. Link
  • OpenAI Python SDK -- GitHub Repository. Link
  • LiteLLM -- Open-source LLM proxy and router. Link
  • Circuit Breaker Pattern -- Martin Fowler. Link
  • Exponential Moving Average -- Wikipedia. Link
  • Related Tutorials