Skip to main content

Hugging Face for LLM Inference: Pipeline API and Inference Endpoints in Python

Intermediate90 min3 exercises55 XP
Prerequisites
0/3 exercises

You've been calling OpenAI and Anthropic APIs, and they work great. Then you need a specialized code model, a fine-tuned medical summarizer, or a 400M-parameter classifier that costs nothing to run. That's when you land on Hugging Face, staring at 800,000+ models.

I use Hugging Face almost daily — sometimes for quick experiments with the Inference API, sometimes for local deployment with pipeline(). This tutorial covers both paths.

What Is Hugging Face and Why Does It Matter?

Hugging Face is the GitHub of machine learning models. Just like GitHub hosts code repositories, Hugging Face hosts trained models — over 800,000 of them — that you can run with a few lines of Python.

OpenAI and Anthropic give you access to their models through their APIs. Hugging Face is different: it's a platform where anyone can publish and share models. Meta releases Llama there. Google publishes Gemma. Thousands of researchers upload fine-tuned models for everything from sentiment analysis to protein folding.

There are three ways to run a Hugging Face model. Each fits a different stage of your workflow.

We'll cover approaches 1 and 2 in depth. By the end, you'll have working code for both and know when to use each.

The first time I opened the Model Hub, I was overwhelmed. 800,000 models, dozens of task categories, cryptic names like microsoft/phi-2. After months of use, I've developed a system for finding the right model in under 2 minutes.

Every model has a model card — a structured page with everything you need to know before downloading.

The search filters are your best friend. Select a Task (e.g., Text Generation), then sort by Most Downloads. This surfaces battle-tested models and pushes experimental ones to page 50.

You can also browse models programmatically, which is useful for comparing options in a script.

Search the Model Hub from Python
Loading editor...

The output shows the most popular text-generation models. You'll typically see Meta's Llama, GPT-2, and Mistral near the top. These rankings shift as new models are released.

Local Inference with pipeline() — Three Lines to a Working Model

Your first pipeline() call
Loading editor...

Three lines of logic: import, create pipeline, call it. The pipeline() function from transformers handles downloading, loading, tokenizing, and post-processing. I reach for it whenever I need to prototype something quickly or want zero cloud dependency.

For sentiment analysis, pipeline() picks distilbert-base-uncased-finetuned-sst-2-english as the default model. You can override this with any compatible model from the Hub.

Choosing a Specific Model

The defaults are fine for learning. In practice, you'll want a specific model — maybe multilingual support, or one fine-tuned on your domain.

Using a specific model for star-rating sentiment
Loading editor...

Common Pipeline Tasks

The pipeline() function supports over 25 NLP tasks. Here are the three I use most, with the exact task string and a practical example.

Three common pipeline tasks
Loading editor...

Each task downloads a different model on first run. GPT-2 is about 500MB. BART summarization is about 1.6GB. The NER model is a lean 400MB.

Text Generation Parameters

Text generation deserves a closer look. The parameters matter more here than in classification — wrong settings produce gibberish or repetitive loops.

Temperature and sampling with text generation
Loading editor...

These parameters work exactly like what you learned in the sampling parameters tutorial. The difference is that here you control the model directly instead of calling an API.

Parse Model Metadata
Write Code

Write a function parse_model_id(model_id) that takes a Hugging Face model ID string (like "meta-llama/Llama-3-8B" or "gpt2") and returns a dictionary with two keys: "organization" (the part before the slash, or "community" if there's no slash) and "model_name" (the part after the slash, or the full string if there's no slash).

Loading editor...

Cloud Inference with the Hugging Face Inference API

What if you don't want to download a 2GB model just to test it? The Inference API lets you call any model on the Hub through HTTP — no downloads, no GPU, no setup. I use it constantly for comparing models before committing to a local deployment.

You'll need a free Hugging Face token. Go to huggingface.co/settings/tokens, create a token with "Read" access, and paste it below.

Your first Inference API call
Loading editor...

That call sent your prompt to Hugging Face's servers and returned the result. No downloads, no GPU. The free tier gives you rate-limited access to most Hub models.

Chat-Style Conversations

For chat models, InferenceClient has a dedicated chat_completion method. The message format matches OpenAI's API — same roles, same structure.

Chat completion with the Inference API
Loading editor...

If you've used the OpenAI client before, this looks familiar. messages list with roles, max_tokens, choices[0].message.content. Hugging Face adopted this format so you can swap providers with minimal code changes.

Beyond Text: Classification and Summarization

The Inference API handles more than text generation. Classification, summarization, and dozens of other tasks work through the same client.

Classification and summarization via the API
Loading editor...

Each method maps to a task type on the Hub. If the model is cold (hasn't been used recently), the first call might take 20-30 seconds while Hugging Face loads it.

Build a Model Response Formatter
Write Code

Write a function format_model_comparison(results) that takes a list of dictionaries — each with keys "model", "output", and "latency_ms" — and returns a formatted string. Each line follows the format: "[{latency_ms}ms] {model}: {output}". Sort by latency, fastest first.

Loading editor...

Working with Model Outputs

Raw model outputs rarely match what your application needs. Classification returns a label and score. Text generation returns your prompt concatenated with the response. Here's how to wrangle these into something useful.

Parsing common model output formats
Loading editor...

The pattern is consistent: get raw output, extract what you need, format it. For text generation, always strip the input prompt unless you want it echoed.

Local vs Cloud Inference — Picking the Right Path

After using both on real projects, the choice comes down to three factors: privacy requirements, call volume, and iteration speed.

Local vs cloud decision matrix
Loading editor...
Local with pipeline() (transformers)
# Pros:
# - 100% private — data stays on your machine
# - Free after initial download
# - No rate limits or quotas
# - Works offline
#
# Cons:
# - Downloads are large (250MB - 16GB+)
# - Needs sufficient RAM/GPU
# - Slower on CPU than cloud GPU
# - You manage updates and dependencies
Cloud with InferenceClient (huggingface_hub)
# Pros:
# - No downloads or GPU needed
# - Try any of 800K+ models instantly
# - Hugging Face provides the GPU
# - Lightweight dependency
#
# Cons:
# - Data leaves your machine
# - Free tier: ~1000 requests/day
# - Cold start delays (20-30s)
# - Requires internet connection

Building a Reusable Inference Wrapper

Switching between local and cloud shouldn't mean rewriting your application. This wrapper abstracts the backend choice behind a consistent interface.

Unified inference wrapper
Loading editor...

The _pipeline_cache dictionary stores loaded pipelines so you avoid re-downloading models on every call. The backend parameter is the only thing that changes between local and cloud.

Using the wrapper — same code, different backend
Loading editor...

Error Handling and Rate Limits

The Inference API will fail on you. Models go cold. Rate limits kick in. Tokens expire. I've hit every one of these in production, and the fix is always the same pattern: catch, wait, retry.

Retry logic with exponential backoff
Loading editor...

Exponential backoff doubles the wait on each retry — 1s, 2s, 4s. Hammering a rate-limited API makes things worse, not better.

Common Inference API errors and fixes
Loading editor...

Four Mistakes Everyone Makes with Hugging Face

I've made all of these. Watching colleagues repeat them is what convinced me to write this section.

Wrong: Using a model for the wrong task
# BERT is a classification model, NOT generative
generator = pipeline(
    "text-generation",
    model="bert-base-uncased"
)
# This errors or produces garbage
Right: Match the task to the model type
# GPT-2 IS a generative model
generator = pipeline(
    "text-generation",
    model="gpt2"
)
# Check the model's task tag on the Hub first
Wrong: No max_new_tokens for generation
# Default max length is often 20 tokens
# Output gets cut off mid-sentence
result = generator("Tell me about Python")
Right: Explicit output length control
# Control exactly how much text you get
result = generator(
    "Tell me about Python",
    max_new_tokens=100
)
Mistakes 3 and 4: Memory and licensing
Loading editor...
Build a Simple Rate Limiter
Write Code

Write a function should_allow_request(request_log, current_time, max_requests, window_seconds) that implements a sliding-window rate limiter. It takes a list of previous request timestamps (floats), the current time (float), a max allowed requests, and a window size in seconds. Return True if the request is allowed, False otherwise.

Loading editor...

Putting It Together — A Multi-Task NLP Pipeline

Here's a scenario I build for clients regularly: customer feedback arrives as text, and you need to extract the sentiment, a summary, and key entities — all using Hugging Face models.

Multi-task analysis pipeline
Loading editor...

In production, you'd add the retry logic from the error handling section and run tasks concurrently with asyncio. The synchronous version works for moderate volumes.

Summary

Hugging Face gives you the largest collection of ML models anywhere — and two clean ways to use them. pipeline() from transformers runs models locally. InferenceClient from huggingface_hub calls them in the cloud.

The key mental model: Hub for discovery, pipeline for local, InferenceClient for cloud. Start with the Inference API to prototype. Switch to local when you need privacy, speed, or zero ongoing cost. Both use the same model IDs — that's the whole point of the Hub.

Hugging Face inference cheat sheet
Loading editor...

Frequently Asked Questions

Can I fine-tune models through the Inference API?

No. The Inference API is read-only — you can run models but not train them. For fine-tuning, use transformers locally or Hugging Face's AutoTrain service. Fine-tuning is covered in a later tutorial in this path.

How do Inference Endpoints differ from the free API?

Inference Endpoints are dedicated, paid instances with guaranteed uptime and no cold starts. The free API uses shared infrastructure with rate limits. Endpoints start at about $0.06/hour for a small GPU.

Can I use private models with the Inference API?

Yes, if the model is hosted in your Hugging Face organization. Pass your token and the full model ID (your-org/your-model). Private models on the free tier have the same rate limits.

What models work best for each approach?

For local pipeline(): stick to models under 1B parameters on CPU (DistilBERT, MiniLM, GPT-2). For the Inference API: you can use larger models (7B+) since Hugging Face provides the GPU. For Inference Endpoints: any model the Hub supports.

References and Further Reading

  • Hugging Face Model Hub — Browse 800K+ models by task, language, and license
  • Transformers Documentation — Pipelines — Official pipeline() reference
  • Hugging Face Hub Python LibraryInferenceClient API reference
  • Inference API Documentation — Rate limits, supported models, pricing
  • Hugging Face Inference Endpoints — Dedicated deployment guide
  • Hugging Face Blog — Latest model releases and platform updates
  • Transformers GitHub Repository — Source code and examples
  • Related Tutorials