Hugging Face for LLM Inference: Pipeline API and Inference Endpoints in Python

Intermediate90 min3 exercises55 XP

Prerequisites

0/3 exercises

You've been calling OpenAI and Anthropic APIs, and they work great. Then you need a specialized code model, a fine-tuned medical summarizer, or a 400M-parameter classifier that costs nothing to run. That's when you land on Hugging Face, staring at 800,000+ models.

I use Hugging Face almost daily — sometimes for quick experiments with the Inference API, sometimes for local deployment with pipeline(). This tutorial covers both paths.

What Is Hugging Face and Why Does It Matter?

Hugging Face is the GitHub of machine learning models. Just like GitHub hosts code repositories, Hugging Face hosts trained models — over 800,000 of them — that you can run with a few lines of Python.

OpenAI and Anthropic give you access to their models through their APIs. Hugging Face is different: it's a platform where anyone can publish and share models. Meta releases Llama there. Google publishes Gemma. Thousands of researchers upload fine-tuned models for everything from sentiment analysis to protein folding.

There are three ways to run a Hugging Face model. Each fits a different stage of your workflow.

We'll cover approaches 1 and 2 in depth. By the end, you'll have working code for both and know when to use each.

Navigating the Model Hub Like a Pro

The first time I opened the Model Hub, I was overwhelmed. 800,000 models, dozens of task categories, cryptic names like microsoft/phi-2. After months of use, I've developed a system for finding the right model in under 2 minutes.

Every model has a model card — a structured page with everything you need to know before downloading.

The search filters are your best friend. Select a Task (e.g., Text Generation), then sort by Most Downloads. This surfaces battle-tested models and pushes experimental ones to page 50.

You can also browse models programmatically, which is useful for comparing options in a script.

Search the Model Hub from Python

Loading editor...

The output shows the most popular text-generation models. You'll typically see Meta's Llama, GPT-2, and Mistral near the top. These rankings shift as new models are released.

Local Inference with pipeline() — Three Lines to a Working Model

Your first pipeline() call

Loading editor...

Three lines of logic: import, create pipeline, call it. The pipeline() function from transformers handles downloading, loading, tokenizing, and post-processing. I reach for it whenever I need to prototype something quickly or want zero cloud dependency.

For sentiment analysis, pipeline() picks distilbert-base-uncased-finetuned-sst-2-english as the default model. You can override this with any compatible model from the Hub.

Choosing a Specific Model

The defaults are fine for learning. In practice, you'll want a specific model — maybe multilingual support, or one fine-tuned on your domain.

Using a specific model for star-rating sentiment

Loading editor...

Common Pipeline Tasks

The pipeline() function supports over 25 NLP tasks. Here are the three I use most, with the exact task string and a practical example.

Three common pipeline tasks

Loading editor...

Each task downloads a different model on first run. GPT-2 is about 500MB. BART summarization is about 1.6GB. The NER model is a lean 400MB.

Text Generation Parameters

Text generation deserves a closer look. The parameters matter more here than in classification — wrong settings produce gibberish or repetitive loops.

Temperature and sampling with text generation

Loading editor...

These parameters work exactly like what you learned in the sampling parameters tutorial. The difference is that here you control the model directly instead of calling an API.

Parse Model Metadata

Write Code

Write a function parse_model_id(model_id) that takes a Hugging Face model ID string (like "meta-llama/Llama-3-8B" or "gpt2") and returns a dictionary with two keys: "organization" (the part before the slash, or "community" if there's no slash) and "model_name" (the part after the slash, or the full string if there's no slash).

Loading editor...

Cloud Inference with the Hugging Face Inference API

What if you don't want to download a 2GB model just to test it? The Inference API lets you call any model on the Hub through HTTP — no downloads, no GPU, no setup. I use it constantly for comparing models before committing to a local deployment.

You'll need a free Hugging Face token. Go to huggingface.co/settings/tokens, create a token with "Read" access, and paste it below.

Your first Inference API call

Loading editor...

That call sent your prompt to Hugging Face's servers and returned the result. No downloads, no GPU. The free tier gives you rate-limited access to most Hub models.

Chat-Style Conversations

For chat models, InferenceClient has a dedicated chat_completion method. The message format matches OpenAI's API — same roles, same structure.

Chat completion with the Inference API

Loading editor...

If you've used the OpenAI client before, this looks familiar. messages list with roles, max_tokens, choices[0].message.content. Hugging Face adopted this format so you can swap providers with minimal code changes.

Beyond Text: Classification and Summarization

The Inference API handles more than text generation. Classification, summarization, and dozens of other tasks work through the same client.

Classification and summarization via the API

Loading editor...

Each method maps to a task type on the Hub. If the model is cold (hasn't been used recently), the first call might take 20-30 seconds while Hugging Face loads it.

Build a Model Response Formatter

Write Code

Write a function format_model_comparison(results) that takes a list of dictionaries — each with keys "model", "output", and "latency_ms" — and returns a formatted string. Each line follows the format: "[{latency_ms}ms] {model}: {output}". Sort by latency, fastest first.

Loading editor...

Working with Model Outputs

Raw model outputs rarely match what your application needs. Classification returns a label and score. Text generation returns your prompt concatenated with the response. Here's how to wrangle these into something useful.

Parsing common model output formats

Loading editor...

The pattern is consistent: get raw output, extract what you need, format it. For text generation, always strip the input prompt unless you want it echoed.

Local vs Cloud Inference — Picking the Right Path

After using both on real projects, the choice comes down to three factors: privacy requirements, call volume, and iteration speed.

Local vs cloud decision matrix

Loading editor...

Local with pipeline() (transformers)

# Pros:
# - 100% private — data stays on your machine
# - Free after initial download
# - No rate limits or quotas
# - Works offline
#
# Cons:
# - Downloads are large (250MB - 16GB+)
# - Needs sufficient RAM/GPU
# - Slower on CPU than cloud GPU
# - You manage updates and dependencies

Cloud with InferenceClient (huggingface_hub)

# Pros:
# - No downloads or GPU needed
# - Try any of 800K+ models instantly
# - Hugging Face provides the GPU
# - Lightweight dependency
#
# Cons:
# - Data leaves your machine
# - Free tier: ~1000 requests/day
# - Cold start delays (20-30s)
# - Requires internet connection

Building a Reusable Inference Wrapper

Switching between local and cloud shouldn't mean rewriting your application. This wrapper abstracts the backend choice behind a consistent interface.

Unified inference wrapper

Loading editor...

The _pipeline_cache dictionary stores loaded pipelines so you avoid re-downloading models on every call. The backend parameter is the only thing that changes between local and cloud.

Using the wrapper — same code, different backend

Loading editor...

Error Handling and Rate Limits

The Inference API will fail on you. Models go cold. Rate limits kick in. Tokens expire. I've hit every one of these in production, and the fix is always the same pattern: catch, wait, retry.

Retry logic with exponential backoff

Loading editor...

Exponential backoff doubles the wait on each retry — 1s, 2s, 4s. Hammering a rate-limited API makes things worse, not better.

Common Inference API errors and fixes

Loading editor...

Four Mistakes Everyone Makes with Hugging Face

I've made all of these. Watching colleagues repeat them is what convinced me to write this section.

Wrong: Using a model for the wrong task

# BERT is a classification model, NOT generative
generator = pipeline(
    "text-generation",
    model="bert-base-uncased"
)
# This errors or produces garbage

Right: Match the task to the model type

# GPT-2 IS a generative model
generator = pipeline(
    "text-generation",
    model="gpt2"
)
# Check the model's task tag on the Hub first

Wrong: No max_new_tokens for generation

# Default max length is often 20 tokens
# Output gets cut off mid-sentence
result = generator("Tell me about Python")

Right: Explicit output length control

# Control exactly how much text you get
result = generator(
    "Tell me about Python",
    max_new_tokens=100
)

Mistakes 3 and 4: Memory and licensing

Loading editor...

Build a Simple Rate Limiter

Write Code

Write a function should_allow_request(request_log, current_time, max_requests, window_seconds) that implements a sliding-window rate limiter. It takes a list of previous request timestamps (floats), the current time (float), a max allowed requests, and a window size in seconds. Return True if the request is allowed, False otherwise.

Loading editor...

Putting It Together — A Multi-Task NLP Pipeline

Here's a scenario I build for clients regularly: customer feedback arrives as text, and you need to extract the sentiment, a summary, and key entities — all using Hugging Face models.

Multi-task analysis pipeline

Loading editor...

In production, you'd add the retry logic from the error handling section and run tasks concurrently with asyncio. The synchronous version works for moderate volumes.

Summary

Hugging Face gives you the largest collection of ML models anywhere — and two clean ways to use them. pipeline() from transformers runs models locally. InferenceClient from huggingface_hub calls them in the cloud.

The key mental model: Hub for discovery, pipeline for local, InferenceClient for cloud. Start with the Inference API to prototype. Switch to local when you need privacy, speed, or zero ongoing cost. Both use the same model IDs — that's the whole point of the Hub.

Hugging Face inference cheat sheet

Loading editor...

Frequently Asked Questions

Can I fine-tune models through the Inference API?

No. The Inference API is read-only — you can run models but not train them. For fine-tuning, use transformers locally or Hugging Face's AutoTrain service. Fine-tuning is covered in a later tutorial in this path.

How do Inference Endpoints differ from the free API?

Inference Endpoints are dedicated, paid instances with guaranteed uptime and no cold starts. The free API uses shared infrastructure with rate limits. Endpoints start at about $0.06/hour for a small GPU.

Can I use private models with the Inference API?

Yes, if the model is hosted in your Hugging Face organization. Pass your token and the full model ID (your-org/your-model). Private models on the free tier have the same rate limits.

What models work best for each approach?

For local pipeline(): stick to models under 1B parameters on CPU (DistilBERT, MiniLM, GPT-2). For the Inference API: you can use larger models (7B+) since Hugging Face provides the GPU. For Inference Endpoints: any model the Hub supports.

References and Further Reading

Hugging Face Model Hub — Browse 800K+ models by task, language, and license

Transformers Documentation — Pipelines — Official pipeline() reference

Hugging Face Hub Python Library — InferenceClient API reference

Inference API Documentation — Rate limits, supported models, pricing

Hugging Face Inference Endpoints — Dedicated deployment guide

Hugging Face Blog — Latest model releases and platform updates

Transformers GitHub Repository — Source code and examples

What Is Hugging Face and Why Does It Matter?

Navigating the Model Hub Like a Pro

Local Inference with pipeline() — Three Lines to a Working Model

Choosing a Specific Model

Common Pipeline Tasks

Text Generation Parameters

Cloud Inference with the Hugging Face Inference API

Chat-Style Conversations

Beyond Text: Classification and Summarization

Working with Model Outputs

Local vs Cloud Inference — Picking the Right Path

Building a Reusable Inference Wrapper

Error Handling and Rate Limits

Four Mistakes Everyone Makes with Hugging Face

Putting It Together — A Multi-Task NLP Pipeline

Summary

Frequently Asked Questions

Can I fine-tune models through the Inference API?

How do Inference Endpoints differ from the free API?

Can I use private models with the Inference API?

What models work best for each approach?

References and Further Reading

Related Tutorials