Hugging Face for LLM Inference: Pipeline API and Inference Endpoints in Python
You've been calling OpenAI and Anthropic APIs, and they work great. Then you need a specialized code model, a fine-tuned medical summarizer, or a 400M-parameter classifier that costs nothing to run. That's when you land on Hugging Face, staring at 800,000+ models.
I use Hugging Face almost daily — sometimes for quick experiments with the Inference API, sometimes for local deployment with pipeline(). This tutorial covers both paths.
What Is Hugging Face and Why Does It Matter?
Hugging Face is the GitHub of machine learning models. Just like GitHub hosts code repositories, Hugging Face hosts trained models — over 800,000 of them — that you can run with a few lines of Python.
OpenAI and Anthropic give you access to their models through their APIs. Hugging Face is different: it's a platform where anyone can publish and share models. Meta releases Llama there. Google publishes Gemma. Thousands of researchers upload fine-tuned models for everything from sentiment analysis to protein folding.
There are three ways to run a Hugging Face model. Each fits a different stage of your workflow.
We'll cover approaches 1 and 2 in depth. By the end, you'll have working code for both and know when to use each.
Navigating the Model Hub Like a Pro
The first time I opened the Model Hub, I was overwhelmed. 800,000 models, dozens of task categories, cryptic names like microsoft/phi-2. After months of use, I've developed a system for finding the right model in under 2 minutes.
Every model has a model card — a structured page with everything you need to know before downloading.
The search filters are your best friend. Select a Task (e.g., Text Generation), then sort by Most Downloads. This surfaces battle-tested models and pushes experimental ones to page 50.
You can also browse models programmatically, which is useful for comparing options in a script.
The output shows the most popular text-generation models. You'll typically see Meta's Llama, GPT-2, and Mistral near the top. These rankings shift as new models are released.
Local Inference with pipeline() — Three Lines to a Working Model
Three lines of logic: import, create pipeline, call it. The pipeline() function from transformers handles downloading, loading, tokenizing, and post-processing. I reach for it whenever I need to prototype something quickly or want zero cloud dependency.
For sentiment analysis, pipeline() picks distilbert-base-uncased-finetuned-sst-2-english as the default model. You can override this with any compatible model from the Hub.
Choosing a Specific Model
The defaults are fine for learning. In practice, you'll want a specific model — maybe multilingual support, or one fine-tuned on your domain.
Common Pipeline Tasks
The pipeline() function supports over 25 NLP tasks. Here are the three I use most, with the exact task string and a practical example.
Each task downloads a different model on first run. GPT-2 is about 500MB. BART summarization is about 1.6GB. The NER model is a lean 400MB.
Text Generation Parameters
Text generation deserves a closer look. The parameters matter more here than in classification — wrong settings produce gibberish or repetitive loops.
These parameters work exactly like what you learned in the sampling parameters tutorial. The difference is that here you control the model directly instead of calling an API.
Write a function parse_model_id(model_id) that takes a Hugging Face model ID string (like "meta-llama/Llama-3-8B" or "gpt2") and returns a dictionary with two keys: "organization" (the part before the slash, or "community" if there's no slash) and "model_name" (the part after the slash, or the full string if there's no slash).
Cloud Inference with the Hugging Face Inference API
What if you don't want to download a 2GB model just to test it? The Inference API lets you call any model on the Hub through HTTP — no downloads, no GPU, no setup. I use it constantly for comparing models before committing to a local deployment.
You'll need a free Hugging Face token. Go to huggingface.co/settings/tokens, create a token with "Read" access, and paste it below.
That call sent your prompt to Hugging Face's servers and returned the result. No downloads, no GPU. The free tier gives you rate-limited access to most Hub models.
Chat-Style Conversations
For chat models, InferenceClient has a dedicated chat_completion method. The message format matches OpenAI's API — same roles, same structure.
If you've used the OpenAI client before, this looks familiar. messages list with roles, max_tokens, choices[0].message.content. Hugging Face adopted this format so you can swap providers with minimal code changes.
Beyond Text: Classification and Summarization
The Inference API handles more than text generation. Classification, summarization, and dozens of other tasks work through the same client.
Each method maps to a task type on the Hub. If the model is cold (hasn't been used recently), the first call might take 20-30 seconds while Hugging Face loads it.
Write a function format_model_comparison(results) that takes a list of dictionaries — each with keys "model", "output", and "latency_ms" — and returns a formatted string. Each line follows the format: "[{latency_ms}ms] {model}: {output}". Sort by latency, fastest first.
Working with Model Outputs
Raw model outputs rarely match what your application needs. Classification returns a label and score. Text generation returns your prompt concatenated with the response. Here's how to wrangle these into something useful.
The pattern is consistent: get raw output, extract what you need, format it. For text generation, always strip the input prompt unless you want it echoed.
Local vs Cloud Inference — Picking the Right Path
After using both on real projects, the choice comes down to three factors: privacy requirements, call volume, and iteration speed.
# Pros:
# - 100% private — data stays on your machine
# - Free after initial download
# - No rate limits or quotas
# - Works offline
#
# Cons:
# - Downloads are large (250MB - 16GB+)
# - Needs sufficient RAM/GPU
# - Slower on CPU than cloud GPU
# - You manage updates and dependencies# Pros:
# - No downloads or GPU needed
# - Try any of 800K+ models instantly
# - Hugging Face provides the GPU
# - Lightweight dependency
#
# Cons:
# - Data leaves your machine
# - Free tier: ~1000 requests/day
# - Cold start delays (20-30s)
# - Requires internet connectionBuilding a Reusable Inference Wrapper
Switching between local and cloud shouldn't mean rewriting your application. This wrapper abstracts the backend choice behind a consistent interface.
The _pipeline_cache dictionary stores loaded pipelines so you avoid re-downloading models on every call. The backend parameter is the only thing that changes between local and cloud.
Error Handling and Rate Limits
The Inference API will fail on you. Models go cold. Rate limits kick in. Tokens expire. I've hit every one of these in production, and the fix is always the same pattern: catch, wait, retry.
Exponential backoff doubles the wait on each retry — 1s, 2s, 4s. Hammering a rate-limited API makes things worse, not better.
Four Mistakes Everyone Makes with Hugging Face
I've made all of these. Watching colleagues repeat them is what convinced me to write this section.
# BERT is a classification model, NOT generative
generator = pipeline(
"text-generation",
model="bert-base-uncased"
)
# This errors or produces garbage# GPT-2 IS a generative model
generator = pipeline(
"text-generation",
model="gpt2"
)
# Check the model's task tag on the Hub first# Default max length is often 20 tokens
# Output gets cut off mid-sentence
result = generator("Tell me about Python")# Control exactly how much text you get
result = generator(
"Tell me about Python",
max_new_tokens=100
)Write a function should_allow_request(request_log, current_time, max_requests, window_seconds) that implements a sliding-window rate limiter. It takes a list of previous request timestamps (floats), the current time (float), a max allowed requests, and a window size in seconds. Return True if the request is allowed, False otherwise.
Putting It Together — A Multi-Task NLP Pipeline
Here's a scenario I build for clients regularly: customer feedback arrives as text, and you need to extract the sentiment, a summary, and key entities — all using Hugging Face models.
In production, you'd add the retry logic from the error handling section and run tasks concurrently with asyncio. The synchronous version works for moderate volumes.
Summary
Hugging Face gives you the largest collection of ML models anywhere — and two clean ways to use them. pipeline() from transformers runs models locally. InferenceClient from huggingface_hub calls them in the cloud.
The key mental model: Hub for discovery, pipeline for local, InferenceClient for cloud. Start with the Inference API to prototype. Switch to local when you need privacy, speed, or zero ongoing cost. Both use the same model IDs — that's the whole point of the Hub.
Frequently Asked Questions
Can I fine-tune models through the Inference API?
No. The Inference API is read-only — you can run models but not train them. For fine-tuning, use transformers locally or Hugging Face's AutoTrain service. Fine-tuning is covered in a later tutorial in this path.
How do Inference Endpoints differ from the free API?
Inference Endpoints are dedicated, paid instances with guaranteed uptime and no cold starts. The free API uses shared infrastructure with rate limits. Endpoints start at about $0.06/hour for a small GPU.
Can I use private models with the Inference API?
Yes, if the model is hosted in your Hugging Face organization. Pass your token and the full model ID (your-org/your-model). Private models on the free tier have the same rate limits.
What models work best for each approach?
For local pipeline(): stick to models under 1B parameters on CPU (DistilBERT, MiniLM, GPT-2). For the Inference API: you can use larger models (7B+) since Hugging Face provides the GPU. For Inference Endpoints: any model the Hub supports.
References and Further Reading
pipeline() referenceInferenceClient API reference