Zero-Shot vs Few-Shot Prompting: Build a Classifier Without Training Data
You need a sentiment classifier for customer reviews. The traditional ML path: collect thousands of labeled examples, choose a model architecture, train for hours, tune hyperparameters, deploy. The prompt engineering path: write five example reviews in a prompt and ask the LLM to classify new ones. Same task, ten minutes instead of ten days.
This tutorial teaches you the two most fundamental prompting techniques — zero-shot (no examples) and few-shot (a handful of examples) — and shows you exactly when each one works, when it fails, and how to pick the right examples to get reliable results. By the end, you will have a working sentiment classifier that handles edge cases most beginners miss.
What Are Zero-Shot and Few-Shot Prompting?
In the previous tutorial on prompt engineering basics, you learned how to structure prompts with instruction, context, format, and role. Those four building blocks are the foundation. Zero-shot and few-shot prompting are the two techniques that determine how much you show the model before asking it to perform.
Zero-shot prompting means asking the model to do a task without showing it any examples. You give an instruction and trust the model's training to figure out what you mean. Every time you type "Summarize this article" into ChatGPT, that is zero-shot prompting.
Few-shot prompting means including a few examples of the task in your prompt before asking the model to handle a new input. You show the model "here is an input and here is the correct output" two to five times, then give it a new input and let it follow the pattern. The difference is concrete — here is the same classification task done both ways:
Both will likely return "negative" for that review. But the few-shot version does something the zero-shot version cannot: it locks the output format. The examples establish that the answer should be a single lowercase word — "positive", "negative", or "neutral" — with no explanation. The zero-shot version might return "Negative", "NEGATIVE", or "Negative. The review expresses strong dissatisfaction..."
When Zero-Shot Is Enough
I default to zero-shot prompting for most tasks. It is cheaper (fewer tokens), simpler to write, and modern LLMs are remarkably good at following instructions without examples. The key is knowing when it works and when it falls short.
Zero-shot works well when the task is common and well-defined. Sentiment analysis, language translation, summarization, question answering — LLMs have seen millions of these during training. They know what "classify as positive or negative" means without you showing them.
Those three tasks work perfectly without examples because the model has a strong prior understanding of what "translate," "summarize," and "classify" mean. You do not need to teach it these concepts.
Zero-shot struggles when the task involves custom categories, ambiguous boundaries, or domain-specific conventions. If you ask an LLM to classify support tickets into your company's specific categories — "billing-error", "feature-request", "integration-bug", "account-access" — it has no way to know what each category means without examples.
Few-Shot Prompting — Teaching by Example
The examples do three things that the zero-shot version could not. First, they show exactly what each category means through concrete instances. "Billing-error" is about charges, not about account issues. Second, they establish the output format — one category label, no explanation. Third, they show the boundary between similar categories. The Slack integration is an "integration-bug" not a "feature-request," which tells the model that connectivity problems with third-party tools go under integration.
How many examples do you need? In my experience, three to five examples hit the sweet spot for most classification tasks. Fewer than three and the model may not pick up the pattern consistently. More than five and you are burning tokens without improving accuracy.
Let me show you the difference that example count makes. We will classify the same set of tricky reviews with one, three, and five examples and compare consistency:
You will likely see that the 1-shot version struggles with "Meh." — it might call it negative or positive because it only has one example to anchor on. The 3-shot and 5-shot versions are more consistent because the model has enough context to distinguish the three categories. The jump from three to five examples usually has diminishing returns for simple tasks like this.
Example Quality Matters More Than Quantity
This is where most people go wrong with few-shot prompting. They pick examples carelessly — usually obvious cases that the model would get right anyway. The examples that matter most are the ones at the decision boundary: cases where two categories are easy to confuse.
I learned this the hard way while building a content moderation system. My first few-shot prompt had examples like "I love this!" (positive) and "This is terrible garbage!" (negative). Obviously the model got those right. But it consistently miscategorized sarcastic reviews and mixed-sentiment reviews because my examples never showed how to handle ambiguity.
# Every example is a clear-cut case
weak_prompt = """Classify as positive, negative, or neutral.
Review: "Best product ever! Love it!"
Sentiment: positive
Review: "Horrible. Waste of money."
Sentiment: negative
Review: "It's okay I guess."
Sentiment: neutral
Review: "The screen is gorgeous but the speakers are tinny."
Sentiment:"""
r = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": weak_prompt}],
)
print(f"Weak examples: {r.choices[0].message.content}")# Examples cover the tricky boundaries
strong_prompt = """Classify as positive, negative, or neutral.
Review: "Best product ever! Love it!"
Sentiment: positive
Review: "The camera is great but the battery life is terrible."
Sentiment: neutral
Review: "Works fine for the price. Not premium quality."
Sentiment: neutral
Review: "The screen is gorgeous but the speakers are tinny."
Sentiment:"""
r = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": strong_prompt}],
)
print(f"Strong examples: {r.choices[0].message.content}")The strong examples include a mixed-sentiment review classified as neutral. This teaches the model your specific rule: when a review has both positive and negative elements, call it neutral. Without this boundary example, the model has to guess your convention — and different runs may produce different guesses.
Building a Reusable Few-Shot Classifier
Hardcoding examples in prompt strings gets messy fast. Let me show you how I structure few-shot classifiers in practice — with examples stored as data and the prompt assembled programmatically.
That classify function separates the examples from the logic. You can swap in different examples for different tasks — support tickets, content moderation, intent detection — without rewriting the prompt construction code. The validation step at the end catches cases where the model returns something outside your category list, which happens more often than you would expect with zero-shot but rarely with well-chosen few-shot examples.
Practice: Build a Few-Shot Prompt
Time to build the prompt assembly logic yourself. This exercise focuses on the pure Python string construction — no API calls, so the test cases are fully deterministic.
Write a function build_few_shot_prompt(examples, new_input, task_description) that assembles a few-shot prompt string.
Rules:
task_description on the first line(input_text, label), add: - Input: "<input_text>"
- Label: <label>
- A blank line after each example
- Input: "<new_input>"
- Label: (with no value — the model fills this in)
Beyond Classification — Extraction, Formatting, and Transformation
Few-shot prompting is not limited to classification. Any task where you can show input-output pairs benefits from examples. Extraction, reformatting, translation between data formats, and text transformation all work well. The principle is always the same: show the pattern, then give a new input.
Few-Shot Data Extraction
Notice how the third example uses "twenty-five bucks" — an informal price that the model must normalize to "$25.00". This boundary example teaches the model your formatting convention: always output prices as dollar amounts with two decimal places, regardless of how the input expresses them. Without that example, the model might output "twenty-five bucks" verbatim.
Few-Shot Format Transformation
The examples establish conventions the zero-shot version would have to guess: use single quotes for strings, end with semicolons, select only relevant columns rather than SELECT *. Each example teaches a different SQL pattern — filtering, counting, aggregation — so the model has a repertoire to draw from.
When to Use Which: A Decision Framework
After building dozens of LLM-powered features, I have a simple decision tree. Start with zero-shot. If the output format is inconsistent, add examples. If the categories are custom or ambiguous, add boundary examples. If accuracy on edge cases still is not good enough, improve your examples before adding more of them.
Four Mistakes That Ruin Few-Shot Prompts
Mistake 1: All Examples From the Same Category
If three of your five examples are "positive" and only one is "negative," the model develops a bias toward positive. I have seen this create a classifier that calls everything positive except the most aggressively negative text. Balance your examples across categories.
Mistake 2: Examples That Are Too Easy
An example like "This is the best product ever!!!" classified as positive teaches the model nothing — it already knows that. Include examples near the boundary. "It is okay, I suppose" is neutral, not negative. "Works great when it works" is neutral, not positive. These borderline cases are what the model actually needs help with.
Mistake 3: Inconsistent Formatting Across Examples
# Mixed formats in examples
bad_prompt = """
Text: "Great product" -> Positive
Text: "Bad product"
Label: negative
Text: "OK product" sentiment=neutral
Text: "New product"
Label:"""
print("Inconsistent labels confuse the model")# Uniform format in every example
good_prompt = """Text: "Great product"
Label: positive
Text: "Bad product"
Label: negative
Text: "OK product"
Label: neutral
Text: "New product"
Label:"""
print("Consistent formatting produces reliable output")Consistency is not optional. Every example must use exactly the same format: same field names, same delimiter, same casing. If one example uses "Sentiment:" and another uses "Label:", the model does not know which convention to follow.
Mistake 4: Too Many Examples
More examples is not always better. Beyond five or six examples for a classification task, accuracy plateaus but token cost keeps climbing. Worse, if your examples contain contradictions or inconsistent labeling, more examples amplify the noise. I have seen prompts with 20 examples that performed worse than the same prompt with 4 carefully chosen ones.
Practice: Fix a Broken Few-Shot Classifier
This exercise tests whether you can identify and fix the problems in a poorly constructed few-shot example set. The function receives a list of examples and must validate them against quality rules.
Write a function validate_examples(examples, categories) that checks a list of few-shot examples for quality issues.
Parameters:
examples: List of (text, label) tuplescategories: List of valid category stringsRules to check:
1. Invalid labels: If any example has a label not in categories, add "invalid_label" to the issues list
2. Missing categories: If any category in categories has zero examples, add "missing_category" to the issues list
3. Imbalanced: If the most common label appears more than twice as often as the least common label (among categories that have examples), add "imbalanced" to the issues list
Return a dictionary with:
"valid": True if no issues, False otherwise"issues": List of issue strings (empty if valid)"category_counts": Dict mapping each category to its count in the examplesPutting It All Together — A Production-Ready Classifier
Let me combine everything from this tutorial into a classifier that handles real-world messiness. It uses few-shot examples with boundary cases, validates its own output, and falls back to "unknown" when the model returns garbage.
The classifier uses six examples — two per category — with deliberate boundary cases. The mixed-sentiment review ("camera quality is excellent but battery drains fast") teaches the neutral convention. The output normalization handles the common failure mode where the model adds extra words or changes capitalization.
Summary
Zero-shot prompting works when the task is standard and the model has strong priors — translation, summarization, basic sentiment analysis. No examples needed. Start here.
Few-shot prompting works when you need consistent output formats, custom categories, or correct handling of edge cases. Three to five examples is the sweet spot. Quality beats quantity — one boundary example is worth three obvious ones.
The key insight from this tutorial: examples in a few-shot prompt are not training data. They are communication tools. They tell the model "this is what I mean by this category" and "this is exactly how I want the output formatted." Pick examples that clarify the ambiguous cases, not the obvious ones.
The next tutorial covers chain-of-thought prompting — a technique that makes LLMs explain their reasoning step by step, dramatically improving accuracy on tasks that require logic, math, or multi-step analysis.
Frequently Asked Questions
Can I use few-shot prompting with any LLM?
Yes. Few-shot prompting works with every major LLM — GPT-4, Claude, Gemini, Llama, Mistral. Larger models generally perform better with fewer examples, while smaller models may need more examples to pick up the pattern. If you are using a small open-source model locally, try five examples instead of three.
Do few-shot examples go in the system message or the user message?
Either works, but I put examples in the user message for most cases. The system message is better for persistent behavior rules ("always respond in JSON," "never include disclaimers"). For task-specific examples that might change between requests, the user message is more natural. Some providers handle system messages differently, so keeping examples in the user message is the more portable choice.
What if my few-shot prompt exceeds the context window?
If you have so many examples that they push against the context limit, you are probably using too many. For classification, 3-5 examples should suffice. For complex extraction tasks, you might need up to 10. If you genuinely need more, consider fine-tuning the model instead — it moves the examples into the model weights, eliminating per-request token costs entirely.