Zero-Shot vs Few-Shot Prompting: Build a Classifier Without Training Data

Beginner60 min2 exercises35 XP

Prerequisites

0/2 exercises

You need a sentiment classifier for customer reviews. The traditional ML path: collect thousands of labeled examples, choose a model architecture, train for hours, tune hyperparameters, deploy. The prompt engineering path: write five example reviews in a prompt and ask the LLM to classify new ones. Same task, ten minutes instead of ten days.

This tutorial teaches you the two most fundamental prompting techniques — zero-shot (no examples) and few-shot (a handful of examples) — and shows you exactly when each one works, when it fails, and how to pick the right examples to get reliable results. By the end, you will have a working sentiment classifier that handles edge cases most beginners miss.

What Are Zero-Shot and Few-Shot Prompting?

In the previous tutorial on prompt engineering basics, you learned how to structure prompts with instruction, context, format, and role. Those four building blocks are the foundation. Zero-shot and few-shot prompting are the two techniques that determine how much you show the model before asking it to perform.

Zero-shot prompting means asking the model to do a task without showing it any examples. You give an instruction and trust the model's training to figure out what you mean. Every time you type "Summarize this article" into ChatGPT, that is zero-shot prompting.

Few-shot prompting means including a few examples of the task in your prompt before asking the model to handle a new input. You show the model "here is an input and here is the correct output" two to five times, then give it a new input and let it follow the pattern. The difference is concrete — here is the same classification task done both ways:

Zero-shot classification — instruction only, no examples

Loading editor...

Few-shot classification — three examples set the pattern

Loading editor...

Both will likely return "negative" for that review. But the few-shot version does something the zero-shot version cannot: it locks the output format. The examples establish that the answer should be a single lowercase word — "positive", "negative", or "neutral" — with no explanation. The zero-shot version might return "Negative", "NEGATIVE", or "Negative. The review expresses strong dissatisfaction..."

When Zero-Shot Is Enough

I default to zero-shot prompting for most tasks. It is cheaper (fewer tokens), simpler to write, and modern LLMs are remarkably good at following instructions without examples. The key is knowing when it works and when it falls short.

Zero-shot works well when the task is common and well-defined. Sentiment analysis, language translation, summarization, question answering — LLMs have seen millions of these during training. They know what "classify as positive or negative" means without you showing them.

Zero-shot handles common tasks reliably

Loading editor...

Those three tasks work perfectly without examples because the model has a strong prior understanding of what "translate," "summarize," and "classify" mean. You do not need to teach it these concepts.

Zero-shot struggles when the task involves custom categories, ambiguous boundaries, or domain-specific conventions. If you ask an LLM to classify support tickets into your company's specific categories — "billing-error", "feature-request", "integration-bug", "account-access" — it has no way to know what each category means without examples.

Zero-shot guesses at ambiguous custom categories

Loading editor...

Few-Shot Prompting — Teaching by Example

Few-shot examples define category boundaries

Loading editor...

The examples do three things that the zero-shot version could not. First, they show exactly what each category means through concrete instances. "Billing-error" is about charges, not about account issues. Second, they establish the output format — one category label, no explanation. Third, they show the boundary between similar categories. The Slack integration is an "integration-bug" not a "feature-request," which tells the model that connectivity problems with third-party tools go under integration.

How many examples do you need? In my experience, three to five examples hit the sweet spot for most classification tasks. Fewer than three and the model may not pick up the pattern consistently. More than five and you are burning tokens without improving accuracy.

Let me show you the difference that example count makes. We will classify the same set of tricky reviews with one, three, and five examples and compare consistency:

Comparing 1-shot, 3-shot, and 5-shot consistency

Loading editor...

You will likely see that the 1-shot version struggles with "Meh." — it might call it negative or positive because it only has one example to anchor on. The 3-shot and 5-shot versions are more consistent because the model has enough context to distinguish the three categories. The jump from three to five examples usually has diminishing returns for simple tasks like this.

Example Quality Matters More Than Quantity

This is where most people go wrong with few-shot prompting. They pick examples carelessly — usually obvious cases that the model would get right anyway. The examples that matter most are the ones at the decision boundary: cases where two categories are easy to confuse.

I learned this the hard way while building a content moderation system. My first few-shot prompt had examples like "I love this!" (positive) and "This is terrible garbage!" (negative). Obviously the model got those right. But it consistently miscategorized sarcastic reviews and mixed-sentiment reviews because my examples never showed how to handle ambiguity.

Weak examples — all obvious cases

# Every example is a clear-cut case
weak_prompt = """Classify as positive, negative, or neutral.

Review: "Best product ever! Love it!"
Sentiment: positive

Review: "Horrible. Waste of money."
Sentiment: negative

Review: "It's okay I guess."
Sentiment: neutral

Review: "The screen is gorgeous but the speakers are tinny."
Sentiment:"""

r = await client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": weak_prompt}],
)
print(f"Weak examples: {r.choices[0].message.content}")

Strong examples — include edge cases

# Examples cover the tricky boundaries
strong_prompt = """Classify as positive, negative, or neutral.

Review: "Best product ever! Love it!"
Sentiment: positive

Review: "The camera is great but the battery life is terrible."
Sentiment: neutral

Review: "Works fine for the price. Not premium quality."
Sentiment: neutral

Review: "The screen is gorgeous but the speakers are tinny."
Sentiment:"""

r = await client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": strong_prompt}],
)
print(f"Strong examples: {r.choices[0].message.content}")

The strong examples include a mixed-sentiment review classified as neutral. This teaches the model your specific rule: when a review has both positive and negative elements, call it neutral. Without this boundary example, the model has to guess your convention — and different runs may produce different guesses.

Building a Reusable Few-Shot Classifier

Hardcoding examples in prompt strings gets messy fast. Let me show you how I structure few-shot classifiers in practice — with examples stored as data and the prompt assembled programmatically.

A reusable few-shot classifier function

Loading editor...

That classify function separates the examples from the logic. You can swap in different examples for different tasks — support tickets, content moderation, intent detection — without rewriting the prompt construction code. The validation step at the end catches cases where the model returns something outside your category list, which happens more often than you would expect with zero-shot but rarely with well-chosen few-shot examples.

Practice: Build a Few-Shot Prompt

Time to build the prompt assembly logic yourself. This exercise focuses on the pure Python string construction — no API calls, so the test cases are fully deterministic.

Build a Few-Shot Prompt from Examples

Write Code

Write a function build_few_shot_prompt(examples, new_input, task_description) that assembles a few-shot prompt string.

Rules:

Start with the task_description on the first line

Add a blank line after the task description

For each example tuple (input_text, label), add:

- Input: "<input_text>"

- Label: <label>

- A blank line after each example

After all examples, add the new input:

- Input: "<new_input>"

- Label: (with no value — the model fills this in)

Return the assembled string with no trailing newline

Loading editor...

Beyond Classification — Extraction, Formatting, and Transformation

Few-shot prompting is not limited to classification. Any task where you can show input-output pairs benefits from examples. Extraction, reformatting, translation between data formats, and text transformation all work well. The principle is always the same: show the pattern, then give a new input.

Few-Shot Data Extraction

Few-shot extraction — structured data from messy text

Loading editor...

Notice how the third example uses "twenty-five bucks" — an informal price that the model must normalize to "$25.00". This boundary example teaches the model your formatting convention: always output prices as dollar amounts with two decimal places, regardless of how the input expresses them. Without that example, the model might output "twenty-five bucks" verbatim.

Few-Shot Format Transformation

Few-shot transformation — natural language to SQL

Loading editor...

The examples establish conventions the zero-shot version would have to guess: use single quotes for strings, end with semicolons, select only relevant columns rather than SELECT *. Each example teaches a different SQL pattern — filtering, counting, aggregation — so the model has a repertoire to draw from.

When to Use Which: A Decision Framework

After building dozens of LLM-powered features, I have a simple decision tree. Start with zero-shot. If the output format is inconsistent, add examples. If the categories are custom or ambiguous, add boundary examples. If accuracy on edge cases still is not good enough, improve your examples before adding more of them.

A decision framework for choosing zero-shot vs few-shot

Loading editor...

Four Mistakes That Ruin Few-Shot Prompts

Mistake 1: All Examples From the Same Category

If three of your five examples are "positive" and only one is "negative," the model develops a bias toward positive. I have seen this create a classifier that calls everything positive except the most aggressively negative text. Balance your examples across categories.

Mistake 2: Examples That Are Too Easy

An example like "This is the best product ever!!!" classified as positive teaches the model nothing — it already knows that. Include examples near the boundary. "It is okay, I suppose" is neutral, not negative. "Works great when it works" is neutral, not positive. These borderline cases are what the model actually needs help with.

Mistake 3: Inconsistent Formatting Across Examples

Inconsistent labels — model gets confused

# Mixed formats in examples
bad_prompt = """
Text: "Great product" -> Positive
Text: "Bad product"
Label: negative
Text: "OK product" sentiment=neutral

Text: "New product"
Label:"""
print("Inconsistent labels confuse the model")

Consistent labels — model follows the pattern

# Uniform format in every example
good_prompt = """Text: "Great product"
Label: positive

Text: "Bad product"
Label: negative

Text: "OK product"
Label: neutral

Text: "New product"
Label:"""
print("Consistent formatting produces reliable output")

Consistency is not optional. Every example must use exactly the same format: same field names, same delimiter, same casing. If one example uses "Sentiment:" and another uses "Label:", the model does not know which convention to follow.

Mistake 4: Too Many Examples

More examples is not always better. Beyond five or six examples for a classification task, accuracy plateaus but token cost keeps climbing. Worse, if your examples contain contradictions or inconsistent labeling, more examples amplify the noise. I have seen prompts with 20 examples that performed worse than the same prompt with 4 carefully chosen ones.

Practice: Fix a Broken Few-Shot Classifier

This exercise tests whether you can identify and fix the problems in a poorly constructed few-shot example set. The function receives a list of examples and must validate them against quality rules.

Validate and Fix Few-Shot Examples

Write Code

Write a function validate_examples(examples, categories) that checks a list of few-shot examples for quality issues.

Parameters:

examples: List of (text, label) tuples

categories: List of valid category strings

Rules to check:

1. Invalid labels: If any example has a label not in categories, add "invalid_label" to the issues list

2. Missing categories: If any category in categories has zero examples, add "missing_category" to the issues list

3. Imbalanced: If the most common label appears more than twice as often as the least common label (among categories that have examples), add "imbalanced" to the issues list

Return a dictionary with:

"valid": True if no issues, False otherwise

"issues": List of issue strings (empty if valid)

"category_counts": Dict mapping each category to its count in the examples

Loading editor...

Putting It All Together — A Production-Ready Classifier

Let me combine everything from this tutorial into a classifier that handles real-world messiness. It uses few-shot examples with boundary cases, validates its own output, and falls back to "unknown" when the model returns garbage.

A complete few-shot sentiment classifier

Loading editor...

The classifier uses six examples — two per category — with deliberate boundary cases. The mixed-sentiment review ("camera quality is excellent but battery drains fast") teaches the neutral convention. The output normalization handles the common failure mode where the model adds extra words or changes capitalization.

Summary

Zero-shot prompting works when the task is standard and the model has strong priors — translation, summarization, basic sentiment analysis. No examples needed. Start here.

Few-shot prompting works when you need consistent output formats, custom categories, or correct handling of edge cases. Three to five examples is the sweet spot. Quality beats quantity — one boundary example is worth three obvious ones.

The key insight from this tutorial: examples in a few-shot prompt are not training data. They are communication tools. They tell the model "this is what I mean by this category" and "this is exactly how I want the output formatted." Pick examples that clarify the ambiguous cases, not the obvious ones.

The next tutorial covers chain-of-thought prompting — a technique that makes LLMs explain their reasoning step by step, dramatically improving accuracy on tasks that require logic, math, or multi-step analysis.

Frequently Asked Questions

Can I use few-shot prompting with any LLM?

Yes. Few-shot prompting works with every major LLM — GPT-4, Claude, Gemini, Llama, Mistral. Larger models generally perform better with fewer examples, while smaller models may need more examples to pick up the pattern. If you are using a small open-source model locally, try five examples instead of three.

Do few-shot examples go in the system message or the user message?

Either works, but I put examples in the user message for most cases. The system message is better for persistent behavior rules ("always respond in JSON," "never include disclaimers"). For task-specific examples that might change between requests, the user message is more natural. Some providers handle system messages differently, so keeping examples in the user message is the more portable choice.

What if my few-shot prompt exceeds the context window?

If you have so many examples that they push against the context limit, you are probably using too many. For classification, 3-5 examples should suffice. For complex extraction tasks, you might need up to 10. If you genuinely need more, consider fine-tuning the model instead — it moves the examples into the model weights, eliminating per-request token costs entirely.

References

Brown, T. et al. (2020). Language Models are Few-Shot Learners. arXiv:2005.14165 — The GPT-3 paper that demonstrated few-shot prompting at scale.

OpenAI. (2024). Prompt Engineering Guide. OpenAI Documentation — Official best practices for prompting OpenAI models.

Anthropic. (2024). Prompt Engineering Guide. Anthropic Documentation — Claude-specific prompting techniques.

Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 — Foundational research on prompting strategies.

Zhao, Z. et al. (2021). Calibrate Before Use: Improving Few-Shot Performance of Language Models. arXiv:2102.09690 — Research on example selection and ordering effects.

Liu, J. et al. (2022). What Makes Good In-Context Examples for GPT-3?. arXiv:2101.06804 — Study on example quality and selection strategies.

What Are Zero-Shot and Few-Shot Prompting?

When Zero-Shot Is Enough

Few-Shot Prompting — Teaching by Example

Example Quality Matters More Than Quantity

Building a Reusable Few-Shot Classifier

Practice: Build a Few-Shot Prompt

Beyond Classification — Extraction, Formatting, and Transformation

Few-Shot Data Extraction

Few-Shot Format Transformation

When to Use Which: A Decision Framework

Four Mistakes That Ruin Few-Shot Prompts

Mistake 1: All Examples From the Same Category

Mistake 2: Examples That Are Too Easy

Mistake 3: Inconsistent Formatting Across Examples

Mistake 4: Too Many Examples

Practice: Fix a Broken Few-Shot Classifier

Putting It All Together — A Production-Ready Classifier

Summary

Frequently Asked Questions

Can I use few-shot prompting with any LLM?

Do few-shot examples go in the system message or the user message?

What if my few-shot prompt exceeds the context window?

References

Related Tutorials