Skip to main content

LangChain Output Parsers: Extract Structured Data from LLM Responses

Intermediate60 min3 exercises50 XP
Prerequisites
0/3 exercises

Your LCEL chain works. The LLM responds. And then you stare at a raw string wondering how to pull the product name, price, and rating out of a paragraph that changes shape every time you run it. I spent a frustrating afternoon writing regex to parse LLM output before discovering that LangChain already solved this problem — with output parsers that slot right into the pipe operator.

The fix is surprisingly clean. LangChain output parsers plug into the end of your chain with the same pipe operator you already know. They handle the format instructions, the parsing, and the validation — so your downstream code gets a typed Python object instead of a string you have to disassemble yourself.

Why Output Parsers Exist

Without a parser, every LCEL chain returns an AIMessage object. You can grab .content from it, but that gives you a string — and strings are where bugs hide. Consider a chain that extracts three fields from a customer review. Sometimes the model returns valid JSON. Sometimes it wraps it in markdown code fences. Sometimes it adds a polite preamble before the JSON. Your downstream code breaks on two of those three cases.

Output parsers solve three problems at once:

  • Format instructions — The parser generates a description of the expected output format, which you inject into the prompt template. The LLM sees "respond with JSON matching this schema" instead of you manually typing format specs.
  • Parsing — The parser converts the raw string into a Python object (dict, Pydantic model, list).
  • Validation — Pydantic-based parsers reject responses that are missing fields or have wrong types, so you catch problems immediately instead of three function calls later.
  • StrOutputParser — The Starting Point

    The simplest parser strips the AIMessage wrapper and hands you back a plain Python string. I use StrOutputParser on almost every chain because, without it, you get an AIMessage object instead of the text your application actually needs.

    Without StrOutputParser
    Loading editor...

    Running that prints something like <class 'langchain_core.messages.ai.AIMessage'> followed by the full object with content, response_metadata, and token usage data. Not what you want to pass to the next function.

    With StrOutputParser
    Loading editor...

    The result is now <class 'str'> and just the sentence itself. The pipe operator makes parsing feel natural: prompt produces a formatted message, the LLM produces an AIMessage, and the parser extracts the content string.

    JsonOutputParser — Getting Dictionaries from LLMs

    Strings are fine for chatbots. But the moment you need to store data, route decisions, or feed output into another function, you need a dictionary. JsonOutputParser handles the messy middle ground — it strips markdown fences, parses the JSON, and returns a Python dict.

    JsonOutputParser format instructions
    Loading editor...

    That prints a block of text telling the LLM to return valid JSON. The key insight: you do not write these instructions yourself. The parser generates them, and you inject them into your prompt template using a variable.

    Full chain with JsonOutputParser
    Loading editor...

    The result is a plain Python dictionary:

    Python
    Loading editor...

    You can now access result["product_name"] or iterate over result["pros"] without any string parsing. The model might occasionally vary the exact wording inside lists, but the structure — keys, nesting, types — stays consistent because the format instructions told it what to produce.

    PydanticOutputParser — Validated, Typed Output

    This is the parser I reach for in any production chain. PydanticOutputParser takes a Pydantic model class, generates detailed format instructions from the field names, types, and descriptions, and then validates the LLM output against that model. If a field is missing or has the wrong type, you get a clear ValidationError instead of a silent bug three functions downstream.

    First, define your expected output shape as a Pydantic model:

    Define the Pydantic model
    Loading editor...

    The Field(description=...) matters here. LangChain includes those descriptions in the format instructions it sends to the LLM, so descriptive field docs produce better extractions. I always write descriptions even for obvious fields like product_name — the LLM uses them as disambiguation hints.

    Create the parser and inspect format instructions
    Loading editor...

    The format instructions include the full JSON schema with field names, types, descriptions, and constraints. The LLM receives a detailed specification it can follow. Wire everything into an LCEL chain:

    Complete PydanticOutputParser chain
    Loading editor...

    The result is a ProductReview instance, not a dictionary. You access fields with dot notation — result.product_name, result.rating — and you get IDE autocompletion and type checking for free. If the LLM returns a rating of 6, Pydantic raises a ValidationError because of the le=5 constraint.

    Injecting Format Instructions with partial

    Passing format_instructions manually every time you call invoke() gets tedious fast. The cleaner pattern is to use prompt.partial() to bake the instructions into the template once:

    Using partial() for cleaner invocations
    Loading editor...

    The partial() call fills in format_instructions ahead of time. From this point on, every call to chain.invoke() only needs to provide the dynamic variablesreview in this case. This is the pattern I use in every production chain that has a parser.

    When the Parser Fails — Handling Malformed Output

    LLMs are stochastic. Even with format instructions, a model might occasionally return malformed JSON — an unclosed bracket, a trailing comma, or a conversational preamble before the JSON block. Here is what happens and how to handle it.

    Catching parser exceptions
    Loading editor...

    The OutputParserException includes llm_output — the raw string the model returned before parsing failed. This is invaluable for debugging. You can log it, send it to an error tracker, or use it to refine your prompt.

    CommaSeparatedListOutputParser — Quick Lists

    Python
    Loading editor...

    This prints: "Your response should be a comma separated list of items. For example: foo, bar, baz". Simple and effective when you need a flat list without the overhead of JSON schemas.

    List parser in a chain
    Loading editor...

    The result is a Python list of strings. I find this parser useful for brainstorming chains, tag extraction, and anywhere you need a flat enumeration without the ceremony of a Pydantic model.

    Real-World Pattern: Multi-Step Chain with Parsers

    Output parsers shine when you chain multiple steps. Here is a practical pattern I use often: a two-step chain where the first step extracts structured data and the second step generates a response based on that data.

    Step 1 — Extract structured ticket data
    Loading editor...

    The first chain parses a raw customer message into a SupportTicket with category, urgency, and escalation flag. The second chain uses those fields to generate an appropriate response:

    Step 2 — Generate response from structured data
    Loading editor...

    The structured intermediate step — the SupportTicket — is what makes this pattern powerful. Without it, you would feed raw text into the response generator and hope the LLM figures out the urgency and category on its own. With the parsed ticket, you control exactly what information the response chain sees.

    Advanced Pydantic Patterns: Enums, Optional Fields, and Nested Models

    Pydantic models can express much richer schemas than flat key-value pairs. Here are three patterns I use regularly that make LLM output significantly more reliable.

    Constraining Values with Enums

    When a field should be one of a fixed set of values — like a sentiment label or a priority level — use a Python Enum. The parser includes the allowed values in the format instructions, and Pydantic rejects anything outside that set.

    Enum-constrained fields
    Loading editor...

    The format instructions now include "sentiment": must be one of: positive, negative, neutral, mixed. The LLM knows its options, and Pydantic enforces them. No more "Somewhat Positive" or "5/10" sneaking through.

    Nested Models for Complex Structures

    Nested Pydantic models
    Loading editor...

    The Optional[str] with default=None handles fields that might not appear in the input text. If the email is not mentioned, the model returns null for that field instead of hallucinating one. Nested models like Address inside Person let you represent hierarchical data without flattening everything into one level.

    Build a Pydantic Model for Job Posting Extraction
    Write Code

    Create a Pydantic model called JobPosting that represents a parsed job listing. It should have these fields:

  • title (str): The job title
  • company (str): The company name
  • salary_min (int): Minimum salary
  • salary_max (int): Maximum salary
  • remote (bool): Whether the position is remote
  • skills (List[str]): Required skills
  • All fields must have Field(description=...) for good format instructions. Then create a PydanticOutputParser from your model and print the format instructions.

    Loading editor...

    Building a Custom Output Parser

    Sometimes the built-in parsers do not fit your use case. Maybe you want to extract key-value pairs from a custom format, or parse markdown tables, or split a response at a specific delimiter. LangChain lets you build custom parsers by subclassing BaseOutputParser.

    Custom BulletListParser
    Loading editor...

    Three methods make a custom parser: parse() does the actual conversion, get_format_instructions() tells the LLM what format to use, and _type provides a string identifier. The generic type parameter BaseOutputParser[List[str]] declares what parse() returns.

    Using the custom parser in a chain
    Loading editor...

    Custom parsers plug into LCEL chains exactly like built-in ones — the pipe operator does not care whether you wrote the parser or LangChain did. I have built custom parsers for XML extraction, markdown table parsing, and splitting multi-part LLM responses at --- delimiters.

    Common Mistakes with Output Parsers

    After debugging parser failures across many projects, these are the three mistakes I see most often.

    Mistake 1: Forgetting Format Instructions in the Prompt

    Missing format instructions — parser fails
    # The prompt never tells the LLM to return JSON
    prompt = ChatPromptTemplate.from_template(
        "Extract the product name and price from: {text}"
    )
    # Parser expects JSON but LLM returns prose
    chain = prompt | llm | parser  # Crashes at parse time
    Format instructions included — parser succeeds
    prompt = ChatPromptTemplate.from_template(
        "Extract the product name and price from: {text}\n"
        "{format_instructions}"
    ).partial(
        format_instructions=parser.get_format_instructions()
    )
    chain = prompt | llm | parser  # Works reliably

    Mistake 2: Pydantic Fields Without Descriptions

    Field descriptions are not just documentation — they are part of the format instructions sent to the LLM. A field named qty with no description leaves the model guessing. A field with Field(description="Quantity ordered, as an integer") removes all ambiguity.

    Mistake 3: Using the Wrong Parser for the Job

    If you need...Use this parser
    Raw text stringStrOutputParser
    Simple flat JSON (no validation)JsonOutputParser
    Validated, typed objectsPydanticOutputParser
    A Python listCommaSeparatedListOutputParser
    Something else entirelyCustom BaseOutputParser subclass
    Build a Key-Value Parser
    Write Code

    Create a custom output parser called KeyValueParser that inherits from BaseOutputParser[dict]. It should parse text in the format:

    Name: Alice
    Age: 30
    City: New York

    into a Python dictionary {"Name": "Alice", "Age": "30", "City": "New York"}.

    Implement the parse() method that splits on newlines, then splits each line at the first : to get key-value pairs. Strip whitespace from both keys and values. Skip empty lines.

    Loading editor...

    Create an Enum-Constrained Pydantic Model
    Write Code

    Create an Enum called Priority with values LOW, MEDIUM, HIGH, and CRITICAL. Then create a Pydantic model called BugReport with these fields:

  • title (str): Bug title
  • priority (Priority): The priority enum
  • steps_to_reproduce (List[str]): Steps to reproduce the bug
  • is_regression (bool): Whether this is a regression, default False
  • Then create a sample BugReport instance and print its priority value.

    Loading editor...

    Choosing the Right Parser — Quick Reference

    ParserReturnsValidates Schema?Best For
    StrOutputParserstrNoChat responses, text generation
    JsonOutputParserdictNo (valid JSON only)Quick prototyping, flexible schemas
    PydanticOutputParserPydantic modelYes (types + constraints)Production chains, APIs
    CommaSeparatedListOutputParserList[str]NoTag extraction, brainstorming
    OutputFixingParserWraps any parserWraps inner parserAuto-retry on parse failures
    Custom BaseOutputParserAnythingYour logicNon-standard formats

    Summary

    Output parsers transform raw LLM text into structured Python objects. StrOutputParser strips the AIMessage wrapper. JsonOutputParser returns dictionaries. PydanticOutputParser adds type validation and schema constraints. And custom parsers handle everything else.

    The key workflow is always the same: define your expected output shape, create a parser, inject get_format_instructions() into your prompt template using partial(), and pipe the parser at the end of your chain. The parser handles the rest — format instructions upstream, parsing and validation downstream.

    Three things to remember: always include format instructions in your prompt, always add descriptions to Pydantic fields, and always wrap parser calls in try/except for production code. If you need automatic retry on malformed output, reach for OutputFixingParser.

    Frequently Asked Questions

    Can I use output parsers with streaming?

    StrOutputParser supports streaming natively — tokens flow through as they arrive. JsonOutputParser also supports streaming via chain.astream(), yielding partial parsed objects as JSON accumulates. PydanticOutputParser does not support streaming because it needs the complete JSON before validation. If you need streaming with structured output, use JsonOutputParser for the stream and validate the final result with Pydantic separately.

    Do output parsers work with non-OpenAI models?

    Yes. Output parsers are model-agnostic — they operate on the text content of any AIMessage, regardless of which LLM produced it. The format instructions go into the prompt, and any model that can follow instructions will produce parseable output. I have used PydanticOutputParser successfully with Claude, Gemini, Llama, and Mistral through LangChain.

    Should I use output parsers or with_structured_output()?

    LangChain also provides llm.with_structured_output(MyModel), which uses provider-native structured output (like OpenAI's response_format). If your provider supports it, with_structured_output() is more reliable because the constraint is enforced at the model level, not just through prompt instructions. Use output parsers when your provider does not support native structured output, or when you need custom parsing logic that goes beyond JSON schema validation.

    References

  • LangChain Output Parsers Documentation — Official guide to all built-in parsers
  • LangChain LCEL Guide — How parsers integrate with the pipe operator
  • Pydantic V2 Documentation — Field types, validators, and model configuration
  • LangChain Structured Output How-Towith_structured_output() as an alternative
  • LangChain JsonOutputParser API — API reference for JsonOutputParser
  • LangChain PydanticOutputParser API — API reference for PydanticOutputParser
  • OpenAI Structured Outputs — Provider-native alternative
  • Related Tutorials