LangChain Output Parsers: Extract Structured Data from LLM Responses

Intermediate60 min3 exercises50 XP

Prerequisites

0/3 exercises

Your LCEL chain works. The LLM responds. And then you stare at a raw string wondering how to pull the product name, price, and rating out of a paragraph that changes shape every time you run it. I spent a frustrating afternoon writing regex to parse LLM output before discovering that LangChain already solved this problem — with output parsers that slot right into the pipe operator.

The fix is surprisingly clean. LangChain output parsers plug into the end of your chain with the same pipe operator you already know. They handle the format instructions, the parsing, and the validation — so your downstream code gets a typed Python object instead of a string you have to disassemble yourself.

Why Output Parsers Exist

Without a parser, every LCEL chain returns an AIMessage object. You can grab .content from it, but that gives you a string — and strings are where bugs hide. Consider a chain that extracts three fields from a customer review. Sometimes the model returns valid JSON. Sometimes it wraps it in markdown code fences. Sometimes it adds a polite preamble before the JSON. Your downstream code breaks on two of those three cases.

Output parsers solve three problems at once:

Format instructions — The parser generates a description of the expected output format, which you inject into the prompt template. The LLM sees "respond with JSON matching this schema" instead of you manually typing format specs.

Parsing — The parser converts the raw string into a Python object (dict, Pydantic model, list).

Validation — Pydantic-based parsers reject responses that are missing fields or have wrong types, so you catch problems immediately instead of three function calls later.

StrOutputParser — The Starting Point

The simplest parser strips the AIMessage wrapper and hands you back a plain Python string. I use StrOutputParser on almost every chain because, without it, you get an AIMessage object instead of the text your application actually needs.

Without StrOutputParser

Loading editor...

Running that prints something like <class 'langchain_core.messages.ai.AIMessage'> followed by the full object with content, response_metadata, and token usage data. Not what you want to pass to the next function.

With StrOutputParser

Loading editor...

The result is now <class 'str'> and just the sentence itself. The pipe operator makes parsing feel natural: prompt produces a formatted message, the LLM produces an AIMessage, and the parser extracts the content string.

JsonOutputParser — Getting Dictionaries from LLMs

Strings are fine for chatbots. But the moment you need to store data, route decisions, or feed output into another function, you need a dictionary. JsonOutputParser handles the messy middle ground — it strips markdown fences, parses the JSON, and returns a Python dict.

JsonOutputParser format instructions

Loading editor...

That prints a block of text telling the LLM to return valid JSON. The key insight: you do not write these instructions yourself. The parser generates them, and you inject them into your prompt template using a variable.

Full chain with JsonOutputParser

Loading editor...

The result is a plain Python dictionary:

Python

Loading editor...

You can now access result["product_name"] or iterate over result["pros"] without any string parsing. The model might occasionally vary the exact wording inside lists, but the structure — keys, nesting, types — stays consistent because the format instructions told it what to produce.

PydanticOutputParser — Validated, Typed Output

This is the parser I reach for in any production chain. PydanticOutputParser takes a Pydantic model class, generates detailed format instructions from the field names, types, and descriptions, and then validates the LLM output against that model. If a field is missing or has the wrong type, you get a clear ValidationError instead of a silent bug three functions downstream.

First, define your expected output shape as a Pydantic model:

Define the Pydantic model

Loading editor...

The Field(description=...) matters here. LangChain includes those descriptions in the format instructions it sends to the LLM, so descriptive field docs produce better extractions. I always write descriptions even for obvious fields like product_name — the LLM uses them as disambiguation hints.

Create the parser and inspect format instructions

Loading editor...

The format instructions include the full JSON schema with field names, types, descriptions, and constraints. The LLM receives a detailed specification it can follow. Wire everything into an LCEL chain:

Complete PydanticOutputParser chain

Loading editor...

The result is a ProductReview instance, not a dictionary. You access fields with dot notation — result.product_name, result.rating — and you get IDE autocompletion and type checking for free. If the LLM returns a rating of 6, Pydantic raises a ValidationError because of the le=5 constraint.

Injecting Format Instructions with partial

Passing format_instructions manually every time you call invoke() gets tedious fast. The cleaner pattern is to use prompt.partial() to bake the instructions into the template once:

Using partial() for cleaner invocations

Loading editor...

The partial() call fills in format_instructions ahead of time. From this point on, every call to chain.invoke() only needs to provide the dynamic variables — review in this case. This is the pattern I use in every production chain that has a parser.

When the Parser Fails — Handling Malformed Output

LLMs are stochastic. Even with format instructions, a model might occasionally return malformed JSON — an unclosed bracket, a trailing comma, or a conversational preamble before the JSON block. Here is what happens and how to handle it.

Catching parser exceptions

Loading editor...

The OutputParserException includes llm_output — the raw string the model returned before parsing failed. This is invaluable for debugging. You can log it, send it to an error tracker, or use it to refine your prompt.

CommaSeparatedListOutputParser — Quick Lists

Python

Loading editor...

This prints: "Your response should be a comma separated list of items. For example: foo, bar, baz". Simple and effective when you need a flat list without the overhead of JSON schemas.

List parser in a chain

Loading editor...

The result is a Python list of strings. I find this parser useful for brainstorming chains, tag extraction, and anywhere you need a flat enumeration without the ceremony of a Pydantic model.

Real-World Pattern: Multi-Step Chain with Parsers

Output parsers shine when you chain multiple steps. Here is a practical pattern I use often: a two-step chain where the first step extracts structured data and the second step generates a response based on that data.

Step 1 — Extract structured ticket data

Loading editor...

The first chain parses a raw customer message into a SupportTicket with category, urgency, and escalation flag. The second chain uses those fields to generate an appropriate response:

Step 2 — Generate response from structured data

Loading editor...

The structured intermediate step — the SupportTicket — is what makes this pattern powerful. Without it, you would feed raw text into the response generator and hope the LLM figures out the urgency and category on its own. With the parsed ticket, you control exactly what information the response chain sees.

Advanced Pydantic Patterns: Enums, Optional Fields, and Nested Models

Pydantic models can express much richer schemas than flat key-value pairs. Here are three patterns I use regularly that make LLM output significantly more reliable.

Constraining Values with Enums

When a field should be one of a fixed set of values — like a sentiment label or a priority level — use a Python Enum. The parser includes the allowed values in the format instructions, and Pydantic rejects anything outside that set.

Enum-constrained fields

Loading editor...

The format instructions now include "sentiment": must be one of: positive, negative, neutral, mixed. The LLM knows its options, and Pydantic enforces them. No more "Somewhat Positive" or "5/10" sneaking through.

Nested Models for Complex Structures

Nested Pydantic models

Loading editor...

The Optional[str] with default=None handles fields that might not appear in the input text. If the email is not mentioned, the model returns null for that field instead of hallucinating one. Nested models like Address inside Person let you represent hierarchical data without flattening everything into one level.

Build a Pydantic Model for Job Posting Extraction

Write Code

Create a Pydantic model called JobPosting that represents a parsed job listing. It should have these fields:

title (str): The job title

company (str): The company name

salary_min (int): Minimum salary

salary_max (int): Maximum salary

remote (bool): Whether the position is remote

skills (List[str]): Required skills

All fields must have Field(description=...) for good format instructions. Then create a PydanticOutputParser from your model and print the format instructions.

Loading editor...

Building a Custom Output Parser

Sometimes the built-in parsers do not fit your use case. Maybe you want to extract key-value pairs from a custom format, or parse markdown tables, or split a response at a specific delimiter. LangChain lets you build custom parsers by subclassing BaseOutputParser.

Custom BulletListParser

Loading editor...

Three methods make a custom parser: parse() does the actual conversion, get_format_instructions() tells the LLM what format to use, and _type provides a string identifier. The generic type parameter BaseOutputParser[List[str]] declares what parse() returns.

Using the custom parser in a chain

Loading editor...

Custom parsers plug into LCEL chains exactly like built-in ones — the pipe operator does not care whether you wrote the parser or LangChain did. I have built custom parsers for XML extraction, markdown table parsing, and splitting multi-part LLM responses at --- delimiters.

Common Mistakes with Output Parsers

After debugging parser failures across many projects, these are the three mistakes I see most often.

Mistake 1: Forgetting Format Instructions in the Prompt

Missing format instructions — parser fails

# The prompt never tells the LLM to return JSON
prompt = ChatPromptTemplate.from_template(
    "Extract the product name and price from: {text}"
)
# Parser expects JSON but LLM returns prose
chain = prompt | llm | parser  # Crashes at parse time

Format instructions included — parser succeeds

prompt = ChatPromptTemplate.from_template(
    "Extract the product name and price from: {text}\n"
    "{format_instructions}"
).partial(
    format_instructions=parser.get_format_instructions()
)
chain = prompt | llm | parser  # Works reliably

Mistake 2: Pydantic Fields Without Descriptions

Field descriptions are not just documentation — they are part of the format instructions sent to the LLM. A field named qty with no description leaves the model guessing. A field with Field(description="Quantity ordered, as an integer") removes all ambiguity.

Mistake 3: Using the Wrong Parser for the Job

If you need...	Use this parser
Raw text string	`StrOutputParser`
Simple flat JSON (no validation)	`JsonOutputParser`
Validated, typed objects	`PydanticOutputParser`
A Python list	`CommaSeparatedListOutputParser`
Something else entirely	Custom `BaseOutputParser` subclass

Build a Key-Value Parser

Write Code

Create a custom output parser called KeyValueParser that inherits from BaseOutputParser[dict]. It should parse text in the format:

Name: Alice
Age: 30
City: New York

into a Python dictionary {"Name": "Alice", "Age": "30", "City": "New York"}.

Implement the parse() method that splits on newlines, then splits each line at the first : to get key-value pairs. Strip whitespace from both keys and values. Skip empty lines.

Loading editor...

Create an Enum-Constrained Pydantic Model

Write Code

Create an Enum called Priority with values LOW, MEDIUM, HIGH, and CRITICAL. Then create a Pydantic model called BugReport with these fields:

title (str): Bug title

priority (Priority): The priority enum

steps_to_reproduce (List[str]): Steps to reproduce the bug

is_regression (bool): Whether this is a regression, default False

Then create a sample BugReport instance and print its priority value.

Loading editor...

Choosing the Right Parser — Quick Reference

Parser	Returns	Validates Schema?	Best For
`StrOutputParser`	`str`	No	Chat responses, text generation
`JsonOutputParser`	`dict`	No (valid JSON only)	Quick prototyping, flexible schemas
`PydanticOutputParser`	Pydantic model	Yes (types + constraints)	Production chains, APIs
`CommaSeparatedListOutputParser`	`List[str]`	No	Tag extraction, brainstorming
`OutputFixingParser`	Wraps any parser	Wraps inner parser	Auto-retry on parse failures
Custom `BaseOutputParser`	Anything	Your logic	Non-standard formats

Summary

Output parsers transform raw LLM text into structured Python objects. StrOutputParser strips the AIMessage wrapper. JsonOutputParser returns dictionaries. PydanticOutputParser adds type validation and schema constraints. And custom parsers handle everything else.

The key workflow is always the same: define your expected output shape, create a parser, inject get_format_instructions() into your prompt template using partial(), and pipe the parser at the end of your chain. The parser handles the rest — format instructions upstream, parsing and validation downstream.

Three things to remember: always include format instructions in your prompt, always add descriptions to Pydantic fields, and always wrap parser calls in try/except for production code. If you need automatic retry on malformed output, reach for OutputFixingParser.

Frequently Asked Questions

Can I use output parsers with streaming?

StrOutputParser supports streaming natively — tokens flow through as they arrive. JsonOutputParser also supports streaming via chain.astream(), yielding partial parsed objects as JSON accumulates. PydanticOutputParser does not support streaming because it needs the complete JSON before validation. If you need streaming with structured output, use JsonOutputParser for the stream and validate the final result with Pydantic separately.

Do output parsers work with non-OpenAI models?

Yes. Output parsers are model-agnostic — they operate on the text content of any AIMessage, regardless of which LLM produced it. The format instructions go into the prompt, and any model that can follow instructions will produce parseable output. I have used PydanticOutputParser successfully with Claude, Gemini, Llama, and Mistral through LangChain.

Should I use output parsers or with_structured_output()?

LangChain also provides llm.with_structured_output(MyModel), which uses provider-native structured output (like OpenAI's response_format). If your provider supports it, with_structured_output() is more reliable because the constraint is enforced at the model level, not just through prompt instructions. Use output parsers when your provider does not support native structured output, or when you need custom parsing logic that goes beyond JSON schema validation.

References

LangChain Output Parsers Documentation — Official guide to all built-in parsers

LangChain LCEL Guide — How parsers integrate with the pipe operator

Pydantic V2 Documentation — Field types, validators, and model configuration

LangChain Structured Output How-To — with_structured_output() as an alternative

LangChain JsonOutputParser API — API reference for JsonOutputParser

LangChain PydanticOutputParser API — API reference for PydanticOutputParser

OpenAI Structured Outputs — Provider-native alternative

Why Output Parsers Exist

StrOutputParser — The Starting Point

JsonOutputParser — Getting Dictionaries from LLMs

PydanticOutputParser — Validated, Typed Output

Injecting Format Instructions with partial

When the Parser Fails — Handling Malformed Output

CommaSeparatedListOutputParser — Quick Lists

Real-World Pattern: Multi-Step Chain with Parsers

Advanced Pydantic Patterns: Enums, Optional Fields, and Nested Models

Constraining Values with Enums

Nested Models for Complex Structures

Building a Custom Output Parser

Common Mistakes with Output Parsers

Mistake 1: Forgetting Format Instructions in the Prompt

Mistake 2: Pydantic Fields Without Descriptions

Mistake 3: Using the Wrong Parser for the Job

Choosing the Right Parser — Quick Reference

Summary

Frequently Asked Questions

Can I use output parsers with streaming?

Do output parsers work with non-OpenAI models?

Should I use output parsers or with_structured_output()?

References

Related Tutorials