Skip to main content

Output Formatting Masterclass: Make LLMs Output JSON, XML, Markdown, and Custom Formats

Intermediate60 min2 exercises35 XP
0/2 exercises

You ask an LLM to return a list of products as JSON. It responds with a lovely paragraph of prose, a few bullet points, and maybe some JSON buried inside a markdown code fence. Your json.loads() call explodes. You have been there — I certainly have, probably a hundred times before I figured out the patterns that actually work.

This tutorial shows you how to make LLMs reliably output JSON, XML, Markdown tables, and custom-delimited formats. More importantly, you will learn how to parse each format safely and what to do when the model inevitably gets creative with your instructions.

Why Output Format Matters More Than You Think

Every time you build an application on top of an LLM, you hit the same wall: the model produces text, but your code needs data. A chatbot can get away with free-form responses. But the moment you need to feed the LLM's answer into a database, render it in a UI, or chain it into another API call, you need structure.

I think of output formatting as the contract between your prompt and your parser. Get the contract right and your pipeline runs cleanly. Get it wrong and you spend more time writing error-recovery code than you spent on the actual feature.

The formats we will cover, and when each one shines:

  • JSON — the default choice for APIs and structured data. Every language has a parser.
  • XML — better than JSON for hierarchical or nested documents with metadata attributes.
  • Markdown — best for human-readable output that still has lightweight structure (tables, headers, lists).
  • Custom delimiters — the simplest option when you just need to extract a few fields reliably.
  • Setup: install openai and create a reusable helper
    Loading editor...

    We set temperature=0.2 because lower temperature means less creative variation — exactly what you want when you need the model to follow a strict format. We will talk more about when to adjust this later.

    JSON via Prompting — The Most Common Format

    JSON is the workhorse of structured LLM output. Every language can parse it, every API speaks it, and models are trained on enormous amounts of it. But asking for JSON and getting clean JSON are two different things.

    The naive approach — just saying "respond in JSON" — fails more often than you would expect. The model wraps the JSON in a markdown code fence, adds an introductory sentence, or produces keys that do not match your schema. Here is a prompt pattern I have found reliable across GPT-4o, Claude, and Gemini:

    A reliable JSON prompt pattern
    Loading editor...

    Three things make this prompt work: showing the exact schema (not just describing it), using the pipe notation for enumerations, and the explicit "no markdown fences" instruction. Without that last rule, most models wrap the output in `

    ...

    which breaks json.loads()`.

    Parsing JSON Safely

    Even with a good prompt, you should never trust that the response is valid JSON. Models occasionally add trailing commas, use single quotes, or sneak in a comment. Here is a defensive parsing function that handles the most common failure modes:

    Defensive JSON parser for LLM output
    Loading editor...

    This parser handles three common issues: markdown code fences around the JSON, extra text before or after the JSON object, and JSON arrays instead of objects. I use a version of this function in every LLM project I build.

    Batch JSON — Multiple Objects in One Call

    Sometimes you need the model to return multiple structured objects — say, analyzing five reviews at once. Asking for a JSON array works, but you need to be explicit about it:

    Getting a JSON array from the model
    Loading editor...

    The model returns a list of three objects, each with the exact keys we specified. Batch processing like this is cheaper than making three separate API calls because you pay for fewer prompt tokens (the system message and instructions are only sent once).

    Exercise 1: Parse and Validate LLM JSON
    Write Code

    Write a function extract_fields(text) that takes a string containing JSON (possibly wrapped in markdown code fences) and returns a dictionary with only the keys "name", "age", and "city". If the JSON contains extra keys, ignore them. If any of the three required keys are missing, set their value to None.

    The function should:

    1. Strip markdown code fences if present

    2. Parse the JSON

    3. Return a dict with exactly three keys: name, age, city

    Loading editor...

    XML for Hierarchical Data

    JSON is great for flat or lightly nested structures. But when your data is deeply hierarchical — think a document outline, a conversation tree, or configuration with attributes and metadata — XML can actually be a better fit. This might sound old-fashioned, but there is a practical reason: XML tags are self-closing and unambiguous, which makes partial or malformed output easier to recover from.

    The key advantage of XML over JSON for LLM output is that attributes and content are separate. A JSON object uses the same mechanism (keys) for metadata and data. An XML element can carry metadata in attributes and data in its text content. That distinction matters when you are building document-processing pipelines.

    Prompting for XML output
    Loading editor...

    Parsing XML in Python is straightforward with the built-in xml.etree.ElementTree module — no pip install needed:

    Parsing XML with ElementTree
    Loading editor...

    Notice how cleanly the attributes map to metadata (department name, headcount) while the nesting represents the hierarchy (company contains departments, departments contain roles). This would work in JSON too, but the XML version reads more naturally when the hierarchy is the point.

    Custom Delimiters — The Simplest Reliable Format

    Using custom delimiters for structured extraction
    Loading editor...

    I reach for custom delimiters when I need to extract 3-6 fields and do not want the overhead of JSON schema definitions. The triple-equals pattern (===FIELD===) works well because it is visually distinctive, unlikely to appear in normal text, and trivial to parse with a regex or simple string split.

    Parser for custom-delimited output
    Loading editor...

    The parser walks through the lines, tracking which field it is currently inside. When it hits a new ===FIELD=== marker, it saves the previous field and starts a new one. This handles multi-line values gracefully — if the model puts a paragraph between two markers, you get the whole paragraph.

    Exercise 2: Build a Delimiter Parser
    Write Code

    Write a function parse_sections(text) that parses text using ###SECTION_NAME### delimiters (note: three # on each side). The function should return a dictionary mapping lowercase section names to their content (stripped of leading/trailing whitespace). Ignore any text before the first delimiter.

    Example input:

    ###TITLE###
    My Report
    ###SUMMARY###
    This is a brief summary.
    It has two lines.
    ###END###

    Expected output: {"title": "My Report", "summary": "This is a brief summary.\nIt has two lines."}

    The ###END### marker signals the end of parsing — do not include it as a key.

    Loading editor...

    Markdown — Structured Output That Humans Can Read

    Markdown sits in a sweet spot: it has enough structure for light parsing (headers, tables, lists) but is still comfortable for humans to read. When your output needs to be both machine-parseable and directly displayable — think reports, summaries, or documentation — Markdown is the right choice.

    Markdown tables are especially useful. They are the one format where I find models are almost always reliable, probably because LLMs have seen millions of Markdown tables in training data.

    Getting a Markdown table from the model
    Loading editor...

    Parsing a Markdown table into a list of dictionaries takes just a few lines:

    Parsing Markdown tables into dictionaries
    Loading editor...

    Each row becomes a dictionary with the header names as keys. This is handy when you need to render the data in a different format — you could convert these dicts to a pandas DataFrame, a CSV, or pass them to a template.

    Format Reliability Comparison — Which Format Fails Least?

    After running thousands of structured output requests across GPT-4o, GPT-4o-mini, Claude, and Gemini, I have a rough reliability ranking. This is practical experience, not a controlled benchmark — but the patterns are consistent enough to be useful.

    FormatReliabilityParsing DifficultyBest For
    Custom delimitersHighestEasiest3-6 simple fields
    Markdown tableHighEasyTabular comparisons
    JSON (with schema)HighMediumAPIs, databases, structured data
    JSON (without schema)MediumMediumQuick prototyping
    XMLMediumMediumHierarchical documents
    Free-form with structureLowHardAvoid in production

    Custom delimiters are the most reliable because they are the simplest — the model just needs to put text between markers. JSON with an explicit schema is close behind, especially with temperature=0. Markdown tables are surprisingly reliable because models produce them constantly during training.

    Hardening Your Output Formatting

    Even the best prompts fail sometimes. Production code needs fallback strategies. These are the techniques I use to push format compliance from ~90% to ~99%.

    Technique 1: Temperature and Top-P

    For structured output, set temperature between 0.0 and 0.3. Higher temperatures increase creativity — exactly the opposite of what you want when asking for precise formatting. If you need varied content within a strict format, keep temperature at 0.2 and use top_p=0.9.

    Technique 2: Few-Shot Examples

    Showing the model a completed example is one of the most effective ways to get consistent formatting. The model mirrors the structure it sees:

    Few-shot example for consistent JSON formatting
    Loading editor...

    The model follows the example's exact key names, value types, and structure. One example is usually enough for simple schemas. For complex schemas with edge cases, two or three examples work better.

    Technique 3: Retry with Correction

    When parsing fails, you can send the malformed output back to the model and ask it to fix the formatting. This succeeds roughly 95% of the time on the retry:

    Auto-retry with correction for JSON output
    Loading editor...

    The retry loop tries to parse the response, and if it fails, sends the error message and the malformed text back to the model for correction. This is cheap — the correction call uses very few tokens — and dramatically improves reliability.

    Common Mistakes and How to Fix Them

    These are the formatting failures I see most often in code reviews and Slack channels. Each one is easy to fix once you know the pattern.

    Vague format instruction
    prompt = "Analyze this review and give me JSON."
    Explicit schema with rules
    prompt = """Analyze this review. Return JSON:
    {"sentiment": "positive"|"negative", "score": <float 0-1>}
    Return ONLY the JSON object."""

    Without an explicit schema, the model invents its own key names every time. One call returns "sentiment", the next returns "feeling", the next returns "opinion". Your downstream code breaks on every variation.

    Trusting raw output
    data = json.loads(response)  # crashes on markdown fences
    Defensive parsing
    data = parse_llm_json(response)  # handles fences and extra text

    Even GPT-4o wraps JSON in code fences roughly 15-20% of the time, depending on the prompt. A raw json.loads() call is a ticking time bomb in any production system.

    Real-World Example: Combining Formats in a Data Pipeline

    Real-world applications rarely use just one format. A pipeline might need JSON for an API, Markdown for email, and tagged items for a task tracker — all from the same LLM call. Custom delimiters work as the outer container, with each section using whatever inner format fits best.

    Here is a practical example. Imagine you are building a tool that takes raw meeting notes and produces structured output for three different consumers:

    Multi-format pipeline: processing meeting notes
    Loading editor...
    Parsing each section for its downstream consumer
    Loading editor...

    One LLM call produces output for three systems. The JSON goes to an API, the Markdown goes to an email template, and the action items go to a task tracker. The custom delimiters let us extract each section and parse it with the appropriate method.

    Frequently Asked Questions

    Should I use the OpenAI `response_format` parameter instead of prompt-based formatting?

    The response_format: { type: 'json_object' } parameter is a great option when available — it guarantees valid JSON from the API level. But it is provider-specific (OpenAI and a few others), and it still does not guarantee your schema. You get valid JSON, but the keys and structure are whatever the model decides. The prompt-based techniques in this article work across all providers and give you schema control. In practice, I use both: response_format for the structural guarantee plus a schema in the prompt for key control.

    What about Pydantic or structured outputs via function calling?

    Function calling (also called tool use) and Pydantic-based structured outputs are the most reliable way to get schema-compliant JSON. They are covered in separate tutorials. The prompt-based approach in this article is valuable because it works with any model, including open-source models via Ollama or Hugging Face that may not support function calling. It is also simpler — you do not need to define Pydantic models or tool schemas.

    How do I handle very long outputs that might get truncated?

    If your expected output is large (>2000 tokens), set max_tokens explicitly to a high enough value. If the model hits the token limit mid-JSON, the output will be truncated and unparseable. For very large structured outputs, break the task into smaller chunks — process 10 items at a time instead of 100.

    Does this work with open-source models like Llama or Mistral?

    Yes. The prompt patterns work with any instruction-tuned model. Smaller models (7B-13B) are less reliable at following complex schemas, so I recommend simpler formats (custom delimiters or flat JSON) with smaller models and save nested JSON/XML for larger models (70B+ or GPT-4 class).

    References

  • OpenAI API documentation — Chat Completions: response_format. Link
  • OpenAI documentation — Structured Outputs. Link
  • Python documentation — json module. Link
  • Python documentation — xml.etree.ElementTree. Link
  • Anthropic documentation — Tool Use (Structured Output). Link
  • Google Gemini API — Structured Output. Link
  • Complete Code

    Click to expand the full script (copy-paste and run)
    Complete utility code (no API calls)
    Loading editor...

    </details>

    Related Tutorials