Output Formatting Masterclass: Make LLMs Output JSON, XML, Markdown, and Custom Formats
You ask an LLM to return a list of products as JSON. It responds with a lovely paragraph of prose, a few bullet points, and maybe some JSON buried inside a markdown code fence. Your json.loads() call explodes. You have been there — I certainly have, probably a hundred times before I figured out the patterns that actually work.
This tutorial shows you how to make LLMs reliably output JSON, XML, Markdown tables, and custom-delimited formats. More importantly, you will learn how to parse each format safely and what to do when the model inevitably gets creative with your instructions.
Why Output Format Matters More Than You Think
Every time you build an application on top of an LLM, you hit the same wall: the model produces text, but your code needs data. A chatbot can get away with free-form responses. But the moment you need to feed the LLM's answer into a database, render it in a UI, or chain it into another API call, you need structure.
I think of output formatting as the contract between your prompt and your parser. Get the contract right and your pipeline runs cleanly. Get it wrong and you spend more time writing error-recovery code than you spent on the actual feature.
The formats we will cover, and when each one shines:
We set temperature=0.2 because lower temperature means less creative variation — exactly what you want when you need the model to follow a strict format. We will talk more about when to adjust this later.
JSON via Prompting — The Most Common Format
JSON is the workhorse of structured LLM output. Every language can parse it, every API speaks it, and models are trained on enormous amounts of it. But asking for JSON and getting clean JSON are two different things.
The naive approach — just saying "respond in JSON" — fails more often than you would expect. The model wraps the JSON in a markdown code fence, adds an introductory sentence, or produces keys that do not match your schema. Here is a prompt pattern I have found reliable across GPT-4o, Claude, and Gemini:
Three things make this prompt work: showing the exact schema (not just describing it), using the pipe notation for enumerations, and the explicit "no markdown fences" instruction. Without that last rule, most models wrap the output in `
... which breaks json.loads()`.
Parsing JSON Safely
Even with a good prompt, you should never trust that the response is valid JSON. Models occasionally add trailing commas, use single quotes, or sneak in a comment. Here is a defensive parsing function that handles the most common failure modes:
This parser handles three common issues: markdown code fences around the JSON, extra text before or after the JSON object, and JSON arrays instead of objects. I use a version of this function in every LLM project I build.
Batch JSON — Multiple Objects in One Call
Sometimes you need the model to return multiple structured objects — say, analyzing five reviews at once. Asking for a JSON array works, but you need to be explicit about it:
The model returns a list of three objects, each with the exact keys we specified. Batch processing like this is cheaper than making three separate API calls because you pay for fewer prompt tokens (the system message and instructions are only sent once).
Write a function extract_fields(text) that takes a string containing JSON (possibly wrapped in markdown code fences) and returns a dictionary with only the keys "name", "age", and "city". If the JSON contains extra keys, ignore them. If any of the three required keys are missing, set their value to None.
The function should:
1. Strip markdown code fences if present
2. Parse the JSON
3. Return a dict with exactly three keys: name, age, city
XML for Hierarchical Data
JSON is great for flat or lightly nested structures. But when your data is deeply hierarchical — think a document outline, a conversation tree, or configuration with attributes and metadata — XML can actually be a better fit. This might sound old-fashioned, but there is a practical reason: XML tags are self-closing and unambiguous, which makes partial or malformed output easier to recover from.
The key advantage of XML over JSON for LLM output is that attributes and content are separate. A JSON object uses the same mechanism (keys) for metadata and data. An XML element can carry metadata in attributes and data in its text content. That distinction matters when you are building document-processing pipelines.
Parsing XML in Python is straightforward with the built-in xml.etree.ElementTree module — no pip install needed:
Notice how cleanly the attributes map to metadata (department name, headcount) while the nesting represents the hierarchy (company contains departments, departments contain roles). This would work in JSON too, but the XML version reads more naturally when the hierarchy is the point.
Custom Delimiters — The Simplest Reliable Format
I reach for custom delimiters when I need to extract 3-6 fields and do not want the overhead of JSON schema definitions. The triple-equals pattern (===FIELD===) works well because it is visually distinctive, unlikely to appear in normal text, and trivial to parse with a regex or simple string split.
The parser walks through the lines, tracking which field it is currently inside. When it hits a new ===FIELD=== marker, it saves the previous field and starts a new one. This handles multi-line values gracefully — if the model puts a paragraph between two markers, you get the whole paragraph.
Write a function parse_sections(text) that parses text using ###SECTION_NAME### delimiters (note: three # on each side). The function should return a dictionary mapping lowercase section names to their content (stripped of leading/trailing whitespace). Ignore any text before the first delimiter.
Example input:
###TITLE###
My Report
###SUMMARY###
This is a brief summary.
It has two lines.
###END###Expected output: {"title": "My Report", "summary": "This is a brief summary.\nIt has two lines."}
The ###END### marker signals the end of parsing — do not include it as a key.
Markdown — Structured Output That Humans Can Read
Markdown sits in a sweet spot: it has enough structure for light parsing (headers, tables, lists) but is still comfortable for humans to read. When your output needs to be both machine-parseable and directly displayable — think reports, summaries, or documentation — Markdown is the right choice.
Markdown tables are especially useful. They are the one format where I find models are almost always reliable, probably because LLMs have seen millions of Markdown tables in training data.
Parsing a Markdown table into a list of dictionaries takes just a few lines:
Each row becomes a dictionary with the header names as keys. This is handy when you need to render the data in a different format — you could convert these dicts to a pandas DataFrame, a CSV, or pass them to a template.
Format Reliability Comparison — Which Format Fails Least?
After running thousands of structured output requests across GPT-4o, GPT-4o-mini, Claude, and Gemini, I have a rough reliability ranking. This is practical experience, not a controlled benchmark — but the patterns are consistent enough to be useful.
| Format | Reliability | Parsing Difficulty | Best For |
|---|---|---|---|
| Custom delimiters | Highest | Easiest | 3-6 simple fields |
| Markdown table | High | Easy | Tabular comparisons |
| JSON (with schema) | High | Medium | APIs, databases, structured data |
| JSON (without schema) | Medium | Medium | Quick prototyping |
| XML | Medium | Medium | Hierarchical documents |
| Free-form with structure | Low | Hard | Avoid in production |
Custom delimiters are the most reliable because they are the simplest — the model just needs to put text between markers. JSON with an explicit schema is close behind, especially with temperature=0. Markdown tables are surprisingly reliable because models produce them constantly during training.
Hardening Your Output Formatting
Even the best prompts fail sometimes. Production code needs fallback strategies. These are the techniques I use to push format compliance from ~90% to ~99%.
Technique 1: Temperature and Top-P
For structured output, set temperature between 0.0 and 0.3. Higher temperatures increase creativity — exactly the opposite of what you want when asking for precise formatting. If you need varied content within a strict format, keep temperature at 0.2 and use top_p=0.9.
Technique 2: Few-Shot Examples
Showing the model a completed example is one of the most effective ways to get consistent formatting. The model mirrors the structure it sees:
The model follows the example's exact key names, value types, and structure. One example is usually enough for simple schemas. For complex schemas with edge cases, two or three examples work better.
Technique 3: Retry with Correction
When parsing fails, you can send the malformed output back to the model and ask it to fix the formatting. This succeeds roughly 95% of the time on the retry:
The retry loop tries to parse the response, and if it fails, sends the error message and the malformed text back to the model for correction. This is cheap — the correction call uses very few tokens — and dramatically improves reliability.
Common Mistakes and How to Fix Them
These are the formatting failures I see most often in code reviews and Slack channels. Each one is easy to fix once you know the pattern.
prompt = "Analyze this review and give me JSON."prompt = """Analyze this review. Return JSON:
{"sentiment": "positive"|"negative", "score": <float 0-1>}
Return ONLY the JSON object."""Without an explicit schema, the model invents its own key names every time. One call returns "sentiment", the next returns "feeling", the next returns "opinion". Your downstream code breaks on every variation.
data = json.loads(response) # crashes on markdown fencesdata = parse_llm_json(response) # handles fences and extra textEven GPT-4o wraps JSON in code fences roughly 15-20% of the time, depending on the prompt. A raw json.loads() call is a ticking time bomb in any production system.
Real-World Example: Combining Formats in a Data Pipeline
Real-world applications rarely use just one format. A pipeline might need JSON for an API, Markdown for email, and tagged items for a task tracker — all from the same LLM call. Custom delimiters work as the outer container, with each section using whatever inner format fits best.
Here is a practical example. Imagine you are building a tool that takes raw meeting notes and produces structured output for three different consumers:
One LLM call produces output for three systems. The JSON goes to an API, the Markdown goes to an email template, and the action items go to a task tracker. The custom delimiters let us extract each section and parse it with the appropriate method.
Frequently Asked Questions
Should I use the OpenAI `response_format` parameter instead of prompt-based formatting?
The response_format: { type: 'json_object' } parameter is a great option when available — it guarantees valid JSON from the API level. But it is provider-specific (OpenAI and a few others), and it still does not guarantee your schema. You get valid JSON, but the keys and structure are whatever the model decides. The prompt-based techniques in this article work across all providers and give you schema control. In practice, I use both: response_format for the structural guarantee plus a schema in the prompt for key control.
What about Pydantic or structured outputs via function calling?
Function calling (also called tool use) and Pydantic-based structured outputs are the most reliable way to get schema-compliant JSON. They are covered in separate tutorials. The prompt-based approach in this article is valuable because it works with any model, including open-source models via Ollama or Hugging Face that may not support function calling. It is also simpler — you do not need to define Pydantic models or tool schemas.
How do I handle very long outputs that might get truncated?
If your expected output is large (>2000 tokens), set max_tokens explicitly to a high enough value. If the model hits the token limit mid-JSON, the output will be truncated and unparseable. For very large structured outputs, break the task into smaller chunks — process 10 items at a time instead of 100.
Does this work with open-source models like Llama or Mistral?
Yes. The prompt patterns work with any instruction-tuned model. Smaller models (7B-13B) are less reliable at following complex schemas, so I recommend simpler formats (custom delimiters or flat JSON) with smaller models and save nested JSON/XML for larger models (70B+ or GPT-4 class).
References
Complete Code
Click to expand the full script (copy-paste and run)
</details>