LLM Streaming in Python: Build Token-by-Token Output with SSE and Async Generators

Intermediate60 min2 exercises40 XP

Prerequisites

0/2 exercises

You call an LLM API, wait four seconds, and a wall of text appears all at once. Your user stares at a blank screen the entire time, wondering if the app crashed. Meanwhile, ChatGPT starts printing words immediately — each token arriving as the model generates it. That fluid, typewriter-style output isn't magic. It's streaming, and you can build it yourself in about 30 lines of Python.

This tutorial covers streaming end-to-end. You'll learn how the SSE protocol works under the hood, how to consume OpenAI streaming responses, and how to build async generators for token delivery. The final sections show how to expose a streaming endpoint with FastAPI. Every concept that can run in the browser is runnable — hit Run and watch tokens arrive one by one.

Why Streaming Matters for LLM Applications

LLMs generate text one token at a time internally, but the default API behavior waits until every token is generated before sending the full response. For a 500-token answer at ~50 tokens per second, that's a 10-second blank screen. Streaming flips the model: each token gets sent to your client the moment it's produced.

Non-streaming vs streaming — the core difference

Loading editor...

I timed both approaches on the same prompt: the non-streaming call took 6.2 seconds before any text appeared. The streaming version showed the first word in 0.3 seconds. Same total time — but the user experience is night and day.

Three concrete reasons to stream in production. First, perceived latency drops by 10-20x because users see immediate output. Second, you can cancel generation early if the first sentence is off-topic. Third, you can process tokens as they arrive for real-time translation, content filtering, or progress indicators.

Streaming with the OpenAI API

The fastest way to see streaming in action is the OpenAI Python client. One parameter changes everything: stream=True. Instead of returning a complete response object, the API returns an async iterator that yields chunks as the model generates them.

Your first streaming API call

Loading editor...

Run that and watch the text appear word by word. The key line is stream=True in the API call. Without it, you get a single ChatCompletion object after the model finishes. With it, you get an async iterator of ChatCompletionChunk objects — one per token (or small group of tokens).

Each chunk has a .choices[0].delta object instead of .choices[0].message. The delta contains only the new content for that chunk — usually one or a few tokens. The first chunk's delta often has a role field ("assistant") but no content. The last chunk has an empty delta, signaling the stream is done.

Collecting Streamed Tokens into a Complete Response

Printing tokens to the console is a demo. In production, you almost always need the complete response too — for logging, for database storage, for passing to the next step in your pipeline. The pattern is simple: accumulate tokens in a list while you stream them.

Accumulating tokens while streaming

Loading editor...

I use a list and "".join() instead of string concatenation (full_response += token) because joining a list is O(n) while repeated string concatenation is O(n^2). With 500+ tokens, the performance difference is measurable.

How Server-Sent Events (SSE) Work Under the Hood

When you set stream=True, what actually travels over the network? The answer is Server-Sent Events — a simple HTTP protocol where the server keeps the connection open and pushes text messages to the client. It's the same protocol that powers ChatGPT's streaming output, GitHub's notification feeds, and stock ticker updates.

SSE is simpler than WebSockets. A regular HTTP request opens a connection, sends a response, and closes. With SSE, the server sends the headers (including Content-Type: text/event-stream), then keeps the connection open and writes data lines whenever it has new content. The client reads these lines as they arrive.

Each SSE message follows a dead-simple format:

Raw SSE message format

Loading editor...

Every line starts with data: followed by a JSON payload. A blank line separates messages. The final data: [DONE] signals the stream is complete. That's the entire protocol — no binary framing, no handshake, no upgrade negotiation. Just text over HTTP.

Regular HTTP (wait for full response)

Client: GET /api/chat
Server: ...processing 6 seconds...
Server: 200 OK
  {"content": "Here is the full 500-token answer..."}

SSE (tokens stream as generated)

Client: GET /api/chat
Server: 200 OK (Content-Type: text/event-stream)
Server: data: {"content": "Here"}
Server: data: {"content": " is"}
Server: data: {"content": " the"}
... (tokens arrive every ~20ms) ...
Server: data: [DONE]

Async Generators — The Python Pattern Behind Streaming

OpenAI's streaming API returns an async iterator. But what if you want to build your own streaming pipeline? Maybe you need to transform tokens before sending them, or you're building a custom LLM backend. The Python primitive for this is the async generator — a function that yields values one at a time, asynchronously.

Building a simple async generator

Loading editor...

The function uses async def and yield — making it an async generator. Each yield pauses the function and sends a value to whoever is consuming it with async for. The await asyncio.sleep() simulates the time a real model takes to generate each token.

This pattern is powerful because the consumer controls the pace. If the consumer is slow (say, writing to a database), the generator pauses at the yield until the consumer asks for the next value. No buffering, no overflow, no dropped tokens.

Build a Word-Count Async Generator

Write Code

Create an async generator function called word_count_stream that takes an async token stream and yields each token along with a running word count.

The function should:

1. Accept an async iterable of strings as input

2. Keep a running count of total words seen so far

3. Yield tuples of (token, cumulative_word_count) for each token

Count words by splitting each token on whitespace and counting non-empty parts.

Loading editor...

Building a Token Processing Pipeline

In production, you rarely just print tokens. You need to filter them, transform them, log them, or route them to multiple consumers. Async generators compose beautifully — you can chain them together like Unix pipes, where each stage processes tokens from the previous one.

Chaining async generators into a pipeline

Loading editor...

Each function takes a stream as input and yields transformed items. token_counter adds metadata. content_filter modifies tokens that match a blocklist. The pipeline is lazy — nothing runs until the final async for loop pulls the first token through the entire chain.

This lazy evaluation is exactly how production streaming systems work. The pipeline only processes one token at a time, keeping memory usage constant regardless of response length. I've used this pattern to stream 100K+ token responses without any memory issues.

Building a FastAPI SSE Streaming Endpoint

So far we've consumed streaming responses and built async generators. The missing piece is serving a streaming endpoint — letting your frontend receive tokens over HTTP. FastAPI combined with sse-starlette makes this surprisingly clean.

Complete FastAPI streaming endpoint

Loading editor...

The structure mirrors exactly what we built earlier. generate_tokens is an async generator that wraps the OpenAI stream. EventSourceResponse from sse-starlette handles the SSE formatting — it takes our generator and converts each yielded string into a properly formatted data: line with the right headers.

Handling Connection Drops and Errors

Real networks are unreliable. WiFi drops, proxies timeout, and API rate limits kick in mid-stream. A production streaming system needs to handle all of these without losing data or leaving the user confused.

Streaming with retry logic

Loading editor...

This implementation catches two common failures: network drops (APIConnectionError) and rate limits (RateLimitError). Each retry doubles the wait time — 2 seconds, then 4, then 8. This exponential backoff prevents hammering the API when it's struggling.

One limitation: if the connection drops mid-stream, the retry starts a fresh generation from scratch. The LLM has no memory of what it already produced. For long responses, you could include the already-collected tokens in a follow-up prompt with instructions to continue, but that adds complexity and cost.

Real-World Example: Streaming AI Summary with Progress

Let's combine everything into a practical example: a streaming summarizer that shows real-time progress as it generates. This is the kind of component you'd embed in a dashboard or internal tool.

Streaming summarizer with live progress

Loading editor...

The summarizer tracks three metrics as it streams: token count, elapsed time, and throughput in tokens per second. These stats are invaluable for monitoring in production — a sudden drop in tokens/sec usually signals API congestion or network issues.

Common Mistakes and How to Fix Them

I've debugged streaming issues in enough projects to recognize the usual suspects. These three mistakes cause the most confusion.

Mistake 1: Forgetting to Flush Output

The flush=True fix

Loading editor...

Python buffers stdout by default when end="" suppresses the newline. Without flush=True, the entire response accumulates in the buffer and dumps at the end — defeating the purpose of streaming.

Mistake 2: Using the Non-Streaming Response Pattern

Wrong: Accessing .message on a stream chunk

# This crashes with AttributeError
async for chunk in stream:
    text = chunk.choices[0].message.content
    print(text)

Correct: Use .delta instead of .message

# Chunks use .delta, not .message
async for chunk in stream:
    text = chunk.choices[0].delta.content
    if text:
        print(text, end="", flush=True)

Streaming chunks use .delta instead of .message. The delta contains only the new content — it's an incremental update, not the full message. Also note the if text: guard — some chunks have None content (especially the first and last chunks).

Mistake 3: Not Handling the Empty Final Chunk

Anatomy of chunks during a stream

Loading editor...

The first chunk carries the role assignment but no content. The last chunk has an empty delta but includes the finish_reason. Accessing .content without checking for None first will either print "None" to your output or crash your pipeline.

Build an SSE Message Parser

Write Code

Create a function called parse_sse_messages that takes a raw SSE string and returns a list of parsed data payloads.

The function should:

1. Split the input on double newlines to separate messages

2. For each message, extract the content after data:

3. Skip the final data: [DONE] message

4. Parse each remaining data string as JSON

5. Return a list of the parsed dictionaries

Loading editor...

Performance Tips and When Not to Stream

Streaming isn't free, and it's not always the right choice. Understanding the tradeoffs helps you decide when to use it.

When streaming hurts more than it helps. If your application needs to parse the complete response as structured JSON (like function calling results), streaming adds complexity with no user-facing benefit. The downstream parser can't start until all tokens arrive anyway. Same for batch processing pipelines — streaming 10,000 prompts sequentially is slower than sending them in parallel without streaming.

Connection overhead. Each SSE connection holds an open HTTP connection for the entire generation. A server handling 1,000 concurrent streams needs 1,000 open connections. This is fine with async frameworks like FastAPI + uvicorn, but would choke a threaded server like Flask without careful tuning.

Token buffering. Some proxy servers and CDNs buffer responses before forwarding them. If you deploy behind Cloudflare or nginx, you may need to disable response buffering (X-Accel-Buffering: no header in nginx, or disable buffering in Cloudflare) otherwise tokens batch up and arrive in clumps instead of individually.

Frequently Asked Questions

Can I stream with the synchronous OpenAI client?

Yes. The synchronous OpenAI() client supports streaming too — use a regular for loop instead of async for:

Synchronous streaming pattern

Loading editor...

Does streaming cost more tokens than non-streaming?

No. Token usage is identical. Streaming changes how tokens are delivered, not how many are generated. The same prompt with stream=True and stream=False produces the same number of tokens and costs the same amount. One caveat: the usage statistics (prompt_tokens, completion_tokens) aren't included in individual chunks. You'll need to add stream_options={"include_usage": True} to get them in the final chunk.

How do I stream with Anthropic or Google Gemini?

Both follow the same async iterator pattern. With Anthropic's Claude, use async with client.messages.stream() and iterate over text events. With Google's Gemini, pass stream=True to generate_content() and iterate over the response chunks. The core concept — async iteration over chunks — is identical across all major providers.

References

OpenAI API Reference — Streaming. Link

MDN Web Docs — Server-Sent Events. Link

Python Documentation — Asynchronous Generators (PEP 525). Link

FastAPI Documentation — StreamingResponse. Link

sse-starlette — Server-Sent Events for Starlette/FastAPI. Link

OpenAI Cookbook — How to stream completions. Link

Anthropic Documentation — Streaming Messages. Link

Why Streaming Matters for LLM Applications

Streaming with the OpenAI API

Collecting Streamed Tokens into a Complete Response

How Server-Sent Events (SSE) Work Under the Hood

Async Generators — The Python Pattern Behind Streaming

Building a Token Processing Pipeline

Building a FastAPI SSE Streaming Endpoint

Handling Connection Drops and Errors

Real-World Example: Streaming AI Summary with Progress

Common Mistakes and How to Fix Them

Mistake 1: Forgetting to Flush Output

Mistake 2: Using the Non-Streaming Response Pattern

Mistake 3: Not Handling the Empty Final Chunk

Performance Tips and When Not to Stream

Frequently Asked Questions

Can I stream with the synchronous OpenAI client?

Does streaming cost more tokens than non-streaming?

How do I stream with Anthropic or Google Gemini?

References

Related Tutorials