What is anatomy of the response object in Python?

That response object contains more than just the AI's answer. Understanding its structure saves you hours of confused debugging later. Four moving parts drive every API call. The client connects to OpenAI's servers. The model (like gpt-4o-mini) determines capability and cost. The messages list...

OpenAI API Crash Course: Chat Completions, Streaming, and Error Handling in Python

Intermediate90 min3 exercises55 XP

Prerequisites

Tokenizers Deep Dive LLM Sampling Parameters Context Windows Explained

0/3 exercises

You've built a chatbot prototype in ChatGPT. The demo went well. Now a client wants it embedded in their app — custom system prompts, streaming output, usage billing, and retry logic when OpenAI's servers hiccup at 2 AM. The ChatGPT interface can't do any of that — the API can.

This tutorial takes you from your first client.chat.completions.create() call to a production-ready ChatClient class with streaming, cost tracking, and exponential backoff. Every code block runs directly in your browser — hit Run and see the results.

Your First API Call

Let's start by making a real API call. I'll show you working code first, then we'll break it down piece by piece.

Your first API call — install, connect, ask

Loading editor...

The exact wording will differ every time you run it. Three lines of actual logic — create client, make request, print result — and you have a working AI assistant.

Anatomy of the Response Object

That response object contains more than just the AI's answer. Understanding its structure saves you hours of confused debugging later.

Inspecting the response object

Loading editor...

Four moving parts drive every API call. The client connects to OpenAI's servers. The model (like gpt-4o-mini) determines capability and cost. The messages list carries your conversation. And the response wraps the AI's answer plus metadata.

The finish_reason field tells you why the model stopped generating. "stop" means it completed normally. "length" means it hit the token limit and your response got cut off — always check this in production code.

System Messages — Programming AI Behavior Without Code

Here's a problem you'll hit immediately. You ask the AI to explain a for loop, and it writes a 500-word essay with advanced examples. You wanted a beginner-friendly two-liner. How do you control the style without rewriting your question every time?

The messages list supports three roles:

`"system"` — Background instructions that shape every response. The end user never sees this, but it controls tone, format, and scope.

`"user"` — The question or input from the person using your app.

`"assistant"` — Previous AI responses (for multi-turn conversations).

Watch how the system message transforms the AI's output. Same question, completely different response:

Without a system message

Loading editor...

With a system message

Loading editor...

The system message didn't change what the AI knows — it changed how it communicates. Every production AI app I've built uses a system message. Think of it as programming the AI's behavior without writing any extra logic.

Temperature — Controlling Creativity vs. Precision

Ask the AI the same question three times and you'll get three different answers. Sometimes that's great — brainstorming, creative writing. Other times you need identical output every time — code generation, data extraction. The temperature parameter controls this tradeoff.

Temperature	Behavior	Best for
0.0	Deterministic — same input gives nearly identical output	Code generation, data extraction, factual Q&A
0.5	Balanced — some variation, mostly predictable	Tutoring, summaries
1.0	Creative — varied responses, may take unexpected angles	Brainstorming, creative writing
1.5+	Very creative — occasionally wild or incoherent	Rarely useful in practice

Temperature 0 vs. temperature 1

Loading editor...

At temperature 0, all three answers are nearly identical. At temperature 1, each takes a different angle. My rule of thumb: start with temperature=0 for anything where correctness matters, and only raise it when you actually want variety.

Exercise: Build a Message Formatter

Write Code

Write a function called format_messages that takes a system_prompt string and a user_prompt string, and returns a properly formatted messages list for the OpenAI API.

The function should return a list of two dictionaries — one with role "system" and one with role "user", each with the corresponding content.

Do not call the API — this exercise tests that you can build the messages structure correctly.

Loading editor...

Building Real Tools — Tutor, Debugger, and Translator

Now that you understand the three building blocks (model, messages, temperature), let's wrap them into tools you'd actually use. All three follow the exact same pattern — the only thing that changes is the system message.

Tool 1: Python tutor. Takes any question and returns a beginner-friendly explanation. Low temperature (0.3) for consistency:

Tool 1: A Python tutor

Loading editor...

Tool 2: Code debugger. This is the tool I reach for most often. Paste broken code, get it fixed with an explanation. Temperature 0 because you want the correct fix, not a creative one:

Tool 2: A code debugger

Loading editor...

That's a genuinely tricky bug — modifying a list while iterating by index causes skipped elements. Most beginners wouldn't catch it. The AI identifies the root cause and suggests a list comprehension as the fix.

Tool 3: Code translator. Translates Python to any language — showing how flexible the system message pattern really is:

Tool 3: A code translator

Loading editor...

All three tools use the exact same API pattern. The only difference is the system message. That's the central insight: the system message is your primary programming interface for AI behavior.

Streaming Responses — Real-Time Token Output

Without streaming, you wait in silence for the model to finish its entire response before seeing anything. For short answers that's fine. For a 500-word explanation, you're staring at a blank screen for 5-10 seconds. Streaming sends tokens as they're generated — the experience users expect from ChatGPT.

The change is minimal. Add stream=True to your call and iterate over the chunks with async for:

Streaming a response token by token

Loading editor...

Each chunk contains a delta object with a content fragment — usually a few characters or a word. We collect them into a list and join at the end to get the full response for logging or storage.

Let's build a reusable streaming helper that handles both modes. This pattern appears in almost every production app I've shipped:

Reusable chat function with optional streaming

Loading editor...

Token Tracking and Cost Control

Once you move past prototyping, tracking token usage becomes essential. A runaway loop or an unexpectedly long prompt can blow through credits fast. Building cost awareness into your code from day one prevents surprises.

Token tracker class

Loading editor...

After just three calls, you can see the total cost is fractions of a cent. But multiply that by thousands of users, and tracking becomes critical for budgeting and alerting.

Here's a helper that enforces a per-session budget ceiling. Once you hit the limit, it raises an exception instead of making more API calls:

Budget-capped API calls

Loading editor...

Error Handling and Retry Logic

OpenAI's API will fail on you. Rate limits, server overloads, network timeouts — it's not a question of if but when. Production code must handle these gracefully. The good news: there are only a handful of errors you need to care about.

Error	Cause	Fix
`AuthenticationError`	Invalid or expired API key	Check your key at platform.openai.com
`RateLimitError`	Too many requests per minute	Wait and retry with backoff
`APIConnectionError`	Network issue	Retry after a short delay
`APITimeoutError`	Request took too long	Retry with a longer timeout
`BadRequestError`	Malformed request (e.g. too many tokens)	Fix the request, don't retry

Let's build an exponential backoff retry wrapper. This is the single most important production pattern in this entire tutorial:

Exponential backoff retry wrapper

Loading editor...

The key insight: only retry transient errors (rate limits, timeouts, connection issues). Never retry permanent errors (bad request, auth failure). Retrying a bad request just wastes time and money.

Exercise: Calculate Backoff Delays

Write Code

Write a function called backoff_delays that takes max_retries (int) and base (float, default 2.0), and returns a list of delay values for exponential backoff.

The delay for attempt i (0-indexed) should be base ** i. For example, with max_retries=4 and base=2.0, return [1.0, 2.0, 4.0, 8.0].

This is the math behind every retry system you will ever build.

Loading editor...

The Reusable ChatClient Class

We've built five separate pieces: basic calls, streaming, token tracking, cost control, and retry logic. In a real project, you don't want to copy-paste these into every file. Let's combine them into one reusable class.

The complete ChatClient class

Loading editor...

That's about 70 lines for a client that handles streaming, retries, and cost tracking. Let's take it for a test drive:

Testing the ChatClient

Loading editor...

Different system prompts create different tools — same class, different personality. A debugger, a translator, and a reviewer are each just a ChatClient with a specialized system prompt.

Common Mistakes and How to Fix Them

Before you go build, here are the mistakes that trip up every beginner. I've hit every one of these in production.

Mistake 1: Hardcoding Your API Key

Key visible in your code

client = openai.AsyncOpenAI(api_key="sk-abc123...")

Key stored in environment variable

import os
client = openai.AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])

In production, set it in your terminal with export OPENAI_API_KEY="sk-your-key" or use a .env file with python-dotenv. Never commit API keys to Git — even in private repos.

Mistake 2: Ignoring the finish_reason

If the model hits its token limit, it stops mid-sentence. Your code processes a chopped-off response as if it were complete. Always check response.choices[0].finish_reason — "stop" means completed, "length" means truncated.

Mistake 3: No Error Handling on API Calls

Crashes at 2 AM when rate limited

response = await client.chat.completions.create(...)

Retries gracefully, logs the issue

response = await robust_chat(messages, max_retries=3)

Rate limits hit at the worst possible time. Wrap every production API call in retry logic. The robust_chat function or ChatClient class from earlier handles this automatically.

Mistake 4: Vague Prompts

Vague — generic boilerplate output

{"role": "user", "content": "Write code"}

Specific — production-quality output

{"role": "user", "content": "Write a Python function that takes a list of integers and returns the second largest, handling duplicates and empty lists"}

Specificity is free and dramatically improves output quality. Tell the AI exactly what inputs to handle, what format to return, and what edge cases to cover.

Exercise: Token Cost Calculator

Write Code

Write a function called calculate_cost that takes input_tokens (int), output_tokens (int), and model (str, default "gpt-4o-mini"), and returns the total cost as a float.

Use these prices:

gpt-4o-mini: $0.15 per 1M input tokens, $0.60 per 1M output tokens

gpt-4o: $2.50 per 1M input tokens, $10.00 per 1M output tokens

If the model is not recognized, use the gpt-4o-mini pricing as a default.

Loading editor...

Frequently Asked Questions

Can I use Claude or Gemini instead of OpenAI?

Absolutely. The concepts — messages, system prompts, temperature, streaming — transfer directly. The API syntax differs slightly: Claude uses anthropic.Anthropic() and Gemini uses google.generativeai. We cover multi-provider setups in a later tutorial.

Is the OpenAI API free?

New accounts typically receive free credits (check OpenAI's pricing page for current offers). After that, you pay per token. For learning, you'll rarely spend more than a few cents per session.

Do I need a GPU to use the API?

No. The AI runs on OpenAI's servers. Your Python script sends text over the internet and gets text back. Any computer with Python and internet access works.

What's the difference between GPT-4o and GPT-4o-mini?

GPT-4o is more capable at complex reasoning, math, and nuanced instructions — but costs about 17x more. GPT-4o-mini is faster and handles most tasks equally well. Start with mini, upgrade only if needed.

Should I use streaming in production?

For any user-facing application, yes. Users perceive streaming responses as faster even when total generation time is the same. For batch processing or background tasks, non-streaming is simpler and returns usage data by default.

Summary and Next Steps

You covered a lot of ground in this crash course. Let me recap the six core skills you now have.

Chat completions — Send messages, get responses, inspect token usage.

System messages — Program AI behavior without writing logic.

Temperature — Control the creativity-precision tradeoff.

Streaming — Display tokens in real time with async for.

Token tracking — Monitor costs and enforce budgets.

Error handling — Retry transient failures with exponential backoff.

Every AI app you'll build — chatbots, RAG systems, AI agents — extends the pattern below with additional messages, tools, or retrieval layers:

The core pattern

Loading editor...

Next up: Build a Chatbot with Memory — where the AI remembers your entire conversation. That's where the "assistant" role and multi-turn conversation management come in.

References

OpenAI Chat Completions Guide — official guide to text generation

OpenAI API Reference — chat.completions.create — full parameter documentation

OpenAI Pricing — current model pricing

OpenAI Tokenizer Tool — interactive token counter

OpenAI Prompt Engineering Guide — best practices for writing effective prompts

Versions used in this tutorial: Python 3.12, openai library 1.x, model gpt-4o-mini. Tested March 2026.

Your First API Call

Anatomy of the Response Object

System Messages — Programming AI Behavior Without Code

Temperature — Controlling Creativity vs. Precision

Building Real Tools — Tutor, Debugger, and Translator

Streaming Responses — Real-Time Token Output

Token Tracking and Cost Control

Error Handling and Retry Logic

The Reusable ChatClient Class

Common Mistakes and How to Fix Them

Mistake 1: Hardcoding Your API Key

Mistake 2: Ignoring the finish_reason

Mistake 3: No Error Handling on API Calls

Mistake 4: Vague Prompts

Frequently Asked Questions

Summary and Next Steps

References

Related Tutorials