What is anatomy of the response object in Python?

The response object contains more than just text. It includes token usage, finish reason, and safety ratings. Let's inspect it to understand what Gemini returns. The finish_reason tells you why the model stopped generating. A value of STOP means it completed normally. If you see MAX_TOKENS, the...

Google Gemini API: Build a Multimodal Document Analyzer in Python

Intermediate90 min3 exercises60 XP

Prerequisites

OpenAI API Crash Course

0/3 exercises

Most AI APIs charge for everything and cap your context at 128K tokens. Gemini gives you a free tier with a 1-million-token context window. It handles text, images, PDFs, audio, and video through a single generate_content() call.

In this tutorial you will build text generators, multi-turn chat systems, and structured output pipelines. Every text-based code block runs directly in your browser. I will flag the multimodal sections that need local files.

What Makes Gemini Different?

Three things set Gemini apart from competitors. The context window is 1 million tokens — enough for a 700-page PDF in a single call. Native audio and video support removes transcription preprocessing. The free tier offers 15 requests per minute with full model access.

OpenAI GPT-4o

# 128K context, text + vision only
# No free tier
# Separate Whisper API for audio
# No native PDF support

from openai import AsyncOpenAI
client = AsyncOpenAI()

# Image: must base64 encode
# Audio: must use whisper endpoint
# PDF: must extract text first
# Video: not supported

Google Gemini 2.0 Flash

# 1M context, text + vision + audio + video + PDF
# Free tier: 15 RPM
# Native support for all modalities
# Direct PDF upload

from google import genai
client = genai.Client(api_key="...")

# Image: pass bytes directly
# Audio: pass bytes directly
# PDF: pass bytes directly
# Video: pass bytes directly

The google-genai SDK uses a client-based pattern similar to the OpenAI SDK. You create a genai.Client, then call methods on it. If you have worked with the OpenAI tutorial, this pattern will feel familiar.

Installation and Your First Gemini Call

Let's make your first Gemini API call. The code below installs the SDK, sets your API key, and generates a response. Replace "your-api-key-here" with your real key and hit Run.

Your first Gemini API call

Loading editor...

That single generate_content() call is the foundation of everything in this tutorial. The response comes back as a GenerateContentResponse object. The .text property extracts the generated text directly.

Anatomy of the Response Object

The response object contains more than just text. It includes token usage, finish reason, and safety ratings. Let's inspect it to understand what Gemini returns.

Inspecting the response object

Loading editor...

The finish_reason tells you why the model stopped generating. A value of STOP means it completed normally. If you see MAX_TOKENS, the response was cut off and you need to increase your token limit.

System Instructions — Controlling Gemini's Behavior

In the OpenAI API, you use a system message in the messages list. Gemini takes a different approach — system instructions are a separate top-level parameter. This is cleaner because it keeps the system prompt out of the conversation history.

Without system instructions

Loading editor...

With system instructions

Loading editor...

The system instruction did not change what Gemini knows. It changed how it communicates. I use system instructions in every production app to control tone, format, and scope.

Generation Parameters — Temperature, Top-P, and Top-K

Gemini supports the same temperature parameter as OpenAI, plus two extras: top_p and top_k. Together, these three knobs control how creative or deterministic the output is.

Parameter	Range	What It Does
temperature	0.0–2.0	Higher = more random. 0 = nearly deterministic
top_p	0.0–1.0	Nucleus sampling. Lower = fewer word choices considered
top_k	1–∞	Only consider the top K most likely tokens
max_output_tokens	1–8192	Hard cap on response length

Temperature 0 vs. temperature 1

Loading editor...

At temperature 0, all three responses are nearly identical. At temperature 1, each takes a different angle. My rule of thumb: start with temperature=0 for anything where correctness matters.

Combining generation parameters

Loading editor...

Exercise: Build a Gemini Tutor

Write Code

Write a function called ask_tutor that takes a question string and returns Gemini's response as text. Use a system instruction that tells Gemini to:

1. Explain with a real-world analogy

2. Show one code example

3. Keep answers under 100 words

Use temperature=0.3 and the gemini-2.0-flash model. The client variable is already set up.

Test it with "What is a Python dictionary?" and print the result followed by "DONE" on a new line.

Loading editor...

Multi-Turn Chat — Conversations with Memory

A single generate_content() call has no memory of previous interactions. For conversations, Gemini provides a chat interface that automatically tracks history. Each message you send includes all prior turns.

Multi-turn chat with memory

Loading editor...

Notice how Turn 3 says "them" without specifying what. Gemini understands from context that "them" means list comprehensions. The chat object tracks the full conversation history automatically.

Inspecting chat history

Loading editor...

Structured Output — JSON Mode and Response Schemas

When you need machine-readable output, asking politely for JSON in your prompt is unreliable. Gemini supports a response_mime_type parameter that forces the output to be valid JSON. You can also provide a schema to control the exact structure.

JSON mode — force structured output

Loading editor...

Setting response_mime_type="application/json" guarantees valid JSON output. Without it, the model might wrap JSON in markdown code fences or add explanatory text. This parameter eliminates parsing headaches.

Schema-constrained JSON output

Loading editor...

Token Counting — Know Your Costs Before You Call

Gemini provides a dedicated count_tokens endpoint. This lets you check the token count of your prompt before making a generation call. Useful for staying within rate limits or estimating costs upfront.

Token counting before API calls

Loading editor...

Use count_tokens() whenever you build something that processes variable-length input. A user might paste a 10-word question or a 10,000-word document. Knowing the token count upfront prevents surprise rate limit errors.

Building Real Tools — Debugger, Reviewer, Translator

Now that you understand generate_content(), system instructions, and generation config, let's wrap them into tools you would actually use. All three follow the same pattern — the only thing that changes is the system instruction.

Tool 1: Code debugger

Loading editor...

That is a genuinely tricky bug — modifying a list while iterating by index causes skipped elements. Gemini identifies the root cause and suggests a list comprehension fix.

Tool 2: Code translator

Loading editor...

Tool 3: Structured code reviewer

Loading editor...

The code reviewer combines system instructions with JSON schema output. Every production AI tool I have built uses structured output when the result feeds into downstream code. Unstructured text is fine for humans, but machines need schemas.

Exercise: Build a Sentiment Analyzer

Write Code

Create a function called analyze_sentiment that takes a text string and returns a dictionary with:

sentiment: "positive", "negative", or "neutral"

confidence: a number from 1 to 10

keywords: a list of key words from the text

Use response_mime_type="application/json" and a response_schema to enforce the structure. Use temperature=0.0 and the gemini-2.0-flash model.

Test it on "Python is an amazing language for beginners" and print the result followed by "DONE" on a new line. The client variable is already set up.

Loading editor...

OpenAI vs Gemini — Side-by-Side API Comparison

If you have used the OpenAI API, the Gemini patterns will feel familiar but the syntax differs. Here is a direct comparison of the key operations.

OpenAI Python SDK

# Installation
# pip install openai

from openai import AsyncOpenAI
client = AsyncOpenAI(api_key="sk-...")

# Basic generation
response = await client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Be concise."},
        {"role": "user", "content": "Hello!"}
    ],
    temperature=0,
    max_tokens=200,
)
print(response.choices[0].message.content)

# JSON mode
response = await client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[...],
    response_format={"type": "json_object"},
)

Google Gemini SDK

# Installation
# pip install google-genai

from google import genai
from google.genai import types
client = genai.Client(api_key="...")

# Basic generation
response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="Hello!",
    config=types.GenerateContentConfig(
        system_instruction="Be concise.",
        temperature=0,
        max_output_tokens=200,
    ),
)
print(response.text)

# JSON mode
response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="...",
    config=types.GenerateContentConfig(
        response_mime_type="application/json",
    ),
)

The core concepts transfer directly between both SDKs. The main differences are structural: Gemini uses a config object instead of keyword arguments, puts system instructions in the config rather than the messages list, and uses .text instead of .choices[0].message.content.

Multimodal Capabilities — Images, PDFs, Audio, Video

Gemini's biggest differentiator is native multimodal support. You can pass images, PDFs, audio files, and video to the same generate_content() method. No separate APIs, no preprocessing, no transcription step.

Image Analysis

Analyzing an image (local only)

Loading editor...

PDF Processing

Processing a PDF (local only)

Loading editor...

Notice how both image and PDF processing use the same method. You create a Part from raw bytes with the correct MIME type, then pass it alongside your text prompt. Gemini handles the rest.

Audio Transcription and Analysis

Audio transcription (local only)

Loading editor...

Video Analysis

Video analysis with File API (local only)

Loading editor...

Putting It All Together — A Document Analyzer Pipeline

Let's combine everything into a text-based analysis pipeline that runs in your browser. This simulates a document analysis workflow using text content instead of file uploads.

Text document analyzer pipeline

Loading editor...

This pipeline demonstrates the power of combining system instructions, JSON output, and a clean function interface. In production, you would extend this to accept file uploads using the multimodal patterns from the previous section.

Exercise: Build a Code Quality Pipeline

Write Code

Create a function called code_quality_pipeline that takes a Python code string and performs two analysis steps:

1. Step 1: Use Gemini to identify bugs (return JSON with a bugs array)

2. Step 2: Use Gemini to suggest improvements (return JSON with an improvements array)

Combine both results into a single dictionary with keys bugs and improvements. Use response_mime_type="application/json" for both calls.

Test it on the code "def add(x, y): return x + y" and print len(result["bugs"]) and len(result["improvements"]) followed by "DONE" on a new line. The client variable is already set up.

Loading editor...

Common Mistakes and How to Fix Them

I have hit every one of these while building with the Gemini API. Learn from my debugging sessions.

Mistake 1: Wrong SDK Package

❌ Old SDK (deprecated)

# pip install google-generativeai
import google.generativeai as genai
genai.configure(api_key="...")
model = genai.GenerativeModel("gemini-pro")

✅ New SDK (recommended)

# pip install google-genai
from google import genai
client = genai.Client(api_key="...")
client.models.generate_content(...)

The old google-generativeai package uses a different API surface entirely. If you find tutorials using genai.configure() or GenerativeModel(), they are using the deprecated SDK. Always use google-genai with the client pattern.

Mistake 2: Hardcoding Your API Key

❌ Key visible in your code

client = genai.Client(api_key="AIzaSyB...")

✅ Key from environment variable

import os
client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])

Set your key in the terminal with export GOOGLE_API_KEY="your-key". Never commit API keys to Git. In this tutorial we set keys inline for convenience, but always use environment variables in real projects.

Mistake 3: Not Checking Finish Reason

If the model hits its token limit, it stops mid-sentence. Your code processes a truncated response without warning. Always check response.candidates[0].finish_reason — it should be STOP for a complete response.

Frequently Asked Questions

Is the Gemini API really free?

Yes — the free tier gives you 15 requests per minute with full model access. No credit card required. For production workloads, the paid tier removes rate limits and offers priority access.

Which model should I use — Flash or Pro?

Start with gemini-2.0-flash. It is fast, cheap, and handles most tasks well. Only switch to Pro when you need stronger reasoning on complex problems. Flash covers 90% of use cases.

Can I switch between OpenAI and Gemini easily?

The concepts transfer directly — system prompts, temperature, JSON mode all work the same way. Only the syntax differs. Many teams use both providers and route tasks based on cost and capability.

What about streaming responses?

Gemini supports streaming via generate_content_stream(). This returns tokens as they are generated instead of waiting for the full response. We cover streaming in the Chatbot with Memory tutorial.

Summary and Next Steps

You built a code debugger, a code translator, a structured reviewer, and a document analysis pipeline — all using the same Gemini API pattern.

The core Gemini pattern

Loading editor...

Every Gemini app you will build extends this pattern. Chatbots add client.chats.create(). Multimodal apps add Part.from_bytes(). RAG systems add retrieval before the call. The foundation stays the same.

Next up: Build a Chatbot with Memory — where the AI remembers your entire conversation history across turns.

References

Google AI Python SDK Documentation — official Gemini API guide

google-genai PyPI Package — SDK installation and changelog

Gemini API Reference — full parameter documentation

Google AI Studio — interactive playground and API key management

Gemini Model Comparison — Flash vs Pro capabilities

Versions used in this tutorial: Python 3.12, google-genai SDK, model gemini-2.0-flash. Tested March 2026.

What Makes Gemini Different?

Installation and Your First Gemini Call

Anatomy of the Response Object

System Instructions — Controlling Gemini's Behavior

Generation Parameters — Temperature, Top-P, and Top-K

Multi-Turn Chat — Conversations with Memory

Structured Output — JSON Mode and Response Schemas

Token Counting — Know Your Costs Before You Call

Building Real Tools — Debugger, Reviewer, Translator

OpenAI vs Gemini — Side-by-Side API Comparison

Multimodal Capabilities — Images, PDFs, Audio, Video

Image Analysis

PDF Processing

Audio Transcription and Analysis

Video Analysis

Putting It All Together — A Document Analyzer Pipeline

Common Mistakes and How to Fix Them

Mistake 1: Wrong SDK Package

Mistake 2: Hardcoding Your API Key

Mistake 3: Not Checking Finish Reason

Frequently Asked Questions

Summary and Next Steps

References

Related Tutorials