Google Gemini API: Build a Multimodal Document Analyzer in Python
Most AI APIs charge for everything and cap your context at 128K tokens. Gemini gives you a free tier with a 1-million-token context window. It handles text, images, PDFs, audio, and video through a single generate_content() call.
In this tutorial you will build text generators, multi-turn chat systems, and structured output pipelines. Every text-based code block runs directly in your browser. I will flag the multimodal sections that need local files.
What Makes Gemini Different?
Three things set Gemini apart from competitors. The context window is 1 million tokens — enough for a 700-page PDF in a single call. Native audio and video support removes transcription preprocessing. The free tier offers 15 requests per minute with full model access.
# 128K context, text + vision only
# No free tier
# Separate Whisper API for audio
# No native PDF support
from openai import AsyncOpenAI
client = AsyncOpenAI()
# Image: must base64 encode
# Audio: must use whisper endpoint
# PDF: must extract text first
# Video: not supported# 1M context, text + vision + audio + video + PDF
# Free tier: 15 RPM
# Native support for all modalities
# Direct PDF upload
from google import genai
client = genai.Client(api_key="...")
# Image: pass bytes directly
# Audio: pass bytes directly
# PDF: pass bytes directly
# Video: pass bytes directlyThe google-genai SDK uses a client-based pattern similar to the OpenAI SDK. You create a genai.Client, then call methods on it. If you have worked with the OpenAI tutorial, this pattern will feel familiar.
Installation and Your First Gemini Call
Let's make your first Gemini API call. The code below installs the SDK, sets your API key, and generates a response. Replace "your-api-key-here" with your real key and hit Run.
That single generate_content() call is the foundation of everything in this tutorial. The response comes back as a GenerateContentResponse object. The .text property extracts the generated text directly.
Anatomy of the Response Object
The response object contains more than just text. It includes token usage, finish reason, and safety ratings. Let's inspect it to understand what Gemini returns.
The finish_reason tells you why the model stopped generating. A value of STOP means it completed normally. If you see MAX_TOKENS, the response was cut off and you need to increase your token limit.
System Instructions — Controlling Gemini's Behavior
In the OpenAI API, you use a system message in the messages list. Gemini takes a different approach — system instructions are a separate top-level parameter. This is cleaner because it keeps the system prompt out of the conversation history.
The system instruction did not change what Gemini knows. It changed how it communicates. I use system instructions in every production app to control tone, format, and scope.
Generation Parameters — Temperature, Top-P, and Top-K
Gemini supports the same temperature parameter as OpenAI, plus two extras: top_p and top_k. Together, these three knobs control how creative or deterministic the output is.
| Parameter | Range | What It Does |
|---|---|---|
| temperature | 0.0–2.0 | Higher = more random. 0 = nearly deterministic |
| top_p | 0.0–1.0 | Nucleus sampling. Lower = fewer word choices considered |
| top_k | 1–∞ | Only consider the top K most likely tokens |
| max_output_tokens | 1–8192 | Hard cap on response length |
At temperature 0, all three responses are nearly identical. At temperature 1, each takes a different angle. My rule of thumb: start with temperature=0 for anything where correctness matters.
Write a function called ask_tutor that takes a question string and returns Gemini's response as text. Use a system instruction that tells Gemini to:
1. Explain with a real-world analogy
2. Show one code example
3. Keep answers under 100 words
Use temperature=0.3 and the gemini-2.0-flash model. The client variable is already set up.
Test it with "What is a Python dictionary?" and print the result followed by "DONE" on a new line.
Multi-Turn Chat — Conversations with Memory
A single generate_content() call has no memory of previous interactions. For conversations, Gemini provides a chat interface that automatically tracks history. Each message you send includes all prior turns.
Notice how Turn 3 says "them" without specifying what. Gemini understands from context that "them" means list comprehensions. The chat object tracks the full conversation history automatically.
Structured Output — JSON Mode and Response Schemas
When you need machine-readable output, asking politely for JSON in your prompt is unreliable. Gemini supports a response_mime_type parameter that forces the output to be valid JSON. You can also provide a schema to control the exact structure.
Setting response_mime_type="application/json" guarantees valid JSON output. Without it, the model might wrap JSON in markdown code fences or add explanatory text. This parameter eliminates parsing headaches.
Token Counting — Know Your Costs Before You Call
Gemini provides a dedicated count_tokens endpoint. This lets you check the token count of your prompt before making a generation call. Useful for staying within rate limits or estimating costs upfront.
Use count_tokens() whenever you build something that processes variable-length input. A user might paste a 10-word question or a 10,000-word document. Knowing the token count upfront prevents surprise rate limit errors.
Building Real Tools — Debugger, Reviewer, Translator
Now that you understand generate_content(), system instructions, and generation config, let's wrap them into tools you would actually use. All three follow the same pattern — the only thing that changes is the system instruction.
That is a genuinely tricky bug — modifying a list while iterating by index causes skipped elements. Gemini identifies the root cause and suggests a list comprehension fix.
The code reviewer combines system instructions with JSON schema output. Every production AI tool I have built uses structured output when the result feeds into downstream code. Unstructured text is fine for humans, but machines need schemas.
Create a function called analyze_sentiment that takes a text string and returns a dictionary with:
sentiment: "positive", "negative", or "neutral"confidence: a number from 1 to 10keywords: a list of key words from the textUse response_mime_type="application/json" and a response_schema to enforce the structure. Use temperature=0.0 and the gemini-2.0-flash model.
Test it on "Python is an amazing language for beginners" and print the result followed by "DONE" on a new line. The client variable is already set up.
OpenAI vs Gemini — Side-by-Side API Comparison
If you have used the OpenAI API, the Gemini patterns will feel familiar but the syntax differs. Here is a direct comparison of the key operations.
# Installation
# pip install openai
from openai import AsyncOpenAI
client = AsyncOpenAI(api_key="sk-...")
# Basic generation
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Be concise."},
{"role": "user", "content": "Hello!"}
],
temperature=0,
max_tokens=200,
)
print(response.choices[0].message.content)
# JSON mode
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[...],
response_format={"type": "json_object"},
)# Installation
# pip install google-genai
from google import genai
from google.genai import types
client = genai.Client(api_key="...")
# Basic generation
response = client.models.generate_content(
model="gemini-2.0-flash",
contents="Hello!",
config=types.GenerateContentConfig(
system_instruction="Be concise.",
temperature=0,
max_output_tokens=200,
),
)
print(response.text)
# JSON mode
response = client.models.generate_content(
model="gemini-2.0-flash",
contents="...",
config=types.GenerateContentConfig(
response_mime_type="application/json",
),
)The core concepts transfer directly between both SDKs. The main differences are structural: Gemini uses a config object instead of keyword arguments, puts system instructions in the config rather than the messages list, and uses .text instead of .choices[0].message.content.
Multimodal Capabilities — Images, PDFs, Audio, Video
Gemini's biggest differentiator is native multimodal support. You can pass images, PDFs, audio files, and video to the same generate_content() method. No separate APIs, no preprocessing, no transcription step.
Image Analysis
PDF Processing
Notice how both image and PDF processing use the same method. You create a Part from raw bytes with the correct MIME type, then pass it alongside your text prompt. Gemini handles the rest.
Audio Transcription and Analysis
Video Analysis
Putting It All Together — A Document Analyzer Pipeline
Let's combine everything into a text-based analysis pipeline that runs in your browser. This simulates a document analysis workflow using text content instead of file uploads.
This pipeline demonstrates the power of combining system instructions, JSON output, and a clean function interface. In production, you would extend this to accept file uploads using the multimodal patterns from the previous section.
Create a function called code_quality_pipeline that takes a Python code string and performs two analysis steps:
1. Step 1: Use Gemini to identify bugs (return JSON with a bugs array)
2. Step 2: Use Gemini to suggest improvements (return JSON with an improvements array)
Combine both results into a single dictionary with keys bugs and improvements. Use response_mime_type="application/json" for both calls.
Test it on the code "def add(x, y): return x + y" and print len(result["bugs"]) and len(result["improvements"]) followed by "DONE" on a new line. The client variable is already set up.
Common Mistakes and How to Fix Them
I have hit every one of these while building with the Gemini API. Learn from my debugging sessions.
Mistake 1: Wrong SDK Package
# pip install google-generativeai
import google.generativeai as genai
genai.configure(api_key="...")
model = genai.GenerativeModel("gemini-pro")# pip install google-genai
from google import genai
client = genai.Client(api_key="...")
client.models.generate_content(...)The old google-generativeai package uses a different API surface entirely. If you find tutorials using genai.configure() or GenerativeModel(), they are using the deprecated SDK. Always use google-genai with the client pattern.
Mistake 2: Hardcoding Your API Key
client = genai.Client(api_key="AIzaSyB...")import os
client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])Set your key in the terminal with export GOOGLE_API_KEY="your-key". Never commit API keys to Git. In this tutorial we set keys inline for convenience, but always use environment variables in real projects.
Mistake 3: Not Checking Finish Reason
If the model hits its token limit, it stops mid-sentence. Your code processes a truncated response without warning. Always check response.candidates[0].finish_reason — it should be STOP for a complete response.
Frequently Asked Questions
Is the Gemini API really free?
Yes — the free tier gives you 15 requests per minute with full model access. No credit card required. For production workloads, the paid tier removes rate limits and offers priority access.
Which model should I use — Flash or Pro?
Start with gemini-2.0-flash. It is fast, cheap, and handles most tasks well. Only switch to Pro when you need stronger reasoning on complex problems. Flash covers 90% of use cases.
Can I switch between OpenAI and Gemini easily?
The concepts transfer directly — system prompts, temperature, JSON mode all work the same way. Only the syntax differs. Many teams use both providers and route tasks based on cost and capability.
What about streaming responses?
Gemini supports streaming via generate_content_stream(). This returns tokens as they are generated instead of waiting for the full response. We cover streaming in the Chatbot with Memory tutorial.
Summary and Next Steps
You built a code debugger, a code translator, a structured reviewer, and a document analysis pipeline — all using the same Gemini API pattern.
Every Gemini app you will build extends this pattern. Chatbots add client.chats.create(). Multimodal apps add Part.from_bytes(). RAG systems add retrieval before the call. The foundation stays the same.
Next up: Build a Chatbot with Memory — where the AI remembers your entire conversation history across turns.
References
Versions used in this tutorial: Python 3.12, google-genai SDK, model gemini-2.0-flash. Tested March 2026.