Skip to main content

Google Gemini API: Build a Multimodal Document Analyzer in Python

Intermediate90 min3 exercises60 XP
0/3 exercises

Most AI APIs charge for everything and cap your context at 128K tokens. Gemini gives you a free tier with a 1-million-token context window. It handles text, images, PDFs, audio, and video through a single generate_content() call.

In this tutorial you will build text generators, multi-turn chat systems, and structured output pipelines. Every text-based code block runs directly in your browser. I will flag the multimodal sections that need local files.

What Makes Gemini Different?

Three things set Gemini apart from competitors. The context window is 1 million tokens — enough for a 700-page PDF in a single call. Native audio and video support removes transcription preprocessing. The free tier offers 15 requests per minute with full model access.

OpenAI GPT-4o
# 128K context, text + vision only
# No free tier
# Separate Whisper API for audio
# No native PDF support

from openai import AsyncOpenAI
client = AsyncOpenAI()

# Image: must base64 encode
# Audio: must use whisper endpoint
# PDF: must extract text first
# Video: not supported
Google Gemini 2.0 Flash
# 1M context, text + vision + audio + video + PDF
# Free tier: 15 RPM
# Native support for all modalities
# Direct PDF upload

from google import genai
client = genai.Client(api_key="...")

# Image: pass bytes directly
# Audio: pass bytes directly
# PDF: pass bytes directly
# Video: pass bytes directly

The google-genai SDK uses a client-based pattern similar to the OpenAI SDK. You create a genai.Client, then call methods on it. If you have worked with the OpenAI tutorial, this pattern will feel familiar.

Installation and Your First Gemini Call

Let's make your first Gemini API call. The code below installs the SDK, sets your API key, and generates a response. Replace "your-api-key-here" with your real key and hit Run.

Your first Gemini API call
Loading editor...

That single generate_content() call is the foundation of everything in this tutorial. The response comes back as a GenerateContentResponse object. The .text property extracts the generated text directly.

Anatomy of the Response Object

The response object contains more than just text. It includes token usage, finish reason, and safety ratings. Let's inspect it to understand what Gemini returns.

Inspecting the response object
Loading editor...

The finish_reason tells you why the model stopped generating. A value of STOP means it completed normally. If you see MAX_TOKENS, the response was cut off and you need to increase your token limit.

System Instructions — Controlling Gemini's Behavior

In the OpenAI API, you use a system message in the messages list. Gemini takes a different approach — system instructions are a separate top-level parameter. This is cleaner because it keeps the system prompt out of the conversation history.

Without system instructions
Loading editor...
With system instructions
Loading editor...

The system instruction did not change what Gemini knows. It changed how it communicates. I use system instructions in every production app to control tone, format, and scope.

Generation Parameters — Temperature, Top-P, and Top-K

Gemini supports the same temperature parameter as OpenAI, plus two extras: top_p and top_k. Together, these three knobs control how creative or deterministic the output is.

ParameterRangeWhat It Does
temperature0.0–2.0Higher = more random. 0 = nearly deterministic
top_p0.0–1.0Nucleus sampling. Lower = fewer word choices considered
top_k1–∞Only consider the top K most likely tokens
max_output_tokens1–8192Hard cap on response length
Temperature 0 vs. temperature 1
Loading editor...

At temperature 0, all three responses are nearly identical. At temperature 1, each takes a different angle. My rule of thumb: start with temperature=0 for anything where correctness matters.

Combining generation parameters
Loading editor...
Exercise: Build a Gemini Tutor
Write Code

Write a function called ask_tutor that takes a question string and returns Gemini's response as text. Use a system instruction that tells Gemini to:

1. Explain with a real-world analogy

2. Show one code example

3. Keep answers under 100 words

Use temperature=0.3 and the gemini-2.0-flash model. The client variable is already set up.

Test it with "What is a Python dictionary?" and print the result followed by "DONE" on a new line.

Loading editor...

Multi-Turn Chat — Conversations with Memory

A single generate_content() call has no memory of previous interactions. For conversations, Gemini provides a chat interface that automatically tracks history. Each message you send includes all prior turns.

Multi-turn chat with memory
Loading editor...

Notice how Turn 3 says "them" without specifying what. Gemini understands from context that "them" means list comprehensions. The chat object tracks the full conversation history automatically.

Inspecting chat history
Loading editor...

Structured Output — JSON Mode and Response Schemas

When you need machine-readable output, asking politely for JSON in your prompt is unreliable. Gemini supports a response_mime_type parameter that forces the output to be valid JSON. You can also provide a schema to control the exact structure.

JSON mode — force structured output
Loading editor...

Setting response_mime_type="application/json" guarantees valid JSON output. Without it, the model might wrap JSON in markdown code fences or add explanatory text. This parameter eliminates parsing headaches.

Schema-constrained JSON output
Loading editor...

Token Counting — Know Your Costs Before You Call

Gemini provides a dedicated count_tokens endpoint. This lets you check the token count of your prompt before making a generation call. Useful for staying within rate limits or estimating costs upfront.

Token counting before API calls
Loading editor...

Use count_tokens() whenever you build something that processes variable-length input. A user might paste a 10-word question or a 10,000-word document. Knowing the token count upfront prevents surprise rate limit errors.

Building Real Tools — Debugger, Reviewer, Translator

Now that you understand generate_content(), system instructions, and generation config, let's wrap them into tools you would actually use. All three follow the same pattern — the only thing that changes is the system instruction.

Tool 1: Code debugger
Loading editor...

That is a genuinely tricky bug — modifying a list while iterating by index causes skipped elements. Gemini identifies the root cause and suggests a list comprehension fix.

Tool 2: Code translator
Loading editor...
Tool 3: Structured code reviewer
Loading editor...

The code reviewer combines system instructions with JSON schema output. Every production AI tool I have built uses structured output when the result feeds into downstream code. Unstructured text is fine for humans, but machines need schemas.

Exercise: Build a Sentiment Analyzer
Write Code

Create a function called analyze_sentiment that takes a text string and returns a dictionary with:

  • sentiment: "positive", "negative", or "neutral"
  • confidence: a number from 1 to 10
  • keywords: a list of key words from the text
  • Use response_mime_type="application/json" and a response_schema to enforce the structure. Use temperature=0.0 and the gemini-2.0-flash model.

    Test it on "Python is an amazing language for beginners" and print the result followed by "DONE" on a new line. The client variable is already set up.

    Loading editor...

    OpenAI vs Gemini — Side-by-Side API Comparison

    If you have used the OpenAI API, the Gemini patterns will feel familiar but the syntax differs. Here is a direct comparison of the key operations.

    OpenAI Python SDK
    # Installation
    # pip install openai
    
    from openai import AsyncOpenAI
    client = AsyncOpenAI(api_key="sk-...")
    
    # Basic generation
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Be concise."},
            {"role": "user", "content": "Hello!"}
        ],
        temperature=0,
        max_tokens=200,
    )
    print(response.choices[0].message.content)
    
    # JSON mode
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[...],
        response_format={"type": "json_object"},
    )
    Google Gemini SDK
    # Installation
    # pip install google-genai
    
    from google import genai
    from google.genai import types
    client = genai.Client(api_key="...")
    
    # Basic generation
    response = client.models.generate_content(
        model="gemini-2.0-flash",
        contents="Hello!",
        config=types.GenerateContentConfig(
            system_instruction="Be concise.",
            temperature=0,
            max_output_tokens=200,
        ),
    )
    print(response.text)
    
    # JSON mode
    response = client.models.generate_content(
        model="gemini-2.0-flash",
        contents="...",
        config=types.GenerateContentConfig(
            response_mime_type="application/json",
        ),
    )

    The core concepts transfer directly between both SDKs. The main differences are structural: Gemini uses a config object instead of keyword arguments, puts system instructions in the config rather than the messages list, and uses .text instead of .choices[0].message.content.

    Multimodal Capabilities — Images, PDFs, Audio, Video

    Gemini's biggest differentiator is native multimodal support. You can pass images, PDFs, audio files, and video to the same generate_content() method. No separate APIs, no preprocessing, no transcription step.

    Image Analysis

    Analyzing an image (local only)
    Loading editor...

    PDF Processing

    Processing a PDF (local only)
    Loading editor...

    Notice how both image and PDF processing use the same method. You create a Part from raw bytes with the correct MIME type, then pass it alongside your text prompt. Gemini handles the rest.

    Audio Transcription and Analysis

    Audio transcription (local only)
    Loading editor...

    Video Analysis

    Video analysis with File API (local only)
    Loading editor...

    Putting It All Together — A Document Analyzer Pipeline

    Let's combine everything into a text-based analysis pipeline that runs in your browser. This simulates a document analysis workflow using text content instead of file uploads.

    Text document analyzer pipeline
    Loading editor...

    This pipeline demonstrates the power of combining system instructions, JSON output, and a clean function interface. In production, you would extend this to accept file uploads using the multimodal patterns from the previous section.

    Exercise: Build a Code Quality Pipeline
    Write Code

    Create a function called code_quality_pipeline that takes a Python code string and performs two analysis steps:

    1. Step 1: Use Gemini to identify bugs (return JSON with a bugs array)

    2. Step 2: Use Gemini to suggest improvements (return JSON with an improvements array)

    Combine both results into a single dictionary with keys bugs and improvements. Use response_mime_type="application/json" for both calls.

    Test it on the code "def add(x, y): return x + y" and print len(result["bugs"]) and len(result["improvements"]) followed by "DONE" on a new line. The client variable is already set up.

    Loading editor...

    Common Mistakes and How to Fix Them

    I have hit every one of these while building with the Gemini API. Learn from my debugging sessions.

    Mistake 1: Wrong SDK Package

    ❌ Old SDK (deprecated)
    # pip install google-generativeai
    import google.generativeai as genai
    genai.configure(api_key="...")
    model = genai.GenerativeModel("gemini-pro")
    ✅ New SDK (recommended)
    # pip install google-genai
    from google import genai
    client = genai.Client(api_key="...")
    client.models.generate_content(...)

    The old google-generativeai package uses a different API surface entirely. If you find tutorials using genai.configure() or GenerativeModel(), they are using the deprecated SDK. Always use google-genai with the client pattern.

    Mistake 2: Hardcoding Your API Key

    ❌ Key visible in your code
    client = genai.Client(api_key="AIzaSyB...")
    ✅ Key from environment variable
    import os
    client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])

    Set your key in the terminal with export GOOGLE_API_KEY="your-key". Never commit API keys to Git. In this tutorial we set keys inline for convenience, but always use environment variables in real projects.

    Mistake 3: Not Checking Finish Reason

    If the model hits its token limit, it stops mid-sentence. Your code processes a truncated response without warning. Always check response.candidates[0].finish_reason — it should be STOP for a complete response.

    Frequently Asked Questions

    Is the Gemini API really free?

    Yes — the free tier gives you 15 requests per minute with full model access. No credit card required. For production workloads, the paid tier removes rate limits and offers priority access.

    Which model should I use — Flash or Pro?

    Start with gemini-2.0-flash. It is fast, cheap, and handles most tasks well. Only switch to Pro when you need stronger reasoning on complex problems. Flash covers 90% of use cases.

    Can I switch between OpenAI and Gemini easily?

    The concepts transfer directly — system prompts, temperature, JSON mode all work the same way. Only the syntax differs. Many teams use both providers and route tasks based on cost and capability.

    What about streaming responses?

    Gemini supports streaming via generate_content_stream(). This returns tokens as they are generated instead of waiting for the full response. We cover streaming in the Chatbot with Memory tutorial.

    Summary and Next Steps

    You built a code debugger, a code translator, a structured reviewer, and a document analysis pipeline — all using the same Gemini API pattern.

    The core Gemini pattern
    Loading editor...

    Every Gemini app you will build extends this pattern. Chatbots add client.chats.create(). Multimodal apps add Part.from_bytes(). RAG systems add retrieval before the call. The foundation stays the same.

    Next up: Build a Chatbot with Memory — where the AI remembers your entire conversation history across turns.

    References

  • Google AI Python SDK Documentation — official Gemini API guide
  • google-genai PyPI Package — SDK installation and changelog
  • Gemini API Reference — full parameter documentation
  • Google AI Studio — interactive playground and API key management
  • Gemini Model Comparison — Flash vs Pro capabilities
  • Versions used in this tutorial: Python 3.12, google-genai SDK, model gemini-2.0-flash. Tested March 2026.

    Related Tutorials