Multimodal AI with Python: Vision, Audio, and Document Understanding Using LLMs

Intermediate120 min3 exercises55 XP

Prerequisites

OpenAI API Crash Course Anthropic Claude API

0/3 exercises

You have been sending text prompts to LLMs for months. Then a colleague drops a screenshot of an error traceback into Slack and asks: "Can your AI tool read this?" You realize your text-only pipeline is blind to images, deaf to audio, and clueless about PDFs.

Multimodal AI changes that. A single API call can describe a photograph, extract tables from a scanned invoice, or transcribe a meeting recording. This tutorial walks through vision, audio, and document understanding across OpenAI, Anthropic, and Google — with runnable code blocks for every text-and-image example.

What Is Multimodal AI?

A multimodal model accepts more than one type of input — text plus images, text plus audio, or all three at once. The model processes these inputs together, so it can answer questions about what it sees and hears, not just what you type.

I think of it this way: a text-only LLM is like talking to someone on the phone. A multimodal LLM is a video call — they can see what you are showing them. That shift from "describe the problem" to "show me the problem" makes a real difference in what you can build.

Here is a quick map of which providers support which modalities as of early 2026:

Provider multimodal support matrix

Loading editor...

The three modalities we will cover — vision, audio, and documents — handle the vast majority of real-world multimodal use cases. Video understanding (supported by Gemini) is essentially vision applied to frames, so the patterns you learn here transfer directly.

Setup and Installation

We will use three SDKs throughout this tutorial. The first code block installs all of them and sets up API keys. You only need the key for the provider you want to test — leave the others as placeholder strings.

Install all three SDKs

Loading editor...

Vision: Image Analysis with GPT-4o

Image understanding is the most immediately useful multimodal capability. You pass an image URL (or base64-encoded bytes) alongside a text prompt, and the model describes, analyzes, or extracts data from what it sees.

The OpenAI API accepts images through the messages list. Instead of a plain text string for the content field, you pass a list of content parts — one text part for your question, one image_url part for the image. Let me show you with a public image.

Analyze a chart with GPT-4o-mini

Loading editor...

The model reads the chart, identifies segments, and lists percentages — all from a single API call. I was skeptical the first time I tried this with a messy hand-drawn whiteboard diagram, and it still extracted the key relationships correctly.

Sending Base64-Encoded Images

URLs work for public images, but most real applications deal with private images — screenshots, uploaded files, camera captures. For those, you encode the image as a base64 string and pass it inline. The pattern is a data URI: data:image/<format>;base64,<encoded_bytes>.

Analyze a local image with base64 encoding

Loading editor...

Multiple Images in One Request

You can send multiple images in a single message. This is useful for comparing screenshots, analyzing before/after states, or processing a batch of receipts.

Compare two images in one call

Loading editor...

Vision with Claude: A Different API Pattern

Claude's image API uses a different structure than OpenAI. Instead of an image_url content type, Claude expects an image content block with explicit source metadata. The base64 approach is the same underneath, but the JSON shape differs enough to trip you up if you are porting code between providers.

Image analysis with Claude

Loading editor...

OpenAI Image Format

# OpenAI: image_url content type
{
    "type": "image_url",
    "image_url": {
        "url": "data:image/png;base64,..."
    }
}

Claude Image Format

# Claude: image content type with source
{
    "type": "image",
    "source": {
        "type": "base64",
        "media_type": "image/png",
        "data": "..."  # no data URI prefix
    }
}

Two differences to watch for. First, Claude wants the raw base64 string without the data:image/png;base64, prefix. Second, you must specify media_type explicitly — Claude does not auto-detect it. Supported types are image/jpeg, image/png, image/gif, and image/webp.

Exercise: Build an Image Description Formatter

Write Code

Write a function called format_image_description that takes a raw description string (as returned by an LLM) and returns a formatted summary dictionary.

The function should:

1. Count the number of words in the description

2. Extract the first sentence (everything up to the first period followed by a space, or the end of the string)

3. Return a dictionary with keys "word_count", "first_sentence", and "full_description"

This is the kind of post-processing you do after every vision API call to normalize outputs across providers.

Loading editor...

Vision with Gemini: The Simplest API

Image analysis with Gemini

Loading editor...

Gemini's API is the most streamlined of the three. You pass a Part.from_bytes() for the image and a Part.from_text() for the question, wrapped in a Content object. No base64 encoding on your side — the SDK handles it internally. I find this the cleanest API for quick prototyping.

Practical Vision: Extracting Structured Data from Images

Describing an image is a demo. Extracting structured data from an image is a product. The difference is your prompt. Instead of "describe this image," you tell the model exactly what format you want back — JSON, CSV, a specific schema.

Here is a pattern I use in production for receipt and invoice processing. The system prompt defines the output schema, and the model fills it in from whatever image you send.

Structured data extraction from images

Loading editor...

Audio Transcription with Whisper

OpenAI's Whisper model handles speech-to-text. Unlike vision, which works through the chat completions endpoint, Whisper has its own dedicated endpoint: client.audio.transcriptions.create(). It accepts audio files in mp3, mp4, wav, webm, and several other formats.

Audio files cannot be sent as URLs — you must upload the file bytes. This means Whisper code needs to run locally (not in the browser), since it requires file system access.

Basic Whisper transcription

Loading editor...

That is the minimal version. In practice, you almost always want timestamps and language detection. The verbose_json response format gives you both.

Transcription with timestamps

Loading editor...

Combining Whisper with GPT-4o

Transcription alone is useful, but the real power comes from piping the transcript into GPT-4o for analysis. This two-step pattern — transcribe, then analyze — handles meeting summaries, podcast search, customer call analysis, and more.

Transcribe and summarize a meeting

Loading editor...

I have used this exact pattern to build an internal tool that emails meeting summaries to the team five minutes after the call ends. The Whisper step takes about 10 seconds per minute of audio. The GPT-4o analysis adds another 2-3 seconds. The whole pipeline runs in under a minute for a typical 30-minute meeting.

Document Understanding: PDFs and Scanned Files

PDF processing is where multimodal AI gets genuinely useful for business applications. Instead of building fragile parsing pipelines with PyPDF2 or pdfminer, you hand the document to an LLM and ask questions in natural language.

There are two approaches. You can convert PDF pages to images and use the vision API (works with any provider). Or you can use native PDF support (currently available in Claude and Gemini). The image approach is more universal; the native approach is cleaner.

PDF as Images (Universal Approach)

The universal approach renders each PDF page as an image and sends it through the vision API. This works with OpenAI, Claude, and Gemini. The downside is higher token cost and slightly lower text extraction accuracy compared to native PDF parsing.

PDF analysis via image conversion (works with any provider)

Loading editor...

Native PDF Support with Claude

Claude supports PDFs natively as a document content type. You pass the base64-encoded PDF bytes directly — no image conversion needed. This is faster, cheaper, and more accurate for text-heavy documents because Claude processes the actual text layer rather than OCR-ing rendered images.

Native PDF analysis with Claude

Loading editor...

Exercise: Document Processing Router

Write Code

Write a function called choose_processing_strategy that takes a file extension and file size in MB, and returns a dictionary describing the best processing approach.

Rules:

For .pdf files under 32 MB: return {"method": "native_pdf", "provider": "claude"}

For .pdf files 32 MB or larger: return {"method": "image_conversion", "provider": "gemini"}

For image files (.png, .jpg, .jpeg): return {"method": "vision", "provider": "gpt-4o"}

For audio files (.mp3, .wav): return {"method": "transcription", "provider": "whisper"}

For anything else: return {"method": "unsupported", "provider": "none"}

Loading editor...

Cross-Provider Comparison: Same Image, Three Models

When you are choosing a provider for a vision task, abstract benchmarks do not tell you much. What matters is how each model performs on your specific image type. The approach: download one image, encode it once, then send the same prompt to all three providers.

Compare providers — Step 1: OpenAI

Loading editor...

The same image and prompt, sent to Claude and Gemini:

Compare providers — Step 2: Claude and Gemini

Loading editor...

In my experience, Gemini tends to be the fastest (and free), GPT-4o gives the most detailed descriptions, and Claude is the most precise when you ask for structured extraction. These rankings shift depending on the image type — run this with your own images to see.

Real-World Pipeline: Error Screenshot Analyzer

Time to combine everything into a practical tool. This pipeline takes a screenshot of an error (a common scenario in dev teams using Slack or Teams), identifies the error, suggests fixes, and formats the output as structured JSON.

The code works with any image URL. In a production version, you would accept uploaded screenshots and encode them as base64.

Error screenshot analyzer

Loading editor...

The temperature is set to 0 because we want consistent, deterministic output — not creative interpretation. The JSON stripping logic handles the common case where models wrap their output in markdown fences.

Common Mistakes with Multimodal APIs

After building several multimodal applications, these are the mistakes I see most often — including ones I made myself.

Mistake 1: Forgetting the media_type for Claude

Claude requires media_type

Loading editor...

Mistake 2: Sending the data URI Prefix to Claude

OpenAI expects the full data URI (data:image/png;base64,<data>). Claude expects raw base64 without the prefix. If you copy-paste code from an OpenAI project to Claude, the request will fail silently or produce garbage results.

Wrong: Sending data URI prefix to Claude

# Claude will reject this
"data": "data:image/png;base64,iVBOR..."

Correct: Raw base64 for Claude

# Claude wants just the base64 string
"data": "iVBOR..."

Mistake 3: Not Setting detail for High-Resolution Images

The default detail in OpenAI's vision API is "auto". For text-heavy images like documents, receipts, and code screenshots, "auto" often picks "low" — which downscales to 512x512 pixels. Small text becomes unreadable. Always set detail: "high" for documents.

Mistake 4: Ignoring Token Costs for Images

A single high-resolution image can cost 765+ tokens. Sending 10 pages of a PDF as high-detail images costs 7,650+ tokens just for the images — before any text in your prompt. I have seen developers hit unexpected bills because they batch-processed PDFs without estimating image token costs first.

Exercise: Image Token Cost Calculator

Write Code

Write a function called estimate_image_tokens that calculates the token cost of an image for OpenAI's vision API.

Rules (matching OpenAI's actual pricing):

If detail is "low": always return 85 tokens

If detail is "high": the image is scaled so the longest side is 2048px, then the shortest side is scaled to 768px. The image is then divided into 512x512 tiles. Each tile costs 170 tokens, plus a base of 85 tokens.

The function takes width (int), height (int), and detail (str, default "high") and returns the token count as an integer.

For the tile calculation: use math.ceil() to round up partial tiles.

Loading editor...

When NOT to Use Multimodal AI

Multimodal AI is impressive, but it is not always the right tool. Here are cases where I recommend a different approach.

High-volume OCR. If you need to extract text from thousands of documents, a dedicated OCR engine like Tesseract or Amazon Textract is faster and cheaper. LLM vision costs scale linearly with page count, and at high volumes that adds up.

Pixel-precise tasks. LLMs cannot reliably count objects in an image, measure distances, or detect exact pixel coordinates. For tasks like object detection with bounding boxes, use a specialized computer vision model (YOLOv8, SAM) instead.

Real-time audio processing. Whisper processes audio after recording. It is not designed for live transcription with sub-second latency. For real-time speech-to-text, use services like Deepgram or AssemblyAI that stream results as audio arrives.

Summary

You now know how to build multimodal AI applications across three providers. Here is what we covered:

Vision (OpenAI, Claude, Gemini): Send images via URL or base64. Each provider has a slightly different message format. GPT-4o uses image_url, Claude uses image with source, Gemini uses Part.from_bytes().

Audio (Whisper): Dedicated transcription endpoint that accepts audio files. Combine with GPT-4o for summarization and analysis.

Documents (Claude, Gemini): Native PDF support avoids image conversion overhead. For OpenAI, convert pages to images with PyMuPDF.

Structured extraction: Use JSON schema prompts to get machine-parseable output from any modality.

Provider selection: Prototype with Gemini (free), benchmark on your data, then commit.

The pattern across all of these is the same: prepare your non-text input (encode, format), attach it to a message alongside a text prompt, and parse the response. Once you have built one multimodal pipeline, every subsequent one follows the same shape.

References and Further Reading

OpenAI Vision Guide — official documentation for image inputs with GPT-4o

OpenAI Whisper API — audio transcription endpoint reference

Anthropic Vision Documentation — Claude image and PDF input formats

Google Gemini API Docs — multimodal capabilities including video and audio

Radford et al., "Robust Speech Recognition via Large-Scale Weak Supervision" (2022) — the Whisper research paper

OpenAI Image Token Pricing — detailed token cost calculation for images

PyMuPDF Documentation — PDF rendering library used for the image conversion approach

Frequently Asked Questions

Can I send video to GPT-4o?

Not directly. GPT-4o does not accept video files. The workaround is to extract keyframes from the video (e.g., one frame every 2 seconds using ffmpeg or OpenCV) and send them as multiple images. Gemini is the only major provider with native video input support.

How do I handle HEIC images from iPhones?

Convert HEIC to JPEG or PNG before sending to any provider. Use the pillow-heif library: pip install pillow-heif, then open the HEIC file with Pillow and save as PNG. None of the major providers accept HEIC natively.

Is Whisper available in languages other than English?

Yes. Whisper supports 57 languages and can auto-detect the language. You can also force a specific language with the language parameter: client.audio.transcriptions.create(model="whisper-1", file=audio, language="es") for Spanish. Accuracy varies by language — English, Spanish, French, and German have the highest accuracy.

Can I fine-tune vision models?

OpenAI supports fine-tuning GPT-4o with image inputs as of late 2024. You provide training examples with images and expected outputs, just like text fine-tuning. This is useful for domain-specific tasks like medical imaging or manufacturing defect detection where the base model needs specialized knowledge.

What Is Multimodal AI?

Setup and Installation

Vision: Image Analysis with GPT-4o

Sending Base64-Encoded Images

Multiple Images in One Request

Vision with Claude: A Different API Pattern

Vision with Gemini: The Simplest API

Practical Vision: Extracting Structured Data from Images

Audio Transcription with Whisper

Combining Whisper with GPT-4o

Document Understanding: PDFs and Scanned Files

PDF as Images (Universal Approach)

Native PDF Support with Claude

Cross-Provider Comparison: Same Image, Three Models

Real-World Pipeline: Error Screenshot Analyzer

Common Mistakes with Multimodal APIs

Mistake 1: Forgetting the media_type for Claude

Mistake 2: Sending the data URI Prefix to Claude

Mistake 3: Not Setting detail for High-Resolution Images

Mistake 4: Ignoring Token Costs for Images

When NOT to Use Multimodal AI

Summary

References and Further Reading

Frequently Asked Questions

Can I send video to GPT-4o?

How do I handle HEIC images from iPhones?

Is Whisper available in languages other than English?

Can I fine-tune vision models?

Related Tutorials