Multimodal AI with Python: Vision, Audio, and Document Understanding Using LLMs
You have been sending text prompts to LLMs for months. Then a colleague drops a screenshot of an error traceback into Slack and asks: "Can your AI tool read this?" You realize your text-only pipeline is blind to images, deaf to audio, and clueless about PDFs.
Multimodal AI changes that. A single API call can describe a photograph, extract tables from a scanned invoice, or transcribe a meeting recording. This tutorial walks through vision, audio, and document understanding across OpenAI, Anthropic, and Google — with runnable code blocks for every text-and-image example.
What Is Multimodal AI?
A multimodal model accepts more than one type of input — text plus images, text plus audio, or all three at once. The model processes these inputs together, so it can answer questions about what it sees and hears, not just what you type.
I think of it this way: a text-only LLM is like talking to someone on the phone. A multimodal LLM is a video call — they can see what you are showing them. That shift from "describe the problem" to "show me the problem" makes a real difference in what you can build.
Here is a quick map of which providers support which modalities as of early 2026:
The three modalities we will cover — vision, audio, and documents — handle the vast majority of real-world multimodal use cases. Video understanding (supported by Gemini) is essentially vision applied to frames, so the patterns you learn here transfer directly.
Setup and Installation
Vision: Image Analysis with GPT-4o
Image understanding is the most immediately useful multimodal capability. You pass an image URL (or base64-encoded bytes) alongside a text prompt, and the model describes, analyzes, or extracts data from what it sees.
The OpenAI API accepts images through the messages list. Instead of a plain text string for the content field, you pass a list of content parts — one text part for your question, one image_url part for the image. Let me show you with a public image.
The model reads the chart, identifies segments, and lists percentages — all from a single API call. I was skeptical the first time I tried this with a messy hand-drawn whiteboard diagram, and it still extracted the key relationships correctly.
Sending Base64-Encoded Images
URLs work for public images, but most real applications deal with private images — screenshots, uploaded files, camera captures. For those, you encode the image as a base64 string and pass it inline. The pattern is a data URI: data:image/<format>;base64,<encoded_bytes>.
Multiple Images in One Request
You can send multiple images in a single message. This is useful for comparing screenshots, analyzing before/after states, or processing a batch of receipts.
Vision with Claude: A Different API Pattern
Claude's image API uses a different structure than OpenAI. Instead of an image_url content type, Claude expects an image content block with explicit source metadata. The base64 approach is the same underneath, but the JSON shape differs enough to trip you up if you are porting code between providers.
# OpenAI: image_url content type
{
"type": "image_url",
"image_url": {
"url": "data:image/png;base64,..."
}
}# Claude: image content type with source
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": "..." # no data URI prefix
}
}Two differences to watch for. First, Claude wants the raw base64 string without the data:image/png;base64, prefix. Second, you must specify media_type explicitly — Claude does not auto-detect it. Supported types are image/jpeg, image/png, image/gif, and image/webp.
Write a function called format_image_description that takes a raw description string (as returned by an LLM) and returns a formatted summary dictionary.
The function should:
1. Count the number of words in the description
2. Extract the first sentence (everything up to the first period followed by a space, or the end of the string)
3. Return a dictionary with keys "word_count", "first_sentence", and "full_description"
This is the kind of post-processing you do after every vision API call to normalize outputs across providers.
Vision with Gemini: The Simplest API
Gemini's API is the most streamlined of the three. You pass a Part.from_bytes() for the image and a Part.from_text() for the question, wrapped in a Content object. No base64 encoding on your side — the SDK handles it internally. I find this the cleanest API for quick prototyping.
Practical Vision: Extracting Structured Data from Images
Describing an image is a demo. Extracting structured data from an image is a product. The difference is your prompt. Instead of "describe this image," you tell the model exactly what format you want back — JSON, CSV, a specific schema.
Here is a pattern I use in production for receipt and invoice processing. The system prompt defines the output schema, and the model fills it in from whatever image you send.
Audio Transcription with Whisper
OpenAI's Whisper model handles speech-to-text. Unlike vision, which works through the chat completions endpoint, Whisper has its own dedicated endpoint: client.audio.transcriptions.create(). It accepts audio files in mp3, mp4, wav, webm, and several other formats.
Audio files cannot be sent as URLs — you must upload the file bytes. This means Whisper code needs to run locally (not in the browser), since it requires file system access.
That is the minimal version. In practice, you almost always want timestamps and language detection. The verbose_json response format gives you both.
Combining Whisper with GPT-4o
Transcription alone is useful, but the real power comes from piping the transcript into GPT-4o for analysis. This two-step pattern — transcribe, then analyze — handles meeting summaries, podcast search, customer call analysis, and more.
I have used this exact pattern to build an internal tool that emails meeting summaries to the team five minutes after the call ends. The Whisper step takes about 10 seconds per minute of audio. The GPT-4o analysis adds another 2-3 seconds. The whole pipeline runs in under a minute for a typical 30-minute meeting.
Document Understanding: PDFs and Scanned Files
PDF processing is where multimodal AI gets genuinely useful for business applications. Instead of building fragile parsing pipelines with PyPDF2 or pdfminer, you hand the document to an LLM and ask questions in natural language.
There are two approaches. You can convert PDF pages to images and use the vision API (works with any provider). Or you can use native PDF support (currently available in Claude and Gemini). The image approach is more universal; the native approach is cleaner.
PDF as Images (Universal Approach)
The universal approach renders each PDF page as an image and sends it through the vision API. This works with OpenAI, Claude, and Gemini. The downside is higher token cost and slightly lower text extraction accuracy compared to native PDF parsing.
Native PDF Support with Claude
Claude supports PDFs natively as a document content type. You pass the base64-encoded PDF bytes directly — no image conversion needed. This is faster, cheaper, and more accurate for text-heavy documents because Claude processes the actual text layer rather than OCR-ing rendered images.
Write a function called choose_processing_strategy that takes a file extension and file size in MB, and returns a dictionary describing the best processing approach.
Rules:
.pdf files under 32 MB: return {"method": "native_pdf", "provider": "claude"}.pdf files 32 MB or larger: return {"method": "image_conversion", "provider": "gemini"}.png, .jpg, .jpeg): return {"method": "vision", "provider": "gpt-4o"}.mp3, .wav): return {"method": "transcription", "provider": "whisper"}{"method": "unsupported", "provider": "none"}Cross-Provider Comparison: Same Image, Three Models
When you are choosing a provider for a vision task, abstract benchmarks do not tell you much. What matters is how each model performs on your specific image type. The approach: download one image, encode it once, then send the same prompt to all three providers.
The same image and prompt, sent to Claude and Gemini:
In my experience, Gemini tends to be the fastest (and free), GPT-4o gives the most detailed descriptions, and Claude is the most precise when you ask for structured extraction. These rankings shift depending on the image type — run this with your own images to see.
Real-World Pipeline: Error Screenshot Analyzer
Time to combine everything into a practical tool. This pipeline takes a screenshot of an error (a common scenario in dev teams using Slack or Teams), identifies the error, suggests fixes, and formats the output as structured JSON.
The code works with any image URL. In a production version, you would accept uploaded screenshots and encode them as base64.
The temperature is set to 0 because we want consistent, deterministic output — not creative interpretation. The JSON stripping logic handles the common case where models wrap their output in markdown fences.
Common Mistakes with Multimodal APIs
After building several multimodal applications, these are the mistakes I see most often — including ones I made myself.
Mistake 1: Forgetting the media_type for Claude
Mistake 2: Sending the data URI Prefix to Claude
OpenAI expects the full data URI (data:image/png;base64,<data>). Claude expects raw base64 without the prefix. If you copy-paste code from an OpenAI project to Claude, the request will fail silently or produce garbage results.
# Claude will reject this
"data": "data:image/png;base64,iVBOR..."# Claude wants just the base64 string
"data": "iVBOR..."Mistake 3: Not Setting detail for High-Resolution Images
The default detail in OpenAI's vision API is "auto". For text-heavy images like documents, receipts, and code screenshots, "auto" often picks "low" — which downscales to 512x512 pixels. Small text becomes unreadable. Always set detail: "high" for documents.
Mistake 4: Ignoring Token Costs for Images
A single high-resolution image can cost 765+ tokens. Sending 10 pages of a PDF as high-detail images costs 7,650+ tokens just for the images — before any text in your prompt. I have seen developers hit unexpected bills because they batch-processed PDFs without estimating image token costs first.
Write a function called estimate_image_tokens that calculates the token cost of an image for OpenAI's vision API.
Rules (matching OpenAI's actual pricing):
detail is "low": always return 85 tokensdetail is "high": the image is scaled so the longest side is 2048px, then the shortest side is scaled to 768px. The image is then divided into 512x512 tiles. Each tile costs 170 tokens, plus a base of 85 tokens.The function takes width (int), height (int), and detail (str, default "high") and returns the token count as an integer.
For the tile calculation: use math.ceil() to round up partial tiles.
When NOT to Use Multimodal AI
Multimodal AI is impressive, but it is not always the right tool. Here are cases where I recommend a different approach.
High-volume OCR. If you need to extract text from thousands of documents, a dedicated OCR engine like Tesseract or Amazon Textract is faster and cheaper. LLM vision costs scale linearly with page count, and at high volumes that adds up.
Pixel-precise tasks. LLMs cannot reliably count objects in an image, measure distances, or detect exact pixel coordinates. For tasks like object detection with bounding boxes, use a specialized computer vision model (YOLOv8, SAM) instead.
Real-time audio processing. Whisper processes audio after recording. It is not designed for live transcription with sub-second latency. For real-time speech-to-text, use services like Deepgram or AssemblyAI that stream results as audio arrives.
Summary
You now know how to build multimodal AI applications across three providers. Here is what we covered:
image_url, Claude uses image with source, Gemini uses Part.from_bytes().The pattern across all of these is the same: prepare your non-text input (encode, format), attach it to a message alongside a text prompt, and parse the response. Once you have built one multimodal pipeline, every subsequent one follows the same shape.
References and Further Reading
Frequently Asked Questions
Can I send video to GPT-4o?
Not directly. GPT-4o does not accept video files. The workaround is to extract keyframes from the video (e.g., one frame every 2 seconds using ffmpeg or OpenCV) and send them as multiple images. Gemini is the only major provider with native video input support.
How do I handle HEIC images from iPhones?
Convert HEIC to JPEG or PNG before sending to any provider. Use the pillow-heif library: pip install pillow-heif, then open the HEIC file with Pillow and save as PNG. None of the major providers accept HEIC natively.
Is Whisper available in languages other than English?
Yes. Whisper supports 57 languages and can auto-detect the language. You can also force a specific language with the language parameter: client.audio.transcriptions.create(model="whisper-1", file=audio, language="es") for Spanish. Accuracy varies by language — English, Spanish, French, and German have the highest accuracy.
Can I fine-tune vision models?
OpenAI supports fine-tuning GPT-4o with image inputs as of late 2024. You provide training examples with images and expected outputs, just like text fine-tuning. This is useful for domain-specific tasks like medical imaging or manufacturing defect detection where the base model needs specialized knowledge.