Workflow:Googleapis Python genai Multimodal Content Generation

Knowledge Sources	Google GenAI Python SDK Gemini API Docs Vertex AI Docs
Domains	LLMs, Multimodal_AI, Generative_AI
Last Updated	2026-02-15 14:00 GMT

Overview

End-to-end process for generating content from multimodal inputs (text, images, PDFs, audio, video) using the Google GenAI SDK with Gemini models.

Description

This workflow covers the generation of content when the input includes non-text modalities such as images, documents, audio, or video alongside text prompts. It leverages the Gemini model's multimodal understanding capabilities. The process involves uploading files (for Gemini Developer API) or referencing GCS URIs (for Vertex AI), constructing multimodal content parts, and generating responses that reason across all provided modalities.

Usage

Execute this workflow when your application needs to process images (image understanding, OCR, visual QA), documents (PDF summarization, analysis), audio (transcription, analysis), or video (description, analysis) alongside text prompts. This is essential for applications that require reasoning over visual or audio content.

Execution Steps

Step 1: Client Initialization

Create a GenAI client configured for either the Gemini Developer API or Vertex AI, following the same initialization pattern as text generation. The choice of backend affects how files are referenced: Gemini Developer API requires file upload via the Files API, while Vertex AI can reference files directly via GCS URIs.

Key considerations:

Gemini Developer API uses the Files API for file management
Vertex AI references files via gs:// URIs directly

Step 2: File Upload or Reference

Make media files available to the model. For the Gemini Developer API, upload files using client.files.upload(), which returns a File object with a URI. For Vertex AI, reference files directly using GCS URIs. Local files can also be provided as inline bytes data using Part.from_bytes(). The SDK supports images (JPEG, PNG, GIF, WebP), documents (PDF), audio (MP3, WAV, etc.), and video formats.

Key considerations:

Uploaded files have a lifecycle and can be listed, retrieved, and deleted
Large files should be uploaded rather than inlined as bytes
The SDK can infer MIME types from file extensions
For GCS URIs, use Part.from_uri() with the appropriate mime_type

Step 3: Multimodal Content Assembly

Construct the contents parameter by combining text parts with media parts. Parts can be created from URIs (Part.from_uri), bytes (Part.from_bytes), uploaded file objects, or PIL images. Multiple parts of different types can be combined in a single Content object to provide the model with all necessary context.

Key considerations:

Mix text and media parts in a single list for the contents parameter
The SDK auto-converts uploaded File objects when included directly in the contents list
Order of parts can affect model attention and response quality
Media resolution can be configured via media_resolution in GenerateContentConfig

Step 4: Content Generation

Invoke generate_content or generate_content_stream with the assembled multimodal contents. The model processes all modalities together and generates a response. Both synchronous and asynchronous execution modes are supported, with optional streaming.

Key considerations:

Multimodal inputs consume more tokens than text-only inputs
Some models have specific input modality support (check model capabilities)
For image output, set response_modalities to IMAGE in the config

Step 5: Response Processing

Extract the generated response. For text responses, use response.text. For image generation responses, iterate over response.parts and check for inline_data, using part.as_image() to retrieve generated images. Handle the response based on the expected output modality.

Key considerations:

Multimodal responses may contain mixed part types (text and images)
Use response.parts to iterate over all response parts
Generated images can be displayed with .show() or saved to disk

Execution Diagram

GitHub URL

Workflow Repository