Workflow:Googleapis Python genai Multimodal Content Generation
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Multimodal_AI, Generative_AI |
| Last Updated | 2026-02-15 14:00 GMT |
Overview
End-to-end process for generating content from multimodal inputs (text, images, PDFs, audio, video) using the Google GenAI SDK with Gemini models.
Description
This workflow covers the generation of content when the input includes non-text modalities such as images, documents, audio, or video alongside text prompts. It leverages the Gemini model's multimodal understanding capabilities. The process involves uploading files (for Gemini Developer API) or referencing GCS URIs (for Vertex AI), constructing multimodal content parts, and generating responses that reason across all provided modalities.
Usage
Execute this workflow when your application needs to process images (image understanding, OCR, visual QA), documents (PDF summarization, analysis), audio (transcription, analysis), or video (description, analysis) alongside text prompts. This is essential for applications that require reasoning over visual or audio content.
Execution Steps
Step 1: Client Initialization
Create a GenAI client configured for either the Gemini Developer API or Vertex AI, following the same initialization pattern as text generation. The choice of backend affects how files are referenced: Gemini Developer API requires file upload via the Files API, while Vertex AI can reference files directly via GCS URIs.
Key considerations:
- Gemini Developer API uses the Files API for file management
- Vertex AI references files via gs:// URIs directly
Step 2: File Upload or Reference
Make media files available to the model. For the Gemini Developer API, upload files using client.files.upload(), which returns a File object with a URI. For Vertex AI, reference files directly using GCS URIs. Local files can also be provided as inline bytes data using Part.from_bytes(). The SDK supports images (JPEG, PNG, GIF, WebP), documents (PDF), audio (MP3, WAV, etc.), and video formats.
Key considerations:
- Uploaded files have a lifecycle and can be listed, retrieved, and deleted
- Large files should be uploaded rather than inlined as bytes
- The SDK can infer MIME types from file extensions
- For GCS URIs, use Part.from_uri() with the appropriate mime_type
Step 3: Multimodal Content Assembly
Construct the contents parameter by combining text parts with media parts. Parts can be created from URIs (Part.from_uri), bytes (Part.from_bytes), uploaded file objects, or PIL images. Multiple parts of different types can be combined in a single Content object to provide the model with all necessary context.
Key considerations:
- Mix text and media parts in a single list for the contents parameter
- The SDK auto-converts uploaded File objects when included directly in the contents list
- Order of parts can affect model attention and response quality
- Media resolution can be configured via media_resolution in GenerateContentConfig
Step 4: Content Generation
Invoke generate_content or generate_content_stream with the assembled multimodal contents. The model processes all modalities together and generates a response. Both synchronous and asynchronous execution modes are supported, with optional streaming.
Key considerations:
- Multimodal inputs consume more tokens than text-only inputs
- Some models have specific input modality support (check model capabilities)
- For image output, set response_modalities to IMAGE in the config
Step 5: Response Processing
Extract the generated response. For text responses, use response.text. For image generation responses, iterate over response.parts and check for inline_data, using part.as_image() to retrieve generated images. Handle the response based on the expected output modality.
Key considerations:
- Multimodal responses may contain mixed part types (text and images)
- Use response.parts to iterate over all response parts
- Generated images can be displayed with .show() or saved to disk