Workflow:Googleapis Python genai Context Caching

Knowledge Sources	Google GenAI Python SDK Gemini API Docs Vertex AI Docs
Domains	LLMs, Cost_Optimization, Generative_AI
Last Updated	2026-02-15 14:00 GMT

Overview

End-to-end process for creating and using cached content with Gemini models to reduce costs and latency for repeated prompts over the same large context.

Description

This workflow covers context caching, which allows pre-processing and storing large content (documents, files, system instructions) on Google's servers so it can be reused across multiple generate_content calls without re-transmitting or re-processing the same data. This significantly reduces both cost (cached tokens are billed at a reduced rate) and latency (cached content is pre-processed). The cache has a configurable TTL (time-to-live) and can be updated, listed, and deleted.

Usage

Execute this workflow when your application makes multiple queries against the same large context, such as repeatedly asking questions about the same set of documents, analyzing the same codebase, or running multiple prompts with the same extensive system instruction. Context caching is most beneficial when the cached content is large relative to the per-query content.

Execution Steps

Step 1: Client Initialization

Create a GenAI client configured for either the Gemini Developer API or Vertex AI. Context caching is available on both backends but the content source differs: Gemini Developer API uses uploaded files, while Vertex AI uses GCS URIs.

Key considerations:

Both backends support context caching
The caches module is accessible via client.caches

Step 2: Content Preparation

Prepare the large content to be cached. For the Gemini Developer API, upload files using client.files.upload() and obtain file URIs. For Vertex AI, reference content via GCS URIs. Construct Content objects with the appropriate parts (text, file URIs, etc.) that represent the static context to cache.

Key considerations:

Cached content must meet a minimum token threshold (model-dependent)
Include all content that will be reused across queries
System instructions can also be included in the cached content
Only content with role user can be cached

Step 3: Cache Creation

Create a cached content entry using client.caches.create() with the target model, content parts, optional system instruction, a display name, and a TTL (time-to-live). The TTL specifies how long the cache remains active (e.g., '3600s' for one hour). The response contains a CachedContent object with the cache name for reference.

Key considerations:

The model specified must match the model used in subsequent generate_content calls
TTL determines the cache lifetime; expired caches are automatically deleted
display_name helps identify caches when listing
Caching incurs storage costs proportional to the cached content size and TTL

Step 4: Content Generation with Cache

Use the cached content in generate_content calls by passing the cache name in the GenerateContentConfig's cached_content parameter. The query-specific content (user's question) is provided in the contents parameter as usual. The model combines the cached context with the new query content.

Key considerations:

The model parameter must match the model used when creating the cache
Only the new query content needs to be sent; cached content is referenced by name
Cached tokens are billed at a reduced rate compared to regular input tokens
The cached content acts as a prefix to the conversation

Step 5: Cache Management

Manage cached content using the CRUD operations on client.caches: get() to retrieve cache details, list() to enumerate all caches, update() to modify TTL or expiration, and delete() to remove caches that are no longer needed. Proactive management prevents unnecessary storage costs.

Key considerations:

Caches automatically expire after their TTL
Update the TTL to extend cache lifetime if still needed
Delete unused caches to reduce storage costs
List caches with pagination for managing multiple cache entries

Execution Diagram

GitHub URL

Workflow Repository