Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Googleapis Python genai Context Caching

From Leeroopedia
Knowledge Sources
Domains LLMs, Cost_Optimization, Generative_AI
Last Updated 2026-02-15 14:00 GMT

Overview

End-to-end process for creating and using cached content with Gemini models to reduce costs and latency for repeated prompts over the same large context.

Description

This workflow covers context caching, which allows pre-processing and storing large content (documents, files, system instructions) on Google's servers so it can be reused across multiple generate_content calls without re-transmitting or re-processing the same data. This significantly reduces both cost (cached tokens are billed at a reduced rate) and latency (cached content is pre-processed). The cache has a configurable TTL (time-to-live) and can be updated, listed, and deleted.

Usage

Execute this workflow when your application makes multiple queries against the same large context, such as repeatedly asking questions about the same set of documents, analyzing the same codebase, or running multiple prompts with the same extensive system instruction. Context caching is most beneficial when the cached content is large relative to the per-query content.

Execution Steps

Step 1: Client Initialization

Create a GenAI client configured for either the Gemini Developer API or Vertex AI. Context caching is available on both backends but the content source differs: Gemini Developer API uses uploaded files, while Vertex AI uses GCS URIs.

Key considerations:

  • Both backends support context caching
  • The caches module is accessible via client.caches

Step 2: Content Preparation

Prepare the large content to be cached. For the Gemini Developer API, upload files using client.files.upload() and obtain file URIs. For Vertex AI, reference content via GCS URIs. Construct Content objects with the appropriate parts (text, file URIs, etc.) that represent the static context to cache.

Key considerations:

  • Cached content must meet a minimum token threshold (model-dependent)
  • Include all content that will be reused across queries
  • System instructions can also be included in the cached content
  • Only content with role user can be cached

Step 3: Cache Creation

Create a cached content entry using client.caches.create() with the target model, content parts, optional system instruction, a display name, and a TTL (time-to-live). The TTL specifies how long the cache remains active (e.g., '3600s' for one hour). The response contains a CachedContent object with the cache name for reference.

Key considerations:

  • The model specified must match the model used in subsequent generate_content calls
  • TTL determines the cache lifetime; expired caches are automatically deleted
  • display_name helps identify caches when listing
  • Caching incurs storage costs proportional to the cached content size and TTL

Step 4: Content Generation with Cache

Use the cached content in generate_content calls by passing the cache name in the GenerateContentConfig's cached_content parameter. The query-specific content (user's question) is provided in the contents parameter as usual. The model combines the cached context with the new query content.

Key considerations:

  • The model parameter must match the model used when creating the cache
  • Only the new query content needs to be sent; cached content is referenced by name
  • Cached tokens are billed at a reduced rate compared to regular input tokens
  • The cached content acts as a prefix to the conversation

Step 5: Cache Management

Manage cached content using the CRUD operations on client.caches: get() to retrieve cache details, list() to enumerate all caches, update() to modify TTL or expiration, and delete() to remove caches that are no longer needed. Proactive management prevents unnecessary storage costs.

Key considerations:

  • Caches automatically expire after their TTL
  • Update the TTL to extend cache lifetime if still needed
  • Delete unused caches to reduce storage costs
  • List caches with pagination for managing multiple cache entries

Execution Diagram

GitHub URL

Workflow Repository