Principle:Googleapis Python genai Cache Creation
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Cost_Reduction |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
A mechanism for pre-computing and storing model context to reduce latency and cost for repeated queries over the same large content.
Description
Cache Creation stores processed content (documents, system instructions, few-shot examples) on the server side so subsequent generation requests can reference it without re-transmitting or re-processing the content. This is particularly valuable for applications that repeatedly query the same large context (e.g., a customer support bot querying a product manual, or an analyst querying a long report). Caches have a time-to-live (TTL) and are associated with a specific model. They reduce both input token costs and latency for repeated queries.
Usage
Use context caching when you have large, stable content (documents, system prompts, few-shot examples) that multiple generation requests will reference. Upload the content first (via files.upload), then create a cache with the content and a TTL. Reference the cache in subsequent generation calls. The cache saves costs when the cached content is large relative to the per-query content and the number of queries is sufficient to amortize the cache creation cost.
Theoretical Basis
Context caching follows a Memoization Pattern at the model context level:
# Without caching: each query re-processes the full context
for query in queries:
response = model.generate([large_document, query]) # O(D + Q) tokens each time
# With caching: context is processed once
cache = create_cache(large_document) # One-time cost: O(D) tokens
for query in queries:
response = model.generate(query, cache=cache) # O(Q) tokens each time
Cost savings increase linearly with the number of queries over the same cached context.