Principle:Run llama Llama index Pipeline State Persistence
| Knowledge Sources | |
|---|---|
| Domains | Data_Ingestion, RAG, Data_Management |
| Last Updated | 2026-02-11 00:00 GMT |
Overview
Pipeline state persistence enables saving and restoring the ingestion pipeline's cache and docstore to disk, allowing resumable ingestion across process restarts without re-processing already-ingested documents.
Description
Ingestion pipelines maintain two forms of state that benefit from persistence:
- Ingestion cache: Maps (input content hash, transformation identity) pairs to transformation output. When a document has not changed and the transformation is the same, the cached result is returned immediately without recomputation.
- Document store: Tracks which documents have been ingested, their content hashes, and their associated node IDs. This enables deduplication across pipeline runs.
Without persistence, both of these are in-memory only and lost when the process exits. Persisting pipeline state to disk enables:
- Resumable ingestion: If a pipeline run is interrupted, the next run picks up where it left off
- Incremental ingestion: Only new or changed documents are processed on subsequent runs
- Cost reduction: Expensive transformations (especially embedding generation) are not repeated for unchanged content
Usage
Call pipeline.persist() after a successful run to save state, and pipeline.load() before subsequent runs to restore it. Both methods accept a persist_dir path and an optional filesystem abstraction.
Theoretical Basis
Cache-Based Transformation Deduplication
The IngestionCache implements a content-addressable store for transformation results. Each cache entry is keyed by a combination of:
- The hash of the input nodes' content
- The identity/configuration of the transformation
This ensures that if either the input data or the transformation parameters change, the cache is bypassed and the transformation is re-executed:
# Conceptual cache operation
cache_key = hash(input_nodes_content) + hash(transformation_config)
if cache.has(cache_key):
output_nodes = cache.get(cache_key) # Skip transformation
else:
output_nodes = transformation(input_nodes)
cache.put(cache_key, output_nodes) # Store for future runs
Persistence File Layout
When persisted, the pipeline creates the following files in the persist_dir:
- cache.json (or custom name): Serialized IngestionCache containing all cached transformation results
- docstore.json (or custom name): Serialized BaseDocumentStore containing document records and their hashes
Both files are JSON-serialized and can be stored on local disk or any filesystem supported by fsspec.