Principle:Run llama Llama index Pipeline State Persistence

Knowledge Sources	LlamaIndex LlamaIndex Ingestion Pipeline
Domains	Data_Ingestion, RAG, Data_Management
Last Updated	2026-02-11 00:00 GMT

Overview

Pipeline state persistence enables saving and restoring the ingestion pipeline's cache and docstore to disk, allowing resumable ingestion across process restarts without re-processing already-ingested documents.

Description

Ingestion pipelines maintain two forms of state that benefit from persistence:

Ingestion cache: Maps (input content hash, transformation identity) pairs to transformation output. When a document has not changed and the transformation is the same, the cached result is returned immediately without recomputation.
Document store: Tracks which documents have been ingested, their content hashes, and their associated node IDs. This enables deduplication across pipeline runs.

Without persistence, both of these are in-memory only and lost when the process exits. Persisting pipeline state to disk enables:

Resumable ingestion: If a pipeline run is interrupted, the next run picks up where it left off
Incremental ingestion: Only new or changed documents are processed on subsequent runs
Cost reduction: Expensive transformations (especially embedding generation) are not repeated for unchanged content

Usage

Call pipeline.persist() after a successful run to save state, and pipeline.load() before subsequent runs to restore it. Both methods accept a persist_dir path and an optional filesystem abstraction.

Theoretical Basis

Cache-Based Transformation Deduplication

The IngestionCache implements a content-addressable store for transformation results. Each cache entry is keyed by a combination of:

The hash of the input nodes' content
The identity/configuration of the transformation

This ensures that if either the input data or the transformation parameters change, the cache is bypassed and the transformation is re-executed:

# Conceptual cache operation
cache_key = hash(input_nodes_content) + hash(transformation_config)

if cache.has(cache_key):
    output_nodes = cache.get(cache_key)  # Skip transformation
else:
    output_nodes = transformation(input_nodes)
    cache.put(cache_key, output_nodes)  # Store for future runs

Persistence File Layout

When persisted, the pipeline creates the following files in the persist_dir:

cache.json (or custom name): Serialized IngestionCache containing all cached transformation results
docstore.json (or custom name): Serialized BaseDocumentStore containing document records and their hashes

Both files are JSON-serialized and can be stored on local disk or any filesystem supported by fsspec.

Related Pages

Implemented By

Implementation:Run_llama_Llama_index_IngestionPipeline_Persist

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment