Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Run llama Llama index Pipeline State Persistence

From Leeroopedia
Knowledge Sources
Domains Data_Ingestion, RAG, Data_Management
Last Updated 2026-02-11 00:00 GMT

Overview

Pipeline state persistence enables saving and restoring the ingestion pipeline's cache and docstore to disk, allowing resumable ingestion across process restarts without re-processing already-ingested documents.

Description

Ingestion pipelines maintain two forms of state that benefit from persistence:

  • Ingestion cache: Maps (input content hash, transformation identity) pairs to transformation output. When a document has not changed and the transformation is the same, the cached result is returned immediately without recomputation.
  • Document store: Tracks which documents have been ingested, their content hashes, and their associated node IDs. This enables deduplication across pipeline runs.

Without persistence, both of these are in-memory only and lost when the process exits. Persisting pipeline state to disk enables:

  • Resumable ingestion: If a pipeline run is interrupted, the next run picks up where it left off
  • Incremental ingestion: Only new or changed documents are processed on subsequent runs
  • Cost reduction: Expensive transformations (especially embedding generation) are not repeated for unchanged content

Usage

Call pipeline.persist() after a successful run to save state, and pipeline.load() before subsequent runs to restore it. Both methods accept a persist_dir path and an optional filesystem abstraction.

Theoretical Basis

Cache-Based Transformation Deduplication

The IngestionCache implements a content-addressable store for transformation results. Each cache entry is keyed by a combination of:

  1. The hash of the input nodes' content
  2. The identity/configuration of the transformation

This ensures that if either the input data or the transformation parameters change, the cache is bypassed and the transformation is re-executed:

# Conceptual cache operation
cache_key = hash(input_nodes_content) + hash(transformation_config)

if cache.has(cache_key):
    output_nodes = cache.get(cache_key)  # Skip transformation
else:
    output_nodes = transformation(input_nodes)
    cache.put(cache_key, output_nodes)  # Store for future runs

Persistence File Layout

When persisted, the pipeline creates the following files in the persist_dir:

  • cache.json (or custom name): Serialized IngestionCache containing all cached transformation results
  • docstore.json (or custom name): Serialized BaseDocumentStore containing document records and their hashes

Both files are JSON-serialized and can be stored on local disk or any filesystem supported by fsspec.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment