Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Run llama Llama index Ingestion Pipeline Construction

From Leeroopedia
Knowledge Sources
Domains Data_Ingestion, RAG, Pipeline_Architecture
Last Updated 2026-02-11 00:00 GMT

Overview

Ingestion pipeline construction is the process of composing a sequence of transformation steps (splitting, embedding, metadata extraction) into a reusable, configurable pipeline for document processing.

Description

The pipeline pattern in LlamaIndex allows developers to declaratively define a chain of transformations that documents pass through during ingestion. Rather than manually orchestrating each step, the pipeline manages:

  • Transformation sequencing: Each transformation receives the output nodes of the previous step
  • Caching: Intermediate results are cached to avoid recomputing transformations on unchanged data
  • Deduplication: An optional docstore tracks which documents have already been processed
  • Vector store integration: Processed nodes can be automatically inserted into a vector store

This design separates the what (which transformations to apply) from the how (execution order, caching, deduplication), making pipelines easier to configure and maintain.

Usage

Construct an IngestionPipeline by providing a list of transformations (node parsers, embedding models, metadata extractors) and optionally a vector_store, docstore, and cache.

Theoretical Basis

Pipeline Composition

The pipeline follows the chain of responsibility pattern. Each transformation implements the TransformComponent interface with a __call__ method that accepts and returns a list of nodes:

# Conceptual pipeline flow
# Input:  [Document_1, Document_2, ...]
#   -> Transformation_1 (e.g., SentenceSplitter)
#   -> [Node_1a, Node_1b, Node_2a, ...]
#   -> Transformation_2 (e.g., OpenAIEmbedding)
#   -> [Node_1a_with_embedding, Node_1b_with_embedding, ...]
#   -> Vector Store (automatic insertion)

Configuration Components

A fully configured pipeline combines several components:

  • transformations: Ordered list of TransformComponent instances (required). Common transformations include SentenceSplitter, TitleExtractor, and embedding models.
  • vector_store: Optional BasePydanticVectorStore for automatic node insertion after all transformations complete.
  • docstore: Optional BaseDocumentStore for deduplication tracking.
  • cache: Optional IngestionCache for caching intermediate transformation results.
  • docstore_strategy: Controls deduplication behavior when a docstore is present (default: UPSERTS).

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment