Principle:Run llama Llama index Ingestion Pipeline Construction

Knowledge Sources	LlamaIndex LlamaIndex Ingestion Pipeline
Domains	Data_Ingestion, RAG, Pipeline_Architecture
Last Updated	2026-02-11 00:00 GMT

Overview

Ingestion pipeline construction is the process of composing a sequence of transformation steps (splitting, embedding, metadata extraction) into a reusable, configurable pipeline for document processing.

Description

The pipeline pattern in LlamaIndex allows developers to declaratively define a chain of transformations that documents pass through during ingestion. Rather than manually orchestrating each step, the pipeline manages:

Transformation sequencing: Each transformation receives the output nodes of the previous step
Caching: Intermediate results are cached to avoid recomputing transformations on unchanged data
Deduplication: An optional docstore tracks which documents have already been processed
Vector store integration: Processed nodes can be automatically inserted into a vector store

This design separates the what (which transformations to apply) from the how (execution order, caching, deduplication), making pipelines easier to configure and maintain.

Usage

Construct an IngestionPipeline by providing a list of transformations (node parsers, embedding models, metadata extractors) and optionally a vector_store, docstore, and cache.

Theoretical Basis

Pipeline Composition

The pipeline follows the chain of responsibility pattern. Each transformation implements the TransformComponent interface with a __call__ method that accepts and returns a list of nodes:

# Conceptual pipeline flow
# Input:  [Document_1, Document_2, ...]
#   -> Transformation_1 (e.g., SentenceSplitter)
#   -> [Node_1a, Node_1b, Node_2a, ...]
#   -> Transformation_2 (e.g., OpenAIEmbedding)
#   -> [Node_1a_with_embedding, Node_1b_with_embedding, ...]
#   -> Vector Store (automatic insertion)

Configuration Components

A fully configured pipeline combines several components:

transformations: Ordered list of TransformComponent instances (required). Common transformations include SentenceSplitter, TitleExtractor, and embedding models.
vector_store: Optional BasePydanticVectorStore for automatic node insertion after all transformations complete.
docstore: Optional BaseDocumentStore for deduplication tracking.
cache: Optional IngestionCache for caching intermediate transformation results.
docstore_strategy: Controls deduplication behavior when a docstore is present (default: UPSERTS).

Related Pages

Implemented By

Implementation:Run_llama_Llama_index_IngestionPipeline_Init

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment