Principle:Run llama Llama index Ingestion Pipeline Construction
| Knowledge Sources | |
|---|---|
| Domains | Data_Ingestion, RAG, Pipeline_Architecture |
| Last Updated | 2026-02-11 00:00 GMT |
Overview
Ingestion pipeline construction is the process of composing a sequence of transformation steps (splitting, embedding, metadata extraction) into a reusable, configurable pipeline for document processing.
Description
The pipeline pattern in LlamaIndex allows developers to declaratively define a chain of transformations that documents pass through during ingestion. Rather than manually orchestrating each step, the pipeline manages:
- Transformation sequencing: Each transformation receives the output nodes of the previous step
- Caching: Intermediate results are cached to avoid recomputing transformations on unchanged data
- Deduplication: An optional docstore tracks which documents have already been processed
- Vector store integration: Processed nodes can be automatically inserted into a vector store
This design separates the what (which transformations to apply) from the how (execution order, caching, deduplication), making pipelines easier to configure and maintain.
Usage
Construct an IngestionPipeline by providing a list of transformations (node parsers, embedding models, metadata extractors) and optionally a vector_store, docstore, and cache.
Theoretical Basis
Pipeline Composition
The pipeline follows the chain of responsibility pattern. Each transformation implements the TransformComponent interface with a __call__ method that accepts and returns a list of nodes:
# Conceptual pipeline flow
# Input: [Document_1, Document_2, ...]
# -> Transformation_1 (e.g., SentenceSplitter)
# -> [Node_1a, Node_1b, Node_2a, ...]
# -> Transformation_2 (e.g., OpenAIEmbedding)
# -> [Node_1a_with_embedding, Node_1b_with_embedding, ...]
# -> Vector Store (automatic insertion)
Configuration Components
A fully configured pipeline combines several components:
- transformations: Ordered list of TransformComponent instances (required). Common transformations include SentenceSplitter, TitleExtractor, and embedding models.
- vector_store: Optional BasePydanticVectorStore for automatic node insertion after all transformations complete.
- docstore: Optional BaseDocumentStore for deduplication tracking.
- cache: Optional IngestionCache for caching intermediate transformation results.
- docstore_strategy: Controls deduplication behavior when a docstore is present (default: UPSERTS).