Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Unstructured IO Unstructured Ingest Processing Configuration

From Leeroopedia
Knowledge Sources
Domains Data_Ingestion, ETL, Configuration
Last Updated 2026-02-12 00:00 GMT

Overview

A configuration pattern for controlling how documents are processed within the ingest pipeline, including strategy selection, parallelism, metadata filtering, and file pattern matching.

Description

Processing configuration governs the behavior of the partition step within the ingest pipeline. While source configuration determines where documents come from and destination configuration determines where they go, processing configuration determines how they are processed.

Key configuration axes include:

  • Strategy: Which partition strategy to use (auto, fast, hi_res, ocr_only)
  • Parallelism: How many documents to process concurrently
  • Metadata filtering: Which metadata fields to include or exclude
  • File filtering: Which files to process based on glob patterns
  • Reprocessing: Whether to re-partition previously processed files

Usage

Use this principle when tuning ingest pipeline behavior for throughput, output format, or resource constraints. Processing configuration is independent of source and destination, so the same processing settings can be applied regardless of where documents come from or go.

Theoretical Basis

Processing configuration maps directly to partition function parameters passed via CLI:

# Abstract processing configuration
pipeline.configure(
    strategy="hi_res",           # Maps to partition(strategy=...)
    num_processes=4,             # Parallel worker count
    metadata_exclude=["filename", "file_directory"],  # Strip fields
    file_glob="*.pdf",           # Only process PDFs
    reprocess=True,              # Force re-partition
    work_dir="/tmp/work",        # Intermediate file storage
    verbose=True,                # Detailed logging
)

Parallelism model: The ingest pipeline uses Python multiprocessing to partition documents concurrently. Each worker process runs an independent partition() call. The default worker count is os.cpu_count().

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment