Principle:Unstructured IO Unstructured Ingest Processing Configuration

Knowledge Sources	Unstructured Unstructured Ingest
Domains	Data_Ingestion, ETL, Configuration
Last Updated	2026-02-12 00:00 GMT

Overview

A configuration pattern for controlling how documents are processed within the ingest pipeline, including strategy selection, parallelism, metadata filtering, and file pattern matching.

Description

Processing configuration governs the behavior of the partition step within the ingest pipeline. While source configuration determines where documents come from and destination configuration determines where they go, processing configuration determines how they are processed.

Key configuration axes include:

Strategy: Which partition strategy to use (auto, fast, hi_res, ocr_only)
Parallelism: How many documents to process concurrently
Metadata filtering: Which metadata fields to include or exclude
File filtering: Which files to process based on glob patterns
Reprocessing: Whether to re-partition previously processed files

Usage

Use this principle when tuning ingest pipeline behavior for throughput, output format, or resource constraints. Processing configuration is independent of source and destination, so the same processing settings can be applied regardless of where documents come from or go.

Theoretical Basis

Processing configuration maps directly to partition function parameters passed via CLI:

# Abstract processing configuration
pipeline.configure(
    strategy="hi_res",           # Maps to partition(strategy=...)
    num_processes=4,             # Parallel worker count
    metadata_exclude=["filename", "file_directory"],  # Strip fields
    file_glob="*.pdf",           # Only process PDFs
    reprocess=True,              # Force re-partition
    work_dir="/tmp/work",        # Intermediate file storage
    verbose=True,                # Detailed logging
)

Parallelism model: The ingest pipeline uses Python multiprocessing to partition documents concurrently. Each worker process runs an independent partition() call. The default worker count is os.cpu_count().

Related Pages

Implemented By

Implementation:Unstructured_IO_Unstructured_Unstructured_Ingest_CLI_Processing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment