Principle:Unstructured IO Unstructured Ingest Processing Configuration
| Knowledge Sources | |
|---|---|
| Domains | Data_Ingestion, ETL, Configuration |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
A configuration pattern for controlling how documents are processed within the ingest pipeline, including strategy selection, parallelism, metadata filtering, and file pattern matching.
Description
Processing configuration governs the behavior of the partition step within the ingest pipeline. While source configuration determines where documents come from and destination configuration determines where they go, processing configuration determines how they are processed.
Key configuration axes include:
- Strategy: Which partition strategy to use (auto, fast, hi_res, ocr_only)
- Parallelism: How many documents to process concurrently
- Metadata filtering: Which metadata fields to include or exclude
- File filtering: Which files to process based on glob patterns
- Reprocessing: Whether to re-partition previously processed files
Usage
Use this principle when tuning ingest pipeline behavior for throughput, output format, or resource constraints. Processing configuration is independent of source and destination, so the same processing settings can be applied regardless of where documents come from or go.
Theoretical Basis
Processing configuration maps directly to partition function parameters passed via CLI:
# Abstract processing configuration
pipeline.configure(
strategy="hi_res", # Maps to partition(strategy=...)
num_processes=4, # Parallel worker count
metadata_exclude=["filename", "file_directory"], # Strip fields
file_glob="*.pdf", # Only process PDFs
reprocess=True, # Force re-partition
work_dir="/tmp/work", # Intermediate file storage
verbose=True, # Detailed logging
)
Parallelism model: The ingest pipeline uses Python multiprocessing to partition documents concurrently. Each worker process runs an independent partition() call. The default worker count is os.cpu_count().