Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Unstructured IO Unstructured Connector Ingest Pipeline

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, ETL, Document_Processing
Last Updated 2026-02-12 09:30 GMT

Overview

End-to-end pipeline for ingesting documents from 30+ source connectors, partitioning them into structured elements, and writing the results to destination connectors using the unstructured-ingest CLI.

Description

This workflow describes the complete data ingestion pipeline that connects external data sources to downstream storage and processing systems. The pipeline follows a two-stage architecture: a source connector downloads documents from a remote system (S3, Azure Blob, Google Cloud Storage, SharePoint, Confluence, Slack, databases, etc.), the documents are then partitioned into structured elements using the Unstructured partition engine, and finally a destination connector writes the processed elements to a target system (local filesystem, vector databases, search engines, cloud storage, etc.).

Key capabilities:

  • 30+ source connectors covering cloud storage, SaaS platforms, databases, and collaboration tools
  • Parallel multi-process document processing
  • Optional embedding generation during the pipeline
  • Multiple destination connectors for vector databases, search engines, and cloud storage
  • Configurable processing strategies (fast, hi_res, ocr_only)
  • Incremental processing with reprocess and preserve-downloads options

Usage

Execute this workflow when you need to batch-process documents from an external data source (such as an S3 bucket, SharePoint site, Confluence space, or database) and deliver the structured output to a destination system for indexing, search, or analytics. This is the primary workflow for production data pipelines that transform unstructured content at scale.

Execution Steps

Step 1: Source_Configuration

Select and configure the source connector for your data origin. Each source connector has specific authentication and connection parameters. Cloud storage connectors (S3, Azure, GCS) require bucket URLs and credentials. SaaS connectors (Confluence, Slack, SharePoint) require API tokens and workspace identifiers. Database connectors (MongoDB, Elasticsearch) require connection strings and index names.

Key considerations:

  • Each source connector has its own required and optional parameters
  • Authentication can use environment variables, API keys, or service account credentials
  • Some connectors support anonymous access (e.g., public S3 buckets)
  • The download directory stores raw files for processing

Step 2: Processing_Configuration

Configure the document processing options including the partitioning strategy, parallelism, metadata handling, and optional embedding. The strategy parameter controls the accuracy-speed tradeoff (fast, hi_res, ocr_only, auto). The num-processes parameter enables parallel processing across multiple CPU cores. Metadata fields can be included or excluded from the output.

Key considerations:

  • Use num-processes to parallelize across available CPU cores
  • Select the appropriate strategy based on document types and accuracy needs
  • Use metadata-exclude to remove fields not needed downstream (e.g., coordinates)
  • Enable verbose mode for debugging pipeline issues
  • Set work-dir for intermediate processing artifacts

Step 3: Embedding_Configuration

Optionally configure an embedding provider to generate vector embeddings during the pipeline. Supported providers include OpenAI, HuggingFace, AWS Bedrock, Google Vertex AI, Voyage AI, and others. Embeddings are attached to each element and written alongside the structured output to the destination.

Key considerations:

  • Embedding is optional and adds processing time per element
  • Choose the embedding provider based on your vector database and use case
  • API keys for embedding providers are configured via environment variables or CLI flags
  • Embedding dimensions vary by provider and model

Step 4: Destination_Configuration

Select and configure the destination connector where processed elements will be written. The simplest destination is local filesystem (JSON files per document). Vector database destinations (Chroma, Weaviate, Pinecone, Qdrant) store elements with their embeddings for similarity search. Search engine destinations (Elasticsearch, OpenSearch) enable full-text search. Cloud storage destinations (S3, Azure, GCS) store structured output remotely.

Key considerations:

  • Local destination writes one JSON file per input document
  • Vector database destinations require embeddings to be generated
  • Database destinations may require schema setup before ingestion
  • Output format varies by destination (JSON, database records, etc.)

Step 5: Pipeline_Execution

Execute the ingest pipeline using the unstructured-ingest CLI command. The pipeline orchestrates the full flow: downloading from the source, partitioning each document, optionally embedding, and writing to the destination. Progress and errors are logged to stdout. Failed documents are reported but do not halt the entire pipeline.

Key considerations:

  • Use reprocess flag to force re-processing of previously processed documents
  • Use preserve-downloads to retain downloaded source files for debugging
  • Monitor pipeline output for per-document errors
  • Large pipelines benefit from higher num-processes values

Step 6: Output_Validation

Validate the pipeline output to ensure documents were processed correctly. For local destinations, verify the number of output files matches expectations. For database destinations, query the target to confirm records were written. The Unstructured test infrastructure includes diff-checking scripts that compare output against expected baselines for regression testing.

Key considerations:

  • Check output file counts match expected document counts
  • Validate element types and metadata in sample outputs
  • Use diff-checking for regression testing against known-good baselines
  • Monitor for empty outputs indicating processing failures

Execution Diagram

GitHub URL

Workflow Repository