Principle:Unstructured IO Unstructured Ingest Source Configuration

Knowledge Sources	Unstructured Unstructured Ingest
Domains	Data_Ingestion, ETL, Cloud_Storage
Last Updated	2026-02-12 00:00 GMT

Overview

A configuration pattern for connecting to external data sources (cloud storage, databases, APIs) to download documents for processing through the partition pipeline.

Description

Source configuration defines how the ingest pipeline connects to and retrieves documents from external systems. The Unstructured ecosystem supports 30+ source connectors including cloud storage (S3, Azure Blob, GCS), databases (Elasticsearch, MongoDB), collaboration tools (Confluence, Notion, Sharepoint, Google Drive), and communication platforms (Slack, Discord).

Each source connector requires specific authentication credentials and connection parameters. The configuration pattern normalizes these across connectors through CLI flags while allowing connector-specific options (e.g., S3 anonymous access, Azure account names, Elasticsearch index filtering).

Usage

Use this principle when setting up document ingestion from external sources. The source connector selection depends on where your documents are stored. Configuration requires knowing the source type, providing appropriate credentials (typically via environment variables), and specifying which documents to retrieve.

Theoretical Basis

Source configuration follows the connector pattern:

# Abstract source connector pattern
connector = SourceConnector(
    source_type="s3",        # or azure, gcs, elasticsearch, etc.
    remote_url="s3://bucket/path/",
    credentials=from_environment(),
    download_dir="./downloads/",
)

# Common parameters across all connectors:
#   --remote-url     : Source data location
#   --download-dir   : Local directory for downloaded files
#   --num-processes  : Parallel download workers
#   --reprocess      : Force re-download of existing files

# Connector-specific parameters:
#   S3:            --anonymous
#   Azure:         --account-name
#   Elasticsearch: --hosts, --index-name, --username, --password, --fields

Authentication patterns:

Environment variables: AWS credentials (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY), API keys
CLI flags: Explicit credential passing (--username, --password, --account-name)
Anonymous access: Public data sources (S3 with --anonymous flag)

Related Pages

Implemented By

Implementation:Unstructured_IO_Unstructured_Unstructured_Ingest_CLI_Source

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment