Principle:Unstructured IO Unstructured Ingest Source Configuration
| Knowledge Sources | |
|---|---|
| Domains | Data_Ingestion, ETL, Cloud_Storage |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
A configuration pattern for connecting to external data sources (cloud storage, databases, APIs) to download documents for processing through the partition pipeline.
Description
Source configuration defines how the ingest pipeline connects to and retrieves documents from external systems. The Unstructured ecosystem supports 30+ source connectors including cloud storage (S3, Azure Blob, GCS), databases (Elasticsearch, MongoDB), collaboration tools (Confluence, Notion, Sharepoint, Google Drive), and communication platforms (Slack, Discord).
Each source connector requires specific authentication credentials and connection parameters. The configuration pattern normalizes these across connectors through CLI flags while allowing connector-specific options (e.g., S3 anonymous access, Azure account names, Elasticsearch index filtering).
Usage
Use this principle when setting up document ingestion from external sources. The source connector selection depends on where your documents are stored. Configuration requires knowing the source type, providing appropriate credentials (typically via environment variables), and specifying which documents to retrieve.
Theoretical Basis
Source configuration follows the connector pattern:
# Abstract source connector pattern
connector = SourceConnector(
source_type="s3", # or azure, gcs, elasticsearch, etc.
remote_url="s3://bucket/path/",
credentials=from_environment(),
download_dir="./downloads/",
)
# Common parameters across all connectors:
# --remote-url : Source data location
# --download-dir : Local directory for downloaded files
# --num-processes : Parallel download workers
# --reprocess : Force re-download of existing files
# Connector-specific parameters:
# S3: --anonymous
# Azure: --account-name
# Elasticsearch: --hosts, --index-name, --username, --password, --fields
Authentication patterns:
- Environment variables: AWS credentials (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY), API keys
- CLI flags: Explicit credential passing (--username, --password, --account-name)
- Anonymous access: Public data sources (S3 with --anonymous flag)