Implementation:Unstructured IO Unstructured Unstructured Ingest CLI Source
| Knowledge Sources | |
|---|---|
| Domains | Data_Ingestion, ETL, CLI |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Concrete tool for configuring source connectors in the unstructured-ingest CLI pipeline.
Description
The unstructured-ingest CLI accepts a source connector type as its first positional argument, followed by connector-specific flags for authentication and data selection, and common processing flags. This implementation documents the CLI usage patterns demonstrated in the repository's integration test scripts for S3, Azure, and Elasticsearch connectors.
Usage
Use this CLI tool when you need to ingest documents from external sources into the Unstructured partition pipeline. The CLI handles downloading, partitioning, optional embedding, and writing to a destination in a single command.
Code Reference
Source Location
- Repository: unstructured
- File: test_unstructured_ingest/src/s3.sh (lines 29-42), test_unstructured_ingest/src/azure.sh (lines 27-40), test_unstructured_ingest/src/elasticsearch.sh (lines 41-57)
Signature
# Generic CLI pattern
unstructured-ingest <source_type> \
--remote-url <URL> \
--num-processes <N> \
--download-dir <DIR> \
--metadata-exclude <FIELDS> \
--strategy <STRATEGY> \
[source-specific flags] \
local --output-dir <DIR>
# S3 source connector
unstructured-ingest s3 \
--remote-url "s3://bucket/prefix/" \
--anonymous \
--num-processes 2 \
--download-dir "$DOWNLOAD_DIR" \
--strategy fast \
local --output-dir "$OUTPUT_DIR"
# Azure Blob source connector
unstructured-ingest azure \
--remote-url "abfs://container@account.dfs.core.windows.net/path/" \
--account-name "$AZURE_ACCOUNT_NAME" \
--num-processes 2 \
local --output-dir "$OUTPUT_DIR"
# Elasticsearch source connector
unstructured-ingest elasticsearch \
--hosts "$ES_HOSTS" \
--index-name "$INDEX" \
--username "$ES_USER" \
--password "$ES_PASS" \
--fields "body" \
--num-processes 2 \
local --output-dir "$OUTPUT_DIR"
Import
pip install unstructured-ingest
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| source_type | positional arg | Yes | Connector name: s3, azure, gcs, elasticsearch, local, etc. |
| --remote-url | string | Varies | Source data URL (format depends on connector) |
| --num-processes | int | No | Parallel processing workers |
| --download-dir | path | No | Local directory for downloaded files |
| --strategy | string | No | Partition strategy (auto, fast, hi_res, ocr_only) |
| --metadata-exclude | CSV | No | Metadata fields to exclude from output |
| --anonymous | flag | No | S3: Use anonymous access (no credentials) |
| --account-name | string | No | Azure: Storage account name |
| --hosts | string | No | Elasticsearch: Host URL |
| --index-name | string | No | Elasticsearch: Index name |
Outputs
| Name | Type | Description |
|---|---|---|
| JSON files | files | Structured JSON element files in --output-dir, one per input document |
| exit code | int | 0=success, 1=failure, 8=skip (missing credentials) |
Usage Examples
Ingest from S3 (Anonymous)
unstructured-ingest s3 \
--remote-url "s3://utic-dev-tech-fixtures/small-pdf-set/" \
--anonymous \
--num-processes 2 \
--download-dir ./downloads \
--metadata-exclude "filename,file_directory" \
--strategy fast \
local --output-dir ./structured-output/s3/
Ingest from Azure Blob
export AZURE_ACCOUNT_NAME="myaccount"
unstructured-ingest azure \
--remote-url "abfs://container@${AZURE_ACCOUNT_NAME}.dfs.core.windows.net/docs/" \
--account-name "$AZURE_ACCOUNT_NAME" \
--num-processes 2 \
local --output-dir ./structured-output/azure/
Ingest from Elasticsearch
unstructured-ingest elasticsearch \
--hosts "http://localhost:9200" \
--index-name "my_documents" \
--username "elastic" \
--password "$ES_PASSWORD" \
--fields "body" \
--num-processes 2 \
local --output-dir ./structured-output/elasticsearch/