Implementation:Unstructured IO Unstructured Unstructured Ingest CLI Source

Knowledge Sources	Unstructured Unstructured Ingest
Domains	Data_Ingestion, ETL, CLI
Last Updated	2026-02-12 00:00 GMT

Overview

Concrete tool for configuring source connectors in the unstructured-ingest CLI pipeline.

Description

The unstructured-ingest CLI accepts a source connector type as its first positional argument, followed by connector-specific flags for authentication and data selection, and common processing flags. This implementation documents the CLI usage patterns demonstrated in the repository's integration test scripts for S3, Azure, and Elasticsearch connectors.

Usage

Use this CLI tool when you need to ingest documents from external sources into the Unstructured partition pipeline. The CLI handles downloading, partitioning, optional embedding, and writing to a destination in a single command.

Code Reference

Source Location

Repository: unstructured
File: test_unstructured_ingest/src/s3.sh (lines 29-42), test_unstructured_ingest/src/azure.sh (lines 27-40), test_unstructured_ingest/src/elasticsearch.sh (lines 41-57)

Signature

# Generic CLI pattern
unstructured-ingest <source_type> \
    --remote-url <URL> \
    --num-processes <N> \
    --download-dir <DIR> \
    --metadata-exclude <FIELDS> \
    --strategy <STRATEGY> \
    [source-specific flags] \
    local --output-dir <DIR>

# S3 source connector
unstructured-ingest s3 \
    --remote-url "s3://bucket/prefix/" \
    --anonymous \
    --num-processes 2 \
    --download-dir "$DOWNLOAD_DIR" \
    --strategy fast \
    local --output-dir "$OUTPUT_DIR"

# Azure Blob source connector
unstructured-ingest azure \
    --remote-url "abfs://container@account.dfs.core.windows.net/path/" \
    --account-name "$AZURE_ACCOUNT_NAME" \
    --num-processes 2 \
    local --output-dir "$OUTPUT_DIR"

# Elasticsearch source connector
unstructured-ingest elasticsearch \
    --hosts "$ES_HOSTS" \
    --index-name "$INDEX" \
    --username "$ES_USER" \
    --password "$ES_PASS" \
    --fields "body" \
    --num-processes 2 \
    local --output-dir "$OUTPUT_DIR"

Import

pip install unstructured-ingest

I/O Contract

Inputs

Name	Type	Required	Description
source_type	positional arg	Yes	Connector name: s3, azure, gcs, elasticsearch, local, etc.
--remote-url	string	Varies	Source data URL (format depends on connector)
--num-processes	int	No	Parallel processing workers
--download-dir	path	No	Local directory for downloaded files
--strategy	string	No	Partition strategy (auto, fast, hi_res, ocr_only)
--metadata-exclude	CSV	No	Metadata fields to exclude from output
--anonymous	flag	No	S3: Use anonymous access (no credentials)
--account-name	string	No	Azure: Storage account name
--hosts	string	No	Elasticsearch: Host URL
--index-name	string	No	Elasticsearch: Index name

Outputs

Name	Type	Description
JSON files	files	Structured JSON element files in --output-dir, one per input document
exit code	int	0=success, 1=failure, 8=skip (missing credentials)

Usage Examples

Ingest from S3 (Anonymous)

unstructured-ingest s3 \
    --remote-url "s3://utic-dev-tech-fixtures/small-pdf-set/" \
    --anonymous \
    --num-processes 2 \
    --download-dir ./downloads \
    --metadata-exclude "filename,file_directory" \
    --strategy fast \
    local --output-dir ./structured-output/s3/

Ingest from Azure Blob

export AZURE_ACCOUNT_NAME="myaccount"
unstructured-ingest azure \
    --remote-url "abfs://container@${AZURE_ACCOUNT_NAME}.dfs.core.windows.net/docs/" \
    --account-name "$AZURE_ACCOUNT_NAME" \
    --num-processes 2 \
    local --output-dir ./structured-output/azure/

Ingest from Elasticsearch

unstructured-ingest elasticsearch \
    --hosts "http://localhost:9200" \
    --index-name "my_documents" \
    --username "elastic" \
    --password "$ES_PASSWORD" \
    --fields "body" \
    --num-processes 2 \
    local --output-dir ./structured-output/elasticsearch/

Related Pages

Implements Principle

Principle:Unstructured_IO_Unstructured_Ingest_Source_Configuration

Requires Environment

Environment:Unstructured_IO_Unstructured_Ingest_CLI

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment