Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Unstructured IO Unstructured Unstructured Ingest CLI Source

From Leeroopedia
Knowledge Sources
Domains Data_Ingestion, ETL, CLI
Last Updated 2026-02-12 00:00 GMT

Overview

Concrete tool for configuring source connectors in the unstructured-ingest CLI pipeline.

Description

The unstructured-ingest CLI accepts a source connector type as its first positional argument, followed by connector-specific flags for authentication and data selection, and common processing flags. This implementation documents the CLI usage patterns demonstrated in the repository's integration test scripts for S3, Azure, and Elasticsearch connectors.

Usage

Use this CLI tool when you need to ingest documents from external sources into the Unstructured partition pipeline. The CLI handles downloading, partitioning, optional embedding, and writing to a destination in a single command.

Code Reference

Source Location

  • Repository: unstructured
  • File: test_unstructured_ingest/src/s3.sh (lines 29-42), test_unstructured_ingest/src/azure.sh (lines 27-40), test_unstructured_ingest/src/elasticsearch.sh (lines 41-57)

Signature

# Generic CLI pattern
unstructured-ingest <source_type> \
    --remote-url <URL> \
    --num-processes <N> \
    --download-dir <DIR> \
    --metadata-exclude <FIELDS> \
    --strategy <STRATEGY> \
    [source-specific flags] \
    local --output-dir <DIR>

# S3 source connector
unstructured-ingest s3 \
    --remote-url "s3://bucket/prefix/" \
    --anonymous \
    --num-processes 2 \
    --download-dir "$DOWNLOAD_DIR" \
    --strategy fast \
    local --output-dir "$OUTPUT_DIR"

# Azure Blob source connector
unstructured-ingest azure \
    --remote-url "abfs://container@account.dfs.core.windows.net/path/" \
    --account-name "$AZURE_ACCOUNT_NAME" \
    --num-processes 2 \
    local --output-dir "$OUTPUT_DIR"

# Elasticsearch source connector
unstructured-ingest elasticsearch \
    --hosts "$ES_HOSTS" \
    --index-name "$INDEX" \
    --username "$ES_USER" \
    --password "$ES_PASS" \
    --fields "body" \
    --num-processes 2 \
    local --output-dir "$OUTPUT_DIR"

Import

pip install unstructured-ingest

I/O Contract

Inputs

Name Type Required Description
source_type positional arg Yes Connector name: s3, azure, gcs, elasticsearch, local, etc.
--remote-url string Varies Source data URL (format depends on connector)
--num-processes int No Parallel processing workers
--download-dir path No Local directory for downloaded files
--strategy string No Partition strategy (auto, fast, hi_res, ocr_only)
--metadata-exclude CSV No Metadata fields to exclude from output
--anonymous flag No S3: Use anonymous access (no credentials)
--account-name string No Azure: Storage account name
--hosts string No Elasticsearch: Host URL
--index-name string No Elasticsearch: Index name

Outputs

Name Type Description
JSON files files Structured JSON element files in --output-dir, one per input document
exit code int 0=success, 1=failure, 8=skip (missing credentials)

Usage Examples

Ingest from S3 (Anonymous)

unstructured-ingest s3 \
    --remote-url "s3://utic-dev-tech-fixtures/small-pdf-set/" \
    --anonymous \
    --num-processes 2 \
    --download-dir ./downloads \
    --metadata-exclude "filename,file_directory" \
    --strategy fast \
    local --output-dir ./structured-output/s3/

Ingest from Azure Blob

export AZURE_ACCOUNT_NAME="myaccount"
unstructured-ingest azure \
    --remote-url "abfs://container@${AZURE_ACCOUNT_NAME}.dfs.core.windows.net/docs/" \
    --account-name "$AZURE_ACCOUNT_NAME" \
    --num-processes 2 \
    local --output-dir ./structured-output/azure/

Ingest from Elasticsearch

unstructured-ingest elasticsearch \
    --hosts "http://localhost:9200" \
    --index-name "my_documents" \
    --username "elastic" \
    --password "$ES_PASSWORD" \
    --fields "body" \
    --num-processes 2 \
    local --output-dir ./structured-output/elasticsearch/

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment