Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Unstructured IO Unstructured Ingest CLI

From Leeroopedia
Knowledge Sources
Domains Data Ingestion
Last Updated 2026-02-12 09:00 GMT

Overview

The Ingest_CLI environment provides the dependencies and configuration for the unstructured-ingest command-line tool, which orchestrates document ingestion pipelines from various sources to various destinations.

Description

The ingest CLI is invoked as the unstructured-ingest command and is installed via the ingest extra in pyproject.toml, which pulls in the separate unstructured-ingest package. This tool enables batch processing of documents from sources such as S3 buckets, local filesystems, and other connectors, applying partitioning and optional post-processing before writing results to a configured destination.

Important deprecation notice: the unstructured.ingest module within the main unstructured package is deprecated in favor of the standalone unstructured-ingest project. A deprecation warning is emitted from embed/__init__.py when the old module path is used.

The CLI supports parallel processing via the MAX_PROCESSES environment variable, which defaults to os.cpu_count(). For testing purposes, S3 source tests use the --anonymous flag to access public buckets without requiring AWS credentials.

Usage

This environment is required when running document ingestion pipelines via the command line, including source connectors (S3, local filesystem, etc.) and processing configurations. It is the primary interface for batch document processing workflows.

System Requirements

Category Requirement Notes
Python >= 3.11, < 3.14 Required Python version range
OS Linux, macOS, Windows Cross-platform CLI tool
unstructured core Must be installed The ingest CLI depends on the core unstructured library for partitioning

Dependencies

System Packages

  • All system packages required by the document types being processed (see All_Docs or individual extras)
  • No additional system packages beyond what the core library requires

Python Packages

  • unstructured-ingest -- the standalone ingest CLI package (installed via the ingest extra in pyproject.toml)
  • All transitive dependencies of unstructured-ingest (connector-specific packages are installed separately)

Credentials

  • PYTHONPATH -- may need to be set to include the unstructured source tree during development
  • OUTPUT_ROOT -- base directory for ingest output files
  • MAX_PROCESSES -- maximum number of parallel worker processes (default: os.cpu_count())
  • RUN_SCRIPT -- path to the ingest runner script (used in test harnesses)
  • AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY -- required for authenticated S3 access (not needed for public buckets with --anonymous)

Quick Install

# Install unstructured with ingest extras
pip install "unstructured[ingest]"

# Verify the CLI is available
unstructured-ingest --help

# Example: ingest from a public S3 bucket
unstructured-ingest s3 \
    --remote-url s3://example-public-bucket \
    --anonymous \
    --output-dir ./output

Code Evidence

Deprecation warning for old module path (embed/__init__.py):

import warnings
warnings.warn(
    "unstructured.ingest is deprecated. Use the unstructured-ingest package instead.",
    DeprecationWarning,
    stacklevel=2,
)

MAX_PROCESSES default (ingest CLI):

max_processes = int(os.environ.get("MAX_PROCESSES", os.cpu_count()))

S3 anonymous access in tests:

unstructured-ingest s3 \
    --remote-url s3://utic-dev-tech-fixtures \
    --anonymous \
    --output-dir ./output

Common Errors

Error Message Cause Solution
command not found: unstructured-ingest The ingest extra is not installed Install via pip install "unstructured[ingest]"
DeprecationWarning: unstructured.ingest is deprecated Using the old unstructured.ingest import path Migrate to the standalone unstructured-ingest package
NoCredentialsError: Unable to locate credentials Trying to access a private S3 bucket without credentials Set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, or use --anonymous for public buckets
OSError: [Errno 28] No space left on device Output directory has insufficient disk space Ensure adequate disk space or change OUTPUT_ROOT to a volume with more capacity

Compatibility Notes

  • The unstructured.ingest module in the main package is deprecated; all new development should use the standalone unstructured-ingest package
  • The --anonymous flag is available for S3 source connectors to access public buckets without AWS credentials
  • MAX_PROCESSES defaults to the number of CPU cores; reduce this on memory-constrained systems
  • The ingest CLI supports multiple source and destination connectors; each may require additional connector-specific packages

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment