Environment:Unstructured IO Unstructured Ingest CLI
| Knowledge Sources | |
|---|---|
| Domains | Data Ingestion |
| Last Updated | 2026-02-12 09:00 GMT |
Overview
The Ingest_CLI environment provides the dependencies and configuration for the unstructured-ingest command-line tool, which orchestrates document ingestion pipelines from various sources to various destinations.
Description
The ingest CLI is invoked as the unstructured-ingest command and is installed via the ingest extra in pyproject.toml, which pulls in the separate unstructured-ingest package. This tool enables batch processing of documents from sources such as S3 buckets, local filesystems, and other connectors, applying partitioning and optional post-processing before writing results to a configured destination.
Important deprecation notice: the unstructured.ingest module within the main unstructured package is deprecated in favor of the standalone unstructured-ingest project. A deprecation warning is emitted from embed/__init__.py when the old module path is used.
The CLI supports parallel processing via the MAX_PROCESSES environment variable, which defaults to os.cpu_count(). For testing purposes, S3 source tests use the --anonymous flag to access public buckets without requiring AWS credentials.
Usage
This environment is required when running document ingestion pipelines via the command line, including source connectors (S3, local filesystem, etc.) and processing configurations. It is the primary interface for batch document processing workflows.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Python | >= 3.11, < 3.14 | Required Python version range |
| OS | Linux, macOS, Windows | Cross-platform CLI tool |
| unstructured core | Must be installed | The ingest CLI depends on the core unstructured library for partitioning |
Dependencies
System Packages
- All system packages required by the document types being processed (see All_Docs or individual extras)
- No additional system packages beyond what the core library requires
Python Packages
- unstructured-ingest -- the standalone ingest CLI package (installed via the ingest extra in pyproject.toml)
- All transitive dependencies of unstructured-ingest (connector-specific packages are installed separately)
Credentials
- PYTHONPATH -- may need to be set to include the unstructured source tree during development
- OUTPUT_ROOT -- base directory for ingest output files
- MAX_PROCESSES -- maximum number of parallel worker processes (default:
os.cpu_count()) - RUN_SCRIPT -- path to the ingest runner script (used in test harnesses)
- AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY -- required for authenticated S3 access (not needed for public buckets with
--anonymous)
Quick Install
# Install unstructured with ingest extras
pip install "unstructured[ingest]"
# Verify the CLI is available
unstructured-ingest --help
# Example: ingest from a public S3 bucket
unstructured-ingest s3 \
--remote-url s3://example-public-bucket \
--anonymous \
--output-dir ./output
Code Evidence
Deprecation warning for old module path (embed/__init__.py):
import warnings
warnings.warn(
"unstructured.ingest is deprecated. Use the unstructured-ingest package instead.",
DeprecationWarning,
stacklevel=2,
)
MAX_PROCESSES default (ingest CLI):
max_processes = int(os.environ.get("MAX_PROCESSES", os.cpu_count()))
S3 anonymous access in tests:
unstructured-ingest s3 \
--remote-url s3://utic-dev-tech-fixtures \
--anonymous \
--output-dir ./output
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
command not found: unstructured-ingest |
The ingest extra is not installed | Install via pip install "unstructured[ingest]"
|
DeprecationWarning: unstructured.ingest is deprecated |
Using the old unstructured.ingest import path |
Migrate to the standalone unstructured-ingest package
|
NoCredentialsError: Unable to locate credentials |
Trying to access a private S3 bucket without credentials | Set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, or use --anonymous for public buckets
|
OSError: [Errno 28] No space left on device |
Output directory has insufficient disk space | Ensure adequate disk space or change OUTPUT_ROOT to a volume with more capacity |
Compatibility Notes
- The unstructured.ingest module in the main package is deprecated; all new development should use the standalone unstructured-ingest package
- The --anonymous flag is available for S3 source connectors to access public buckets without AWS credentials
- MAX_PROCESSES defaults to the number of CPU cores; reduce this on memory-constrained systems
- The ingest CLI supports multiple source and destination connectors; each may require additional connector-specific packages