Environment:Unstructured IO Unstructured All Docs
| Knowledge Sources | |
|---|---|
| Domains | Document Partitioning |
| Last Updated | 2026-02-12 09:00 GMT |
Overview
The All_Docs environment installs every document type dependency at once via the all-docs pip extra, enabling the unstructured library to partition all supported file formats.
Description
The all-docs extra is a convenience meta-extra defined in pyproject.toml that aggregates all individual document type extras into a single installable target. It is equivalent to installing:
unstructured[csv,doc,docx,epub,image,md,odt,org,pdf,ppt,pptx,rtf,rst,tsv,xlsx]
This pulls in a wide range of third-party libraries covering spreadsheet parsing, word processing, presentation files, PDF extraction with OCR, Markdown rendering, and more. The partition/auto.py module uses dependency_exists() checks (lines 370-378) to verify whether the required extra for a given file type is installed, and raises an ImportError with a helpful message guiding the user to install the appropriate extra (e.g., pip install "unstructured[pdf]").
Usage
Use this environment when you need to partition any and all supported document types without worrying about which individual extras to install. This is the recommended setup for development environments, CI pipelines processing mixed document sets, and deployment scenarios where the full range of file types must be handled.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Python | >= 3.11, < 3.14 | Required Python version range |
| OS (Linux) | x86_64 or aarch64 | Fully supported |
| OS (macOS) | arm64 (Apple Silicon) | Fully supported |
| OS (Windows) | AMD64 | Supported |
| Tesseract OCR | Required for PDF/image extras | System package: tesseract-ocr
|
| Poppler | Required for pdf2image | System package: poppler-utils
|
| libmagic | Required for file type detection | System package: libmagic1
|
| Pandoc | Required for doc, odt, epub, rtf, org extras | System package: pandoc
|
Dependencies
System Packages
- libmagic1 -- MIME type detection
- tesseract-ocr -- OCR engine for PDF and image processing
- poppler-utils -- PDF rendering for pdf2image
- pandoc -- document format conversion (doc, odt, epub, rtf, org)
- libreoffice -- required for .doc and .ppt conversion (optional but recommended)
Python Packages
- pandas -- CSV and TSV parsing
- python-docx -- DOCX document parsing
- pypandoc-binary -- Pandoc wrapper for multiple formats (epub, odt, rtf, rst, org)
- pdf2image -- PDF page to image conversion
- pdfminer.six -- PDF text extraction
- pi-heif -- HEIF image support
- pikepdf -- PDF manipulation and repair
- pypdf -- PDF reading
- unstructured-inference -- layout detection models
- unstructured-pytesseract -- Tesseract OCR wrapper
- markdown -- Markdown parsing
- python-pptx -- PowerPoint parsing
- msoffcrypto-tool -- encrypted Office document support
- networkx -- graph-based document structure analysis
- openpyxl -- XLSX reading
- xlrd -- XLS reading
- google-cloud-vision -- Google Cloud Vision OCR (optional alternative)
Credentials
- GOOGLE_APPLICATION_CREDENTIALS -- path to Google Cloud service account JSON (only if using Google Cloud Vision OCR)
Quick Install
# Install system dependencies (Ubuntu/Debian)
sudo apt-get update && sudo apt-get install -y \
libmagic1 tesseract-ocr poppler-utils pandoc
# Install unstructured with all document type extras
pip install "unstructured[all-docs]"
Code Evidence
Dependency check with helpful error message (partition/auto.py:370-378):
if not dependency_exists(dep):
raise ImportError(
f'The "{dep}" package is required to process {filetype} files. '
f'You can install it with: pip install "unstructured[{extra}]"'
)
all-docs extra definition (pyproject.toml):
[project.optional-dependencies]
all-docs = [
"unstructured[csv,doc,docx,epub,image,md,odt,org,pdf,ppt,pptx,rtf,rst,tsv,xlsx]"
]
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
ImportError: The "python-docx" package is required to process .docx files |
The docx extra is not installed | Install via pip install "unstructured[docx]" or pip install "unstructured[all-docs]"
|
ImportError: The "pdfminer" package is required to process .pdf files |
The pdf extra is not installed | Install via pip install "unstructured[pdf]" or pip install "unstructured[all-docs]"
|
OSError: pandoc not found |
Pandoc system package is not installed | Install via sudo apt-get install pandoc or brew install pandoc
|
TesseractNotFoundError |
Tesseract OCR is not installed on the system | Install via sudo apt-get install tesseract-ocr
|
Compatibility Notes
- The all-docs extra is the heaviest installation option; for production deployments processing only specific file types, prefer installing individual extras to reduce image size
- pypandoc-binary bundles its own Pandoc binary, but certain formats may still require the system Pandoc package
- Supported platforms: Linux x86_64, Linux aarch64, macOS arm64, Windows AMD64
- Some extras (e.g., image, pdf) have overlapping dependencies; installing all-docs resolves these automatically