Environment:Unstructured IO Unstructured All Docs

Knowledge Sources	unstructured
Domains	Document Partitioning
Last Updated	2026-02-12 09:00 GMT

Overview

The All_Docs environment installs every document type dependency at once via the all-docs pip extra, enabling the unstructured library to partition all supported file formats.

Description

The all-docs extra is a convenience meta-extra defined in pyproject.toml that aggregates all individual document type extras into a single installable target. It is equivalent to installing:

unstructured[csv,doc,docx,epub,image,md,odt,org,pdf,ppt,pptx,rtf,rst,tsv,xlsx]

This pulls in a wide range of third-party libraries covering spreadsheet parsing, word processing, presentation files, PDF extraction with OCR, Markdown rendering, and more. The partition/auto.py module uses dependency_exists() checks (lines 370-378) to verify whether the required extra for a given file type is installed, and raises an ImportError with a helpful message guiding the user to install the appropriate extra (e.g., pip install "unstructured[pdf]").

Usage

Use this environment when you need to partition any and all supported document types without worrying about which individual extras to install. This is the recommended setup for development environments, CI pipelines processing mixed document sets, and deployment scenarios where the full range of file types must be handled.

System Requirements

Category	Requirement	Notes
Python	>= 3.11, < 3.14	Required Python version range
OS (Linux)	x86_64 or aarch64	Fully supported
OS (macOS)	arm64 (Apple Silicon)	Fully supported
OS (Windows)	AMD64	Supported
Tesseract OCR	Required for PDF/image extras	System package: `tesseract-ocr`
Poppler	Required for pdf2image	System package: `poppler-utils`
libmagic	Required for file type detection	System package: `libmagic1`
Pandoc	Required for doc, odt, epub, rtf, org extras	System package: `pandoc`

Dependencies

System Packages

libmagic1 -- MIME type detection
tesseract-ocr -- OCR engine for PDF and image processing
poppler-utils -- PDF rendering for pdf2image
pandoc -- document format conversion (doc, odt, epub, rtf, org)
libreoffice -- required for .doc and .ppt conversion (optional but recommended)

Python Packages

pandas -- CSV and TSV parsing
python-docx -- DOCX document parsing
pypandoc-binary -- Pandoc wrapper for multiple formats (epub, odt, rtf, rst, org)
pdf2image -- PDF page to image conversion
pdfminer.six -- PDF text extraction
pi-heif -- HEIF image support
pikepdf -- PDF manipulation and repair
pypdf -- PDF reading
unstructured-inference -- layout detection models
unstructured-pytesseract -- Tesseract OCR wrapper
markdown -- Markdown parsing
python-pptx -- PowerPoint parsing
msoffcrypto-tool -- encrypted Office document support
networkx -- graph-based document structure analysis
openpyxl -- XLSX reading
xlrd -- XLS reading
google-cloud-vision -- Google Cloud Vision OCR (optional alternative)

Credentials

GOOGLE_APPLICATION_CREDENTIALS -- path to Google Cloud service account JSON (only if using Google Cloud Vision OCR)

Quick Install

# Install system dependencies (Ubuntu/Debian)
sudo apt-get update && sudo apt-get install -y \
    libmagic1 tesseract-ocr poppler-utils pandoc

# Install unstructured with all document type extras
pip install "unstructured[all-docs]"

Code Evidence

Dependency check with helpful error message (partition/auto.py:370-378):

if not dependency_exists(dep):
    raise ImportError(
        f'The "{dep}" package is required to process {filetype} files. '
        f'You can install it with: pip install "unstructured[{extra}]"'
    )

all-docs extra definition (pyproject.toml):

[project.optional-dependencies]
all-docs = [
    "unstructured[csv,doc,docx,epub,image,md,odt,org,pdf,ppt,pptx,rtf,rst,tsv,xlsx]"
]

Common Errors

Error Message	Cause	Solution
`ImportError: The "python-docx" package is required to process .docx files`	The docx extra is not installed	Install via `pip install "unstructured[docx]"` or `pip install "unstructured[all-docs]"`
`ImportError: The "pdfminer" package is required to process .pdf files`	The pdf extra is not installed	Install via `pip install "unstructured[pdf]"` or `pip install "unstructured[all-docs]"`
`OSError: pandoc not found`	Pandoc system package is not installed	Install via `sudo apt-get install pandoc` or `brew install pandoc`
`TesseractNotFoundError`	Tesseract OCR is not installed on the system	Install via `sudo apt-get install tesseract-ocr`

Compatibility Notes

The all-docs extra is the heaviest installation option; for production deployments processing only specific file types, prefer installing individual extras to reduce image size
pypandoc-binary bundles its own Pandoc binary, but certain formats may still require the system Pandoc package
Supported platforms: Linux x86_64, Linux aarch64, macOS arm64, Windows AMD64
Some extras (e.g., image, pdf) have overlapping dependencies; installing all-docs resolves these automatically

Related Pages

Implementation:Unstructured_IO_Unstructured_Partition

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment