Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Unstructured IO Unstructured All Docs

From Leeroopedia
Knowledge Sources
Domains Document Partitioning
Last Updated 2026-02-12 09:00 GMT

Overview

The All_Docs environment installs every document type dependency at once via the all-docs pip extra, enabling the unstructured library to partition all supported file formats.

Description

The all-docs extra is a convenience meta-extra defined in pyproject.toml that aggregates all individual document type extras into a single installable target. It is equivalent to installing:

unstructured[csv,doc,docx,epub,image,md,odt,org,pdf,ppt,pptx,rtf,rst,tsv,xlsx]

This pulls in a wide range of third-party libraries covering spreadsheet parsing, word processing, presentation files, PDF extraction with OCR, Markdown rendering, and more. The partition/auto.py module uses dependency_exists() checks (lines 370-378) to verify whether the required extra for a given file type is installed, and raises an ImportError with a helpful message guiding the user to install the appropriate extra (e.g., pip install "unstructured[pdf]").

Usage

Use this environment when you need to partition any and all supported document types without worrying about which individual extras to install. This is the recommended setup for development environments, CI pipelines processing mixed document sets, and deployment scenarios where the full range of file types must be handled.

System Requirements

Category Requirement Notes
Python >= 3.11, < 3.14 Required Python version range
OS (Linux) x86_64 or aarch64 Fully supported
OS (macOS) arm64 (Apple Silicon) Fully supported
OS (Windows) AMD64 Supported
Tesseract OCR Required for PDF/image extras System package: tesseract-ocr
Poppler Required for pdf2image System package: poppler-utils
libmagic Required for file type detection System package: libmagic1
Pandoc Required for doc, odt, epub, rtf, org extras System package: pandoc

Dependencies

System Packages

  • libmagic1 -- MIME type detection
  • tesseract-ocr -- OCR engine for PDF and image processing
  • poppler-utils -- PDF rendering for pdf2image
  • pandoc -- document format conversion (doc, odt, epub, rtf, org)
  • libreoffice -- required for .doc and .ppt conversion (optional but recommended)

Python Packages

  • pandas -- CSV and TSV parsing
  • python-docx -- DOCX document parsing
  • pypandoc-binary -- Pandoc wrapper for multiple formats (epub, odt, rtf, rst, org)
  • pdf2image -- PDF page to image conversion
  • pdfminer.six -- PDF text extraction
  • pi-heif -- HEIF image support
  • pikepdf -- PDF manipulation and repair
  • pypdf -- PDF reading
  • unstructured-inference -- layout detection models
  • unstructured-pytesseract -- Tesseract OCR wrapper
  • markdown -- Markdown parsing
  • python-pptx -- PowerPoint parsing
  • msoffcrypto-tool -- encrypted Office document support
  • networkx -- graph-based document structure analysis
  • openpyxl -- XLSX reading
  • xlrd -- XLS reading
  • google-cloud-vision -- Google Cloud Vision OCR (optional alternative)

Credentials

  • GOOGLE_APPLICATION_CREDENTIALS -- path to Google Cloud service account JSON (only if using Google Cloud Vision OCR)

Quick Install

# Install system dependencies (Ubuntu/Debian)
sudo apt-get update && sudo apt-get install -y \
    libmagic1 tesseract-ocr poppler-utils pandoc

# Install unstructured with all document type extras
pip install "unstructured[all-docs]"

Code Evidence

Dependency check with helpful error message (partition/auto.py:370-378):

if not dependency_exists(dep):
    raise ImportError(
        f'The "{dep}" package is required to process {filetype} files. '
        f'You can install it with: pip install "unstructured[{extra}]"'
    )

all-docs extra definition (pyproject.toml):

[project.optional-dependencies]
all-docs = [
    "unstructured[csv,doc,docx,epub,image,md,odt,org,pdf,ppt,pptx,rtf,rst,tsv,xlsx]"
]

Common Errors

Error Message Cause Solution
ImportError: The "python-docx" package is required to process .docx files The docx extra is not installed Install via pip install "unstructured[docx]" or pip install "unstructured[all-docs]"
ImportError: The "pdfminer" package is required to process .pdf files The pdf extra is not installed Install via pip install "unstructured[pdf]" or pip install "unstructured[all-docs]"
OSError: pandoc not found Pandoc system package is not installed Install via sudo apt-get install pandoc or brew install pandoc
TesseractNotFoundError Tesseract OCR is not installed on the system Install via sudo apt-get install tesseract-ocr

Compatibility Notes

  • The all-docs extra is the heaviest installation option; for production deployments processing only specific file types, prefer installing individual extras to reduce image size
  • pypandoc-binary bundles its own Pandoc binary, but certain formats may still require the system Pandoc package
  • Supported platforms: Linux x86_64, Linux aarch64, macOS arm64, Windows AMD64
  • Some extras (e.g., image, pdf) have overlapping dependencies; installing all-docs resolves these automatically

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment