Workflow:Unstructured IO Unstructured Document Partitioning

Knowledge Sources	Unstructured Unstructured Docs Partitioning Guide
Domains	Document_Processing, Data_Engineering, NLP
Last Updated	2026-02-12 09:30 GMT

Overview

End-to-end process for transforming raw unstructured documents (PDF, DOCX, HTML, images, and 20+ other formats) into structured, typed elements using the Unstructured partition pipeline.

Description

This workflow describes the standard procedure for parsing unstructured documents into a list of typed elements (titles, narrative text, tables, images, list items, etc.). The library automatically detects file types using libmagic and routes each document to the appropriate format-specific partitioner. Multiple processing strategies are available: fast (text extraction only), hi_res (layout-aware with deep learning models), ocr_only (full OCR), and auto (automatically selects the best strategy). The output is a list of Element objects with rich metadata including coordinates, page numbers, languages, and data source information.

Key capabilities:

Automatic file type detection and routing to 20+ format-specific partitioners
Four processing strategies with different speed/accuracy tradeoffs
Table structure inference for extracting tabular data as HTML
Image extraction from PDFs and documents
Multi-language support with per-element language detection
Rich metadata on every extracted element (coordinates, page numbers, links, emphasis)

Usage

Execute this workflow when you have one or more documents in any supported format (PDF, DOCX, PPTX, XLSX, HTML, EML, MSG, CSV, JSON, XML, Markdown, RST, TXT, EPUB, ODT, RTF, TSV, images, and more) and need to extract their content as structured, typed elements for downstream processing such as search indexing, RAG pipelines, summarization, or data analysis.

Execution Steps

Step 1: Environment_Setup

Install the unstructured library with the appropriate extras for your document types. The base package handles plain text, HTML, XML, JSON, and emails. Additional document types require specific extras (e.g., pdf, docx, pptx, image). System dependencies such as libmagic, poppler-utils, tesseract-ocr, and libreoffice may also be required depending on the document formats being processed.

Key considerations:

Install only the extras you need to minimize dependencies
Use the all-docs extra for full format support
Ensure system dependencies are available (libmagic for detection, poppler for PDFs, tesseract for OCR)

Step 2: File_Type_Detection

The partition function automatically detects the file type using libmagic and content-type headers. Based on the detected type, the document is routed to the appropriate format-specific partitioner (e.g., partition_pdf, partition_docx, partition_html). You can also call format-specific partition functions directly if you know the document type in advance.

Key considerations:

File detection uses libmagic and falls back to file extension
Content-type can be explicitly provided to override detection
Documents can be loaded from file path, file-like object, or URL

Step 3: Strategy_Selection

Choose a partitioning strategy based on your accuracy and performance requirements. The auto strategy selects the best approach per document type. The fast strategy uses direct text extraction without deep learning models. The hi_res strategy applies layout detection models for precise element classification and coordinate extraction. The ocr_only strategy performs full OCR on every page.

Key considerations:

Use fast for high throughput when layout precision is not critical
Use hi_res when you need accurate element classification, table extraction, or coordinate data
Use ocr_only for scanned documents or images without embedded text
The auto strategy makes per-format decisions (e.g., fast for DOCX, hi_res for PDFs with tables)

Step 4: Document_Partitioning

Execute the partition function with the selected strategy and options. The partitioner processes the document and returns a list of Element objects. Each element has a type (Title, NarrativeText, Table, ListItem, Image, etc.), text content, and metadata. For hi_res mode, elements also include bounding box coordinates and layout detection confidence scores.

Key considerations:

Enable pdf_infer_table_structure to extract tables as HTML
Use extract_images_in_pdf to save embedded images
Set languages or ocr_languages for non-English documents
Use starting_page_number to offset page numbering for partial documents

Step 5: Element_Processing

Process the returned elements as needed for your use case. Elements can be serialized to JSON or dictionaries, filtered by type, cleaned with text processing functions, or passed to downstream chunking and embedding pipelines. The elements preserve hierarchical relationships through parent_id metadata and maintain source document provenance.

Key considerations:

Elements support JSON serialization via elements_to_json and elements_from_json
Text cleaning functions can be applied to elements via the apply method
Table elements include text_as_html metadata for structured table content
Image elements can include base64-encoded image data

Execution Diagram

GitHub URL

Workflow Repository