Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Unstructured IO Unstructured Document Partitioning

From Leeroopedia
Knowledge Sources
Domains Document_Processing, Data_Engineering, NLP
Last Updated 2026-02-12 09:30 GMT

Overview

End-to-end process for transforming raw unstructured documents (PDF, DOCX, HTML, images, and 20+ other formats) into structured, typed elements using the Unstructured partition pipeline.

Description

This workflow describes the standard procedure for parsing unstructured documents into a list of typed elements (titles, narrative text, tables, images, list items, etc.). The library automatically detects file types using libmagic and routes each document to the appropriate format-specific partitioner. Multiple processing strategies are available: fast (text extraction only), hi_res (layout-aware with deep learning models), ocr_only (full OCR), and auto (automatically selects the best strategy). The output is a list of Element objects with rich metadata including coordinates, page numbers, languages, and data source information.

Key capabilities:

  • Automatic file type detection and routing to 20+ format-specific partitioners
  • Four processing strategies with different speed/accuracy tradeoffs
  • Table structure inference for extracting tabular data as HTML
  • Image extraction from PDFs and documents
  • Multi-language support with per-element language detection
  • Rich metadata on every extracted element (coordinates, page numbers, links, emphasis)

Usage

Execute this workflow when you have one or more documents in any supported format (PDF, DOCX, PPTX, XLSX, HTML, EML, MSG, CSV, JSON, XML, Markdown, RST, TXT, EPUB, ODT, RTF, TSV, images, and more) and need to extract their content as structured, typed elements for downstream processing such as search indexing, RAG pipelines, summarization, or data analysis.

Execution Steps

Step 1: Environment_Setup

Install the unstructured library with the appropriate extras for your document types. The base package handles plain text, HTML, XML, JSON, and emails. Additional document types require specific extras (e.g., pdf, docx, pptx, image). System dependencies such as libmagic, poppler-utils, tesseract-ocr, and libreoffice may also be required depending on the document formats being processed.

Key considerations:

  • Install only the extras you need to minimize dependencies
  • Use the all-docs extra for full format support
  • Ensure system dependencies are available (libmagic for detection, poppler for PDFs, tesseract for OCR)

Step 2: File_Type_Detection

The partition function automatically detects the file type using libmagic and content-type headers. Based on the detected type, the document is routed to the appropriate format-specific partitioner (e.g., partition_pdf, partition_docx, partition_html). You can also call format-specific partition functions directly if you know the document type in advance.

Key considerations:

  • File detection uses libmagic and falls back to file extension
  • Content-type can be explicitly provided to override detection
  • Documents can be loaded from file path, file-like object, or URL

Step 3: Strategy_Selection

Choose a partitioning strategy based on your accuracy and performance requirements. The auto strategy selects the best approach per document type. The fast strategy uses direct text extraction without deep learning models. The hi_res strategy applies layout detection models for precise element classification and coordinate extraction. The ocr_only strategy performs full OCR on every page.

Key considerations:

  • Use fast for high throughput when layout precision is not critical
  • Use hi_res when you need accurate element classification, table extraction, or coordinate data
  • Use ocr_only for scanned documents or images without embedded text
  • The auto strategy makes per-format decisions (e.g., fast for DOCX, hi_res for PDFs with tables)

Step 4: Document_Partitioning

Execute the partition function with the selected strategy and options. The partitioner processes the document and returns a list of Element objects. Each element has a type (Title, NarrativeText, Table, ListItem, Image, etc.), text content, and metadata. For hi_res mode, elements also include bounding box coordinates and layout detection confidence scores.

Key considerations:

  • Enable pdf_infer_table_structure to extract tables as HTML
  • Use extract_images_in_pdf to save embedded images
  • Set languages or ocr_languages for non-English documents
  • Use starting_page_number to offset page numbering for partial documents

Step 5: Element_Processing

Process the returned elements as needed for your use case. Elements can be serialized to JSON or dictionaries, filtered by type, cleaned with text processing functions, or passed to downstream chunking and embedding pipelines. The elements preserve hierarchical relationships through parent_id metadata and maintain source document provenance.

Key considerations:

  • Elements support JSON serialization via elements_to_json and elements_from_json
  • Text cleaning functions can be applied to elements via the apply method
  • Table elements include text_as_html metadata for structured table content
  • Image elements can include base64-encoded image data

Execution Diagram

GitHub URL

Workflow Repository