Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Unstructured IO Unstructured Partition

From Leeroopedia
Knowledge Sources
Domains Document_Processing, NLP
Last Updated 2026-02-12 00:00 GMT

Overview

Concrete tool for partitioning documents into structured elements provided by the Unstructured library.

Description

The partition function is the primary entry point for document processing. It accepts a document (via file path, file object, or URL), detects its format, selects the appropriate format-specific partitioner, and returns a list of typed Element objects. It supports over 15 document formats and four processing strategies (auto, fast, hi_res, ocr_only).

Usage

Import this function when you need to convert any document into structured elements. It is the recommended entry point for most use cases, as it handles format detection and routing automatically. Use format-specific partitioners (partition_pdf, partition_docx) only when you need format-specific parameters not exposed by the generic function.

Code Reference

Source Location

  • Repository: unstructured
  • File: unstructured/partition/auto.py
  • Lines: 30-296

Signature

def partition(
    filename: Optional[str] = None,
    *,
    file: Optional[IO[bytes]] = None,
    encoding: Optional[str] = None,
    content_type: Optional[str] = None,
    url: Optional[str] = None,
    headers: dict[str, str] = {},
    ssl_verify: bool = True,
    request_timeout: Optional[int] = None,
    strategy: str = PartitionStrategy.AUTO,
    skip_infer_table_types: list[str] = ["pdf", "jpg", "png", "heic"],
    ocr_languages: Optional[str] = None,
    languages: Optional[list[str]] = None,
    detect_language_per_element: bool = False,
    pdf_infer_table_structure: bool = False,
    extract_images_in_pdf: bool = False,
    extract_image_block_types: Optional[list[str]] = None,
    extract_image_block_output_dir: Optional[str] = None,
    extract_image_block_to_payload: bool = False,
    data_source_metadata: Optional[DataSourceMetadata] = None,
    metadata_filename: Optional[str] = None,
    hi_res_model_name: Optional[str] = None,
    model_name: Optional[str] = None,
    starting_page_number: int = 1,
    **kwargs: Any,
) -> list[Element]:
    """Partition a document into structured elements.

    Args:
        filename: Path to the document file.
        file: File-like object with document content.
        encoding: Character encoding of the document.
        content_type: MIME type (bypasses auto-detection).
        url: URL to fetch document from.
        headers: HTTP headers for URL fetching.
        ssl_verify: Verify SSL certificates for URL fetching.
        request_timeout: Timeout for URL fetching in seconds.
        strategy: Partition strategy (auto, fast, hi_res, ocr_only).
        skip_infer_table_types: File types to skip table structure inference.
        ocr_languages: OCR language codes (deprecated, use languages).
        languages: Document languages for OCR (e.g., ["eng", "deu"]).
        detect_language_per_element: Detect language for each element individually.
        pdf_infer_table_structure: Infer table structure for PDFs.
        extract_images_in_pdf: Extract images from PDFs.
        extract_image_block_types: Element types to extract images from.
        extract_image_block_output_dir: Directory for extracted images.
        extract_image_block_to_payload: Include images as base64 in metadata.
        data_source_metadata: Source metadata for connector pipelines.
        metadata_filename: Override filename in element metadata.
        hi_res_model_name: Layout detection model name.
        model_name: Alias for hi_res_model_name.
        starting_page_number: Page number offset.
    Returns:
        Ordered list of typed Element objects.
    """

Import

from unstructured.partition.auto import partition

I/O Contract

Inputs

Name Type Required Description
filename None No Path to document file on disk
file None No File-like object with document content
url None No URL to fetch document from
strategy str No Partition strategy: "auto" (default), "fast", "hi_res", "ocr_only"
languages None No Document languages for OCR (e.g., ["eng"])
hi_res_model_name None No Layout detection model name for hi_res strategy
content_type None No MIME type to bypass format auto-detection
starting_page_number int No Page number offset (default 1)

Outputs

Name Type Description
return list[Element] Ordered list of typed elements: NarrativeText, Title, Table, Image, ListItem, Header, Footer, FigureCaption, Address, Formula, PageBreak, etc.

Usage Examples

Basic Document Partitioning

from unstructured.partition.auto import partition

# Partition a PDF with automatic strategy
elements = partition(filename="report.pdf")

# Print element types and text
for element in elements:
    print(f"{type(element).__name__}: {str(element)[:80]}")

High-Resolution PDF with Table Extraction

from unstructured.partition.auto import partition

elements = partition(
    filename="financial_report.pdf",
    strategy="hi_res",
    pdf_infer_table_structure=True,
    languages=["eng"],
    hi_res_model_name="yolox",
)

# Filter for tables
tables = [el for el in elements if type(el).__name__ == "Table"]
for table in tables:
    print(table.metadata.text_as_html)

Partition from URL

from unstructured.partition.auto import partition

elements = partition(
    url="https://example.com/document.pdf",
    strategy="fast",
)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment