Implementation:Unstructured IO Unstructured Partition

Knowledge Sources	Unstructured Unstructured Docs
Domains	Document_Processing, NLP
Last Updated	2026-02-12 00:00 GMT

Overview

Concrete tool for partitioning documents into structured elements provided by the Unstructured library.

Description

The partition function is the primary entry point for document processing. It accepts a document (via file path, file object, or URL), detects its format, selects the appropriate format-specific partitioner, and returns a list of typed Element objects. It supports over 15 document formats and four processing strategies (auto, fast, hi_res, ocr_only).

Usage

Import this function when you need to convert any document into structured elements. It is the recommended entry point for most use cases, as it handles format detection and routing automatically. Use format-specific partitioners (partition_pdf, partition_docx) only when you need format-specific parameters not exposed by the generic function.

Code Reference

Source Location

Repository: unstructured
File: unstructured/partition/auto.py
Lines: 30-296

Signature

def partition(
    filename: Optional[str] = None,
    *,
    file: Optional[IO[bytes]] = None,
    encoding: Optional[str] = None,
    content_type: Optional[str] = None,
    url: Optional[str] = None,
    headers: dict[str, str] = {},
    ssl_verify: bool = True,
    request_timeout: Optional[int] = None,
    strategy: str = PartitionStrategy.AUTO,
    skip_infer_table_types: list[str] = ["pdf", "jpg", "png", "heic"],
    ocr_languages: Optional[str] = None,
    languages: Optional[list[str]] = None,
    detect_language_per_element: bool = False,
    pdf_infer_table_structure: bool = False,
    extract_images_in_pdf: bool = False,
    extract_image_block_types: Optional[list[str]] = None,
    extract_image_block_output_dir: Optional[str] = None,
    extract_image_block_to_payload: bool = False,
    data_source_metadata: Optional[DataSourceMetadata] = None,
    metadata_filename: Optional[str] = None,
    hi_res_model_name: Optional[str] = None,
    model_name: Optional[str] = None,
    starting_page_number: int = 1,
    **kwargs: Any,
) -> list[Element]:
    """Partition a document into structured elements.

    Args:
        filename: Path to the document file.
        file: File-like object with document content.
        encoding: Character encoding of the document.
        content_type: MIME type (bypasses auto-detection).
        url: URL to fetch document from.
        headers: HTTP headers for URL fetching.
        ssl_verify: Verify SSL certificates for URL fetching.
        request_timeout: Timeout for URL fetching in seconds.
        strategy: Partition strategy (auto, fast, hi_res, ocr_only).
        skip_infer_table_types: File types to skip table structure inference.
        ocr_languages: OCR language codes (deprecated, use languages).
        languages: Document languages for OCR (e.g., ["eng", "deu"]).
        detect_language_per_element: Detect language for each element individually.
        pdf_infer_table_structure: Infer table structure for PDFs.
        extract_images_in_pdf: Extract images from PDFs.
        extract_image_block_types: Element types to extract images from.
        extract_image_block_output_dir: Directory for extracted images.
        extract_image_block_to_payload: Include images as base64 in metadata.
        data_source_metadata: Source metadata for connector pipelines.
        metadata_filename: Override filename in element metadata.
        hi_res_model_name: Layout detection model name.
        model_name: Alias for hi_res_model_name.
        starting_page_number: Page number offset.
    Returns:
        Ordered list of typed Element objects.
    """

Import

from unstructured.partition.auto import partition

I/O Contract

Inputs

Name	Type	Required	Description
filename	None	No	Path to document file on disk
file	None	No	File-like object with document content
url	None	No	URL to fetch document from
strategy	str	No	Partition strategy: "auto" (default), "fast", "hi_res", "ocr_only"
languages	None	No	Document languages for OCR (e.g., ["eng"])
hi_res_model_name	None	No	Layout detection model name for hi_res strategy
content_type	None	No	MIME type to bypass format auto-detection
starting_page_number	int	No	Page number offset (default 1)

Outputs

Name	Type	Description
return	list[Element]	Ordered list of typed elements: NarrativeText, Title, Table, Image, ListItem, Header, Footer, FigureCaption, Address, Formula, PageBreak, etc.

Usage Examples

Basic Document Partitioning

from unstructured.partition.auto import partition

# Partition a PDF with automatic strategy
elements = partition(filename="report.pdf")

# Print element types and text
for element in elements:
    print(f"{type(element).__name__}: {str(element)[:80]}")

High-Resolution PDF with Table Extraction

from unstructured.partition.auto import partition

elements = partition(
    filename="financial_report.pdf",
    strategy="hi_res",
    pdf_infer_table_structure=True,
    languages=["eng"],
    hi_res_model_name="yolox",
)

# Filter for tables
tables = [el for el in elements if type(el).__name__ == "Table"]
for table in tables:
    print(table.metadata.text_as_html)

Partition from URL

from unstructured.partition.auto import partition

elements = partition(
    url="https://example.com/document.pdf",
    strategy="fast",
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment