Implementation:Unstructured IO Unstructured Partition
| Knowledge Sources | |
|---|---|
| Domains | Document_Processing, NLP |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Concrete tool for partitioning documents into structured elements provided by the Unstructured library.
Description
The partition function is the primary entry point for document processing. It accepts a document (via file path, file object, or URL), detects its format, selects the appropriate format-specific partitioner, and returns a list of typed Element objects. It supports over 15 document formats and four processing strategies (auto, fast, hi_res, ocr_only).
Usage
Import this function when you need to convert any document into structured elements. It is the recommended entry point for most use cases, as it handles format detection and routing automatically. Use format-specific partitioners (partition_pdf, partition_docx) only when you need format-specific parameters not exposed by the generic function.
Code Reference
Source Location
- Repository: unstructured
- File: unstructured/partition/auto.py
- Lines: 30-296
Signature
def partition(
filename: Optional[str] = None,
*,
file: Optional[IO[bytes]] = None,
encoding: Optional[str] = None,
content_type: Optional[str] = None,
url: Optional[str] = None,
headers: dict[str, str] = {},
ssl_verify: bool = True,
request_timeout: Optional[int] = None,
strategy: str = PartitionStrategy.AUTO,
skip_infer_table_types: list[str] = ["pdf", "jpg", "png", "heic"],
ocr_languages: Optional[str] = None,
languages: Optional[list[str]] = None,
detect_language_per_element: bool = False,
pdf_infer_table_structure: bool = False,
extract_images_in_pdf: bool = False,
extract_image_block_types: Optional[list[str]] = None,
extract_image_block_output_dir: Optional[str] = None,
extract_image_block_to_payload: bool = False,
data_source_metadata: Optional[DataSourceMetadata] = None,
metadata_filename: Optional[str] = None,
hi_res_model_name: Optional[str] = None,
model_name: Optional[str] = None,
starting_page_number: int = 1,
**kwargs: Any,
) -> list[Element]:
"""Partition a document into structured elements.
Args:
filename: Path to the document file.
file: File-like object with document content.
encoding: Character encoding of the document.
content_type: MIME type (bypasses auto-detection).
url: URL to fetch document from.
headers: HTTP headers for URL fetching.
ssl_verify: Verify SSL certificates for URL fetching.
request_timeout: Timeout for URL fetching in seconds.
strategy: Partition strategy (auto, fast, hi_res, ocr_only).
skip_infer_table_types: File types to skip table structure inference.
ocr_languages: OCR language codes (deprecated, use languages).
languages: Document languages for OCR (e.g., ["eng", "deu"]).
detect_language_per_element: Detect language for each element individually.
pdf_infer_table_structure: Infer table structure for PDFs.
extract_images_in_pdf: Extract images from PDFs.
extract_image_block_types: Element types to extract images from.
extract_image_block_output_dir: Directory for extracted images.
extract_image_block_to_payload: Include images as base64 in metadata.
data_source_metadata: Source metadata for connector pipelines.
metadata_filename: Override filename in element metadata.
hi_res_model_name: Layout detection model name.
model_name: Alias for hi_res_model_name.
starting_page_number: Page number offset.
Returns:
Ordered list of typed Element objects.
"""
Import
from unstructured.partition.auto import partition
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| filename | None | No | Path to document file on disk |
| file | None | No | File-like object with document content |
| url | None | No | URL to fetch document from |
| strategy | str | No | Partition strategy: "auto" (default), "fast", "hi_res", "ocr_only" |
| languages | None | No | Document languages for OCR (e.g., ["eng"]) |
| hi_res_model_name | None | No | Layout detection model name for hi_res strategy |
| content_type | None | No | MIME type to bypass format auto-detection |
| starting_page_number | int | No | Page number offset (default 1) |
Outputs
| Name | Type | Description |
|---|---|---|
| return | list[Element] | Ordered list of typed elements: NarrativeText, Title, Table, Image, ListItem, Header, Footer, FigureCaption, Address, Formula, PageBreak, etc. |
Usage Examples
Basic Document Partitioning
from unstructured.partition.auto import partition
# Partition a PDF with automatic strategy
elements = partition(filename="report.pdf")
# Print element types and text
for element in elements:
print(f"{type(element).__name__}: {str(element)[:80]}")
High-Resolution PDF with Table Extraction
from unstructured.partition.auto import partition
elements = partition(
filename="financial_report.pdf",
strategy="hi_res",
pdf_infer_table_structure=True,
languages=["eng"],
hi_res_model_name="yolox",
)
# Filter for tables
tables = [el for el in elements if type(el).__name__ == "Table"]
for table in tables:
print(table.metadata.text_as_html)
Partition from URL
from unstructured.partition.auto import partition
elements = partition(
url="https://example.com/document.pdf",
strategy="fast",
)