Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Unstructured IO Unstructured Detect Filetype

From Leeroopedia
Knowledge Sources
Domains Document_Processing, Preprocessing
Last Updated 2026-02-12 00:00 GMT

Overview

Concrete tool for identifying document file types provided by the Unstructured library.

Description

The detect_filetype function examines a file's binary content using libmagic and extension-based heuristics to determine which FileType enum member it corresponds to. It supports both file paths and in-memory file-like objects, and can accept an explicit content type to bypass auto-detection.

Usage

Import this function when you need to determine the format of a document before routing it to a format-specific partitioner. It is called internally by partition() but can also be used standalone for file triage, filtering, or validation workflows.

Code Reference

Source Location

  • Repository: unstructured
  • File: unstructured/file_utils/filetype.py
  • Lines: 63-109

Signature

def detect_filetype(
    file_path: str | None = None,
    file: IO[bytes] | tempfile.SpooledTemporaryFile | None = None,
    encoding: str | None = None,
    content_type: str | None = None,
    metadata_file_path: Optional[str] = None,
) -> FileType:
    """Detect the file type of a document.

    Args:
        file_path: Path to file on disk.
        file: File-like object with binary content.
        encoding: Character encoding (default utf-8).
        content_type: Known MIME type (disables auto-detection when provided).
        metadata_file_path: Alternative path used for extension-based detection.
    Returns:
        FileType enum member (e.g., FileType.PDF, FileType.DOCX, FileType.HTML).
    """

Import

from unstructured.file_utils.filetype import detect_filetype

I/O Contract

Inputs

Name Type Required Description
file_path None No Path to file on disk (provide this or file)
file SpooledTemporaryFile | None No File-like object with binary content
encoding None No Character encoding hint (default utf-8)
content_type None No Known MIME type; bypasses auto-detection
metadata_file_path None No Alternative path for extension-based fallback

Outputs

Name Type Description
return FileType Enum member identifying the document format (e.g., FileType.PDF, FileType.DOCX, FileType.HTML, FileType.UNK)

Usage Examples

Detect from File Path

from unstructured.file_utils.filetype import detect_filetype

# Detect type of a local PDF file
file_type = detect_filetype(file_path="documents/report.pdf")
print(file_type)  # FileType.PDF

Detect from File-like Object

from unstructured.file_utils.filetype import detect_filetype

with open("documents/report.pdf", "rb") as f:
    file_type = detect_filetype(file=f)
    print(file_type)  # FileType.PDF

Bypass Detection with Known Content Type

from unstructured.file_utils.filetype import detect_filetype

# When MIME type is already known (e.g., from HTTP headers)
file_type = detect_filetype(
    file_path="data/unknown_extension",
    content_type="application/pdf",
)
print(file_type)  # FileType.PDF

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment