Implementation:Unstructured IO Unstructured Detect Filetype

Knowledge Sources	Unstructured python-magic
Domains	Document_Processing, Preprocessing
Last Updated	2026-02-12 00:00 GMT

Overview

Concrete tool for identifying document file types provided by the Unstructured library.

Description

The detect_filetype function examines a file's binary content using libmagic and extension-based heuristics to determine which FileType enum member it corresponds to. It supports both file paths and in-memory file-like objects, and can accept an explicit content type to bypass auto-detection.

Usage

Import this function when you need to determine the format of a document before routing it to a format-specific partitioner. It is called internally by partition() but can also be used standalone for file triage, filtering, or validation workflows.

Code Reference

Source Location

Repository: unstructured
File: unstructured/file_utils/filetype.py
Lines: 63-109

Signature

def detect_filetype(
    file_path: str | None = None,
    file: IO[bytes] | tempfile.SpooledTemporaryFile | None = None,
    encoding: str | None = None,
    content_type: str | None = None,
    metadata_file_path: Optional[str] = None,
) -> FileType:
    """Detect the file type of a document.

    Args:
        file_path: Path to file on disk.
        file: File-like object with binary content.
        encoding: Character encoding (default utf-8).
        content_type: Known MIME type (disables auto-detection when provided).
        metadata_file_path: Alternative path used for extension-based detection.
    Returns:
        FileType enum member (e.g., FileType.PDF, FileType.DOCX, FileType.HTML).
    """

Import

from unstructured.file_utils.filetype import detect_filetype

I/O Contract

Inputs

Name	Type	Required	Description
file_path	None	No	Path to file on disk (provide this or file)
file	SpooledTemporaryFile \| None	No	File-like object with binary content
encoding	None	No	Character encoding hint (default utf-8)
content_type	None	No	Known MIME type; bypasses auto-detection
metadata_file_path	None	No	Alternative path for extension-based fallback

Outputs

Name	Type	Description
return	FileType	Enum member identifying the document format (e.g., FileType.PDF, FileType.DOCX, FileType.HTML, FileType.UNK)

Usage Examples

Detect from File Path

from unstructured.file_utils.filetype import detect_filetype

# Detect type of a local PDF file
file_type = detect_filetype(file_path="documents/report.pdf")
print(file_type)  # FileType.PDF

Detect from File-like Object

from unstructured.file_utils.filetype import detect_filetype

with open("documents/report.pdf", "rb") as f:
    file_type = detect_filetype(file=f)
    print(file_type)  # FileType.PDF

Bypass Detection with Known Content Type

from unstructured.file_utils.filetype import detect_filetype

# When MIME type is already known (e.g., from HTTP headers)
file_type = detect_filetype(
    file_path="data/unknown_extension",
    content_type="application/pdf",
)
print(file_type)  # FileType.PDF

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment