Implementation:Unstructured IO Unstructured Detect Filetype
| Knowledge Sources | |
|---|---|
| Domains | Document_Processing, Preprocessing |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Concrete tool for identifying document file types provided by the Unstructured library.
Description
The detect_filetype function examines a file's binary content using libmagic and extension-based heuristics to determine which FileType enum member it corresponds to. It supports both file paths and in-memory file-like objects, and can accept an explicit content type to bypass auto-detection.
Usage
Import this function when you need to determine the format of a document before routing it to a format-specific partitioner. It is called internally by partition() but can also be used standalone for file triage, filtering, or validation workflows.
Code Reference
Source Location
- Repository: unstructured
- File: unstructured/file_utils/filetype.py
- Lines: 63-109
Signature
def detect_filetype(
file_path: str | None = None,
file: IO[bytes] | tempfile.SpooledTemporaryFile | None = None,
encoding: str | None = None,
content_type: str | None = None,
metadata_file_path: Optional[str] = None,
) -> FileType:
"""Detect the file type of a document.
Args:
file_path: Path to file on disk.
file: File-like object with binary content.
encoding: Character encoding (default utf-8).
content_type: Known MIME type (disables auto-detection when provided).
metadata_file_path: Alternative path used for extension-based detection.
Returns:
FileType enum member (e.g., FileType.PDF, FileType.DOCX, FileType.HTML).
"""
Import
from unstructured.file_utils.filetype import detect_filetype
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| file_path | None | No | Path to file on disk (provide this or file) |
| file | SpooledTemporaryFile | None | No | File-like object with binary content |
| encoding | None | No | Character encoding hint (default utf-8) |
| content_type | None | No | Known MIME type; bypasses auto-detection |
| metadata_file_path | None | No | Alternative path for extension-based fallback |
Outputs
| Name | Type | Description |
|---|---|---|
| return | FileType | Enum member identifying the document format (e.g., FileType.PDF, FileType.DOCX, FileType.HTML, FileType.UNK) |
Usage Examples
Detect from File Path
from unstructured.file_utils.filetype import detect_filetype
# Detect type of a local PDF file
file_type = detect_filetype(file_path="documents/report.pdf")
print(file_type) # FileType.PDF
Detect from File-like Object
from unstructured.file_utils.filetype import detect_filetype
with open("documents/report.pdf", "rb") as f:
file_type = detect_filetype(file=f)
print(file_type) # FileType.PDF
Bypass Detection with Known Content Type
from unstructured.file_utils.filetype import detect_filetype
# When MIME type is already known (e.g., from HTTP headers)
file_type = detect_filetype(
file_path="data/unknown_extension",
content_type="application/pdf",
)
print(file_type) # FileType.PDF