Heuristic:Unstructured IO Unstructured Libmagic Filetype Accuracy
| Knowledge Sources | |
|---|---|
| Domains | File Type Detection, MIME Classification, Dependency Management |
| Last Updated | 2026-02-12 09:00 GMT |
Overview
File type detection accuracy depends heavily on whether libmagic is installed and which version is present; without it, textual file types are completely undetectable, and even with it, certain MIME types are misclassified.
Description
The Unstructured library uses a multi-layered file type detection strategy, but each layer has known limitations:
Without libmagic installed: The Python filetype package alone cannot detect textual file types at all (filetype.py:243-248). This means CSV, EML, HTML, Markdown, RST, RTF, TSV, and TXT files will not be correctly identified. The filetype package only detects binary formats by examining magic bytes.
Binary detection before caller-asserted content-type: The system performs its own binary content detection before trusting any caller-supplied content-type header (filetype.py:167-170). This is because HTTP content-type headers are inherently unreliable -- servers frequently misconfigure them, and user-supplied metadata cannot be trusted for correctness.
Known libmagic version bugs:
- Older libmagic versions, including the one shipped in the Unstructured Docker image, misclassify .json files as text/plain instead of application/json (filetype.py:638-640). The code includes a workaround for this.
- Go source files require exact MIME matching to avoid false positives from the substring "go" appearing in other MIME types (filetype.py:387-389).
Design decisions:
- Source code files (.py, .js, .go, etc.) are treated as plain text for partitioning purposes (filetype.py:256-258), since no special parsing is applied.
- When the Chipper model is in use, the hierarchy assignment step is skipped because Chipper handles hierarchy internally (filetype.py:779-781).
Usage
Apply this heuristic when:
- Deploying Unstructured in environments where libmagic may not be installed (e.g., minimal Docker images, serverless functions).
- Processing files where the caller-supplied content-type may be incorrect (e.g., files uploaded via web forms, files fetched from URLs).
- Debugging why certain text-based file types are not being detected or are being misclassified.
- Working with JSON files in the official Unstructured Docker image.
The Insight (Rule of Thumb)
- Action: Always install libmagic alongside the filetype package to enable detection of textual file types. Do not trust caller-supplied content-type headers without binary verification. Apply known workarounds for JSON misclassification in older libmagic versions.
- Value: Without libmagic, 8 textual file types (CSV, EML, HTML, MD, RST, RTF, TSV, TXT) are undetectable. Binary detection runs first regardless of caller-supplied content-type. JSON files may be misclassified as text/plain in Docker.
- Trade-off: Installing libmagic adds a system-level dependency (not pure Python), which complicates some deployment environments. The binary-first detection order adds a small performance overhead but prevents misclassification from unreliable metadata.
Reasoning
File type detection is the very first step in the partitioning pipeline -- if the file type is wrong, the wrong partitioner is selected and the output is garbage. The library cannot rely on caller-supplied content-type because it frequently encounters files from web scraping, email attachments, and user uploads where metadata is incorrect or missing entirely. By performing binary detection first and requiring libmagic for textual types, the system maximizes classification accuracy at the cost of an additional system dependency. The JSON workaround and Go MIME exact-matching are defensive measures against known platform-specific bugs that would otherwise cause silent misclassification.
Code Evidence
Textual file types undetectable without libmagic (filetype.py:243-248):
# filetype.py:243-248
# NOTE: without libmagic, the filetype package CANNOT detect these textual types:
# CSV, EML, HTML, MD, RST, RTF, TSV, TXT
# The filetype package only inspects magic bytes, which are only
# reliable for binary formats.
LIBMAGIC_TEXTUAL_TYPES = {FileType.CSV, FileType.EML, FileType.HTML,
FileType.MD, FileType.RST, FileType.RTF,
FileType.TSV, FileType.TXT}
Binary detection before trusting content-type (filetype.py:167-170):
# filetype.py:167-170
# Perform binary detection BEFORE trusting caller-asserted content_type
# because content-type headers are inherently unreliable.
detected_type = _detect_filetype_from_binary(file_path)
if detected_type is not None:
return detected_type
JSON misclassification workaround (filetype.py:638-640):
# filetype.py:638-640
# Older libmagic (including Unstructured Docker image) misclassifies
# .json as text/plain instead of application/json
if mime_type == "text/plain" and extension == ".json":
return FileType.JSON
Go MIME type exact matching (filetype.py:387-389):
# filetype.py:387-389
# Use exact match for Go to avoid false positives from "go" substring
# appearing in other MIME type strings
if mime_type == "text/x-go":
return FileType.GO