Implementation:Huggingface Datasets Extractor
| Source | src/datasets/utils/extract.py |
|---|---|
| Domain(s) | Data_Processing, File_Handling |
| Last Updated | 2026-02-14 |
Overview
Description
The Extractor module provides a unified archive extraction framework that supports nine compressed and archived file formats. It is built around a factory pattern: the top-level Extractor class maintains a registry of format-specific extractor classes and dispatches extraction operations to the appropriate one.
The architecture consists of three layers:
BaseExtractor-- An abstract base class defining the interface:is_extractable(path)andextract(input_path, output_path).MagicNumberBaseExtractor-- An intermediate ABC that implements format detection by reading magic bytes from the beginning of a file and comparing them against known signatures.- Concrete extractors -- Format-specific classes that implement the actual extraction logic:
TarExtractor-- Handles tar archives (with CVE-2007-4559 path traversal protection viasafemembers).GzipExtractor-- Handles gzip files (magic:\x1f\x8b).ZipExtractor-- Handles zip files (magic:PK\x03\x04, plus enhanced detection using central directory inspection).XzExtractor-- Handles xz/LZMA files (magic:\xfd7zXZ\x00).RarExtractor-- Handles RAR archives (requires optionalrarfiledependency).ZstdExtractor-- Handles Zstandard files (requires optionalzstandarddependency).Bzip2Extractor-- Handles bzip2 files (magic:BZh).SevenZipExtractor-- Handles 7z archives (requires optionalpy7zrdependency).Lz4Extractor-- Handles LZ4 frames (requires optionallz4dependency).
The Extractor.infer_extractor_format class method reads the magic bytes once and iterates through all registered extractors to find one that matches, returning the format name string (e.g., "tar", "gzip", "zip"). The Extractor.extract class method then delegates to the matched extractor's extract static method, using a file lock to prevent parallel extractions to the same output path.
An ExtractManager class provides higher-level caching integration: it computes a hashed output path from the input file's absolute path and only extracts if the output does not already exist.
Usage
Use this module when downloading and preparing datasets that are distributed in compressed or archived formats. The Extractor class auto-detects the format and extracts the archive contents to a specified output directory. It is used internally by the dataset download and preparation pipeline.
Code Reference
Source Location
src/datasets/utils/extract.py (332 lines)
Signature
class Extractor:
extractors: dict[str, type[BaseExtractor]] = {
"tar": TarExtractor,
"gzip": GzipExtractor,
"zip": ZipExtractor,
"xz": XzExtractor,
"rar": RarExtractor,
"zstd": ZstdExtractor,
"bz2": Bzip2Extractor,
"7z": SevenZipExtractor,
"lz4": Lz4Extractor,
}
@classmethod
def infer_extractor_format(cls, path: Union[Path, str]) -> Optional[str]: ...
@classmethod
def extract(
cls,
input_path: Union[Path, str],
output_path: Union[Path, str],
extractor_format: str,
) -> None: ...
Import
from datasets.utils.extract import Extractor
I/O Contract
Inputs
| Name | Type | Description |
|---|---|---|
input_path |
Union[Path, str] |
Path to the compressed or archived file to extract |
output_path |
Union[Path, str] |
Destination directory or file path for extracted content |
extractor_format |
str |
Format identifier (e.g., "tar", "gzip", "zip")
|
Outputs
| Name | Type | Description |
|---|---|---|
| (side effect) | files on disk | Extracted files written to output_path
|
| format name | Optional[str] |
From infer_extractor_format: the detected format string, or None if not extractable
|
Usage Examples
Auto-detecting and extracting an archive:
from datasets.utils.extract import Extractor
# Detect the format from magic bytes
fmt = Extractor.infer_extractor_format("data/train.tar.gz")
# fmt == "gzip"
# Extract the archive
if fmt:
Extractor.extract("data/train.tar.gz", "data/extracted/", fmt)
Using ExtractManager for cached extraction:
from datasets.utils.extract import ExtractManager
manager = ExtractManager(cache_dir="/home/user/.cache/huggingface/datasets")
# Extracts only if the output does not already exist
output_path = manager.extract("downloads/dataset.zip")
# output_path == "/home/user/.cache/huggingface/datasets/extracted/<hash>"
Checking if a file is extractable:
from datasets.utils.extract import Extractor
fmt = Extractor.infer_extractor_format("data/plain_text.csv")
if fmt is None:
print("File is not a recognized archive format")
Related Pages
- Principle: Archive Extraction -- The design principle governing how compressed and archived dataset files are detected, extracted, and cached during data preparation.