Implementation:Huggingface Datasets Extractor

Source	src/datasets/utils/extract.py
Domain(s)	Data_Processing, File_Handling
Last Updated	2026-02-14

Overview

Description

The Extractor module provides a unified archive extraction framework that supports nine compressed and archived file formats. It is built around a factory pattern: the top-level Extractor class maintains a registry of format-specific extractor classes and dispatches extraction operations to the appropriate one.

The architecture consists of three layers:

BaseExtractor -- An abstract base class defining the interface: is_extractable(path) and extract(input_path, output_path).
MagicNumberBaseExtractor -- An intermediate ABC that implements format detection by reading magic bytes from the beginning of a file and comparing them against known signatures.
Concrete extractors -- Format-specific classes that implement the actual extraction logic:
- TarExtractor -- Handles tar archives (with CVE-2007-4559 path traversal protection via safemembers).
- GzipExtractor -- Handles gzip files (magic: \x1f\x8b).
- ZipExtractor -- Handles zip files (magic: PK\x03\x04, plus enhanced detection using central directory inspection).
- XzExtractor -- Handles xz/LZMA files (magic: \xfd7zXZ\x00).
- RarExtractor -- Handles RAR archives (requires optional rarfile dependency).
- ZstdExtractor -- Handles Zstandard files (requires optional zstandard dependency).
- Bzip2Extractor -- Handles bzip2 files (magic: BZh).
- SevenZipExtractor -- Handles 7z archives (requires optional py7zr dependency).
- Lz4Extractor -- Handles LZ4 frames (requires optional lz4 dependency).

The Extractor.infer_extractor_format class method reads the magic bytes once and iterates through all registered extractors to find one that matches, returning the format name string (e.g., "tar", "gzip", "zip"). The Extractor.extract class method then delegates to the matched extractor's extract static method, using a file lock to prevent parallel extractions to the same output path.

An ExtractManager class provides higher-level caching integration: it computes a hashed output path from the input file's absolute path and only extracts if the output does not already exist.

Usage

Use this module when downloading and preparing datasets that are distributed in compressed or archived formats. The Extractor class auto-detects the format and extracts the archive contents to a specified output directory. It is used internally by the dataset download and preparation pipeline.

Code Reference

Source Location

src/datasets/utils/extract.py (332 lines)

Signature

class Extractor:
    extractors: dict[str, type[BaseExtractor]] = {
        "tar": TarExtractor,
        "gzip": GzipExtractor,
        "zip": ZipExtractor,
        "xz": XzExtractor,
        "rar": RarExtractor,
        "zstd": ZstdExtractor,
        "bz2": Bzip2Extractor,
        "7z": SevenZipExtractor,
        "lz4": Lz4Extractor,
    }

    @classmethod
    def infer_extractor_format(cls, path: Union[Path, str]) -> Optional[str]: ...

    @classmethod
    def extract(
        cls,
        input_path: Union[Path, str],
        output_path: Union[Path, str],
        extractor_format: str,
    ) -> None: ...

Import

from datasets.utils.extract import Extractor

I/O Contract

Inputs

Name	Type	Description
`input_path`	`Union[Path, str]`	Path to the compressed or archived file to extract
`output_path`	`Union[Path, str]`	Destination directory or file path for extracted content
`extractor_format`	`str`	Format identifier (e.g., `"tar"`, `"gzip"`, `"zip"`)

Outputs

Name	Type	Description
(side effect)	files on disk	Extracted files written to `output_path`
format name	`Optional[str]`	From `infer_extractor_format`: the detected format string, or `None` if not extractable

Usage Examples

Auto-detecting and extracting an archive:

from datasets.utils.extract import Extractor

# Detect the format from magic bytes
fmt = Extractor.infer_extractor_format("data/train.tar.gz")
# fmt == "gzip"

# Extract the archive
if fmt:
    Extractor.extract("data/train.tar.gz", "data/extracted/", fmt)

Using ExtractManager for cached extraction:

from datasets.utils.extract import ExtractManager

manager = ExtractManager(cache_dir="/home/user/.cache/huggingface/datasets")

# Extracts only if the output does not already exist
output_path = manager.extract("downloads/dataset.zip")
# output_path == "/home/user/.cache/huggingface/datasets/extracted/<hash>"

Checking if a file is extractable:

from datasets.utils.extract import Extractor

fmt = Extractor.infer_extractor_format("data/plain_text.csv")
if fmt is None:
    print("File is not a recognized archive format")

Related Pages

Principle: Archive Extraction -- The design principle governing how compressed and archived dataset files are detected, extracted, and cached during data preparation.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment