Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Datasets Extractor

From Leeroopedia
Source src/datasets/utils/extract.py
Domain(s) Data_Processing, File_Handling
Last Updated 2026-02-14

Overview

Description

The Extractor module provides a unified archive extraction framework that supports nine compressed and archived file formats. It is built around a factory pattern: the top-level Extractor class maintains a registry of format-specific extractor classes and dispatches extraction operations to the appropriate one.

The architecture consists of three layers:

  1. BaseExtractor -- An abstract base class defining the interface: is_extractable(path) and extract(input_path, output_path).
  2. MagicNumberBaseExtractor -- An intermediate ABC that implements format detection by reading magic bytes from the beginning of a file and comparing them against known signatures.
  3. Concrete extractors -- Format-specific classes that implement the actual extraction logic:
    • TarExtractor -- Handles tar archives (with CVE-2007-4559 path traversal protection via safemembers).
    • GzipExtractor -- Handles gzip files (magic: \x1f\x8b).
    • ZipExtractor -- Handles zip files (magic: PK\x03\x04, plus enhanced detection using central directory inspection).
    • XzExtractor -- Handles xz/LZMA files (magic: \xfd7zXZ\x00).
    • RarExtractor -- Handles RAR archives (requires optional rarfile dependency).
    • ZstdExtractor -- Handles Zstandard files (requires optional zstandard dependency).
    • Bzip2Extractor -- Handles bzip2 files (magic: BZh).
    • SevenZipExtractor -- Handles 7z archives (requires optional py7zr dependency).
    • Lz4Extractor -- Handles LZ4 frames (requires optional lz4 dependency).

The Extractor.infer_extractor_format class method reads the magic bytes once and iterates through all registered extractors to find one that matches, returning the format name string (e.g., "tar", "gzip", "zip"). The Extractor.extract class method then delegates to the matched extractor's extract static method, using a file lock to prevent parallel extractions to the same output path.

An ExtractManager class provides higher-level caching integration: it computes a hashed output path from the input file's absolute path and only extracts if the output does not already exist.

Usage

Use this module when downloading and preparing datasets that are distributed in compressed or archived formats. The Extractor class auto-detects the format and extracts the archive contents to a specified output directory. It is used internally by the dataset download and preparation pipeline.

Code Reference

Source Location

src/datasets/utils/extract.py (332 lines)

Signature

class Extractor:
    extractors: dict[str, type[BaseExtractor]] = {
        "tar": TarExtractor,
        "gzip": GzipExtractor,
        "zip": ZipExtractor,
        "xz": XzExtractor,
        "rar": RarExtractor,
        "zstd": ZstdExtractor,
        "bz2": Bzip2Extractor,
        "7z": SevenZipExtractor,
        "lz4": Lz4Extractor,
    }

    @classmethod
    def infer_extractor_format(cls, path: Union[Path, str]) -> Optional[str]: ...

    @classmethod
    def extract(
        cls,
        input_path: Union[Path, str],
        output_path: Union[Path, str],
        extractor_format: str,
    ) -> None: ...

Import

from datasets.utils.extract import Extractor

I/O Contract

Inputs

Name Type Description
input_path Union[Path, str] Path to the compressed or archived file to extract
output_path Union[Path, str] Destination directory or file path for extracted content
extractor_format str Format identifier (e.g., "tar", "gzip", "zip")

Outputs

Name Type Description
(side effect) files on disk Extracted files written to output_path
format name Optional[str] From infer_extractor_format: the detected format string, or None if not extractable

Usage Examples

Auto-detecting and extracting an archive:

from datasets.utils.extract import Extractor

# Detect the format from magic bytes
fmt = Extractor.infer_extractor_format("data/train.tar.gz")
# fmt == "gzip"

# Extract the archive
if fmt:
    Extractor.extract("data/train.tar.gz", "data/extracted/", fmt)

Using ExtractManager for cached extraction:

from datasets.utils.extract import ExtractManager

manager = ExtractManager(cache_dir="/home/user/.cache/huggingface/datasets")

# Extracts only if the output does not already exist
output_path = manager.extract("downloads/dataset.zip")
# output_path == "/home/user/.cache/huggingface/datasets/extracted/<hash>"

Checking if a file is extractable:

from datasets.utils.extract import Extractor

fmt = Extractor.infer_extractor_format("data/plain_text.csv")
if fmt is None:
    print("File is not a recognized archive format")

Related Pages

  • Principle: Archive Extraction -- The design principle governing how compressed and archived dataset files are detected, extracted, and cached during data preparation.

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment