Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Compression Filesystems

From Leeroopedia
Knowledge Sources
Domains File_Systems, Compression
Last Updated 2026-02-14 18:00 GMT

Overview

Fsspec-compatible filesystem classes for transparently reading single-file compressed archives, provided by the HuggingFace Datasets library.

Description

BaseCompressedFileFileSystem is an abstract base class extending fsspec.archive.AbstractArchiveFileSystem that treats a single compressed file as a virtual filesystem containing one uncompressed file. The uncompressed filename is derived by stripping the compression extension from the original filename. The class uses fsspec.open() internally with the appropriate compression codec to transparently decompress data on read.

Six concrete subclasses are provided, each setting the protocol, compression, and extensions class attributes for a specific compression format:

  • Bz2FileSystem -- protocol "bz2", extension .bz2
  • GzipFileSystem -- protocol "gzip", extensions .gz, .gzip
  • Lz4FileSystem -- protocol "lz4", extension .lz4
  • XzFileSystem -- protocol "xz", extension .xz
  • ZstdFileSystem -- protocol "zstd", extensions .zst, .zstd

Each filesystem is read-only and only supports "rb" mode. The _open_with_fsspec partial handles proxy configuration via trust_env=True and avoids URL requoting issues on redirects.

Usage

These filesystem classes are registered with fsspec and used internally by the Datasets library to transparently decompress data files during dataset loading. They enable URL chaining such as gzip://file.txt::http://example.com/file.txt.gz. End users typically do not instantiate them directly.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/filesystems/compression.py
  • Lines: 1-127

Signature

class BaseCompressedFileFileSystem(AbstractArchiveFileSystem):
    """Read contents of compressed file as a filesystem with one file inside."""

    root_marker = ""
    protocol: str = None
    compression: str = None
    extensions: list[str] = None

    def __init__(
        self, fo: str = "", target_protocol: Optional[str] = None,
        target_options: Optional[dict] = None, **kwargs
    ):

Key methods:

@classmethod
def _strip_protocol(cls, path):
    # compressed file paths are always relative to the archive root
    return super()._strip_protocol(path).lstrip("/")

def _get_dirs(self):
    # Populates dir_cache with single uncompressed file entry

def cat(self, path: str):
    # Returns full uncompressed content as bytes

def _open(self, path: str, mode: str = "rb", block_size=None,
          autocommit=True, cache_options=None, **kwargs):
    # Opens the compressed file for reading (only 'rb' mode supported)

Concrete subclasses:

class Bz2FileSystem(BaseCompressedFileFileSystem):
    protocol = "bz2"
    compression = "bz2"
    extensions = [".bz2"]

class GzipFileSystem(BaseCompressedFileFileSystem):
    protocol = "gzip"
    compression = "gzip"
    extensions = [".gz", ".gzip"]

class Lz4FileSystem(BaseCompressedFileFileSystem):
    protocol = "lz4"
    compression = "lz4"
    extensions = [".lz4"]

class XzFileSystem(BaseCompressedFileFileSystem):
    protocol = "xz"
    compression = "xz"
    extensions = [".xz"]

class ZstdFileSystem(BaseCompressedFileFileSystem):
    protocol = "zstd"
    compression = "zstd"
    extensions = [".zst", ".zstd"]

Import

from datasets.filesystems import Bz2FileSystem, GzipFileSystem, Lz4FileSystem, XzFileSystem, ZstdFileSystem

I/O Contract

Inputs

Name Type Required Description
fo str No Path to the compressed file. Supports fsspec URL chaining (e.g., "gzip://file.txt::http://host/file.txt.gz"). Defaults to "".
target_protocol Optional[str] No Override the filesystem protocol inferred from the URL.
target_options Optional[dict] No Additional keyword arguments passed when instantiating the target filesystem.

Outputs

Name Type Description
(from cat) bytes The full uncompressed content of the file.
(from _open) file-like object A file-like object for reading the uncompressed data in binary mode.

Supported Compression Formats

Class Protocol Compression Extensions
Bz2FileSystem bz2 bz2 .bz2
GzipFileSystem gzip gzip .gz, .gzip
Lz4FileSystem lz4 lz4 .lz4
XzFileSystem xz xz .xz
ZstdFileSystem zstd zstd .zst, .zstd

Usage Examples

Basic Usage

from datasets.filesystems import GzipFileSystem

# Open a gzip-compressed file as a virtual filesystem
fs = GzipFileSystem("data/train.csv.gz")

# Read the full uncompressed content
content = fs.cat("train.csv")

# List files in the virtual filesystem
files = fs.ls("")
print(files)  # ['train.csv']

URL Chaining

from datasets.filesystems import GzipFileSystem

# Read a remote gzip file via URL chaining
fs = GzipFileSystem("gzip://train.csv::http://example.com/data/train.csv.gz")
content = fs.cat("train.csv")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment