Implementation:Huggingface Datasets Compression Filesystems
| Knowledge Sources | |
|---|---|
| Domains | File_Systems, Compression |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Fsspec-compatible filesystem classes for transparently reading single-file compressed archives, provided by the HuggingFace Datasets library.
Description
BaseCompressedFileFileSystem is an abstract base class extending fsspec.archive.AbstractArchiveFileSystem that treats a single compressed file as a virtual filesystem containing one uncompressed file. The uncompressed filename is derived by stripping the compression extension from the original filename. The class uses fsspec.open() internally with the appropriate compression codec to transparently decompress data on read.
Six concrete subclasses are provided, each setting the protocol, compression, and extensions class attributes for a specific compression format:
- Bz2FileSystem -- protocol
"bz2", extension.bz2 - GzipFileSystem -- protocol
"gzip", extensions.gz,.gzip - Lz4FileSystem -- protocol
"lz4", extension.lz4 - XzFileSystem -- protocol
"xz", extension.xz - ZstdFileSystem -- protocol
"zstd", extensions.zst,.zstd
Each filesystem is read-only and only supports "rb" mode. The _open_with_fsspec partial handles proxy configuration via trust_env=True and avoids URL requoting issues on redirects.
Usage
These filesystem classes are registered with fsspec and used internally by the Datasets library to transparently decompress data files during dataset loading. They enable URL chaining such as gzip://file.txt::http://example.com/file.txt.gz. End users typically do not instantiate them directly.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/filesystems/compression.py - Lines: 1-127
Signature
class BaseCompressedFileFileSystem(AbstractArchiveFileSystem):
"""Read contents of compressed file as a filesystem with one file inside."""
root_marker = ""
protocol: str = None
compression: str = None
extensions: list[str] = None
def __init__(
self, fo: str = "", target_protocol: Optional[str] = None,
target_options: Optional[dict] = None, **kwargs
):
Key methods:
@classmethod
def _strip_protocol(cls, path):
# compressed file paths are always relative to the archive root
return super()._strip_protocol(path).lstrip("/")
def _get_dirs(self):
# Populates dir_cache with single uncompressed file entry
def cat(self, path: str):
# Returns full uncompressed content as bytes
def _open(self, path: str, mode: str = "rb", block_size=None,
autocommit=True, cache_options=None, **kwargs):
# Opens the compressed file for reading (only 'rb' mode supported)
Concrete subclasses:
class Bz2FileSystem(BaseCompressedFileFileSystem):
protocol = "bz2"
compression = "bz2"
extensions = [".bz2"]
class GzipFileSystem(BaseCompressedFileFileSystem):
protocol = "gzip"
compression = "gzip"
extensions = [".gz", ".gzip"]
class Lz4FileSystem(BaseCompressedFileFileSystem):
protocol = "lz4"
compression = "lz4"
extensions = [".lz4"]
class XzFileSystem(BaseCompressedFileFileSystem):
protocol = "xz"
compression = "xz"
extensions = [".xz"]
class ZstdFileSystem(BaseCompressedFileFileSystem):
protocol = "zstd"
compression = "zstd"
extensions = [".zst", ".zstd"]
Import
from datasets.filesystems import Bz2FileSystem, GzipFileSystem, Lz4FileSystem, XzFileSystem, ZstdFileSystem
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| fo | str |
No | Path to the compressed file. Supports fsspec URL chaining (e.g., "gzip://file.txt::http://host/file.txt.gz"). Defaults to "".
|
| target_protocol | Optional[str] |
No | Override the filesystem protocol inferred from the URL. |
| target_options | Optional[dict] |
No | Additional keyword arguments passed when instantiating the target filesystem. |
Outputs
| Name | Type | Description |
|---|---|---|
(from cat) |
bytes |
The full uncompressed content of the file. |
(from _open) |
file-like object | A file-like object for reading the uncompressed data in binary mode. |
Supported Compression Formats
| Class | Protocol | Compression | Extensions |
|---|---|---|---|
Bz2FileSystem |
bz2 |
bz2 |
.bz2
|
GzipFileSystem |
gzip |
gzip |
.gz, .gzip
|
Lz4FileSystem |
lz4 |
lz4 |
.lz4
|
XzFileSystem |
xz |
xz |
.xz
|
ZstdFileSystem |
zstd |
zstd |
.zst, .zstd
|
Usage Examples
Basic Usage
from datasets.filesystems import GzipFileSystem
# Open a gzip-compressed file as a virtual filesystem
fs = GzipFileSystem("data/train.csv.gz")
# Read the full uncompressed content
content = fs.cat("train.csv")
# List files in the virtual filesystem
files = fs.ls("")
print(files) # ['train.csv']
URL Chaining
from datasets.filesystems import GzipFileSystem
# Read a remote gzip file via URL chaining
fs = GzipFileSystem("gzip://train.csv::http://example.com/data/train.csv.gz")
content = fs.cat("train.csv")