Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Datatrove ZstdMediaWriter

From Leeroopedia
Knowledge Sources
Domains Media Processing, Data Compression
Last Updated 2026-02-14 17:00 GMT

Overview

ZstdWriter is a media writer that compresses and stores binary media content using the Zstandard compression algorithm in a magicless frame format, producing files that can be efficiently read back via offset-based random access.

Description

The ZstdWriter class extends BaseMediaWriter to provide Zstandard-compressed media storage. It writes each media record as an independently decompressible frame within a binary output file, with the default filename pattern ${rank}.bin.zst and a default maximum file size of 5GB.

The compression is configured through a lazily-initialized ZstdCompressor property that uses FORMAT_ZSTD1_MAGICLESS (omitting the standard zstd magic bytes to save space), disables checksums (write_checksum=0), and omits content size headers (write_content_size=0). These settings optimize for storage density when writing many small records. The compression level is configurable (default: 3), balancing compression ratio with speed.

The _write method records the starting byte offset, opens a streaming zstd writer on the file handler (with closefd=False to prevent closing the shared file handle), writes the media bytes as a complete frame, and returns the tuple of (filename, start_offset, compressed_size). This compressed_size is the total number of bytes written to the file for this record, which is essential for the corresponding ZstdReader to know exactly how many compressed bytes to read back.

Usage

Use ZstdWriter when you need to store media data with compression for space efficiency while maintaining the ability to randomly access individual records. It pairs naturally with ZstdReader for reading the stored data back. It is well suited for large-scale media processing pipelines where storage cost is a concern.

Code Reference

Source Location

Signature

class ZstdWriter(BaseMediaWriter):
    default_output_filename: str = "${rank}.bin.zst"
    name = "️️🗜️ - Binary Zstd"

    def __init__(
        self,
        output_folder: DataFolderLike,
        output_filename: str = None,
        max_file_size: int = 5 * 2**30,  # 5GB
        compression_level: int = 3,
    ):
        ...

    @property
    def compressor(self):
        ...

    def _write(self, media: Media, file_handler: IO, filename: str):
        ...

    def close(self):
        ...

Import

from datatrove.pipeline.media.media_writers.zstd import ZstdWriter

I/O Contract

Inputs

Name Type Required Description
output_folder DataFolderLike Yes Destination path for compressed output files
output_filename str No Template string for output filenames (default: "${rank}.bin.zst")
max_file_size int No Maximum file size in bytes before splitting (default: 5GB)
compression_level int No Zstandard compression level; higher means better ratio but slower (default: 3)
media Media Yes (via _write) A Media object with non-None media_bytes to compress and write

Outputs

Name Type Description
result tuple[str, int, int] Tuple of (filename, byte_offset, compressed_size) identifying the written record's location

Usage Examples

Basic Usage

from datatrove.pipeline.media.media_writers.zstd import ZstdWriter

# Write compressed media with default settings
zstd_writer = ZstdWriter(
    output_folder="/output/compressed-media/",
    max_file_size=5 * 2**30,  # 5GB per file
    compression_level=3,
)

# Use in a pipeline
pipeline = [
    # ... media reader step ...
    # ... optional filter steps ...
    zstd_writer,
    # ... document writer to save updated metadata ...
]

Custom Compression Level

from datatrove.pipeline.media.media_writers.zstd import ZstdWriter

# Higher compression for archival storage
archival_writer = ZstdWriter(
    output_folder="s3://archive-bucket/media/",
    compression_level=9,
    max_file_size=10 * 2**30,  # 10GB per file
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment