Implementation:Huggingface Datatrove ZstdMediaWriter
| Knowledge Sources | |
|---|---|
| Domains | Media Processing, Data Compression |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
ZstdWriter is a media writer that compresses and stores binary media content using the Zstandard compression algorithm in a magicless frame format, producing files that can be efficiently read back via offset-based random access.
Description
The ZstdWriter class extends BaseMediaWriter to provide Zstandard-compressed media storage. It writes each media record as an independently decompressible frame within a binary output file, with the default filename pattern ${rank}.bin.zst and a default maximum file size of 5GB.
The compression is configured through a lazily-initialized ZstdCompressor property that uses FORMAT_ZSTD1_MAGICLESS (omitting the standard zstd magic bytes to save space), disables checksums (write_checksum=0), and omits content size headers (write_content_size=0). These settings optimize for storage density when writing many small records. The compression level is configurable (default: 3), balancing compression ratio with speed.
The _write method records the starting byte offset, opens a streaming zstd writer on the file handler (with closefd=False to prevent closing the shared file handle), writes the media bytes as a complete frame, and returns the tuple of (filename, start_offset, compressed_size). This compressed_size is the total number of bytes written to the file for this record, which is essential for the corresponding ZstdReader to know exactly how many compressed bytes to read back.
Usage
Use ZstdWriter when you need to store media data with compression for space efficiency while maintaining the ability to randomly access individual records. It pairs naturally with ZstdReader for reading the stored data back. It is well suited for large-scale media processing pipelines where storage cost is a concern.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/media/media_writers/zstd.py
- Lines: 1-53
Signature
class ZstdWriter(BaseMediaWriter):
default_output_filename: str = "${rank}.bin.zst"
name = "️️🗜️ - Binary Zstd"
def __init__(
self,
output_folder: DataFolderLike,
output_filename: str = None,
max_file_size: int = 5 * 2**30, # 5GB
compression_level: int = 3,
):
...
@property
def compressor(self):
...
def _write(self, media: Media, file_handler: IO, filename: str):
...
def close(self):
...
Import
from datatrove.pipeline.media.media_writers.zstd import ZstdWriter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| output_folder | DataFolderLike | Yes | Destination path for compressed output files |
| output_filename | str | No | Template string for output filenames (default: "${rank}.bin.zst") |
| max_file_size | int | No | Maximum file size in bytes before splitting (default: 5GB) |
| compression_level | int | No | Zstandard compression level; higher means better ratio but slower (default: 3) |
| media | Media | Yes (via _write) | A Media object with non-None media_bytes to compress and write |
Outputs
| Name | Type | Description |
|---|---|---|
| result | tuple[str, int, int] | Tuple of (filename, byte_offset, compressed_size) identifying the written record's location |
Usage Examples
Basic Usage
from datatrove.pipeline.media.media_writers.zstd import ZstdWriter
# Write compressed media with default settings
zstd_writer = ZstdWriter(
output_folder="/output/compressed-media/",
max_file_size=5 * 2**30, # 5GB per file
compression_level=3,
)
# Use in a pipeline
pipeline = [
# ... media reader step ...
# ... optional filter steps ...
zstd_writer,
# ... document writer to save updated metadata ...
]
Custom Compression Level
from datatrove.pipeline.media.media_writers.zstd import ZstdWriter
# Higher compression for archival storage
archival_writer = ZstdWriter(
output_folder="s3://archive-bucket/media/",
compression_level=9,
max_file_size=10 * 2**30, # 10GB per file
)