Implementation:Huggingface Datatrove JsonlWriter

Sources	Domains	Last Updated
Huggingface Datatrove	Data_Output, Serialization	2026-02-14

Overview

Concrete writer class that serializes Document objects to JSONL files on local or remote storage using orjson for high-performance JSON encoding.

Description

JsonlWriter extends DiskWriter to produce line-delimited JSON output. Each document is adapted to a dictionary (via a configurable adapter function), then serialized with orjson.dumps using the OPT_APPEND_NEWLINE option so each record occupies exactly one line. Media bytes, if present, are base64-encoded before writing. The writer inherits filename templating, max file size splitting, and output file management from DiskWriter.

Usage

Place as the final step in a datatrove pipeline to persist processed documents. Configure output_folder to a local path or remote storage (S3, GCS) and optionally set compression, expand_metadata, or a custom adapter function.

Code Reference

Source Location: Repository: huggingface/datatrove, File: src/datatrove/pipeline/writers/jsonl.py (L8-50)

Signature:

class JsonlWriter(DiskWriter):
    def __init__(
        self,
        output_folder: DataFolderLike,
        output_filename: str = None,
        compression: str | None = "gzip",
        adapter: Callable = None,
        expand_metadata: bool = False,
        max_file_size: int = -1,
        save_media_bytes=False,
    ):

Import:

from datatrove.pipeline.writers import JsonlWriter

I/O Contract

Inputs:

Parameter	Type	Required	Description
output_folder	DataFolderLike	Yes	Local path, remote URI, or DataFolder where JSONL files are written
output_filename	str	No	Filename template with placeholders (default ${rank}.jsonl)
compression	str or None	No	Compression scheme; default "gzip". Set to None for no compression
adapter	Callable	No	Custom function to transform Document to output dict
expand_metadata	bool	No	If True, flatten metadata keys to top-level fields (default False)
max_file_size	int	No	Max bytes per file; -1 for unlimited (default -1)
save_media_bytes	bool	No	If True, include base64-encoded media bytes in output (default False)

Outputs:

JSONL files on disk or remote storage, one JSON object per line
Each line contains text, id, media, and metadata fields (or expanded metadata fields)

Usage Examples

Example 1 -- Basic JSONL output with gzip:

from datatrove.pipeline.writers import JsonlWriter

writer = JsonlWriter(
    output_folder="s3://my-bucket/output/",
    compression="gzip",
)

Example 2 -- Expanded metadata with custom filename:

from datatrove.pipeline.writers import JsonlWriter

writer = JsonlWriter(
    output_folder="/data/output",
    output_filename="${rank}_${language}.jsonl",
    expand_metadata=True,
    compression=None,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment