Implementation:Huggingface Datatrove JsonlWriter
| Sources | Domains | Last Updated |
|---|---|---|
| Huggingface Datatrove | Data_Output, Serialization | 2026-02-14 |
Overview
Concrete writer class that serializes Document objects to JSONL files on local or remote storage using orjson for high-performance JSON encoding.
Description
JsonlWriter extends DiskWriter to produce line-delimited JSON output. Each document is adapted to a dictionary (via a configurable adapter function), then serialized with orjson.dumps using the OPT_APPEND_NEWLINE option so each record occupies exactly one line. Media bytes, if present, are base64-encoded before writing. The writer inherits filename templating, max file size splitting, and output file management from DiskWriter.
Usage
Place as the final step in a datatrove pipeline to persist processed documents. Configure output_folder to a local path or remote storage (S3, GCS) and optionally set compression, expand_metadata, or a custom adapter function.
Code Reference
Source Location: Repository: huggingface/datatrove, File: src/datatrove/pipeline/writers/jsonl.py (L8-50)
Signature:
class JsonlWriter(DiskWriter):
def __init__(
self,
output_folder: DataFolderLike,
output_filename: str = None,
compression: str | None = "gzip",
adapter: Callable = None,
expand_metadata: bool = False,
max_file_size: int = -1,
save_media_bytes=False,
):
Import:
from datatrove.pipeline.writers import JsonlWriter
I/O Contract
Inputs:
| Parameter | Type | Required | Description |
|---|---|---|---|
| output_folder | DataFolderLike | Yes | Local path, remote URI, or DataFolder where JSONL files are written |
| output_filename | str | No | Filename template with placeholders (default ${rank}.jsonl) |
| compression | str or None | No | Compression scheme; default "gzip". Set to None for no compression |
| adapter | Callable | No | Custom function to transform Document to output dict |
| expand_metadata | bool | No | If True, flatten metadata keys to top-level fields (default False) |
| max_file_size | int | No | Max bytes per file; -1 for unlimited (default -1) |
| save_media_bytes | bool | No | If True, include base64-encoded media bytes in output (default False) |
Outputs:
- JSONL files on disk or remote storage, one JSON object per line
- Each line contains text, id, media, and metadata fields (or expanded metadata fields)
Usage Examples
Example 1 -- Basic JSONL output with gzip:
from datatrove.pipeline.writers import JsonlWriter
writer = JsonlWriter(
output_folder="s3://my-bucket/output/",
compression="gzip",
)
Example 2 -- Expanded metadata with custom filename:
from datatrove.pipeline.writers import JsonlWriter
writer = JsonlWriter(
output_folder="/data/output",
output_filename="${rank}_${language}.jsonl",
expand_metadata=True,
compression=None,
)