Implementation:Huggingface Datatrove JsonlReader

Knowledge Sources	datatrove JSON Lines
Domains	Data_Ingestion, NLP_Data_Processing
Last Updated	2026-02-14 00:00 GMT

Overview

Concrete tool for reading JSONL (JSON Lines) files provided by the datatrove library. JsonlReader extends BaseDiskReader to stream line-delimited JSON files from disk or remote storage, converting each JSON line into a Document object for pipeline processing.

Description

JsonlReader is a pipeline reader component that processes JSONL files where each line contains a JSON object representing a single document. It parses each line independently, extracts the text content and identifier fields, and assembles a Document object with associated metadata.

Key capabilities include:

Streaming line-by-line parsing: Each line is parsed as an independent JSON object, enabling constant-memory processing of arbitrarily large files
Compression support: Handles gzip and zstd compressed JSONL files, with automatic detection via "infer" mode
Configurable field mapping: The text_key and id_key parameters specify which JSON fields map to the Document text and identifier
File path annotation: The add_file_path parameter (default True) attaches the source file path to each Document's metadata
Glob patterns: Supports file glob patterns to select specific JSONL files from a directory
Shard-based parallel reading: Files can be distributed across multiple workers for parallel ingestion

Usage

Import and use JsonlReader when loading preprocessed text datasets stored in JSONL format into a datatrove pipeline. It is commonly used to read intermediate pipeline outputs, cleaned document collections, or externally provided datasets in JSONL format.

Code Reference

Source Location

Repository: datatrove
File: src/datatrove/pipeline/readers/jsonl.py
Lines: L9-94 (JsonlReader class and methods)

Signature

class JsonlReader(BaseDiskReader):
    def __init__(
        self,
        data_folder: DataFolderLike,
        paths_file: DataFileLike | None = None,
        compression: Literal["infer", "gzip", "zstd"] | None = "infer",
        limit: int = -1,
        skip: int = 0,
        file_progress: bool = False,
        doc_progress: bool = False,
        adapter: Callable = None,
        text_key: str = "text",
        id_key: str = "id",
        default_metadata: dict = None,
        recursive: bool = True,
        glob_pattern: str | None = None,
        shuffle_files: bool = False,
        add_file_path: bool = True,
    ):

Import

from datatrove.pipeline.readers import JsonlReader

I/O Contract

Inputs

Name	Type	Required	Description
data_folder	DataFolderLike	Yes	Path or data folder object pointing to JSONL files
paths_file	DataFileLike	No	File listing specific JSONL file paths to read
compression	Literal["infer", "gzip", "zstd"] or None	No (default: "infer")	Compression scheme for JSONL files; "infer" auto-detects from file extension
limit	int	No (default: -1)	Maximum number of documents to read; -1 for unlimited
skip	int	No (default: 0)	Number of documents to skip from the beginning
file_progress	bool	No (default: False)	Show progress bar for files processed
doc_progress	bool	No (default: False)	Show progress bar for documents processed
adapter	Callable	No (default: None)	Custom function to transform raw JSON data before Document creation
text_key	str	No (default: "text")	JSON key containing the document text content
id_key	str	No (default: "id")	JSON key containing the document identifier
default_metadata	dict	No (default: None)	Default metadata to attach to every Document
recursive	bool	No (default: True)	Recursively search subdirectories for JSONL files
glob_pattern	str	No (default: None)	Glob pattern to filter which files to read
shuffle_files	bool	No (default: False)	Shuffle the order of files before reading
add_file_path	bool	No (default: True)	Add the source file path to each Document's metadata

Outputs

Name	Type	Description
documents	Generator[Document]	Stream of Document objects, each containing: text - the text content extracted from the JSON object via `text_key` id - the document identifier extracted via `id_key` metadata - remaining JSON fields plus optionally the source `file_path`

Usage Examples

Reading JSONL Files from a Directory

from datatrove.pipeline.readers import JsonlReader

# Read all JSONL files from a local directory
reader = JsonlReader(
    data_folder="/data/processed-documents/",
    glob_pattern="*.jsonl.gz",
    text_key="text",
    id_key="id",
)

# Use in a pipeline for deduplication
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.dedup import MinhashDeduplication

executor = LocalPipelineExecutor(
    pipeline=[
        JsonlReader(
            data_folder="/data/cleaned-corpus/",
            glob_pattern="*.jsonl.gz",
        ),
        MinhashDeduplication(),
    ],
    tasks=16,
)
executor.run()

Reading with Custom Field Mapping

from datatrove.pipeline.readers import JsonlReader

# Dataset uses "content" and "doc_id" instead of "text" and "id"
reader = JsonlReader(
    data_folder="/data/external-dataset/",
    text_key="content",
    id_key="doc_id",
    default_metadata={"source": "external"},
    add_file_path=True,
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment