Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Datatrove JsonlReader

From Leeroopedia
Knowledge Sources
Domains Data_Ingestion, NLP_Data_Processing
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete tool for reading JSONL (JSON Lines) files provided by the datatrove library. JsonlReader extends BaseDiskReader to stream line-delimited JSON files from disk or remote storage, converting each JSON line into a Document object for pipeline processing.

Description

JsonlReader is a pipeline reader component that processes JSONL files where each line contains a JSON object representing a single document. It parses each line independently, extracts the text content and identifier fields, and assembles a Document object with associated metadata.

Key capabilities include:

  • Streaming line-by-line parsing: Each line is parsed as an independent JSON object, enabling constant-memory processing of arbitrarily large files
  • Compression support: Handles gzip and zstd compressed JSONL files, with automatic detection via "infer" mode
  • Configurable field mapping: The text_key and id_key parameters specify which JSON fields map to the Document text and identifier
  • File path annotation: The add_file_path parameter (default True) attaches the source file path to each Document's metadata
  • Glob patterns: Supports file glob patterns to select specific JSONL files from a directory
  • Shard-based parallel reading: Files can be distributed across multiple workers for parallel ingestion

Usage

Import and use JsonlReader when loading preprocessed text datasets stored in JSONL format into a datatrove pipeline. It is commonly used to read intermediate pipeline outputs, cleaned document collections, or externally provided datasets in JSONL format.

Code Reference

Source Location

  • Repository: datatrove
  • File: src/datatrove/pipeline/readers/jsonl.py
  • Lines: L9-94 (JsonlReader class and methods)

Signature

class JsonlReader(BaseDiskReader):
    def __init__(
        self,
        data_folder: DataFolderLike,
        paths_file: DataFileLike | None = None,
        compression: Literal["infer", "gzip", "zstd"] | None = "infer",
        limit: int = -1,
        skip: int = 0,
        file_progress: bool = False,
        doc_progress: bool = False,
        adapter: Callable = None,
        text_key: str = "text",
        id_key: str = "id",
        default_metadata: dict = None,
        recursive: bool = True,
        glob_pattern: str | None = None,
        shuffle_files: bool = False,
        add_file_path: bool = True,
    ):

Import

from datatrove.pipeline.readers import JsonlReader

I/O Contract

Inputs

Name Type Required Description
data_folder DataFolderLike Yes Path or data folder object pointing to JSONL files
paths_file DataFileLike No File listing specific JSONL file paths to read
compression Literal["infer", "gzip", "zstd"] or None No (default: "infer") Compression scheme for JSONL files; "infer" auto-detects from file extension
limit int No (default: -1) Maximum number of documents to read; -1 for unlimited
skip int No (default: 0) Number of documents to skip from the beginning
file_progress bool No (default: False) Show progress bar for files processed
doc_progress bool No (default: False) Show progress bar for documents processed
adapter Callable No (default: None) Custom function to transform raw JSON data before Document creation
text_key str No (default: "text") JSON key containing the document text content
id_key str No (default: "id") JSON key containing the document identifier
default_metadata dict No (default: None) Default metadata to attach to every Document
recursive bool No (default: True) Recursively search subdirectories for JSONL files
glob_pattern str No (default: None) Glob pattern to filter which files to read
shuffle_files bool No (default: False) Shuffle the order of files before reading
add_file_path bool No (default: True) Add the source file path to each Document's metadata

Outputs

Name Type Description
documents Generator[Document] Stream of Document objects, each containing:
  • text - the text content extracted from the JSON object via text_key
  • id - the document identifier extracted via id_key
  • metadata - remaining JSON fields plus optionally the source file_path

Usage Examples

Reading JSONL Files from a Directory

from datatrove.pipeline.readers import JsonlReader

# Read all JSONL files from a local directory
reader = JsonlReader(
    data_folder="/data/processed-documents/",
    glob_pattern="*.jsonl.gz",
    text_key="text",
    id_key="id",
)

# Use in a pipeline for deduplication
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.dedup import MinhashDeduplication

executor = LocalPipelineExecutor(
    pipeline=[
        JsonlReader(
            data_folder="/data/cleaned-corpus/",
            glob_pattern="*.jsonl.gz",
        ),
        MinhashDeduplication(),
    ],
    tasks=16,
)
executor.run()

Reading with Custom Field Mapping

from datatrove.pipeline.readers import JsonlReader

# Dataset uses "content" and "doc_id" instead of "text" and "id"
reader = JsonlReader(
    data_folder="/data/external-dataset/",
    text_key="content",
    id_key="doc_id",
    default_metadata={"source": "external"},
    add_file_path=True,
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment