Implementation:Huggingface Datatrove JsonlReader
| Knowledge Sources | |
|---|---|
| Domains | Data_Ingestion, NLP_Data_Processing |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Concrete tool for reading JSONL (JSON Lines) files provided by the datatrove library. JsonlReader extends BaseDiskReader to stream line-delimited JSON files from disk or remote storage, converting each JSON line into a Document object for pipeline processing.
Description
JsonlReader is a pipeline reader component that processes JSONL files where each line contains a JSON object representing a single document. It parses each line independently, extracts the text content and identifier fields, and assembles a Document object with associated metadata.
Key capabilities include:
- Streaming line-by-line parsing: Each line is parsed as an independent JSON object, enabling constant-memory processing of arbitrarily large files
- Compression support: Handles gzip and zstd compressed JSONL files, with automatic detection via
"infer"mode - Configurable field mapping: The
text_keyandid_keyparameters specify which JSON fields map to the Document text and identifier - File path annotation: The
add_file_pathparameter (defaultTrue) attaches the source file path to each Document's metadata - Glob patterns: Supports file glob patterns to select specific JSONL files from a directory
- Shard-based parallel reading: Files can be distributed across multiple workers for parallel ingestion
Usage
Import and use JsonlReader when loading preprocessed text datasets stored in JSONL format into a datatrove pipeline. It is commonly used to read intermediate pipeline outputs, cleaned document collections, or externally provided datasets in JSONL format.
Code Reference
Source Location
- Repository: datatrove
- File:
src/datatrove/pipeline/readers/jsonl.py - Lines: L9-94 (JsonlReader class and methods)
Signature
class JsonlReader(BaseDiskReader):
def __init__(
self,
data_folder: DataFolderLike,
paths_file: DataFileLike | None = None,
compression: Literal["infer", "gzip", "zstd"] | None = "infer",
limit: int = -1,
skip: int = 0,
file_progress: bool = False,
doc_progress: bool = False,
adapter: Callable = None,
text_key: str = "text",
id_key: str = "id",
default_metadata: dict = None,
recursive: bool = True,
glob_pattern: str | None = None,
shuffle_files: bool = False,
add_file_path: bool = True,
):
Import
from datatrove.pipeline.readers import JsonlReader
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_folder | DataFolderLike | Yes | Path or data folder object pointing to JSONL files |
| paths_file | DataFileLike | No | File listing specific JSONL file paths to read |
| compression | Literal["infer", "gzip", "zstd"] or None | No (default: "infer") | Compression scheme for JSONL files; "infer" auto-detects from file extension |
| limit | int | No (default: -1) | Maximum number of documents to read; -1 for unlimited |
| skip | int | No (default: 0) | Number of documents to skip from the beginning |
| file_progress | bool | No (default: False) | Show progress bar for files processed |
| doc_progress | bool | No (default: False) | Show progress bar for documents processed |
| adapter | Callable | No (default: None) | Custom function to transform raw JSON data before Document creation |
| text_key | str | No (default: "text") | JSON key containing the document text content |
| id_key | str | No (default: "id") | JSON key containing the document identifier |
| default_metadata | dict | No (default: None) | Default metadata to attach to every Document |
| recursive | bool | No (default: True) | Recursively search subdirectories for JSONL files |
| glob_pattern | str | No (default: None) | Glob pattern to filter which files to read |
| shuffle_files | bool | No (default: False) | Shuffle the order of files before reading |
| add_file_path | bool | No (default: True) | Add the source file path to each Document's metadata |
Outputs
| Name | Type | Description |
|---|---|---|
| documents | Generator[Document] | Stream of Document objects, each containing:
|
Usage Examples
Reading JSONL Files from a Directory
from datatrove.pipeline.readers import JsonlReader
# Read all JSONL files from a local directory
reader = JsonlReader(
data_folder="/data/processed-documents/",
glob_pattern="*.jsonl.gz",
text_key="text",
id_key="id",
)
# Use in a pipeline for deduplication
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.dedup import MinhashDeduplication
executor = LocalPipelineExecutor(
pipeline=[
JsonlReader(
data_folder="/data/cleaned-corpus/",
glob_pattern="*.jsonl.gz",
),
MinhashDeduplication(),
],
tasks=16,
)
executor.run()
Reading with Custom Field Mapping
from datatrove.pipeline.readers import JsonlReader
# Dataset uses "content" and "doc_id" instead of "text" and "id"
reader = JsonlReader(
data_folder="/data/external-dataset/",
text_key="content",
id_key="doc_id",
default_metadata={"source": "external"},
add_file_path=True,
)