Implementation:Huggingface Datasets JsonDatasetReader

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for importing JSON and JSON Lines files into the HuggingFace Dataset format provided by the HuggingFace Datasets library.

Description

JsonDatasetReader is a reader class that extends AbstractDatasetReader and uses the packaged Json builder to parse one or more JSON/JSONL files into an Arrow-backed Dataset or IterableDataset. It supports an optional field parameter to extract records from a nested JSON key, configurable features, caching, in-memory loading, streaming mode, and multiprocessing. All additional keyword arguments are forwarded to the underlying Json builder.

Usage

Use JsonDatasetReader when you need to programmatically load JSON or JSON Lines files into a HuggingFace Dataset. It is typically invoked indirectly via Dataset.from_json() or load_dataset("json", ...), but can also be instantiated directly.

Code Reference

Source Location

Repository: datasets
File: src/datasets/io/json.py
Lines: L15-L69

Signature

class JsonDatasetReader(AbstractDatasetReader):
    def __init__(
        self,
        path_or_paths: NestedDataStructureLike[PathLike],
        split: Optional[NamedSplit] = None,
        features: Optional[Features] = None,
        cache_dir: str = None,
        keep_in_memory: bool = False,
        streaming: bool = False,
        field: Optional[str] = None,
        num_proc: Optional[int] = None,
        **kwargs,
    ):

    def read(self):

Import

from datasets.io.json import JsonDatasetReader

I/O Contract

Inputs

Name	Type	Required	Description
path_or_paths	`NestedDataStructureLike[PathLike]`	Yes	Path(s) to JSON/JSONL file(s). Can be a single path, a list of paths, or a dict mapping split names to paths.
split	`Optional[NamedSplit]`	No	Name of the dataset split to assign to the loaded data.
features	`Optional[Features]`	No	Explicit schema to apply instead of inferring from the JSON data.
cache_dir	`str`	No	Directory for caching the processed dataset.
keep_in_memory	`bool`	No	Whether to keep the dataset in memory instead of memory-mapping. Defaults to False.
streaming	`bool`	No	If True, returns an IterableDataset for streaming access. Defaults to False.
field	`Optional[str]`	No	Name of the JSON field containing the records (for nested JSON structures).
num_proc	`Optional[int]`	No	Number of processes for parallel dataset preparation.
**kwargs		No	Additional keyword arguments forwarded to the Json builder.

Outputs

Name	Type	Description
dataset	`Dataset` or `IterableDataset`	The loaded dataset, either map-style or iterable depending on the streaming parameter.

Usage Examples

Basic Usage

from datasets.io.json import JsonDatasetReader

# Load a JSON Lines file
reader = JsonDatasetReader("data/train.jsonl", split="train")
dataset = reader.read()

# Load a nested JSON file where records are under "data" key
reader = JsonDatasetReader("data/train.json", split="train", field="data")
dataset = reader.read()

# Load with streaming
reader = JsonDatasetReader("data/train.jsonl", streaming=True)
iterable_dataset = reader.read()

Related Pages

Implements Principle

Principle:Huggingface_Datasets_JSON_Import

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment