Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets JsonDatasetReader

From Leeroopedia
Revision as of 12:59, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Huggingface_Datasets_JsonDatasetReader.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for importing JSON and JSON Lines files into the HuggingFace Dataset format provided by the HuggingFace Datasets library.

Description

JsonDatasetReader is a reader class that extends AbstractDatasetReader and uses the packaged Json builder to parse one or more JSON/JSONL files into an Arrow-backed Dataset or IterableDataset. It supports an optional field parameter to extract records from a nested JSON key, configurable features, caching, in-memory loading, streaming mode, and multiprocessing. All additional keyword arguments are forwarded to the underlying Json builder.

Usage

Use JsonDatasetReader when you need to programmatically load JSON or JSON Lines files into a HuggingFace Dataset. It is typically invoked indirectly via Dataset.from_json() or load_dataset("json", ...), but can also be instantiated directly.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/io/json.py
  • Lines: L15-L69

Signature

class JsonDatasetReader(AbstractDatasetReader):
    def __init__(
        self,
        path_or_paths: NestedDataStructureLike[PathLike],
        split: Optional[NamedSplit] = None,
        features: Optional[Features] = None,
        cache_dir: str = None,
        keep_in_memory: bool = False,
        streaming: bool = False,
        field: Optional[str] = None,
        num_proc: Optional[int] = None,
        **kwargs,
    ):

    def read(self):

Import

from datasets.io.json import JsonDatasetReader

I/O Contract

Inputs

Name Type Required Description
path_or_paths NestedDataStructureLike[PathLike] Yes Path(s) to JSON/JSONL file(s). Can be a single path, a list of paths, or a dict mapping split names to paths.
split Optional[NamedSplit] No Name of the dataset split to assign to the loaded data.
features Optional[Features] No Explicit schema to apply instead of inferring from the JSON data.
cache_dir str No Directory for caching the processed dataset.
keep_in_memory bool No Whether to keep the dataset in memory instead of memory-mapping. Defaults to False.
streaming bool No If True, returns an IterableDataset for streaming access. Defaults to False.
field Optional[str] No Name of the JSON field containing the records (for nested JSON structures).
num_proc Optional[int] No Number of processes for parallel dataset preparation.
**kwargs No Additional keyword arguments forwarded to the Json builder.

Outputs

Name Type Description
dataset Dataset or IterableDataset The loaded dataset, either map-style or iterable depending on the streaming parameter.

Usage Examples

Basic Usage

from datasets.io.json import JsonDatasetReader

# Load a JSON Lines file
reader = JsonDatasetReader("data/train.jsonl", split="train")
dataset = reader.read()

# Load a nested JSON file where records are under "data" key
reader = JsonDatasetReader("data/train.json", split="train", field="data")
dataset = reader.read()

# Load with streaming
reader = JsonDatasetReader("data/train.jsonl", streaming=True)
iterable_dataset = reader.read()

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment