Implementation:Huggingface Datasets JsonDatasetReader
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for importing JSON and JSON Lines files into the HuggingFace Dataset format provided by the HuggingFace Datasets library.
Description
JsonDatasetReader is a reader class that extends AbstractDatasetReader and uses the packaged Json builder to parse one or more JSON/JSONL files into an Arrow-backed Dataset or IterableDataset. It supports an optional field parameter to extract records from a nested JSON key, configurable features, caching, in-memory loading, streaming mode, and multiprocessing. All additional keyword arguments are forwarded to the underlying Json builder.
Usage
Use JsonDatasetReader when you need to programmatically load JSON or JSON Lines files into a HuggingFace Dataset. It is typically invoked indirectly via Dataset.from_json() or load_dataset("json", ...), but can also be instantiated directly.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/io/json.py - Lines: L15-L69
Signature
class JsonDatasetReader(AbstractDatasetReader):
def __init__(
self,
path_or_paths: NestedDataStructureLike[PathLike],
split: Optional[NamedSplit] = None,
features: Optional[Features] = None,
cache_dir: str = None,
keep_in_memory: bool = False,
streaming: bool = False,
field: Optional[str] = None,
num_proc: Optional[int] = None,
**kwargs,
):
def read(self):
Import
from datasets.io.json import JsonDatasetReader
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path_or_paths | NestedDataStructureLike[PathLike] |
Yes | Path(s) to JSON/JSONL file(s). Can be a single path, a list of paths, or a dict mapping split names to paths. |
| split | Optional[NamedSplit] |
No | Name of the dataset split to assign to the loaded data. |
| features | Optional[Features] |
No | Explicit schema to apply instead of inferring from the JSON data. |
| cache_dir | str |
No | Directory for caching the processed dataset. |
| keep_in_memory | bool |
No | Whether to keep the dataset in memory instead of memory-mapping. Defaults to False. |
| streaming | bool |
No | If True, returns an IterableDataset for streaming access. Defaults to False. |
| field | Optional[str] |
No | Name of the JSON field containing the records (for nested JSON structures). |
| num_proc | Optional[int] |
No | Number of processes for parallel dataset preparation. |
| **kwargs | No | Additional keyword arguments forwarded to the Json builder. |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset or IterableDataset |
The loaded dataset, either map-style or iterable depending on the streaming parameter. |
Usage Examples
Basic Usage
from datasets.io.json import JsonDatasetReader
# Load a JSON Lines file
reader = JsonDatasetReader("data/train.jsonl", split="train")
dataset = reader.read()
# Load a nested JSON file where records are under "data" key
reader = JsonDatasetReader("data/train.json", split="train", field="data")
dataset = reader.read()
# Load with streaming
reader = JsonDatasetReader("data/train.jsonl", streaming=True)
iterable_dataset = reader.read()