Implementation:Huggingface Datasets Json Builder

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Loading, Tabular
Last Updated	2026-02-14 18:00 GMT

Overview

Packaged dataset builder for loading JSON and JSON Lines files into Arrow-backed datasets provided by the HuggingFace Datasets library.

Description

Json is a packaged dataset builder extending ArrowBasedBuilder that reads JSON and JSON Lines (JSONL) files into Arrow tables. It is configured via JsonConfig, a dataclass extending BuilderConfig, with fields for features, encoding (default "utf-8"), encoding_errors, field (for extracting a specific nested field from a JSON object), chunksize (default 10 MB), and deprecated fields use_threads and block_size. The newlines_in_values parameter is no longer supported and raises ValueError if set.

The builder handles two reading modes: (1) when field is specified, it reads the entire JSON file, extracts the named field, and converts it via pandas; (2) otherwise, it reads JSONL data in chunks using PyArrow's JSON reader with adaptive block sizing. If PyArrow parsing fails, it falls back to pandas-based JSON reading. The _cast_table method handles missing columns, struct-to-string conversion, and schema casting via table_cast.

The module also provides compatibility helper functions ujson_dumps, ujson_loads, and pandas_read_json to handle differences across pandas versions.

Usage

Use Json via load_dataset("json", data_files=...) to load JSON or JSONL files. For JSON files with nested structure, use the field parameter to specify which key contains the data records.

Code Reference

Source Location

Repository: datasets
File: src/datasets/packaged_modules/json/json.py
Lines: 1-201

Signature

@dataclass
class JsonConfig(datasets.BuilderConfig):
    """BuilderConfig for JSON."""
    features: Optional[datasets.Features] = None
    encoding: str = "utf-8"
    encoding_errors: Optional[str] = None
    field: Optional[str] = None
    use_threads: bool = True  # deprecated
    block_size: Optional[int] = None  # deprecated
    chunksize: int = 10 << 20  # 10MB
    newlines_in_values: Optional[bool] = None


class Json(datasets.ArrowBasedBuilder):
    BUILDER_CONFIG_CLASS = JsonConfig

    def _info(self):
    def _split_generators(self, dl_manager):
    def _cast_table(self, pa_table: pa.Table) -> pa.Table:
    def _generate_shards(self, base_files, files_iterables):
    def _generate_tables(self, base_files, files_iterables):

Import

from datasets.packaged_modules.json.json import Json, JsonConfig

I/O Contract

Inputs (JsonConfig)

Name	Type	Required	Description
features	`Optional[Features]`	No	Schema describing the dataset features. If provided, missing columns are added as null and type casting is applied.
encoding	`str`	No	File encoding. Defaults to `"utf-8"`. Non-UTF-8 files are re-encoded before parsing.
encoding_errors	`Optional[str]`	No	How to handle encoding errors. Defaults to `"strict"` when not set.
field	`Optional[str]`	No	Name of the JSON key containing the data records. Used when the JSON file is a single object with a nested array of records.
use_threads	`bool`	No	Deprecated. Previously controlled multithreaded reading. No longer has any effect.
block_size	`Optional[int]`	No	Deprecated. Use `chunksize` instead.
chunksize	`int`	No	Number of bytes to read per chunk when processing JSONL files. Defaults to 10 MB (`10 << 20`).
newlines_in_values	`Optional[bool]`	No	No longer supported. Raises `ValueError` if set.

Outputs

Name	Type	Description
dataset	`Dataset`	An Arrow-backed Dataset constructed from the JSON/JSONL file contents.

Usage Examples

Loading a JSON Lines File

from datasets import load_dataset

# Load a JSONL file (one JSON object per line)
ds = load_dataset("json", data_files="data/train.jsonl", split="train")
print(ds[0])

Loading a Nested JSON File

from datasets import load_dataset

# JSON file structure: {"data": [{"text": "hello"}, {"text": "world"}]}
ds = load_dataset("json", data_files="data/dataset.json", field="data", split="train")
print(ds[0])  # {"text": "hello"}

Loading with Custom Encoding

from datasets import load_dataset

# Load a Latin-1 encoded JSONL file
ds = load_dataset(
    "json",
    data_files="data/latin1_data.jsonl",
    encoding="latin-1",
    split="train",
)

Related Pages

Implements Principle

Principle:Huggingface_Datasets_JSON_Dataset_Building

Requires Environment

Environment:Huggingface_Datasets_Python_PyArrow_Core

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment