Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Json Builder

From Leeroopedia
Knowledge Sources
Domains Data_Loading, Tabular
Last Updated 2026-02-14 18:00 GMT

Overview

Packaged dataset builder for loading JSON and JSON Lines files into Arrow-backed datasets provided by the HuggingFace Datasets library.

Description

Json is a packaged dataset builder extending ArrowBasedBuilder that reads JSON and JSON Lines (JSONL) files into Arrow tables. It is configured via JsonConfig, a dataclass extending BuilderConfig, with fields for features, encoding (default "utf-8"), encoding_errors, field (for extracting a specific nested field from a JSON object), chunksize (default 10 MB), and deprecated fields use_threads and block_size. The newlines_in_values parameter is no longer supported and raises ValueError if set.

The builder handles two reading modes: (1) when field is specified, it reads the entire JSON file, extracts the named field, and converts it via pandas; (2) otherwise, it reads JSONL data in chunks using PyArrow's JSON reader with adaptive block sizing. If PyArrow parsing fails, it falls back to pandas-based JSON reading. The _cast_table method handles missing columns, struct-to-string conversion, and schema casting via table_cast.

The module also provides compatibility helper functions ujson_dumps, ujson_loads, and pandas_read_json to handle differences across pandas versions.

Usage

Use Json via load_dataset("json", data_files=...) to load JSON or JSONL files. For JSON files with nested structure, use the field parameter to specify which key contains the data records.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/packaged_modules/json/json.py
  • Lines: 1-201

Signature

@dataclass
class JsonConfig(datasets.BuilderConfig):
    """BuilderConfig for JSON."""
    features: Optional[datasets.Features] = None
    encoding: str = "utf-8"
    encoding_errors: Optional[str] = None
    field: Optional[str] = None
    use_threads: bool = True  # deprecated
    block_size: Optional[int] = None  # deprecated
    chunksize: int = 10 << 20  # 10MB
    newlines_in_values: Optional[bool] = None


class Json(datasets.ArrowBasedBuilder):
    BUILDER_CONFIG_CLASS = JsonConfig

    def _info(self):
    def _split_generators(self, dl_manager):
    def _cast_table(self, pa_table: pa.Table) -> pa.Table:
    def _generate_shards(self, base_files, files_iterables):
    def _generate_tables(self, base_files, files_iterables):

Import

from datasets.packaged_modules.json.json import Json, JsonConfig

I/O Contract

Inputs (JsonConfig)

Name Type Required Description
features Optional[Features] No Schema describing the dataset features. If provided, missing columns are added as null and type casting is applied.
encoding str No File encoding. Defaults to "utf-8". Non-UTF-8 files are re-encoded before parsing.
encoding_errors Optional[str] No How to handle encoding errors. Defaults to "strict" when not set.
field Optional[str] No Name of the JSON key containing the data records. Used when the JSON file is a single object with a nested array of records.
use_threads bool No Deprecated. Previously controlled multithreaded reading. No longer has any effect.
block_size Optional[int] No Deprecated. Use chunksize instead.
chunksize int No Number of bytes to read per chunk when processing JSONL files. Defaults to 10 MB (10 << 20).
newlines_in_values Optional[bool] No No longer supported. Raises ValueError if set.

Outputs

Name Type Description
dataset Dataset An Arrow-backed Dataset constructed from the JSON/JSONL file contents.

Usage Examples

Loading a JSON Lines File

from datasets import load_dataset

# Load a JSONL file (one JSON object per line)
ds = load_dataset("json", data_files="data/train.jsonl", split="train")
print(ds[0])

Loading a Nested JSON File

from datasets import load_dataset

# JSON file structure: {"data": [{"text": "hello"}, {"text": "world"}]}
ds = load_dataset("json", data_files="data/dataset.json", field="data", split="train")
print(ds[0])  # {"text": "hello"}

Loading with Custom Encoding

from datasets import load_dataset

# Load a Latin-1 encoded JSONL file
ds = load_dataset(
    "json",
    data_files="data/latin1_data.jsonl",
    encoding="latin-1",
    split="train",
)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment