Implementation:Huggingface Datasets Json Builder
| Knowledge Sources | |
|---|---|
| Domains | Data_Loading, Tabular |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Packaged dataset builder for loading JSON and JSON Lines files into Arrow-backed datasets provided by the HuggingFace Datasets library.
Description
Json is a packaged dataset builder extending ArrowBasedBuilder that reads JSON and JSON Lines (JSONL) files into Arrow tables. It is configured via JsonConfig, a dataclass extending BuilderConfig, with fields for features, encoding (default "utf-8"), encoding_errors, field (for extracting a specific nested field from a JSON object), chunksize (default 10 MB), and deprecated fields use_threads and block_size. The newlines_in_values parameter is no longer supported and raises ValueError if set.
The builder handles two reading modes: (1) when field is specified, it reads the entire JSON file, extracts the named field, and converts it via pandas; (2) otherwise, it reads JSONL data in chunks using PyArrow's JSON reader with adaptive block sizing. If PyArrow parsing fails, it falls back to pandas-based JSON reading. The _cast_table method handles missing columns, struct-to-string conversion, and schema casting via table_cast.
The module also provides compatibility helper functions ujson_dumps, ujson_loads, and pandas_read_json to handle differences across pandas versions.
Usage
Use Json via load_dataset("json", data_files=...) to load JSON or JSONL files. For JSON files with nested structure, use the field parameter to specify which key contains the data records.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/packaged_modules/json/json.py - Lines: 1-201
Signature
@dataclass
class JsonConfig(datasets.BuilderConfig):
"""BuilderConfig for JSON."""
features: Optional[datasets.Features] = None
encoding: str = "utf-8"
encoding_errors: Optional[str] = None
field: Optional[str] = None
use_threads: bool = True # deprecated
block_size: Optional[int] = None # deprecated
chunksize: int = 10 << 20 # 10MB
newlines_in_values: Optional[bool] = None
class Json(datasets.ArrowBasedBuilder):
BUILDER_CONFIG_CLASS = JsonConfig
def _info(self):
def _split_generators(self, dl_manager):
def _cast_table(self, pa_table: pa.Table) -> pa.Table:
def _generate_shards(self, base_files, files_iterables):
def _generate_tables(self, base_files, files_iterables):
Import
from datasets.packaged_modules.json.json import Json, JsonConfig
I/O Contract
Inputs (JsonConfig)
| Name | Type | Required | Description |
|---|---|---|---|
| features | Optional[Features] |
No | Schema describing the dataset features. If provided, missing columns are added as null and type casting is applied. |
| encoding | str |
No | File encoding. Defaults to "utf-8". Non-UTF-8 files are re-encoded before parsing.
|
| encoding_errors | Optional[str] |
No | How to handle encoding errors. Defaults to "strict" when not set.
|
| field | Optional[str] |
No | Name of the JSON key containing the data records. Used when the JSON file is a single object with a nested array of records. |
| use_threads | bool |
No | Deprecated. Previously controlled multithreaded reading. No longer has any effect. |
| block_size | Optional[int] |
No | Deprecated. Use chunksize instead.
|
| chunksize | int |
No | Number of bytes to read per chunk when processing JSONL files. Defaults to 10 MB (10 << 20).
|
| newlines_in_values | Optional[bool] |
No | No longer supported. Raises ValueError if set.
|
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset |
An Arrow-backed Dataset constructed from the JSON/JSONL file contents. |
Usage Examples
Loading a JSON Lines File
from datasets import load_dataset
# Load a JSONL file (one JSON object per line)
ds = load_dataset("json", data_files="data/train.jsonl", split="train")
print(ds[0])
Loading a Nested JSON File
from datasets import load_dataset
# JSON file structure: {"data": [{"text": "hello"}, {"text": "world"}]}
ds = load_dataset("json", data_files="data/dataset.json", field="data", split="train")
print(ds[0]) # {"text": "hello"}
Loading with Custom Encoding
from datasets import load_dataset
# Load a Latin-1 encoded JSONL file
ds = load_dataset(
"json",
data_files="data/latin1_data.jsonl",
encoding="latin-1",
split="train",
)