Implementation:Huggingface Datasets Dataset To Json

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for exporting a HuggingFace Dataset to JSON or JSON Lines format provided by the HuggingFace Datasets library.

Description

Dataset.to_json is a method that serializes the dataset's Arrow data to JSON Lines (one JSON object per line) or full JSON format. Internally, it delegates to JsonDatasetWriter, which processes the data in configurable batch sizes, converting each batch to a pandas DataFrame and calling DataFrame.to_json. The default output is JSON Lines (orient="records", lines=True). The method supports multiprocessing, compression (gzip, bz2, xz), remote storage via fsspec, and all pandas to_json keyword arguments.

Usage

Use Dataset.to_json when you need to export a processed dataset to JSON Lines for downstream NLP pipelines, web APIs, or any system that consumes line-delimited JSON. Use the orient and lines parameters to produce other JSON formats when needed.

Code Reference

Source Location

Repository: datasets
File: src/datasets/arrow_dataset.py
Lines: L5097-L5157

Signature

def to_json(
    self,
    path_or_buf: Union[PathLike, BinaryIO],
    batch_size: Optional[int] = None,
    num_proc: Optional[int] = None,
    storage_options: Optional[dict] = None,
    **to_json_kwargs,
) -> int:

Import

from datasets import Dataset
# to_json is a method on Dataset instances

I/O Contract

Inputs

Name	Type	Required	Description
path_or_buf	`Union[PathLike, BinaryIO]`	Yes	Path to a file, a remote URI, or a binary file object where the JSON will be written.
batch_size	`Optional[int]`	No	Number of rows to load in memory and write at once. Defaults to datasets.config.DEFAULT_MAX_BATCH_SIZE.
num_proc	`Optional[int]`	No	Number of processes for multiprocessing. Defaults to None (single process).
storage_options	`Optional[dict]`	No	Key/value pairs for the fsspec file-system backend.
**to_json_kwargs		No	Additional keyword arguments forwarded to pandas.DataFrame.to_json (e.g., orient, lines, compression, index).

Outputs

Name	Type	Description
written	`int`	The number of characters or bytes written.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("json", data_files="input.jsonl", split="train")

# Export to JSON Lines (default)
num_bytes = ds.to_json("output.jsonl")

# Export to full JSON
num_bytes = ds.to_json("output.json", orient="records", lines=False)

# Export with compression
num_bytes = ds.to_json("output.jsonl.gz", compression="gzip")

# Export with multiprocessing
num_bytes = ds.to_json("output.jsonl", num_proc=4)

Related Pages

Implements Principle

Principle:Huggingface_Datasets_JSON_Export

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment