Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Dataset To Json

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for exporting a HuggingFace Dataset to JSON or JSON Lines format provided by the HuggingFace Datasets library.

Description

Dataset.to_json is a method that serializes the dataset's Arrow data to JSON Lines (one JSON object per line) or full JSON format. Internally, it delegates to JsonDatasetWriter, which processes the data in configurable batch sizes, converting each batch to a pandas DataFrame and calling DataFrame.to_json. The default output is JSON Lines (orient="records", lines=True). The method supports multiprocessing, compression (gzip, bz2, xz), remote storage via fsspec, and all pandas to_json keyword arguments.

Usage

Use Dataset.to_json when you need to export a processed dataset to JSON Lines for downstream NLP pipelines, web APIs, or any system that consumes line-delimited JSON. Use the orient and lines parameters to produce other JSON formats when needed.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/arrow_dataset.py
  • Lines: L5097-L5157

Signature

def to_json(
    self,
    path_or_buf: Union[PathLike, BinaryIO],
    batch_size: Optional[int] = None,
    num_proc: Optional[int] = None,
    storage_options: Optional[dict] = None,
    **to_json_kwargs,
) -> int:

Import

from datasets import Dataset
# to_json is a method on Dataset instances

I/O Contract

Inputs

Name Type Required Description
path_or_buf Union[PathLike, BinaryIO] Yes Path to a file, a remote URI, or a binary file object where the JSON will be written.
batch_size Optional[int] No Number of rows to load in memory and write at once. Defaults to datasets.config.DEFAULT_MAX_BATCH_SIZE.
num_proc Optional[int] No Number of processes for multiprocessing. Defaults to None (single process).
storage_options Optional[dict] No Key/value pairs for the fsspec file-system backend.
**to_json_kwargs No Additional keyword arguments forwarded to pandas.DataFrame.to_json (e.g., orient, lines, compression, index).

Outputs

Name Type Description
written int The number of characters or bytes written.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("json", data_files="input.jsonl", split="train")

# Export to JSON Lines (default)
num_bytes = ds.to_json("output.jsonl")

# Export to full JSON
num_bytes = ds.to_json("output.json", orient="records", lines=False)

# Export with compression
num_bytes = ds.to_json("output.jsonl.gz", compression="gzip")

# Export with multiprocessing
num_bytes = ds.to_json("output.jsonl", num_proc=4)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment