Implementation:Huggingface Datasets Dataset To Json
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for exporting a HuggingFace Dataset to JSON or JSON Lines format provided by the HuggingFace Datasets library.
Description
Dataset.to_json is a method that serializes the dataset's Arrow data to JSON Lines (one JSON object per line) or full JSON format. Internally, it delegates to JsonDatasetWriter, which processes the data in configurable batch sizes, converting each batch to a pandas DataFrame and calling DataFrame.to_json. The default output is JSON Lines (orient="records", lines=True). The method supports multiprocessing, compression (gzip, bz2, xz), remote storage via fsspec, and all pandas to_json keyword arguments.
Usage
Use Dataset.to_json when you need to export a processed dataset to JSON Lines for downstream NLP pipelines, web APIs, or any system that consumes line-delimited JSON. Use the orient and lines parameters to produce other JSON formats when needed.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/arrow_dataset.py - Lines: L5097-L5157
Signature
def to_json(
self,
path_or_buf: Union[PathLike, BinaryIO],
batch_size: Optional[int] = None,
num_proc: Optional[int] = None,
storage_options: Optional[dict] = None,
**to_json_kwargs,
) -> int:
Import
from datasets import Dataset
# to_json is a method on Dataset instances
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path_or_buf | Union[PathLike, BinaryIO] |
Yes | Path to a file, a remote URI, or a binary file object where the JSON will be written. |
| batch_size | Optional[int] |
No | Number of rows to load in memory and write at once. Defaults to datasets.config.DEFAULT_MAX_BATCH_SIZE. |
| num_proc | Optional[int] |
No | Number of processes for multiprocessing. Defaults to None (single process). |
| storage_options | Optional[dict] |
No | Key/value pairs for the fsspec file-system backend. |
| **to_json_kwargs | No | Additional keyword arguments forwarded to pandas.DataFrame.to_json (e.g., orient, lines, compression, index). |
Outputs
| Name | Type | Description |
|---|---|---|
| written | int |
The number of characters or bytes written. |
Usage Examples
Basic Usage
from datasets import load_dataset
ds = load_dataset("json", data_files="input.jsonl", split="train")
# Export to JSON Lines (default)
num_bytes = ds.to_json("output.jsonl")
# Export to full JSON
num_bytes = ds.to_json("output.json", orient="records", lines=False)
# Export with compression
num_bytes = ds.to_json("output.jsonl.gz", compression="gzip")
# Export with multiprocessing
num_bytes = ds.to_json("output.jsonl", num_proc=4)