Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer JsonFormatter

From Leeroopedia
Knowledge Sources
Domains Data_Loading, Formatting
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for loading and formatting JSON and JSONL files as datasets provided by Data-Juicer.

Description

JsonFormatter extends LocalFormatter with SUFFIXES = ['.json', '.jsonl', '.jsonl.zst'] and type='json', delegating all loading logic to the parent class which uses HuggingFace's load_dataset with the JSON reader. It is registered with the FORMATTERS registry via @FORMATTERS.register_module(). This is the primary formatter for the most common Data-Juicer input format (JSON/JSONL), which is the default and most widely used format for LLM training data.

Usage

Use when loading JSON or JSONL formatted datasets into Data-Juicer for processing, either through direct instantiation or through the automatic format detection system.

Code Reference

Source Location

Signature

@FORMATTERS.register_module()
class JsonFormatter(LocalFormatter):
    SUFFIXES = [".json", ".jsonl", ".jsonl.zst"]

    def __init__(self, dataset_path, suffixes=None, **kwargs):

Import

from data_juicer.format.json_formatter import JsonFormatter

I/O Contract

Inputs

Name Type Required Description
dataset_path str Yes Path to a JSON/JSONL dataset file or directory containing JSON files
suffixes list No File suffixes to be processed. Default: ['.json', '.jsonl', '.jsonl.zst']
**kwargs Any No Extra arguments passed to the parent LocalFormatter

Outputs

Name Type Description
dataset Dataset A unified HuggingFace Dataset loaded from the JSON/JSONL files

Usage Examples

from data_juicer.format.json_formatter import JsonFormatter

# Load a JSONL dataset
formatter = JsonFormatter(dataset_path="/path/to/data.jsonl")
dataset = formatter.load_dataset(num_proc=4)

# Load from a directory of JSON files
formatter = JsonFormatter(dataset_path="/path/to/json_dir/")
dataset = formatter.load_dataset()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment