Implementation:Datajuicer Data juicer JsonFormatter
| Knowledge Sources | |
|---|---|
| Domains | Data_Loading, Formatting |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for loading and formatting JSON and JSONL files as datasets provided by Data-Juicer.
Description
JsonFormatter extends LocalFormatter with SUFFIXES = ['.json', '.jsonl', '.jsonl.zst'] and type='json', delegating all loading logic to the parent class which uses HuggingFace's load_dataset with the JSON reader. It is registered with the FORMATTERS registry via @FORMATTERS.register_module(). This is the primary formatter for the most common Data-Juicer input format (JSON/JSONL), which is the default and most widely used format for LLM training data.
Usage
Use when loading JSON or JSONL formatted datasets into Data-Juicer for processing, either through direct instantiation or through the automatic format detection system.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File:
data_juicer/format/json_formatter.py
Signature
@FORMATTERS.register_module()
class JsonFormatter(LocalFormatter):
SUFFIXES = [".json", ".jsonl", ".jsonl.zst"]
def __init__(self, dataset_path, suffixes=None, **kwargs):
Import
from data_juicer.format.json_formatter import JsonFormatter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset_path | str | Yes | Path to a JSON/JSONL dataset file or directory containing JSON files |
| suffixes | list | No | File suffixes to be processed. Default: ['.json', '.jsonl', '.jsonl.zst'] |
| **kwargs | Any | No | Extra arguments passed to the parent LocalFormatter |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset | A unified HuggingFace Dataset loaded from the JSON/JSONL files |
Usage Examples
from data_juicer.format.json_formatter import JsonFormatter
# Load a JSONL dataset
formatter = JsonFormatter(dataset_path="/path/to/data.jsonl")
dataset = formatter.load_dataset(num_proc=4)
# Load from a directory of JSON files
formatter = JsonFormatter(dataset_path="/path/to/json_dir/")
dataset = formatter.load_dataset()