Implementation:Datajuicer Data juicer BaseFormatter
| Knowledge Sources | |
|---|---|
| Domains | Data_Loading, Formatting |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for defining the base formatter classes and dataset format unification provided by Data-Juicer.
Description
BaseFormatter is the abstract base class for all dataset loaders. LocalFormatter extends it to load datasets from local files or directories, finding files matching specified suffixes, loading them via HuggingFace's load_dataset, optionally adding suffix metadata, and calling unify_format. RemoteFormatter loads datasets directly from HuggingFace Hub. The unify_format function validates text key presence, filters out samples with None/empty text fields, wraps the dataset as a NestedDataset, and converts relative media paths (images, audio, video) to absolute paths. The FORMATTERS registry enables format-specific subclasses to self-register.
Usage
Use as the base class when implementing new format-specific loaders, or use LocalFormatter/RemoteFormatter directly to load datasets from files or HuggingFace Hub with automatic format unification.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File:
data_juicer/format/formatter.py
Signature
FORMATTERS = Registry("Formatters")
class BaseFormatter:
def load_dataset(self, *args) -> Dataset:
class LocalFormatter(BaseFormatter):
def __init__(
self,
dataset_path: str,
type: str,
suffixes: Union[str, List[str], None] = None,
text_keys: List[str] = None,
add_suffix=False,
**kwargs,
):
def load_dataset(self, num_proc: Optional[int] = None, global_cfg=None) -> Dataset:
class RemoteFormatter(BaseFormatter):
def __init__(self, dataset_path: str, text_keys: List[str] = None, **kwargs):
def load_dataset(self, num_proc: int = 1, global_cfg=None) -> Dataset:
def add_suffixes(datasets: DatasetDict, num_proc: int = 1) -> Dataset:
def unify_format(
dataset: Dataset,
text_keys: Union[List[str], str] = "text",
num_proc: int = 1,
global_cfg: Union[dict, Namespace] = None,
) -> Dataset:
Import
from data_juicer.format.formatter import BaseFormatter, LocalFormatter, RemoteFormatter, FORMATTERS, unify_format
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset_path | str | Yes | Path to a dataset file, directory, or HuggingFace Hub repository |
| type | str | Yes (LocalFormatter) | HuggingFace dataset module type (json, csv, text, parquet, etc.) |
| suffixes | Union[str, List[str]] | No | File suffixes to include when scanning directories |
| text_keys | List[str] | No | Key names of fields that store sample text |
| add_suffix | bool | No | Whether to add file suffix to dataset meta info. Default: False |
| num_proc | int | No | Number of processes for parallel loading. Default: 1 |
| global_cfg | dict or Namespace | No | Global configuration for path conversion and key mapping |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset | A unified NestedDataset with validated text fields, filtered empty samples, and absolute media paths |
Usage Examples
from data_juicer.format.formatter import LocalFormatter, RemoteFormatter
# Load local JSON files
formatter = LocalFormatter(
dataset_path="/path/to/data/",
type="json",
suffixes=[".json", ".jsonl"],
text_keys=["text"]
)
dataset = formatter.load_dataset(num_proc=4)
# Load from HuggingFace Hub
formatter = RemoteFormatter(
dataset_path="squad",
text_keys=["context"]
)
dataset = formatter.load_dataset()