Implementation:Datajuicer Data juicer EmptyFormatter

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Loading, Formatting
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for creating empty datasets with specified schemas provided by Data-Juicer.

Description

EmptyFormatter creates empty datasets with specified feature keys and a given length, providing a HuggingFace-backed implementation. It builds a HuggingFace Dataset from a dict of null-valued columns with Value('string') features, then wraps it in a NestedDataset. A companion RayEmptyFormatter creates a pandas DataFrame of empty dicts and converts it to a Ray dataset via ray.data.from_pandas. Both are registered with the FORMATTERS registry.

Usage

Use when you need an empty dataset of a specific schema as a starting point, such as for data generation pipelines, testing scenarios, or initializing blank datasets to be populated later.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/format/empty_formatter.py

Signature

@FORMATTERS.register_module()
class EmptyFormatter(BaseFormatter):
    SUFFIXES = []

    def __init__(self, length, feature_keys: List[str] = [], *args, **kwargs):

    def load_dataset(self, *args, **kwargs):

@FORMATTERS.register_module()
class RayEmptyFormatter(BaseFormatter):
    SUFFIXES = []

    def __init__(self, length, feature_keys: List[str] = [], *args, **kwargs):

    def load_dataset(self, *args, **kwargs):

Import

from data_juicer.format.empty_formatter import EmptyFormatter, RayEmptyFormatter

I/O Contract

Inputs

Name	Type	Required	Description
length	int	Yes	The number of rows in the empty dataset
feature_keys	List[str]	No	List of column names for the empty dataset. Default: []

Outputs

Name	Type	Description
dataset	NestedDataset or Ray Dataset	An empty dataset with the specified schema and length, with all values set to None (HuggingFace) or empty dicts (Ray)

Usage Examples

from data_juicer.format.empty_formatter import EmptyFormatter

# Create an empty dataset with 100 rows and specific columns
formatter = EmptyFormatter(
    length=100,
    feature_keys=["text", "label", "source"]
)
empty_dataset = formatter.load_dataset()
# Result: dataset with 100 rows, columns text/label/source all set to None

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment