Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer EmptyFormatter

From Leeroopedia
Knowledge Sources
Domains Data_Loading, Formatting
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for creating empty datasets with specified schemas provided by Data-Juicer.

Description

EmptyFormatter creates empty datasets with specified feature keys and a given length, providing a HuggingFace-backed implementation. It builds a HuggingFace Dataset from a dict of null-valued columns with Value('string') features, then wraps it in a NestedDataset. A companion RayEmptyFormatter creates a pandas DataFrame of empty dicts and converts it to a Ray dataset via ray.data.from_pandas. Both are registered with the FORMATTERS registry.

Usage

Use when you need an empty dataset of a specific schema as a starting point, such as for data generation pipelines, testing scenarios, or initializing blank datasets to be populated later.

Code Reference

Source Location

Signature

@FORMATTERS.register_module()
class EmptyFormatter(BaseFormatter):
    SUFFIXES = []

    def __init__(self, length, feature_keys: List[str] = [], *args, **kwargs):

    def load_dataset(self, *args, **kwargs):

@FORMATTERS.register_module()
class RayEmptyFormatter(BaseFormatter):
    SUFFIXES = []

    def __init__(self, length, feature_keys: List[str] = [], *args, **kwargs):

    def load_dataset(self, *args, **kwargs):

Import

from data_juicer.format.empty_formatter import EmptyFormatter, RayEmptyFormatter

I/O Contract

Inputs

Name Type Required Description
length int Yes The number of rows in the empty dataset
feature_keys List[str] No List of column names for the empty dataset. Default: []

Outputs

Name Type Description
dataset NestedDataset or Ray Dataset An empty dataset with the specified schema and length, with all values set to None (HuggingFace) or empty dicts (Ray)

Usage Examples

from data_juicer.format.empty_formatter import EmptyFormatter

# Create an empty dataset with 100 rows and specific columns
formatter = EmptyFormatter(
    length=100,
    feature_keys=["text", "label", "source"]
)
empty_dataset = formatter.load_dataset()
# Result: dataset with 100 rows, columns text/label/source all set to None

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment