Implementation:Datajuicer Data juicer EmptyFormatter
| Knowledge Sources | |
|---|---|
| Domains | Data_Loading, Formatting |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for creating empty datasets with specified schemas provided by Data-Juicer.
Description
EmptyFormatter creates empty datasets with specified feature keys and a given length, providing a HuggingFace-backed implementation. It builds a HuggingFace Dataset from a dict of null-valued columns with Value('string') features, then wraps it in a NestedDataset. A companion RayEmptyFormatter creates a pandas DataFrame of empty dicts and converts it to a Ray dataset via ray.data.from_pandas. Both are registered with the FORMATTERS registry.
Usage
Use when you need an empty dataset of a specific schema as a starting point, such as for data generation pipelines, testing scenarios, or initializing blank datasets to be populated later.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File:
data_juicer/format/empty_formatter.py
Signature
@FORMATTERS.register_module()
class EmptyFormatter(BaseFormatter):
SUFFIXES = []
def __init__(self, length, feature_keys: List[str] = [], *args, **kwargs):
def load_dataset(self, *args, **kwargs):
@FORMATTERS.register_module()
class RayEmptyFormatter(BaseFormatter):
SUFFIXES = []
def __init__(self, length, feature_keys: List[str] = [], *args, **kwargs):
def load_dataset(self, *args, **kwargs):
Import
from data_juicer.format.empty_formatter import EmptyFormatter, RayEmptyFormatter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| length | int | Yes | The number of rows in the empty dataset |
| feature_keys | List[str] | No | List of column names for the empty dataset. Default: [] |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | NestedDataset or Ray Dataset | An empty dataset with the specified schema and length, with all values set to None (HuggingFace) or empty dicts (Ray) |
Usage Examples
from data_juicer.format.empty_formatter import EmptyFormatter
# Create an empty dataset with 100 rows and specific columns
formatter = EmptyFormatter(
length=100,
feature_keys=["text", "label", "source"]
)
empty_dataset = formatter.load_dataset()
# Result: dataset with 100 rows, columns text/label/source all set to None