Implementation:Datajuicer Data juicer Schema

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Processing, Core
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for unified dataset schema representation provided by Data-Juicer.

Description

Schema is a dataclass that represents dataset column names and types, providing conversion methods from both HuggingFace Features and Ray/PyArrow schemas to a common Python type representation. It stores column_types (a dict mapping column names to Python types) and columns (an ordered list of names). Factory methods from_hf_features and from_ray_schema convert HuggingFace Features and PyArrow schemas respectively, using recursive type mapping functions that handle primitives, sequences, structs, and nested types.

Usage

Use when you need to inspect or compare dataset structure in a backend-agnostic way, regardless of whether the data comes from HuggingFace datasets or Ray.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/core/data/schema.py

Signature

@dataclass
class Schema:
    column_types: Dict[str, Any]
    columns: List[str]

    @classmethod
    def from_hf_features(cls, features: Features):

    @classmethod
    def from_ray_schema(cls, schema):

    @classmethod
    def map_hf_type_to_python(cls, feature):

    @classmethod
    def map_ray_type_to_python(cls, ray_type: pa.DataType):

Import

from data_juicer.core.data.schema import Schema

I/O Contract

Inputs

Name	Type	Required	Description
column_types	Dict[str, Any]	Yes	Mapping of column names to their Python types
columns	List[str]	Yes	Ordered list of column names
features	Features	Yes (for from_hf_features)	HuggingFace Features object to convert
schema	PyArrow Schema	Yes (for from_ray_schema)	Ray/PyArrow schema to convert

Outputs

Name	Type	Description
schema	Schema	A Schema instance with column_types and columns populated from the source schema

Usage Examples

from data_juicer.core.data.schema import Schema
from datasets import Features, Value, Sequence

# Create schema from HuggingFace features
features = Features({
    "text": Value("string"),
    "label": Value("int64"),
    "scores": Sequence(Value("float32"))
})
schema = Schema.from_hf_features(features)
print(schema)
# Dataset Schema:
# ----------------------------------------
# text: <class 'str'>
# label: <class 'int'>
# scores: typing.List[float]

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment