Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer Schema

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Core
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for unified dataset schema representation provided by Data-Juicer.

Description

Schema is a dataclass that represents dataset column names and types, providing conversion methods from both HuggingFace Features and Ray/PyArrow schemas to a common Python type representation. It stores column_types (a dict mapping column names to Python types) and columns (an ordered list of names). Factory methods from_hf_features and from_ray_schema convert HuggingFace Features and PyArrow schemas respectively, using recursive type mapping functions that handle primitives, sequences, structs, and nested types.

Usage

Use when you need to inspect or compare dataset structure in a backend-agnostic way, regardless of whether the data comes from HuggingFace datasets or Ray.

Code Reference

Source Location

Signature

@dataclass
class Schema:
    column_types: Dict[str, Any]
    columns: List[str]

    @classmethod
    def from_hf_features(cls, features: Features):

    @classmethod
    def from_ray_schema(cls, schema):

    @classmethod
    def map_hf_type_to_python(cls, feature):

    @classmethod
    def map_ray_type_to_python(cls, ray_type: pa.DataType):

Import

from data_juicer.core.data.schema import Schema

I/O Contract

Inputs

Name Type Required Description
column_types Dict[str, Any] Yes Mapping of column names to their Python types
columns List[str] Yes Ordered list of column names
features Features Yes (for from_hf_features) HuggingFace Features object to convert
schema PyArrow Schema Yes (for from_ray_schema) Ray/PyArrow schema to convert

Outputs

Name Type Description
schema Schema A Schema instance with column_types and columns populated from the source schema

Usage Examples

from data_juicer.core.data.schema import Schema
from datasets import Features, Value, Sequence

# Create schema from HuggingFace features
features = Features({
    "text": Value("string"),
    "label": Value("int64"),
    "scores": Sequence(Value("float32"))
})
schema = Schema.from_hf_features(features)
print(schema)
# Dataset Schema:
# ----------------------------------------
# text: <class 'str'>
# label: <class 'int'>
# scores: typing.List[float]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment