Implementation:Datajuicer Data juicer Schema
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Core |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for unified dataset schema representation provided by Data-Juicer.
Description
Schema is a dataclass that represents dataset column names and types, providing conversion methods from both HuggingFace Features and Ray/PyArrow schemas to a common Python type representation. It stores column_types (a dict mapping column names to Python types) and columns (an ordered list of names). Factory methods from_hf_features and from_ray_schema convert HuggingFace Features and PyArrow schemas respectively, using recursive type mapping functions that handle primitives, sequences, structs, and nested types.
Usage
Use when you need to inspect or compare dataset structure in a backend-agnostic way, regardless of whether the data comes from HuggingFace datasets or Ray.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File:
data_juicer/core/data/schema.py
Signature
@dataclass
class Schema:
column_types: Dict[str, Any]
columns: List[str]
@classmethod
def from_hf_features(cls, features: Features):
@classmethod
def from_ray_schema(cls, schema):
@classmethod
def map_hf_type_to_python(cls, feature):
@classmethod
def map_ray_type_to_python(cls, ray_type: pa.DataType):
Import
from data_juicer.core.data.schema import Schema
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| column_types | Dict[str, Any] | Yes | Mapping of column names to their Python types |
| columns | List[str] | Yes | Ordered list of column names |
| features | Features | Yes (for from_hf_features) | HuggingFace Features object to convert |
| schema | PyArrow Schema | Yes (for from_ray_schema) | Ray/PyArrow schema to convert |
Outputs
| Name | Type | Description |
|---|---|---|
| schema | Schema | A Schema instance with column_types and columns populated from the source schema |
Usage Examples
from data_juicer.core.data.schema import Schema
from datasets import Features, Value, Sequence
# Create schema from HuggingFace features
features = Features({
"text": Value("string"),
"label": Value("int64"),
"scores": Sequence(Value("float32"))
})
schema = Schema.from_hf_features(features)
print(schema)
# Dataset Schema:
# ----------------------------------------
# text: <class 'str'>
# label: <class 'int'>
# scores: typing.List[float]