Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Dataset Cast

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, ML_Preprocessing
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for casting dataset columns to a new set of feature types provided by the HuggingFace Datasets library.

Description

The cast method converts all columns in a dataset to match a new Features specification. The feature names in the new specification must match the current column names exactly. The data types must be convertible (e.g., Value('string') to Value('large_string'), or between compatible numeric types). For non-trivial conversions like str to ClassLabel, the map method should be used instead. Internally, cast uses map with table_cast to perform the conversion, so it benefits from caching and multiprocessing.

Usage

Use Dataset.cast when you need to change the data types of columns to match model or framework requirements, such as changing label types, numeric precision, or string representations.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/arrow_dataset.py
  • Lines: L2081-L2163

Signature

def cast(
    self,
    features: Features,
    batch_size: Optional[int] = 1000,
    keep_in_memory: bool = False,
    load_from_cache_file: Optional[bool] = None,
    cache_file_name: Optional[str] = None,
    writer_batch_size: Optional[int] = 1000,
    num_proc: Optional[int] = None,
) -> "Dataset":

Import

from datasets import load_dataset, ClassLabel, Value

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
new_features = ds.features.copy()
new_features["label"] = ClassLabel(names=["bad", "good"])
ds = ds.cast(new_features)

I/O Contract

Inputs

Name Type Required Description
features Features Yes New features to cast the dataset to. Column names must match exactly.
batch_size Optional[int] No Number of examples per batch provided to cast. Defaults to 1000. If <= 0 or None, the full dataset is a single batch.
keep_in_memory bool No Whether to copy the data in-memory. Defaults to False.
load_from_cache_file Optional[bool] No If a cache file exists, use it instead of recomputing. Defaults to True if caching is enabled.
cache_file_name Optional[str] No Path for the cache file. If None, auto-generated.
writer_batch_size Optional[int] No Number of rows per write operation for the cache file writer. Defaults to 1000.
num_proc Optional[int] No Number of processes for multiprocessing. Defaults to None (no multiprocessing).

Outputs

Name Type Description
return Dataset A copy of the dataset with casted features.

Usage Examples

Basic Usage

from datasets import load_dataset, ClassLabel, Value

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
print(ds.features)
# {'label': ClassLabel(names=['neg', 'pos']), 'text': Value('string')}

new_features = ds.features.copy()
new_features["label"] = ClassLabel(names=["bad", "good"])
new_features["text"] = Value("large_string")
ds = ds.cast(new_features)
print(ds.features)
# {'label': ClassLabel(names=['bad', 'good']), 'text': Value('large_string')}

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment