Implementation:Huggingface Datasets Dataset Cast
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, ML_Preprocessing |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for casting dataset columns to a new set of feature types provided by the HuggingFace Datasets library.
Description
The cast method converts all columns in a dataset to match a new Features specification. The feature names in the new specification must match the current column names exactly. The data types must be convertible (e.g., Value('string') to Value('large_string'), or between compatible numeric types). For non-trivial conversions like str to ClassLabel, the map method should be used instead. Internally, cast uses map with table_cast to perform the conversion, so it benefits from caching and multiprocessing.
Usage
Use Dataset.cast when you need to change the data types of columns to match model or framework requirements, such as changing label types, numeric precision, or string representations.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/arrow_dataset.py - Lines: L2081-L2163
Signature
def cast(
self,
features: Features,
batch_size: Optional[int] = 1000,
keep_in_memory: bool = False,
load_from_cache_file: Optional[bool] = None,
cache_file_name: Optional[str] = None,
writer_batch_size: Optional[int] = 1000,
num_proc: Optional[int] = None,
) -> "Dataset":
Import
from datasets import load_dataset, ClassLabel, Value
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
new_features = ds.features.copy()
new_features["label"] = ClassLabel(names=["bad", "good"])
ds = ds.cast(new_features)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| features | Features |
Yes | New features to cast the dataset to. Column names must match exactly. |
| batch_size | Optional[int] |
No | Number of examples per batch provided to cast. Defaults to 1000. If <= 0 or None, the full dataset is a single batch.
|
| keep_in_memory | bool |
No | Whether to copy the data in-memory. Defaults to False.
|
| load_from_cache_file | Optional[bool] |
No | If a cache file exists, use it instead of recomputing. Defaults to True if caching is enabled.
|
| cache_file_name | Optional[str] |
No | Path for the cache file. If None, auto-generated.
|
| writer_batch_size | Optional[int] |
No | Number of rows per write operation for the cache file writer. Defaults to 1000. |
| num_proc | Optional[int] |
No | Number of processes for multiprocessing. Defaults to None (no multiprocessing).
|
Outputs
| Name | Type | Description |
|---|---|---|
| return | Dataset |
A copy of the dataset with casted features. |
Usage Examples
Basic Usage
from datasets import load_dataset, ClassLabel, Value
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
print(ds.features)
# {'label': ClassLabel(names=['neg', 'pos']), 'text': Value('string')}
new_features = ds.features.copy()
new_features["label"] = ClassLabel(names=["bad", "good"])
new_features["text"] = Value("large_string")
ds = ds.cast(new_features)
print(ds.features)
# {'label': ClassLabel(names=['bad', 'good']), 'text': Value('large_string')}