Workflow:Huggingface Datasets Dataset Preprocessing

Knowledge Sources	Huggingface Datasets Datasets Documentation Processing Data Audio Processing Image Processing
Domains	Data_Engineering, Machine_Learning, Feature_Engineering
Last Updated	2026-02-14 18:00 GMT

Overview

End-to-end process for transforming, filtering, and restructuring loaded datasets to prepare them for machine learning model training or evaluation.

Description

This workflow covers the full data preprocessing pipeline using the Dataset's built-in transformation methods. The library provides a rich set of operations (map, filter, select, sort, shuffle, rename, remove, cast, flatten) that produce new cached Arrow files, enabling reproducible and efficient data processing. All transformations are fingerprinted for automatic cache reuse, meaning identical transformations applied to the same data will be loaded from cache rather than recomputed. The map function supports both single-example and batched processing with optional multi-processing for parallelism.

Usage

Execute this workflow after loading a dataset and before feeding data to a model training loop. Typical use cases include tokenizing text for NLP models, resizing and normalizing images for vision models, extracting audio features, creating derived columns, filtering out low-quality examples, splitting data into train/test sets, and restructuring column schemas to match model input requirements.

Execution Steps

Step 1: Inspect and Understand the Raw Schema

Examine the loaded dataset's feature schema, column names, data types, and sample rows to understand what transformations are needed. Identify columns that need to be renamed, removed, cast to different types, or flattened from nested structures.

Key considerations:

Review dataset.features for the full type schema including nested types
Check for special feature types (Image, Audio, Video, ClassLabel) that have custom encoding/decoding
Identify columns that are unnecessary for your task and can be removed early to reduce memory usage
Detect nested struct columns that may need to be flattened

Step 2: Apply Column Transformations

Restructure the dataset schema by renaming columns to match model expectations, removing unnecessary columns to reduce memory and processing overhead, casting columns to different types (e.g., string to ClassLabel), and flattening nested structures into top-level columns.

What happens:

rename_column and rename_columns change column names without copying data
remove_columns drops specified columns from the Arrow table
cast changes column types using Arrow's type casting
flatten expands nested struct columns into dot-separated top-level columns

Step 3: Apply the Map Function

Transform each example (or batch of examples) using a user-defined function via the map method. This is the primary transformation tool, supporting both element-wise and batched processing, optional multi-processing, and the ability to add, modify, or remove columns.

Key considerations:

Batched mode (batched=True) processes multiple examples at once and is significantly faster for tokenization and similar operations
Multi-processing (num_proc > 1) parallelizes the transformation across CPU cores
The function can return new columns, modify existing ones, or remove columns by not including them in the output
Results are automatically cached to Arrow files, keyed by a deterministic fingerprint of the function and input data
with_indices=True passes the row index to the function for positional operations

Step 4: Filter the Dataset

Remove examples that do not meet quality criteria or relevance conditions using the filter method. The filter function evaluates a boolean predicate for each example and retains only those that return True.

Key considerations:

Filter supports both single-example and batched modes
Multi-processing is available for CPU-intensive filter conditions
The filter result is cached independently from the map result
Chaining multiple filters is equivalent to a single filter with combined conditions

Step 5: Split, Shuffle, and Select

Prepare the dataset for training by splitting into train/test subsets, shuffling to randomize example order, and optionally selecting specific subsets of rows. The train_test_split method provides stratified splitting capability when class balance matters.

Key considerations:

train_test_split creates a DatasetDict with "train" and "test" splits
Stratified splitting preserves label distribution across splits
shuffle generates a random permutation of indices and creates a new indexed view
select picks specific rows by index for targeted subset creation
sort reorders examples by one or more column values

Step 6: Set the Output Format

Configure the dataset to return examples in the format expected by your ML framework. The format setting controls how data is converted when accessed via indexing or iteration, enabling seamless integration with PyTorch DataLoaders, TensorFlow data pipelines, or NumPy-based workflows.

Key considerations:

set_format modifies the dataset in-place; with_format returns a new dataset
Supported formats: torch, tensorflow, numpy, jax, pandas, polars, arrow, or None (Python dicts)
Only specified columns are converted; others default to Python objects
Format is applied at access time, not stored in the Arrow files

Execution Diagram

GitHub URL

Workflow Repository