Workflow:Huggingface Datasets Dataset Preprocessing
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Machine_Learning, Feature_Engineering |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
End-to-end process for transforming, filtering, and restructuring loaded datasets to prepare them for machine learning model training or evaluation.
Description
This workflow covers the full data preprocessing pipeline using the Dataset's built-in transformation methods. The library provides a rich set of operations (map, filter, select, sort, shuffle, rename, remove, cast, flatten) that produce new cached Arrow files, enabling reproducible and efficient data processing. All transformations are fingerprinted for automatic cache reuse, meaning identical transformations applied to the same data will be loaded from cache rather than recomputed. The map function supports both single-example and batched processing with optional multi-processing for parallelism.
Usage
Execute this workflow after loading a dataset and before feeding data to a model training loop. Typical use cases include tokenizing text for NLP models, resizing and normalizing images for vision models, extracting audio features, creating derived columns, filtering out low-quality examples, splitting data into train/test sets, and restructuring column schemas to match model input requirements.
Execution Steps
Step 1: Inspect and Understand the Raw Schema
Examine the loaded dataset's feature schema, column names, data types, and sample rows to understand what transformations are needed. Identify columns that need to be renamed, removed, cast to different types, or flattened from nested structures.
Key considerations:
- Review dataset.features for the full type schema including nested types
- Check for special feature types (Image, Audio, Video, ClassLabel) that have custom encoding/decoding
- Identify columns that are unnecessary for your task and can be removed early to reduce memory usage
- Detect nested struct columns that may need to be flattened
Step 2: Apply Column Transformations
Restructure the dataset schema by renaming columns to match model expectations, removing unnecessary columns to reduce memory and processing overhead, casting columns to different types (e.g., string to ClassLabel), and flattening nested structures into top-level columns.
What happens:
- rename_column and rename_columns change column names without copying data
- remove_columns drops specified columns from the Arrow table
- cast changes column types using Arrow's type casting
- flatten expands nested struct columns into dot-separated top-level columns
Step 3: Apply the Map Function
Transform each example (or batch of examples) using a user-defined function via the map method. This is the primary transformation tool, supporting both element-wise and batched processing, optional multi-processing, and the ability to add, modify, or remove columns.
Key considerations:
- Batched mode (batched=True) processes multiple examples at once and is significantly faster for tokenization and similar operations
- Multi-processing (num_proc > 1) parallelizes the transformation across CPU cores
- The function can return new columns, modify existing ones, or remove columns by not including them in the output
- Results are automatically cached to Arrow files, keyed by a deterministic fingerprint of the function and input data
- with_indices=True passes the row index to the function for positional operations
Step 4: Filter the Dataset
Remove examples that do not meet quality criteria or relevance conditions using the filter method. The filter function evaluates a boolean predicate for each example and retains only those that return True.
Key considerations:
- Filter supports both single-example and batched modes
- Multi-processing is available for CPU-intensive filter conditions
- The filter result is cached independently from the map result
- Chaining multiple filters is equivalent to a single filter with combined conditions
Step 5: Split, Shuffle, and Select
Prepare the dataset for training by splitting into train/test subsets, shuffling to randomize example order, and optionally selecting specific subsets of rows. The train_test_split method provides stratified splitting capability when class balance matters.
Key considerations:
- train_test_split creates a DatasetDict with "train" and "test" splits
- Stratified splitting preserves label distribution across splits
- shuffle generates a random permutation of indices and creates a new indexed view
- select picks specific rows by index for targeted subset creation
- sort reorders examples by one or more column values
Step 6: Set the Output Format
Configure the dataset to return examples in the format expected by your ML framework. The format setting controls how data is converted when accessed via indexing or iteration, enabling seamless integration with PyTorch DataLoaders, TensorFlow data pipelines, or NumPy-based workflows.
Key considerations:
- set_format modifies the dataset in-place; with_format returns a new dataset
- Supported formats: torch, tensorflow, numpy, jax, pandas, polars, arrow, or None (Python dicts)
- Only specified columns are converted; others default to Python objects
- Format is applied at access time, not stored in the Arrow files