Implementation:Huggingface Datasets Dataset Select
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, ML_Preprocessing |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for selecting specific rows by index from a dataset provided by the HuggingFace Datasets library.
Description
The select method creates a new dataset containing only the rows at the specified indices. It accepts various index formats including range, list, iterable, NumPy ndarray, and pandas Series. When the indices form a contiguous range, the Arrow table is efficiently sliced. For non-contiguous indices, an indices mapping is created, which is still faster than rebuilding the Arrow table from scratch. The method also converts PyArrow arrays to NumPy and generator objects to lists automatically.
Usage
Use Dataset.select when you need to create a subset of the dataset based on known indices, such as selecting the first N examples for debugging, implementing custom sampling strategies, or partitioning data according to externally computed indices.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/arrow_dataset.py - Lines: L4037-L4125
Signature
@transmit_format
@fingerprint_transform(inplace=False, ignore_kwargs=["indices_cache_file_name"])
def select(
self,
indices: Iterable,
keep_in_memory: bool = False,
indices_cache_file_name: Optional[str] = None,
writer_batch_size: Optional[int] = 1000,
new_fingerprint: Optional[str] = None,
) -> "Dataset":
Import
from datasets import load_dataset
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
ds = ds.select(range(4))
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| indices | Iterable |
Yes | Range, list, or 1D-array of integer indices for row selection. Contiguous ranges are optimized via slicing. |
| keep_in_memory | bool |
No | Keep indices mapping in memory. Defaults to False.
|
| indices_cache_file_name | Optional[str] |
No | Cache file path for the indices mapping. |
| writer_batch_size | Optional[int] |
No | Rows per write operation. Defaults to 1000. |
| new_fingerprint | Optional[str] |
No | The new fingerprint after transform. |
Outputs
| Name | Type | Description |
|---|---|---|
| return | Dataset |
A new dataset containing only the rows at the specified indices. |
Usage Examples
Basic Usage
from datasets import load_dataset
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
# Select the first 4 examples
subset = ds.select(range(4))
print(subset)
# Dataset({ features: ['text', 'label'], num_rows: 4 })
# Select specific indices
subset = ds.select([0, 10, 20, 30])
print(subset.num_rows)
# 4