Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Dataset Select

From Leeroopedia
Revision as of 12:58, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Huggingface_Datasets_Dataset_Select.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data_Engineering, ML_Preprocessing
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for selecting specific rows by index from a dataset provided by the HuggingFace Datasets library.

Description

The select method creates a new dataset containing only the rows at the specified indices. It accepts various index formats including range, list, iterable, NumPy ndarray, and pandas Series. When the indices form a contiguous range, the Arrow table is efficiently sliced. For non-contiguous indices, an indices mapping is created, which is still faster than rebuilding the Arrow table from scratch. The method also converts PyArrow arrays to NumPy and generator objects to lists automatically.

Usage

Use Dataset.select when you need to create a subset of the dataset based on known indices, such as selecting the first N examples for debugging, implementing custom sampling strategies, or partitioning data according to externally computed indices.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/arrow_dataset.py
  • Lines: L4037-L4125

Signature

@transmit_format
@fingerprint_transform(inplace=False, ignore_kwargs=["indices_cache_file_name"])
def select(
    self,
    indices: Iterable,
    keep_in_memory: bool = False,
    indices_cache_file_name: Optional[str] = None,
    writer_batch_size: Optional[int] = 1000,
    new_fingerprint: Optional[str] = None,
) -> "Dataset":

Import

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
ds = ds.select(range(4))

I/O Contract

Inputs

Name Type Required Description
indices Iterable Yes Range, list, or 1D-array of integer indices for row selection. Contiguous ranges are optimized via slicing.
keep_in_memory bool No Keep indices mapping in memory. Defaults to False.
indices_cache_file_name Optional[str] No Cache file path for the indices mapping.
writer_batch_size Optional[int] No Rows per write operation. Defaults to 1000.
new_fingerprint Optional[str] No The new fingerprint after transform.

Outputs

Name Type Description
return Dataset A new dataset containing only the rows at the specified indices.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")

# Select the first 4 examples
subset = ds.select(range(4))
print(subset)
# Dataset({ features: ['text', 'label'], num_rows: 4 })

# Select specific indices
subset = ds.select([0, 10, 20, 30])
print(subset.num_rows)
# 4

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment