Principle:Huggingface Datasets Row Selection
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, ML_Preprocessing |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Selecting specific rows from a dataset by index for subsetting, sampling, or partitioning.
Description
Row Selection is the practice of creating a new dataset containing only the rows at specified indices. Unlike filtering (which uses a predicate function), row selection operates directly on indices, providing precise control over which examples are included. This is useful for creating deterministic subsets, implementing custom sampling strategies, manual data partitioning, and retrieving specific examples for inspection.
Row selection supports various index formats including ranges, lists, and arrays. When indices form a contiguous range, the operation is optimized to use Arrow table slicing, which is highly efficient. For non-contiguous indices, an indices mapping is created that provides fast random access at the cost of some sequential read performance.
Usage
Use Row Selection when:
- You need to create a fixed-size subset of a dataset (e.g., the first N examples for debugging).
- You are implementing custom sampling strategies that produce a list of indices.
- You need to partition a dataset according to externally computed indices (e.g., from cross-validation folds).
- You want to select specific examples by their position for inspection or analysis.
- You are implementing curriculum learning by selecting examples in a specific order.
Theoretical Basis
Row Selection implements index-based subsetting, a fundamental operation in data management systems. In relational algebra, this corresponds to positional selection on ordered relations. The operation is order-preserving with respect to the provided indices, meaning the resulting dataset maintains the order specified by the index sequence. This makes it strictly more general than predicate-based filtering, as any filter result can be expressed as a selection on the set of indices where the predicate is true, but not vice versa.