Principle:Huggingface Datasets Dataset Item Access

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Dataset Item Access is the mechanism for retrieving individual examples, slices, or columns from a loaded dataset using index-based or name-based lookups.

Description

Once a dataset is loaded into memory as a Dataset object backed by an Arrow table, users need efficient ways to access its contents. Dataset Item Access provides a unified interface that supports multiple access patterns through Python's subscript operator ([]):

Single row access: Using an integer index returns a dictionary mapping column names to values for that row.
Slice access: Using a Python slice (e.g., [10:20]) returns a dictionary of lists, one list per column, covering the specified row range.
Batch access: Using a list or array of integer indices returns a dictionary of lists for the selected rows.
Column access: Using a string column name returns a Column object (or a list of values) for that column across all rows.
Boolean indexing: Using a list or array of booleans selects rows where the value is True.

The access mechanism is format-aware. The Dataset object can be configured (via set_format) to return data as Python objects, NumPy arrays, PyTorch tensors, TensorFlow tensors, JAX arrays, Arrow tables, Pandas DataFrames, or Polars DataFrames. The item access layer applies the appropriate formatting transformation after querying the underlying Arrow table.

The architecture also supports indexed datasets where an indices mapping redirects logical indices to physical row positions, enabling efficient subset views without copying data.

Usage

Apply Dataset Item Access when:

Retrieving a single example for inspection or debugging.
Selecting a batch of examples for model input during training or evaluation.
Extracting a specific column (e.g., labels) for analysis.
Slicing a dataset to take a quick look at a range of examples.
Working with a formatted dataset where output must be in a specific tensor framework format.

Theoretical Basis

The access mechanism operates in two layers:

__getitem__(key):
  If key is a string:
    Return Column(dataset, key) for lazy column access
  Else:
    Delegate to _getitem(key)

_getitem(key):
  1. DETERMINE format settings (type, columns, kwargs)
  2. CREATE formatter for the configured output format
  3. QUERY Arrow table:
     - If indices mapping exists: translate logical indices to physical
     - Apply key (int, slice, list, or bool mask) to Arrow table
     - Return a sub-table (pa.Table)
  4. FORMAT the sub-table:
     - Select only the configured format_columns (if set)
     - Convert Arrow data to the target format (dict, numpy, torch, etc.)
  5. Return formatted output

The querying step leverages Arrow's zero-copy slicing for contiguous ranges, making slice-based access very efficient. For non-contiguous index lists, Arrow performs a take operation. The formatting step is applied lazily after the query to avoid unnecessary conversions.

Related Pages

Implemented By

Implementation:Huggingface_Datasets_Dataset___getitem__

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment